_main.Rmd

--- 
title: "Individual-based models of cultural evolution"
subtitle: "A step-by-step guide using R"
author: 
- Alberto Acerbi
- Alex Mesoudi
- Marco Smolla
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
documentclass: book
bibliography: "biblio.bib"
link-citations: true
---

# Introduction {-#Introduction}

TO DO

<!--chapter:end:index.Rmd-->

# (PART\*) Basics {-} 

# Unbiased transmission

We start by simulating a simple case of unbiased cultural transmission. We will detail each step of the simulation and explain the code line-by-line. In the following chapters, we will reuse most of this initial model, building up the complexity of our simulations.  

## Initialising the simulation 

Here we will simulate a case where $N$ individuals each possess one of two mutually exclusive cultural traits. These alternative traits are denoted $A$ and $B$. For example, $A$ might be eating a vegetarian diet, and $B$ might be eating a non-vegetarian diet. In reality, traits are seldom clear-cut (e.g. what about pescatarians?), but models are designed to cut away all the complexity to give tractable answers to simplified situations.

Our model has non-overlapping generations. In each generation, all $N$ individuals are replaced with $N$ new individuals. Again, this is unlike any real biological group but provides a simple way of simulating change over time. Generations here could correspond to biological generations, but could equally be 'cultural generations' (or learning episodes), which might be much shorter.

Each new individual of each new generation picks a member of the previous generation at random and copies their cultural trait. This is known as unbiased oblique cultural transmission. 'Unbiased' refers to the fact that traits are copied entirely at random. The term 'oblique' means that members of one generation learn from those of the previous, non-overlapping, generation. This is different from, for example, horizontal cultural transmission, where individuals copy members of the same generation, and vertical cultural transmission, where offspring copy their biological parents.

If we assume that the two cultural traits are transmitted in an unbiased way, what does that mean for the average trait frequency in the population? To answer this question, we must track the proportion of individuals who possess trait $A$ over successive generations. We will call this proportion $p$. We could also track the proportion who possess trait $B$, but this will always be $1 - p$ given that the two traits are mutually exclusive. For example, if $70\%$ of the population have trait $A$ $(p=0.7)$, then the remaining $30\%$ must have trait $B$ (i.e. $1-p=1-0.7=0.3$).

The output of the model will be a plot showing $p$ over all generations up to the last generation. Generations (or time steps) are denoted by $t$, where generation one is $t=1$, generation two is $t=2$, up to the last generation $t=t_{\text{max}}$. 

First, we need to specify the fixed parameters of the model. These are quantities that we decide on at the start and do not change during the simulation. In this model these are $N$ (the number of individuals) and $t_{\text{max}}$ (the number of generations). Let's start with $N=100$ and $t_{\text{max}}=200$:

```{r 1.1}
N <- 100
t_max <- 200
```

Now we need to create our individuals. The only information we need to keep about our individuals is their cultural trait ($A$ or $B$). We'll make **population** the data structure containing the individuals. The type of data structure we have chosen here is a tibble. This is a more user-friendly version of a dataframe.

Initially, we'll give each individual either an $A$ or $B$ at random, using the `sample()` command. This can be seen in the code chunk below. The `sample()` command takes three arguments (i.e. inputs or options). The first argument lists the elements to pick at random, in our case, the traits $A$ and $B$. The second argument gives the number of times to pick, in our case $N$ times, once for each individual. The final argument says to replace or reuse the elements specified in the first argument after they've been picked (otherwise there would only be one copy of $A$ and one copy of $B$, so we could only give two individuals traits before running out). Within the tibble command, the word $trait$ denotes the name of the variable within the tibble that contains the random $A$s and $B$s, and the whole tibble is assigned the name **population**.

There are one line before `sample()` in the chunk below. First, we need to call the `tidyverse` library. We will use this throughout the chapter. Here, it allows us to create a tibble using the tibble command. 

```{r 1.2, message = FALSE}
library(tidyverse)
population <- tibble(trait = sample(c("A", "B"), N, replace = TRUE))
```

We can see the cultural traits of our population by simply entering its name in the R console: 

```{r 1.3}
population
```

As expected, there is a single column called $trait$ containing $A$s and $B$s. The type of the column, in this case 'chr' (i.e. character), is reported below the name. 

A specific individual's trait can be retrieved using the square bracket notation in R. For example, individual 4's trait can be retrieved by typing: 

```{r 1.4}
population$trait[4]
```

This should match the fourth row in the table above.

We also need a tibble to record the output of our simulation, that is, to track the trait frequency $p$ in each generation. This will have two columns with $t_{\text{max}}$ rows, one row for each generation. The first column is simply a counter of the generations, from 1 to $t_{\text{max}}$. This will be useful to plot the output later. The other column contains the values of $p$ for each generation. 

At this stage we don't know what $p$ will be in each generation, so, for now, let's fill the **output** tibble with lots of NAs, which is R's symbol for Not Available, or missing value. We can use the `rep()` (repeat) command to repeat NA $t_{\text{max}}$ times. We're using NA rather than, say, zero, because zero could be misinterpreted as $p = 0$, which would mean that all individuals have trait $B$. This would be misleading, because at the moment we haven't yet calculated $p$, so it's nonexistent, rather than zero.

```{r 1.5}
output <- tibble(generation = 1:t_max, p = rep(NA, t_max))
```

We can, however, fill in the first value of $p$ for our already-created first generation of individuals, held in **population**. The command below sums the number of $A$s in **population** and divides by $N$ to get a proportion out of 1 rather than an absolute number. It then puts this proportion in the first slot of $p$ in **output**, the one for the first generation, $t=1$. We can again write the name of the tibble, `output`, to see that it worked.

```{r 1.6}
output$p[1] <- sum(population$trait == "A") / N
output
```

This first value of $p$ should be around $0.5$, meaning that around 50 individuals have trait $A$, and 50 have trait $B$. Even though `sample()` returns either trait with equal probability, this does not necessarily mean that we will get exactly 50 $A$s and 50 $B$s. This happens with simulations and finite population sizes: they are probabilistic (or stochastic), not deterministic. Analogously, flipping a coin 100 times will not always give exactly 50 heads and 50 tails. Sometimes we will get 51 heads, sometimes 49, etc. To see this in our simulation, re-run the above code with different values of the random number seed.

## Execute generation turn-over many times

Now that we have built the population, we can simulate what individuals do in each generation. We iterate these actions over $t_{\text{max}}$ generations. In each generation, we need to:

* copy the current individuals to a separate tibble called **previous_population** to use as demonstrators for the new individuals; this allows us to implement oblique transmission with its non-overlapping generations, rather than mixing up the generations

* create a new generation of individuals, each of whose trait is picked at random from the **previous_population** tibble

* calculate $p$ for this new generation and store it in the appropriate slot in **output**

To iterate, we'll use a for-loop, using $t$ to track the generation. We've already done generation 1 so we'll start at generation 2. The random picking of models is done with `sample()` again, but this time picking from the traits held in **previous_population**. Note that we have added comments briefly explaining what each line does. This is perhaps superfluous when the code is this simple, but it's always good practice. Code often gets cut-and-pasted into other places and loses its context. Explaining what each line does lets other people - and a future, forgetful you - know what's going on.

```{r 1.7}
for (t in 2:t_max) {
  previous_population <- population # copy the population tibble to previous_population tibble

  population <- tibble(trait = sample(previous_population$trait, N, replace = TRUE)) # randomly copy from previous generation's individuals

  output$p[t] <- sum(population$trait == "A") / N # get p and put it into the output slot for this generation t
}
```

Now we should have 200 values of $p$ stored in **output**, one for each generation. You can list them by typing **output**, but more effective is to plot them.

## Plotting the model results

We use `ggplot()` to plot our data. The syntax of ggplot may be slightly obscure at the beginning, but it forces us to have a clear picture of the data before plotting.

In the first line in the code below, we are telling ggplot that the data we want to plot is in the tibble **output**. Then, with the command `aes()` we declare the 'aesthetics' of the plot, that is, how we want our data mapped in our plot. In this case, we want the values of $p$ on the y-axis, and the values of $generation$ on the x-axis (this is why we created earlier, in the tibble **output**, a column to keep the count of generations).

We then use `geom_line()`. In ggplot, 'geoms' describe what kind of visual representation should be plotted: lines, bars, boxes and so on. This visual representation is independent of the mapping that we declared before with `aes()`. The same data, with the same mapping, can be visually represented in many different ways. In this case, we are asking ggplot to represent the data as a line. You can change `geom_line()` in the code below to `geom_point()`, and see what happens (other geoms have less obvious effects, and we will see some of them in the later chapters).

The other commands are mainly to make the plot look nicer. We want the y-axis to span all the possible values of $p$, from 0 to 1, and we use a particular 'theme' for our plot, in this case, a standard theme with white background. With the command `labs()` we give a more informative label to the y-axis (ggplot automatically labels the axis with the name of the tibble columns that are plotted: this is good for $generation$, but less so for $p$).

```{r 1.8}
ggplot(data = output, aes(y = p, x = generation)) +
  geom_line() +
  ylim(c(0, 1)) +
  theme_bw() +
  labs(y = "p (proportion of individuals with trait A)")
```

The proportion of individuals with trait $A$ should start off hovering around 0.5, and then oscillating randomly (it may be, in some cases, also reach 0, meaning that all $A$s disappeared, or 1, meaning that all $B$s disappeared).  Unbiased transmission, or random copying, is by definition random, so different runs of this simulation will generate different plots. If you rerun all the code, trying different seeds of the random number generator, you will get something different. In all likelihood, $p$ might go to 0 or 1 at some point. At $p = 0$ there are no $A$s and every individual possesses $B$. At $p = 1$ there are no $B$s and every individual possesses $A$. This is a typical feature of cultural drift, analogous to genetic drift: in small populations, with no selection or other directional processes operating, traits can be lost purely by chance after some generations.

## Write a function to wrap the model code

Ideally, we would like to repeat the simulation to explore this idea in more detail, perhaps changing some of the parameters. For example, if we increase $N$, are we more or less likely to lose one of the traits? As noted above, individual-based models like this one are probabilistic or stochastic, thus it is essential to run simulations many times to understand what happens. With our code scattered about in chunks, it is hard to quickly repeat the simulation. Instead, we can wrap it all up in a function:

```{r 1.9}
unbiased_transmission_1 <- function(N, t_max) {
  population <- tibble(trait = sample(c("A", "B"), N, replace = TRUE))

  output <- tibble(generation = 1:t_max, p = rep(NA, t_max))

  output$p[1] <- sum(population$trait == "A") / N

  for (t in 2:t_max) {
    previous_population <- population # copy individuals to previous_population tibble

    population <- tibble(trait = sample(previous_population$trait, N, replace = TRUE))
    # randomly copy from previous generation

    output$p[t] <- sum(population$trait == "A") / N # get p and put it into output slot for this generation t
  }
  output
}
```

This is just all of the code snippets that we already ran above, but all within a function with parameters $N$ and $t_{\text{max}}$ as arguments to the function. In addition, `unbiased_transmission_1()`  ends with the line `output`. This means that this tibble will be exported from the function when it is run. This is useful for storing data from simulations wrapped in functions, otherwise that data is lost after the function is executed. 

Nothing will happen when you run the above code, because all you have done is define the function but not actually run it. The point is that we can now call the function in one go, easily changing the values of $N$ and $t_{\text{max}}$. Let's try first with the same values of $N$ and $t_{\text{max}}$ as before, and save the output from the simulation into **data_model**, as a record of what happened.

```{r 1.10}
data_model <- unbiased_transmission_1(N = 100, t_max = 200)
```

We also need to create another function to plot the data, so we do not need to rewrite all the plotting instructions each time. Whereas this may seem impractical now, it is convenient to separate the function that runs the simulation and the function that plots the data for various reasons. With more complicated models, we do not want to rerun a simulation just because we want to change some detail in the plot. It also makes conceptual sense to keep separate the raw output of the model from the various ways we can visualise it, or the further analysis we want to perform on it. As above, the code is identical to what we already wrote: 

```{r 1.11}
plot_single_run <- function(data_model) {
  ggplot(data = data_model, aes(y = p, x = generation)) +
    geom_line() +
    ylim(c(0, 1)) +
    theme_bw() +
    labs(y = "p (proportion of individuals with trait A)")
}
```

At this point, we can visualise the results:

```{r 1.12}
plot_single_run(data_model)
```

As anticipated, the plot is different from the simulation we ran before, even though the code is exactly the same. This is due to the stochastic nature of the simulation. 

Now let's try changing the parameters. We can call the simulation and the plotting functions together. The code below reruns and plots the simulation with a much larger $N$.

```{r 1.13}
data_model <- unbiased_transmission_1(N = 10000, t_max = 200)
plot_single_run(data_model)
```

You should see much less fluctuation. Rarely in a population of $N = 10000$ will either trait go to fixation. Try re-running the previous code chunk to explore the effect of $N$ on long-term dynamics.

## Run several independent simulations and plot their results

Wrapping a simulation in a function like this is good because we can easily re-run it with just a single command. However, it's a bit laborious to manually re-run it. Say we wanted to re-run the simulation 10 times with the same parameter values to see how many times $A$ goes to fixation, and how many times $B$ goes to fixation. Currently, we'd have to manually run the `unbiased_transmission_1()` function 10 times and record somewhere else what happened in each run. It would be better to automatically re-run the simulation several times and plot each run as a separate line on the same plot. We could also add a line showing the mean value of $p$ across all runs.

Let's use a new parameter $r_{\text{max}}$ to specify the number of independent runs, and use another for-loop to cycle over the $r_{\text{max}}$ runs. Let's rewrite the `unbiased_transmission_1()` function to handle multiple runs. We will call the new function `unbiased_transmission_2()`.

```{r 1.14}
unbiased_transmission_2 <- function(N, t_max, r_max) {
  output <- tibble(generation = rep(1:t_max, r_max), p = as.numeric(rep(NA, t_max * r_max)), run = as.factor(rep(1:r_max, each = t_max))) # create the output tibble

  for (r in 1:r_max) { # for each run
    population <- tibble(trait = sample(c("A", "B"), N, replace = TRUE))
    # create first generation

    output[output$generation == 1 & output$run == r, ]$p <- sum(population$trait == "A") / N # add first generation's p for run r

    for (t in 2:t_max) {
      previous_population <- population # copy individuals to previous_population tibble

      population <- tibble(trait = sample(previous_population$trait, N, replace = TRUE))
      # randomly copy from previous generation

      output[output$generation == t & output$run == r, ]$p <- sum(population$trait == "A") / N # get p and put it into output slot for this generation t and run r
    }
  }
  output # export data from function
}
```

There are a few changes here. First, we need a different **output** tibble, because we need to store data for all the runs. For that, we initialise the same $generation$ and $p$ columns as before, but with space for all the runs. $generation$ is now built by repeating the count of each generation $r_{\text{max}}$ times, and $p$ is NA repeated for all generations, for all runs.

We also need a new column called $run$ that keeps track of which run the data in the other two columns belongs to. Note that the definition of $run$ is preceded by `as.factor()`. This specifies the type of data to put in the $run$ column. We want $run$ to be a 'factor' or categorical variable so that, even if runs are labelled with numbers (1, 2, 3...), this should not be misinterpreted as a continuous, real number: there is no sense in which run 2 is twice as 'runny' as run 1, or run 3 half as 'runny' as run 6. Runs could equally have been labelled using letters, or any other arbitrary scheme. While omitting `as.factor()` does not make any difference when running the simulation, it would create problems when plotting the data because ggplot would treat runs as continuous real numbers rather than discrete categories (you can see this yourself by modifying the definition of **output** in the previous code chunk). This is a good example of how it is important to have a clear understanding of your data before trying to plot or analyse them.

Going back to the function, we then set up an $r$ loop, which executes once for each run. The code within this loop is mostly the same as before, except that we now use the `[output$generation == t & output$run == r, ]` notation to put $p$ into the right place in **output**. 

The plotting function is also changed to handle multiple runs:

```{r 1.15}
plot_multiple_runs <- function(data_model) {
  ggplot(data = data_model, aes(y = p, x = generation)) +
    geom_line(aes(colour = run)) +
    stat_summary(fun = mean, geom = "line", size = 1) +
    ylim(c(0, 1)) +
    theme_bw() +
    labs(y = "p (proportion of individuals with trait A)")
}
```

To understand how the above code works, we need to explain the general functioning of ggplot. As explained above, `aes()` specifies the 'aesthetics', or how the data are mapped in the plot. This is independent from the possible visual representations of this mapping, or 'geoms'. If we declare specific aesthetics when we call `ggplot()`, these aesthetics will be applied to all geoms we call afterwards. Alternatively, we can specify the aesthetics in the geom itself. For example this:

```{r 1.16, eval=FALSE}
ggplot(data = output, aes(y = p, x = generation)) +
  geom_line()
```

is equivalent to this:

```{r 1.17, eval=FALSE}
ggplot(data = output) +
  geom_line(aes(y = p, x = generation))
```

We can use this property to make more complex plots. The plot created in `plot_multiple_runs` has a first geom, `geom_line()`. This inherits the aesthetics specified in the initial call to `ggplot()` but also has a new mapping specific to `geom_line()`, `colour = run`. This tells ggplot to plot each run line with a different colour. The following command, `stat_summary()`, calculates the mean of all runs. However, this only inherits the mapping specified in the initial `ggplot()` call. If in the aesthetic of `stat_summary()` we had also specified `colour = run`, it would separate the data by run, and it would calculate the mean of each run. This, though, is just the lines we have already plotted with the `geom_line()` command. For this reason, we did not put `colour = run` in the `ggplot()` call, only in `geom_line()`. As always, there are various ways to obtain the same result. This code:

```{r 1.18, eval=FALSE}
ggplot(data = output) +
  geom_line(aes(y = p, x = generation, colour = run)) +
  stat_summary(aes(y = p, x = generation), fun.y = mean, geom = "line", size = 1)
```

is equivalent to the code we wrapped in the function above. However, the original code is clearer, as it distinguishes the global mapping, and the mappings specific to each visual representation. 

`stat_summary()` is a generic ggplot function which can be used to plot different statistics to summarise our data. In this case, we want to calculate the mean on the data mapped in $y$, we want to plot them with a line, and we want this line to be thicker than the lines for the single runs. The default line size for geom_line is 0.5, so `size = 1` doubles the thickness.

Let's now run the function and plot the results for five runs with the same parameters we used at the beginning ($N=100$ and $t_{\text{max}}=200$):

```{r 1.19}
data_model <- unbiased_transmission_2(N = 100, t_max = 200, r_max = 5)
plot_multiple_runs(data_model)
```

You should be able to see five independent runs of our simulation shown as regular thin lines, along with a thicker line showing the mean of these lines. Some runs have probably gone to 0 or 1, and the mean should be somewhere in between. The data is stored in **data_model**, which we can inspect by writing its name.

```{r 1.20}
data_model
```

Now let's run the `unbiased_transmission_2()` model with $N = 10000$, to compare with $N = 100$.

```{r 1.21}
data_model <- unbiased_transmission_2(N = 10000, t_max = 200, r_max = 5)
plot_multiple_runs(data_model)
```

The mean line should be almost exactly at $p = 0.5$ now, with the five independent runs fairly close to it.

## Varying initial conditions

Let's add one final modification. So far the starting frequencies of $A$ and $B$ have been the same, roughly 0.5 each. But what if we were to start at different initial frequencies of $A$ and $B$? Say, $p = 0.2$ or $p = 0.9$? Would unbiased transmission keep $p$ at these initial values, or would it go to $p = 0.5$ as we have found so far?

To find out, we can add another parameter, $p_0$, which specifies the initial probability of an individual having an $A$ rather than a $B$ in the first generation. Previously this was always $p_0 = 0.5$, but in the new function below we add it to the `sample()` function to weight the initial allocation of traits in $t = 1$.

<!-- see the comment above for the "as.numeric" -->
```{r 1.22}
unbiased_transmission_3 <- function(N, p_0, t_max, r_max) {
  output <- tibble(generation = rep(1:t_max, r_max), p = as.numeric(rep(NA, t_max * r_max)), run = as.factor(rep(1:r_max, each = t_max)))

  for (r in 1:r_max) {
    population <- tibble(trait = sample(c("A", "B"), N, replace = TRUE, prob = c(p_0, 1 - p_0)))
    # create first generation

    output[output$generation == 1 & output$run == r, ]$p <- sum(population$trait == "A") / N # add first generation's p for run r

    for (t in 2:t_max) {
      previous_population <- population # copy individuals to previous_population tibble

      population <- tibble(trait = sample(previous_population$trait, N, replace = TRUE))
      # randomly copy from previous generation

      output[output$generation == t & output$run == r, ]$p <- sum(population$trait == "A") / N # get p and put it into output slot for this generation t and run r
    }
  }
  output # export data from function
}
```

`unbiased_transmission_3()` is almost identical to the previous function. The only changes are the addition of $p_0$ as an argument to the function, and the $prob$ argument in the `sample()` command. The $prob$ argument gives the probability of picking each option, in our case $A$ and $B$, in the first generation. The probability of $A$ is now $p_0$, and the probability of $B$ is now $1 - p_0$. We can use the same plotting function as before to visualise the result. Let's see what happens with a different value of $p_0$, for example $p_0 = 0.2$.

```{r 1.23}
data_model <- unbiased_transmission_3(N = 10000, p_0 = 0.2, t_max = 200, r_max = 5)
plot_multiple_runs(data_model)
```

With $p_0 = 0.2$, trait frequencies stay at $p = 0.2$. Unbiased transmission is truly non-directional: it maintains trait frequencies at whatever they were in the previous generation, barring random fluctuations caused by small population sizes.

*** 
***

## Analytical model {-}

If $p$ is the frequency of $A$ in one generation, we are interested in calculating $p'$, the frequency of $A$ in the next generation under the assumption of unbiased transmission. Each new individual in the next generation picks a demonstrator at random from among the previous generation. The demonstrator will have $A$ with probability $p$. The frequency of $A$ in the next generation, then, is simply the frequency of $A$ in the previous generation:

$$p' = p                \hspace{30 mm}(1.1)$$

Equation 1.1 simply says that under unbiased transmission there is no change in $p$ over time. If, as we assumed above, the initial value of $p$ in a particular population is $p_0$, then the equilibrium value of $p$, $p^*$, at which there is no change in $p$ over time, is just $p_0$. 

We can plot this recursion, to recreate the final simulation plot above:

```{r 1.24}

p_0 <- 0.2
t_max <- 200

pop_analytical <- tibble(p = rep(NA, t_max), generation = 1:t_max)
pop_analytical$p[1] <- p_0

for (i in 2:t_max) {
  pop_analytical$p[i] <- pop_analytical$p[i - 1]
}

ggplot(data = pop_analytical, aes(y = p, x = generation)) +
  geom_line() +
  ylim(c(0, 1)) +
  theme_bw() +
  labs(y = "p (proportion of individuals with trait A)")
```

Here, we use a **for** loop to cycle through each generation, each time updating $p$ according to the recursion equation above. Remember, there is no $N$ here because the recursion is deterministic and assumes an infinite population size; hence there is no stochasticity due to finite population sizes. There is also no need to have multiple runs as each run is identical, hence no $r_{max}$.

Don't worry, it gets more complicated than this in later chapters. The key point here is that analytical (or deterministic) models assume infinite populations and no stochasticity. Simulations with very large populations should give the same results as analytical models. Basically, the closer we can get in stochastic models to the assumption of infinite populations, the closer the match to infinite-population deterministic models. Deterministic models give the ideal case; stochastic models permit more realistic dynamics based on finite populations.

More generally, creating deterministic recursion-based models can be a good way of verifying simulation models, and vice versa: if the same dynamics occur in both individuals-based and recursion-based models, then we can be more confident that those dynamics are genuine and not the result of a programming error or mathematical mistake. 

***
***

## Summary of the model

Even this extremely simple model provides some valuable insights. First, unbiased transmission does not in itself change trait frequencies. As long as populations are large, trait frequencies remain the same. 

Second, the smaller the population size, the more likely traits are to be lost by chance. This is a basic insight from population genetics, known there as genetic drift, but it can also be applied to cultural evolution. Many studies have tested (and some supported) the idea that population size and other demographic factors can shape cultural diversity. 

Furthermore, generating expectations about cultural change under simple assumptions like random cultural drift can be useful for detecting non-random patterns like selection. If we don't have a baseline, we won't know selection or other directional processes when we see them.

We also have introduced several programming techniques that will be useful in later simulations. We have seen how to use tibbles to hold characteristics of individuals and the outputs of simulations, how to use loops to cycle through generations and simulation runs, how to use `sample()` to pick randomly from sets of elements, how to wrap simulations in functions to easily re-run them with different parameter values, and how to use `ggplot()` to plot the results of simulations.


## Further reading

@cavalli-sforza_cultural_1981 explored how cultural drift affects cultural evolution, which was extended by @neiman_stylistic_1995 in an archaeological context. @bentley_random_2004 present models of unbiased transmission for several cultural datasets. @lansing_domain_2011 and commentaries explore the underlying assumptions of applying random drift to cultural evolution.

<!--chapter:end:01-Unbiased_transmission.Rmd-->

# Unbiased and biased mutation

Evolution doesn't work without a source of variation that introduces new variation upon which selection, drift and other processes can act. In genetic evolution, mutation is almost always blind with respect to function. Beneficial genetic mutations are no more likely to arise when they are needed than when they are not needed - in fact, most genetic mutations are neutral or detrimental to an organism. Cultural evolution is more interesting, in that novel variation may sometimes be directed to solve specific problems, or systematically biased due to features of our cognition. In the models below, we'll simulate both unbiased and biased mutation.

## Unbiased mutation

First, we will simulate unbiased mutation in the same basic model as used in the previous chapter. We'll remove unbiased transmission to see the effect of unbiased mutation alone. 

As in the previous model, we assume $N$ individuals each of whom possesses one of two discrete cultural traits, denoted $A$ and $B$. In each generation, from $t = 1$ to $t = t_{\text{max}}$, the $N$ individuals are replaced with $N$ new individuals. Instead of random copying, each individual now gives rise to a new individual with the same cultural trait as them. (Another way of looking at this is in terms of timesteps, such as years: the same $N$ individual live for $t_{\text{max}}$ years and keep their cultural trait from one year to the next.)

At each generation, however, there is a probability $\mu$ that each individual mutates from their current trait to the other trait (the Greek letter Mu is the standard notation for the mutation rate in genetic evolution, and it has an analogous function here). For example, vegetarian individuals can decide to eat animal products, and vice versa. Remember, this is not copied from other individuals, as in the previous model, but can be thought of as an individual decision. Another way to see this is that the probability of changing trait applies to each individual independently; whether an individual mutates has no bearing on whether or how many other individuals have mutated. On average, this means that $\mu N$ individuals mutate each generation. Like in the previous model, we are interested in tracking the proportion $p$ of agents with trait $A$ over time.

We'll wrap this in a function called `unbiased_mutation()`, using much of the same code as `unbiased_transmission_3()`. As before, we need to call the tidyverse library and set a seed for the random number generator, so the results will be the same each time we rerun the code. Of course, if you want to see the stochasticity inherent in the simulation, you can remove the set.seed command, or set it to a different number.

```{r 2.1, message = FALSE}
library(tidyverse)
set.seed(111)

unbiased_mutation <- function(N, mu, p_0, t_max, r_max) {
  output <- tibble(generation = rep(1:t_max, r_max), p = as.numeric(rep(NA, t_max * r_max)), run = as.factor(rep(1:r_max, each = t_max))) # create the output tibble

  for (r in 1:r_max) {
    population <- tibble(trait = sample(c("A", "B"), N, replace = TRUE, prob = c(p_0, 1 - p_0)))

    output[output$generation == 1 & output$run == r, ]$p <- sum(population$trait == "A") / N # add first generation's p for run r
    for (t in 2:t_max) {
      previous_population <- population # copy individuals to previous_population tibble

      mutate <- sample(c(TRUE, FALSE), N, prob = c(mu, 1 - mu), replace = TRUE) # determine 'mutant' individuals

      if (nrow(population[mutate & previous_population$trait == "A", ]) > 0) { # if there are 'mutants' from A to B
        population[mutate & previous_population$trait == "A", ]$trait <- "B" # then flip them to B
      }

      if (nrow(population[mutate & previous_population$trait == "B", ]) > 0) { # if there are 'mutants' from B to A
        population[mutate & previous_population$trait == "B", ]$trait <- "A" # then flip them to A
      }
      output[output$generation == t & output$run == r, ]$p <- sum(population$trait == "A") / N # get p and put it into output slot for this generation t and run r
    }
  }
  output # export data from function
}
```

The only changes from the previous model are the addition of $\mu$, the parameter that specifies the probability of mutation, in the function definition and new lines of code within the `for` loop on $t$ which replace the random copying command with unbiased mutation. Let's examine these lines to see how they work.

The most obvious way of implementing unbiased mutation - which is not done above - would have been to set up another `for` loop. We would cycle through each individual one by one, each time calculating whether it should mutate or not based on $mu$. This would certainly work, but R is notoriously slow at loops. It's always preferable in R, where possible, to use 'vectorised' code. That's what is done above in our three added lines, starting from `mutate <- sample()`. 

First, we pre-specify the probability of mutating for each individual. For this, we again use the function `sample()`, picking $TRUE$ (corresponding to being a mutant) or $FALSE$ (not mutating, i.e. keeping the same cultural trait) for $N$ times. The draw, however, is not random: the probability of drawing $TRUE$ is equal to $\mu$, and the probability of drawing $FALSE$ is $1-\mu$. You can think about the procedure in this way: each individual in the population flips a biased coin that has $\mu$ probability to land on, say, heads, and $1-\mu$ to land on tails. If it lands on heads they change their cultural trait.

After that, in the following lines, we change the traits for the 'mutant' individuals. We need to check whether there are individuals that change their trait, both from $A$ to $B$ and vice versa, using the two `if` conditionals. If there are no such individuals, then assigning a new value to an empty tibble returns an error. To check, we make sure that the number of rows is greater than 0 (using `nrow()>0` within the `if`). 

To plot the results, we can use the same function `plot_multiple_runs()` we wrote in the [previous chapter][Unbiased transmission], reproduced here for convenience.

```{r 2.2, echo=FALSE}
plot_multiple_runs <- function(data_model) {
  ggplot(data = data_model, aes(y = p, x = generation)) +
    geom_line(aes(colour = run)) +
    stat_summary(fun.y = mean, geom = "line", size = 1) +
    ylim(c(0, 1)) +
    theme_bw() +
    labs(y = "p (proportion of individuals with trait A)")
}
```

Let's now run and plot the model:

```{r 2.3}
data_model <- unbiased_mutation(N = 100, mu = 0.05, p_0 = 0.5, t_max = 200, r_max = 5)
plot_multiple_runs(data_model)
```

Unbiased mutation produces random fluctuations over time and does not alter the overall frequency of $A$, which stays around $p = 0.5$. Because mutations from $A$ to $B$ are as equally likely as $B$ to $A$, there is no overall directional trend. 

If you remember from the previous chapter, with unbiased transmission, instead, when populations were small (e.g. $N=100$) generally one of the traits disappeared after a few generations. Here, though, with $N=100$, both traits remain until the end of the simulation. Why this difference? You can think of it in this way: when one trait becomes popular, say the frequency of $A$ is equal to $0.8$, with unbiased transmission it is more likely that individuals of the new generation will pick up $A$ randomly when copying. The few individuals with trait $B$ will have 80% of probability of copying $A$. With unbiased mutation, on the other hand, since $\mu$ is applied independently to each individual when $A$ is common then there will be more individuals that will flip to $B$ (specifically, $\mu p N$ individuals, which in our case is 4) than individuals that will flip to $A$ (equal to $\mu (1-p) N$ individuals, in our case 1) keeping the traits at similar frequencies.

But what if we were to start at different initial frequencies of $A$ and $B$? Say, $p=0.1$ and $p=0.9$? Would $A$ disappear? Would unbiased mutation keep $p$ at these initial values, like we saw unbiased transmission does in Model 1?

To find out, let's change $p_0$, which specifies the initial probability of drawing an $A$ rather than a $B$ in the first generation.

```{r 2.4}
data_model <- unbiased_mutation(N = 100, mu = 0.05, p_0 = 0.1, t_max = 200, r_max = 5)
plot_multiple_runs(data_model)
```

You should see $p$ go from 0.1 up to 0.5. In fact, whatever the initial starting frequencies of $A$ and $B$, unbiased mutation always leads to $p = 0.5$, for the reason explained above: unbiased mutation always tends to balance the proportion of $A$s and $B$s.


## Biased mutation

A more interesting case is biased mutation. Let's assume now that there is a probability $\mu_b$ that an individual with trait $B$ mutates into $A$, but there is no possibility of trait $A$ mutating into trait $B$. Perhaps trait $A$ is a particularly catchy or memorable version of a story or an intuitive explanation of a phenomenon, and $B$ is difficult to remember or unintuitive to understand.

The function `biased_mutation()` captures this unidirectional mutation.

```{r 2.5}
biased_mutation <- function(N, mu_b, p_0, t_max, r_max) {
  output <- tibble(generation = rep(1:t_max, r_max), p = as.numeric(rep(NA, t_max * r_max)), run = as.factor(rep(1:r_max, each = t_max))) # create the output tibble

  for (r in 1:r_max) {
    population <- tibble(trait = sample(c("A", "B"), N, replace = TRUE, prob = c(p_0, 1 - p_0)))

    output[output$generation == 1 & output$run == r, ]$p <- sum(population$trait == "A") / N # add first generation's p for run r
    for (t in 2:t_max) {
      previous_population <- population # copy individuals to previous_population tibble

      mutate <- sample(c(TRUE, FALSE), N, prob = c(mu_b, 1 - mu_b), replace = TRUE) # find 'mutant' individuals

      if (nrow(population[mutate & previous_population$trait == "B", ]) > 0) {
        population[mutate & previous_population$trait == "B", ]$trait <- "A" # if individual was B and mutates, flip to A
      }
      output[output$generation == t & output$run == r, ]$p <- sum(population$trait == "A") / N # get p and put it into output slot for this generation t and run r
    }
  }
  output # export data from function
}
```

There are just two changes in this code compared to `unbiased_mutation()`. First, we've replaced $\mu$ with $\mu_b$ to keep the two parameters distinct and avoid confusion. Second, the line in `unbiased_mutation()` which caused individuals with $A$ to mutate to $B$ has been deleted.

Let's see what effect this has by running `biased_mutation()`. We'll start with the population entirely composed of individuals with $B$, i.e. $p_0 = 0$, to see how quickly and in what manner $A$ spreads via biased mutation.

```{r 2.6}
data_model <- biased_mutation(N = 100, mu_b = 0.05, p_0 = 0, t_max = 200, r_max = 5)
plot_multiple_runs(data_model)
```

The plot shows a steep increase that slows and plateaus at $p = 1$ by around generation $t = 100$. There should be a bit of fluctuation in the different runs, but not much. Now let's try a larger sample size.

```{r 2.7}
data_model <- biased_mutation(N = 10000, mu_b = 0.05, p_0 = 0, t_max = 200, r_max = 5)
plot_multiple_runs(data_model)
```

With $N = 10000$ the line should be smooth with little (if any) fluctuation across the runs. But notice that it plateaus at about the same generation, around $t = 100$. Population size has little effect on the rate at which a novel trait spreads via biased mutation. $\mu_b$, on the other hand, does affect this speed. Let's double the biased mutation rate to 0.1.

```{r 2.8}
data_model <- biased_mutation(N = 10000, mu_b = 0.1, p_0 = 0, t_max = 200, r_max = 5)
plot_multiple_runs(data_model)
```

Now trait $A$ reaches fixation around generation $t = 50$. Play around with $N$ and $\mu_b$ to confirm that the latter determines the rate of diffusion of trait $A$, and that it takes the same form each time - roughly an 'r' shape with an initial steep increase followed by a plateauing at $p = 1$.

***
***

## Analytical model {-}

If $p$ is the frequency of $A$ in one generation, we are interested in calculating $p'$, the frequency of $A$ in the next generation under the assumption of unbiased mutation. The next generation retains the cultural traits of the previous generation, except that $\mu$ of them switch to the other trait. There are therefore two sources of $A$ in the next generation: members of the previous generation who had $A$ and didn't mutate, therefore staying $A$, and members of the previous generation who had $B$ and did mutate, therefore switching to $A$. The frequency of $A$ in the next generation is therefore:

$$p' = p(1-\mu) + (1-p)\mu           \hspace{30 mm}(2.1)$$

The first term on the right-hand side of Equation 2.1 represents the first group, the $(1 - \mu)$ proportion of the $p$ $A$-carriers who didn't mutate. The second term represents the second group, the $\mu$ proportion of the $1 - p$ $B$-carriers who did mutate.

To calculate the equilibrium value of $p$, $p^*$, we want to know when $p' = p$, or when the frequency of $A$ in one generation is identical to the frequency of $A$ in the next generation. This can be found by setting $p' = p$ in Equation 2.1, which gives:

$$p = p(1-\mu) + (1-p)\mu             \hspace{30 mm}(2.2)$$

Rearranging Equation 2.2 gives:

$$\mu(1 - 2p) = 0 \hspace{30 mm}(2.3)$$

The left-hand side of Equation 2.3 equals zero when either $\mu = 0$, which given our assumption that $\mu > 0$ cannot be the case, or when $1 - 2p = 0$, which after rearranging gives the single equilibrium $p^* = 0.5$. This matches our simulation results above. As we found in the simulations, this does not depend on $\mu$ or the starting frequency of $p$.

We can also plot the recursion in Equation 2.1 like so:

```{r 2.9}

p_0 <- 0
t_max <- 200
mu <- 0.1

pop_analytical <- tibble(p = rep(NA, t_max), generation = 1:t_max)
pop_analytical$p[1] <- p_0

for (i in 2:t_max) {
  pop_analytical$p[i] <- pop_analytical$p[i - 1] * (1 - mu) + (1 - pop_analytical$p[i - 1]) * mu
}

ggplot(data = pop_analytical, aes(y = p, x = generation)) +
  geom_line() +
  ylim(c(0, 1)) +
  theme_bw() +
  labs(y = "p (proportion of individuals with trait A)")
```

Again, this should resemble the figure generated by the simulations above, and confirm that $p^* = 0.5$.

For biased mutation, assume that only $B$s are switching to $A$, and with probability $\mu_b$ instead of $\mu$. The first term on the right-hand side becomes simply $p$, because $A$s do not switch. The second term remains the same, but with $\mu_b$. Thus,

$$p' = p + (1-p)\mu_b \hspace{30 mm}(2.4)$$

The equilibrium value $p^*$ can be found by again setting $p' = p$ and solving for $p$. Assuming $\mu_b > 0$, this gives the single equilibrium $p^* = 1$, which again matches the simulation results. 

We can plot the above recursion like so:

```{r 2.10}

p_0 <- 0
t_max <- 200
mu_b <- 0.1

pop_analytical <- tibble(p = rep(NA, t_max), generation = 1:t_max)
pop_analytical$p[1] <- p_0

for (i in 2:t_max) {
  pop_analytical$p[i] <- pop_analytical$p[i - 1] + (1 - pop_analytical$p[i - 1]) * mu_b
}

ggplot(data = pop_analytical, aes(y = p, x = generation)) +
  geom_line() +
  ylim(c(0, 1)) +
  theme_bw() +
  labs(y = "p (proportion of individuals with trait A)")
```

Hopefully, this looks identical to the final simulation plot with the same value of $\mu_b$.

Furthermore, we can specify an equation for the change in $p$ from one generation to the next, or $\Delta p$. We do this by subtracting $p$ from both sides of Equation 2.4, giving:

$$\Delta p = p' - p = (1-p)\mu_b            \hspace{30 mm}(2.5)$$

Seeing this helps explain two things. First, the $1 - p$ part explains the r-shape of the curve. It says that the smaller is $p$, the larger $\Delta p$ will be. This explains why $p$ increases in frequency very quickly at first, when $p$ is near zero, and the increase slows when $p$ gets larger. We have already determined that the increase stops altogether (i.e. $\Delta p$ = 0) when $p = p^* = 1$. 

Second, it says that the rate of increase is proportional to $\mu_b$. This explains our observation in the simulations that larger values of $\mu_b$ cause $p$ to reach its maximum value faster.

***
***

## Summary of the model

With this simple model, we can draw the following insights. Unbiased mutation, which resembles genetic mutation in being non-directional, always leads to an equal mix of the two traits. It introduces and maintains cultural variation in the population. It is interesting to compare unbiased mutation to unbiased transmission from Model 1. While unbiased transmission did not change $p$ over time, unbiased mutation always converges on $p^* = 0.5$, irrespective of the starting frequency. (NB $p^* = 0.5$ assuming there are two traits; more generally, $p^* = 1/v$, where $v$ is the number of traits.) 

Biased mutation, which is far more common - perhaps even typical - in cultural evolution, shows different dynamics. Novel traits favoured by biased mutation spread in a characteristic fashion - an r-shaped diffusion curve - with a speed characterised by the mutation rate $\mu_b$. Population size has little effect, whether $N = 100$ or $N = 10000$. Whenever biased mutation is present ($\mu_b > 0$), the favoured trait goes to fixation, even if it is not initially present.

In terms of programming techniques, the major novelty in Model 2 is the use of `sample()` to determine which individuals should undergo whatever the fixed probability specifies (in our case, mutation). This could be done with a loop, but vectorising code in the way we did here is much faster in R than loops.

## Further reading

@boyd_culture_1985 model what they call 'guided variation', which is equivalent to biased mutation as modelled in this chapter. @henrich_cultural_2001 shows how biased mutation / guided variation generates r-shaped curves similar to those generated here.

<!--chapter:end:02-Unbiased_and_biased_mutation.Rmd-->

# Biased transmission: direct bias

So far we have looked at unbiased transmission ([Chapter 1][Unbiased transmission]) and mutation, both unbiased and biased ([Chapter 2][Unbiased and biased mutation]). Let's complete the set by looking at biased transmission. This occurs when one trait is more likely to be copied than another trait. When the choice depends on the features of the trait, it is often called 'direct' or 'content' bias. When the choice depends on features of the demonstrators (the individuals from whom one is copying), it is often called 'indirect' or 'context' bias. Both are sometimes also called 'cultural selection' because one trait is selected to be copied over another trait. In this chapter, we will look at trait-based (direct, content) bias.

(As an aside, there is a confusing array of terminology in the field of cultural evolution, as illustrated by the preceding paragraph. That's why models are so useful. Words and verbal descriptions can be ambiguous. Often the writer doesn't realise that there are hidden assumptions or unrecognised ambiguities in their descriptions. They may not realise that what they mean by 'cultural selection' is entirely different from how someone else uses it. Models are great because they force us to precisely specify exactly what we mean by a particular term or process. We can use the words in the paragraph above to describe biased transmission, but it's only really clear when we model it, making all our assumptions explicit.)

To simulate biased transmission, following the simulations in [Chapter 1][Unbiased transmission], we assume there are two traits $A$ and $B$, and that each individual chooses another individual from the previous generation at random. This time, however, we give the traits two different probabilities of being copied: we can call them, $s_a$ and $s_b$ respectively. When an individual encounters another individual with trait $A$, they will copy them with probability $s_a$. When they encounter an individual with trait $B$, they will copy them with probability $s_b$. 

With $s_a=s_b$, copying is unbiased, and individuals switch to the encountered alternative with the same probability. This reproduces the results of the simulations when the transmission is unbiased. If $s_a=s_b=1$, the model is exactly the same as in [Chapter 1][Unbiased transmission]. The relevant situation in this chapter is when $s_a>s_b$ (or vice versa) so that we have biased transmission. Perhaps $A$ (or $B$) is a more effective tool, a more memorable story, or a more easily pronounced word.

Let's first write the function, and then explore what happens in this case. Below is a function `biased_transmission_direct()` that implements all of these ideas.

```{r 3.1}
library(tidyverse)
set.seed(111)

biased_transmission_direct <- function (N, s_a, s_b, p_0, t_max, r_max) {
  
  output <- tibble(generation = rep(1:t_max, r_max), p = as.numeric(rep(NA, t_max * r_max)), run = as.factor(rep(1:r_max, each = t_max)))
  
  for (r in 1:r_max) {
  
    population <- tibble(trait = sample(c("A", "B"), N, replace = TRUE, prob = c(p_0, 1 - p_0))) # create first generation
    
    output[output$generation == 1 & output$run == r, ]$p <- sum(population$trait == "A") / N # add first generation's p for run r

    for (t in 2:t_max) {
    
      previous_population <- population # copy individuals to previous_population tibble
      
      demonstrator_trait <- tibble(trait = sample(previous_population$trait, N, replace = TRUE)) 
      # for each individual, pick a random individual from the previous generation to act as demonstrator and store their trait
      
      # biased probabilities to copy:
      copy_a <- sample(c(TRUE, FALSE), N, prob = c(s_a, 1 - s_a), replace = TRUE) 
      copy_b <- sample(c(TRUE, FALSE), N, prob = c(s_b, 1 - s_b), replace = TRUE) 
      
      if (nrow(population[copy_a & demonstrator_trait == "A", ]) > 0) {
        population[copy_a & demonstrator_trait == "A", ]$trait <- "A" 
      }  
      
      if (nrow(population[copy_b & demonstrator_trait == "B", ]) > 0) {
        population[copy_b & demonstrator_trait == "B", ]$trait <- "B" 
      }  
      output[output$generation == t & output$run == r, ]$p <- sum(population$trait == "A") / N # get p and put it into output slot for this generation t and run r
    }
  }
  output # export data from function
}
```

Most of `biased_transmission_direct()` is recycled from the previous models. As before, we initialise the data structure **output** from multiple runs, and in generation $t = 1$, we create a **population** tibble to hold the trait of each individual. 

The major change is that we now include biased transmission. We first select at random the demonstrators from the previous generation (using the same code we used in `unbiased_transmission()`) and we store their trait in **demonstrator_trait**. Then we get the probabilities to copy $A$ and to copy $B$ for the entire population, using the same code used in `biased_mutation()`, only that this time it produces a probability to copy. Again using the same code as in `biased mutation()`, we have the individuals copy the trait at hand with the desired probability.

Let's run our function `biased_transmission_direct()`. As before, to plot the results, we can use the same function `plot_multiple_runs()` we wrote in [Chapter 1][Unbiased transmission].

As noted above, the interesting case is when one trait is favoured over the other. We can assume, for example, $s_a=0.1$ and $s_b=0$. This means that when individuals encounter another individual with trait $A$ they copy them 1 in every 10 times, but, when individuals encounter another individual with trait $B$, they never switch. We can also assume that the favoured trait, $A$, is initially rare in the population ($p_0=0.01$) to see how selection favours this initially-rare trait (Note that $p_0$ needs to be higher than 0; since there is no mutation in this model, we need to include at least some $A$s at the beginning of the simulation, otherwise it would never appear). 

```{r 3.2, echo=FALSE}
plot_multiple_runs <- function(data_model) {
  ggplot(data = data_model, aes(y = p, x = generation)) +
    geom_line(aes(colour = run)) +
    stat_summary(fun.y = mean, geom = "line", size = 1) +
    ylim(c(0, 1)) +
    theme_bw() +
    labs(y = "p (proportion of individuals with trait A)")
}
```

```{r 3.3}
data_model <- biased_transmission_direct(N = 10000, s_a = 0.1, s_b = 0 , p_0 = 0.01, t_max = 150, r_max = 5)
plot_multiple_runs(data_model)
```

With a moderate selection strength, we can see that $A$ gradually replaces $B$ and goes to fixation. It does this in a characteristic manner: the increase is slow at first, then picks up speed, then plateaus.

Note the difference to biased mutation. Where biased mutation was r-shaped, with a steep initial increase, biased transmission is s-shaped, with an initial slow uptake. This is because the strength of biased transmission (like selection in general) is proportional to the variation in the population. When $A$ is rare initially, there is only a small chance of picking another individual with $A$. As $A$ spreads, the chances of picking an $A$ individual increases. As $A$ becomes very common, there are few $B$ individuals left to switch. In the case of biased mutation, instead, the probability to switch is independent of the variation in the population. 

## Strength of selection

On what does the strength of selection depend? First, the strength is independent of the specific values of $s_a$ and $s_b$. What counts is their relative difference, in this case $s_a-s_b = 0.1$. If we run a simulation with, say, $s_a=0.6$ and $s_b=0.5$, we see the same pattern, albeit with slightly more noise. That is, the single runs are more different from one another compared to the previous simulation. This is because switches from $A$ to $B$ are now also possible.

```{r 3.4}
data_model <- biased_transmission_direct(N = 10000, s_a = 0.6, s_b = 0.5 , p_0 = 0.01, t_max = 150, r_max = 5)
plot_multiple_runs(data_model)
```

To change the selection strength, we need to modify the difference between $s_a$ and $s_b$. We can double the strength by setting $s_a = 0.2$, and keeping $s_b=0$.

```{r 3.5}
data_model <- biased_transmission_direct(N = 10000, s_a = 0.2, s_b = 0 , p_0 = 0.01, t_max = 150, r_max = 5)
plot_multiple_runs(data_model)
```

As we might expect, increasing the strength of selection increases the speed with which $A$ goes to fixation. Note, though, that it retains the s-shape.


***
***

## Analytical model {-}

As before, we have $p$ individuals with trait $A$, and $1 - p$ individuals with trait $B$. As we saw that what is important is the relative difference between the two probabilities of being copied associated to the two traits and not their absolute value, we consider always $s_b=0$ and vary $s_a$, which we can call simply $s$. Thus, the $p$ individuals with trait $A$ always keep their $A$s. The $1 - p$ individuals with trait $B$ pick another individual at random, hence with probability $p$, and with probability $s$, they switch to trait $A$. We can, therefore, write the recursion for $p$ under biased transmission as:

$$p' = p + p(1-p)s                \hspace{30 mm}(3.1)$$

The first term on the right-hand side is the unchanged $A$ bearers, and the second term is the $1-p$ $B$-bearers who find one of the $p$ $A$-bearers and switch with probability $s$.

Here is some code to plot this biased transmission recursion:

```{r 3.6}
t_max <- 150
s <- 0.1

pop_analytical <- tibble(p = rep(NA, t_max), generation = 1:t_max)
pop_analytical$p[1] <- 0.01

for (i in 2:t_max) {
  pop_analytical$p[i] <- pop_analytical$p[i - 1] + pop_analytical$p[i - 1] * (1 - pop_analytical$p[i - 1]) * s
}

ggplot(data = pop_analytical, aes(y = p, x = generation)) +
  geom_line() +
  ylim(c(0, 1)) +
  theme_bw() +
  labs(y = "p (proportion of individuals with trait A)")
```

The curve above should be identical to the simulation curve, given that the simulation had the same biased transmission strength $s$ and a large enough $N$ to minimise stochasticity. 

From the equation above, we can see how the strength of biased transmission depends on variation in the population, given that $p(1 - p)$ is the formula for variation. This determines the shape of the curve, while $s$ determines the speed with which the equilibrium $p^*$ is reached.

But what is the equilibrium $p^*$ here? In fact, there are two. As before, the equilibrium can be found by setting the change in $p$ to zero, or when:

$$p(1-p)s = 0                 \hspace{30 mm}(3.2)$$

There are three ways in which the left-hand side can equal zero: when $p = 0$, when $p = 1$ and when $s = 0$. The last case is uninteresting: it would mean that biased transmission is not occurring. The first two cases simply say that if either trait reaches fixation, then it will stay at fixation. This is to be expected, given that we have no mutation in our model. It contrasts with unbiased and biased mutation, where there is only one equilibrium value of $p$. 

We can also say that $p = 0$ is an unstable equilibrium, meaning that any slight perturbation away from $p = 0$ moves $p$ away from that value. This is essentially what we simulated above: a slight perturbation up to $p = 0.01$ went all the way up to $p = 1$. In contrast, $p = 1$ is a stable equilibrium: any slight perturbation from $p = 1$ immediately goes back to $p = 1$.

***
***

## Summary of the model

We have seen how biased transmission causes a trait favoured by cultural selection to spread and go to fixation in a population, even when it is initially very rare. Biased transmission differs in its dynamics from biased mutation. Its action is proportional to the variation in the population at the time at which it acts. It is strongest when there is lots of variation (in our model, when there are equal numbers of $A$ and $B$ at $p = 0.5$), and weakest when there is little variation (when $p$ is close to 0 or 1).


## Further reading

@boyd_culture_1985 modelled direct bias, while @henrich_cultural_2001 added directly biased transmission to his guided variation / biased mutation model, showing that this generates s-shaped curves similar to those generated here. Note though that subsequent work has shown that s-shaped curves can be generated via other processes (e.g. @reader_distinguishing_2004), and should not be considered definite evidence for biased transmission.


<!--chapter:end:03-Biased_transmission_direct_bias.Rmd-->

# Biased transmission: frequency-dependent indirect bias

## The logic of conformity 

In [Chapter 3][Biased transmission: direct bias] we looked at the case where one cultural trait is intrinsically more likely to be copied than another trait. Here we will start looking at the other kind of biased transmission when traits are equivalent, but individuals are more likely to adopt a trait according to the characteristics of the population, and in particular which other individuals already have it. (As we mentioned previously, these are often called 'indirect' or 'context' biases).

A first possibility is that we may be influenced by the frequency of the trait in the population, i.e. how many other individuals already have the trait. Conformity (or 'positive frequency-dependent bias') has been most studied. Here, individuals are disproportionately more likely to adopt the most common trait in the population, irrespective of its intrinsic characteristics. (The opposite case, anti-conformity or negative frequency-dependent bias is also possible, where the least common trait is more likely to be copied. This is probably less common in real life.)

For example, imagine trait $A$ has a frequency of 0.7 in the population, with the rest possessing trait $B$. An unbiased learner would adopt trait $A$ with a probability exactly equal to 0.7. This is unbiased transmission and is what happens the model described in ([Chapter 1][Unbiased transmission]: by picking a member of the previous generation at random, the probability of adoption is equal to the frequency of that trait among the previous generation.

A conformist learner, on the other hand, would adopt trait $A$ with a probability greater than 0.7. In other words, common traits get an 'adoption boost' relative to unbiased transmission. Uncommon traits get an equivalent 'adoption penalty'. The magnitude of this boost or penalty can be controlled by a parameter, which we will call $D$.

Let's keep things simple in our model. Rather than assuming that individuals sample across the entire population, which in any case might be implausible in large populations, let's assume they pick only three demonstrators at random. Why three? This is the minimum number of demonstrators that can yield a majority (i.e. 2 vs 1), which we need to implement conformity. When two demonstrators have one trait and the other demonstrator has a different trait, we want to boost the probability of adoption for the majority trait, and reduce it for the minority trait. 

We can specify the probability of adoption as follows:

**Table 1: Probability of adopting trait $A$ for each possible combination of traits amongst three demonstrators**

Demonstrator 1 | Demonstrator 2 | Demonstrator 3 | Probability of adopting trait $A$ 
-------------- | -------------- | -------------- | --------------------------------- |
$A$            | $A$            | $A$            | 1
               |                |                | 
$A$            | $A$            | $B$            | $2/3 + D/3$
$A$            | $B$            | $A$            | $2/3 + D/3$
$B$            | $A$            | $A$            | $2/3 + D/3$
               |                |                | 
$A$            | $B$            | $B$            | $1/3 - D/3$
$B$            | $A$            | $B$            | $1/3 - D/3$
$B$            | $B$            | $A$            | $1/3 - D/3$              
               |                |                | 
$B$            | $B$            | $B$            | 0         
               
The first row says that when all demonstrators have trait $A$, then trait $A$ is definitely adopted. Similarly, the bottom row says that when all demonstrators have trait $B$, then trait $A$ is never adopted, and by implication trait $B$ is always adopted.

For the three combinations where there are two $A$s and one $B$, the probability of adopting trait $A$ is $2/3$, which it would be under unbiased transmission (because two out of three demonstrators have $A$), plus the conformist adoption boost specified by $D$. As we want $D$ to vary from 0 to 1, it is divided by three, so that the maximum probability of adoption is equal to 1 (when $D=1$).

Similarly, for the three combinations where there are two $B$s and one $A$, the probability of adopting $A$ is 1/3 minus the conformist adoption penalty specified by $D$.

Let's implement these assumptions in the kind of individual-based model we've been building so far. As before, assume $N$ individuals each of whom possesses one of two traits $A$ or $B$. The frequency of $A$ is denoted by $p$. The initial frequency of $A$ in generation $t = 1$ is $p_0$. Rather than going straight to a function, let's go step by step.

First, we'll specify our parameters, $N$ and $p_0$ as before, plus the new conformity parameter $D$. We also create the usual **population** tibble and fill it with $A$s and $B$s in the proportion specified by $p_0$, again exactly as before. 

```{r 4.1}
library(tidyverse)
set.seed(111)

N <- 100
p_0 <- 0.5
D <- 1

population <- tibble(trait = sample(c("A", "B"), N, replace = TRUE, prob = c(p_0, 1 - p_0))) # create first generation
```

Now we create another tibble, called **demonstrators** that picks, for each new individual in the next generation, three demonstrators at random from the current population of individuals. It, therefore, needs three columns/variables, one for each of the demonstrators, and $N$ rows, one for each individual. We fill each column with randomly chosen traits from the **population** tibble. We can have a look at **demonstrators** by entering its name in the R console.

```{r 4.2}
# create dataframe with a set of 3 randomly-picked demonstrators for each agent
demonstrators <- tibble(dem1 = sample(population$trait, N, replace = TRUE), dem2 = sample(population$trait, N, replace = TRUE), dem3 = sample(population$trait, N, replace = TRUE))

demonstrators
```

Think of each row here as containing the traits of three randomly-chosen demonstrators chosen by each new next-generation individual. Now we want to calculate the probability of adoption of $A$ for each of these three-trait demonstrator combinations.

First we need to get the number of $A$s in each combination. Then we can replace the traits in **population** based on the probabilities in Table 1. When all demonstrators have $A$, we set to $A$. When no demonstrators have $A$, we set to $B$. When two out of three demonstrators have $A$, we set to $A$ with probability $2/3 + D/3$ and $B$ otherwise. When one out of three demonstrators have $A$, we set to $A$ with probability $1/3 - D/3$ and $B$ otherwise.

```{r 4.3}
# get the number of As in each 3-dem combo
num_As <- rowSums(demonstrators == "A")

population$trait[num_As == 3] <- "A"  # for dem combos with all As, set to A
population$trait[num_As == 0] <- "B"  # for dem combos with no As, set to B

prob_majority <- sample(c(TRUE, FALSE), prob = c((2/3 + D/3), 1 - (2/3 + D/3)), N, replace = TRUE)
prob_minority <- sample(c(TRUE, FALSE), prob = c((1/3 - D/3), 1 - (1/3 - D/3)), N, replace = TRUE)

# when A is a majority, 2/3
if (nrow(population[prob_majority & num_As == 2, ]) > 0) {
  population[prob_majority & num_As == 2, ] <- "A"
}
if (nrow(population[prob_majority == FALSE & num_As == 2, ]) > 0) {
  population[prob_majority == FALSE & num_As == 2, ] <- "B"
}  
# when A is a minority, 1/3
if (nrow(population[prob_minority & num_As == 1, ]) > 0) {
  population[prob_minority & num_As == 1, ] <- "A"
}
if (nrow(population[prob_minority == FALSE & num_As == 1, ]) > 0) {
  population[prob_minority == FALSE & num_As == 1, ] <- "B"
}  

```

To check it works, we can add the new **population** tibble as a column to **demonstrators** and have a look at it. This will let us see the three demonstrators and the resulting new trait side by side.

```{r 4.4}
# for testing only, add the new traits to the demonstrator dataframe and show it
demonstrators <- add_column(demonstrators, new_trait = population$trait)

demonstrators
```

Because we set $D=1$ above, the new trait is always the majority trait among the three demonstrators. This is perfect conformity. We can weaken conformity by reducing $D$. Here an example with $D=0.5$. All the code is the same as what we already discussed above.

```{r 4.5}

N <- 100
p_0 <- 0.5
D <- 0.1

population <- tibble(trait = sample(c("A", "B"), N, replace = TRUE, prob = c(p_0, 1 - p_0))) # create first generation

# create dataframe with a set of 3 randomly-picked demonstrators for each agent
demonstrators <- tibble(dem1 = sample(population$trait, N, replace = TRUE), dem2 = sample(population$trait, N, replace = TRUE), dem3 = sample(population$trait, N, replace = TRUE))

# get the number of As in each 3-dem combo
num_As <- rowSums(demonstrators == "A")

population$trait[num_As == 3] <- "A"  # for dem combos with all As, set to A
population$trait[num_As == 0] <- "B"  # for dem combos with no As, set to B

prob_majority <- sample(c(TRUE, FALSE), prob = c((2/3 + D/3), 1 - (2/3 + D/3)), N, replace = TRUE)
prob_minority <- sample(c(TRUE, FALSE), prob = c((1/3 - D/3), 1 - (1/3 - D/3)), N, replace = TRUE)

# when A is a majority, 2/3
if (nrow(population[prob_majority & num_As == 2, ]) > 0) {
  population[prob_majority & num_As == 2, ] <- "A"
}
if (nrow(population[prob_majority == FALSE & num_As == 2, ]) > 0) {
  population[prob_majority == FALSE & num_As == 2, ] <- "B"
}  
# when A is a minority, 1/3
if (nrow(population[prob_minority & num_As == 1, ]) > 0) {
  population[prob_minority & num_As == 1, ] <- "A"
}
if (nrow(population[prob_minority == FALSE & num_As == 1, ]) > 0) {
  population[prob_minority == FALSE & num_As == 1, ] <- "B"
}  

# for testing only, add the new traits to the demonstrator dataframe and show it
demonstrators <- add_column(demonstrators, new_trait = population$trait)

demonstrators
```

Now that conformity is weaker, sometimes the new trait is not the majority amongst the three demonstrators. 

## Testing conformist transmission

As in the previous chapters, we can put all this code together into a function to see what happens over multiple generations and in multiple runs. There is nothing new in the code below, which is a combination of the code we already wrote in ([Chapter 1][Unbiased transmission]) and the new bits of code for conformity introduced above.

```{r 4.6}

conformist_transmission <- function (N, p_0, D, t_max, r_max) {
  
  output <- tibble(generation = rep(1:t_max, r_max), p = as.numeric(rep(NA, t_max * r_max)), run = as.factor(rep(1:r_max, each = t_max)))
  
  for (r in 1:r_max) {
    
    population <- tibble(trait = sample(c("A", "B"), N, replace = TRUE, prob = c(p_0, 1 - p_0)))
    # create first generation
    
    output[output$generation == 1 & output$run == r, ]$p <- sum(population$trait == "A") / N # add first generation's p for run r
    
    for (t in 2:t_max) {
      
      # create dataframe with a set of 3 randomly-picked demonstrators for each agent
      demonstrators <- tibble(dem1 = sample(population$trait, N, replace = TRUE), dem2 = sample(population$trait, N, replace = TRUE), dem3 = sample(population$trait, N, replace = TRUE))
      
      # get the number of As in each 3-dem combo
      num_As <- rowSums(demonstrators == "A")
      
      population$trait[num_As == 3] <- "A"  # for dem combos with all As, set to A
      population$trait[num_As == 0] <- "B"  # for dem combos with no As, set to B
      
      prob_majority <- sample(c(TRUE, FALSE), prob = c((2/3 + D/3), 1 - (2/3 + D/3)), N, replace = TRUE)
      prob_minority <- sample(c(TRUE, FALSE), prob = c((1/3 - D/3), 1 - (1/3 - D/3)), N, replace = TRUE)
      
      # when A is a majority, 2/3
      if (nrow(population[prob_majority & num_As == 2, ]) > 0) {
        population[prob_majority & num_As == 2, ] <- "A"
      }
      if (nrow(population[prob_majority == FALSE & num_As == 2, ]) > 0) {
        population[prob_majority == FALSE & num_As == 2, ] <- "B"
      }  
      # when A is a minority, 1/3
      if (nrow(population[prob_minority & num_As == 1, ]) > 0) {
        population[prob_minority & num_As == 1, ] <- "A"
      }
      if (nrow(population[prob_minority == FALSE & num_As == 1, ]) > 0) {
        population[prob_minority == FALSE & num_As == 1, ] <- "B"
      }  
      output[output$generation == t & output$run == r, ]$p <- sum(population$trait == "A") / N # get p and put it into output slot for this generation t and run r
    }
  }
  output  # export data from function
}

```

We can test the function with perfect conformity ($D=1$) and plot it (again we use the function `plot_multiple_runs()` we wrote in [Chapter 1][Unbiased transmission]).

```{r 4.7, echo=FALSE}
plot_multiple_runs <- function(data_model) {
  ggplot(data = data_model, aes(y = p, x = generation)) +
    geom_line(aes(colour = run)) +
    stat_summary(fun.y = mean, geom = "line", size = 1) +
    ylim(c(0, 1)) +
    theme_bw() +
    labs(y = "p (proportion of individuals with trait A)")
}
```

```{r 4.8}
data_model <- conformist_transmission(N = 1000, p_0 = 0.5, D = 1, t_max = 50, r_max = 10)
plot_multiple_runs(data_model)
```

Here we should see some lines going to $p = 1$, and some lines going to $p = 0$. Conformity acts to favour the majority trait. This will depend on the initial frequency of $A$ in the population. In different runs with $p_0 = 0.5$, sometimes there will be slightly more $A$s, sometimes slightly more $B$s (remember, in our model, this is probabilistic, like flipping coins, so initial frequencies will rarely be precisely 0.5).

What happens if we set $D = 0$?

```{r 4.9}
data_model <- conformist_transmission(N = 1000, p_0 = 0.5, D = 0, t_max = 50, r_max = 10)
plot_multiple_runs(data_model)
```

This model is equivalent to unbiased transmission. As for the simulations described in [Chapter 1][Unbiased transmission], with a sufficiently large $N$, the frequencies fluctuate around $p = 0.5$. This underlines the effect of conformity. With unbiased transmission, majority traits are favoured because they are copied in proportion to their frequency (incidentally, it is for this reason that 'copying the majority' is not a good description of conformity in the technical sense of cultural evolution: even with unbiased copying the majority trait is copied more than the minority one). However, they reach fixation only in small populations. With conformity, instead, the majority trait is copied with a probability higher than its frequency, so that conformity drives traits to fixation as they become more and more common.

As an aside, note that the last two graphs have roughly the same thick black mean frequency line, which hovers around $p = 0.5$. This highlights the dangers of looking at means alone. If we hadn't plotted the individual runs and relied solely on mean frequencies, we might think that $D = 0$ and $D = 1$ gave identical results. But in fact, they are very different. Always look at the underlying distribution that generates means.

Now let's explore the effect of changing the initial frequencies by changing $p_0$, and adding conformity back in.

```{r 4.10}
data_model <- conformist_transmission(N = 1000, p_0 = 0.55, D = 1, t_max = 50, r_max = 10)
plot_multiple_runs(data_model)

```

When $A$ starts with a slight majority ($p_0 = 0.55$), all of the runs result in $A$ going to fixation (notice this partly depends on the random initialisation: you can change the number in `set.seed()` to see what happens. However, most if not all runs should result in $A$ going to fixation). Now let's try the reverse.

```{r 4.11}
data_model <- conformist_transmission(N = 1000, p_0 = 0.45, D = 1, t_max = 50, r_max = 10)
plot_multiple_runs(data_model)
```

When $A$ starts off in a minority ($p_0 = 0.45$), all runs result in $A$ disappearing. These last two graphs show how initial conditions affect conformity. Whichever trait is more common is favoured by conformist transmission.

***
***

## Analytical model {-}

Let's revise Table 1 to add the probabilities of each combination of three demonstrators coming together, assuming they are picked at random. These probabilities can be expressed in terms of $p$, the frequency of $A$, and $(1 - p)$, the frequency of $B$. Table 2 adds this column.

**Table 2: Full adoption probability table for trait $A$ under conformist transmission**

Dem 1 | Dem 2 | Dem 3 | Prob of adopting $A$ | Prob of combination forming
----- | ----- | ----- | -------------------- | -----------------------
$A$   | $A$   | $A$   | 1                    | $p^3$
      |       |       |                      |
$A$   | $A$   | $B$   |                      | 
$A$   | $B$   | $A$   | $2/3 + D/3$          | $p^2(1-p)$
$B$   | $A$   | $A$   |                      | 
      |       |       |                      | 
$A$   | $B$   | $B$   |                      | 
$B$   | $A$   | $B$   | $1/3 - D/3$          | $p(1-p)^2$
$B$   | $B$   | $A$   |                      |        
      |       |       |                      | 
$B$   | $B$   | $B$   | 0                    | $(1-p)^3$      

To get the frequency of $A$ in the next generation, $p'$, we multiply, for each of the eight rows in Table 2, the probability of adopting $A$ by the probability of that combination forming (i.e. the final two columns in Table 2), and add up all of these eight products. After rearranging, this gives the following recursion:

$$p' = p + Dp(1-p)(2p-1)            \hspace{30 mm}(4.1)$$

We can plot the recursion, with weak conformity ($D = 0.1$) and slightly more $A$ in the initial generation ($p_0 = 0.55$) as we did previously in the simulation:

```{r 4.12}
t_max <- 150
p_0 <- 0.51 
D <- 0.1

pop_analytical <- tibble(p = rep(NA, t_max), generation = 1:t_max)
pop_analytical$p[1] <- p_0
  
for (i in 2:t_max) {
  pop_analytical$p[i] <-pop_analytical$p[i - 1] + D * pop_analytical$p[i - 1] * (1 - pop_analytical$p[i - 1]) * (2 * pop_analytical$p[i - 1] - 1)
}

ggplot(data = pop_analytical, aes(y = p, x = generation)) +
  geom_line() +
  ylim(c(0, 1)) +
  theme_bw() +
  labs(y = "p (proportion of individuals with trait A)")

```


You can change the values of $p_0$ in the code above, (for example less than 0.5, and equal to 0.5) and reproduce the results of the other simulations above. 

Finally, we can use the recursion equation to generate a plot that has become a signature for conformity in the cultural evolution literature. The following code plots, for all possible values of $p$, the probability of adopting $p$ in the next generation.

Note the first two new R commands. We use the function `seq()` to generate a sequence of 101, equally spaced, numbers from 0 to 1, and we use a new ggplot 'geom'. `geom_abline()` draws a custom line for which we can pass the slope and intercept, as well as other aesthetic properties (such as here `linetype = "dashed"`).  

```{r 4.13}
D <- 1
conformity_p_adopt <- tibble( p = seq(from = 0, to = 1, length.out = 101), p_next = p + D * p * (1 - p) * (2 * p - 1))

ggplot(data = conformity_p_adopt, aes(y = p_next, x = p)) +
  geom_line() +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  ylim(c(0, 1)) +
  theme_bw() +
  labs(x = "frequency of A (p)", y = "probability of adopting A (p')")
```

This plot encapsulates the process of conformity. The dotted line shows unbiased transmission: the probability of adopting $A$ is exactly equal to the frequency of $A$ in the population. The s-shaped solid curve shows conformist transmission. When $A$ is common ($p > 0.5$), then the curve is higher than the dotted line: there is a disproportionately higher probability of adopting $A$. When $A$ is uncommon ($p < 0.5$), then the curve is lower than the dotted line: there is a disproportionately lower probability of adopting $A$.

***
***

## Summary of the model

In this chapter, we explored conformist biased cultural transmission. This is where individuals are disproportionately more likely to adopt the most common trait among a set of demonstrators. We can contrast this indirect bias with the direct (or content) biased transmission from [Chapter 3][Biased transmission (direct bias)], where one trait is intrinsically more likely to be copied. With conformity, the traits have no intrinsic attractiveness and are preferentially copied simply because they are common.

We saw how conformity increases the frequency of whichever trait is more common. Initial trait frequencies are important here: traits that are initially more common typically go to fixation. This, in turn, makes stochasticity important, which in small populations can affect initial frequencies.

We also discussed the subtle but fundamental difference between unbiased copying and conformity. In both, majority traits are favoured, but it is only with conformity that they are *disproportionally* favoured. In large populations, unbiased transmission rarely leads to trait fixation, whereas conformist transmission often does. Furthermore, as we will see later, conformity also makes majority traits resistant to external disturbances, such as the introduction of other traits via innovation or migration. 


## Further readings

@boyd_culture_1985 introduced conformist or positive frequency-dependent cultural transmission as defined here, and modelled it analytically with similar methods. @henrich_evolution_1998 modelled the evolution of conformist transmission, while @efferson_conformists_2008 provided experimental evidence that at least some people conform in a simple learning task.

<!--chapter:end:04-Biased_transmission_indirect_bias_frequency.Rmd-->

# Biased transmission: demonstrator-based indirect bias

In the previous two chapters we examined two forms of biased transmission, one where the bias arises due to characteristics of the traits (or [direct bias][Biased transmission (direct bias)]) and another where the bias arises due to the characteristics of the population (or indirect bias). In the previous chapter we examined frequency-dependent indirect bias which takes into account the frequency of the trait (or [conformity][Biased transmission (indirect bias: frequency)]). Here we examine indirect bias that takes into account specific features of the demonstrators. This demonstrator-based bias is also called 'model bias' or 'context bias' in the cultural evolution literature.

Whereas the simulations we created previously are fairly standard, indirect demonstrator-based biases can be implemented in several ways. Demonstrator biases result whenever individuals decide whether to copy or not by taking into account any features of the demonstrators, as long as it is not directly tied to the traits. The most studied demonstrator bias is prestige bias, where individuals are more likely to copy from demonstrators who are considered more 'prestigious' or high in subjective social status, for example because other individuals show deference to them. Alternatively, individuals can copy demonstrators who are more successful according to some objective criterion (e.g. wealth) independently from how others judge them, or they can copy individuals that are more similar to themselves, or older (or younger) than themselves, and so on. The key point is that the decision is not directly linked to the cultural trait itself, and relates to some characteristic of the demonstrator(s) from whom one is copying.

## A simple demonstrator bias

To implement a simple version of demonstrator-biased cultural transmission, we first need to assume that there are some intrinsic differences between individuals within the population. Up until now, our individuals have only been described by the traits they possess. We now want individuals to have some additional feature which others can use when deciding whether to copy that individual. We call this feature 'status'. For simplicity, an individual's status is a binary variable that could stand for whether they are prestigious or not, successful or not, and so on. We define a parameter $p_s$ that determines the probability that an individual has high status, as opposed to low status. 

```{r 5.1}
library(tidyverse)
set.seed(111)

N <- 100
p_0 <- 0.5
p_s <- 0.05

population <- tibble(trait = sample(c("A", "B"), N, replace = TRUE, prob = c(p_0, 1 - p_0)),
                     status = sample(c("high", "low"), N, replace = TRUE, prob = c(p_s, 1 - p_s))) 
```

We can inspect the tibble by typing its name in the R console

```{r 5.2}
population
```

With $p_s=0.05$ around 5 individuals in a population of 100 will have high status. In this specific case, one of them is individual 10. 

We now need to make it so that these rare high status individuals are more likely to be copied. One way of doing this is to assume that the probabilities of picking high-status and low-status individuals as demonstrators are different. So far, when using the function `sample()` to select demonstrators, we did not include any specific probability. This meant that each individual of the previous generation had the same likelihood of being selected and copied. Instead, now we pass to the function a vector of probabilities to weight the choice. We assume that the probability of selecting high status individuals as demonstrators is always equal to 1, but the probability of selecting low-status individuals is given by a further parameter, $p_\text{low}$. When $p_\text{low}=1$, the simulations correspond to unbiased transmission, as everybody has the same probability of being chosen. When $p_\text{low}=0$, there is a strict status-based demonstrator bias, where only high-status individuals are ever selected as demonstrators.

To implement this, we first store in `p_demonstrator` the probabilities of being copied for each member of the population: 

```{r 5.3}
p_low <- 0.01

p_demonstrator <- rep(1,N)
p_demonstrator[population$status == "low"] <- p_low

```

Then we sample the traits in the population using these probabilities. Notice the condition `if(sum(p_demonstrator) > 0)`. This is necessary in case there are no high-status individuals (for example when $p_s\approx0$) and the probability of selecting a low status demonstrator to copy is 0 ($p_\text{low}=0$). This would make the summed probability equal to 0, and without the condition generate an error. With the condition, no copying occurs, which is what we would expect in this situation.

```{r 5.4}
if(sum(p_demonstrator) > 0){
  
  demonstrator_index <- sample (N, prob = p_demonstrator, replace = TRUE)
  
  population$trait <- population$trait[demonstrator_index]
  
}
```

As usual, we can wrap everything in a function.

```{r 5.5}
biased_transmission_demonstrator <- function(N, p_0, p_s, p_low, t_max, r_max) {
  
  output <- tibble(generation = rep(1:t_max, r_max), p = as.numeric(rep(NA, t_max * r_max)), run = as.factor(rep(1:r_max, each = t_max)))
  
  for (r in 1:r_max) {
    
    population <- tibble(trait = sample(c("A", "B"), N, replace = TRUE, prob = c(p_0, 1 - p_0)),
                     status = sample(c("high", "low"), N, replace = TRUE, prob = c(p_s, 1 - p_s))) 
    
    output[output$generation == 1 & output$run == r, ]$p <- sum(population$trait == "A") / N # add first generation's p for run r
    
    for (t in 2:t_max) {
      
      p_demonstrator <- rep(1,N)
      p_demonstrator[population$status == "low"] <- p_low
      
      if(sum(p_demonstrator) > 0){
        
        demonstrator_index <- sample (N, prob = p_demonstrator, replace = TRUE)
        
        population$trait <- population$trait[demonstrator_index]
      
      }
      output[output$generation == t & output$run == r, ]$p <- sum(population$trait == "A") / N # get p and put it into output slot for this generation t and run r
    }
    
  }
  output # export data from function
}
```

We can now test our simulation, assuming a very low, but not zero, probability of selecting low-status individuals as demonstrators. We are using the usual `plot_multiple_runs()` function to plot the results of the simulations, reproduced for convenience here.

```{r 5.6, echo=FALSE}
plot_multiple_runs <- function(data_model) {
  ggplot(data = data_model, aes(y = p, x = generation)) +
    geom_line(aes(colour = run)) +
    stat_summary(fun.y = mean, geom = "line", size = 1) +
    ylim(c(0, 1)) +
    theme_bw() +
    labs(y = "p (proportion of individuals with trait A)")
}
```

```{r 5.7}
data_model <- biased_transmission_demonstrator(N = 100, p_s = 0.05, p_low=0.0001, p_0 = 0.5, t_max = 50, r_max = 5)
plot_multiple_runs(data_model)
```

The results are similar to what we saw in the [previous chapter][Biased transmission (indirect bias: frequency)] for conformity: one of the two traits quickly reaches fixation. In the case of conformity, however, the trait reaching fixation was the one that happened to have a slightly higher frequency at the beginning, because of the random initialisation. With a demonstrator bias, this is not the case.

From this perspective, an indirect demonstrator-based bias is more similar to unbiased transmission. If you remember from the [first chapter][Unbiased transmission], simulations with unbiased transmission also ended up with one trait reaching fixation in small populations ($N=100$), but in bigger ones ($N=10000$) the frequencies of the two traits remained around $p=0.5$. What happens with demonstrator-based bias?

```{r 5.8}
data_model <- biased_transmission_demonstrator(N = 10000, p_s = 0.005, p_low=0.0001, p_0 = 0.5, t_max = 200, r_max = 5)
plot_multiple_runs(data_model)
```

Even with $N=10000$, if the number of high-status individuals is sufficiently low, as in this case ($p_s=0.005$ means that, on average, 50 individuals are high-status in each run), traits reach fixation. By reducing the pool of demonstrators, demonstrator-based bias makes drift more important for the overall dynamics. The pool of high-status demonstrators (equal to $Np_s$) is the effective population size, which is much smaller than the actual population size ($N$).

You can experiment with different values of $p_s$ and $p_\text{low}$. How big can the pool of high-status demonstrators be before the dynamics become indistinguishable from unbiased transmission?

## Predicting the 'winning' trait

With conformity, as just mentioned, the trait that reaches fixation is the one starting in the majority. With unbiased transmission the trait that goes to fixation cannot be predicted at the beginning of the simulation. With a demonstrator-based bias, a reasonable guess would be that the 'winning' trait is the one that is, at the beginning, most common among the high-status individuals. Can we check this intuition with our model?

Currently the output we obtain from the simulations is not suitable for this purpose. On the one hand, we do not have the crucial piece of information that we need: the proportion of each trait amongst the high-status individuals when the population is initialised. On the other hand, we have much information that we do not need, such as the frequency of the two traits at each time step. We just need to know which traits reach fixation. We can therefore rewrite the `biased_transmission_demonstrator` function and change the `output` tibble to suit our needs.

```{r 5.9}
biased_transmission_demonstrator_2 <- function(N, p_0, p_s, p_low, t_max, r_max) {
  
  output <- tibble(status_A = as.numeric(rep(NA, r_max)), p = as.numeric(rep(NA, r_max)))
  
  for (r in 1:r_max) {
    
    population <- tibble(trait = sample(c("A", "B"), N, replace = TRUE, prob = c(p_0, 1 - p_0)),
                     status = sample(c("high", "low"), N, replace = TRUE, prob = c(p_s, 1 - p_s))) 
    
    output[r, ]$status_A <- sum(population$status == "high" & population$trait == "A") / 
      sum(population$status == "high")
    
    for (t in 2:t_max) {
      
      p_demonstrator <- rep(1,N)
      p_demonstrator[population$status == "low"] <- p_low
      
      if(sum(p_demonstrator) > 0){
        
        demonstrator_index <- sample (N, prob = p_demonstrator, replace = TRUE)
        
        population$trait <- population$trait[demonstrator_index]
      
      }
    }
  output[r, ]$p <- sum(population$trait == "A") / N     
  }
  output # export data from function
}
```

Here, `status_A` gives the starting frequency of A amongst the high status individuals. $p$, as before, gives the frequency of $A$ in the entire population, but we only record this value at the very end of the simluation, to see if one trait has gone to fixation.

Let's run the new function, `biased_transmission_demonstrator_2`, for 50 runs (setting $r_\text{max}=50$) so that we have more independent data points, and inspect the output. We use `set.seed` to make sure your output is the same as ours.

```{r 5.10}
set.seed(111)
data_model <- biased_transmission_demonstrator_2(N = 100, p_s = 0.05, p_low=0.0001, p_0 = 0.5, t_max = 50, r_max = 50)
data_model
```

Each line of the output is a run of the simulation. In the first run, for example, 20% of high-status individuals had the trait $A$ at the beginning, and the frequency of 0trait $A$ at the end of the simulation was 0, meaning that $B$ reached fixation. In the second run, the starting frequency of $A$ was 57%, and by the end $A$ went to fixation. From a cursory inspection of the output, it seems our guess was correct. But let's visualise all the data to be sure.

We want to know how the initial proportion of high-status individuals is related to the two possible outcomes (trait $A$ reaches fixation or trait $B$ reaches fixation). A convenient way is to use a boxplot. In the code below, we first eliminate the runs where the traits did not reach fixation (if they exist) using the new function `filter()`, and, for clarity, we assign the trait name $A$ or $B$ to each run according to which trait reached fixation. We can then plot our output.

The main novelties in this code are the new ggplot 'geoms' `geom_boxplot()` and `geom_jitter()`. Whereas boxplots are useful to detect aggregate information on our simulations, `geom_jitter()` plots also the single data points, so we can have a better idea on how the proportions of high-status individuals are distributed in the various runs. We could have done this with our usual `geom_point()`, but `geom_jitter()` scatters randomly (at a distance specified by the parameter `width`) the points in the plot. This allows to avoid the overlapping of individual data points (known as overplotting). 

```{r 5.11}
data_model <- filter(data_model, p == 1 | p == 0)
data_model$p <- as.character(data_model$p)
data_model[data_model$p==1, ]$p <- "A"
data_model[data_model$p==0, ]$p <- "B"

ggplot(data = data_model, aes(x = p, y = status_A, fill = p)) +
  geom_boxplot() +
  geom_jitter(width = 0.05) +
  labs(y = "proportion of high-status individuals with trait A", 
       x = "winning trait") +
  theme_bw() +
  theme(legend.position = "none") 
```

The plot shows that when trait $A$ reaches fixation there are more high-status individuals with trait $A$ at the beginning, and vice versa for $B$, confirming our intuition. However, this is far from being a safe bet. Runs with only a quarter of high-status individuals with $A$ ended up with all $A$s in the population and, conversely, runs with 80% of high-status individuals with $A$ ended up with the fixation of $B$. With bigger populations, it is even worse.

```{r 5.12}
data_model <- biased_transmission_demonstrator_2(N = 10000, p_s = 0.005, p_low=0.0001, p_0 = 0.5, t_max = 200, r_max = 50)

data_model <- filter(data_model, p == 1 | p == 0)
data_model$p <- as.character(data_model$p)
data_model[data_model$p==1, ]$p <- "A"
data_model[data_model$p==0, ]$p <- "B"

ggplot(data = data_model, aes(x = p, y = status_A, fill = p)) +
  geom_boxplot() +
  geom_jitter(width = 0.05) +
  labs(y = "proportion of high-status individuals with trait A", 
       x = "winning trait") +
  ylim(c(0,1)) +
  theme_bw() +
  theme(legend.position = "none") 
```

With $N=10000$ and around 50 high-status individuals, the traits are more equally distributed among 'influential' demonstrators at the beginning, and there is hardly any difference in the two outcomes.   

## Summary of the model

In this chapter we modeled an example of indirect, demonstrator-based, biased transmission. We assumed that a fraction of individuals in the population was 'high-status' and thus more likely to be selected as demonstrators. The results show that in this situation a trait is likely to become predominant even when populations are large. This is due to the fact that a demonstrator bias effectively reduces the pool of demonstrators and accelerates convergence through a similar process as drift / unbiased transmission. We also saw that the possibility of predicting which trait will become predominant depends on the number of high-status demonstrators. When there are few high-status demonstrators, then the most common trait amongst these high-status demonstrators will likely go to fixation. When their number increases, it is more difficult to make such a prediction.

We also saw how it is important to modify the output of a model depending on the question we are interested in. We used a novel ggplot aesthetic to produce a boxplot, a convenient way of displaying the distribution of data among different groups.


## Further readings

Examples of simulation models implementing indirect, demonstrator-based, biased transmission include @mesoudi_cultural_2009, an individual-based model that explores how prestige bias can generate clusters of recurring behaviours, applied to the case of copycat suicides. @henrich_joseph_big_2015 presents a population-level model that links prestige to the emergence of whithin-group cooperation. @henrich_demography_2004 describes an analytical, population-level, model, where individuals copy the most succesfull demonstrator in the population.  

An earlier analytical treatment of demonstrator-based bias, with extensions on the evolution of symbolic traits that may be associated to demonstrators is in Chapter 8 of @boyd_culture_1985.

Finally, @henrich_evolution_2001 is the classic treatment of prestige bias, and a recent review of the empirical evidence supporting it is @jimenez_prestige-biased_2019.


<!--chapter:end:05-Biased_transmission_indirect_bias_demonstrator.Rmd-->

# Vertical and horizontal transmission

An important distinction in cultural evolution concerns the pathway of cultural transmission. Vertical cultural transmission occurs when individuals learn from their parents. Oblique cultural transmission occurs when individuals learn from other (non-parental) members of the older generation, such as teachers. Horizontal cultural transmission occurs when individuals learn from members of the same generation.

These terms (vertical, oblique and horizontal) are borrowed from epidemiology, where they are used to describe the transmission of diseases. Cultural traits, like diseases, are interesting in that they have multiple pathways of transmission. While genes spread purely vertically (at least in species like ours; horizontal gene transfer is common in plants and bacteria), cultural traits can spread obliquely and horizontally. These latter pathways can increase the rate at which cultural traits can spread, compared to vertical transmission alone. 

In this chapter we will simulate and test this claim, focusing in particular on horizontal cultural transmission: when and why does horizontal transmission increase the rate of spread of a cultural trait compared to vertical cultural transmission?

## Vertical cultural transmission

To simulate vertical cultural transmission we need to decide how people learn from their parents, assuming those two parents possess different combinations of cultural traits. As in previous models, we assume two discrete traits, $A$ and $B$. There are then four combinations of traits amongst two parents: both parents have $A$, both parents have $B$, mother has $A$ and father has $B$, and mother has $B$ and father has $A$.

For simplicity, we can assume that when both parents have the same trait, the child adopts that trait. When parents differ, the child faces a choice. To make things more interesting, let's assume a bias for one trait over the other in such situations (otherwise we would be back to unbiased transmission, and no trait would reliably spread - remember we are interested in how quickly traits spread under vertical vs horizontal transmission).

Hence we assume a probability $b$ that, when parents differ in their traits such that there is some uncertainty, the child adopts $A$. With probability $1-b$ they adopt trait $B$. When $b=0.5$, transmission is unbiased. When $b>0.5$, $A$ should be favoured; when $b<0.5$, $B$ should be favoured. Let's simulate this and test these predictions.

The following function `vertical_transmission()` is very similar to previous simulation functions. The explanation follows.

```{r 6.1, message=FALSE}
library(tidyverse)

vertical_transmission <- function(N, p_0, b, t_max, r_max) {
  output <- tibble(generation = rep(1:t_max, r_max), p = as.numeric(rep(NA, t_max * r_max)), run = as.factor(rep(1:r_max, each = t_max)))
  
  for (r in 1:r_max) {
    population <- tibble(trait = sample(c("A", "B"), N, replace = TRUE, prob = c(p_0, 1 - p_0)))
    # create first generation
    
    output[output$generation == 1 & output$run == r, ]$p <- sum(population$trait == "A") / N # add first generation's p for run r
    
    for (t in 2:t_max) {
      previous_population <- population # copy individuals to previous_population tibble
      
      mother <- tibble(trait = sample(previous_population$trait, N, replace = TRUE))  # randomly pick mothers
      father <- tibble(trait = sample(previous_population$trait, N, replace = TRUE))  # randomly pick fathers
      
      population <- tibble(trait = as.character(rep(NA, N))) # next generation
      
      both_A <- mother$trait == "A" & father$trait == "A"
      if (sum(both_A) > 0) {
        population[both_A, ]$trait <- "A"  # parents both A, adopt A
      }
      
      both_B <- mother$trait == "B" & father$trait == "B"
      if (sum(both_B) > 0) {
        population[both_B, ]$trait <- "B"  # parents both B, adopt B
      }
      
      if (anyNA(population)) {  # if any empty NA slots (i.e. one A and one B parent) are present...
        population[is.na(population)[,1],]$trait <- sample(c("A", "B"), sum(is.na(population)), prob = c(b, 1 - b), replace = TRUE)
        # ...make them A with probability b
      }
      
      output[output$generation == t & output$run == r, ]$p <- sum(population$trait == "A") / N # get p and put it into output slot for this generation t and run r
    }
  }
  output # export data from function
}

```

First we set up an output tibble to store the frequency of $A$ ($p$) over $t_{\text{max}}$ generations and across $r_{\text{max}}$ runs. As before we create a `population` tibble to store our $N$ traits, one per individual.

This time, however, in each generation we create two new tibbles, `mother` and `father`. These store the traits of two randomly chosen individuals from the `previous_population`, one pair for each new individual. Note that we are assuming random mating here: parents pair up entirely at random. Alternative mating rules are possible, such as assortative cultural mating, where parents preferentially assort based on their cultural trait. We will leave it to readers to create models of this.

Once the `mother` and `father` tibbles are created, we can fill in the new individuals' traits in `population`. `both_A` is used to mark with TRUE whether both mother and father have trait $A$, and (assuming some such cases exist), sets all individuals in `population` for whom this is true to have trait $A$. `both_B` works equivalently for parents who both possess trait $B$.

The remaining cases (identified as still being NA in the `population` tibble) must have one $A$ and one $B$ parent. We are not concerned with which parent has which in this simple model, so in each of these cases we set the individual's trait to be $A$ with probability $b$ and $B$ with probability $1-b$. Again, we leave it to readers to modify the code to have separate probabilities for maternal and paternal transmission.

Once all generations are finished, we export the `output` tibble as our data. We can use our existing function `plot_multiple_runs()` from previous chapters to plot the results.

```{r 6.2, echo=FALSE}
plot_multiple_runs <- function(data_model) {
  ggplot(data = data_model, aes(y = p, x = generation)) +
    geom_line(aes(colour = run)) +
    stat_summary(fun = mean, geom = "line", size = 1) +
    ylim(c(0, 1)) +
    theme_bw() +
    labs(y = "p (proportion of individuals with trait A)")
}
```

And now run both functions to see what happens. Remember we are interested in how fast the favoured trait spreads, so let's start it off at a low frequency ($p_0=0.01$) so we can see it spreading from rarity. We use a small transmission bias $b=0.6$ favouring $A$.

```{r 6.3, fig.cap = "The favourite trait, A, spreads in the population under vertical transmission."}
data_model <- vertical_transmission(N = 10000, p_0 = 0.01, b = 0.6, t_max = 50, r_max = 5)
plot_multiple_runs(data_model)
```

Here we can see a gradual spread of the favoured trait $A$ from $p=0.01$ to $p=1$. As in our [directly biased transmission model][Biased transmission: direct bias], the diffusion curve is s-shaped. To obtain the same result with two different models is encouraging! We can also test our prediction that when $b=0.5$, we recreate our unbiased transmission model from [Chapter 1][Unbiased transmission]:

```{r 6.4, fig.cap = "When no trait is favoured, there is no change in the frequency of trait A under vertical transmission."}
data_model <- vertical_transmission(N = 10000, p_0 = 0.1, b = 0.5, t_max = 50, r_max = 5)
plot_multiple_runs(data_model)
```

As predicted, there is no change in starting trait frequencies when $b=0.5$. If you reduce the sample size, you will see much more fluctuation across the runs, with some runs losing $A$ altogether.

## Horizontal cultural transmission

Now let's add horizontal cultural transmission to our model. We will add it to vertical cultural transmission, rather than replace vertical with horizontal, so we can compare both in the same model.

First there is vertical transmission as above, with random mating and the parental bias $b$, to create a new generation. Then, the new generation learns from each other. The key difference between vertical and horizontal transmission is that horizontal cultural transmission can occur from more than two individuals. Let's assume individuals pick $n$ other individuals from their generation. We also assume a bias in favour of $A$ during horizontal transmission. If the learner is $B$, then for each of the $n$ demonstrators who have $A$, there is an independent probability $g$ that the learner switches to $A$. If the learner is already $A$, or if the demonstrator is $B$, then nothing happens.

The following code implements this horizontal transmission in a new function `vertical_horizontal_transmission()`.

```{r 6.5}
vertical_horizontal_transmission <- function(N, p_0, b, n, g, t_max, r_max) {
  output <- tibble(generation = rep(1:t_max, r_max), p = as.numeric(rep(NA, t_max * r_max)), run = as.factor(rep(1:r_max, each = t_max)))
  
  for (r in 1:r_max) {
    population <- tibble(trait = sample(c("A", "B"), N, replace = TRUE, prob = c(p_0, 1 - p_0)))
    # create first generation
    
    output[output$generation == 1 & output$run == r, ]$p <- sum(population$trait == "A") / N # add first generation's p for run r
    
    # vertical transmission
    for (t in 2:t_max) {
      previous_population <- population # copy individuals to previous_population tibble
      
      mother <- tibble(trait = sample(previous_population$trait, N, replace = TRUE))  # randomly pick mothers
      father <- tibble(trait = sample(previous_population$trait, N, replace = TRUE))  # randomly pick fathers
      
      population <- tibble(trait = as.character(rep(NA, N)))  # next generation
      
      both_A <- mother$trait == "A" & father$trait == "A"
      if ( sum(both_A) > 0 ) {
        population[both_A, ]$trait <- "A"  # parents both A, adopt A
      }
      
      both_B <- mother$trait == "B" & father$trait == "B"
      if ( sum(both_B) > 0) {
        population[both_B, ]$trait <- "B"  # parents both B, adopt B
      }
      
      if (anyNA(population)) {  # if any empty NA slots (i.e. one A and one B parent) are present...
        population[is.na(population)[,1],]$trait <- sample(c("A", "B"), sum(is.na(population)), prob = c(b, 1 - b), replace = TRUE)
        # ...make them A with probability b
      }
      
      # horizontal transmission
      previous_population <- population # previous_population are children before horizontal transmission, population are children after horizontal transmission
      
      N_B <- length(previous_population$trait[previous_population$trait == "B"])  # N_B = number of Bs
      
      if (N_B > 0 & n > 0) {  # if there are B individuals to switch, and n is not zero
        
        for (i in 1:N_B) {  # for each B individual...
          
          demonstrator <- sample(previous_population$trait, n, replace = TRUE) # pick n demonstrators
          copy <- sample(c(TRUE, FALSE), n, prob = c(g, 1-g), replace = TRUE)  # get probability g
          
          if ( sum(demonstrator == "A" & copy == TRUE) > 0 ) {  # if any demonstrators with A are to be copied...
            population[previous_population$trait == "B",]$trait[i] <- "A"  # ...the B individual switches to A
          }
        
        }
      }
      
      output[output$generation == t & output$run == r, ]$p <- sum(population$trait == "A") / N # get p and put it into output slot for this generation t and run r
    }
  }
  output # export data from function
}

```

The first part of this code is identical to `vertical_transmission`. Then there is horizontal transmission. We put `population` into `previous_population` again, but now `population` contains the individuals after horizontal transmission, and `previous_population` contains the individuals before. $N_B$ holds the individuals in `previous_population` who are $B$, as they are the only ones we need to concern ourselves with ($A$ individuals do not change). If there are such individuals ($N_B>0$), and individuals are learning from at least one individual ($n>0$), then for each individual we pick $n$ demonstrators, and if any of those demonstrators are $A$ plus probability $g$ is fulfilled, we set the individual to $A$.

Running horizontal transmission with $n=5$ and $g=0.1$ and without vertical transmission bias ($b=0.5$) causes, as expected, $A$ to spread.

```{r 6.6, fig.cap = "The favourite trait, A, spreads in the population under horizontal transmission."}
data_model <- vertical_horizontal_transmission(N = 5000, p_0 = 0.01, b = 0.5, n = 5, g = 0.1, t_max = 50, r_max = 5)
plot_multiple_runs(data_model)
```

This plot above confirms that horizontal cultural transmission, with some direct bias in the form of $g$, again generates an s-shaped curve and causes the favoured trait to spread. But we haven't yet done what we set out to do, which is compare the speed of the different pathways. The following code generates three datasets, one with only vertical transmission and $b=0.6$, one with only horizontal transmission with $n=2$ and $g=0.1$ which is roughly equivalent to two parents and a bias of $b=0.6$ (0.1 higher than unbiased), and one with only horizontal transmission with $n=5$ and $g=0.1$.

```{r 6.7}
data_model_v <- vertical_horizontal_transmission(N = 5000, p_0 = 0.01, b = 0.6, n = 0, g = 0, t_max = 50, r_max = 5)
data_model_hn2 <- vertical_horizontal_transmission(N = 5000, p_0 = 0.01, b = 0.5, n = 2, g = 0.1, t_max = 50, r_max = 5)
data_model_hn5 <- vertical_horizontal_transmission(N = 5000, p_0 = 0.01, b = 0.5, n = 5, g = 0.1, t_max = 50, r_max = 5)
```

```{r 6.8, fig.cap = "The favourite trait, A, spreads in the population under vertical transmission only."}
plot_multiple_runs(data_model_v)
```

```{r 6.9, fig.cap = "Given an equivalent bias strength and two demonstrators, the favourite trait, A, spreads under horizontal transmission at the same speed than in the vertical transmission scenario."}
plot_multiple_runs(data_model_hn2)
```

```{r 6.10, fig.cap = "Given an equivalent bias strength and five demonstrators, the favourite trait, A, spreads under horizontal transmission faster than in the vertical transmission scenario."}
plot_multiple_runs(data_model_hn5)
```

The first two plots should be very similar. Horizontal cultural transmission from $n=2$ demonstrators is equivalent to vertical cultural transmission, which of course also features two demonstrators, when both pathways have similarly strong direct biases. The third plot shows that increasing the number of demonstrators makes favoured traits spread more rapidly under horizontal transmission, without changing the strength of the biases. Of course, changing the relative strength of the vertical and horizontal biases ($b$ and $g$ respectively) also affects the relative speed. But all else being equal, horizontal transmission with $n>2$ is faster than vertical transmission.

## Summary of the model

This model has combined directly biased transmission with vertical and horizontal transmission pathways. The vertical transmission model recreates the patterns from our previous unbiased and directly biased transmission, but explicitly modelling parents and their offspring. Although there were no differences, our vertical transmission model could be modified easily to study different kinds of parental bias (e.g. making maternal influence stronger than paternal influence), or different types of non-random mating.

Our horizontal transmission model is similar to the conformist bias simulated in [Chapter 4][Biased transmission: frequency-dependent indirect bias], but slightly different - there is no disproportionate majority copying, and instead one trait is favoured when learning from $n$ demonstrators. Comparing the two pathways, we can see that horizontal cultural transmission is faster than vertical cultural transmission largely because it allows individuals to learn from more than two demonstrators.


## Further reading

The above models are based on those by @cavalli-sforza_cultural_1981. Their vertical cultural transmission models feature bias parameters for each combination of matings ($b_0$, $b_1$, $b_2$ and $b_3$); our $b$ is their $b_1$ and $b_2$. Their horizontal transmission model also features $n$ and $g$, which have the same definitions as here. Subsequent models in that volume examine assortative cultural mating and oblique transmission, although the latter is similar to horizontal transmission.


<!--chapter:end:06-Horizontal_vertical_oblique_transmission.Rmd-->

# Multiple traits models

In all previous models, individuals could possess one of only two cultural traits, $A$ or $B$. This is a useful simplification, and it represents cases in which cultural traits can be modeled as binary choices, such as voting Republican or Democrat, driving on the left or the right, or being vegetarian or meat-eating. In other cases, however, there are many options: in many countries there are multiple political parties to vote for, there may be many dietary choices (vegan, pescatarian, vegetarian, etc), and so on. What happens when we copy others' choices given more than two alternatives? To simplify this question, we again assume unbiased copying as in the [first chapter][Unbiased transmission]: all traits are functionally equivalent and other individuals are copied at random.

## Unbiased transmission with multiple traits

The first modification we need to make in the code concerns how traits are represented. Since we have an undetermined number of possible traits we cannot use the two letters $A$ and $B$. Instead we will use numbers, referring to trait "1", trait "2", trait "3", etc. How can we distribute the traits in the initial population? We can assume that there are $m$ possible traits at the beginning, with $m \leq N$ (as usual, $N$ is the population size). In all the following simulations, we will fix $m=N$, and effectively initialise each individual with a trait randomly chosen between "1" and "100".  

```{r 7.1, message = FALSE}
library(tidyverse)
N <- 100
population <- tibble(trait = sample(1:N, N, replace = TRUE))
```

You can inspect the `population` tibble by writing its name.

```{r 7.2}
population
```

The basic code of the simulation is similar to the code in the [first chapter][Unbiased transmission], but what should the `output` be? Until now, we just needed to save the frequency of one of the two traits, because the frequency of the other was always one minus the first's frequency. Now we need the frequencies of all $N$ traits. (Technically, we only need to track $N-1$ frequencies, with the last inferred by substracting the other frequencies from 1. But for simplicity we'll track all of the frequencies.) 

Second, how do we measure the frequency of the traits in each generation? The base R function `tabulate()` does this for us. `tabulate()` counts the number of times each element of a vector (`population$trait` in our case) occurs in the bins that we also pass to the function. In our case the bins are $1$ to $N$. Since we want the frequencies, and not the absolute number, we divide the result by $N$.

```{r 7.3}
multiple_traits <- function(N, t_max) {
  
  output <- tibble(trait = as.factor(rep(1:N, each = t_max)), generation = rep(1:t_max, N), p = as.numeric(rep(NA, t_max * N)))

  population <- tibble(trait = sample(1:N, N, replace = TRUE))  # create first generation
  
  output[output$generation == 1, ]$p <- tabulate(population$trait, nbins = N) / N  # add first generation's p for all traits

  for (t in 2:t_max) {
    previous_population <- population # copy individuals to previous_population tibble

    population <- tibble(trait = sample(previous_population$trait, N, replace = TRUE)) # randomly copy from previous generation

    output[output$generation == t, ]$p <- tabulate(population$trait, nbins = N) / N  # get p for all traits and put it into output slot for this generation t
  }
  output # export data from function
}
```

Finally, the function to plot the output is similar to what we have already done when plotting multiple runs. The one difference is that now the colored lines do not represent different runs, but different traits, as indicated below by `aes(colour = trait)`. The new line `theme(legend.position = "none")` simply tells ggplot to not include the legend in the graph, as it is not informative. It would just show 100 colors, one for each trait.

```{r 7.4}
plot_multiple_traits <- function(data_model) {
  ggplot(data = data_model, aes(y = p, x = generation)) +
    geom_line(aes(colour = trait)) +
    ylim(c(0, 1)) +
    theme_bw() +
    theme(legend.position = "none")
}
```

As usual, we can call the function and see what happens:

```{r 7.5, fig.cap = "With small populations, the majority of traits disappear after few generations in a model with multiple traits and unbiased transmission."}
data_model <- multiple_traits(N = 100, t_max = 200)
plot_multiple_traits(data_model)
```

Usually, only one or two traits are still present in the population after 200 generations, and, if we increase $t_\text{max}$ for example to 1000, virtually all runs end up with only a single trait reaching fixation:

```{r 7.6, fig.cap = "With small populations, a single traits reach fixation if there are enough generations in a model with multiple traits and unbiased transmission."}
data_model <- multiple_traits(N = 100, t_max = 1000)
plot_multiple_traits(data_model)
```

This is similar to what we saw with only two traits, $A$ and $B$: with unbiased copying and relatively small populations, drift is a powerful force and quickly erodes cultural diversity. 

As we already discussed, increasing $N$ reduces the effect of drift. You can experiment with various values for $N$ and $t_\text{max}$. However, the general point is that variation is gradually lost in all cases. How can we counterbalance the homogenizing effect that drift has in small and isolated population, such as the one we are simulating? 

## Introducing innovation

One option is to introduce new traits via innovation. We can imagine that, at each time step, a proportion of individuals, $\mu$, introduces a new trait in the population. We use the same notation that we used for mutation in [chapter 2][Unbiased and biased mutation]: you can think that 'mutation' is when an individual change its trait for one that is already present, whereas an 'innovation' happens when an individual introduces a new trait never seen before. The remaining proportion of individuals, $1-\mu$, copy at random from others, as before. We can start with a small value, such as $\mu=0.01$. Since $N=100$, this means that in each generation, on average, one new trait will be introduced into the population.

The following code adds innovation to the multiple-trait code from above:

```{r 7.7}
mu <- 0.01

last_trait <- max(population) # record the last trait introduced in the population

previous_population <- population # copy the population tibble to previous_population tibble

population <- tibble(trait = sample(previous_population$trait, N, replace = TRUE))  # randomly copy from previous generation

innovators <- sample(c(TRUE, FALSE), N, prob = c(mu, 1 - mu), replace = TRUE) # identify the innovators

if( sum(innovators) > 0){ # if there are innovators
  population[innovators,]$trait <- (last_trait + 1):(last_trait + sum(innovators)) # replace innovators' traits with new traits
}
```

There are two modifications here. First, we need to select who are the innovators. For that, we use again the function `sample()`, biased by $\mu$, picking $TRUE$ (corresponding to being an innovator) or $FALSE$ (keeping the copied cultural trait) $N$ times.

Second, we need to actually introduce the new traits, with the correct number labels. First we record at the beginning of each generation the label of the last trait introduced (at the beginning, with $N=100$, it will likely be 100 because we initialise each individual's traits by choosing randomly between 1 and 100). When new traits are introduced, we give them consecutive number labels: the first new trait will be called 101, the second 102, and so on. The code above adds all of the new traits into the innovator slots all in one go, which is more efficient than doing it one innovator at a time.

We can now, as usual, wrap everything in a function: 

```{r 7.8}
multiple_traits_2 <- function(N, t_max, mu) {
  max_traits <- N + N * mu * t_max

  output <- tibble(trait = as.factor(rep(1:max_traits, each = t_max)), generation = rep(1:t_max, max_traits), p = as.numeric(rep(NA, t_max * max_traits)))

  population <- tibble(trait = sample(1:N, N, replace = TRUE))  # create first generation
  
  output[output$generation == 1, ]$p <- tabulate(population$trait, nbins = max_traits) / N  # add first generation's p for all traits

  for (t in 2:t_max) {
    last_trait <- max(population) # record what is the last trait introduced in the population

    previous_population <- population # copy individuals to previous_population tibble

    population <- tibble(trait = sample(previous_population$trait, N, replace = TRUE))  # randomly copy from previous generation

    innovators <- sample(c(TRUE, FALSE), N, prob = c(mu, 1 - mu), replace = TRUE) # select the innovators
    if ((last_trait + sum(innovators)) < max_traits) {  
      if( sum(innovators) > 0){
        population[innovators,]$trait <- (last_trait + 1):(last_trait + sum(innovators)) # replace innovators' traits with new traits
      }
    }
    output[output$generation == t, ]$p <- tabulate(population$trait, nbins = max_traits) / N # get p for all traits and put it into output slot for this generation t
  }
  output # export data
}
```

You should now be familiar with more or less everything within this function, with one exception: the new quantity *max_traits*. This is a trick we are using to avoid making the code too slow to run. Our `output` tibble, as you remember, records all the frequencies of all traits. When programming, a good rule-of-thumb is to avoid dynamically modifying the size of your data structures, such as adding new rows to a pre-existing tibble during the simulation. Where possible, set the size of a data structure at the start, and then modify its values during the simulation. So rather than creating a tibble that is expanded dynamically as new traits are introduced via innovation, we create a bigger tibble from the start. How big should it be? We do not know for sure, but a good estimate is that we will need space for the initial traits ($N$), plus around $N\mu$ traits that are added each generation.

To be absolutely sure we do not exceed this estimate, we wrap the innovation instruction within the `if ((last_trait + sum(innovators)) < max_traits)` condition. This prevents innovation when the tibble has filled up. This might prevent innovation in the last few generations, but generally this should hae negligible consequences for our purposes.

Let's now run the function with an innovation rate $\mu=0.01$, a population of 100 individuals, and for 200 generations.

```{r 7.9, fig.cap = "By adding innovations, more traits can be preserved in the population."}
data_model <- multiple_traits_2(N = 100, t_max = 200, mu = 0.01)
plot_multiple_traits(data_model)
```

With innovation, there should now be more traits at non-zero frequency at the end of the simulation than when innovation was not possible. We can check the exact number, by inspecting how many frequencies are higher than 0 in the last row of our matrix:

```{r 7.10}
sum(filter(data_model, generation==200)$p > 0)
```

What happens if we increase the number of generations, or time steps, to 1000, as we did before?

```{r 7.11, fig.cap = "By adding innovations, traits are preserved even when the model runs for several generations."}
data_model <- multiple_traits_2(N = 100, t_max = 1000, mu = 0.01)
plot_multiple_traits(data_model)
```

As you can see in the plot, there should still be several traits that have frequencies higher than 0, even after 1000 generations. Again, we can find the exact number in the final generation:

```{r 7.12}
sum(filter(data_model, generation==1000)$p > 0)
```

Innovation, in sum, allows the maintenance of variation even in small populations.

## Optimising the code

Now for a short technical digression. You may have noticed that running the function `multiple_traits_2()` is quite time consuming with a population of 1000 individuals. There is a quick way to check the exact time needed, using the function `Sys.time()`. This returns the current time at the point of its execution. Let's run the function again and calculate how long it takes.

```{r 7.13}
start_time <- Sys.time()
data_model <- multiple_traits_2(N = 100, t_max = 1000, mu = 0.01)
end_time <- Sys.time()
end_time - start_time
```

While this varies from computer to computer, it may take several seconds to finsh. To store the output, we are using a tibble with $1100000$ data points, as *max_traits* is equal to $1100$, which needs to be updated in each of the $1000$ generations. One way of speeding up the simulation is to record our output in a different data structure.

So far, we have been using tibbles to store our simulation output. R, as with all programming languages, can store data in different structures. Depending on what the data are and what one wants to do with them, different structures are more or less suitable. The advantage of tibbles is that they can contain heterogeneous data, depending on what we need to store: for example, in our `output` tibble, the $trait$ column was specified as a factor, whereas the others two columns, $generation$ and $p$, were numeric.

An alternative is to use vectors and matrices. A vector is a list of data points that are all of the same type, e.g. logical (TRUE/FALSE), integer (whole numbers), numeric (any numbers), or character (text). Matrices are just two-dimensional vectors: they must also contain all the same type of data, but they have rows and columns similar to a tibble, dataframe or Excel spreadsheet. The advantage of vectors and matrices is efficiency: they make simulations much faster than identical code running with tibbles.

Let's rewrite our multiple trait function that runs exactly the same simulation, but using matrices instead of tibbles. The output is now a matrix with $t_\text{max}$ rows and *max_traits* columns. This is initialised with NAs at the beginning. The population is a vector of integers, representing the trait held by each individual.  

```{r 7.14}
multiple_traits_matrix <- function(N, t_max, mu) {
  
  max_traits <- N + N * mu * t_max
  
  output <- matrix(data = NA, nrow = t_max, ncol = max_traits)
  
  # create first generation
  population <- sample(1:N, N, replace = TRUE)
  output[1, ] <- tabulate(population, nbins = N) / N
  
  # add first generation's p for all traits
  for (t in 2:t_max) {
    last_trait <- max(population) # record what is the last trait introduced in the population
  
    previous_population <- population # copy individuals to previous_population tibble
    
    population <- sample(previous_population, N, replace = TRUE)
    # randomly copy from previous generation
    
    innovators <- sample(c(TRUE, FALSE), N, prob = c(mu, 1 - mu), replace = TRUE) # select the innovators
    if ((last_trait + sum(innovators)) < max_traits) {
      population[innovators] <- (last_trait + 1):(last_trait + sum(innovators)) # replace innovators' traits with new traits
    }
    
    output[t, ] <- tabulate(population, nbins = max_traits) / N # get p for all traits and put it into output slot for this generation t
  }
  output # export data
}
```

To plot the output, we re-convert it into a tibble so that it can be handled by `ggplot()`. We first create a column that explicitly indicates the number of generations, and then we use the function `gather()` from the tidyverse to reassemble the columns of the matrix in key-value pairs.

```{r 7.15}
plot_multiple_traits_matrix <- function(data_model) {
  generation <- rep(1:dim(data_model)[1], dim(data_model)[2])
  
  data_to_plot <- as_tibble(data_model) %>%
    gather( key = "trait", value = "p") %>%
    add_column(generation)
  
  ggplot(data = data_to_plot, aes(y = p, x = generation)) +
    geom_line(aes(colour = trait)) +
    ylim(c(0, 1)) +
    theme_bw() +
    theme(legend.position = "none")
}
```

We can now run the new function, checking that it gives the same output as the tibble version, and again calculating the time needed.

```{r 7.16, warning=FALSE, fig.cap="We obtain qualitatively the same results of the previous code, in a model with unbiased transmission, multiple traits, and innovation."}
start_time <- Sys.time()
data_model <- multiple_traits_matrix(N = 100, t_max = 1000, mu = 0.01)
end_time <- Sys.time()
plot_multiple_traits_matrix(data_model)
end_time - start_time
```

The results are equivalent, but the simulation should be hundreds of times faster! This shows that implementation details are very important when building individual based models. When one needs to run the same simulation many times, or test many different parameter values, implementation choices can make drastic differences.

## The distribution of popularity

An interesting aspect of these simulations is that, even if all traits are functionally equivalent and transmission is unbiased, a few traits, for random reasons, are more successful than the others. A way to visualise this is to plot their cumulative popularity, i.e. the sum of their quantities over all generations. Given our matrix, it is easy to calculate this by summing each column and multiplying by *N* (remember they are frequencies, whereas now we want to visualise their actual quantities). We also need to keep only the values that are higher than zero: values equal to zero are in fact the empty slots created in the initial matrix that were never filled with actual traits.

```{r 7.17}
cumulative <- colSums(data_model) * N 
cumulative <- cumulative[cumulative > 0]
```

Let's sort them from the most to the least popular and plot the results.

```{r 7.18, fig.cap = "The popularity distribution of traits with unbiased copying is long-tailed, with few very successful traits and many relatively unsuccessful ones."}
data_to_plot <- tibble(cumulative = sort(cumulative, decreasing = TRUE))

ggplot(data = data_to_plot, aes(x = seq_along(cumulative), y = cumulative)) +
  geom_point() +
  theme_bw() +
  labs(x = "trait label", y = "cumulative popularity")
```

This is an example of a long-tailed distribution. The great majority of traits did not spread in the population, and their cumulative popularity is very close to one. Very few of them---the ones on the left side of the plot---were instead very successful. Long-tailed distributions like the one we just produced are very common for cultural traits: a small number of movies, books, or first names are very popular, while the great majority is not. In addition, in these domains, the popular traits are *much* more popular than the unpopular ones. The average cumulative popularity is `mean(cumulative)`, but the most successful trait has a popularity of `max(cumulative)`.

It is common to plot these distributions by binning the data in intervals of exponentially increasing size. In other words, we want to know how many traits have a cumulative popularity between 1 and 2, then between 2 and 4, then between 4 and 8, and so on, until we reach the maximum value of cumulative popularity. The code below does that, using a `for` cycle to find how many traits fall in each bin and further normalising according to bin size. The size is increased 50 times, until an arbitrary maximum bin size of $2^{50}$, to be sure to include all cumulative popularities.

```{r 7.19}
bin <- rep(NA, 50)
x <- rep(NA, 50)
for( i in 1:50 ){
  bin[i] <- sum( cumulative >= 2^(i-1) & cumulative < 2^i)
  bin[i] <- ( bin[i] / length( cumulative ) ) / 2^(i-1);
  x[i] <- 2^i
}
```

We can now visualise the data on a log-log plot, after filtering out the empty bins. A log-log plot is a graph that uses logarithmic scales on both axes. Using logarithmic axes is useful when, as in this case, the data are skewed towards large values. In the previous plot, we were not able to appreciate visually any difference in the great majority of data points, for example points that had cumulative popularity between 1 and 10, as they were all bunched up close to the x-axis. 

```{r 7.20, message = FALSE, fig.cap = "Popularity distribution of traits on a log-log scale."}
data_to_plot <- tibble(bin = bin, x = x) 
data_to_plot <- filter(data_to_plot, bin > 0)

ggplot(data = data_to_plot, aes(x = x, y = bin)) +
  geom_point() +
  labs(x = "cumulative popularity", y = "proportion of traits") +
  scale_x_log10() +
  scale_y_log10() +
  stat_smooth(method = "lm") +
  theme_bw()
```

On a log-log scale, the distribution of cumulative popularity produced by unbiased copying lies approximately on a straight line (this linear best-fit line is produced using the command `stat_smooth(method = "lm")`). This straight line on a log-log plot is known as a "power law" frequency distribution. The goodness of fit and the slope of the line can be used to compare different types of cultural transmission. For example, what would happen to the above power law if we added some degree of conformity? What about demonstrator-based bias? We can also generate equivalent plots for real-world cultural datasets to test hypotheses about the processes that generated these distributions in the real world.

## Summary of the model

In this chapter we simulated the case where individuals can possess one of more than two traits. We explored the simplest case of unbiased transmission. We also implemented the possibility of innovation, where individuals introduce, with some probability, new traits into the cultural pool of the population. Individual innovations counterbalance the homogenizing effect of drift, and replace the traits that are gradually lost.

To simulate multiple traits and innovation we also needed to deal with a few technical details such as how to keep track of an initially unknown number of new traits. We learned that it is best to create data structures of the desired size at the outset, rather than changing their size dynamically during the simulation. We also saw the importance of using appropriate data structures when simulations start to become more complex. Replacing tibbles with matrices, we were able to make our simulation 100 times faster.

Our results showed that unbiased copying produces long-tailed distributions where very few traits are very popular and the great majority are not. An interesting insight from this model is that these `extreme` distributions do not necessarily result from extreme tendencies at the individual level. Some traits become hugely more popular than others without individuals being biased, for example, towards popular traits. Cultural transmission generates these distributions without biases, but simply because popular traits have the intrinsic advantage of being more likely to be randomly copied. We also introduced a new technique, the log-log plot of binned popularity distributions, to visualise this outcome.


## Further readings

@neiman_stylistic_1995 first introduced a model of unbiased copying with multiple traits to explain popularity distributions in assemblages of Neolithic pottery. @bentley_random_2004 elaborated on this idea, presenting a 'random copying' model (equivalent to the one developed in this chapter) and comparing the popularity distributions produced with real datasets, including the frequency distributions of first names in the US and the citations of patents. @mesoudi_random_2009 explored how adding transmission biases (e.g. conformity) to the basic model changes the resulting power-law frequency distribution.

<!--chapter:end:07-Multiple_traits_models.Rmd-->

# (PART\*) Advanced topics - The evolution of cultural evolution {-} 

# Rogers' Paradox

The previous chapters all concerned cultural evolutionary dynamics: how different biases and transmission pathways affect the frequency of cultural traits in a population over time. Equally important, though, is to step back and consider where these biases and pathways came from in the first place. That is, we need to also consider the evolution of culture, and the evolution of cultural evolution.

The most basic question we can ask here is why a capacity for social learning (learning from others) evolved, relative to individual learning (learning directly from the environment, on one's own). An intuitive answer to this question is that social learning is less costly than individual learning. Imagine trying out different foods, some of which may be poisonous. One could try each one, and see if they make you ill. A less risky strategy would be to observe one's neighbour, and eat what they are eating. Unless they look sickly all the time, this will likely lead to a palatable (and evolutionarily adaptive) choice. Consequently, social learning should increase the mean adaptation of a population.

However, this intuition can be misleading. This was shown in 1988 by Alan Rogers in a now-classic model of the evolution of social learning (@rogers_does_1988). This model is often called "Rogers' paradox", because it shows that under certain conditions, social learning does not lead to increased adaptation, even when it is less costly than individual learning. More precisely, the mean fitness of a population containing social learners does not exceed the mean fitness of a population composed entirely of individual learners. Here we will recapitulate Rogers' mathematical model in an individual-based simulation, to see when and why this counter-intuitive result holds.

In Rogers' model there are $N$ individuals. Each individual has a fixed learning strategy: they are either an individual learner, or a social learner. Each individual also exhibits a behaviour, which we will represent with an integer (e.g. '5', or '32'). There is also an environmental state, $E$, which is also represented with an integer. When an individual's behaviour matches the environment, they receive increased fitness, compared to when it does not match. A match might represent 'palatable food', while a mismatch might represent 'poisonous food'.

In each generation, individual learners directly sample the environment, and have a probability $p$ of acquiring the 'correct', adaptive behaviour that matches the environment (and therefore a probability $1-p$ of adopting the incorrect, maladaptive behaviour). Social learners choose a member of the previous generation at random and copy their behaviour, just like for unbiased transmission considered in Chapter 1.

Unlike previous models, we are interested here not in the behaviours or traits, but in how the learning strategies evolve over time. We therefore want to track the proportion of social learners in the population, which we call $p_{SL}$ (with $1-p_{SL}$ being the proportion of individual learners). We assume these strategies are inherited (perhaps genetically, possibly culturally) from parent to offspring, and are affected by the fitness of the bearers of the strategies. Hence we need to specify fitness parameters.

Each individual starts with a baseline fitness, $w$. This is typically set at 1, to avoid tricky-to-handle negative fitnesses. Individuals who have behaviour that matches the environment receive a fitness boost of $+b$. Individuals who have behaviour that does not match the environment receive a fitness penalty of $-b$. Explicit in the above verbal outline is that social learning is less costly than individual learning. Therefore, individual learners receive a fitness cost of $-b*c$, and social learners receive a fitness cost of $-b*s$, where $c>s$. For simplicity, we can set $s=0$ (social learning is free) and set $c>0$, so we only have to change one parameter.

The fitness of each individual is then totted up based on the above, and the next generation is created. Each individual reproduces in proportion to the fitness of their strategy, relative to other strategies. 

We also assume some mutation during reproduction. With probability $mu$, the new individual 'mutates' to the other learning strategy. Because we are interested here in how social learning evolves from individual learning, we start with a first generation entirely made up of individual learners. Social learning then appears from the second generation onwards via mutation.

Finally, Rogers was interested in the effect of environmental change. Each generation, there is a probability $u$ of the environment changing to a new state. In Rogers' original model, the environment flipped between the same two states, back and forth. However, this is problematic when environmental change is very fast, because an individual with out-dated behaviour can receive a fitness benefit if the environment flips back to the previous state. Hence we assume that when environments change, they change to a new value never previously experienced by any individual.

This is a complex model but let's go step by step. First we create and initialise tibbles to store the output and the population of individuals, just like in previous chapters. The output here needs to be big enough to store data from $r_{max}$ runs and $t_{max}$ generations, like before. We then need to create NA placeholders for $p_{SL}$ (the proportion of social learners) and $W$ (the mean population fitness). The `population` dataframe stores the characteristics of the individuals: learning strategy ('individual' or 'social'), behaviour (initially all NA) and fitness (initially all NA). Finally, we initialise the environment $E$ at zero, which will subsequently increment.

```{r 8.1, message = FALSE}

library(tidyverse)
set.seed(111)

N <- 100
r_max <- 1
t_max <- 10

output <- tibble(generation = rep(1:t_max, r_max), run = as.factor(rep(1:r_max, each = t_max)), p.SL = as.numeric(rep(NA, t_max * r_max)), W = as.numeric(rep(NA, t_max * r_max)))

population <- tibble(learning = rep("individual", N), behaviour = rep(NA, N), fitness = rep(NA, N))

E <- 0

```

Now let's go through each event that happens during a single generation. Later we will put it all inside a loop. It's useful to write out the events that we need:
1. Social learning
2. Individual learning
3. Calculate fitnesses
4. Store population characteristics in output tibble
5. Reproduction
6. Potential environmental change

First, social learning. The following code picks random individuals from the previous population tibble (which we have yet to create, but will do later), to put into the social learner individuals in the current population tibble. This is similar to what we did in the first chapter. It only does this if there is at least one social learner. As noted above, we start in the first generation with all individual learners and no social learners, so this will not be fulfilled until the second generation. For now, nothing happens.

```{r 8.2}

if (sum(population$learning == "social") > 0) {
  population$behaviour[population$learning == "social"] <- sample(previous_population$behaviour, sum(population$learning == "social"), replace = TRUE)
}

```

The following code implements individual learning. This *does* apply to the first generation. We first create a vector of TRUE and FALSE values dependent on $p$, the probability of individual learning resulting in a correct match with the environment. With this probability, individual learners have their behaviour set to the correct $E$ value. Otherwise, they are given the incorrect behaviour $E-1$. Note the use of the ! before `learn_correct` to give a match when this vector is FALSE (i.e. they do *not* learn the correct behaviour). Run this and check that the behaviour column of the population tibble changes.

```{r 8.3}

p <- 0.8

learn_correct <- sample(c(TRUE, FALSE), N, prob = c(p, 1 - p), replace = TRUE)
population$behaviour[learn_correct & population$learning == "individual"] <- E
population$behaviour[!learn_correct & population$learning == "individual"] <- E - 1

```

Now we obtain the fitnesses for each individual. First we give everyone the baseline fitness, $w$. Then we add or subtract $b$, based on whether the individual has the correct or incorrect behaviour. Finally we impose costs, which are different for social and individual learners. Run this and check that the fitness column of the population tibble changes.

```{r 8.4}

w <- 1
b <- 0.5
s <- 0
c <- 0.9

population$fitness <- w  # baseline fitness

# for individuals with behaviour matched to the environment, add b
population$fitness[population$behaviour == E] <- population$fitness[population$behaviour == E] + b  
# for individuals with behaviour not matched to the environment, subtract b
population$fitness[population$behaviour != E] <- population$fitness[population$behaviour != E] - b

# impose cost b*c on individual learners
population$fitness[population$learning == "individual"] <- population$fitness[population$learning == "individual"] - b*c  
# impose cost b*s on social learners
population$fitness[population$learning == "social"] <- population$fitness[population$learning == "social"] - b*s 

```
The fourth stage is recording the resulting data into the output tibble. Ultimately this will be put into a position indexed by a generation ($t$) loop and a run ($r$) loop, but here we create dummy $t$ and $r$ values for illustration. First we calculate $p_{SL}$ as the number of social learners divided by the total population size. Then we calculate $W$, the mean fitness in the entire population. All of these are done with the standard R mean command. 

```{r 8.5}

t <- 1
r <- 1

output[output$generation == t & output$run == r, ]$p.SL <- mean(population$learning == "social")
output[output$generation == t & output$run == r, ]$W <- mean(population$fitness)

```

The fifth stage is reproduction. Here we put the current population tibble into a new tibble, called previous_population, as we have done before. This acts as both a record to now calculate fitnesses, as well as a source of demonstrators for the social learning stage we covered above. After doing this, we reset the behaviour and fitness of the current population. We then over-write the learning strategies based on fitness.

First we get $fitness_{IL}$, the fitness of individual learners relative to the fitness of the entire population (assuming there are any individual learners, otherwise we set this to zero). This then serves as the probability of setting $produce_{IL}$ to TRUE. We do something similar with $mutation$, denoting the probability of an individual mutating their learning strategy. Finally, we have four statements for all combinations of inherited learning strategy and mutation: if an individual inherits an individual learning strategy and does not mutate, they are an individual learner; if they inherit a social learning strategy and do not mutate, they are a social learner; if they inherit individual learning and mutate, they are a social learner; and if they inherit social learning and mutate, they are an individual learner. 

```{r 8.6}

mu <- 0.1

previous_population <- population
population$behaviour <- NA
population$fitness <- NA
      
# probability of individual learning in new generation (population) is proportional to the relative fitness of individual learners in the previous_population
      
# relative fitness of individual learners (if there are any)
if (sum(previous_population$learning == "individual") > 0) {
  fitness_IL <- sum(previous_population$fitness[previous_population$learning == "individual"]) / sum(previous_population$fitness)
} else {
  fitness_IL <- 0
}
produce_IL <- sample(c(TRUE, FALSE), N, prob = c(fitness_IL, 1 - fitness_IL), replace = TRUE)
      
# also add mutation, chance of switching learning types
mutation <- sample(c(TRUE, FALSE), N, prob = c(mu, 1 - mu), replace = TRUE)
      
# if parent is an individual learner and no mutation, then they're an ind learner
population$learning[produce_IL & !mutation] <- "individual"  
# if parent is a social learner and no mutation, then they're a social learner
population$learning[!produce_IL & !mutation] <- "social"  
# if parent is an individual learner plus mutation, then they're a social learner
population$learning[produce_IL & mutation] <- "social"  
# if parent is a social learner plus mutation, then they're an ind learner
population$learning[!produce_IL & mutation] <- "individual"  

```

The final stage is the easiest. With probability $u$, we increment the environmental state $E$ by one. Otherwise, it stays as it is. To do this we pick a random number between 0 and 1 using the $runif$ command, and if $u$ exceeds this, we increment $E$.

```{r 8.7}

u <- 0.1

if (runif(1) < u) E <- E + 1

```

That covers the six stages that occur in each generation. We can now put them all together into a loop tracking runs, and a loop tracking generations. We can also put all this inside a function. This should all be familiar from previous chapters. Note that we've added some explanatory comments to explain what's happening, and number the different stages. We also add a parameter check at the start, to make sure that we don't get negative fitnesses. We ends with the output tibble, which constitutes the data outputted from the whole simulation, and we set some of the parameters ($w$, $b$ and $s$) to default values. The others we force the user to specify.

```{r 8.8}

rogers_model <- function(N, t_max, r_max, w = 1, b = 0.5, c, s = 0, mu, p, u) {
  
  # check parameters, to avoid negative fitnesses
  if (b*(1+c) > 1 || b*(1+s) > 1) {
    stop("Invalid parameter values: ensure b*(1+c) < 1 and b*(1+s) < 1")
  }
  
  # create output tibble
  # p.SL is the proportion of social learners in the population and W is the population mean fitness
  output <- tibble(generation = rep(1:t_max, r_max), run = as.factor(rep(1:r_max, each = t_max)), p.SL = as.numeric(rep(NA, t_max * r_max)), W = as.numeric(rep(NA, t_max * r_max)))
  
  for (r in 1:r_max) {
    
    # create a population of individuals
    # learning type is 'individual' or 'social' (initially all 'individual')
    # behaviour is indexed by an integer, which may or may not match the environment
    # fitness is the individual's fitness, given their learning type, behaviour and the environment
    population <- tibble(learning = rep("individual", N), behaviour = rep(NA, N), fitness = rep(NA, N))
    
    # initialise the environment
    E <- 0
    
    for (t in 1:t_max) {
      
      # 1. social learners copy the behaviour of a randomly chosen member of the previous generation
      if (sum(population$learning == "social") > 0) {
        population$behaviour[population$learning == "social"] <- sample(previous_population$behaviour, sum(population$learning == "social"), replace = TRUE)
      }
      
      # 2. individual learners learn the correct behaviour (E) with probability p
      # otherwise they learn the incorrect behaviour (E - 1)
      learn_correct <- sample(c(TRUE, FALSE), N, prob = c(p, 1 - p), replace = TRUE)
      population$behaviour[learn_correct & population$learning == "individual"] <- E
      population$behaviour[!learn_correct & population$learning == "individual"] <- E - 1
      
      # 3. get fitnesses
      population$fitness <- w  # baseline fitness
      # for individuals with behaviour matched to the environment, add b
      population$fitness[population$behaviour == E] <- population$fitness[population$behaviour == E] + b  
      # for individuals with behaviour not matched to the environment, subtract b
      population$fitness[population$behaviour != E] <- population$fitness[population$behaviour != E] - b
      # impose cost b*c on individual learners
      population$fitness[population$learning == "individual"] <- population$fitness[population$learning == "individual"] - b*c  
      # impose cost b*s on social learners
      population$fitness[population$learning == "social"] <- population$fitness[population$learning == "social"] - b*s 
      
      # 4. store population characteristics in output
      output[output$generation == t & output$run == r, ]$p.SL <- mean(population$learning == "social")
      output[output$generation == t & output$run == r, ]$W <- mean(population$fitness)
      
      # 5. reproduction
      previous_population <- population
      population$behaviour <- NA
      population$fitness <- NA
      
      # probability of individual learning in new generation (population) is proportional to the relative fitness of individual learners in the previous_population
      
      # relative fitness of individual learners (if there are any)
      if (sum(previous_population$learning == "individual") > 0) {
        fitness_IL <- sum(previous_population$fitness[previous_population$learning == "individual"]) / sum(previous_population$fitness)
      } else {
        fitness_IL <- 0
      }
      produce_IL <- sample(c(TRUE, FALSE), N, prob = c(fitness_IL, 1 - fitness_IL), replace = TRUE)
      
      # also add mutation, chance of switching learning types
      mutation <- sample(c(TRUE, FALSE), N, prob = c(mu, 1 - mu), replace = TRUE)
      
      # if parent is an individual learner and no mutation, then they're an ind learner
      population$learning[produce_IL & !mutation] <- "individual"  
      # if parent is a social learner and no mutation, then they're a social learner
      population$learning[!produce_IL & !mutation] <- "social"  
      # if parent is an individual learner plus mutation, then they're a social learner
      population$learning[produce_IL & mutation] <- "social"  
      # if parent is a social learner plus mutation, then they're an ind learner
      population$learning[!produce_IL & mutation] <- "individual"  
      
      # 6. potential environmental change
      # increment the environmental state with probability u
      if (runif(1) < u) E <- E + 1
      
    }
  }
  output
}

```

Now we can run the simulation for 10 runs, and 100 generations.

```{r 8.9}

data_model <- rogers_model(N = 1000, t_max = 500, r_max = 10, c = 0.9, mu = 0.01, p = 1, u = 0.2)

```

You can inspect the data_model tibble, but so much data is hard to make sense of. Let's write plotting functions like in previous chapters. First we can plot $p_{SL}$, the frequency of social learners. This is similar to the plots in previous chapters, but instead of plotting each run as a different colour, we plot them in grey, to show the range across runs as well as the mean.

```{r 8.10}

plot_p.SL <- function(data_model) {
  ggplot(data = data_model, aes(y = p.SL, x = generation)) +
    geom_line(col = "grey") +
    stat_summary(fun.y = mean, geom = "line", size = 1) +
    ylim(c(0, 1)) +
    theme_bw() +
    labs(y = "p.SL (proportion of social learners)")
}

plot_p.SL(data_model)

```

Here we can see that, for these parameter values, the mean proportion of social learners fluctuates around 0.5. However, each run is quite erratic, with a large spread. More important for our understanding of Rogers' paradox, however, is the mean fitness of the population, and how this compares with a population entirely composed of individual learners. Consequently, we need to plot the mean population fitness over time. This is $W$ in the output of the rogers_model function. The function below plots this, along with a dotted line denoting the fitness of an individual learner, which by extension will be the same as the mean fitness of a population entirely composed of individual learners.

```{r 8.11}

plot_W <- function(data_model, w=1, b=0.5, c, p) {
  ggplot(data = data_model, aes(y = W, x = generation)) +
    geom_line(col = "grey") +
    stat_summary(fun.y = mean, geom = "line", size = 1) +
    geom_hline(yintercept = w + b*(2*p - c - 1), linetype = 2) +
    ylim(c(0, NA)) +
    theme_bw() +
    labs(y = "W (mean population fitness)")
}

plot_W(data_model, c = 0.9, p = 1)

```

This is Rogers' paradox. Even though social learning is less costly than individual learning (i.e. $s < c$), our population of roughly 50% social learners never exceeds the dotted line that indicates the fitness of a population of individual learners. Social learning does not increase adaptation. This also runs counter to the common claim that culture - with social learning at its heart - has been a key driver of our species' ecological success.

The reason for this result is that social learning is frequency-dependent in a changing environment. Individual learners undergo costly individual learning and discover the correct behaviour, initially doing well. Social learners then copy that behaviour, but at lower cost. Social learners therefore then do better than, and outcompete, individual learners. But when the environment changes, the social learners do badly, because they are left copying outdated behaviour. Individual learners then do better, because they can detect the new environmental state. Individual learners increase in frequency, and the cycle continues. Eventually they reach an equilibrium at which the frequency of social and individual learners is the same. but by definition, this equilibrium must have the same mean fitness as a population entirely composed of individual learners. Hence, the 'paradox'.

To explore this further, we can alter the parameters. First, we can reduce the cost of individual learning, from $c=0.9$ to $c=0.4$. 

```{r 8.12}

data_model <- rogers_model(N = 1000, t_max = 500, r_max = 10, c = 0.4, mu = 0.01, p = 1, u = 0.2)
plot_p.SL(data_model)
plot_W(data_model, c = 0.4, p = 1)

```

As we might expect, this reduces the proportion of social learners, by giving individual learners less of a penalty for doing their individual learning. Also as expected, the paradox remains. In fact it is even more obvious, given that there are many more individual learners.

We can also reduce the accuracy of individual learning, reducing $p$ from 1 to 0.8.

```{r 8.13}

data_model <- rogers_model(N = 1000, t_max = 500, r_max = 10, c = 0.9, mu = 0.01, p = 0.7, u = 0.2)
plot_p.SL(data_model)
plot_W(data_model, c = 0.9, p = 0.7)

```

Now there are a majority of social learners. Yet the paradox remains: the mostly social learners still do not really exceed the pure individual learning fitness line.

If our explanation above is correct, then making the environment constant should remove the paradox. If the environment stays the same, then behaviour can never be outdated, and individual learners never regain the upper hand. Setting $u=0$ shows this.

```{r 8.14}

data_model <- rogers_model(N = 1000, t_max = 500, r_max = 10, c = 0.9, mu = 0.01, p = 1, u = 0.0)
plot_p.SL(data_model)
plot_W(data_model, c = 0.9, p = 1)

```

Now the paradox has disappeared: social learners clearly outperform the individual learners after the latter have gone to the trouble of discovering the correct behaviour, and the social learners have higher mean fitness than the individual learning dotted line. This is just as we would expect. Rogers' paradox crucially depends on a changing environment. However, nature rarely provides a constant environment. Food sources change location, technology accumulates, languages diverge, and climates change.

## Summary of the model

Rogers' model is obviously a gross simplification of reality. However, as discussed in earlier chapters, realism is often not the aim of modelling. Models - even simple and grossly unrealistic ones - force us to think through assumptions, and challenge verbal theorising. Rogers' model is a good example of this. Even though it sounds reasonable that social learning should increase the mean fitness, or adaptation, of a population, in this simple model with these assumptions it does not. We saw one situation in which social learning *does* increase mean fitness: when environments do not change. This, however, is not very plausible. Environments always change. We therefore need to examine the other assumptions of Rogers' model. We will do this in the next chapter. 

## Further reading

An early example of the claim that social learning is adaptive because it reduces the costs of learning can be found in @boyd_culture_1985. @rogers_does_1988 then challenged this claim, as we have seen in this chapter. In the next chapter we will consider subsequent models that have examined 'solutions' to Rogers' paradox.


<!--chapter:end:08-Rogers_paradox.Rmd-->

# Rogers' Paradox: A Solution

In the previous chapter we saw how social learning does not increase the mean fitness of a population relative to a population entirely made up of individual learners, at least in a changing environment. This is colloquially known as Rogers' paradox, after Alan Rogers' model which originally showed this. It is a 'paradox' because it holds even though social learning is less costly than individual learning, and social learning is often argued to underpin our species' ecological success. The paradox occurs because social learning is frequency dependent: when environments change, the success of social learning depends on there being individual learners around to copy. Otherwise social learners are left copying each others' outdated behaviour.

Several subsequent models have explored 'solutions' to Rogers' paradox. These involve relaxing the obviously unrealistic assumptions. One of these is that individuals in the model come in one of two fixed types: social learners (who always learn socially), and individual learners (who always learn individually). This is obviously unrealistic. Most organisms that can learn individually can also learn socially, and the two capacities likely rely on the same underlying mechanisms (e.g. associative learning, see e.g. @heyes_whats_2011).

To explore this assumption, @enquist_critical_2007 added another type of individual to Rogers' model: a critical social learner. These individuals first try social learning, and if the result is unsatisfactory, they then try individual learning. The following function modifies the rogers_model function from the last chapter to include critical learners. We need to change the code in a few places, as explained below the chunk. 

```{r 9.1, message = FALSE}

library(tidyverse)
set.seed(111)

rogers_model2 <- function(N, t_max, r_max, w = 1, b = 0.5, c, s = 0, mu, p, u) {
  
  # check parameters, to avoid negative fitnesses
  if (b*(1+c) > 1 || b*(1+s) > 1) {
    stop("Invalid parameter values: ensure b*(1+c) < 1 and b*(1+s) < 1")
  }
  
  # create output tibble
  # p.SL is the proportion of social learners in the population and W is the population mean fitness
  output <- tibble(generation = rep(1:t_max, r_max), run = as.factor(rep(1:r_max, each = t_max)), p.SL = as.numeric(rep(NA, t_max * r_max)), p.IL = as.numeric(rep(NA, t_max * r_max)), p.CL = as.numeric(rep(NA, t_max * r_max)), W = as.numeric(rep(NA, t_max * r_max)))
  
  for (r in 1:r_max) {
    
    # create a population of individuals
    # learning type is 'individual', 'social' or 'critical' (initially all 'individual')
    # behaviour is indexed by an integer, which may or may not match the environment
    # fitness is the individual's fitness, given their learning type, behaviour and the environment
    population <- tibble(learning = rep("individual", N), behaviour = rep(NA, N), fitness = rep(NA, N))
    
    # initialise the environment
    E <- 0
    
    for (t in 1:t_max) {
      
      # NB now we integrate fitnesses into the learning stage
      population$fitness <- w  # start with baseline fitness
      
      # 1. social learners copy the behaviour of a randomly chosen member of the previous generation
      if (sum(population$learning == "social") > 0) {
        population$behaviour[population$learning == "social"] <- sample(previous_population$behaviour, sum(population$learning == "social"), replace = TRUE)
        # subtract b*s from fitness of SLers
        population$fitness[population$learning == "social"] <- population$fitness[population$learning == "social"] - b*s
      }
      
      # 2. individual learners learn the correct behaviour (E) with probability p
      # otherwise they learn the incorrect behaviour (E - 1)
      learn_correct <- sample(c(TRUE, FALSE), N, prob = c(p, 1 - p), replace = TRUE)
      population$behaviour[learn_correct & population$learning == "individual"] <- E
      population$behaviour[!learn_correct & population$learning == "individual"] <- E - 1
      # impose cost b*c on individual learners
      population$fitness[population$learning == "individual"] <- population$fitness[population$learning == "individual"] - b*c
      
      # 3. critical learners try social learning, and if the copied behaviour does not match the environment, they do individual learning
      if (sum(population$learning == "critical") > 0) {
        
        # first critical learners socially learn
        population$behaviour[population$learning == "critical"] <- sample(previous_population$behaviour, sum(population$learning == "critical"), replace = TRUE)
        
        # subtract b*s from fitness of socially learning critical learners
        population$fitness[population$learning == "critical"] <- population$fitness[population$learning == "critical"] - b*s
        
        # do individual learning for those critical learners who did not copy correct behaviour
        # (NB we re-use learn_correct from above)
        population$behaviour[learn_correct & population$learning == "critical" & population$behaviour != E] <- E
      
        # subtract b*c from fitness of individually learning critical learners
        population$fitness[learn_correct & population$learning == "critical" & population$behaviour != E] <- population$fitness[learn_correct & population$learning == "critical" & population$behaviour != E] - b*c
        
      }
      
      # 4. get fitnesses
      # now only need to do the b bonus or penalty
      # for individuals with behaviour matched to the environment, add b
      population$fitness[population$behaviour == E] <- population$fitness[population$behaviour == E] + b  
      # for individuals with behaviour not matched to the environment, subtract b
      population$fitness[population$behaviour != E] <- population$fitness[population$behaviour != E] - b
      
      # 5. store population characteristics in output
      output[output$generation == t & output$run == r, ]$p.SL <- mean(population$learning == "social")
      output[output$generation == t & output$run == r, ]$p.IL <- mean(population$learning == "individual")
      output[output$generation == t & output$run == r, ]$p.CL <- mean(population$learning == "critical")
      output[output$generation == t & output$run == r, ]$W <- mean(population$fitness)
      
      # 6. reproduction
      previous_population <- population
      population$behaviour <- NA
      population$fitness <- NA
      
      # probability of individual learning in new generation (population) is proportional to the relative fitness of individual learners in the previous_population
      
      # relative fitness of individual learners (if there are any)
      if (sum(previous_population$learning == "individual") > 0) {
        fitness_IL <- sum(previous_population$fitness[previous_population$learning == "individual"]) / sum(previous_population$fitness)
      } else {
        fitness_IL <- 0
      }
      produce_IL <- sample(c(TRUE, FALSE), N, prob = c(fitness_IL, 1 - fitness_IL), replace = TRUE)
      
      # relative fitness of social learners (if there are any)
      if (sum(previous_population$learning == "social") > 0) {
        fitness_SL <- sum(previous_population$fitness[previous_population$learning == "social"]) / sum(previous_population$fitness)
      } else {
        fitness_SL <- 0
      }
      produce_SL <- sample(c(TRUE, FALSE), N, prob = c(fitness_SL, 1 - fitness_SL), replace = TRUE)
      
      # if parent is an individual learner, then they're an ind learner
      population$learning[produce_IL] <- "individual"  
      # if parent is a social learner, then they're a social learner
      population$learning[produce_SL] <- "social"  
      # if parent is neither IL or SL, then they're a critical learner
      population$learning[!produce_IL & !produce_SL] <- "critical"
      
      # 7. mutation, chance of switching learning types
      mutation <- sample(c(TRUE, FALSE), N, prob = c(mu, 1 - mu), replace = TRUE)
      
      # new previous_population2 to avoid anyone mutating twice
      previous_population2 <- population
      
      population$learning[mutation & previous_population2$learning == "individual"] <- sample(c("critical", "social"), sum(mutation & previous_population2$learning == "individual"), prob = c(0.5, 0.5), replace = TRUE)
      population$learning[mutation & previous_population2$learning == "social"] <- sample(c("critical", "individual"), sum(mutation & previous_population2$learning == "social"), prob = c(0.5, 0.5), replace = TRUE)
      population$learning[mutation & previous_population2$learning == "critical"] <- sample(c("individual", "social"), sum(mutation & previous_population2$learning == "critical"), prob = c(0.5, 0.5), replace = TRUE)
    
      # 8. potential environmental change
      # increment the environmental state with probability u
      if (runif(1) < u) E <- E + 1
      
    }
  }
  output
}

```

First, the output tibble needs to store the proportion of all three types of learner, so we add $p.CL$, the proportion of critical learners. Next, we need to add a learning routine for critical learners. This involves repeating the social learning code originally written for the social learners. We then apply the individual learning code to those critical learners who copied the incorrect behaviour (this makes them 'unsatisfied'). To make it easier to follow, we now insert the fitness updates into the learning section. This is because only those critical learners who are unsatified will suffer the costs of individual learning. If we left it to afterwards, it's easy to lose track of who is paying what fitness costs.

Reproduction and mutation are changed to account for the three learning strategies. We now need to get the relative fitness of social and individual learners, and reproduce based on those fitnesses. Individuals left over become critical learners. We could calculate the relative fitness of critical learners, but it's not really necessary given that the proportion of critical learners will always be 1 minus the proportion of social and individual learners. Similarly, mutation now needs to specify that individuals can mutate into either of the two other learning strategies. We assume this mutation is unbiased, and mutation is equally likely to result in the two other strategies.

Now we can run rogers_model2, with the same parameter values as we initially ran rogers_model in the last chapter.

```{r 9.2}

data_model <- rogers_model2(N = 1000, t_max = 500, r_max = 10, c = 0.9, mu = 0.01, p = 1, u = 0.2)

```

As before, it's difficult to see what's happening unless we plot the data. The following function plot_prop now plots the proportion of all three learning strategies. To do this we need to convert our wide data_model tibble (where each strategy is in a different column) to long format (where all proportions are in a single column, and a new column indexes the strategy). To do this we use pivot_longer from the tidyverse package, which we have already loaded above. For cosmetic purposes, we also rename the p.XL variables with full words.

```{r 9.3}

plot_prop <- function(data_model) {
  
  names(data_model)[3:5] <- c("social", "individual", "critical")
  data_model_long <- pivot_longer(data_model, -c(W,generation,run), names_to = "learning", values_to = "proportion")

ggplot(data = data_model_long, aes(y = proportion, x = generation, colour = learning)) +
  stat_summary(fun.y = mean, geom = "line", size = 1) +
  ylim(c(0, 1)) +
  theme_bw() +
  labs(y = "Proportion of learners")
}

plot_prop(data_model)

```

Here we can see that critical learners have a clear advantage over the other two learning strategies. Critical learners go virtually to fixation, barring mutation which prevents it from going to 100%. It pays off being a flexible, discerning learner who only learns individually when social learning does not work.

What about Rogers' paradox? Do critical learners exceed the mean fitness of a population entirely composed of individual learners? We can use the plot_W function from the last chapter to find out:

```{r}

plot_W <- function(data_model, w=1, b=0.5, c, p) {
  ggplot(data = data_model, aes(y = W, x = generation)) +
    geom_line(col = "grey") +
    stat_summary(fun.y = mean, geom = "line", size = 1) +
    geom_hline(yintercept = w + b*(2*p - c - 1), linetype = 2) +
    ylim(c(0, NA)) +
    theme_bw() +
    labs(y = "W (mean population fitness)")
}

plot_W(data_model, c = 0.9, p = 1)

```

Yes: critical learners clearly outperform the dotted line indicating a hypothetical 100% individual learning population. Rogers' paradox is solved.

## Summary of the model

Several 'solutions' have been demonstrated to Rogers' paradox. Here we have explored one of them. Critical learners can flexibly employ both social and individual learning, and do this in an adaptive manner (i.e. only individually learn if social learning is unsuccessful). Critical learners outperform the pure individual learning and pure social learning strategies. They therefore solve Rogers' paradox by exceeding the mean fitness of a population entirely composed of individual learning.

One might complain that all this is obvious. Of course real organisms can learn both socially and individually, and adaptively employ both during their lifetimes. But hindsight is a wonderful thing. Before Rogers' model, scholars did not fully recognise this, and simply argued that social learning is adaptive because it has lower costs than individual learning. We now know this argument is faulty. But it took a simple model to realise it, and to realise the reasons why.


## Further reading

There are several other solutions to Rogers' paradox in the literature. @boyd_why_1995 suggested individuals who initially learn individually and then if unsatisfied learn socially - the reverse of @enquist_critical_2007's critical learners. @boyd_why_1995 also suggested that if culture is cumulative, i.e. each generation builds on the beneficial modifications of the previous generations, then Rogers' paradox is resolved.


<!--chapter:end:09-Rogers_solution.Rmd-->

# (PART\*) Advanced topics - Culture and populations {-} 


# Demography

*In the previous chapters, we have looked at the transmission of information between individuals. We have seen that relatively simple mechanisms at the individual level can lead to population-level outcomes (e.g. the fixation of a rare cultural trait). We have also seen the importance of the characteristics of individuals (e.g. for success and prestige bias) in cultural processes. What we have not yet looked at is how the characteristics of the population may affect the outcome of cultural dynamics. In the following three chapters we will have a closer look at how: population size (demography), population structure (social networks), and group structured populations (with migration) can influence cultural evolution.*

Why would demography matter to cultural evolution? As long as information is transmitted among individuals and between generations, the size of the population should not play a role. In theory, this statement is true but it relies on a crucial assumption: information transfer is not only complete (all information from the previous generation is transmitted to the next generation) but also error-free. However, from many lab experiments, we know that copying information is an error-prone process. In this chapter, we will look at how those errors affect information accumulation and how population size is augmenting this process. 

Several studies have looked at population effects. @shennan_demography_2015 provide a good overview of a variety of approaches and questions. For example, in their model, @ghirlanda_sustainability_2010, investigate the interplay between cultural innovations and cultural loss. While it would be trivial to say that culture accumulates where the rate of cultural innovation is higher than the rate of cultural loss, this model allows for two additional complicating mechanisms: (1) it lets innovations affect carrying capacity (and so the number of innovators), and (2) it allows trait corruption (i.e. a trait that was adaptive before can become maladaptive later, e.g. because it allows the over-exploitation of a resource). 

Another well-known study is that by Joseph @henrich_demography_2004. His model takes inspiration from the archaeological record of Tasmania, which shows a deterioration of some and the persistence of other cultural traits after Tasmania was cut-off from Australia at the end of the last ice age. Henrich develops a compelling analytical model to show that the same adaptive processes in cultural evolution can result in the improvement and retention of simple skills but also the deterioration and even loss of complex skills. In the following section we will take a closer look at this model.

## Background on demography-mediated cultural loss

The principle idea of this model is the following: information transmission from one generation to another (or from one individual to another, here it does not make a difference) has a random component (error rate) that will lead to most individuals failing to achieve the same skill level (denoted with $z$) as their cultural model, whereas a few will match and - even fewer - exceed that skill level. Imagine a group of students who try to acquire the skills to manufacture a spear. As imitation is imperfect, and memorising and recalling action sequences is error-prone some students will end up with a spear that is inferior to the one of their cultural model. A few individuals might achieve a similar or even higher skill level than their cultural model. These 'new masters' will become the cultural models of the next generation. Fig. \@ref(fig:henrichGumbel) is showing this principle. 

```{r henrichGumbel, fig.cap="Shown are the probability distributions to acquire a specific skill level for two different skills (one that is easy to learn and one that is more complex and therefore harder to learn). Given that learning is error-prone more individuals will acquire a skill level that is lower than that of a cultural model (its level is indicated by the vertical dashed line) through imitation (left of the dashed line). A few individauls will achieve higher skill levels (right of the dashed line). For the complex skill the probability to be above the skill level of the cultural model is lower (smaller area under the curve) than for simple skills.", message=FALSE, warning=FALSE}
library(extraDistr)
library(tidyverse)
data <- tibble(skill = rep(c("simple","complex"), each=6000),
                   z = c(rgumbel(n=6000, mu=-7, sigma=2),
                         rgumbel(n=6000, mu=-9, sigma=2)))
ggplot(data, aes(x=z, col=skill)) +
  geom_density() + 
  geom_vline(xintercept=0, col="grey", linetype=2) + 
  theme_bw() +
  # theme(axis.text=element_blank(),
  #       axis.ticks=element_blank()) +
  xlab("imitator value z") + 
  ylab("probability imitatior acquires z") 
```

Given that the new skill level is essentially drawn from a probability distribution around the value of the individual that is imitated, the new average skill value of the population will not only depend on how frequent and severe copying errors are (that is, how skewed the distribution is to the right) but also on how many individuals try to imitate. The smaller the pool of imitators, the fewer individuals will achieve a higher skill level and so, over time the skill will deteriorate. Henrich provides an analytical model to explain how societies below a critical size (of cultural learners) might lose complex (or even simple) cultural traits over time. We will attempt to re-create his results using an individual-based model. 

## The Tasmania case model

We begin with a very simple learning loop. We have a population of $N=1000$ individuals, each with a skill level $z$ for a particular skill (which we initiate with random values drawn from a uniform distribution, $U(0,1)$). The average skill level is $\bar{z}$. We want to calculate what the new average skill level is after all individuals attempt to imitate the most successful individual (the one where $z$ is the largest). Each individual then receives a new $z$-value that is drawn from a Gumbel distribution (which we can get from the `extraDistr` package). This distribution is controlled by two values, $\mu$ (location) and $\beta$ (scale). To model imperfect imitation, we can vary $\beta$ and we can subtract an amount $\alpha$ from $\mu$. If learning was perfect $\alpha$ and $\beta$ would be zero. If something is easy to learn, $\alpha$ is small (and large if it is hard to learn). If people make very similar mistakes, $\beta$ is small (and large if people make widely different mistakes). Once we have drawn new values for $z$, we calculate the average change in $z$, $\Delta \bar{z}$, and then replace the original $z$-values to restart the loop. 

```{r, cache = TRUE}
Rounds <- 5000
N <- 1000
f <- rep(NA, N)
z <- runif(n=N, min=0, max=1)
zbar <- rep(NA, Rounds)

beta <- 1
alpha <- 7

for(r in 1:Rounds){
  # update f
    # for perfect identification of most skilful individual
    f <- as.numeric(z == max(z))
  
  # choose who to observe
    obs <- sample(x=1:N, prob=f, replace=T)
    
  # calculate new z
  # znew <- rlogis(n=N, location=z[obs]-alpha, scale=beta)
  znew <- rgumbel(n=N, mu=z[obs]-alpha, sigma=beta)
  
  # record average z
    zbar[r] <- mean(znew-z)
      
  # update z
    z <- znew
}
```

We let the simulation run for $5000$ rounds and plot the results.

```{r}
plot(zbar, type="l", xlab="time", ylab="change in z")
abline(h=mean(zbar), col="grey")
```

We find that $\Delta \bar z$ quickly plateaus at about `r round(mean(zbar),2)` (grey horizontal line in the previous plot). On average the population would improve this skill over time (as this value is positive). 

Let us now write a function that can perform this loop repeatedly and average $\Delta \bar z$ over all those repetitions to receive a more stable result. 

```{r}
demography_model <- function(R, N, alpha, beta, repeats){
  res <- lapply(1:reps, function(REPS){
    z <- runif(n=N, min=0, max=1)
    zbar <- rep(NA, R)
    for(r in 1:R){
      znew <- rgumbel(n=N, mu=max(z)-alpha, sigma=beta)
      zbar[r] <- mean(znew-z)
      z <- znew
    }
    return(mean(zbar))
  })
  mean(unlist(res))
}
```

In the previous code section, we use a new function, `lapply()`. There is a series of apply functions in $R$ that 'apply' a function to a given data structure. Generally, these functions take an argument `X` (a vector or a list of elements) and then apply a function `FUN` to all elements. In our case, we want to execute our simulation $r$ times (whereby $r$ stands for the number of repetitions). Alternatively, we could use an outer `for` loop (over all repetitions, `reps`) around the inner `for` loop (over all rounds `R`). The problem with this latter approach is that nested loops (like any looped execution) happen sequentially. Thus, the third repetition of our simulation will not be executed before the first and the second simulation have completed. With an apply function, each simulation can be executed independent from each other (using parallel computing) and so speeding up the calculation. We initiate the `lapply()` function with a vector (here, `1:reps`), which indiciates the number individual runs. Then we define the function that we want to execute. We need to make sure to return the averaged $\bar z$ at the end of the function. The `lapply()` function then returns a list of all $\bar z$ values. To calculate the average across all of these values, we first need to turn the `list` back into a vector, which we can do using the `unlist()` function. 

With the `demography_model()` function, we can easily run our simulation repeatedly for a single set of parameters (e.g. by running `demography_model(R=5000, N=1000, alpha=7, beta=1, repeats=10)`). 

In the next step, let us use the function to run simulations for different population sizes, as well as for two different skill complexities: simple ($\alpha/\beta=7$) and complex ($\alpha/\beta=9$). 

```{r, cache = TRUE}
sizes <- c(2, seq(100,6100,by=250))
res <- lapply(sizes, demography_model, R=200, alpha=7, beta=1, repeats=20)
res2 <- lapply(sizes, demography_model, R=200, alpha=9, beta=1, repeats=20)
df <- tibble(N    = rep(sizes, 2), 
             zbar = c(unlist(res), unlist(res2)), 
             trait=rep(c("simple","complex"), each=length(sizes)))
```

Note, that we are again using `lapply()` here. Similar to what we do within `demography_model()`, we provide an argument (different population `sizes`) which should be applied to a function (here, `demography_model`). Arguments that follow the function description are directly handed over to the function. In the last line of this chunk, we have created a `tibble` with the results for both skills and the different population sizes. Now, we can plot the results. 

```{r effectivePopSize, fig.cap="The effect of population size on the average change in skill level in the population."}
library(ggplot2)
ggplot(df) + 
  geom_line(aes(x=N, y=zbar, color=trait)) +
  xlab("Effective population size") + 
  ylab("Change in average skill level, z bar") + 
  geom_hline(yintercept=0) + 
  theme_bw()
```

In Fig. \@ref(fig:effectivePopSize) we can see that the simple skill (blue) intercepts the x-axis at much smaller population sizes than the complex trait. That means, a simple trait can be maintained by much smaller populations, whereas larger populations of imitators are required for complex traits. 

Henrich calls the minimum population size required to maintain a skill the critical population size, $N^\star$. How can we calculate $N^\star$ for different skill complexities? Note that when you use the logarithmic population size to plot Fig. \@ref(fig:effectivePopSize), the resulting graphs are almost linear (see \@ref(fig:logEffectivePopSize)). 

```{r logEffectivePopSize, fig.cap="The same as in Fig. \\@ref(fig:effectivePopSize) but using log on population sizes."}
ggplot(df) + 
  geom_line(aes(x=log(N), y=zbar, color=trait)) +
  xlab("log(Effective population size)") +
  ylab("Change in average skill level, z bar") + 
  geom_hline(yintercept=0) + 
  theme_bw()
```

And so, we could use a linear fit and then solve for $y=0$ to calculate $N^\star$.

```{r}
fit <- lm(zbar ~ log(N), df[df$trait=="simple",])
print(fit)
N_star <- exp(solve(coef(fit)[-1], -coef(fit)[1]))
N_star
```

Of course in the last step, we also have to take the exponent of the resulting value to revert the log function. We see that a simple trait with a low alpha/beta ratio requires a minimum population size of about `r round(N_star)`. Let us now calculate the $N^\star$ values for different trait complexities.

```{r, cache = TRUE}
sizes <- seq(100,6100,by=500)

res <- do.call("rbind", lapply(seq(4,9,.5), function(alpha){
  tmp_z <- unlist(lapply(X=sizes, FUN=demography_model, R=200, alpha=alpha, beta=1, repeats=5))
  fit <- lm(tmp_z ~ log(sizes))
  n_star <- exp(solve(coef(fit)[-1], -coef(fit)[1]))
  tibble(n_star=n_star, alpha=alpha)
}))
res
```

And finally, we can print the critical population sizes as a function of the trait complexity $\alpha$ over $\beta$. 

```{r, fig.cap="Critical population size, N*, for different skill complexities."}
ggplot(res, aes(x=alpha, y=n_star)) + 
  geom_line() + 
  xlab(expression(alpha/beta)) +
  ylab("Critical populaton size, N*") + 
  theme_bw()
```

It is interesting to observe that the critical population size increases exponentially with skill complexity. This also suggests that all being equal, very high skill levels will never be reached by finite population sizes. However, different ways of learning (e.g. teaching) could considerably decrease $\alpha$ and $\beta$ over time and so allow high skill levels. 


## Summary of the model
Similar to the model in the chapter on Rogers' paradox, the present model is very simple and is making many simplifications. Nevertheless, it provides an intuitive understanding of how changes (up and down) in population size can affect the cultural repertoire of a population, and how it can be that simple traits thrive, while complex ones disappear. In the next chapter, we will discuss the importance of social networks, i.e. who can interact with whom. We will see that this will also have an effect (additional to the population size).


## Further readings

It is important to note that Henrich's paper provides an analytical model, which can give precise results without the need for modelling learning. The paper is well worth a read as it explains this analytical approach in clear terms. 


<!--chapter:end:10-Demography.Rmd-->

# Social network structure

As we have seen in the previous chapter, sometimes more detailed modelling can lead to different results (compared to the abstract case) ....... 
Henrich's model, for example, has shown that a populations ability to maintain and accumulate cultural traits depends on its size, whereby larger populations are more likely to retain and improve cultural traits than smaller ones. We used this model as an example that population-level characteristics (here, demography) play a role in cultural dynamics. Given that individuals acquire information from other individuals through social learning, we might also suspect that individual-level phenomena are important, too. Commonly, individual interactions are represented as networks, whereby nodes (also referred to as vertex, _pl._ vertices) represent individuals, and ties (also referred to as edges) between any two nodes indicate a relationship (e.g. friendship or kinship, or willingness to cooperate, etc.). Note, as with the term individual-based models, the individual can be a person but it can also represent a country, an institution, or any other kind of actor. 

In this chapter, we will cover the basics of social networks, how to create, analyse, and use them. Finally, we will use all of that to find an answer to the question: how do individual-level interaction patterns affect population-level cultural dynamics?

## The Basics
Let us start with a small group of colleagues who see each other on a regular base in the office, some are friends, and some are not (sometimes the friendships might not even be reciprocal). For this example, we use gossip as socially transmitted information. Say, we want to better understand how gossip is spreading through this group. At the simplest level, we might assume that: any two colleagues will exchange gossip whenever they meet. This would reduce the group of colleagues to a well-mixed population, where information can flow freely. However, assume that gossip is only transmitted if two friends interact. Now we have a situation, where information flow is constrained by the underlying friendship network. To explore, how information can spread in that friendship network, we first need to find a way to represent it in R. One way to represent networks is an adjacency matrix, $A$. It is a square matrix where all possible dyadic interactions of individuals in a population are represented as entries in its rows ($i$) and columns ($j$). Commonly, rows indicate interacting individuals, and columns are individuals who are recipients of this interaction. For example:

```{r 11.1}
m <- matrix(c(0,1,1, 0,0,1, 1,1,0), nrow=3, byrow=T)
row.names(m) <- c("A","B","C")
colnames(m) <- c("A","B","C")
m
```

Shown are the relationships between three individuals A, B, and C.  When we look at the first row, we see that A has no interaction with itself (indicated by the zero), but interacts both with B and C (indicated by the ones). From the next row, we see that B is only interacting with C, and the last row tells us that C interacting both with A and with B. This is an example for an asymmetric adjacency matrix because not all interactions are reciprocal (A is interacting with B but B is not interacting with A). We can test this by comparing the upper triangle of the matrix with the lower triangle of the matrix:

```{r 11.2}
all(m[upper.tri(m)] == m[lower.tri(m)])
```

We can make this matrix symmetric by replacing the lower matrix triangle with the upper one (this requires us to first transpose the triangle using the `t()` function):

```{r 11.3}
# Either
m[lower.tri(m)] <- t(m)[lower.tri(m)]
# or use m[2,1] <- 1
m
# Check whether the matrix is symmetric now:
all(m[upper.tri(m)] == m[lower.tri(m)])
```

Now let us turn back to our group of co-workers, and let us create a random friendship network. We start with an empty adjacency matrix, add friendships to the upper triangle and then copy the mirror image to the lower triangle (this is assuming that friendships are reciprocal):

```{r 11.4}
n <- 10
adjm <- matrix(NA, ncol=n, nrow=n)
adjm
```

Now, we randomly assign friendships to one of the triangles. We sample from zeros (no friends) and ones (friends) with a probability that indicates how many friendships there are in the network (`friendschipRatio`):

```{r 11.5}
friendshipRatio <- 0.5

adjm[upper.tri(adjm)] <- sample(x = 0:1, 
                                size = n*(n-1)/2, 
                                replace = T, 
                                prob = c(1-friendshipRatio, friendshipRatio))
adjm

# add the mirror image to the lower triangle
adjm[lower.tri(adjm)] <- t(adjm)[lower.tri(adjm)]
adjm

# replace diagonal (these indicate self-references) with zeros
diag(adjm) <- 0
adjm
```


This is all we need to describe the interactions and the flow of information within this group. And while we could already use other tools in R to analyse it, we should first try to visualise this network to get a better intuition of what this group looks like. The `igraph` package provides many incredibly useful functions and tools to create, work with, and analyse networks. We will rely on many of its functions in this chapter. 

## Plotting networks 
Given that we already have an adjacency matrix, we can create a network using the `graph_from_adjacency_matrix()` function. 

```{r 11.6}
library(igraph)
net <- graph_from_adjacency_matrix(adjm)
net
```

When we return the `net` object, we receive a lot of information about our network, for example, that it is there are `r vcount(net)` vertices and `r ecount(net)` edges (for more information be sure to have a look at this short [igraph introduction](https://igraph.org/r/doc/aaa-igraph-package.html)). However, we did not receive an actual network figure. For this, we can use the R's generic plot function:

```{r 11.7}
#set.seed(1)
plot(net)
```

This is the most basic network plot where each node (with the numbers 1 to `r n`) and their connections (edges/ties) are plotted such that nodes that receive more connections are more central and those that rec#eive less are more peripheral. Note, every time you plot the network, it will have a slightly different layout. If you would like to preserve the node positions you can uncomment the `set.seed()` function in the code chunk above.

There are many ways to change the looks of network plots. The igraph manual is an ideal starting point to learn more about it. For example, we could plot nodes based on specific layout functions:

```{r 11.8}
par(mfrow=c(1,2))
plot(net, layout=layout.grid(net), main="Grid layout")
plot(net, layout=layout.circle(net), main="Ring layout")
```

Or we can directly change the size, and colour of nodes:

```{r 11.9}
par(mfrow=c(1,1))
plot(net, 
     vertex.color="dodgerblue", 
     vertex.label.color="white",
     vertex.size=20,
     edge.color="black",
     edge.width=1,
     edge.arrow.size=0.5,
     main="Default layout with styling")
```

In the previous example, we have changed the colour of the nodes' background and text, and their overall size (attributes starting with `vertex.`). We have also changed the colour of the edges, their width, and the size of the tips (attributes starting with `edge.`).

Let's say, we know the age of each individual and we would like to visualise age as different colours of the nodes. For that, we should first store the colour information for each vertex (using the `V()` function) and then change the vertex colour in the plot function:

```{r 11.10}
# There is no age attribute set
V(net)$age

# Add an age attribute and randomly assign an age class
V(net)$age <- sample(x=1:10, size=n, replace=T)

# The network now has an age attribute
V(net)$age

V(net)$colour <- heat.colors(n=10)[V(net)$age]

net
```

The `net` object now has two attributes (one is called age, with numeric values, and one is called colour, with character values). We can now plot this graph:

```{r 11.11}
plot(net, 
     vertex.color=V(net)$colour, 
     vertex.label.color="black",
     vertex.size=20,
     edge.color="black",
     edge.width=1,
     edge.arrow.size=0.5)
```

It can be useful to remove the labels from the nodes and reduce the node size, especially when the networks become large. Also, given that we have a symmetric network (all relationships are reciprocal), we can get rid of the arrow tips. We do this by telling igraph that the network is undirected.

```{r 11.12}
net <- as.undirected(net)
plot(net, 
     vertex.color=V(net)$colour, 
     vertex.label="",
     vertex.size=9,
     edge.width=1,
     edge.arrow.size=0.5)
```

One final styling example. So far, we have the edges indicate the existence of friendship relationships, which is a binary quality. However, we can also imagine friendship as a continuous variable (from strong = 1 to none = 0). Using the edge function `E()`, we can add attributes to the edges in the network. We can use those to alter the width of edges depending on their relationship strength. To do that, we will first change the adjacency matrix (from binary to continuous values) and then plot the new network.

```{r 11.13}
# create a copy of the adjacency matrix
adjm2 <- adjm

# draw random values from a uniform distribution and replace the matrix entries that have a value of 1
adjm2[upper.tri(adjm2)] <- runif(n=sum(upper.tri(adjm2))) * adjm2[upper.tri(adjm2)]

# copy the mirror image to the lower triangle
adjm2[lower.tri(adjm2)] <- t(adjm2)[lower.tri(adjm2)]

# create the network
net2 <- graph_from_adjacency_matrix(adjm2, weighted=TRUE, mode="undirected")
net2
```

Note, igraph automatically added an attribute called `weight`. Now, let's plot this network with the edge width representing their weight:

```{r 11.14}
V(net2)$age <- V(net)$age
V(net2)$colour <- V(net)$colour

plot(net2, 
     vertex.color=V(net2)$colour,
     vertex.label="",
     vertex.size=9,
     edge.width=E(net2)$weight*5,
     edge.arrow.size=0.5)
```

You can now observe strong and weak relationships between individuals, their location relative to each other, and how they cluster. In the next section, we will try to quantify these observations.

## Analyse social networks
There is a variety of measures to describe the structure of our graph (another term for network). We can generally differentiate between properties of the network and properties of the vertices:

### Network properties 
We can retrieve the most basic information about our network using:
```{r 11.15}
# vertex information
V(net2)

# edge information
E(net2)
```

If our networks have attributes stored (for the vertices or edges), we can retrieve them with the following two functions: 
```{r 11.16}
# vertex attributes
get.vertex.attribute(graph=net2)

# edge attributes
get.edge.attribute(net2)
```

With `vcount()` and `ecount()`, we can return the number of vertices (`r vcount(net)`) and edges (`r ecount(net)`) in our graph. 
```{r 11.17}
# vertex count
vcount(net)

# edge count
ecount(net)
```

Let us now look at a variety of measures that we can calculate to characterise our network.

**Diameter** is a measure for the longest (geodesic) path, i.e. the largest number of steps that are necessary to reach two vertices in a network (using `farthest_vertices()` we can return the ID of the two vertices).
```{r 11.18}
diameter(graph=net)
```

**Average path length** is the average number of steps that need to be traversed between any two vertices (aka as dyad). We can also use the `distance()` function to return a distance matrix similar to the adjacency matrix.
```{r 11.19}
# average distance between any dyad
mean_distance(graph=net)
```

**Edge density** is the proportion of edges present in the graph relative to the number of possible edges (i.e. in a fully connected network with the same number of nodes).
```{r 11.20}
edge_density(graph=net)
```

**Reciprocity** (relevant for directed graphs only) calculates the proportion of mutual edges. As we have a directed graph, this value is one. 
```{r 11.21}
reciprocity(graph=net)
```

**Clustering coefficient** (aka transitivity, or cliquishness) is the probability that the two neighbours of a vertex are neighbours of each other. This is also called a triangle, and often the image of 'my friends are friends with each other' is used.
```{r 11.22}
transitivity(graph=net)
```


### Vertex properties
Additional to these high-level measures, we can use a series of node-level measures to describe connectivity in more detail:

**Degree centrality** refers to the number of (incoming/outgoing/both) edges of a vertex. We can use the `degree()` function to determine the degree centrality of each node:
```{r 11.23}
# number of edges that connected with each node
degree(graph=net)

# as we have an undirected network, the count of incoming and outgoing edges is identical
identical(degree(graph=net, mode="in"), 
          degree(graph=net, mode="out"))

# we can use the mean of all degree centralities as a general measure for the entire network
mean(degree(graph=net))
```

**Strength** is similar to degree centrality but relevant for weighted networks. It is the sum of all adjacent edge weights (a node might have many edges but with very low weights and so with high degree centrality but low strength). In our case (unweighted network), `degree()` and `strength()` produce the same result:
```{r 11.24}
sort(strength(graph=net))
```

**Closeness centrality** represents the number of steps it takes from a given vertex to any other vertex in the network. It is a measure for how long information on average takes to arrive at this node. 
```{r 11.25}
closeness(graph=net)
```
Note that the values are $<$1. This is because `igraph` defines closeness centrality as 'the inverse of the average length of the shortest paths to/from all the other vertices in the graph.'

**Betweenness centrality** is the number of shortest paths between nodes that pass through a particular node. It is often seen as a measure for a node's gatekeeping or brokerage potential:
```{r 11.26}
betweenness(graph=net)
```

**Eigenvector centrality** is the eigenvector of the adjacency matrix. Vertices with a high eigenvector centrality are connected to many individuals who are connected to many individuals, and so on (see also page rank, `page_rank()`, and authority, `authority_score()`, score functions).
```{r 11.27}
eigen_centrality(graph=net)$vector
```

## Using social networks to model information transmission
Now that we know how to generate, plot, and analyse networks: how can we use them in the context of information sharing. Or to come back to our initial example, the spread of gossip? Let us start with a very simple model: we initiate the simulation be endowing one individual with a piece of information (or gossip). We then simulate how many time steps it will take until the information has spread through the entire network. 

### Gossip transmission on networked populations
```{r 11.28}
# set up vector that stores whether an individual possesses the latest gossip:
gossip_meter <- rep(F, n)
gossip_meter

# choose a random individual to start gossipping
gossip_meter[sample(x=n, size=1)] <- T
gossip_meter

# choose an individual with gossip
if(sum(gossip_meter)==1){
  gossip_provider <- which(gossip_meter)
} else {
  gossip_provider <- sample(x=which(gossip_meter), size=1)
}

# choose a neighbour in the network
neighbours <- which(adjm[gossip_provider,]==1)

# only execute if there is a neighbour
if(length(neighbours)>0){
  # choose one neighbour if several are present
  if(length(neighbours)==1){
    neighbour <- neighbours
  } else {
    neighbour <- sample(x=neighbours, size=1)
  }
  # update gossip indicator of the neighbour
  gossip_meter[neighbour] <- T
}
gossip_meter
```

This is already sufficient to simulate the spread of gossip on a network. In the next step, let us loop over these instructions and record the number of time steps it takes until everyone received the gossip. But first, we should write a different `sample()` function, as the generic function misbehaves with samples of length equal to one. We will use the following: 

```{r 11.29}
mySample <- function(x, prob=NULL){
  if(length(x)==1){
    return(x)
  } else {
    return(sample(x=x, size=1, prob=prob))
  }
}
```

Now we can use this simpler version of the sampling function (which is directly tailored to our needs), replace it in the code above, and add a `for` loop: 

```{r 11.30}
# total number of turns to loop over
Turns <- 50

# a matrix to record how many individuals have the gossip at time x
res <- tibble(turn=1:Turns, gossipers=rep(0, Turns))

# set up gossip indicator
gossip_meter <- rep(F, n)

# choose a random individual to start gossipping
gossip_meter[sample(x=n, size=1)] <- T

for(turns in 1:Turns){
  res[turns,2] <- sum(gossip_meter)/n
  # choose an individual with gossip
  gossip_provider <- mySample(which(gossip_meter))
  
  # choose a neighbour in the network
  neighbours <- which(adjm[gossip_provider,]==1)
  
  # only execute if there is a neighbour
  if(length(neighbours)>0){
    # choose one neighbour ...
    neighbour <- mySample(neighbours)
    # and update its gossip indicator
    gossip_meter[neighbour] <- T
  }
}

# have a look at the time course
res$turn

# plot result
library(ggplot2)
ggplot(res) + 
  geom_line(aes(x=turn, y=gossipers)) +
  theme_bw()
```

Let us adapt the code so that we can run it repeatedly, to get a better feeling for the average amount of time it takes for the information to spread. 

```{r 11.31}
gossip_model <- function(ADJM, TURNS, SIM=1){
  n <- nrow(ADJM)
  # a matrix to record how many individuals have the gossip at time x
  res <- tibble(turn=1:TURNS, gossipers=rep(0, TURNS))
  # set up gossip indicator
  gossip_meter <- rep(F, n)
  # choose a random individual to start gossipping
  gossip_meter[sample(x=n, size=1)] <- T
  
  for(turns in 1:TURNS){
    res[turns,2] <- sum(gossip_meter)/n
    # choose an individual with gossip
    gossip_provider <- mySample(which(gossip_meter))
    # choose a neighbour in the network
    neighbours <- which(ADJM[gossip_provider,]==1)
    # only execute if there is a neighbour
    if(length(neighbours)>0){
      # choose one neighbour ...
      neighbour <- mySample(neighbours)
      # and update its gossip indicator
      gossip_meter[neighbour] <- T
    }
  }
  res$simulation <- SIM
  return(res)
}
```

With this function of our model, we can easily re-run the simulation several times and receive the time trajectories for each simulation:

```{r 11.32, cache=TRUE}
data <- do.call("rbind", lapply(1:100, function(sim){
  gossip_model(ADJM=adjm, TURNS=100, SIM=sim)
}))
```

```{r 11.33}
ggplot(data) + 
  geom_line(aes(x=turn, y=gossipers, group=simulation), alpha=.2) +
  theme_bw()
```

We can see that the initial spread is very fast and then tapers off. At the beginning there are a lot of individuals who do not have the gossip and transmission will be successful. However, the more individuals with the gossip there are, the less likely it is to find someone who does not yet have the gossip. How does this result compare to simulations with a fully connected network (i.e. where every individual is equally likely to interact with any other individual)?

To make the creation of networks easier, let us write down a function to generate networks as we have done before:

```{r 11.34}
createNetwork <- function(N, FRIENDSHIPRATIO){
  adjm <- matrix(0, ncol=N, nrow=N)
  adjm[upper.tri(adjm)] <- sample(x = 0:1, 
                                  size = N*(N-1)/2, 
                                  replace = T, 
                                  prob = c(1-FRIENDSHIPRATIO, FRIENDSHIPRATIO))
  adjm[lower.tri(adjm)] <- t(adjm)[lower.tri(adjm)]
  return(adjm)
}
```

Now we can create a fully connected network, run the gossip model on it, and plot the results:

```{r 11.35, cache=TRUE}
adjm_full <- createNetwork(N=n, FRIENDSHIPRATIO=1)

data_full <- do.call("rbind", 
                     lapply(1:100, function(sim){
                       gossip_model(ADJM=adjm_full, TURNS=100, SIM=sim)
                       }
                       )
                     )
```

```{r 11.36}
data$type <- "colleague network"
data_full$type <- "fully connected"
data_combined <- rbind(data, data_full)

ggplot(data_combined) + 
  geom_point(aes(x=turn, y=gossipers, color=type), alpha=.05) +
  geom_smooth(aes(x=turn, y=gossipers, color=type)) +
  ylab("Proportion with gossip") +
  theme_bw()
```

This plot shows that the fully connected network reaches saturation faster than the colleague network. The difference becomes even more obvious when we re-run this simulation with larger networks:

```{r 11.37, cache=TRUE}
adjm_colleagues <- createNetwork(N=25, FRIENDSHIPRATIO=.20)
adjm_full <- createNetwork(N=25, FRIENDSHIPRATIO=1)

data_colleagues <- do.call("rbind", lapply(1:100, function(sim){
  gossip_model(ADJM=adjm_colleagues, TURNS=100, SIM=sim)
}))

data_full <- do.call("rbind", lapply(1:100, function(sim){
  gossip_model(ADJM=adjm_full, TURNS=100, SIM=sim)
}))
```

```{r 11.38}
data_colleagues$type <- "colleague network"
data_full$type <- "fully connected"
data_combined2 <- rbind(data_full, data_colleagues)

ggplot(data_combined2) + 
  geom_point(aes(x=turn, y=gossipers, color=type), alpha=.05) +
  geom_smooth(aes(x=turn, y=gossipers, color=type)) +
  ylab("Proportion with gossip") +
  theme_bw()
```


### How does network structure affect information transmission?
While the previous example has shown us the difference in the transmission speed of information in a structured versus an unstructured population (i.e. the fully connected network which is equivalent to a well-mixed population), we still do not know how network characteristics affect the transmission of information. To get a better understanding of it, we will simulate the transmission of information in networks of different sizes and connectivity. Here, we will simulate Small-World networks (also known as Watts-Strogatz graphs). These are networks that are characterised by high clustering and short average path length. 

For this type of analysis we do not need to record each time step, instead, we can let our function return the time step at which, e.g. 75\% of the population have heard the gossip:

```{r 11.39}
gossip_model <- function(ADJM, PROP, TURNS, SIM=1){
  n <- nrow(ADJM)
  # set up gossip indicator
  gossip_meter <- rep(F, n)
  # choose a random individual to start gossipping
  gossip_meter[sample(x=n, size=1)] <- T
  
  timestep <- NA
  for(turns in 1:TURNS){
    # record the time step when a defined proportion of the population has the gossip
    if(sum(gossip_meter)/n >= PROP & is.na(timestep)){
      timestep <- turns 
    }
    # choose an individual with gossip
    gossip_provider <- mySample(which(gossip_meter))
    # choose a neighbour in the network
    neighbours <- which(ADJM[gossip_provider,]==1)
    # only execute if there is a neighbour
    if(length(neighbours)>0){
      # choose one neighbour ...
      neighbour <- mySample(neighbours)
      # and update its gossip indicator
      gossip_meter[neighbour] <- T
    }
  }
  # if PROP is not reached, return the maximum value
  if(is.na(timestep)){
    timestep <- TURNS
  }
  return(timestep)
}
```

By default, this function will return the maximum turn number, if the set target proportion is not reached (however, we could also let the function return a different value or a character expression, such as `MAX`).

```{r 11.40}
adjm <- createNetwork(N=50, FRIENDSHIPRATIO=.25)
gossip_model(ADJM=adjm, PROP=.75, TURNS=100, SIM=1)
```
When you run d

Now, let us run these functions for different group sizes and different connectivities, and let us repeat these simulations several times to get a better estimate:

```{r 11.41, cache=TRUE}
parameters <- expand.grid(groupSize=c(10,50),#c(10,30,50,70,90),
                          reconnections=c(0,.01,.05,.1,.5,1),
                          repetition=1:50)
tmp <- lapply(1:nrow(parameters), function(p){
  net <- watts.strogatz.game(dim=1, size=parameters[p,"groupSize"], nei=2, p=parameters[p,"reconnections"], loops=F, multiple=F)
  adjm <- get.adjacency(net)
  model <- gossip_model(ADJM=adjm, PROP=.5, TURNS=500, SIM=parameters[p,"repetition"])
  tibble(time=model, 
             degree=mean(degree(net)), 
             path=average.path.length(net))
})

tmp2 <- do.call("rbind", tmp)
parameters <- cbind(parameters, tmp2)
```

To plot the average time of our repeated simulations, we need to first calculate this average:

```{r 11.42}
library(tidyverse)
parameters %>% group_by(groupSize, reconnections) %>% summarise(meanTime=mean(time), 
                                                                meanDegree=mean(degree),
                                                                meanPath=mean(path)) -> data
```

Let us have a look at both the average network degree and the average path length of the networks:

```{r 11.43}
parameters$groupSize <- factor(parameters$groupSize)

ggplot(parameters, aes(x=reconnections, y=degree, col=groupSize)) + 
  geom_jitter(height=.01, alpha=.5) +
  scale_y_continuous(limits=c(3.5,4.5)) +
  xlab("Prob. of reconnection") +
  ylab("Average Degree") +
  theme_bw()
```

Note, the average network degree is constant at $k =$ `r mean(parameters$degree)` (the plotting function uses `geom_jitter()` which works just like `geom_point()` but is also adding a little y-axis jitter to the data points so that we can see overlapping points). This is due to the way Small World Networks are created. We start with a regular lattice network and then randomly select an edge and change the vertex on one end (e.g. from A--B to A--C). Thus, while some individuals might now have fewer (or even zero) edges and others might have more, the average number of edges and so the average degree remains fixed an independent of the number of rewired edges.

```{r 11.44}
ggplot(parameters, aes(x=reconnections, y=path, col=groupSize)) + 
  geom_point() + 
  xlab("Prob. of reconnection") +
  ylab("Average path length") +
  theme_bw()
```

This, however, is not true for the path length. Instead, we find that larger networks (with otherwise the same network parameters) have longer average path lengths (i.e. it takes on average more steps to get from one to another individual). Also, as we increase the probability to reconnect edges average path length goes down. This is because random connections might connect distant sections of the network and so drastically lower the number of steps required to get from one individual to another. 

```{r 11.45}
ggplot(parameters, aes(x=path, y=time, col=groupSize)) + 
  geom_point(alpha=.33) + 
  xlab("Average path length") +
  ylab("Time") +
  theme_bw()
```

From this graph, we can see that the spread of information decreases as networks become larger and as average path length increases. 


### Complex versus simple contagion

The spread of information is not only affected by the network shape but also by the manner of information transmission. That is, often information is not simply transmitted from one individual to another in a simple contagion like manner but instead requires increased social facilitation. This can both be repeated exposure to the information or a certain proportion of social partners who possess this information. In other words, often we are more likely to acquire behaviours from others if this behaviour is more frequent in our neighbourhood. This kind of transmission is called complex contagion (see e.g. @centola_spread_2010). But how does simple and complex contagion affect the spread of information? To model this, we can write a small model where we randomly select individuals and check whether they acquire information from their social contacts based on a threshold value, i.e. the number of contacts required to acquire information. 

Let us start with a model of simple contagoin (requires only one social contact that possesses a new beahviour or information to acquire it):

```{r 11.46, cache=TRUE}
informationSpread <- function(ADJM, MODE="simple", ROUNDS=100, CONTACTS=1){
  N <- nrow(ADJM)
  
  # set up behaviour
  behaviour <- rep(F, N)
  behaviour[sample(N,size=10)] <- T
  
  # recording data
  fraction <- rep(0, ROUNDS)
  
  # loop
  for(round in 1:ROUNDS){
    # record
    fraction[round] <- sum(behaviour)/N
    
    # pick random individual
    focal <- sample(x=N, size=1)
    # who are its neighbours
    neigh <- adjm[focal, ] == 1
    
    if(sum(behaviour[neigh]) >= CONTACTS){
      behaviour[focal] <- T
    }
    
  }
  return(tibble(mode=MODE, 
                frac=fraction, 
                round=1:ROUNDS))
}

# Simulations with clustered (lattice) networks
net <- watts.strogatz.game(size=10, dim=2, nei=2, p=.1)
adjm <- as.matrix(get.adjacency(net))
clu <- informationSpread(ADJM=adjm, MODE="simple", ROUNDS=1000, CONTACTS=1)
clu$network <- "clustered"

# Simulations with random (small world) networks
net2 <- erdos.renyi.game(n=100, p.or.m=600, type="gnm", directed=F, loops=F)
adjm2 <- as.matrix(get.adjacency(net2))
rand <- informationSpread(ADJM=adjm2, MODE="simple", ROUNDS=1000, CONTACTS=1)
rand$network <- "random"

res <- rbind(clu, rand)

ggplot(res) + geom_line(aes(x=round, y=frac, col=network, linetype=mode), size=.9) + theme_bw()
```

For the simple contagion case we find very little difference between the two network types. Let us now add the complex contagion case to the simulation:

```{r 11.47, cache=TRUE}
# for repeated interactions 
informationSpread <- function(NET, MODE="simple", ROUNDS=100, CONTACTS=1){
  if(NET=="clustered"){
    net <- watts.strogatz.game(size=10, dim=2, nei=2, p=.1, loops=F, multiple=F)
  } else if(NET=="random"){
    net <- erdos.renyi.game(n=100, p.or.m=600, type="gnm", directed=F, loops=F)
  }
 adjm <- as.matrix(get.adjacency(net)) 
 
  N <- nrow(adjm)
  
  # set up behaviour
  behaviour <- rep(F, N)
  behaviour[sample(N,size=10)] <- T
  
  # recording data
  fraction <- rep(0, ROUNDS)
  
  # loop
  for(round in 1:ROUNDS){
    # record
    fraction[round] <- sum(behaviour)/N
    
    # pick random individual
    focal <- sample(x=N, size=1)
    # who are its neighbours
    neigh <- adjm[focal, ] == 1
    
    if(sum(behaviour[neigh]) >= CONTACTS){
      behaviour[focal] <- T
    }
    
  }
  return(tibble(network=NET, 
                mode=MODE, 
                frac=fraction, 
                round=1:ROUNDS))
}

# repeated runs
reps <- 10

# simulte complex and simple contagion for clustered networks
set.seed(1110)
clu_simple <- do.call("rbind", lapply(1:reps, function(x) informationSpread(NET="clustered", ROUNDS=1000, CONTACTS=1, MODE="simple")))
clu_comp <- do.call("rbind", lapply(1:reps, function(x) informationSpread(NET="clustered", ROUNDS=1000, CONTACTS=4, MODE="complex")))
clu <- rbind(clu_simple, clu_comp)

# simulte complex and simple contagion for random networks
rand_simple <- do.call("rbind", lapply(1:reps, function(x) informationSpread(NET="random", ROUNDS=1000, CONTACTS=1, MODE="simple")))
rand_comp <- do.call("rbind", lapply(1:reps, function(x) informationSpread(NET="random", ROUNDS=1000, CONTACTS=4, MODE="complex")))
rand <- rbind(rand_simple, rand_comp)

# combine results
res <- rbind(clu, rand)
```

```{r 11.48}
ggplot(res) + 
  stat_smooth(aes(x=round, y=frac, col=network, linetype=mode), size=.9, se=T) +
  ylab("Proportion of individuals with information") +
  xlab("Round") + 
  theme_bw()
```

This figure tells us that, while there is no apparent difference in the spread of information in clustered and random networks for simple contagion, we find that information spreads faster in clustered networks if the transmission is akin to complex contagion. The reason for this is that in clustered networks an individual's neighbours are more likely to also be connected. This increases the likelihood that the neighbours also share the same information, and, in turn, increases the individual's exposure to this information. 


## Summary of the model

## Further Reading

<!--chapter:end:11-Social-network-structure.Rmd-->

# Group structured populations and migration

Many simulations assume well-mixed populations, that is, populations of individuals that have an equal chance of encountering each other. In the previous chapter, we have looked at the effects of structured interactions on the transmission of cultural traits. Both, structured and unstructured populations, can be good approximations to the real world, depending on the context and research question. What about structured populations, a combination of the two. That is, a large population of individuals is divided into sub-populations, where individuals _within_ a sub-population are (more or less) equally likely to encounter each other, whereas encounter frequencies _between_ sub-populations are much lower. As such learning would only occur within each sub-population. However, individuals may migrate between sub-populations, bringing along their own selection or expressions of cultural traits. 

## Basic migration model 

In this chapter, we will simulate this movement between sub-populations to study its effect on the transmission of a socially acquired behaviour. We model a population of size $n$, with $s$ sub-populations, and $b$ instances of a behavioural trait. For most of this chapter, we assume that there are two versions of the behaviour, for example, two individuals greet each other either with a handshake or with a hug. You will see that there are many opportunities to alter the model (have more sub-populations, behaviours, learning events, etc.). 

We will start very simply with one large population where individuals acquire a behavioural trait from a random individual in their population, and then add layers to the model. Let us set up a population:

```{r 12.1}
n <- 100
b <- 2

behaviours <- sample(x=b, size=n, replace=TRUE)
behaviours[1:10]
table(behaviours)
```

Because we are only interested which instance of a particular behaviour an individual expresses, each individual can be fully described by its behaviour, and so, we can represent the entire population as a vector of expressed behaviours (`behaviours`). 

Everyone in our population is constantly exhibiting their behaviour. Occasionally, one of the individuals will copy the behaviour of another individual. To simulate this behavioural updating, we select two random individuals (an observer and a model) and then copy the model's behaviour into the observer's entry. We wrap a loop around these two steps to simulate repeated updating events and add a record variable to store the frequency of each of the behaviours.

```{r 12.2}
t <- 1000
recbehav <- matrix(NA, nrow=t, ncol=b)

for(timestep in 1:t){
  # choose a random observer and a random individual to observe 
  observer_model <- sample(x=n, size=2, replace=F)
  # let observer acquire trait from model
  behaviours[ observer_model[1] ] <- behaviours[ observer_model[2] ]
  # record the frequency of each trait in each timestep
  recbehav[timestep,] <- table(behaviours)
}
```

Now, we can plot the frequency of each behaviour over time, which we recorded in the `recbehav` matrix.

```{r 12.3}
library(ggplot2)
library(reshape2)
recbehav_d <- melt(recbehav)
colnames(recbehav_d) <- c("time", "behaviour","freq")

ggplot(recbehav_d) + 
  geom_line(aes(x=time, y=freq/n, col=factor(behaviour))) + 
  scale_y_continuous(limits=c(0,1)) + 
  theme_bw()
```

As we would have expected from an unbiased transmission, the frequency of the two traits will move around $0.5$. For smaller populations, more traits, or longer time drift will lead to the exclusion of one of the traits. We can test this by using a much smaller population. To do so, let us first wrap the simulation in a function called `migrationModel()`. 

```{r 12.4}
migrationModel <- function(POPSIZE, NBEHAVIOUR, NTIMESTEPS){
  behaviours <- sample(x=NBEHAVIOUR, size=POPSIZE, replace=TRUE)
  recbehav <- matrix(NA, nrow=NTIMESTEPS, ncol=NBEHAVIOUR)
  
  for(timestep in 1:NTIMESTEPS){
    # choose a random observer and a random individual to observe 
    observer_model <- sample(x=POPSIZE, size=2, replace=F)
    # let observer acquire trait from model
    behaviours[ observer_model[1] ] <- behaviours[ observer_model[2] ]
    recbehav[timestep,] <- as.numeric(table(behaviours)) / POPSIZE
  }
  return(recbehav)
}
```

The function makes it easy to quickly run simulations with different input parameters. Here, we run the simulation with a population size of $n=20$:

```{r 12.5}
# run simulation
set.seed(3)
res <- migrationModel(POPSIZE=20, NBEHAVIOUR=2, NTIMESTEPS=1000)
# turn results matrix into long format
recbehav_d <- melt(res)
# add descriptive column names
colnames(recbehav_d) <- c("time", "behaviour","freq")
# plot results
ggplot(recbehav_d) + 
  geom_line(aes(x=time, y=freq, col=factor(behaviour))) + 
  scale_y_continuous(limits=c(0,1)) + 
  theme_bw()
```

As you can see, the two behaviours meander until, by chance, behaviour $1$ is completely replaced by behaviour $2$. 


## Subdivided population
Now let us assume that the population we are looking at divided into two sub-populations, here `clusters`. We will keep track which individual belongs to which cluster and then record the frequency of one of the behaviours, $p$, (the frequency of the other behaviour, $q$, is simply $1-p$) in each cluster. Also, note that cluster size $n$ is the same for all clusters.

```{r 12.6}
migrationModel <- function(NPOP, POPSIZE, NBEHAVIOUR, NTIMESTEPS){
  totalpop <- NPOP*POPSIZE
  cluster <- sample(x=NPOP, size=totalpop, replace=TRUE)
  behaviours <- sample(x=NBEHAVIOUR, size=totalpop, replace=TRUE)
  recbehav <- matrix(NA, nrow=NTIMESTEPS, ncol=NPOP)
  
  for(timestep in 1:NTIMESTEPS){
    # choose a random observer and a random individual to observe 
    observer_model <- sample(x=totalpop, size=2, replace=F)
    # let observer acquire trait from model
    behaviours[ observer_model[1] ] <- behaviours[ observer_model[2] ]
    # record the relative frequency of behaviour 1 for each cluster
    recbehav[timestep,] <- unlist(lapply(1:NPOP, function(x) 
                              sum(behaviours[cluster==x]==1) / sum(cluster==x)
                            ))
  }
  
  recbehav_d <- melt(recbehav)
  colnames(recbehav_d) <- c("time", "cluster","freq")
  return(recbehav_d)
}

# run simulation
set.seed(4)
res <- migrationModel(NPOP=2, POPSIZE=100, NBEHAVIOUR=2, NTIMESTEPS=1000)

# plot results
ggplot(res) + 
  geom_line(aes(x=time, y=freq, col=factor(cluster))) + 
  scale_y_continuous(limits=c(0,1)) + 
  ylab("Relative frequency of behaviour 1") + 
  theme_bw()
```

Running this code repeatedly will show you two things. First, on average the frequency of each behaviour will still be around 0.5, and second that the frequency changes are highly correlated between the two sub-populations. This is expected because, with the current version of our model, individuals do not distinguish between or have different access to individuals of either cluster. 

What would happen if members of a cluster preferentially learn from others within their cluster? This might be the case where individuals spent most of their time in their sub-populations. To make sure that we choose an observer and a model from the same cluster we can use the `prob` argument in the `sample()`. This argument gives a weight (or probability) with which an element of a provided set is chosen. By default, each element has a weight of $1$ and thus is equally likely to be selected. To limit our scope to individuals within the same cluster, we can simply set the weight to $0$ for all individuals that are in a different cluster. And so, we first select a cluster and then choose two individuals from that cluster to be our observer and model. We will add an `if` statement to make sure that there are at least two individuals in the cluster:

```{r 12.7}
migrationModel <- function(NPOP, POPSIZE, NBEHAVIOUR, NTIMESTEPS){
  totalpop <- NPOP*POPSIZE
  cluster <- sample(x=NPOP, size=totalpop, replace=TRUE)
  behaviours <- sample(x=NBEHAVIOUR, size=totalpop, replace=TRUE)
  recbehav <- matrix(NA, nrow=NTIMESTEPS, ncol=NPOP)
  
  for(timestep in 1:NTIMESTEPS){
    # choose a random cluster
    clusterid <- sample(NPOP, 1)
    # if there are at least two individuals in this cluster
    if(sum(cluster==clusterid)>1){
      # choose a random observer and a random individual to observe within the same cluster 
      observer_model <- sample(x=totalpop, size=2, replace=F, 
                               prob=(cluster==clusterid)*1)
      behaviours[ observer_model[1] ] <- behaviours[ observer_model[2] ]
    }
    recbehav[timestep,] <- unlist(lapply(1:NPOP, function(x) 
      sum( behaviours[cluster==x]==1 ) / sum(cluster==x)))
  }
  
  recbehav_d <- melt(recbehav)
  colnames(recbehav_d) <- c("time", "cluster","freq")
  return(recbehav_d)
}

set.seed(20)
res <- migrationModel(NPOP=2, POPSIZE=50, NBEHAVIOUR=2, NTIMESTEPS=1000)
ggplot(res) + 
  geom_line(aes(x=time, y=freq, col=factor(cluster))) + 
  scale_y_continuous(limits=c(0,1)) + 
  ylab("Relative frequency of behaviour 1") + 
  theme_bw()
```

When you run this simulation repeatedly, you will see that sometimes one of the behaviours gets lost in one, both, or neither of the clusters. Because in this iteration of our simulation there are no interactions between individuals of different clusters, we are essentially simulating two independent populations. You can see that this is the case when we start the simulation monomorphic clusters, i.e. all individuals in a cluster start with the same behaviour:

```{r 12.8}
migrationModel <- function(NPOP, POPSIZE, NBEHAVIOUR, NTIMESTEPS){
  totalpop <- NPOP*POPSIZE
  behaviours <- rep(sample(NBEHAVIOUR,size=NPOP,replace=T), each=POPSIZE)
  cluster <- rep(1:NPOP, each=POPSIZE)
  recbehav <- matrix(NA, nrow=NTIMESTEPS, ncol=NPOP)
  
  for(timestep in 1:NTIMESTEPS){
    clusterid <- sample(NPOP, 1)
    if(sum(cluster==clusterid)>1){
      observer_model <- sample(x=totalpop, size=2, replace=F, 
                               prob=(cluster==clusterid)*1)
      behaviours[ observer_model[1] ] <- behaviours[ observer_model[2] ]
    }
    recbehav[timestep,] <- unlist(lapply(1:NPOP, function(x) 
      sum( behaviours[cluster==x]==1 ) / sum(cluster==x)))
  }
  
  recbehav_d <- melt(recbehav)
  colnames(recbehav_d) <- c("time", "cluster","freq")
  return(recbehav_d)
}

res <- migrationModel(NPOP=2, POPSIZE=50, NBEHAVIOUR=2, NTIMESTEPS=100)
ggplot(res) + 
  geom_line(aes(x=time, y=freq, col=factor(cluster))) + 
  scale_y_continuous(limits=c(0,1)) + 
  ylab("Relative frequency of behaviour 1") + 
  theme_bw()
```

As you can see, there is no interaction between the two clusters and the frequency of behaviour 1 remains unchanged. 


## Simulating migration between sub-populations

In the next step, we can add migration to our model. In the context of our model, migration means that an individual is moving from one sub-population to another. However, we could also think of a version where this move is not permanent. Instead, the individual could join another group, learn a new behaviour and return to its original sub-population. Here, we will focus on the former. 

To model (physical) migration, we will add the migration probability $\mu$ to the model (where $0$ means no migration, and $1$ means that an individual will always choose to move to another group):

```{r 12.9}
migrationModel <- function(NPOP, POPSIZE, NBEHAVIOUR, NTIMESTEPS, MIGRATION){
  totalpop <- NPOP*POPSIZE
  behaviours <- rep(sample(NBEHAVIOUR,size=NPOP,replace=T), each=POPSIZE)
  cluster <- rep(1:NPOP, each=POPSIZE)
  recbehav <- matrix(NA, nrow=NTIMESTEPS, ncol=NPOP)
  for(timestep in 1:NTIMESTEPS){
    clusterid <- sample(NPOP, 1)
    if(sum(cluster==clusterid)>1){
      observer_model <- sample(x=totalpop, size=2, replace=F, 
                               prob=(cluster==clusterid)*1)
      behaviours[ observer_model[1] ] <- behaviours[ observer_model[2] ]
    }
    # migration to another cluster
    if((runif(1,0,1) <= MIGRATION) & (NPOP > 1)){
      cluster[ observer_model[1] ] <- sample((1:NPOP)[-clusterid], 1)
    }
    recbehav[timestep,] <- unlist(lapply(1:NPOP, function(x) 
      sum( behaviours[cluster==x]==1 ) / sum(cluster==x)))
  }
  
  recbehav_d <- melt(recbehav)
  colnames(recbehav_d) <- c("time", "cluster","freq")
  return(recbehav_d)
}
```

The migration code chunk is doing three things. First, we make sure that migration only happens with the pre-set probability $\mu$ (`if(runif(1,0,1) <= MIGRATION)`). Second, if the statement is `TRUE` we choose a new cluster ID that is different from the current one (`sample((1:NPOP)[-clusterid], 1)`). And finally, we make sure that migration only happens if there are at least two clusters (`NPOP > 1`). 

Setting $\mu=0$ will recover the previous results, where the clusters behave independently:

```{r 12.10}
res <- migrationModel(NPOP=4, POPSIZE=100, NBEHAVIOUR=2, NTIMESTEPS=1000, MIGRATION=0)
ggplot(res) + 
  geom_line(aes(x=time, y=freq, col=factor(cluster))) + 
  scale_y_continuous(limits=c(0,1)) + 
  ylab("Relative frequency of behaviour 1") + 
  theme_bw()
```

Setting $\mu=1$, we find that the frequency of the behaviours become correlated as more and more individuals keep moving between the clusters. It is as if there were no clusters. 
```{r 12.11}
set.seed(2)
res <- migrationModel(NPOP=4, POPSIZE=100, NBEHAVIOUR=2, NTIMESTEPS=1000, MIGRATION=1)
ggplot(res) + 
  geom_line(aes(x=time, y=freq, col=factor(cluster))) + 
  scale_y_continuous(limits=c(0,1)) + 
  ylab("Relative frequency of behaviour 1") + 
  theme_bw()
```

For rare migration, we can find occasional changes in the frequency but it usually bounces back again. 
```{r 12.12}
set.seed(1)
res <- migrationModel(NPOP=4, POPSIZE=100, NBEHAVIOUR=2, NTIMESTEPS=1000, MIGRATION=0.01)
ggplot(res) + 
  geom_line(aes(x=time, y=freq, col=factor(cluster))) + 
  scale_y_continuous(limits=c(0,1)) + 
  ylab("Relative frequency of behaviour 1") + 
  theme_bw()
```

## Varying the strength of migration for repeated simulation runs

As we have seen repeatedly in this book, it is often easier to understand how certain parameters affect simulation results when we run simulations many times. Therefore, let us now run our simulation repeatedly (100 times) for two clusters and a fixed migration probability ($\mu=0.05$). We can then use these results to see how correlated the frequency of behaviour 1 is between the two populations. 

```{r 12.13}
set.seed(1)
repeatedRun <- do.call("rbind", 
  lapply(1:100, function(run){
    # run individual simulation
    res <- migrationModel(NPOP=2, POPSIZE=100, NBEHAVIOUR=2, NTIMESTEPS=1000, MIGRATION=.05)
    # return only the frequency for each cluster at the last time step
    resLast <- res[res[,"time"]==max(res[,"time"]),"freq"]
    return(resLast)
  })
  )

# plot results
ggplot(as.data.frame(repeatedRun)) + 
  geom_point(aes(x=V1, y=V2)) + 
  xlab("Frequency in cluster 1") + 
  ylab("Frequency in cluster 2") + 
  theme_bw()

# return mean absolute difference of behaviour frequency between the two clusters
mean(abs(repeatedRun[,1]-repeatedRun[,2]))
```

The plot shows four possible outcomes of these simulations. In the extreme case, both clusters are fixed on behaviour 1 (top right) or both clusters are fixed on behaviour 2 (bottom left). And then there are two intermediate results where cluster 1 has a high frequency of behaviour 1 and migration to cluster 2 is slightly increasing its frequency in that cluster, and vice versa. 

Let us modify the code so that the simulation runs repeatedly and at the end returns the analysed results. 

```{r 12.14}
migrationModelAnalysed <- function(NPOP, POPSIZE, NBEHAVIOUR, NTIMESTEPS, MIGRATION, REPETITIONS){
  totalpop <- NPOP*POPSIZE
  recbehavFreq <- matrix(NA, ncol=NPOP, nrow=REPETITIONS)
  # run repeated independent simulations
#ALTERNATIVLEY USE LAPPLY FUNCTION HERE!!:
  for(repetitions in 1:REPETITIONS){
    behaviours <- rep(sample(NBEHAVIOUR,size=NPOP,replace=T), each=POPSIZE)
    cluster <- rep(sample(NPOP), each=POPSIZE)
    for(timestep in 1:NTIMESTEPS){
      clusterid <- sample(NPOP, 1)
      if(sum(cluster==clusterid)>1){
        observer_model <- sample(x=totalpop, size=2, replace=F, 
                                 prob=(cluster==clusterid)*1)
        behaviours[ observer_model[1] ] <- behaviours[ observer_model[2] ]
      }
      if((runif(1,0,1) <= MIGRATION) & (NPOP > 1)){
        cluster[ observer_model[1] ] <- sample((1:NPOP)[-clusterid], 1)
      }
    }
    recbehavFreq[repetitions, ] <- unlist(lapply(1:NPOP, function(x) 
      sum( behaviours[cluster==x]==1 ) / sum(cluster==x))) 
  }

  return(
  data.frame(migration=MIGRATION, 
             absdiff  = (abs(recbehavFreq[,1]-recbehavFreq[,2])))
  )
}
```

```{r 12.15}
migrationModelAnalysed(NPOP=2, POPSIZE=50, NBEHAVIOUR=2, NTIMESTEPS=1000, MIGRATION=.05, REPETITIONS=5)
```

As you can see, for some simulations these differences can be quite large, whereas in some the difference is zero. This will be the case where both populations randomly started monomorphic for the same behaviour, or where they ended up in that state. 

Using our `migrationModelAnalysed()` function we can easily run this simulation for various different migration rates and several repetitions:

```{r 12.16, cache=TRUE}
set.seed(2)
repeatedRun <- do.call("rbind",
  lapply(seq(from=0, to=1, by=0.05), function(migration){
    migrationModelAnalysed(NPOP=2, POPSIZE=50, NBEHAVIOUR=2, NTIMESTEPS=10000, MIGRATION=migration, REPETITIONS=20)
  })
  )
```

```{r 12.17}
ggplot(repeatedRun) + 
  geom_point(aes(x=migration, y=absdiff)) +
  geom_smooth(aes(x=migration, y=absdiff)) +
  xlab("Migration rate") + 
  ylab("Average frequency difference between clusters") + 
  theme_bw()
```

When we plot the results (as a scatter-plot with an added loess trendline), we can see again for the case where there is no migration ($\mu=0$) the two clusters either have entirely the same (difference $=0$) or entirely different behaviours (difference $=1$). We can also see that for all migration probabilities there is a chance that both clusters will end up being monomorphic in their behaviour. Conversely, the probability that the two clusters are entirely different in their behaviour decreases as $\mu$ increases. It is also interesting to see that for the parameters we have chosen here this happens most dramatically for $\mu<0.25$. For larger values of $\mu$ the difference between the two clusters is on average $0$. This is what we would expect: when migration is high sub-populations are essentially becoming one large population again. Given enough time drift will lead to the fixation of one of the behaviours. 


## Model extensions
There are many ways in which we can analyse this basic model. For example, how population size affects the above result or the number of behaviours and clusters. Also, there are a few interesting ways to extend the base model. We list a couple of them below. 

### Innovation or mutation {-}
An interesting extension to this model is the addition of innovation or mutation. For example, individuals could invent completely new behaviours (in this case we would not work with a fixed behaviour number $b$), or with a certain probability, an individual might try to copy behaviour 2 but acquires behaviour 1. These are all mechanisms that would diversity to the model that in itself is interesting to study. 

<!-- ### Acculturation {-} -->

<!-- ### Fitness {-} -->

### Copy $m$ models {-}
So far, an individual is changing its behaviour based on observing one other individual. However, instead of choosing from a single model, we can change the code such that the individual is considering the behaviours of $m$ other individuals. As we have seen in previous chapters, as the difference between $n$ and $m$ becomes smaller the more the dynamics will look like frequency biased copying. 

To achieve this behaviour we can change the `observer_model` selection part of our model and change it to:
```{r, eval=FALSE}
if(sum(cluster==clusterid)>m){
  observer_model <- sample(x=totalpop, size=m+1, replace=F,
                           prob=(cluster==clusterid)*1)
  behaviours[ observer_model[1] ] <-
    behaviours[ sample(observer_model[2:(m+1)], size=1) ]
}
```
where $m$ is the number of models to observe. (Note, due to the peculiarities of the `sample()` function this code only works for $m>1$).


### Learn from but not moving to another sub-population {-}
Another extension of those model would be to have individuals not actually (physically) move between sub-populations, but rather do *expeditions* to another group, copy a behaviour, and bring it back to their original group. Let's assume this happens with probability $\alpha$.

To do this, we would again make changes to the `observer_model` selection part of the code. We would now choose the cluster ID for the model and the observer independently.
```{r, eval=FALSE}
if(runif(n=1,min=0,max=1) < alpha){
  # learn from a different cluster
  if(any(cluster==clusterid) & any(cluster!=clusterid)){
    observer_model <- c(sample(x=totalpop, size=1, prob=(cluster==clusterid)*1),
                        sample(x=totalpop, size=1, prob=(cluster!=clusterid)*1))
  }
} else {
  # learn from same cluster
  if(sum(cluster==clusterid)>1){
    observer_model <- sample(x=totalpop, size=2, replace=F,
                             prob=(cluster==clusterid)*1)
  }
}
behaviours[ observer_model[1] ] <- behaviours[ observer_model[2] ]
```
Note, this would also prohibit clusters from every being empty as no individual would ever leave their cluster. Depending on the objective of the model this could be useful.


### Variable migration probability among sub-populations {-}
Finally, this model can be extended to accommodate different population structures. In this chapter, we have only looked at symmetric connections between sub-populations (all sub-populations are connected and migration between them is equally likely). But the structure could also be a line, a circle, a star, and others, where not all sub-populations are connected (missing links) or where migration probability is low (using weighted connections). This can be useful to generally better understand how population structure will affect transmission. But it can also be used to model specific scenarios if there is existing data on population structure. 

For this iteration of the base model, we need to change the `migration` section. Instead of choosing randomly among other clusters, we would provide a probability vector to the `sample()` function that reflects the probabilities to move from one to another sub-population. As an example, let us assume we are looking at three sub-populations $\{A, B, C\}$. A simple structure is a line, where $A$ is connected with $B$ and $B$ is connected with $C$, or `A--B, B--C`. We can use an adjacency matrix to describe the probability to move from one sub-population to another:
```{r}
adj <- matrix(c(0,1,0,  1,1,1,  0,1,0), nrow=3)
adj
```
If rows are the starting and columns the destiny sub-populations, then this matrix tells us that an individual in sub-population $A$ (first row) can move to $B$ (second column entry is a one) but not to $C$ (last column entry is a zero). Now, when we determine that an individual is moving to a different group, we can simply recall the correct row of the `adj` matrix based on the individual's `clusterid`:
```{r, eval=FALSE}
if((runif(1,0,1) <= MIGRATION) & (NPOP > 1)){
  cluster[ observer_model[1] ] <- sample((1:NPOP), size=1, prob=adj[clusterid,])
}
```
Also, this piece of code allows us to use non-binary values, where small values represent a low probability to move from one to another sub-population, and asymmetric matrices where the probability going from $A$ to $B$ can be different from the probability for the reverse movement. 


## Summary of the model


## Further reading
There are a couple of interesting empirical studies on migration and the social transmission of locally adaptive behaviours in animals. For example, the study by @luncz_tradition_2014 reports on the stability of tool traditions in neighbouring chimpanzee communities. 

There are also a few theoretical studies on the persistence or change of local traditions. @boyd_voting_2009, for example, focus on how adaptive a behaviour is, whereas @mesoudi_migration_2018 focusses on the strength of acculturation that is required to maintain cultural diversity between groups. 

<!--chapter:end:12-Group-structured-populations-and-migration.Rmd-->

# (PART\*) Advanced topics - Cultural inheritance {-} 

# Reproduction and transformation

To be considered "cultural", ideas, behaviours, and artefacts need to be sufficiently stable in time. The version of *Little Red Riding Hood* we heard now is part of a long cultural transmission chain that includes all the other versions of the tale because they all share *enough* features to be considered the same tale. Similarly, the lasagne I cooked yesterday are part of a long, intricate, chain of cultural transmission events, where all the products are stable enough that we can consider all of them as one cultural trait: *lasagne*. Boundaries are muddled for many traits: while some artefacts can be the exact replica of each other, no two identical lasagne exist. In any case, the question we explore in this chapter is: how does this stability is brought about? In the models so far, as much as in the majority of models in cultural evolution, we assumed that traits are copied from one (cultural) generation to another with enough fidelity to assure a relative stability. This is a useful assumption and, in many case, a good approximation of what happens in reality. 

However, cultural traits can be stable not because they are copied with high-fidelity, but because, when passing from an individual to another, they are independently reconstructed in the same way or, another way to say it, they become similar to each other through a process of convergent transformation. Think about whistling. We do learn to whistle from each other through a process of cultural transmission (we want to reproduce what others do), but the configuration of the muscles in the mouth is not something that we copy directly. Still, there are few ways to effectively whistle, so that we likely end up with the same - or similar - configuration. (Notice we can also actually copy the exact configuration and, indeed, there are specialised whistling techniques for which it is required. As we will mention again later, copying and reconstructing are not two alternative processes, but they both concur to cultural evolution.)

Evolutionary psychologists and some anthropologists emphasise how certain cultural traditions are similar in many societies, such as supernatural beliefs, types of musics, or what people find or not disgusting. These similarities do not need to be produced by genetically encoded preferences, but it suffices that some general tendencies make more likely that people everywhere will converge on these quasi-universal forms. A psychological tendency to interpret the behaviour of an entity as intentional (even if this entity is an inanimate object) could give rise to similarity in supernatural beliefs, as much as the physical property of the mouth give rise to similarity in how people whistle everywhere.    

## Copying and selection

To have a better grasp of the consequence of this idea we can, as usual, try to model a very simple case, where cultural stability can be obtained with a process of copying and selection of a model, as we did in many of the previous chapters, or with convergent transformation, where individuals are not very good at copying, or at selecting models, but they tend to transform the trait in the same way.

Let's imagine a population with a single trait, a continuous trait $P$, that can have values between 0 and 1. At the beginning of the simulations, $P$ is uniformly distributed in the population. Let's say the optimal value of $P$ is $1$ (this is convenient for the code, but the exact value is not important). You can think to $P$ as, for example, how sharp is a knife: the more the better.

```{r 13.1, message=FALSE}
library(tidyverse)
N <- 1000
population <- tibble(P = runif(N))
```

Now, we can write the familiar function where individuals copy the trait from the previous generation with one of the biases we explored earlier in the book. In [Chapter 3][Biased transmission: direct bias], for example, we showed how a direct bias for one of two discrete cultural traits could make it spread and go to fixation. We can do something similar here, with the difference that the trait is continuous and the bias needs to be a preference for traits close to the optimal value. (Notice the code would be equivalent - and we would obtain the same effect of convergence to optimal value - thinking in terms of other methods of cultural selection, e.g. an indirect bias towards successful demonstrators, that are successful as they have a $P$ close to the optimal).

```{r 13.2}
reproduction <- function(N, t_max, r_max, mu) {
  output <- tibble(generation = rep(1:t_max, r_max), p = as.numeric(rep(NA, t_max * r_max)), run = as.factor(rep(1:r_max, each = t_max)))
  for (r in 1:r_max) {
    population <- tibble(P = runif(N))
    # create first generation
    output[output$generation == 1 & output$run == r, ]$p <- sum(population$P) / N # add first generation's p for run r
    for (t in 2:t_max) {
      previous_population <- population # copy individuals to previous_population tibble
      demonstrators <- tibble(P1 = sample(previous_population$P, N, replace = TRUE), P2 = sample(previous_population$P, N, replace = TRUE))
      copy <- pmax(demonstrators$P1, demonstrators$P2)

      population$P <- copy + runif(N, -mu, +mu)
      population$P[population$P > 1] <- 1
      population$P[population$P < 0] <- 0
      output[output$generation == t & output$run == r, ]$p <- sum(population$P) / N # get p and put it into output slot for this generation t and run r
    }
  }
  output # export data from function
}
```

The function is similar to what we have already done several times. Let's have a look at the few differences. First, we find again the parameter $\mu$, as done for [biased mutation][Unbiased and biased mutation] and [innovation][Multiple traits models] in previous chapters. Similarly, it implements here the error in copying: with respect to the $P$ of the demonstrator chosen, the new trait will vary of maximum of $\mu$, through the instruction `runif(N, -mu, +mu)`. The two following lines just keep the traits in the boundaries between $0$ and $1$. The second difference is in the selection of the trait to copy. Here each individual sample two traits (or demonstrators) from the previous generation, and simply copies the one with the trait closer to the optimal value of $1$.   

We can now run the simulation, and plot it with a slightly modified function `plot_multiple_runs_p()` (we just need to change the label for y-axis). We use a low value for the copying error, such as $\mu=0.05$.

```{r 13.3}
plot_multiple_runs_p <- function(data_model) {
  ggplot(data = data_model, aes(y = p, x = generation)) +
    geom_line(aes(colour = run)) +
    stat_summary(fun = mean, geom = "line", size = 1) +
    ylim(c(0, 1)) +
    theme_bw() +
    labs(y = "p (average value of P)")
}
```

```{r 13.4, fig.cap = "The populations reach the optimal trait value with cultural selection."}
data_model <- reproduction(N = 1000, t_max = 20, r_max = 5, mu = 0.05)
plot_multiple_runs_p(data_model)
```

Even with a weak form of selection (sampling two traits and choosing the better one) the population converges on the optimal value quickly, in only around ten cultural generations.

## Convergent transformation

Now we can write another function where convergent transformation produces the same effect.

```{r 13.5}
transformation <- function(N, t_max, r_max) {
  output <- tibble(generation = rep(1:t_max, r_max), p = as.numeric(rep(NA, t_max * r_max)), run = as.factor(rep(1:r_max, each = t_max)))
  for (r in 1:r_max) {
    population <- tibble(P = runif(N))
    # create first generation
    output[output$generation == 1 & output$run == r, ]$p <- sum(population$P) / N # add first generation's p for run r
    for (t in 2:t_max) {
      previous_population <- population # copy individuals to previous_population tibble
      demonstrators <- tibble(P = sample(previous_population$P, N, replace = TRUE))
      
      population$P <- demonstrators$P + runif(N, max = 1-demonstrators$P)
      
      output[output$generation == t & output$run == r, ]$p <- sum(population$P) / N # get p and put it into output slot for this generation t and run r
    }
  }
  output # export data from function
}
```

There is only a line we need to pay attention to: the values of $P$ of the new generation are calculated as `demonstrators$P + runif(N, max = 1-demonstrators$P)`. This means that the individuals of the new population copy the old generation, but they are not particularly good. They take a value of $P$ randomly drawn between the value of the demonstrator and the optimal $P=1$. Thus, if they attempt to copy a demonstrator with $P=0.1$, their "error" can be as large as 0.9. While errors can be big, they are all in the same direction, contributing to increase $P$. Let's run the simulations.

```{r 13.6, fig.cap = "The population reach the optimal trait value when transformations converge towards the optimal value."}
data_model <- transformation(N = 1000, t_max = 20, r_max = 5)
plot_multiple_runs_p(data_model)
```

As the transformations tend to all converge in the same direction, the results are equivalent to the previous model. It does not matter when exactly the population reaches stability at $P=1$, as this depends on the specific implementation choices, for example the strength of cultural selection in the first model, or how big can be the "jump" of the transformation in the second model.

When we see cultural traits in real life, we are observing systems in a state analogous to what happens on the right side of the two plots, where individuals reproduce traits one similar to the other. As we touched earlier in the chapter, both faithful copying coupled with selection and transformation can be important for culture, and their importance can depend from the specific features we are interested to track, or from the cultural domain. As we have just shown, however, the spread of the traits in the population looks similar in both cases. Are there ways to distinguish the relative importance of the two processes?    

## Emergent similarity 

One possibility is to track how similar are the traits that the observers reproduce, with respect to the traits they use as a starting point. If reproduction is the driving force, they should be fairly similar and the measure should be always the same, as the distance between the trait produced and the trait copied is fixed, and given by the parameter $\mu$. If transformation is the driving force, instead, we should expect similarity being lower when traits are far from the optimal value (i.e. at the beginning of the simulations, remember their $P$ is randomly drawn between the value of the demonstrator and the optimal $P=1$ - and higher when traits are close to the optimal value.

We can rewrite the `reproduction()` and `transformation()` functions adding as a further output this measure of similarity, that is, how distant are the traits that the new generation show versus the traits that they had copied from the previous. We thus add a variable *d* (as in "distance") in our `output` tibble, and we calculate this value at the end of each generation as `sum(abs(population$P - copy)) / N`.

```{r 13.7}
reproduction <- function(N, t_max, r_max, mu) {
  output <- tibble(generation = rep(1:t_max, r_max), p = as.numeric(rep(NA, t_max * r_max)), run = as.factor(rep(1:r_max, each = t_max)), d = as.numeric(rep(NA, t_max * r_max)))
  for (r in 1:r_max) {
    population <- tibble(P = runif(N))
    # create first generation
    output[output$generation == 1 & output$run == r, ]$p <- sum(population$P) / N # add first generation's p for run r
    for (t in 2:t_max) {
      previous_population <- population # copy individuals to previous_population tibble
      demonstrators <- tibble(P1 = sample(previous_population$P, N, replace = TRUE), P2 = sample(previous_population$P, N, replace = TRUE))
      copy <- pmax(demonstrators$P1, demonstrators$P2)

      population$P <- copy + runif(N, -mu, +mu)
      population$P[population$P > 1] <- 1
      population$P[population$P < 0] <- 0
      output[output$generation == t & output$run == r, ]$p <- sum(population$P) / N # get p and put it into output slot for this generation t and run r
      output[output$generation == t & output$run == r, ]$d <- sum(abs(population$P - copy)) / N
    }
  }
  output # export data from function
}
```

We do the same for the `transformation()` function, and we can run again both simulations.

```{r 13.8}
transformation <- function(N, t_max, r_max) {
  output <- tibble(generation = rep(1:t_max, r_max), p = as.numeric(rep(NA, t_max * r_max)), run = as.factor(rep(1:r_max, each = t_max)), d = as.numeric(rep(NA, t_max * r_max)))
  for (r in 1:r_max) {
    population <- tibble(P = runif(N))
    # create first generation
    output[output$generation == 1 & output$run == r, ]$p <- sum(population$P) / N # add first generation's p for run r
    for (t in 2:t_max) {
      previous_population <- population # copy individuals to previous_population tibble
      demonstrators <- tibble(P = sample(previous_population$P, N, replace = TRUE))
      
      population$P <- demonstrators$P + runif(N, max = 1-demonstrators$P)
      
      output[output$generation == t & output$run == r, ]$p <- sum(population$P) / N # get p and put it into output slot for this generation t and run r
      output[output$generation == t & output$run == r, ]$d <- sum(abs(population$P - demonstrators$P)) / N
    }
  }
  output # export data from function
}
```

```{r 13.9}
data_model_reproduction <- reproduction(N = 1000, t_max = 20, r_max = 5, mu = 0.05)
data_model_transformation <- transformation(N = 1000, t_max = 20, r_max = 5)
```

We already know the results with respect to the value of $P$, but now we are interested in comparing if and how the values for *d* change in time in the two conditions. For this, we write an *ad hoc* plotting function, that takes the data from the two outputs and plot them in the same graph. Notice the `na.omit()` function in the first line: the data on *d* is NA for the first generation, because there is no previous generation from which to take the measure, so we want to exclude it from our plot, and start from generation number 2. For this reason, all the other values are rescaled and, in particular, the variable *generation* starts from 2. 

```{r 13.10, fig.cap = "When convergent transformation is the driving force, the similarity between original and copied items starts high and decreases with time. When cultural selection is the driving force, the similarity is constant."}
data_to_plot <- tibble(distance = c(na.omit(data_model_reproduction$d), na.omit(data_model_transformation$d)), 
                       condition = rep(c("reproduction", "transformation"), each = 95),
                       generation = rep(2:20,10),
                       run = as.factor(rep(1:10, each = 19)))
ggplot(data = data_to_plot, aes(y = distance, x = generation, group = run, color = condition)) +
  geom_line() +    
  theme_bw() +
  labs(y = "d (average distance observer/demonstrator)")
```

As predicted, in the "transformation" condition distance is higher at the beginning of the simulation, and reaches zero when all individuals have the optimal value. In the "reproduction" condition, instead, the distance is approximately constant. In fact, it slightly decreases after the first few generations. This is due to the fact that at the beginning, with $P$ randomly distributed in the population, the mutation is effectively drawn between $P-\mu$ and $P+\mu$, but after a while, when demonstrators have a value close to the optimal $P=1$, the mutation is only drawn in $P-\mu$, as values of $P$ higher than $1$ are not allowed.  

## Cultural fitness

Another way to look at the difference between the two conditions, focusing on the process of selection, is to look at the "cultural fitness" of the individuals in the population. More in detail, one can look at how this metric covaries with how good they actually are, as given by their value of $P$. We can define $W$, a measure of cultural fitness, as the number of "cultural offspring" that the individual $i$ has in the next generation. If individual $i$ has been copied by, say, four individuals, its fitness will be $W_i=4$.  

If individuals, or their traits, are selected, as happens in the "reproduction" condition, we expect that individuals with higher values of $P$ have more cultural offspring, thus, that to higher $P$s correspond higher $W$s. We expect, in other words, the covariance between $W$ and $P$ being positive i.e. $cov(W,P)>0$. On the other hand, in the condition "transformation" there is no selection, and there are no reasons why an individual with higher $P$ produces more cultural offspring. In this case, the covariance should be zero, i.e. $cov(W,P)=0$. 

To calculate cultural fitness, and how it covaries with $P$, we need to modify again our functions. Let's start, as before, with `reproduction()`:

```{r 13.11}
reproduction <- function(N, t_max, r_max, mu) {
  output <- tibble(generation = rep(1:t_max, r_max), p = as.numeric(rep(NA, t_max * r_max)), run = as.factor(rep(1:r_max, each = t_max)), cov_W_P = as.numeric(rep(NA, t_max * r_max)))
  for (r in 1:r_max) {
    population <- tibble(P = runif(N))
    # create first generation
    output[output$generation == 1 & output$run == r, ]$p <- sum(population$P) / N # add first generation's p for run r
    for (t in 2:t_max) {
      previous_population <- population # copy individuals to previous_population tibble
      demonstrators <- cbind(sample(N, N, replace = TRUE), sample(N, N, replace = TRUE))
      copy <- max.col(cbind(previous_population[demonstrators[,1],], previous_population[demonstrators[,2],]))
      demonstrators <- demonstrators[cbind(1 : N, copy)]
      fitness <- tabulate(demonstrators, N) 

      population$P <- previous_population[demonstrators,]$P + runif(N, -mu, +mu)
      population$P[population$P > 1] <- 1
      population$P[population$P < 0] <- 0
      output[output$generation == t & output$run == r, ]$p <- sum(population$P) / N # get p and put it into output slot for this generation t and run r
      output[output$generation == t & output$run == r, ]$cov_W_P <- cov(fitness, previous_population$P) 
    }
  }
  output # export data from function
}
```

The function produces the usual output, but in a different way. To measure $W$, we need to know the actual individuals that are copied, not only their $P$ values, as we were doing previously. For this reason, the sampling of the demonstrators is done on their indexes with the instruction `sample(N, N, replace = TRUE)`. Then, the indexes are used to retrieve their $P$, in the two following lines. Finally, we count how many times each index, that is, each individual, is used as demonstrator, using the function `tabulate()`, introduced in [Chapter 7][Multiple traits models].   

Now we need to do the same for `transformation()`:

```{r 13.12}
transformation <- function(N, t_max, r_max) {
  output <- tibble(generation = rep(1:t_max, r_max), p = as.numeric(rep(NA, t_max * r_max)), run = as.factor(rep(1:r_max, each = t_max)), cov_W_P = as.numeric(rep(NA, t_max * r_max)))
  for (r in 1:r_max) {
    population <- tibble(P = runif(N))
    # create first generation
    output[output$generation == 1 & output$run == r, ]$p <- sum(population$P) / N # add first generation's p for run r
    for (t in 2:t_max) {
      previous_population <- population # copy individuals to previous_population tibble
      demonstrators <- sample(N, N, replace = TRUE)
      fitness <- tabulate(demonstrators, N) 

      
      population$P <- previous_population[demonstrators,]$P + runif(N, max = 1 - previous_population[demonstrators,]$P)
      
      output[output$generation == t & output$run == r, ]$p <- sum(population$P) / N # get p and put it into output slot for this generation t and run r
      output[output$generation == t & output$run == r, ]$cov_W_P <- cov(fitness,previous_population$P)
    }
  }
  output # export data from function
}
```

The logic is exactly the same, only we do not need to use the indexes to retrieve the $P$ values, as they are not needed to choose demonstrators. At this point, as before, we can run the simulations in the two conditions, and plot the results (the code for the plot is also the same, only the label for the y-axes changes).

```{r 13.13, fig.cap = "When cultural selection is the driving force, better cultural items are more likely to be copied (until the population converges to optimal values). When convergent transformation is the driving force, there is no relationship between quality and cultural success."}
data_model_reproduction <- reproduction(N = 1000, t_max = 20, r_max = 5, mu = 0.05)
data_model_transformation <- transformation(N = 1000, t_max = 20, r_max = 5)

data_to_plot <- tibble(covariance = c(na.omit(data_model_reproduction$cov_W_P), na.omit(data_model_transformation$cov_W_P)), 
                       condition = rep(c("reproduction", "transformation"), each = 95),
                       generation = rep(2:20,10),
                       run = as.factor(rep(1:10, each = 19)))
ggplot(data = data_to_plot, aes(y = covariance, x = generation, group = run, color = condition)) +
  geom_line() +    
  theme_bw() +
  labs(y = "covariance between cultural fitness and P")
```

In the "reproduction" condition, the covariance is indeed positive, and decreases gradually close to zero, when all the individuals converge to $P=1$, as there is no more variation on which selection can act. Notice it does not reach zero, as mutation keep some variation, and individuals that muted to lower $P$s are less likely to be selected. As expected, the covariance is equal to zero in the "transformation" condition. This is hardly a surprising result as demonstrators are selected fully at random in the model, but it is important to compare this with what happens with empirical data of real cultural dynamics, where we can be able to distinguish different underlying stabilising forces.

## Summary of the model

Cultural traditions can survive intact through long and wide transmission chains because cultural traits are copied faithfully, because some of them are copied more than others (cultural selection), and because everybody involved in the episodes of transmission tend to reproduce them in a similar way. All these forces are likely to be important, to a various degree, in different domains and for different features of cultural traits. While in the rest of the book we have focused on copying and selection, in this chapter we have considered transformation. We have shown that both copying plus selection and convergent transformation create stable cultural systems, where all individuals in the population converge on similar cultural traits. We also explored how these forces can be distinguished. One possible way is to chart the similarity between the cultural traits observed and the cultural traits reproduces: in the case of transformation depends on the specific feature of the traits (the closer to the end-point of the convergence, the higher the similarity), whereas for reproduction we should expect similarity to be constant, depending on how, generally, copying is precise. Another way is to detect whether demonstrators with certain traits (or certain features of a trait) are copied more: this is the sign of selection, that characterised our "reproduction"" model, but not the "transformation" one.   
  
## Further readings

The model comparing reproduction and transformation is a simplified version of the model described in @acerbi_cultural_2019. The analysis to detect signs (similarity and cultural fitness) of the two forces are inspired also by the application of the Price Equation to cultural evolution in @nettle_selection_nodate. An early account of the importance of convergent transformation ("cultural attraction") in cultural evolution is @sperber_selection_1997. A discussion of the relative importance of transformation and reproduction in cultural evolution, and of the necessity to consider both is @acerbi_if_2015. 


<!--chapter:end:13-Reproduction_and_transformation.Rmd-->

# Social learning of social learning rules

In the models we explored so far, individuals decide whether to copy or not according to various rules, often called "transmission biases" in cultural evolution jargon. They may have a tendency to copy common traits, or to copy a subset of the population, or to prefer certain cultural traits with respect to others by virtue of their intrinsic characteristics, and so on. 

A characteristic of all these models is that these rules were considered stable, or changing very slowly (perhaps because of genetic evolution) in comparison to the timescale of the model, so that we effectively treated them as fixed. However, cultural evolution can also influence its own rules, that is, we can learn from others when, what, or from whom to learn. This is far from being a rare instance: parents, at least in modern western societies, invest much effort to transmit to children that learning from schoolteachers is important, or teenagers groups may discourage learning from other groups, or from adults in general. Educational systems in countries such as Korea or Japan are thought to encourage pupils to learn and trust teachers almost unconditionally, whereas, in countries like UK and USA, the emphasis is on individual creativity and critical thinking.

## Openness and conservatism 

How can we approach the social learning of social learning rules with simple models? To start with, we can imagine that individuals learn from others whether to copy others or not. We can imagine the simplest possible dynamic, where a single trait, *P*, both regulate the probability to copy from others and is the trait that is actually copied. When an individual has $P=1$ always copies others (we will call it a completely "open" individual), and when it has $P=0$ never copies others (we will call it a completely "conservative" individual). All intermediate values of *P* are possible.    

```{r 14.1, message=FALSE}
library(tidyverse)
N <- 1000
population <- tibble(P = runif(N))
```

After initialising the population with a random uniform sample of values of *P*, we can write the function to run the simulations.

```{r 14.2}
openness_conservatism <- function(N, t_max, r_max) {
  output <- tibble(generation = rep(1:t_max, r_max), p = as.numeric(rep(NA, t_max * r_max)), run = as.factor(rep(1:r_max, each = t_max)))
  for (r in 1:r_max) {
    population <- tibble(P = runif(N))
    # create first generation
    output[output$generation == 1 & output$run == r, ]$p <- sum(population$P) / N # add first generation's p for run r
    for (t in 2:t_max) {
      previous_population <- population # copy individuals to previous_population tibble
      demonstrators <- tibble(P = sample(previous_population$P, N, replace = TRUE)) # choose demonstrators at random
      
      copy <- previous_population$P > runif(N) # choose individuals that copy, according to their P
      
      population[copy, ]$P <- demonstrators[copy, ]$P # copy
      output[output$generation == t & output$run == r, ]$p <- sum(population$P) / N # get p and put it into output slot for this generation t and run r
    }
  }
  output # export data from function
}
```

Everything should be familiar in this function. The only new instruction is in the line `copy <- previous_population$P > runif(N)`. This simply compares each individual's *P* value with a random number extracted between $0$ and $1$. If the *P* value is higher, the individual will copy, otherwise it will not. 

We can now run the simulation, and plot it with the `plot_multiple_runs_p()` function for continuous traits we wrote in the previous chapter.

```{r 14.3}
plot_multiple_runs_p <- function(data_model) {
  ggplot(data = data_model, aes(y = p, x = generation)) +
    geom_line(aes(colour = run)) +
    stat_summary(fun = mean, geom = "line", size = 1) +
    ylim(c(0, 1)) +
    theme_bw() +
    labs(y = "p (average value of P)")
}
```

```{r 14.4, fig.cap = "After few generations, the popualtion is composed by conservative individuals."}
data_model <- openness_conservatism(N = 1000, t_max = 50, r_max = 5)
plot_multiple_runs_p(data_model)
```

The average value of *P* in the population quickly converges towards 0 (in fact, towards the lower initial value, as there are no mutations) in all runs. At this point of the book, you should be able to introduce mutations, as well as initialising the population with different values of *P*. What would happen, for example, if individuals start with values of *P* clustering around 1, that is, they are all initially very open? Another possible modification is that, instead of comparing the copier's *P* value with a random number, when two individuals are paired, the individual with the higher *P* (that is, the most open of the two) copies the other one. 

At the risk of ruining the surprise, the main result of populations converging towards maximum conservatism is robust to many modifications (but you should try your own, this is what models are about). The result seems at first sight counterintuitive: the outcome of social transmission is to eliminate social transmission! A way to understand this result is that conservative individuals, exactly because they are conservative, change less than open individuals and, in general, transitions from open to conservative happen more frequently than transitions from conservative to open. Imagine a room where people are all copying the t-shirt colors of each other, but one stubborn individual, with a red t-shirt, never changes. If there are not other forces acting, at some point all individuals will wear red t-shirts.    

## Maintaining open populations

The result above highlights a possibly interesting aspect of what could happen when social learning rules are themselves subject to social learning, but it does not represent, of course, what happens in reality. Some models, such as the Rogers' model we explored in [chapter 8][Rogers' model], are useful exactly because they force us to think how reality differs from the modelled situation. Individuals, in real life, remain open because learning from others is, on average, effective, and increases their fitness. 

However, even without considering the possible fitness advantages of copying from others, there may be other reasons why individuals remain open to cultural influences. We can add a bit of complexity to the previous model and see what happens. For example, instead of having a single *P* value, individuals can be "open" or "conservative" depending on the specific cultural trait they observe. One can be open to try exotic recipes, while another may like only its local cuisine; one want to know everything about combat sports, while another prefers watching them in TV. We can say that, instead of a single *P*, we have many preferences associated to cultural traits and, as before, they can be transmitted from one individual to another. Second, we decide to copy other individuals depending on our preferences for the traits they show us. Finally, individuals are born [AS BEFORE, IF BIRTH/DEATH PROCESSES ARE NOT INTRODUCED BEFORE, WE NEED TO SAY SOMETHING HERE] without cultural traits, and they acquire them during the course of their life, by copying them from others, or by introducing them through innovation. The new function `openness_conservatims_2()` does all the above.   

```{r 14.5}
openness_conservatism_2 <- function(N, M, mu, p_death, t_max, r_max){
  output <- tibble(generation = rep(1:t_max, r_max), p = as.numeric(rep(NA, t_max * r_max)), m = as.numeric(rep(NA, t_max * r_max)), run = as.factor(rep(1:r_max, each = t_max)))
  for (r in 1:r_max) {
    population_preferences <- matrix( runif(M * N), ncol = M, nrow = N)
    population_traits <- matrix(0, ncol = M, nrow = N)
    output[output$generation == 1 & output$run == r, ]$p <- mean(population_preferences)
    output[output$generation == 1 & output$run == r, ]$m <- sum(population_traits) / N  
    for(t in 2:t_max){
      # innovations
      innovators <- sample(c(TRUE, FALSE), N, prob = c(mu, 1 - mu), replace = TRUE) 
      innovations <- sample(1:M, sum(innovators), replace = TRUE)
      population_traits[cbind(which(innovators == TRUE), innovations)] <- 1
      
      # copying
      previous_population_preferences <- population_preferences
      previous_population_traits <- population_traits
      
      demonstrators <- sample(1:N, replace = TRUE)
      demonstrators_traits <- sample(1:M, N, replace = TRUE)

      copy <- previous_population_traits[cbind(demonstrators,demonstrators_traits)]==1 & previous_population_preferences[cbind(1:N, demonstrators_traits)] > runif(N)
      population_traits[cbind(which(copy), demonstrators_traits[copy])] <- 1
      population_preferences[cbind(which(copy), demonstrators_traits[copy])] <- previous_population_preferences[cbind(demonstrators[copy], demonstrators_traits[copy])] 
      
      # birth/death
      replace <- sample(c(TRUE, FALSE), N, prob = c(p_death, 1 - p_death), replace = TRUE)
      population_traits[replace, ] <- 0
      population_preferences[replace, ] <- runif(M * sum(replace))
      
      output[output$generation == t & output$run == r, ]$p <- mean(population_preferences)
      output[output$generation == t & output$run == r, ]$m <- sum(population_traits) / N    
    }
  }
  output
}
```

The population is now described by two matrices, `population_preferences` and `population_traits`, that are initialiased, respectively, with random number between 0 and 1 and with all 0s, respectively, meaning that at the beginning there are no traits in the population. The same happens for newborns. A parameter of the simulation, *M*, gives the maximum possible number of traits. At each time step, a proportion $\mu$ of innovators introduce a trait at random. 

The main novelties of the code are in the copying procedure. After selecting random demonstrators and, for each of them, a random trait-slot, we record in the variable `copy` whether or not the individuals that will copy the demonstrator. For this to happen, the demonstrator needs to actually possess the trait randomly selected (`previous_population_traits[cbind(demonstrators,demonstrators_traits)]==1`) and the preference of the observer for that trait should be sufficiently high (`previous_population_preferences[cbind(1:N, demonstrators_traits)] > runif(N)`). If these two conditions are satisfied, the observer copies both the trait and the preference of the demonstrator. 

We can start with a situation similar to the previous model, with only a single trait ($M=1$). We set a relatively high innovation rate ($\mu=0.1$) so that the initial population is quickly populated by cultural traits, and $p_\text{death}=0.01$, meaning that, with a population of 100 individuals, every time step there will be on average one newborn. (As usual, you are invited to explore the effect of these parameters.) 

```{r 14.6, fig.cap = "Simlarly to the previous model, the popualtion converges to conservatism, even if the descent is less steep as individuals need some time to acquire traits."}
data_model <- openness_conservatism_2(N = 1000, M = 1, mu = 0.1, p_death = 0.01, t_max = 50, r_max = 5)
plot_multiple_runs_p(data_model)
```

The plot is fairly similar to what we saw before. The average openness of the population converges towards lower values in few generations, in all runs. The descent is less steep since at the beginning of the simulations individuals need to acquire cultural traits to kick start social transmission. We can now try with an higher number of possible traits, for example $M=10$.  

```{r 14.7, fig.cap = "With 10 possible traits, convergence to conservatism is slower."}
data_model <- openness_conservatism_2(N = 1000, M = 10, mu = 0.1, p_death = 0.01, t_max = 50, r_max = 5)
plot_multiple_runs_p(data_model)
```

Now the convergence seems slower. We can try with longer simulations, fixing $t_\text{max}=1000$.

```{r 14.8, fig.cap = "Even after 1,000 generations, with 10 possible traits, individuals are not completely conservative."}
data_model <- openness_conservatism_2(N = 1000, M = 10, mu = 0.1, p_death = 0.01, t_max = 1000, r_max = 5)
plot_multiple_runs_p(data_model)
```

Even after $1000$ generations, population openness did not go to 0, but it stabilises on a value of around $0.12$. To understand what happens it is interesting to plot the other value we are recording in the output of the simulation, that is the average number of traits that individuals possess. The function below is equivalent to the usual `plot_multiple_runs()`, but with a different y-axis label, and it takes $M$ (the maximum number of traits) as a parameter, so that we can set the y-axis to span from 0 to $M$, to have a better visual estimate of the proportion of traits present with respect to the maximum possible.  

```{r 14.9}
plot_multiple_runs_m <- function(data_model, M) {
  ggplot(data = data_model, aes(y = m, x = generation)) +
    geom_line(aes(colour = run)) +
    stat_summary(fun = mean, geom = "line", size = 1) +
    ylim(c(0, M)) +
    theme_bw() +
    labs(y = "m (average number of traits)")
}
```

```{r 14.10, fig.cap = "Individuals, on average, do not acquire all the possible traits during the lifetime."}
plot_multiple_runs_m(data_model, M = 10)
```

On average, individuals do not have all 10 possible traits. Remember that individuals are replaced with a birth/death process, and they are born with no cultural traits so that they need time to acquire them. Let's try now with a bigger possible cultural repertoire, say $M=50$, and plot the average openness as well as the average number of traits.

```{r 14.11, fig.cap = "Individuals, on average, acquire less then half of the available traits, when there are 50 possible traits."}
data_model <- openness_conservatism_2(N = 1000, M = 50, mu = 0.1, p_death = 0.01, t_max = 1000, r_max = 5)
plot_multiple_runs_p(data_model)
plot_multiple_runs_m(data_model, M = 50)
```

This time the average openness stabilises to an even higher value (around $0.4$), and the number of cultural traits is below 20, lower than half of all possible traits. 

We can explicitly visualise the relationship between $M$ and population openness after $1000$ generations for few representative values of $M$. We consider only a single run for each condition as, from the previous results, we know that different runs give very similar results.
  
```{r 14.12, fig.cap = "Relationhsip between the number of possible cultural traits and the average openness of the population: when there are more traits possible to acquire, populations remain more open."}
test_openness <- tibble(M = c(1,5,10,20,50,100), p = as.numeric(rep(NA, 6)))
for(condition in test_openness$M){
  data_model <- openness_conservatism_2(N = 1000, M = condition, mu = 0.1, p_death = 0.01, t_max = 1000, r_max = 1)
  test_openness[test_openness$M == condition, ]$p <- data_model[data_model$generation == 1000, ]$p
}
ggplot(data = test_openness, aes(x = M, y = p)) +
  geom_line(linetype = "dashed") +
  geom_point() +
  theme_bw() +
  labs(x = "Maximum possible number of traits", y = "p (final average value of p)")
```

The more cultural traits that are possible to acquire, the more individuals remain open. Why is that the case? As we saw before, a conservative individual will be able to spread its traits because they are more stable (remember the red t-shirt example). On the other hand, to be copied, an individual needs to showcase its traits. As the traits are chosen at random, it is better for an individual - from the point of view of its cultural success - to have many traits. These two requirements are in conflict: to acquire many traits an individual needs to remain relatively open. For this reason, when the cultural repertoire is big, individuals will remain open longer. 

You can easily check by yourself that decreasing $p_\text{death}$ has a similar effect of decreasing $M$. Individual living longer will generate more conservative populations. With a bit of work to the code, the same effect can be produced if individuals can learn faster. You can add a parameter to the model that tells how many traits an observer copies from the demonstrator at each interaction (in the case above is as this parameter would have been fixed to 1). The more effective is cultural transmission, the more conservative the populations. All depends on whether individuals have time to to optimise *both* openness and conservatism: big repertories, short lifespans, and ineffective cultural transmission all maintain relatively open populations. 
  
## Summary of the model
  
In this chapter we explored the idea that we can learn from others not only beliefs and skills, but also the rules that govern how and when we learn from others. The models we presented just scratch the surface of what the consequences of the "social learning of social learning rules" could be, and we invite the readers to explore other possibilities. The models still provide some interesting insights: successful cultural models need to integrate openness (to acquire cultural traits liked by others) and, at the same time, conservativeness (to remain stable and repeatedly show the same traits to copy). This also suggests that successful cultural traits should not only be liked by many, but they also should promote conservativeness, as we defined it here, in their bearers. After all, the first commandment in the Abrahamic religions is "Thou shalt have no other gods before me" rather then "Check the other gods, and you'll see I am the better one"! Regardless of the particular results, however, these models mostly highlight how unexpected cultural dynamics can emerge from systems in which the rules governing social learning are not fixed, but they are themselves subject of cultural evolution.

  
## Further readings
  
The models above are based on the models described in @ghirlanda_cultural_2006 and @acerbi_cultural_2009. @acerbi_cultural_2009 investigates possible variants of the main model of this chapter, such as continuous traits, innovations possible for preferences too, and various degrees of effectiveness of cultural transmission. It also explores how the basic dynamics affects individual characteristics (young individuals are more open than older individuals, older individuals are more effective cultural models than younger individuals, and so on). @acerbi_regulatory_2014 summarises these models, and provides a more general perspective on the "social learning of social learning rules" topics, including other simulated scenarios. @mesoudi_evolution_2016 is a review of the individual and cultural variation in social learning, pointing to various references, including empirical evidences of cultural variation in social learning in humans.   

<!--chapter:end:14-Social_learning_of_social_learning_rules.Rmd-->

# Traits inter-dependence

In real life, the relationship a cultural trait has with other, coexisting, cultural traits, is important to determine its success. Nails will enjoy much cultural success in a world where hammers are present, and less in a world where they are not. Being against abortion in contemporary US is strongly correlated to being religious, which, in turn, is (less strongly) correlated with not supporting same-sex marriage. Of course, not all these relationships are stable in time, and they can also be themselves subject to cultural change. In this chapter, we will explore how simple relationships between traits can be modelled, and how they can influence cultural evolution.

## Compatible and incompatible traits

We can start by assuming that, when an observer meets a demonstrator, the observer evaluates the relationships of the traits of the demonstrator with its own traits, and use this information to decide whether to copy or not. For example, if the observer has the trait "being religious" and the demonstrator the trait "being pro abortion", copying will be less likely to happen than if the demonstrator has the trait "being against abortion". 

We can imagine a simple scenario when there are only two possible relationships between two traits: they are compatible, meaning that the presence of one trait will reinforce the presence of the other, or incompatible, meaning the opposite. In addition, the relationship is symmetric: if trait A favours trait B, and conversely if trait B favours trait A (the same holds for the case of incompatibility). Finally, we assume that each trait is compatible with itself, simply meaning that, if both the observer and the demonstrator have trait A the probability to copy trait B will increase.

In a simple "world" with only four traits, we can represent trait relationships with a symmetric matrix, as the one below, where $+1$ denotes compatibility, and $-1$ denotes incompatibility.

Traits    |  A |  B |  C |  D 
----------|---:|---:|---:|---:
 **A**    | +1 | +1 | -1 | -1 
 **B**    | +1 | +1 | -1 | -1 
 **C**    | -1 | -1 | +1 | +1 
 **D**    | -1 | -1 | +1 | +1    

In this case, traits A and B are compatible with each other but incompatible with C and D. The same is true for C and D, which are compatible with each other but incompatible with A and B.

We can construct this matrix in R, by indicating one-by-one the values we want to fill in, and the number of rows and columns the matrix needs to have. As usual, we can check the resulting matrix by writing its name and hitting the return key.

```{r 15.1, message=FALSE}
library(tidyverse)
my_world <- matrix(c(1,1,-1,-1,1,1,-1,-1,-1,-1,1,1,-1,-1,1,1), nrow = 4, ncol = 4)
my_world
```

Given this simple way of representing a "world" of compatibilities among traits, we can write our model. 

```{r 15.2}
traits_inter_dependence <- function(N, t_max, k, mu, p_death, world){
  output <- tibble(trait = rep(c("A","B","C","D"), each = t_max), generation = rep(1:t_max, 4), p = as.numeric(rep(NA, t_max * 4)))      
  population <- matrix(0, ncol = 4, nrow = N)
  output[output$generation == 1 ,]$p <- colSums(population) / N
  for(t in 2:t_max){
    # innovations
    innovators <- sample(c(TRUE, FALSE), N, prob = c(mu, 1 - mu), replace = TRUE) 
    innovations <- sample(1:4, sum(innovators), replace = TRUE)
    population[cbind(which(innovators == TRUE), innovations)] <- 1
    
    # copying
    demonstrators <- sample(1:N, replace = TRUE)
    demonstrators_traits <- sample(1:4, N, replace = TRUE)
    
    # not sure how to vectorise the code below. The first "if" can be done with:
    # population[cbind(demonstrators,demonstrators_traits)]==1 as in previous chapter
    # but I could not figure out a way to calculate the compatibility score at population level
    # MS: I can't come up with a faster version either.
    
    for(i in 1:N){
      if(population[demonstrators[i], demonstrators_traits[i]]){
        compatibility_score <- sum(world[demonstrators_traits[i], population[i,]!=0])
        copy <- (1 / (1 + exp(-k*compatibility_score))) > runif(1)
        population[i,demonstrators_traits[i]] <- 1 * copy
      }
    }
    
    # birth/death
    replace <- sample(c(TRUE, FALSE), N, prob = c(p_death, 1 - p_death), replace = TRUE)
    population[replace, ] <- 0
    
    output[output$generation == t ,]$p <- colSums(population) / N
  }
  output
}
```

As in the [previous chapter][Social learning of social learning rules], the simulation starts with no traits, and individuals introduce them with random innovations, the rate of which is regulated by the parameter $\mu$. Individuals are replaced by culturally-naive newborns with a probability $p_\text{death}$. There are, two major differences to the previous model. One is that the function accepts a parameter, `world`, a four-by-four matrix of compatibilities between traits (thus the compatibilities can change, but not the actual number of traits). The second is in the copying procedure. 

As in the previous chapter, one trait is randomly selected to be observed from a demonstrator and, if the demonstrator *i* has it (`population[demonstrators[i], demonstrators_traits[i]]`), we calculate the "compatibility score". The compatibility score is the sum of the compatibilities of all the traits of the observer towards the traits of the demonstrator, using the "world" matrix. If, for example, both observer and demonstrator have A and B (and only A and B), the compatibility would be $2$, if the observer has A and B and the demonstrator C and D, the compatibility would be $-2$ and so on. In the next line, the compatibility score is transformed in the actual probability to copy with a logistic function. This is a useful trick to transform possibly unbounded positive and negative values to be between $0$ and $1$:

$$P_\text{copy} = \frac{1}{1 + e^{-kC}}         \hspace{30 mm}(15.1)$$
where *C* represents the compatibility score between observer and demonstrator, and *k* is a parameter of the simulation, that controls the steepness of the logistic curve, i.e. how fast positive values of the compatibility score produce a probability to copy equal to $1$, and negative values a probability equal to $0$.

We can now run the function, using the `plot_multiple_traits()` function to plot the result. We use a value of $k=10$, and a small probability of innovation $\mu=0.0005$, so that the dynamics are mainly generated by cultural transmission.  

```{r 15.3, echo=FALSE}
plot_multiple_traits <- function(data_model) {
  ggplot(data = data_model, aes(y = p, x = generation)) +
    geom_line(aes(colour = trait)) +
    ylim(c(0, 1)) +
    theme_bw() +
    theme(legend.position = "none")
}
```

```{r 15.4, fig.cap = "Frequency of traits in a world with four traits and pairwise compatibilities."}
my_world <- matrix(c(1,1,-1,-1,1,1,-1,-1,-1,-1,1,1,-1,-1,1,1), nrow = 4, ncol = 4)
data_model <- traits_inter_dependence(N = 100, t_max = 1000, k = 10, mu = 0.0005, p_death = 0.01, world = my_world)
plot_multiple_traits(data_model)
```

In the great majority of the runs, two out of the four traits diffuse in the population. We can check whether these are in fact one of the couple of compatible traits, having a look at the last line of the output produced by the simulation. 

```{r 15.5}
data_model[data_model$generation==1000,]
```

Depending on random factors, the successful traits will be A and B or C and D. If you run the simulation again and again you would see that in around half of the simulations A and B are the successful traits, and in another half C and D are, and very few other possible cases with this "world" of compatibilities. 

What happens if we change the world? We can run a new simulation where the traits A, B, and C are all compatible, but not D (remember, you can visualise the matrix of compatibilities by typing its name to be sure to have entered the compatibilities correctly).

```{r 15.6, fig.cap = "Frequency of traits in a world with four traits and three traits compatibile with each other, but not compatible with the fourth."}
my_world <- matrix(c(1,1,1,-1, 1,1,1,-1,1,1,1,-1,-1,-1,-1,1), nrow = 4, ncol = 4)
data_model <- traits_inter_dependence(N = 100, t_max = 1000, k = 10, mu = 0.0005, p_death = 0.01, world = my_world)
plot_multiple_traits(data_model)
```

As expected, now three traits have high frequencies in the population, while one is unsuccessful. As before, you can check that the unsuccessful trait is actually D by inspecting manually the last line of the output. 

Not surprisingly, if all traits are compatible, they all spread equally in the population.  

```{r 15.7, fig.cap = "Frequency of traits in a world with four traits all compatibles with each other."}
my_world <- matrix(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1), nrow = 4, ncol = 4)
data_model <- traits_inter_dependence(N = 100, t_max = 1000, k = 10, mu = 0.0005, p_death = 0.01, world = my_world)
plot_multiple_traits(data_model)
```

Notice that the opposite---when none of the traits is compatible with each other---is an extreme case that produces a situation similar to the [unbiased transmission model with multiple traits][Multiple traits models] we explored earlier in the book. Note, that individuals will almost only copy once in their life when they do not have any trait. At this time, they copy whatever trait they observe with 50% chance, since the logistically transformed compatibility score of $0$ will always yield $0.5$. Subsequently, any encounter with ademonstrators with other traits will result in a very low copying probability, whereas any encounter with a demonstrator with the same trait will lead to a high copying probability, however, the observer already possesses that trait. 

## Many-traits model

Building the compatibility matrix by hand is not very practical, especially if we want to test our model with more traits. We will now extend the basic model above in order to be able to customise the maximum number of traits and to automatically generate the compatibility worlds.   

```{r 15.8}
M <- 7
gamma <- 0.5
my_world <- matrix( rep(1, M * M), nrow = M)
compatibilities <- sample(c(1, -1), choose(M, 2), prob = c(gamma, 1 - gamma), replace = TRUE) 
my_world[upper.tri(my_world)] <- compatibilities
my_world <- t(my_world)
my_world[upper.tri(my_world)] <- compatibilities
```

We have now two parameters we use to build the matrix: the maximum number of traits $M$, and  the probability that two traits are compatible with each other, $\gamma$ (or `gamma` in the code). To build the matrix, we create a $M$ by $M$ matrix filled with $1$s, then a vector of compatibilities randomly generated with probability $\gamma$ (the length of the vector is the number of entries above the main diagonal of the matrix, given by `choose(M, 2)`), and finally we copy the values the lower triangle to the upper triangle of the matrix (in practice, to make it symmetric, we copy it twice in the upper triangle, transposing the matrix after the first copy).  

Have a look at the the matrix we just generated.

```{r 15.9}
my_world
```

The function to run the simulation is very similar to the previous one, once we account for the difference in how the compatibility matrix is created and for the two new parameters ($M$ and $\gamma$) needed in the function call. Another difference is that the output data structure is a matrix and not a tibble. Since we want to be able to run the simulations with an arbitrary large number of traits, we need to speed up the computation, exactly in the same way as we did for the multiple traits model in [chapter 7][Multiple traits models].

```{r 15.10}
traits_inter_dependence_2 <- function(N, M, t_max, k, mu, p_death, gamma){
  output <- matrix(data = NA, nrow = t_max, ncol = M)
  # initalise the traits' world:
  world <- matrix( rep(1, M * M), nrow = M)
  compatibilities <- sample(c(1, -1), choose(M,2), prob = c(gamma, 1 - gamma), replace = TRUE) 
  world[upper.tri(world)] <- compatibilities
  world <- t(world)
  world[upper.tri(world)] <- compatibilities
  # initialise the population:
  population <- matrix(0, ncol = M, nrow = N)
  output[1, ] <- colSums(population) / N 
  for(t in 2:t_max){
    # innovations
    innovators <- sample(c(TRUE, FALSE), N, prob = c(mu, 1 - mu), replace = TRUE) 
    innovations <- sample(1:M, sum(innovators), replace = TRUE)
    population[cbind(which(innovators == TRUE), innovations)] <- 1
    
    # copying
    demonstrators <- sample(1:N, replace = TRUE)
    demonstrators_traits <- sample(1:M, N, replace = TRUE)

    for(i in 1:N){
      if(population[demonstrators[i], demonstrators_traits[i]]){
        compatibility_score <- sum(world[demonstrators_traits[i], which(population[i,]>0)])
        copy <- (1 / (1 + exp(-k*compatibility_score))) > runif(1)
        if(copy){
          population[i,demonstrators_traits[i]] <- 1
        }
      }
    }
    
    # birth/death
    replace <- sample(c(TRUE, FALSE), N, prob = c(p_death, 1 - p_death), replace = TRUE)
    population[replace, ] <- 0
    
    output[t, ] <- colSums(population) / N
  }
  output
}  
```

```{r 15.11, echo = FALSE}
plot_multiple_traits_matrix <- function(data_model) {
  generation <- rep(1:dim(data_model)[1], dim(data_model)[2])
  
  data_to_plot <- as_tibble(data_model) %>%
    gather( key = "trait", value = "p") %>%
    add_column(generation)
  
  ggplot(data = data_to_plot, aes(y = p, x = generation)) +
    geom_line(aes(colour = trait)) +
    ylim(c(0, 1)) +
    theme_bw() +
    theme(legend.position = "none")
}
```

We can now use the `plot_multiple_traits_matrix()` function (see [chapter 7][Multiple traits models]) to visualise the model results. Let's have a look at what happens when we have 20 traits and an intermediate probability of compatibility.

```{r 15.12, warning = FALSE, fig.cap = "Frequency of traits in a world with 20 traits and compatiblity randomly generated. Each trait has 50% of probability of being compatible with each other trait."}
data_model <- traits_inter_dependence_2(N = 100, M = 20, t_max = 2000, k = 10, mu = 0.001, p_death = 0.01, gamma = .5)
plot_multiple_traits_matrix(data_model)
```

The simulation generates a complex dynamic in which some of the traits spread in the population, while others do not. The success of a trait depends on its general compatibility with other traits but also on which traits are present at a certain point in time in the population. Some traits succeed to spread only after the traits they are compatible with have sufficiently spread in the population. 

Let's change $\gamma$ to be $1$, i.e. when all traits are compatible with each other.

```{r 15.13, warning = FALSE, fig.cap = "Frequency of traits in a world with 20 traits and compatiblity randomly generated. Each trait is compatible with all other traits."}
data_model <- traits_inter_dependence_2(N = 100, M = 20, t_max = 2000, k = 10, mu = 0.001, p_death = 0.01, gamma = 1)
plot_multiple_traits_matrix(data_model)
```

As expected, all traits spread in the population.

We leave to the reader to explore further what drives the dynamic, especially in the interesting cases with intermediate values of $\gamma$. In order to do this, we suggest to add further outputs to the function `traits_inter_dependence_2()`, such as the compatibility world generated by the simulation, or the actual composition of the population, perhaps only at the end, or every 100, or 500, time steps. 

As an example, we can try to confirm the intuition that the final number of traits depends on the value of $\gamma$: if it is more likely that traits are compatible with each other, we expect more traits to spread in the population. We can choose values of $\gamma$ between $0$ and $1$ (with steps of $0.1$), and use a `for` cycle to run our function for each value. In fact, we are running another `for` cycle within the main one, as to have more runs for each value of $\gamma$ (an alternative would be to rewrite the function `traits_inter_dependence_2()` to accept an additional argument that indicates the number of simulation repetitions, as we did in previous chapters). 

Finally, we store the number of successful traits at the end of the simulation, that is, the traits that spread in at least half of the population (`data_model[2000,]>.5`).

```{r 15.14}
r_max = 10
test_inter_dependence <- tibble(gamma = as.factor(rep(seq(0, 1, by = .1), r_max)), run = as.factor(rep(1:r_max, each = 11)), C = as.numeric(NA))
for(condition in seq(0, 1, by = .1)){
  for(r in 1:r_max) {
    data_model <- traits_inter_dependence_2(N = 100, M = 20, t_max = 2000, k = 10, mu = 0.001, p_death = 0.01, gamma = condition)
    test_inter_dependence[test_inter_dependence$gamma == condition & test_inter_dependence$run == r, ]$C <- sum(data_model[2000,]>.5)
  }
}
```

To plot the results, we combine boxplots and `geom_jitter()` so we can see the actual data points. `geom_jitter()` is another ggplot 'geom' that adds a small amount of random variation to the location of each point in the x and y dimension of the plot (indicated by values to the `width` and `height` argument). In our case, we want a small sideways variation, i.e. the width. This is useful to show data points that would otherwise overlap.

```{r 15.15, fig.cap = "Increasing the probability of traits being compatible with each others produces bigger populations."}
ggplot(data = test_inter_dependence, aes(x = gamma, y = C)) +
  geom_boxplot() +
  geom_jitter(width = 0.1, height = 0, alpha = 0.5) +
  theme_bw() +
  labs(x = "Average compatibility", y = "C (number of common traits)")
```

The results broadly confirm our intuitions. Interestingly, we do not need that all traits are compatible with each other to produce the outcome of all traits being successful. Already with $\gamma=0.8$, all the 20 traits spread in more than half of the population in almost all runs of the simulation. Even with few incompatibilities, the "compatibility score" between observers and demonstrators is, given a sufficient number of compatible traits, a positive number, resulting in high probabilities to copy.

## Summary of the model

Using simple models, we formalised the intuitive idea that cultural traits (can) have meaningful (for us) relationships with each other. When we decide whether to support a policy or not, to adopt a behaviour or not, or to participate in the latest fad, the decision may depend on how well the policy, the behaviour, or the fad fit with our pre-existing ideas. We used a simple rule for which traits can be compatible or incompatible with the other traits, and we showed that the success of traits depend on their compatibility. We also showed that, quite intuitively, populations where many traits are compatible with each other, will generate bigger cultures. 
  
We also introduced a few new modelling devices, such as the logistic function, that transform unbounded positive and negative values into probabilities ranging from $0$ to $1$, and the ggplot geom `geom_jitter()`, which is useful to visualise overlapping data points. 


## Further readings
  
A broader treatment of models of cultural evolution where the outcomes depend on the relationships between traits, defined by the authors 'cultural systems', is in @buskell_systems_2019. More complex relationships between traits, including the possibility that some trait will facilitate the appearance of new traits, generating cascade of innovations are explored in @enquist_modelling_2011.   

# References

<!--chapter:end:15-Traits_inter-dependence.Rmd-->