I'll shortly be posting a
preprint about methodological quality of studies in the field of neurogenetics.
It's something I've been working on with a group of colleagues for a while, and
we are aiming to make recommendations to improve the field.

I won't go into details
here, as you will be able to read the preprint fairly soon. Instead, what I
want to do here is to expand on a small point that cropped up as I looked at
this literature, and which I think is underappreciated.

It's to do with sampling.
There's a particular problem that I started to think about a while back when I
heard someone give a talk about a candidate gene study. I can't remember who it
was or even what the candidate gene was, but basically they took a bunch of
students, genotyped them, and then looked for associations between their
genotypes and measures of memory. They were excited because they found some
significant results. But I was, as usual, sitting there thinking convoluted
thoughts about all of this, and wondering whether it really made sense. In
particular, if you have a common genetic variant that has such a big effect on
memory, would this really show up in a bunch of students – who are presumably
people who have pretty good memories? Wouldn't it rather be the case that what
you'd expect would be an alteration in the frequencies of genotypes in the
student population?

Whenever I have an
intuition like that, I find the best thing to do is to try a simulation. Sometimes
the intuition is confirmed, and sometimes things turn out different and, very
often, more complicated.

But this time, I'm
pleased to say my intuition seems to have something going for it.

So here's the nuts and
bolts.

I simulated genotypes and
associated phenotypes by just using R's nice mvrnorm function. For the examples
below, I specified that a and A are equally common (i.e. minor allele frequency
is .5), so we have 25% as aa, 50% as aA, and 25% AA. The script lets you
specify how closely these are related to the phenotype, but from what we know
about genetics, it's very unlikely that a common variant would have a value
more than about .25.

We can then test for two
things:

1) How far does the distribution of genotypes in
the sample (i.e. people who are aa, aA or AA) resemble that in the general
population? If we know that MAF is .5, we expect this distribution to be 1:2:1.

2) We can assign each
person a score corresponding to number of A alleles (coding aa as zero, aA as
1, and AA as 2) and look at the regression of the phenotype on the genotype.
That's the standard approach to looking for genotype-phenotype association.

If we work with the whole
population of simulated data, these values will correspond to those that we
specified in setting up the simulation, provided we have a reasonably large
sample size.

But what if we take a
selective sample of cases who fall above some cutoff on the phenotype? This is
equivalent to taking, for instance, a sample from a student population from a
selective institution, when the phenotype is a measure of cognitive function.
You're not likely to get into the institution unless you have a good
cognitive ability. Then, working with this selected subgroup, we recompute our
two measures, i.e. the proportions of each genotype, and the correlation
between the genotype and the phenotype.

Now, the really
interesting thing here is that, as the selection cutoff gets more extreme, two
things happen:

a) The proportions of
people with different genotypes starts to depart from the values expected for
the population in general. We can test to see when the departure becomes
statistically significant with a chi square test.

b) The regression of the
phenotype on the genotype weakens. We can quantify this effect by just computing
the p-value associated with the correlation between genotype and phenotype.

Figure 1: Genotype-phenotype associations for samples selected on phenotype |

Figure 1 shows the mean
phenotype scores for each genotype for three samples: an unselected sample, a
sample selected with z-score cutoff zero (corresponding to the top 50% of the
population on the phenotype) and a sample selected with z-score cutoff of .5
(roughly selecting the top third of the population).

It's immediately apparent
from the figure that the selection dramatically weakens the association between
genotype and phenotype. In effect, we are distorting the relationship between
genotype and phenotype by focusing just on a restricted range.

Comparison of p-values from conventional regression approach and chi square test on genotype frequencies in relation to sample selection |

Figure 2 shows the data
from another perspective, by considering the statistical results from a
conventional regression analysis, when different z-score cutoffs are used,
selecting an increasingly extreme subset of the population. If we take a cutoff
of zero – in effect selecting just the top half of the population, the
regression effect (predicting phenotype from genotype), shown in the blue line, which was strong in the full
population, is already much reduced. If you select only people with z-scores of
.5 or above (equivalent to an IQ score of around 108), then the regression is
no longer significant. But notice what happens to the black line. This shows
the p-value from a chi square test which compares the distribution of genotypes
in relation to expected population values in each subsample. If there is a true
association between genotype and phenotype, then greater the selection on the
phenotpe, the more the genotype distribution departs from expected values. The
specific patterns observed will depend on the true association in the
population and on the sample size, but this kind of cross-over is a typical
result.

So what's the moral of
this exercise? Well, if you are interested in a phenotype that has a particular
distribution in the general population, you need to be careful when selecting a
sample for a genetic association study. If you pick a sample that has a
restricted range of phenotypes relative to the general population, then you
make it less likely that you will detect a true genetic association in a
conventional regression analysis. In fact, if you take a selected sample, there
comes a point when the optimal way to demonstrate an association is by looking
for a change in the frequency of different genotypes in the selected population
vs the general population.

No doubt this effect is
already well-known to geneticists, and it's all pretty obvious to anyone who is statistically savvy, but I was pleased to be able to quantify the effect via simulations. It is clear that it has implications for those who work
predominantly with selected samples such as university students. For some
phenotypes, use of a student sample may not be a problem, provided they are
similar to the general population in the range of phenotype scores. But for
cognitive phenotypes that's very unlikely, and attempting to show genetic
effects in such samples seems a doomed enterprise.

The script for this
simulation should be available here: https://github.com/oscci/mysqing.

(I am a github novice,
but I'm sure someone will tell me if I've got that wrong).