The wonders of statistics in gene expression experiments

Post by Bryn Gaertner


Binding site turnover produces pervasive quantitative changes in transcription factor binding between closely related Drosophila species

RK Bradley, XY Li, C Trapnell, S Davidson, L Pachter, HC Chu, LA Tonkin, MD Biggin, MB Eisen

PLoS Biology 8(3) 2010

Flies in the Drosophila genus all look about the same, and the early-development transcription factors that we all know and love (Hunchback, kruppel, bicoid, giant, knirps, etc.) are expressed in roughly the same patterns. However, there is about one SNP per 10 bp between these species, which strongly suggests that the TF binding targets are no longer conserved. How do the TFs still know where to go?

This was a beast of a paper.  The main experiment was to see how transcription factor binding changes in closely related species, considering that the species maintain their phenotypic similarities.  Earlier studies found that transcription factors, which are responsible for initiating gene expression, have affinity for very specific nucleotide sequences that are about 8 base-pairs long.  However, just from sampling, that 8-bp combination will show up a lot of times in the genome, and transcription factors do have a bit of error in their binding affinity, so they can bind to other sequences too.

These particular transcription factors set up the anterior-posterior axis on flies, so they are responsible for turning on a whole ton of genes.  But we’re not sure exactly which ones, and we’re not sure how they “know” to turn on the right genes considering that as these species diverge, they’re going to be accumulating mutations in these 8 bp recognition sequences.

The way this group addressed this question is by using ChIP-seq.  How this works is you take larvae that are at a point in development where they’re setting up their AP-axis.  You dip them in formaldehyde, which means that any protein that was bound to DNA at that instant will stay stuck to it.  Then they use antibodies to pull that particular protein– plus the DNA attached to it– out of a big sample of lysed cell goo.  Then, they sequence the DNA they pulled out and align it back to the genome.  This way, they can see exactly where their proteins were bound in the genomes.  Not only that, but they can get a quantitative estimate of how well or tightly the transcription factors were bound, because with a big enough sample size we go from probability land to frequency land.  So, the more counts they get on the sequence, presumably the more tightly bound that TF was.

They found that there are pretty high correlations between two species of Drosophila for TF binding sites, which is consistent with their similar phenotypes.  However, using a fairly large window they couldn’t find any correlation between nucleotide divergence and binding strength differences.  Thus, they used a search algorithm to identify “words” (base-pair sequences) that occurred under these binding peaks and whose non-conservation between species were associated with changes in binding strength.

Using this statistical approach, they were able to recover all known binding sequences for these transcription factors. If conservation is the only factor that predicts binding strength, then you should be able to make a correlation between the two and predict that strongly bound regions would be subject to more functional constraint.  But that wasn’t really true, which suggests that there’s another factor that’s influencing these binding properties.

So they used another strategy, which is principle components analysis.  What PCA does is finds patterns in the variance of the data that you’re not totally aware of.  So if you have some variables that are highly correlated, PCA would collapse them into one variable and you’d get more predictive power from it.  In this case, they found that there is one variable (Principle Componen 1) that explains 38% of the variance in the data.  Put another way, there are whole suites of genomic regions where binding strengths are highly correlated.  Turns out, when we apply the word-finding algorithm to this PCA, there’s a short little “TAG”team sequence that goes with these correlated data.  This is a known binding site for a transcription factor called Zelda whose function isn’t really well known, but is thought to have something to do with chromatin remodelling– changing the chromosomes around to make some regions more accessible to transcription factors.

So the COOL! part of this paper is that they were able to computationally extract purpose for a transcription factor, and now they can test that experimentally. However, none of us were completely clear on how they did their PCA, so we all wanted a bit more detail in that regard.  Specifically, what of their data went into the PCA?  was it just the peak locations?  Was it differentiated by transcription factors?  If the latter, why show all six principle components, since the last few probably don’t actually explain any variance?  How much variance do each of the PCs explain?  (Though we all thought that regardless, a heat map is a really visually informative way to display factor loadings)

The implications for evolution are kind of interesting.  What the PCA suggested is that there is as much variation within species for this chromatin remodelling factor as there is between species.  That’s consistent with the gradualist perspective of speciation, which is that there’s nothing more amazing going on in speciation than there is when alleles are segregating in a population.  The other interesting thing is that the genome appears to be a big old mess.  TFs are binding all over the place, who knows how many funky RNA transcripts are floating around, and yet… we get nice clean organisms.  Especially with these early patterning factors, the traits are highly canalized, meaning you have to seriously perturb the system to get a change in phenotype.  Yet we DO get changes in phenotype.  The book is still open on how that happens, of course.


2 responses to “The wonders of statistics in gene expression experiments

  1. Stephen Richards

    “Flies in the Drosophila genus all look about the same” I think if you actually do look at some, you will see that this is not true. The Hawaiians in particular are very pretty, but even the closer species are different.

  2. A very fair point– I forget how large that genus is. I should say, D. melanogaster and D. yakuba (the two species examined here) are strikingly phenotypically similar, and also that all Drosophilids have similar body plans, despite any derived modifications on a theme (especially the Hawaiian guys)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s