Some years ago, while working on a collaborative science fiction project, one of the other people mentioned that they had read that chimpanzee DNA is more than 98% identical to human DNA. “It only takes that little 1% to make a huge difference!”
I had seen articles quote this figure as well, but since the human genome project was still underway at the time, I was a little skeptical that it was accurate. However, one of my friends showed me a reference to a paper in a peer-reviewed journal where the statistic came from, and a little research seemed to indicate that it was true. So I accepted it, occasionally quoted it myself, and didn’t think about it.
Until I read a book about the human genome project, which talked about that old statistical claim in particular, and explained exactly how it came about.
If complicated science theories or statistics make your head spin, don’t panic! I’m going to explain it in a way that will not cause you any distress.
Imagine that you have printed out the text of a pair of books that are roughly the same length. You have printed it out single-sided, double-spaced, and in comfortably-sized font. Now, you take a pair of scissors to the first book, and you start cutting each page up—you don’t cut them up randomly, you cut them so that you have several thousand little pieces of papers, each one of which has one and only one word on it.
Now, you sort them. You make a pile of all of the slips of paper that have the word “the” on it. You make another pile of all the slips that have the word “blue” on it, and so on, until you have a bunch of piles of the little slips of paper, each pile containing however many instances of a single word.
Pick the ten biggest piles, only, and discard the rest. Now count the number of slips of paper in each of your ten piles, and write down the number of times each word occurs in the book you cut up.
Now, go repeat the whole process on a second book, and when you’re done, compare the two lists. Calculate by what percentage each varies on each word, and then average that variation out.
When you’re done, you find that there is a 98% match between a Harry Potter book and Fifty Shades of Grey. “Look!” you declare, “They’re practically the same book!”
But you haven’t compared the two books, you’ve only counted words and compared counts of the most common words between the two books. If you perform this treatment on any books written in the same language, you’re going to find a match.
And I think everyone realizes, when I explain it this way, that what you’ve done does not measure how similar the books are.
Of course this makes you wonder what the scientists were thinking when they did something very similar to the DNA of humans and chimpanzees.
To be fair, the scientists who authored the original paper never claimed that humans and chimpanzees only varied from each other by 1 or 2 percent. They said that they found a similarity in the number and distribution of certain combinations of base pairs of the portions of the chromosomes compared of about 98%. They knew that they weren’t comparing the entire genome, because no one had mapped the entire thing for either species.
At the time, the methods we had for analyzing DNA were crude. We could separate chromosomes, we could pull out certain sequences and count those, but there was a lot we couldn’t do.
We also assumed that the long, repetitive bits at the end of each chromosome were junk, or filler, but new research is beginning to cast doubt on that.
Make no mistake, the more we study both species’ chromosomes, we keep finding a very high amount of similarity between us. Chimps and Bonobos both are clearly very closely related to us. But we aren’t “practically the same species.”
My point is, that 98% was a fact. It was even a true fact, but it was a very specific fact: when comparing certain portions out of the whole DNA. of each species, and when counting building blocks, without much regard to how those building blocks relate to each other, the number of those building blocks is about 98% the same in each species.
Before we can know what that fact means, we need to know a whole lot more facts and a much better understanding of the context.