Long long ago in Pittsburgh far away I earned a Bachelors in Mathematics. I then spent the first twenty years of my career as a “techie”, using that math and Computer Science to analyze and make sense of the world. Every so often that experience is useful here in the real world.
Often that experience leads me to be cynical of the use of statistics in the news, especially when the mathematics are hidden behind the buzzwords of “big data” and “AI”.
One simple example of where this can go wrong are genetic analysis services like 23andMe. I bought my wife and I a pair of kits a few years ago as an Anniversary gift to see how closely related we are. We’re both from a people known to marry from within. Turns out we are very distant cousins.
How distant? That’s where the statistics and AI are breaking down.
Over the winter break I was looking on 23andMe to see what new cousins had joined in. The service notifies me about this every month or so, but rarely has any relative I recognize joined in. Instead, I’m presented with a lot of 2nd and 3rd cousins I’ve never heard of, who share no family names, often from parts of the world where it’s physically impossible to be my 2nd cousin. My ancestors and close relatives have all been born in the U.S. for too many generations to have 2nd cousins born overseas.
The latest set of connections proved the bug in 23andMe’s statistics and shows how you can’t completely trust what services like this are reporting.
One new “2nd cousin” and I connected today. 23andMe reported that she’s my mother’s 3rd cousin. How can someone be more distantly related to my mother than myself? Simple. 23andMe also reported that this new cousin is also my father’s 5th cousin.
Take any well inbred group and you get a lot of double cousins. Go out to 3rd and 4th cousins, and there are probably some triple cousins in that mix too.
The mathematics 23andMe are using for reporting cousins are ignoring these cases, despite 23andMe having the data to compute these overlaps. Instead, it is clear that 23andMe is assuming their data is uncorrelated except via some ideal, unrealistic, purely hierarchical family tree. It isn’t.