In recent years companies like 23andme have gained traction by feeding our desire to understand the roots of our ancestry. They promise to give insights into the geographical origins of our genetic makeup — using just a droplet of saliva.
In this article, we are going to have a (rather simplistic) look at the bioinformatics aspect of such an analysis. We will walk through all the necessary preprocessing steps to get from raw sequencing reads to a machine learning model which aims to cluster human samples according to their geographical origin.
But before we get to the coding, let’s quickly clarify why this works. …