From raw sequencing reads to a machine learning model, which infers an individuals geographical origin based on their genomic variation.

Image for post
Image for post
Photo by Clay Banks on Unsplash

In recent years companies like 23andme have gained traction by feeding our desire to understand the roots of our ancestry. They promise to give insights into the geographical origins of our genetic makeup — using just a droplet of saliva.

In this article, we are going to have a (rather simplistic) look at the bioinformatics aspect of such an analysis. We will walk through all the necessary preprocessing steps to get from raw sequencing reads to a machine learning model which aims to cluster human samples according to their geographical origin.

But before we get to the coding, let’s quickly…

Nico Alavi

Bioinformatics Master's Student at FU Berlin

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store