In 2020, my friend Artem Babaian and I started the Serratus Project, an effort to build a “tree of life” for the Covid-19. The project later evolved into a search and cataloguing project for all known viruses, finding them hidden deep inside a vast database of freely available genomics data submitted by labs all around the world.
We submitted this work to Nature, a renowned scientific journal, where it was published in Jan 2022.
I developed and scaled the original search and computing architecture for the project, mostly from scratch, including a custom job scheduler, scaling logic, Terraform scripts to provision infrastructure, Docker images, and a monitoring and debugging stack. The system was capable of scaling itself up to nearly 10k AWS EC2 instances, which it did on several occasions during our big multi-day pushes to process large amounts of data (10PB+) as quickly as possible.
The project is open-source and available on Github.