Open-source Big Compute for Life Sciences
In collaboration with Dr. Kenji Takeda, Solutions Architect and Tech Manager, Microsoft Research.
The amazing rate of scientific progress is increasingly enabled by advances in big data and big compute technology. We at Microsoft are excited to be working with the global scientific community to make the best use of cloud computing through our Azure for Research program. Everyday we’re learning how researchers across all domains, such as computer science, engineering, and life sciences, are able to accelerate their projects’ progress with the cloud.
Genetics, genomics and life on Earth
Some of the most profound developments in science are centered on harnessing our knowledge of genetics; changing the way we think about biology, medicine and all living things. By mapping genomes through DNA sequencing techniques and bioinformatics, we are making huge strides in our understanding and exploitation of genetics. The cost of gene sequencing is falling faster than Moore’s Law, meaning we can gather more genomic data at a faster rate than ever before, but having access to more data is only half the story. More data means more storage, processing and analysis; often it’s difficult to keep up with the computing power needed. We’ve been working with genomics researchers around the world, and the experiences of Simon O’Hanlon and Professor Matthew Fisher at Imperial College London are typical of how Azure Big Compute can help with this bioinformatics computing challenge.
The Challenge: Hunting a serial killer using next-generation sequencing
Next-generation sequencing (NGS) is a technique that massively parallelizes the process for obtaining genetic data. The current, most popular approach to NGS is to perform ‘sequencing by synthesis’, which is a slow but ‘embarrassingly parallel’ process. Typically researchers split genomic DNA into small fragments (tens to hundreds of nucleotides long) and sequence between millions and billions of these pieces simultaneously. This process can take several days but produces a very large amount of sequence data.
NGS is transforming work in biological sciences but poses significant computational challenges. Azure is helping researchers to speed-up their NGS pipelines, from working towards a cure for cancer to diagnosing rare diseases. In addition to well-known threats to human health there are many diseases that originate from other animal species first. It’s critical that we understand this ‘zoonotic pool’ of pathogens that may transfer from animals to humans, to prevent future outbreaks of diseases such as Ebola, Bird Flu and BSE (mad cow disease). One of the most virulent pathogens for vertebrates is Batrachochytrium dendrobatidis (Bd), a fungal infection that causes chytridiomycosis in amphibians. Bd has been shown to infect over 500 species of amphibians, and is responsible for the decline or extinction of over 200 of these. The World Conservation Union (IUCN) declared in 2005 that chytridiomycosis, “is the worst infectious disease ever recorded among vertebrates in terms of the number of species impacted, and its propensity to drive them to extinction.”
Matthew and Simon at Imperial College London are at the frontline of combatting this severe amphibian threat. They are using NGS to understand the origins and spread of Bd, and require large amounts of computing power to do this. While NGS is fast, it produces an overwhelming amount of data of varying quality. The small fragments of DNA sequenced by NGS machines must first be reassembled into a complete genome in a process analogous to piecing together a very large jigsaw. There are a bewildering array of evolutionary and environmental processes which may affect rates of genetic change and advanced techniques such as machine learning are needed to make most sense of it. Traditionally, bioinformaticians have used on-premises computing to number-crunch and search their sequences, but NGS means it is becoming harder to match local computing capability with the deluge of data coming from the sequencing machines, as Simon explains,
“At Imperial we are lucky enough to have access to some great computational resources at both Institutional level and departmental level. Our departmental cluster is powerful and runs Windows which works out fantastic for those people whose toolchains are written for that environment, but we needed a more Linux-centric solution as many of our tools are Unix-based. The central HPC service is great, but there are limits on computational resources, and of course time spent queuing jobs to be run when there are many concurrent users.”
The Contender: Azure Big Compute for open-source solutions
There are thousands of researchers around the world working on genomics, creating a rich ecosystem of tools and technologies to choose from. Most of these use open-source frameworks leveraging Linux and are used on everything from a laptop to a supercomputer. Teams such as Matthew’s have access to many University compute resources, but the everyday experience with these is not always ideal. Azure Big Compute provides two major advantages: Powerful Virtual Machines and no waiting.
Azure’s powerful A11 compute instances, with 16 fast cores and 112GB of RAM, give Simon a real boost to run through his genomics processing pipeline. Having instant access to a personal Azure cluster with 256 or more cores, means he is able to be in full control of his work, as he explains:
“Azure for me was perfect. I could scale my compute capability with my current computational needs. What is also great is that as we didn’t have to outlay any capital investment in a particular architecture I could also change up the physical makeup of my cluster as required. For instance, for some problems I might need nodes with large amounts of RAM and processors (such as Azure A11). At other times I might need faster SSD storage and faster interconnects between nodes on my cluster. I can do all this from the Azure web portal or more usually, programmatically using the Azure command line tools. I can set up a whole new cluster with just a few lines of script code. Once I had set up my cluster it was easy to run a few short scripts to set up an open source Grid Engine job scheduler to distribute computational jobs across my cluster. The best thing about having your own cluster though, is of course no wait times in queues!”
The Solution: A genomics time machine, doing a week’s work in an afternoon
Simon is used to running his computations from the command-line, so he was quickly able to make use of Azure without changing his working habits. The Linux Command-Line Interface (CLI) enabled him to easily setup and manage virtual machines that he could configure for his workflow. This included setting up the well-known Genome Analysis Toolkit (GATK) on Azure for variant calling, and coupling this with RaxML for maximum likelihood calculations. Simon then installed Grid Engine on his cluster of Ubuntu 14.04 Azure VMs to manage his job runs to make use of MPI and OpenMP across A11 compute nodes.
“Using Azure VMs allowed me to port my workflow directly from my local setup at Imperial straight into the cloud, with almost no tweaking required. The Azure command line tools are also great. They really helped manage the workflow for both cluster setup and subsequent management. Since cluster use is billed per minute its essential to have great tools to spin up and shut down your resources efficiently, simply and reliably. You are of course also admin on your cluster which gave me ultimate flexibility to instantly install and update whatever tools and software I needed, or just wanted to test out. I could organize my workflow and environment in a way that was right for me.”
On the surface of it, this setup looks pretty similar to what was already available, and THAT is the point. Simon was able to easily recreate his complex working environment in Azure using familiar open-source tools and without re-architecting it. With this in place he now has access to huge amounts of computing that is, importantly, available on-demand.
For his latest set of genome sequencing data, Simon was able to reduce his time to solution from a week to an afternoon!
For many researchers, the runtime on the cluster can be fast, but waiting in a queue is not only time-consuming, but also unpredictable. Azure Big Compute provides a compelling combination of fast, big memory VMs and no queuing, making it a huge time-saver for researchers.
This is a great example showing how adept Azure is for running compute-intensive Linux and OSS workloads. It shows how some of the biggest and most complex processes can be easily moved to the cloud, to take advantage of bigger VMs and on-demand scalability. The experience of doing this was even smoother than might be imagined, with Azure truly becoming an extension of the desktop, seamlessly supporting standard OSS software stacks.
When Big Compute was launched we coupled fast processors and big memory, with Infiniband QDR low-latency, high-bandwidth network interconnect. Subsequently we have found that users like Simon and Matthew want powerful compute instances, but are happy to use standard Gigabit Ethernet networking. It’s nice to see that the A10 and A11 instances can accelerate these complex workloads beyond what is possible otherwise.
Some final words
We’ve loved working with Simon and Matthew to see how Azure could help them. So we want to hear from you about how your workloads can be accelerated with Azure. Both the Big Compute team, and we here in Microsoft Research, are here to support you. For Simon, finishing a week’s work in an afternoon is great, and scaling that up means running a year’s worth of computing in less than a month. Just imagine how that kind of speed-up can help you.
To learn more about Big Compute solutions on Azure, see:
To learn more about Linux and OSS on Azure, see:
To learn more about Azure for Research, see:
To learn more about Genomics at Microsoft, see:
Source: Microsoft Azure News