Lessons Learned Optimizing AlphaFold at Moderna

written by Eric J. Ma on 2023-12-13 | tags: alphafold moderna technology cloudcomputing aws infrastructure code optimization refactoring scaling challenges continuous integration

At work (Moderna), we use AlphaFold. I was one of the people who deployed it to run on our internal infrastructure. At the same time, my initial implementation could have been more efficient: we ran the entire AlphaFold pipeline on a p3 or g5 instance, which was costly per folding run. On a small scale, it got the job done. That was, of course, until we had power users hammer AlphaFold. That's when the flaws in our current implementation became evident - we started hitting limits on AWS' GPU instance availability.

Our initial deployment of AlphaFold was inefficient because the multiple sequence alignment (MSA) step, which was CPU-bound, was running on the GPU instance while not using the GPU. In my final week of work (week of December 5) before I went on vacation for the rest of the year (to work on house renovation matters), I made a successful attempt to refactor the AlphaFold execution script such that it would run the CPU-bound MSA steps separately from the GPU-bound structure folding steps. While the original implementation was initially designed to work on a beefy workstation or High-Performance Computing (HPC) cluster, we had scalable infrastructure that allowed us to run many more jobs than before, which also meant passing information from the MSA job to the structure prediction job via the filesystem.

At its core, the key object passed from the MSA job to the structure prediction job is the features dictionary, which (by default) gets saved to disk as a Python pickle by AlphaFold. Once we identified this key interface (which took me days to figure out by digging through source code), it was immediately apparent where to perform code lobotomy. Once the refactor was complete, we could run MSA jobs on r6a instances and structure prediction jobs on g5 instances while orchestrating the two jobs from a dirt-cheap c5 instance. In doing so, we estimated that running jobs AlphaFold would cost 1/3 the original price. But beyond the impact, I recognized that I had a bunch of technical lessons that I thought would be useful for others, and I'd like to share them here.

Lesson 1: Static analysis tools can help catch code errors early

The first lesson I wanted to share is investing in mastering an integrated development environment (IDE). VSCode, when fully configured, can be a super powerful tool! While hacking on the deployment codebase, I used static analysis tools (pyright, flake8, ruff) to check that we had the correct code inside the Typer CLI we built for AlphaFold. Along the way, when splitting functions up, if we weren't using an argument in a function or didn't declare it in the function signature, I would get a red squiggly line under that variable, immediately giving me a visual check that the code contained errors. Catching these errors early on helped with development cycle time.

Lesson 2: CI/CD caching is incredibly useful

This lesson is an inverse lesson for me because we use Bitbucket at work and don't have the kind of cache sizes available to GitHub Actions users that I'm used to in my open-source projects. As such, my cycle times for testing were 30-45 minutes rather than 2-3 minutes, which can be incredibly distracting. On the other hand, I could work around the long cycle times by setting a watch timer for AlphaFold. I am sure that we would have finished the refactor faster if we had better caching!

Lesson 3: It pays to read code first before writing code

For a while, I was intimidated by what it might take to refactor the AlphaFold executor script. Our original rewrite included mixed abstraction levels, making it difficult to know whether I had broken something. In some ways, I was paying the price for my own design choices made earlier. That said, taking the time to slowly read the code while making small breaking changes, which was approximately 1-2 days worth of time, really helped build momentum for the breakthrough on 3rd day when the needed abstractions finally clicked in my head. Greg Brockman of OpenAI once said,

Much of the challenge in machine learning engineering is fitting the complexity into your head. Best way to approach is with great patience & a desire to dig into any detail.

Twitter

This statement resonated so much during this refactor.

Lesson 4: Scaling is hard, and not for the reasons you might think

When scaling from hundreds of AlphaFold jobs a month to thousands, we ran into all sorts of limits that I was unaware of. A majority of these were AWS' limits: they have a limited pool of GPU instances that we share with the entire set of users of us-<region_name>-<number>, and when we hit NeurIPS season, that pool of available GPUs also shrunk. Additionally, AWS limits the number of IP addresses we can use within a subnet, the total amount of EBS storage provided for EC2 instances, the total number of instances we can run, and more. A lot of these limits help prevent accidental surprise bills that may show up. However, these limits also put an upper limit on the amount of actual computation we might want to do, so bursting our jobs in the cloud may result in incredibly frustrating experiences as people wait for others ahead in the queue to finish their work. And because of the way our cybersecurity works, we have to submit tickets internally to request account limit changes. Talk about kicking out bureaucracy.

These limits were the most surprising piece for me because I had previously assumed that scaling would be difficult for most problems because of complex technical matters. However, scaling AlphaFold was challenging because of invisible business guardrails instead.

Lesson 5: Working in the "open" breeds trust

Some technical folks have a penchant for working behind closed doors and delivering a gift-wrapped deliverable to their colleagues. In my case, though, I chose to work as openly as I could internally, including sharing screenshots of code, dev builds, job failures and successes, and even personal frustrations (think "ARRRRGH!" or "WOOT!" peppering MS Teams chat rooms). Heart-on-sleeve sharing turned out to help my colleagues better understand the challenges we faced when building and scaling systems. The constant stream of update snippets helped reassure my colleagues that progress, however choppy, was being made day by day. By contrast, if I had chosen to work in the dark and surprise them at the end, I can't imagine the kind of anxiety they would have faced otherwise.

Conclusion

I've had the privilege of reverse engineering, designing, and building some complex mathematical models and systems, such as:

reverse-engineering a hierarchical Dirichlet process autoregressive Gaussian hidden Markov model (2018 with my then-intern Nitin Kumar Mittal at NIBR),
re-implementing a protein language model (recurrent neural network) from scratch (2019 with Arkadij Kummer at NIBR) and speeding it up 100X,
co-designing an abstracted data model for laboratory process capture (2023 at Moderna) and
building a scalable, automated machine learning system for chemical property prediction (2019-2020 at NIBR)

But this one feels like the crowning achievement of 2023, given the intensity of the refactor, the speed at which we managed to pull it through, and the impact it should have on my colleagues in the Research organization. It touched on cloud limits, which was a first for me. I also had the support of a larger group of colleagues - both the maintainers of our internal infrastructure in Seattle and the power AlphaFold users in Cambridge, and having that very human support was important for pulling through. Refactoring AlphaFold was also, in the words of my teammate Jackie Valeri, the B-plot to the first ever Moderna Data Science Guild hackathon that was ongoing concurrently. That made it double the fun!

What would you consider to be your crowning technical achievement of the year?

I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.

If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!

Finally, I do free 30-minute GenAI strategy calls for organizations who are seeking guidance on how to best leverage this technology. Consider booking a call on Calendly if you're interested!

Eric J Ma's Website