written by Eric J. Ma on 2023-09-09 | tags: data science biotech research machine learning problem spotting problem scoping problem shepherding solution translating laboratory science protein engineering antibody therapies
Today, I had a chance to read the article, 4 Skills the Next Generation of Data Scientists Needs to Develop. (h/t Eric Yang for sharing this one on LinkedIn.) This article resonated massively with me, and I wanted to share how some of the ideas in here translate over to the DSAI Research team at Moderna.
The first skill is distinguishing the real issues faced by our collaborators from the apparent problems. The real issues might be different from the issues they think they have.
Within biotech research, scientists come with differing levels of savviness with machine learning methods and the capabilities they bring to laboratory science. As such, our collaborators may come to us with requests for help that may need refining before they match what they need.
One example that I can recall was related to the hype of ChatGPT. A news and perspectives article in Nature wrote that
Language models similar to those behind ChatGPT have been used to improve antibody therapies against COVID-19, Ebola and other viruses.
And pretty soon, we were getting inquiries about whether we could use an internal implementation of GPT-4 to design protein sequences.
An inexperienced data scientist would say, "That sounds like a cool idea! We should try it out!"
But an experienced data scientist well-versed in the training of ML models and laboratory science would immediately see the flaws in that logic and instead ask what the real problems are -- why would one want to design proteins in the first place? Here are some examples of questions we might ask:
Notice how the questions we're asking aren't related to ChatGPT at all! Instead, they're all questions about the laboratory science itself. These questions are designed to address the lab scientist's fundamental scientific challenges. Doing so is paramount for our team to continue being valuable and innovative: uncovering the fundamental problems orients us to build Model Ts instead of breeding faster horses.
The second skill is asking probing questions that narrow down the space of problems and the possible solutions that our collaborators could use.
Let's continue the previous example by expanding on option #2, which is antibody-related.
Gaining clarity would involve asking follow-up questions that look like these:
With answers to these and other questions of the same spirit, we can enhance our understanding of the problem space. We can also begin to see connections between what our collaborators need and what other collaborators may need and begin to design entire systems that can benefit more than one group at a time.
As mentioned in the article, it's tempting for a data scientist to dive into the problem for an extended period and come back up with a solution that may impress a collaborator.
Within a biotech context, this is junior-level thinking -- and an easy way to lose the trust of sharp and well-trained PhD-level scientists and leaders. Universities train PhDs to question everything they don't understand. No assumption will be left unturned.
A junior-thinking data scientist who keeps themselves disconnected to do a "big reveal" will likely not understand the questions the laboratory scientists will ask. They may build something they don't understand and are unwilling to adopt and use.
By contrast, the senior-thinking data scientist who makes the concerted effort to jointly co-create the solution with the laboratory scientist will cultivate trust in the solution, build their vocabulary within the scientific domain, and construct a library of misconceptions and assumptions they can anticipate and address in future builds.
Following the original article, speaking the audience's language is paramount. As you might have guessed from the previous points, when developing solutions for laboratory science, we need to speak the language of laboratory science throughout the entire conversation. It's insufficient to stop at general questions that sound like:
By contrast, the questioning strategy that garners trust involves progressively asking more and more specific questions. Notice the specificity level of the questions above in section 2. In my books, those are only the basic questions; our questions are best when they show that we truly understand the underlying biology, chemistry, immunology, analytical chemistry, organic chemistry, or other theory that backs a problem. When we also speak the vocabulary of the laboratory methods, shown for example, by actively questioning how experiments are designed and possessing an order of magnitude intuition about assay throughputs, that trust with laboratory scientists is solidified.
I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.
If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!
Finally, I do free 30-minute GenAI strategy calls for organizations who are seeking guidance on how to best leverage this technology. Consider booking a call on Calendly if you're interested!