State of Data Science
This was inspired by my participation in the TAO Data Science Panel.
I'm starting to see a bifurcation in research vs business data science.
How this translates to training needs and hiring
And notes for managing data scientists:
TAO Data Science Panel
Feels like the early days of synthetic biology: 10 professors giving 11 definitions.
For me, it is the use of modelling methods to:
1. Answering scientific questions at the boundaries of knowledge.
2. Accelerate decision-making in routine research processes.
Tons! Here's a sampling of the themes I've seen at work:
It's my full-time job :).
Our specialties means we are always conducting highly independent research work - as part of a team of scientists who are answering a bigger question.
Each team looks for a slightly different thing. Some require deep background knowledge. Others require
In terms of so-called "soft" skills, in my interviews, I look for the ability to translate problems into code, and vice versa. Being able to explain a model's core succinctly enough to another person to be able to implement it in code.
Research vs Business Data Science
One of my colleagues (well, strictly speaking my boss' boss) recently crystallized a very important and key idea for my colleagues: the difference between biomedical research data science and tech business data science. I gave his ideas some thought, and decided to pen down what I saw as the biggest similarities and differences.
The goals between the two "forms" of data science are different:
There are issues that I'm seeing in the data science field. Some of the problems I have seen thus far.
And what I think is needed:
The key difference, I think is that The end goals of business data science is about capturing value from existing processes, while The end goals of research data science is about expanding new avenues of value from unknown, un-developed, and un-captured business processes. The latter is and has always been an investment to make; in a well-oiled system, the former likely generates profit that can and should be invested in the latter.
Learn how to learn fast
How do we learn how to learn fast?
I can mostly only speak for myself, and even then, I know I'm not the fastest learner. But some principles come to mind, which appear to have been battle-tested.
Data scientists should learn how to write good code
Data scientists are most commonly writing and developing custom code. It's the most flexible way to write all the abstractions that are needed. By writing custom code, we need some tools to help with code quality.
How to motivate your problem solvers
From Brandon Rhorer's LinkedIn post
Hey engineering managers,
(I don’t speak for all of your team, but I probably speak for a couple of them.) Nothing pushes us to do our best work like a good sticky problem to solve. Motivational sayings don’t do much. Targets and reporting is a necessary evil. Micromanaging is a huge annoyance. Bonuses are nice, but not really a reason to jump out of bed in the morning. And browbeating is the fastest way to chase us off. But give us a problem to solve and the space to solve it, and you won't be able to pry it out of our hands at the end of the workday.
Sincerely,
Your problem solvers
Operate outside your pay grade
I have heard before rumblings that one shouldn't deliver more value than what we're paid for. Grumbles that sound like, "It's just a job", or, "The company won't take care of you."
As a matter of career advice I would give to someone else, though, I think operating outside our pay grade is what we ought to be doing.
Operating outside our pay grade means both going above what we are paid to do, to get a better/higher contextual view of what we're doing, and moving sideways to do adjacent things beyond our role. (That is where "above and beyond" comes from, I guess.)
What does "operating outside of our paygrade" mean? I think the following:
So operating outside of our paygrade really means mastering adjacent skills in pursuit of being able to own a project end-to-end.
index
This is the landing page for my notes.
This is 100% inspired by Andy Matuschak's famous notes page. I'm not technically skilled enough to replicate the full "Andy Mode", though, so I just did some simple hacks. If you're curious how these notes compiled, check out the summary in How these notes are made into HTML pages.
This is my "notes garden". I tend to it on a daily basis, and it contains some of my less fully-formed thoughts. Nothing here is intended to be cited, as the link structure evolves over time. The notes are best viewed on a desktop/laptop computer, because of the use of hovers for previews.
There's no formal "navigation", or "search" for these pages. To go somewhere, click on any of the "high-level" notes below, and enjoy.
Data scientists should know the data generating process behind their data
I think a data scientist has a responsibility to be fully informed about how the numbers they're using are generated. This relates very much to knowing the data generating process for whatever input numbers they are using. We shouldn't take for granted the inputs that arrive in our hands.
Some key questions to always ask:
See also: Embeddings should be treated with care
Researchers think mechanistically about the world
What do we mean by "thinking mechanistically"? This refers to being able to think mechanistically through data generating processes. In research data science, this likely requires a deep knowledge of the field that one is applying quantitative methods to, i.e. domain expertise. Though deep domain knowledge is usually correlated with doctoral training or many prior years of work experience, a newcomer can compensate for domain knowledge deficits by demonstrating the skill of being able to learn domain knowledge really quickly, or by having complementary breadth of modelling knowledge that has been honed in a diverse set of settings.
For someone who has not yet had the domain expertise to think mechanistically about a problem, they need to Learn how to learn fast.
Avoiding technical debt with ML pipelines
One of my readings from the web, taken from here.
By Hamza Tahir.
Emphasis on pipelines, end-to-end ownership, well-defined interfaces... sounds a ton like modern software development. (Which is why I think knowing modern software development workflow is a boon for data scientists.)