Eric's Notes

State of Data Science

This was inspired by my participation in the TAO Data Science Panel.

I'm starting to see a bifurcation in research vs business data science.

Research vs Business Data Science

How this translates to training needs and hiring

Researchers think mechanistically about the world
Learn how to learn fast
Operate outside your pay grade
Data scientists should know the data generating process behind their data
Data scientists should learn how to write good code
Avoiding technical debt with ML pipelines

And notes for managing data scientists:

How to motivate your problem solvers

Pages that link here

index
This is the landing page for my notes

TAO Data Science Panel

Questions and answers

How does your company define data science? How would you describe it to a layperson?

Feels like the early days of synthetic biology: 10 professors giving 11 definitions.

For me, it is the use of modelling methods to:
1. Answering scientific questions at the boundaries of knowledge.
2. Accelerate decision-making in routine research processes.

What types of problems do data scientists work on at your organization?

Tons! Here's a sampling of the themes I've seen at work:

Building knowledge graphs
Image search and retrieval
Predictive models of molecule properties
Input design optimization

How do you use data science in your role?

It's my full-time job :).

How much of your work is independent vs team-oriented?

Our specialties means we are always conducting highly independent research work - as part of a team of scientists who are answering a bigger question.

What do you wish you had known when you were going to school/first starting your career in relation to data science?
How can those interested in a data science career get access to the tools they need to succeed? What skills are commonly missing in candidates?
Is it challenging to get talent in data science? What type of background is most valuable in your data science hires?

Each team looks for a slightly different thing. Some require deep background knowledge. Others require

What types of data does your organization work with?
What challenges does your organization face that can be solved with data science techniques?
Where do you think the field is headed?
What technical and soft skills are critical?

In terms of so-called "soft" skills, in my interviews, I look for the ability to translate problems into code, and vice versa. Being able to explain a model's core succinctly enough to another person to be able to implement it in code.

Research vs Business Data Science

One of my colleagues (well, strictly speaking my boss' boss) recently crystallized a very important and key idea for my colleagues: the difference between biomedical research data science and tech business data science. I gave his ideas some thought, and decided to pen down what I saw as the biggest similarities and differences.

The goals between the two "forms" of data science are different:

The end goals of business data science
The end goals of research data science

There are issues that I'm seeing in the data science field. Some of the problems I have seen thus far.

Model-focused enhancements are saturating
Experimentation is getting a bit out of hand

And what I think is needed:

The goal of scientific model building is high explanatory power
The place for overparametrized models is tooling
Finding the appropriate model to apply is key

The key difference, I think is that The end goals of business data science is about capturing value from existing processes, while The end goals of research data science is about expanding new avenues of value from unknown, un-developed, and un-captured business processes. The latter is and has always been an investment to make; in a well-oiled system, the former likely generates profit that can and should be invested in the latter.

Learn how to learn fast

How do we learn how to learn fast?

I can mostly only speak for myself, and even then, I know I'm not the fastest learner. But some principles come to mind, which appear to have been battle-tested.

The first is to Build a project portfolio.
The second is to Learn like Feynman.
The third is to Learn adjacent topics, rather than distant ones.

Data scientists should learn how to write good code

Data scientists are most commonly writing and developing custom code. It's the most flexible way to write all the abstractions that are needed. By writing custom code, we need some tools to help with code quality.

Code quality tools
Use relative paths to project roots

How to motivate your problem solvers

From Brandon Rhorer's LinkedIn post

Hey engineering managers,

(I don’t speak for all of your team, but I probably speak for a couple of them.) Nothing pushes us to do our best work like a good sticky problem to solve. Motivational sayings don’t do much. Targets and reporting is a necessary evil. Micromanaging is a huge annoyance. Bonuses are nice, but not really a reason to jump out of bed in the morning. And browbeating is the fastest way to chase us off. But give us a problem to solve and the space to solve it, and you won't be able to pry it out of our hands at the end of the workday.

Sincerely,
Your problem solvers

Operate outside your pay grade

I have heard before rumblings that one shouldn't deliver more value than what we're paid for. Grumbles that sound like, "It's just a job", or, "The company won't take care of you."

As a matter of career advice I would give to someone else, though, I think operating outside our pay grade is what we ought to be doing.

Operating outside our pay grade means both going above what we are paid to do, to get a better/higher contextual view of what we're doing, and moving sideways to do adjacent things beyond our role. (That is where "above and beyond" comes from, I guess.)

What does "operating outside of our paygrade" mean? I think the following:

Taking projects from conception to completion, so that you can build that portfolio of stuff done. (see: Build a project portfolio)
Mastering something rare and valuable.
Constantly cross-training (see: The importance of cross-training) to learn adjacent skills (see: Learn adjacent topics) and complementary ones.

So operating outside of our paygrade really means mastering adjacent skills in pursuit of being able to own a project end-to-end.

index

This is the landing page for my notes.

This is 100% inspired by Andy Matuschak's famous notes page. I'm not technically skilled enough to replicate the full "Andy Mode", though, so I just did some simple hacks. If you're curious how these notes compiled, check out the summary in How these notes are made into HTML pages.

This is my "notes garden". I tend to it on a daily basis, and it contains some of my less fully-formed thoughts. Nothing here is intended to be cited, as the link structure evolves over time. The notes are best viewed on a desktop/laptop computer, because of the use of hovers for previews.

There's no formal "navigation", or "search" for these pages. To go somewhere, click on any of the "high-level" notes below, and enjoy.

Notes on statistics
Notes on differential computing
The State of Data Science
Network science
Scholarly readings
Software skills for data scientists
The Data Science Programming Newsletter MOC
Life and computer hacks
Reading Bazaar
Blog drafts
Conference Proposals

Data scientists should know the data generating process behind their data

I think a data scientist has a responsibility to be fully informed about how the numbers they're using are generated. This relates very much to knowing the data generating process for whatever input numbers they are using. We shouldn't take for granted the inputs that arrive in our hands.

Some key questions to always ask:

How was a number generated? What are the possible causal mechanisms?
What data are missing? Why are they missing? Are there possibly multiple causes for missing data?

Researchers think mechanistically about the world

What do we mean by "thinking mechanistically"? This refers to being able to think mechanistically through data generating processes. In research data science, this likely requires a deep knowledge of the field that one is applying quantitative methods to, i.e. domain expertise. Though deep domain knowledge is usually correlated with doctoral training or many prior years of work experience, a newcomer can compensate for domain knowledge deficits by demonstrating the skill of being able to learn domain knowledge really quickly, or by having complementary breadth of modelling knowledge that has been honed in a diverse set of settings.

For someone who has not yet had the domain expertise to think mechanistically about a problem, they need to Learn how to learn fast.

Avoiding technical debt with ML pipelines

One of my readings from the web, taken from here.

By Hamza Tahir.

Emphasis on pipelines, end-to-end ownership, well-defined interfaces... sounds a ton like modern software development. (Which is why I think knowing modern software development workflow is a boon for data scientists.)