Data scientists should learn how to write good code
Data scientists are most commonly writing and developing custom code. It's the most flexible way to write all the abstractions that are needed. By writing custom code, we need some tools to help with code quality.
Use relative paths to project roots
Inside Jupyter notebooks, I commonly see that we read in data from the filesystem using paths that look like this:
df = pd.read_csv("../../data/something.csv")
This is troublesome, because if the notebook moves, then the relative paths may move as well.
We can get around this by using pyprojroot
.
from pyprojroot import here
df = pd.read_csv(here() / "data/something.csv")
Now, only if the data moves will we need to update the path in all of our notebooks.
If multiple notebooks use the same file, it's possibly prudent to refactor even the file path itself as a variable that gets imported. That way, you have one single source of truth for the path to the file of interest:
# this is a custom source file, like "custom.py"
from pyprojroot import here
something_path = here() / "data/something.csv"
And then in our notebooks:
from custom import something_path
df = pd.read_csv(something_path)
Now, if the file path changes, we update one location and the code should work across all notebooks; if the notebook file path changes, we need not do anything to guarantee that the data path is correct.
Code quality tools
For .py
files:
black
: Code formattingmypy
: Optional static type checkingisort
: Sorting imports sanelypylance
: Fast code quality checking in VSCodeFor Jupyter notebooks:
nbqa
: Run any file checker that would run on .py
files instead on .ipynb
notebooks.nbstripout
: Strip outputs from Jupyter notebooks to make them clean before committing.State of Data Science
This was inspired by my participation in the TAO Data Science Panel.
I'm starting to see a bifurcation in research vs business data science.
How this translates to training needs and hiring
And notes for managing data scientists: