Eric's Notes

Use relative paths to project roots

Inside Jupyter notebooks, I commonly see that we read in data from the filesystem using paths that look like this:

df = pd.read_csv("../../data/something.csv")

This is troublesome, because if the notebook moves, then the relative paths may move as well.

We can get around this by using pyprojroot.

from pyprojroot import here

df = pd.read_csv(here() / "data/something.csv")

Now, only if the data moves will we need to update the path in all of our notebooks.

If multiple notebooks use the same file, it's possibly prudent to refactor even the file path itself as a variable that gets imported. That way, you have one single source of truth for the path to the file of interest:

# this is a custom source file, like "custom.py"
from pyprojroot import here

something_path = here() / "data/something.csv"

And then in our notebooks:

from custom import something_path

df = pd.read_csv(something_path)

Now, if the file path changes, we update one location and the code should work across all notebooks; if the notebook file path changes, we need not do anything to guarantee that the data path is correct.

Code quality tools

For .py files:

black: Code formatting
mypy: Optional static type checking
isort: Sorting imports sanely
pylance: Fast code quality checking in VSCode

For Jupyter notebooks:

nbqa: Run any file checker that would run on .py files instead on .ipynb notebooks.
nbstripout: Strip outputs from Jupyter notebooks to make them clean before committing.

State of Data Science

This was inspired by my participation in the TAO Data Science Panel.

I'm starting to see a bifurcation in research vs business data science.

Research vs Business Data Science

How this translates to training needs and hiring

And notes for managing data scientists:

How to motivate your problem solvers

Pages that link here