Data scientists should learn how to write good code

Data scientists are most commonly writing and developing custom code. It's the most flexible way to write all the abstractions that are needed. By writing custom code, we need some tools to help with code quality.

Use relative paths to project roots

Inside Jupyter notebooks, I commonly see that we read in data from the filesystem using paths that look like this:

df = pd.read_csv("../../data/something.csv")

This is troublesome, because if the notebook moves, then the relative paths may move as well.

We can get around this by using pyprojroot.

from pyprojroot import here

df = pd.read_csv(here() / "data/something.csv")

Now, only if the data moves will we need to update the path in all of our notebooks.

If multiple notebooks use the same file, it's possibly prudent to refactor even the file path itself as a variable that gets imported. That way, you have one single source of truth for the path to the file of interest:

# this is a custom source file, like "custom.py"
from pyprojroot import here

something_path = here() / "data/something.csv"

And then in our notebooks:

from custom import something_path

df = pd.read_csv(something_path)

Now, if the file path changes, we update one location and the code should work across all notebooks; if the notebook file path changes, we need not do anything to guarantee that the data path is correct.

Code quality tools

For .py files:

  • black: Code formatting
  • mypy: Optional static type checking
  • isort: Sorting imports sanely
  • pylance: Fast code quality checking in VSCode

For Jupyter notebooks:

  • nbqa: Run any file checker that would run on .py files instead on .ipynb notebooks.
  • nbstripout: Strip outputs from Jupyter notebooks to make them clean before committing.

State of Data Science