About::Work
I accelerate biological and chemical research using statistical and machine learning methods.
"Data scientist" at NIBR.
I've played every role you can think of: protein engineer, data engineer, software engineer, data scientist, scientist who talks to leadership...
I have struggled through bad workflows enough to share what I think are good workflows.
About::Open Source
Clean APIs for cleaning data
Beginner-friendly to contribute!
Teaching applied network science
Freely available for all on the web.
You can purchase an offline book!
Also raffling two download codes for today's talk 😄
Come let us know if you're interested in reading it!
Everything outside of building the model
Growth is dictated not by total resources available, but by the scarcest resource (limiting factor).
(Ref.: Wikipedia)
All of the answers to my examples above will be in the negative.
X belongs in the set of:
I have unpublished work, unfinished things, because they didn't start with defined impact.
Always, always define what impact looks like before you embark on a project.
What does "production" look like?
API?
curl https://my.company.server/predict/
FastAPI is a superb tool for building APIs!
Software library?
from project_source import stuff
Dashboard?
https://my.company.server/dashboard
Excel spreadsheet?
./outputs
|- model_output-20200721.xlsx
Auto-generated emailed reports to leadership inboxes?
Surrounding the data science project often is a bunch of software.
We might as well learn from software developers.
Shape up the project.
The book is geared towards software teams. Data teams will need to adapt.
Work backwards from the end goal, or design your project outside-in.
Most complex model not always needed.
If <simple model here>
solves your problem...
...then that's what you should use.
Oftentimes, the harder part is satisfying client requests using code.
"We might as well learn from software developers."
./ # my project directory
./
|- .git/
./
|- .git/
|- environment.yml
./
|- .git/
|- environment.yml
|- src/ # LOOK HERE!
|- proj_package/
|- setup.py
Minimize code duplication. Single source of truth. Easier to test.
Good software: "well-defined categories of things".
./
|- .git/
|- environment.yml
|- src/
|- proj_package/
|- setup.py
|- notebooks/ # LOOK HERE!
./
|- .git/
|- environment.yml
|- src/
|- proj_package/
|- setup.py
|- tests/ # LOOK HERE!
|- notebooks/
Automatically test your code and validate your data.
./
|- .git/
|- environment.yml
|- src/
|- proj_package/
|- setup.py
|- tests/
|- notebooks/
|- README.md # LOOK HERE!
Explain what the project is all about!
./
|- .git/
|- environment.yml
|- src/
|- proj_package/
|- setup.py
|- tests/
|- notebooks/
|- README.md
|- docs/ # LOOK HERE!
You want to be able to work on any machine, not just your laptop.
If you can develop your project on any machine, your end product has a chance of being portable across any machine too.
A sane baseline environment file:
name: my-project
channels:
- conda-forge
dependencies:
- python=3.8
# Basic data science stack
- jupyter>=2.0 # wherever possible, pin versions
- jupyterlab
- conda
- mamba # use mamba for fast conda environment solving!
- ipython
- ipykernel
- numpy
- matplotlib
- scipy
- pandas
- pip
# Code quality utilities
- pre-commit
- black
- nbstripout
- mypy
- flake8
- pycodestyle
- pydocstyle
- pip:
# Good for documentation!
- mkdocs
- mkdocs-material
- mkdocstrings
- mknotebooks
Environment files don't worry about the operating system.
If you need operating system-wide packages, you need a Dockerfile
.
If you're going to deploy in a containerized environment, you need a Dockerfile
.
FROM continuumio/miniconda3
COPY environment.yml /tmp/conda-tmp/.
RUN conda env create -f /tmp/conda-tmp/environment.yml && \
rm -rf /tmp/conda-tmp/
Modified from here.
Invest early in automation, saving keystrokes, and mental shortcuts to commonly used things.
They pay dividends later.
Makefiles
are your friend.
init: # make init
bash install.sh
test: # make test
pytest -v --cov
docs: # make docs
mkdocs build
train: # make train
python scripts/model_training.py
app: # make app
streamlit run apps/app.py
Wrap commonly-executed bash commands inside a Makefile command so they are easy to execute.
Easily-executable commands make your Jenkins or Travis or Azure pipelines easy to develop.
# .travis.yml
# stuff above
script:
- make test
# stuff below
If the project has a defined "production" end-point, then everything before it ought to be made reliable.
They may change under you, breaking things in unexpected ways.
The simplest data test:
def test_data_loading():
data = load_some_data()
assert set(data.columns) == set([...])
Schema validation with [pandera]. Say you have some data:
import pandas as pd
import pandera as pa
# data to validate
df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
"column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
})
# define schema
schema = pa.DataFrameSchema({
# Column 1 gets a "statistical support" check
"column1": pa.Column(pa.Int, checks=pa.Check.less_than_or_equal_to(10)),
"column2": pa.Column(pa.Float, checks=pa.Check.less_than(-1.2)),
"column3": pa.Column(pa.String, checks=[
pa.Check.str_startswith("value_"),
# define custom checks as functions that take a series as input and
# outputs a boolean or boolean Series
pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
]),
})
validated_df = schema.validate(df)
print(validated_df)
Use schema in tests:
# tests/test_data.py
from project_source.data import some_schema, load_some_data
def test_load_some_data():
data = load_some_data()
some_schema.validate(data) # test-time validation!
More tools:
CI systems that automatically execute tests on every git commit give you confidence that source code is correct.
CI systems that automatically execute analyses to ensure they are at least not borked let you quickly isolate changes and feed them back.
CI systems that automatically build and generate model artifacts save you time opening your notebooks.
One source library for project with minimized code duplication.
No copies of data. Pull from the raw-est source. Transform using code.
All transformations done with code. No manual Excel handling.
Excel is great for prototyping, but very quickly you should translate into code.
Develop shortcut functions to your data.
This is an investment in saving keystrokes!
# src/project_source/data.py
import pandas as pd
import wget
def load_some_raw_data():
"""
Very informative docstring goes here.
Describe provenance of data, and data involved.
"""
remote_url = "..."
filename = wget.download(remote_url)
return pd.csv(filename)
import janitor # gives you some API superpowers!
from functools import lru_cache # alternative: cachier
@lru_cache(maxsize=32) # speeds up reload time
def load_some_transformed_data():
"""
Very informative docstring goes here.
Describe provenance of data,
and why transforms were done.
"""
data = (
load_some_raw_data()
.add_column(...)
.transform_column(..., np.log10, "log_transformed")
.dropnotnull(...)
)
return data
./
|- docs/
|- index.md
|- more.md
|- contributing.md
Build and publish your docs on every project rebuild.
Beautiful websites like:
This Makefile
line:
docs:
mkdocs build
gives you this bash line:
make docs # :)
Habits build up over time.
Growth is dictated not by total resources available, but by the scarcest resource (limiting factor).
Your project will move as far/fast/high as the the biggest limiting factor will limit it.
What slows you down?
There's probably a software practice that you can adapt!
Manually checking the same things?
Automate with a script!
Data issues failing you?
Automate data validation!
Unstable copy/pasted code?
Refactor your code!
Environment can't be reproduced?
Declare it with a file!
Can't remember how to do things?
Write documentation for it!
Sorry, but you will have even less time, and many-fold more stress, if you don't!
You build them in one-by-one.
The time will come back to you many-fold.
Earn credibility points by delivering something that they want.
Then spend the credibility points on requests for time to build reliability.
Build the virtuous cycle.