Create runtime environment variable configuration files for each of your projects
When you work on your projects, one assumption you will usually have is that your development environment will look like your project's runtime environment with all of its environment variables. The runtime environment is usually your "production" setting: a web app or API, a model in a pipeline, or a software package that gets distributed. (For more on environment variables, see: Take full control of your shell environment variables)
Here, I'm assuming that you follow the practice of and that you Use pyprojroot to define relative paths to the project root.
To configure environment variables for your project,
a recommended practice is to create a .env
file in your project's root directory,
which stores your environment variables as such:
export ENV_VAR_1 = "some_value"
export DATABASE_CONNECTION_STRING = "some_database_connection_string"
export ENV_VAR_3 = "some_other_value"
We use the export
syntax here because we can, in our shells,
run the command source .env
and have the environment variables defined in there applied to our environment.
Now, if you're using a Python project,
make sure you have the package python-dotenv
(Github repo here)
installed in the conda environment.
Then, in your Python .py
source files:
from dotenv import load_dotenv
from pyprojroot import here
import os
dotenv_path = here() / ".env"
load_dotenv(dotenv_path=dotenv_path) # this will load the .env file in your project directory root.
# Now, get the environment variable.
DATABASE_CONNECTION_STRING = os.getenv("DATABASE_CONNECTION_STRING")
In this way, your runtime environment variables get loaded into the runtime environment, and become available to all child processes started from within the shell (e.g. Jupyter Lab, or Python, etc.).
Your .env file might contain some sensitive secrets.
You should always ensure that your .gitignore
file contains .env
in it.
See also: Set up an awesome default gitignore for your projects
Get prepped per project
Treat your projects as if they were software projects for maximum organizational effectiveness. Why? The biggest reason is that it will nudge us towards getting organized. The "magic" behind well-constructed software projects is that someone sat down and thought clearly about how to organize things. The same principle can be applied to data analysis projects.
Firstly, some overall ideas to ground the specifics:
Some ideas pertaining to Git:
Notes that pertain to organizing files:
Notes that pertain to your compute environment:
And notes that pertain to good coding practices:
Treating projects as if they were software projects, but without software engineering's stricter practices, keeps us primed to think about the generalizability of what we do, but without the over-engineering that might constrain future flexibility.
Take full control of your shell environment variables
If you're not sure what environment variables are, I have an essay on them that you can reference. Mastering environment variables is crucial for data scientists!
Your shell environment, whether it is zsh or bash or fish or something else, is supremely important. It determines the runtime environment, which in turn determines which Python you're using, whether you have proxies set correctly, and more. Rather than leave this to chance, I would recommend instead gaining full control over your environment variables.
The simplest way is to set them explicitly in your shell initialization script. For bash shells, it's either .bashrc
or .bash_profile
. For the Z shell, it'll be the .zshrc
file. In there, step by step, set the environment variables that you need system-wide.
For example, explicitly set your PATH
environment variable with explainers that tell future you why you ordered the PATH in a certain way.
# Start with an explicit minimal PATH
export PATH=/bin:/usr/bin:/usr/local/bin
# Add in my custom binaries that I want available across projects
export PATH=$HOME/bin:$PATH
# Add in anaconda installation path
export PATH=$HOME/anaconda/bin:$PATH
# Add more stuff below...
If you want your shell initialization script to be cleaner, you can refactor it out into a second bash script called env_vars.sh
, which lives either inside your home directory or your dotfiles repository (see: Leverage dotfiles to get your machine configured quickly). Then, source the env_vars.sh
script from the shell initialization script:
source ~/env_vars.sh
There may be a chance that other things, like the Anaconda installer, will give you an option to modify your shell initializer script. If so, be sure to keep this in the back of your mind. At the end, of your shell initializer script, you can echo the final state of environment variables to help you debug.
Environment variables that need to be set on a per-project basis are handled slightly differently. See Create runtime environment variable configuration files for each of your projects.
Use Jupyter as an experimentation playground
I use Jupyter notebooks in the following ways.
Firstly, I use them as a prototyping environment. They are wonderful, because I can hold the state of a program in memory and interactively modify it until I get what I need out of the program. (This especially saves on time spent re-computing things.)
Secondly, I use Jupyter as an authoring environment for interactive computational teaching material. For example, I structured Network Analysis Made Simple as a series of Jupyter notebooks.
Finally, on occasion, I use Jupyter with ipywidgets
and Voila to build out dashboards and interactive applications for my colleagues.
Get Jupyter installed in each of your environments, by including it in your environment.yml
file. (see: Create one conda environment per project)
Doing so is based on advice I received at SciPy 2016, in which one of the Jupyter developers strongly advised against "global" installations of Jupyter, to avoid package conflicts.
To get Jupyter to recognize the Python interpreter that defined by your conda environment (see: Create one conda environment per project), you need to make sure you have ipykernel
installed inside your environment. Then, use the following command:
export ENV_NAME="put_your_environment_name_here"
conda activate $ENV_NAME
python -m ipykernel install --user --name $ENV_NAME
Newcomers to Anaconda are usually spoonfed the GUI, but I am a proponent of launching Jupyter from the terminal because doing so makes us fully aware of our environment, including the environment variables. (see the related: Create runtime environment variable configuration files for each of your projects and Take full control of your shell environment variables)
To launch Jupyter:
jupyter lab
In shell terms:
cd /path/to/project/directory
conda activate $ENV_NAME
jupyter lab
Use docker containers for system-level packages
If conda environments are such a great environment isolation tool, why would we need Docker?
That's because sometimes, your project might have an unavoidable dependency on system-level packages. I have seen some projects that use spatial mapping tooling require system-level packages. Others that depend on audio processing might require packages that can only be obtained outside of conda
. In these cases, yes, installing them locally on your machine can be handy (see Install homebrew on your machine), but if you're also interested in building an app, then you'll need them packaged up inside a Docker container.
What is a Docker container? The best anchoring way to thinking about it is a fully-fledged operating system completely insulated from its host (i.e. your computer). It has no knowledge of your runtime environment variables (see: Create runtime environment variable configuration files for each of your projects and Take full control of your shell environment variables). It's like having a completely clean operating system, without the cost of buying new hardware.
I'm assuming you've already obtained Docker on your system. (see: Install Docker on your machine).
The core thing you need to know how to write is a Dockerfile
. This file specifies exactly how a Docker container is to be built. The easiest way to think about the Dockerfile
syntax is that it's almost bash, with a bit more additional syntax. The Docker docs give an extremely thorough tutorial. For those who are more hands-on, I recommend pair coding with another more experienced individual who is willing to teach you the ropes, to build a Docker container when it becomes relevant to your problem.
Use pyprojroot to define relative paths to the project root
If you follow the practice of One project should get one git repository, then everything related to the project will be housed inside that repository. Under this assumption, if you also develop a custom source code library for your project (see Place custom source code inside a lightweight package for why), then you'll likely encounter the need to find paths to things, such as data files, relative to the project root. Rather than hard-coding paths into your library and Jupyter notebooks, you can instead leverage pyprojroot
to define a library of paths that are useful across the project.
Firstly, make sure you have an importable source_package.paths
module. (I'm assuming you have written a custom source package!) In there, define project paths:
from pyprojroot import here
root = here(proj_files=[".git"])
notebooks_dir = root / "notebooks"
data_dir = root / "data"
timeseries_data_dir = data_dir / "timeseries"
here()
returns a Python pathlib.Path
object.
You can go as granular or as coarse-grained as you want.
Then, inside your Jupyter notebooks or Python scripts, you can import those paths as needed.
from source_package.paths import timeseries_data_dir
import pandas as pd
data = pd.read_csv(timeseries_data_dir / "2016-2019.csv")
Now, if for whatever reason you have to move the data files to a different subdirectory (say, to keep things even more organized than you already are, you awesome person!), then you just have to update one location in source_package.paths
, and you're able to reference the data file in all of your scripts!
See also: Define single sources of truth for your data sources.
Set up an awesome default gitignore for your projects
There will be some files you'll never want to commit to Git. Some include:
If you commit them, then:
Some believe that your .gitignore
should be curated. I believe that you should use a good default one that is widely applicable. To do so, go to gitignore.io, fill in the languages and operating systems involved in your project, and copy/paste the one that fits you. If you want an awesome default one for Python:
cd /path/to/project/root
curl https://www.toptal.com/developers/gitignore/api/python
It will have .env
available in there too! (see: Create runtime environment variable configuration files for each of your projects)
.gitignore
file parsed?A .gitignore
file is parsed according to the rules on its documentation page. It essentially follows the unix glob syntax while adding on logical modifiers. Here are a few examples to get you oriented:
.DS_Store
filesThese are files generated by macOS' Finder. You can ignore them by appending the following line to your .gitignore
:
*.DS_Store
site/
If you use MkDocs to build documentation, it will place the output into the directory site/
. You will want to ignore the entire directory appending the following line:
site/
.ipynb_checkpoints
directoriesIf you have Jupyter notebooks inside your repository, you can ignore any path containing .ipynb_checkpoints
.
.ipynb_checkpoints
Adding this line will prevent your Jupyter notebook checkpoints from being committed into your Git repository.
One project should get one git repository
This helps a ton with organization. When you have one project targeted to one Git repository, you can easily house everything related to that project in that one Git repository. I mean everything. This includes:
In doing so, you have one mental location that you can point to for everything related to a project. This is a saner way of operating than over-engineering the separation of concerns at the beginning, with docs in one place and out-of-sync with the source code in another place... you get where we're going with this point.
Easy! Create your Git repo for the project, and then start putting stuff in there :).
Enough said here!
What should you name the Git repo? See the page: Sanely name things consistently
After you have set up your Git repo, make sure to Set up your project with a sane directory structure.
Also, Set up an awesome default gitignore for your projects!
Adhere to best git practices
Git is a unique piece of software. It does one and only one thing well: store versions of hand-curated files. Adhering to Git best practices will ensure that you use Git in its intended fashion.
The most significant point to keep in mind: only commit to Git files that you have had to create manually. That usually means version controlling:
There are also things you should actively avoid committing.
For specific files, you can set up a .gitignore
file.
See the page Set up an awesome default gitignore for your projects
for more information on preventing yourself from committing them automatically.
For Jupyter notebooks,
it is considered good practice to avoid committing notebooks that still have outputs.
It is best to clear them out using nbstripout
.
That can be automated before committing them through the use of pre-commit hooks.
(See: Set up pre-commit hooks to automate checks before making Git commits)