A Data Scientist's Guide to Environment Variables
You might have encountered a piece of software asking you for permission to modify your
or another program's installation instructions cryptically telling you
that you have to "set your
LD_LIBRARY_PATH variable correctly".
As a data scientist, you might encounter other environment variable issues when interacting with your compute stack (particularly if you don't have full control over it, like I do). This post is meant to demystify what an environment variable is, and how it gets used in a data science context.
What Is An Environment Variable?
First off, let me explain what an environment variable is,
by going in-depth into the
PATH environment variable.
I'd encourage you to execute the commands here inside your bash terminal
(with appropriate modifications -- read the text to figure out what I'm doing!).
When you log into your computer system, say,
your local computer’s terminal or your remote server via SSH,
your bash interpreter needs to know where to look for particular programs,
nano (the text editor), or
git (your version control software),
or your Python executable. This is controlled by your PATH variable.
It specifies the paths to folders where your executable programs are found.
By historical convention, command line programs,
are found in the directory
By historical convention, the
/bin folder is for software binaries,
which is why they are named
These are the ones that are bundled with your operating system,
and as such, need special permissions to upgrade.
Try it out in your terminal:
$ which which /usr/bin/which $ which top /usr/bin/top
Other programs are installed (for whatever reason) into
ls is one example:
$ which ls /bin/ls
Yet other programs might be installed in other special directories:
$ which nano /usr/local/bin/nano
How does your Bash terminal figure out where to go to look for stuff?
It uses the
PATH environment variable.
It looks something like this:
$ echo $PATH /usr/bin:/bin:/usr/local/bin
The most important thing to remember about the
PATH variable is that it is "colon-delimited".
That is, each directory path is separated by the next using a "colon" (
The order in which your bash terminal is looking for programs goes from left to right:
On my particular computer, when I type in
my bash interpreter will look inside the
/usr/bin directory first.
It'll find that
ls doesn't exist in
and so it'll move to the next directory,
ls exists under
it'll execute the
ls program from there.
You can see, then, that this is simultaneously super flexible for customizing your compute environment,
yet also potentially super frustrating if a program modified your
PATH variable without you knowing.
Wait, you can actually modify your
PATH variable? Yep, and there's a few ways to do this.
How To Modify the
Using a Bash Session
The first way is transient, or temporary, and only occurs for your particular bash session.
You can make a folder have higher priority than the existing paths by "pre-pending" it to the
$ export PATH=/path/to/my/folder:$PATH $ echo $PATH /path/to/my/folder:/usr/bin:/bin:/usr/local/bin
Or I can make it have a lower priority than existing paths by "appending" it to the
$ export PATH=$PATH:/path/to/my/folder $ echo $PATH /usr/bin:/bin:/usr/local/bin:/path/to/my/folder
The reason this is temporary is because I only export it during my current bash session.
If I wanted to make my changes somewhat more permanent,
then I would include inside my
(I recommend using the
.bash_profile file lives inside your home directory
$HOME environment variable specifies this),
and is a file that your bash interpreter will execute first load.
It will execute all of the commands inside there.
This means, you can change your PATH variable by simply putting inside your
...other stuff above... # Make /path/to/folder have higher priority export PATH=/path/to/folder:$PATH # Make /path/to/other/folder have lower priority export PATH=$PATH:/path/to/folder ...other stuff below...
Data Science and the
PATH environment variable
Now, how is this relevant to data scientists?
Well, if you're a data scientist, chances are that you use Python,
and that your Python interpreter comes from the Anaconda Python distribution
(a seriously awesome thing, go get it!).
What the Anaconda Python installer does is prioritize
/path/to/anaconda/bin folder in the
PATH environment variable.
You might have other Python interpreters installed on your system
(that is, Apple ships its own).
PATH modification ensures that
each time you type
python into your Bash terminal,
ou execute the Python interpreter shipped with the Anaconda Python distribution.
In my case, after installing the Anaconda Python distribution, my
PATH looks like:
$ echo $PATH /Users/ericmjl/anaconda/bin:/usr/bin:/bin:/usr/local/bin
Even better, what conda environments do is
prepend the path to the conda environment binaries folder
while the environment is activated.
For example, with my blog, I keep it in an environment named
$ echo $PATH /Users/ericmjl/anaconda/bin:/usr/bin:/bin:/usr/local/bin $ which python /Users/ericmjl/anaconda/bin/python $ source activate lektor $ echo $PATH /Users/ericmjl/anaconda/envs/lektor/bin:/Users/ericmjl/anaconda/bin:/usr/bin:/bin:/usr/local/bin $ which python /Users/ericmjl/anaconda/envs/lektor/bin/python
Notice how the bash terminal now preferentially picks the Python inside the higher-priority
If you've gotten to this point, then you'll hopefully realize there's a few important concepts listed here. Let's recap them:
PATHis an environment variable stored as a plain text string used by the bash interpreter to figure out where to find executable programs.
PATHis colon-delimited; higher priority directories are to the left of the string, while lower priority directories are to the right of the string.
PATHcan be modified by prepending or appending directories to the environment variable. It can be done transiently inside a bash session by running the
exportcommand at the command prompt, or it can be done permanently across bash sessions by adding an
exportline inside your
Other Environment Variables of Interest
Now, what other environment variables might a data scientist encounter? These are a sampling of them that you might see, and might have to fix, especially in contexts where your system administrators are off on vacation (or taking too long to respond).
For general use**, you'll definitely want to know where your
HOME folder is -- on Linux systems, it's often
/home/username, while on macOS systems, it's often
/Users/username. You can figure out what
HOME is by doing:
$ echo $HOME /Users/ericmjl
If you're a Python user,
PYTHONPATH is one variable that might be useful.
It is used by the Python interpreter,
and specifies where to find Python modules/packages.
If you have to deal with C++ libraries,
then knowing your
LD_LIBRARY_PATH environment variable is going to be very important.
I'm not well-versed enough in this to espouse on it intelligently,
so I would defer to this website
for more information on best practices for using the
If you're working with Spark,
PYSPARK_PYTHON environment variable would be of interest.
This essentially tells Spark which Python to use for both its driver and its workers;
you can also set the
to be separate from the
PYSPARK_PYTHON environment variable, if needed.
Data science apps
If you're developing data science apps,
then according to the 12 factor app development principles,
your credentials to databases and other sensitive information
are securely stored and dynamically loaded into the environment at runtime.
How then do you mimic this in a "local" environment (i.e. your computer)
without hard-coding sensitive information in your source
One way to handle this situation is as follows:
Firstly, create a
.env file in your home directory.
In there, store your credentials:
Next, add it to your
.gitignore, so you never add it to your version control system.
# other things .env
Finally, in your source
.py files, use
python-dotenv to load the environment variables at runtime.
from dotenv import load_dotenv load_dotenv() import os username = os.getenv("SOME_USERNAME") password = os.getenv("SOME_PASSWORD")
Hack Your Environment Variables
This is where the most fun happens! Follow along for some stuff you might be able to do by hacking your environment variables.
Hack #1: Enable access to PyPy.
I occasionally keep up with the development of PyPy,
but because PyPy is not yet the default Python interpreter,
and is not yet
I have to put it in its own
To enable access to the PyPy interpreter,
I have to make sure that my
/path/to/pypy is present
PATH environment variable,
but at a lower priority than my regular CPython interpreter.
Hack #2: Enable access to other language interpreters/compilers.
This is analogous to PyPy.
I once was trying out Lua's JIT interpreter to use Torch for deep learning,
and needed to add a path to there in my
Hack #3: Install Python packages to your home directory.
On shared Linux compute systems that use the
modulefile that you load might be configured
with a virtual environment that you don't have permissions to modify.
If you need to install a Python package,
you might want to
pip install --user my_pkg_name.
This will install it to
Ensuring that your
at a high enough priority is going to be important in this case.
Hack 4: Debugging when things go wrong.
In case something throws an error, or you have unexpected behaviour -- something I encountered before was my Python interpreter not being found correctly after loading all of my Linux modules -- then a way to debug is to temporarily set your PATH environment variable to some sensible "defaults" and sourcing that, effectively "resetting" your PATH variable, so that you can manually prepend/append while debugging.
To do this, place the following line inside a file named
inside your home directory:
export PATH="" # resets PATH to an empty string. export PATH=/usr/bin:/bin:/usr/local/bin:$PATH # this is a sensible default; customize as needed.
After something goes wrong, you can reset your PATH environment variable by using the "source" command:
$ echo $PATH /some/complicated/path:/more/complicated/paths:/really/complicated/paths $ source ~/.path_default $ echo $PATH /usr/bin:/bin:/usr/local/bin
Note - you can also execute the exact same commands inside your bash session; the interactivity may also be helpful.
I hope you enjoyed this article, and that it'll give you a, ahem, path forward whenever you encounter these environment variables!
Thank you for reading!
If you enjoyed this essay and would like to receive early-bird access to more, please support me on Patreon! A coffee a month sent my way gets you early access to my essays on a private URL exclusively for my supporters as well as shoutouts on every single essay that I put out.
Also, I have a free monthly newsletter that I use as an outlet to share programming-oriented data science tips and tools. If you'd like to receive it, sign up on TinyLetter!