A Data Scientist's Guide to Environment Variables
You might have encountered a piece of software asking you for permission to modify your PATH
variable,
or another program's installation instructions cryptically telling you
that you have to "set your LD_LIBRARY_PATH
variable correctly".
As a data scientist, you might encounter other environment variable issues when interacting with your compute stack (particularly if you don't have full control over it, like I do). This post is meant to demystify what an environment variable is, and how it gets used in a data science context.
What Is An Environment Variable?
First off, let me explain what an environment variable is,
by going in-depth into the PATH
environment variable.
I'd encourage you to execute the commands here inside your bash terminal
(with appropriate modifications -- read the text to figure out what I'm doing!).
When you log into your computer system, say,
your local computer’s terminal or your remote server via SSH,
your bash interpreter needs to know where to look for particular programs,
such as nano
(the text editor), or git
(your version control software),
or your Python executable. This is controlled by your PATH variable.
It specifies the paths to folders where your executable programs are found.
By historical convention, command line programs,
such as nano
, which
, and top
,
are found in the directory /usr/bin
.
By historical convention, the /bin
folder is for software binaries,
which is why they are named /bin
.
These are the ones that are bundled with your operating system,
and as such, need special permissions to upgrade.
Try it out in your terminal:
$ which which
/usr/bin/which
$ which top
/usr/bin/top
Other programs are installed (for whatever reason) into /bin
instead. ls
is one example:
$ which ls
/bin/ls
Yet other programs might be installed in other special directories:
$ which nano
/usr/local/bin/nano
How does your Bash terminal figure out where to go to look for stuff?
It uses the PATH
environment variable.
It looks something like this:
$ echo $PATH
/usr/bin:/bin:/usr/local/bin
The most important thing to remember about the PATH
variable is that it is "colon-delimited".
That is, each directory path is separated by the next using a "colon" (:
) character.
The order in which your bash terminal is looking for programs goes from left to right:
/usr/bin
/bin
/usr/local/bin
On my particular computer, when I type in ls
,
my bash interpreter will look inside the /usr/bin
directory first.
It'll find that ls
doesn't exist in /usr/bin
,
and so it'll move to the next directory, /bin
.
Since my ls
exists under /bin
,
it'll execute the ls
program from there.
You can see, then, that this is simultaneously super flexible for customizing your compute environment,
yet also potentially super frustrating if a program modified your PATH
variable without you knowing.
Wait, you can actually modify your PATH
variable? Yep, and there's a few ways to do this.
How To Modify the PATH
variable
Using a Bash Session
The first way is transient, or temporary, and only occurs for your particular bash session.
You can make a folder have higher priority than the existing paths by "pre-pending" it to the PATH
variable:
$ export PATH=/path/to/my/folder:$PATH
$ echo $PATH
/path/to/my/folder:/usr/bin:/bin:/usr/local/bin
Or I can make it have a lower priority than existing paths by "appending" it to the PATH
variable:
$ export PATH=$PATH:/path/to/my/folder
$ echo $PATH
/usr/bin:/bin:/usr/local/bin:/path/to/my/folder
The reason this is temporary is because I only export it during my current bash session.
bashrc
or .bash_profile
File
If I wanted to make my changes somewhat more permanent,
then I would include inside my .bashrc
or .bash_profile
file.
(I recommend using the .bashrc
file.)
The .bashrc
/.bash_profile
file lives inside your home directory
(your $HOME
environment variable specifies this),
and is a file that your bash interpreter will execute first load.
It will execute all of the commands inside there.
This means, you can change your PATH variable by simply putting inside your .bashrc
:
...other stuff above...
# Make /path/to/folder have higher priority
export PATH=/path/to/folder:$PATH
# Make /path/to/other/folder have lower priority
export PATH=$PATH:/path/to/folder
...other stuff below...
Data Science and the PATH
environment variable
Now, how is this relevant to data scientists?
Well, if you're a data scientist, chances are that you use Python,
and that your Python interpreter comes from the Anaconda Python distribution
(a seriously awesome thing, go get it!).
What the Anaconda Python installer does is prioritize
the /path/to/anaconda/bin
folder in the PATH
environment variable.
You might have other Python interpreters installed on your system
(that is, Apple ships its own).
However, this PATH
modification ensures that
each time you type python
into your Bash terminal,
ou execute the Python interpreter shipped with the Anaconda Python distribution.
In my case, after installing the Anaconda Python distribution, my PATH
looks like:
$ echo $PATH
/Users/ericmjl/anaconda/bin:/usr/bin:/bin:/usr/local/bin
Even better, what conda environments do is
prepend the path to the conda environment binaries folder
while the environment is activated.
For example, with my blog, I keep it in an environment named lektor
.
Thus...
$ echo $PATH
/Users/ericmjl/anaconda/bin:/usr/bin:/bin:/usr/local/bin
$ which python
/Users/ericmjl/anaconda/bin/python
$ source activate lektor
$ echo $PATH
/Users/ericmjl/anaconda/envs/lektor/bin:/Users/ericmjl/anaconda/bin:/usr/bin:/bin:/usr/local/bin
$ which python
/Users/ericmjl/anaconda/envs/lektor/bin/python
Notice how the bash terminal now preferentially picks the Python inside the higher-priority lektor
environment.
If you've gotten to this point, then you'll hopefully realize there's a few important concepts listed here. Let's recap them:
PATH
is an environment variable stored as a plain text string used by the bash interpreter to figure out where to find executable programs.PATH
is colon-delimited; higher priority directories are to the left of the string, while lower priority directories are to the right of the string.PATH
can be modified by prepending or appending directories to the environment variable. It can be done transiently inside a bash session by running theexport
command at the command prompt, or it can be done permanently across bash sessions by adding anexport
line inside your.bashrc
or.bash_profile
.
Other Environment Variables of Interest
Now, what other environment variables might a data scientist encounter? These are a sampling of them that you might see, and might have to fix, especially in contexts where your system administrators are off on vacation (or taking too long to respond).
General Use
For general use**, you'll definitely want to know where your HOME
folder is -- on Linux systems, it's often /home/username
, while on macOS systems, it's often /Users/username
. You can figure out what HOME
is by doing:
$ echo $HOME
/Users/ericmjl
Python
If you're a Python user,
then the PYTHONPATH
is one variable that might be useful.
It is used by the Python interpreter,
and specifies where to find Python modules/packages.
C++ libraries
If you have to deal with C++ libraries,
then knowing your LD_LIBRARY_PATH
environment variable is going to be very important.
I'm not well-versed enough in this to espouse on it intelligently,
so I would defer to this website
for more information on best practices for using the LD_LIBRARY_PATH
variable.
Spark
If you're working with Spark,
then the PYSPARK_PYTHON
environment variable would be of interest.
This essentially tells Spark which Python to use for both its driver and its workers;
you can also set the PYSPARK_DRIVER_PYTHON
to be separate from the PYSPARK_PYTHON
environment variable, if needed.
Data science apps
If you're developing data science apps,
then according to the 12 factor app development principles,
your credentials to databases and other sensitive information
are securely stored and dynamically loaded into the environment at runtime.
How then do you mimic this in a "local" environment (i.e. your computer)
without hard-coding sensitive information in your source .py
files?
One way to handle this situation is as follows:
Firstly, create a .env
file in your home directory.
In there, store your credentials:
SOME_PASSWORD="put_your_pw_here"
SOME_USERNAME="put_your_username_here"
Next, add it to your .gitignore
, so you never add it to your version control system.
# other things
.env
Finally, in your source .py
files, use python-dotenv
to load the environment variables at runtime.
from dotenv import load_dotenv
load_dotenv()
import os
username = os.getenv("SOME_USERNAME")
password = os.getenv("SOME_PASSWORD")
Hack Your Environment Variables
This is where the most fun happens! Follow along for some stuff you might be able to do by hacking your environment variables.
Hack #1: Enable access to PyPy.
I occasionally keep up with the development of PyPy,
but because PyPy is not yet the default Python interpreter,
and is not yet conda install
-able,
I have to put it in its own $HOME/pypy/bin
directory.
To enable access to the PyPy interpreter,
I have to make sure that my /path/to/pypy
is present
in the PATH
environment variable,
but at a lower priority than my regular CPython interpreter.
Hack #2: Enable access to other language interpreters/compilers.
This is analogous to PyPy.
I once was trying out Lua's JIT interpreter to use Torch for deep learning,
and needed to add a path to there in my .bashrc
.
Hack #3: Install Python packages to your home directory.
On shared Linux compute systems that use the modules
system
rather than conda
environments,
a modulefile
that you load might be configured
with a virtual environment that you don't have permissions to modify.
If you need to install a Python package,
you might want to pip install --user my_pkg_name
.
This will install it to $HOME/.local/lib/python-[version]/site-packages/
.
Ensuring that your PYTHONPATH
includes $HOME/.local/lib/python-[version]/site-packages
at a high enough priority is going to be important in this case.
Hack 4: Debugging when things go wrong.
In case something throws an error, or you have unexpected behaviour -- something I encountered before was my Python interpreter not being found correctly after loading all of my Linux modules -- then a way to debug is to temporarily set your PATH environment variable to some sensible "defaults" and sourcing that, effectively "resetting" your PATH variable, so that you can manually prepend/append while debugging.
To do this, place the following line inside a file named .path_default
,
inside your home directory:
export PATH="" # resets PATH to an empty string.
export PATH=/usr/bin:/bin:/usr/local/bin:$PATH # this is a sensible default; customize as needed.
After something goes wrong, you can reset your PATH environment variable by using the "source" command:
$ echo $PATH
/some/complicated/path:/more/complicated/paths:/really/complicated/paths
$ source ~/.path_default
$ echo $PATH
/usr/bin:/bin:/usr/local/bin
Note - you can also execute the exact same commands inside your bash session; the interactivity may also be helpful.
Conclusion
I hope you enjoyed this article, and that it'll give you a, ahem, path forward whenever you encounter these environment variables!