Configure your machine

After getting access to your development machine, you'll want to configure it and take full control over how it works. Backing the following steps are a core set of ideas:

  • Give yourself a rich set of commonly necessary tooling right from the beginning, but without the bloat that might be unnecessary.
  • Standardize your compute environment for maximal portability from computer to computer.
  • Build up automation to get you up and running as fast as possible.
  • Have full control over your system, such that you know as much about your configuration as possible.
  • Squeeze as much productivity out of your UI as possible.

Head over to the following pages to see how you can get things going.

Initial setup

Getting Anaconda Python installed

Master the shell

Further configuration

Advanced Stuff

Bootstrap a scratch conda environment

A scratch environment is your playground

In a pinch, you might want to muck around on your system with some quick-and-dirty experiment. Having a suite of packages inside a scratch environment can be handy. Your scratch environment can be your base environment if you'd like, but I would strongly recommend creating a separate scratch environment instead.

How to bootstrap a scratch environment

I would recommend that you bootstrap a scratch conda environment with some basic data science packages.

mamba activate base
mamba install -c conda-forge \
    scipy numpy pandas matplotlib \
	numpy jupyter jupyterlab \
	scikit-learn ipython ipykernel \
	ipywidgets mamba

(Replace mamba with conda if you don't have mamba installed on your system.)

Doing so gives you an environment where you can quickly prototype new things without necessarily going through the overhead of creating an entirely new project (and with it, a full conda environment).

Installing mamba can be helpful if you want a faster drop-in replacement for conda. (see: Use Mamba as a faster drop-in replacement for conda for more information.)

Take full control of your shell environment variables

Why control your environment variables

If you're not sure what environment variables are, I have an essay on them that you can reference. Mastering environment variables is crucial for data scientists!

Your shell environment, whether it is zsh or bash or fish or something else, is supremely important. It determines the runtime environment, which in turn determines which Python you're using, whether you have proxies set correctly, and more. Rather than leave this to chance, I would recommend instead gaining full control over your environment variables.

How do I control my environment variables

The simplest way is to set them explicitly in your shell initialization script. For bash shells, it's either .bashrc or .bash_profile. For the Z shell, it'll be the .zshrc file. In there, step by step, set the environment variables that you need system-wide.

For example, explicitly set your PATH environment variable with explainers that tell future you why you ordered the PATH in a certain way.

# Start with an explicit minimal PATH
export PATH=/bin:/usr/bin:/usr/local/bin

# Add in my custom binaries that I want available across projects
export PATH=$HOME/bin:$PATH

# Add in anaconda installation path
export PATH=$HOME/anaconda/bin:$PATH

# Add more stuff below...

If you want your shell initialization script to be cleaner, you can refactor it out into a second bash script called env_vars.sh, which lives either inside your home directory or your dotfiles repository (see: Leverage dotfiles to get your machine configured quickly). Then, source the env_vars.sh script from the shell initialization script:

source ~/env_vars.sh

There may be a chance that other things, like the Anaconda installer, will give you an option to modify your shell initializer script. If so, be sure to keep this in the back of your mind. At the end, of your shell initializer script, you can echo the final state of environment variables to help you debug.

Environment variables that need to be set on a per-project basis are handled slightly differently. See Create runtime environment variable configuration files for each of your projects.

Automate the bootstrapping of your new computer

Why automate your configuration

In automating your shell's configuration, you save yourself time each time you get access to a new computer. That is the primary value proposition of automation! No more spending 2-3 hours setting things up. Instead, simply type ./install.sh at the terminal!

How to create a bootstrap script

The best way I would recommend doing this is by creating a dotfiles repository. (see: Leverage dotfiles to get your machine configured quickly) Place every file needed for shell initialization inside there -- primarily, I mean the .zshrc or .bashrc/.bash_profile files, and any other files on which you depend.

Then, create the main script install.sh, which you execute from within the dotfiles repository, and have it perform all of the necessary actions to place the right files in the right place. (Or symlink them from the dotfiles repository to the correct places.)

Keep in mind, there's no "perfect" setup except for the one that matches your brain. (We are, after all, talking about setting up your own computer.) Sophistication is also not a pre-requisite. All you need is to guarantee that your setup ends up working the way you'd like. If you want, you can use my dotfiles as a starting point, but I would strongly suggest that you customize it to your own needs!

Create shell command aliases for your commonly used commands

Why create shell aliases

Shell aliases can save you keystrokes, which save time. That time saved is compound interest over long time horizons!

How do I create aliases?

Shell aliases are easy to create. In your shell initializer script, use the following syntax, using ls being aliased to exa with configuration flags at the end as an example:

alias ls="exa --long"

Now, typing ls at the shell will instead execute exa! (To know what is exa, see Install a suite of really cool utilities on your machine using homebrew.)

Where do I store these aliases?

In order for these shell aliases to take effect each time you open up your shell, you should ensure that they get sourced in your shell initialization script (see: Take full control of your shell environment variables for more information). You have one of two options:

  1. These aliases can be declared in your .zshrc or .bashrc (or analogous) file, or
  2. They can be declared in ~/.aliases, which you source inside your shell initialization script file (i.e. .zshrc/.bashrc/etc.)

I recommend the second option as doing so means you'll be putting into practice the philosophy of having clear categories of things in one place.

What are some aliases that could be useful?

In my dotfiles repository, I have a .shell_aliases directory which contains a full suite of aliases that I have installed.

Other external links that showcase shell aliases that could serve as inspiration for your personal collection include:

And finally, to top it off, Twitter user @ctrlshifti suggests aliasing please to sudo for a pleasant experience at the terminal:

alias please="sudo"

# Now you type:
# please apt-get update
# please apt-get upgrade
# etc...

Configure your conda installation

Why you would want to configure your conda installation

Configuring some things with conda can help lubricate your interactions with the conda package manager. It will save you keystrokes at the terminal, primarily, thus saving you time. The place to do this configuration is in the .condarc file, which the conda package manager searches for by default in your user's home directory.

The condarc docs are your best bet for the full configuration, but I have some favourites that I'm more than happy to share below.

How to configure your condarc

Firstly, you create a file in your home directory called .condarc. Then edit it to have the following contents:

channels:
  - conda-forge
  - defaults

auto_update_conda: True

always_yes: True

The whys

  • The auto_update_conda saves me from having to update conda all the time,
  • always_yes lets me always answer y to the conda installation and update prompts.
  • Setting conda-forge as the default channel above the defaults channel allows me to type conda install some_package rather than conda install -c conda-forge some_package each time I want to install a package, as conda will prioritize channels according to their order under the channels section.

About channel priorities

If you prefer, you can set the channel priorities in a different order and/or expand the list. For example, bioinformatics users may want to add in the bioconda channel, while R users may want to add in the r channel. Users who prefer stability may want to prioritize defaults ahead of conda-forge.

What this affects is how conda will look for packages when you execute the conda install command. However, it doesn't affect the channel priority in your per-project environment.yml file (see: Create one conda environment per project).

Other conda-related pages to look at

Configure Jupyter and Jupyter Lab

Project Jupyter is a cornerstone of the data science ecosystem. When getting set up to work with Jupyter, here are some settings that you may wish to configure if you are running your own Jupyter server (and don't have a managed one provided).

Jupyter configuration files

The best way to learn about Jupyter's configuration system is to read the official documentation. However, I will strive to provide an up-to-date overview here.

You can find Jupyter configuration files in your home directory at the directory ~/.jupyter/. Jupyter's server, which is what you execute when you execute jupyter lab (as of Jan 2021, the lab interface is the future of Jupyter).

To configure Jupyter's behaviour, you usually start by generating the configuration file:

jupyter lab --generate-config

Now, there will be a file created at ~/.jupyter/jupyter_lab_config.py. You can edit that file to configure Jupyter's behaviour.

Change IP address to 0.0.0.0

By convention, Jupyter's server runs on 127.0.0.1, which is aliased by localhost. Usually, this is not a problem if you run the server from your local machine and then access it by a browser on the same device. However, this configuration can get troublesome if you run Jupyter on a remote workstation or server.

Here's why.

When configured to run on 127.0.0.1, the Jupyter server will only accept browser requests originating from the local machine. It will deny browser requests that originate from another device will. However, when the Jupyter server runs on 0.0.0.0 instead, it will be able to accept browser requests that originate from outside the server itself (i.e. your laptop).

Are there risks to serving the notebook up on 0.0.0.0? Yes, the primary one is that your notebook server is now accessible to anybody on the same network as the machine that is serving up Jupyter. To mitigate that risk, Jupyter has "token-based authentication" turned on by default; you might also want to consider turning on password protection.

Add password protection

You can turn on password protection by running the following command at the terminal:

jupyter lab password

Jupyter will prompt you for a password and store a hash of the password in the file ~/.jupyter/jupyter_server_config.json. (You don't need to edit this file!)

When you enable password protection, you will be prompted for a password when you access the Jupyter lab interface.

Change the minimum port number

Sometimes you and your colleagues might share the same workstation and run your Jupyter servers on there. Because the default server configuration starts at port 8888, you might end up with the following scenario (primarily if you practice running one Jupyter server per project):

| Notebook port | Owner | |:--------:|:----:| | 8888 | yourself | | 8889 | yourself | | 8890 | your colleague | | 8891 | yourself |

To avoid this scenario, you can agree with your colleague that they can keep their configuration, and you can start your port numbers at 9000 (or 10000). To do so, open up ~/.jupyter/jupyter_lab_config.py and set the config traitlet below:

c.ServerApp.port = 9000

Now, your port numbers will start from port 9000 upwards, helping to keep your port number range and your colleagues' port number range effectively separated.

Get bootstrapped on your data science projects

Why this knowledge base exists

I'm super glad you made it to my knowledge base on bootstrapping your data science machine - otherwise known as getting set up for success with great organization and sane structure. The content inside here has been battle-tested through real-world experience with colleagues and others skilled in their computing domains, but a bit new to the modern tooling offered to us in the data science world.

This knowledge base exists because I want to encourage more data scientists to adopt sane practices and conventions that promote collaboration and reproducibility in our data work. These are practices that, through years of practice in developing data projects and open source software, I have come to see the importance of.

Where I think you, the reader, are coming from

The most important thing I'm assuming about you, the reader, is that you have experienced the same challenges I encountered when structure and workflow were absent from my work. I wrote down this knowledge base for your benefit. Based on one decade (as of 2023) of continual refinement, you'll learn how to:

  1. structure your computer for data analysis projects, and
  2. structure a data analysis project for maximum effectiveness.

Because I'm a Pythonista who uses Jupyter and VSCode, some tips are specific to the language and these tools. However, being a Python programmer isn't a hard requirement. More than the specifics, I hope this knowledge base imparts to you a particular philosophy of how to work. That philosophy should be portable across languages and tooling, though having specific tooling can sometimes help you adhere to the philosophy. To read more about the philosophies behind this knowledge base, check out the page: The philosophies that ground the bootstrap.

For the beginner

As you grow in your knowledge and skillsets, this knowledge base should help you keep an eye out for critical topics you might want to learn.

For the moderately experienced

If you're looking to refine your skillsets, this knowledge graph should give you the base from which you dive deeper into specific tools.

For the seasoned data scientist

If you're a seasoned data practitioner, this guide should be able to serve you the way it helps me: as a cookbook/recipe guide to remind you of things when you forget them.

Things you'll learn

The things you'll learn here cover the first steps, starting at configuring your laptop or workstation for data science development up to some practices that help you organize your projects, regardless of where you do your computing.

I have a recommended order below, based on my experience with colleagues and other newcomers to a project:

  1. Configure your machine
  2. Get prepped per project
  3. Navigate the packaging world
  4. Handling data
  5. Choose and customize your development environment

However, you may wish to follow the guide differently and not read it in the way I prescribed above. That's not a problem! The online version is intentionally structured as a knowledge graph and not a book so that you can explore it on your own terms.

Apply these ideas just-in-time

As you go through this content, I would also encourage you to keep in mind: Time will distill the best practices in your context. Don't feel pressured to apply every single thing you see here to your project. Incrementally adopt these practices as they make sense. They're all quite composable with one another.

Not everything written here is applicable to every single project. Indeed, rarely do I use 100% of everything I've written here. Sometimes, my projects end up being more software tool development oriented, and hence I use a lot of the software-oriented ideas. Sometimes my projects are one-off, and so I ease off on the reproducibility aspect. Most of the time, my projects require a lot of exploratory work beyond simple exploratory data analysis, and imposing structure early on can be stifling for the project.

So rather than see this collection of notes as something that we must impose on every project, I would encourage you to be picky and choosy, and use only what helps you, in a just-in-time fashion, to increase your effectiveness in a project. Just-in-time adoption of a practice or a tool is preferable, because doing so eases the pressure to be rigidly complete from the get-go. In my own work, I incorporate a practice into the project just-in-time as I sense the need for it.

Moreover, as my colleague Zachary Barry would say, none of these practices can be mastered overnight. It takes running into walls to appreciate why these practices are important. For an individual who has not yet encountered problems with disorganized code, multiple versions of the same dataset, and other issues I describe here, it is difficult to deeply appreciate why it matters to apply simple and basic software development practices to your data science work. So I would encourage you to use this knowledge base as a reference tool that helps you find out, in a just-in-time fashion, a practice or tool that helps you solve a problem.

Ways to support the project

If you wish to support the project, there are a few ways:

Firstly, I spent some time linearizing this content based on my experience guiding skilled newcomers to the DS world. That's available on the eBook version on LeanPub. If you purchase a copy, you will get instructions to access the repository that houses the book source and automation to bootstrap each of your Python data science projects easily!

Secondly, you can support my data science education work on Patreon! My supporters get early access to the data science content that I make.

Finally, if you have a question regarding the content, please feel free to reach out on LinkedIn. (If I make substantial edits on the basis of your comments or questions, I might reach out to you to offer a free copy of the eBook!)

Install zsh and oh-my-zsh for shell hacks

Why install zsh

The default shell on most Linux systems is bash. While bash is nice, in its vanilla state, you can't get a ton of information at a glance. For example, your bash shell might not be configured by default to:

  • Show the current working directory
  • Show the status of the Git repository
  • Show the time you executed the last command you were running
  • Show the hostname of the machine you're on (especially useful if you work on a remote machine and locally at the same time).

To illustrate, at a vanilla bash shell, you usually are given:

username@hostname $

With some of the zsh themes, with minimal configuration, you get a shell that looks like this:

username@hostname at /path/to/current/dir (yyyy-mm-dd hh:ss) [+] $

The [+] at the end gives us the Git status of a directory that is also a Git repository.

The Z-shell, or zsh, as well as other shells like the fish shell, are alternative shells that use Bash-compatible syntax to interact with them, but come pre-configured with themes that you can apply. The zsh is especially handy for this.

Of course, fancier bash prompts are one thing, but each shell comes with its own set of cool tooling to enhance your productivity at the terminal.

How do I install zsh and oh-my-zsh

If you're on a fresh install of the latest versions of macOS, the zsh is already your default shell. You probably want to then install oh-my-zsh. Follow instructions on the project's webpage. Then, pick one of the themes that you like, and configure your shell that way.

Install homebrew on your machine

Why install Homebrew?

Your Mac comes with a lot of neat apps, but it's a bit crippled when it comes to shell utilities. (Linux machines can use Homebrew too! Read on to see when you might need it.)

As claimed, Homebrew is the missing package manager for the Mac. From it, you can get shell utilities and apps that don't come pre-installed on your computer, such as wget. Installing these shell utilities can give you a leg-up as you strive to gain mastery over your machine. (see: Install a suite of really cool utilities on your machine using homebrew)

How do we install Homebrew?

Follow the instructions on the homebrew website, but essentially, it's a one bash command install. Usually, you would copy/paste it from the homebrew website, but I've copied it over so you don't have to context-switch:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

It can be executed anywhere, but if you're feeling superstitious, you can always move to your home directory first (cd ~) before executing the command.

Once you're done...

If you're planning to install Anaconda Install Anaconda on your machine, then make sure you install wget, as my bootstrap step for installing Anaconda relies on using wget to pull the installer from the internet.

brew install wget

You can also install some other cool utilities using brew! (see: Install a suite of really cool utilities on your machine using homebrew)

What about Linux machines?

Linux machines usually come with their own package manager, such as yum on CentOS and apt on Ubuntu. If you have the necessary privileges to install packages, which usually means having sudo privileges on your machine, then you probably don't need to install Homebrew on Linux.

However, if you do not have sudo privileges on your machine, then you should consider installing Homebrew inside your home directory. This enables you to use brew to install Linux utilities that might not be built-in to your system. It's a pretty neat hack to have when you're working on a managed system, such as a high performance computing system.

Leverage dotfiles to get your machine configured quickly

Why create a dotfiles repository

Your dotfiles control the baseline of your computing environment. Creating a dotfiles repository lets you version control it, make a backup of it on a hosted version control site (like Github or Bitbucket) and quickly deploy it to a new system.

How do you structure a dotfiles repository

It's really up to you, but you want to make sure that you capture all of the .some_file_extension files stored in your home directory that are also important for your shell runtime environment.

For example, you might want to include your .zshrc or your .bashrc files, i.e. the shell initialization scripts.

You might also want to refactor out some pieces from the .zshrc and put them into separate files that get sourced inside those files. For example, I have two, one for the PATH environment variable named .path (see: Take full control of your shell environment variables) and one for aliases named .aliases (see: Create shell command aliases for your commonly used commands). You can source these files in the .zshrc file, so I have everything defined in .path and .aliases available to me.

You can also create an install.sh script that, when executed at the shell, symlinks all the files from the dotfiles directory into the home directory or copies them. (I usually opt to symlink because I can apply updates more easily.) The install.sh script can be as simple as:

cp .zshrc $HOME/.zshrc
cp .path $HOME/.path
cp .aliases $HOME/.aliases

Everything outlined above forms the basis of your bootstrap for a new computer, which I alluded to in Automate the bootstrapping of your new computer.


If you want to see a few examples of dotfiles in action, check out the following repositories and pages:

From the official "dotfiles" GitHub pages:

My own dotfiles: ericmjl/dotfiles which are inspired by mathiasbynens/dotfiles

index

Why this knowledge base exists

I'm super glad you made it to my knowledge base on bootstrapping your data science machine - otherwise known as getting set up for success with great organization and sane structure. The content inside here has been battle-tested through real-world experience with colleagues and others skilled in their computing domains, but a bit new to the modern tooling offered to us in the data science world.

This knowledge base exists because I want to encourage more data scientists to adopt sane practices and conventions that promote collaboration and reproducibility in our data work. These are practices that, through years of practice in developing data projects and open source software, I have come to see the importance of.

Where I think you, the reader, are coming from

The most important thing I'm assuming about you, the reader, is that you have experienced the same challenges I encountered when structure and workflow were absent from my work. I wrote down this knowledge base for your benefit. Based on one decade (as of 2023) of continual refinement, you'll learn how to:

  1. structure your computer for data analysis projects, and
  2. structure a data analysis project for maximum effectiveness.

Because I'm a Pythonista who uses Jupyter and VSCode, some tips are specific to the language and these tools. However, being a Python programmer isn't a hard requirement. More than the specifics, I hope this knowledge base imparts to you a particular philosophy of how to work. That philosophy should be portable across languages and tooling, though having specific tooling can sometimes help you adhere to the philosophy. To read more about the philosophies behind this knowledge base, check out the page: The philosophies that ground the bootstrap.

For the beginner

As you grow in your knowledge and skillsets, this knowledge base should help you keep an eye out for critical topics you might want to learn.

For the moderately experienced

If you're looking to refine your skillsets, this knowledge graph should give you the base from which you dive deeper into specific tools.

For the seasoned data scientist

If you're a seasoned data practitioner, this guide should be able to serve you the way it helps me: as a cookbook/recipe guide to remind you of things when you forget them.

Things you'll learn

The things you'll learn here cover the first steps, starting at configuring your laptop or workstation for data science development up to some practices that help you organize your projects, regardless of where you do your computing.

I have a recommended order below, based on my experience with colleagues and other newcomers to a project:

  1. Configure your machine
  2. Get prepped per project
  3. Navigate the packaging world
  4. Handling data
  5. Choose and customize your development environment

However, you may wish to follow the guide differently and not read it in the way I prescribed above. That's not a problem! The online version is intentionally structured as a knowledge graph and not a book so that you can explore it on your own terms.

Apply these ideas just-in-time

As you go through this content, I would also encourage you to keep in mind: Time will distill the best practices in your context. Don't feel pressured to apply every single thing you see here to your project. Incrementally adopt these practices as they make sense. They're all quite composable with one another.

Not everything written here is applicable to every single project. Indeed, rarely do I use 100% of everything I've written here. Sometimes, my projects end up being more software tool development oriented, and hence I use a lot of the software-oriented ideas. Sometimes my projects are one-off, and so I ease off on the reproducibility aspect. Most of the time, my projects require a lot of exploratory work beyond simple exploratory data analysis, and imposing structure early on can be stifling for the project.

So rather than see this collection of notes as something that we must impose on every project, I would encourage you to be picky and choosy, and use only what helps you, in a just-in-time fashion, to increase your effectiveness in a project. Just-in-time adoption of a practice or a tool is preferable, because doing so eases the pressure to be rigidly complete from the get-go. In my own work, I incorporate a practice into the project just-in-time as I sense the need for it.

Moreover, as my colleague Zachary Barry would say, none of these practices can be mastered overnight. It takes running into walls to appreciate why these practices are important. For an individual who has not yet encountered problems with disorganized code, multiple versions of the same dataset, and other issues I describe here, it is difficult to deeply appreciate why it matters to apply simple and basic software development practices to your data science work. So I would encourage you to use this knowledge base as a reference tool that helps you find out, in a just-in-time fashion, a practice or tool that helps you solve a problem.

Ways to support the project

If you wish to support the project, there are a few ways:

Firstly, I spent some time linearizing this content based on my experience guiding skilled newcomers to the DS world. That's available on the eBook version on LeanPub. If you purchase a copy, you will get instructions to access the repository that houses the book source and automation to bootstrap each of your Python data science projects easily!

Secondly, you can support my data science education work on Patreon! My supporters get early access to the data science content that I make.

Finally, if you have a question regarding the content, please feel free to reach out on LinkedIn. (If I make substantial edits on the basis of your comments or questions, I might reach out to you to offer a free copy of the eBook!)

Install and configure git on your machine

Why do we need Git

Git is an extremely important tool! We use it to do what is known as "version control" -- the act of explicitly curating and keeping track of changes that are made to files in a repository of text files. Using Git, you can even restore files to a previous state. It's like having an extremely powerful undo button at the command line.

Knowing Git also gets you access to the world of open source tooling available on hosted version control storage providers, like GitHub, GitLab, and more.

How to install Git

Linux systems usually come with git pre-installed.

On macOS, you can type git at the Terminal, and a pop-up will show up that prompts you to install XCode and the developer tools for macOS. Accept it, and go about the rest of your day.

Sometimes, the built-in versions of git might be a bit outdated. If you want to install one of the latest versions of git, then you can use Homebrew to install Git. (see: Install homebrew on your machine)

How to configure Git with basic information

You might want to configure git with some basic information.

For example, you might need to configure Git with your username and email address, so that your commits can be attributed to your user accounts on GitHub, GitLab, or Bitbucket. To do this:

git config --global user.name "My name in quotes"
git config --global user.email "myemail@address.com"

This sets your configuration to be "global". However, you can also have "local" (i.e. per-repository) configurations, by changing the --global flag to --local:

# inside a repository, say, your company's project
git config --local user.name "My name in quotes"
git config --local user.email "myemail@company.com"

Doing so is important because you want to ensure that your Git commits are tied to the appropriate email address. Setting the "global" one gives you the convenience of setting a sane default, which you can modify by setting "local", per-repository configuration.

How to configure Git with fancy features

If you installed the cool tools from "Install a suite of really cool utilities on your machine using homebrew", then you'll be thrilled to know that you can configure Git to use diff-so-fancy to render diffs!

Follow the instructions in the diff-so-fancy repository. As of 10 December 2020, my favored set of configurations are:

git config --global core.pager "diff-so-fancy | less --tabs=4 -RFX"

git config --global color.ui true

git config --global color.diff-highlight.oldNormal    "red bold"
git config --global color.diff-highlight.oldHighlight "red bold 52"
git config --global color.diff-highlight.newNormal    "green bold"
git config --global color.diff-highlight.newHighlight "green bold 22"

git config --global color.diff.meta       "11"
git config --global color.diff.frag       "magenta bold"
git config --global color.diff.commit     "yellow bold"
git config --global color.diff.old        "red bold"
git config --global color.diff.new        "green bold"
git config --global color.diff.whitespace "red reverse"

Install Docker on your machine

Why you will need Docker

Docker is used in a myriad of ways, but here are the main reasons I see a data scientist will want to use Docker:

  1. It solves the "it works on my machine" headache.
  2. Those of us who operate as full-stack data scientists might sometimes be involved in deploying models as apps with front-ends for those whom we serve.
  3. Vicki Boykis said so.

While conda environments give you everything that you need for the practice of data science on your local (or remote) machine, Docker containers will give you the ultimate portability. From a technical standpoint, conda environments package data science packages, stopping the stack at stuff that ships with the operating system, which your project might unavoidably depend on. Docker lets you ship an entire operating system + anything else you install in it.

Docker (the company) has a few good reasons why you would want to use Docker (the tool), and you can read about them here. Those reasons you read about on the Docker website are likely also applicable

How do you install Docker

Install the Desktop client first. Don't worry about registering for an account on Docker Hub, though that can be useful later.

If you're on Linux, there are a few guides you can follow, my favourite being the curated guides from DigitalOcean.

If you do a quick Google search for "install docker" and tag on your operating system name, look out first for the Digital Ocean tutorials, which are in my opinion the best maintained.

Install Anaconda on your machine

What is anaconda

Anaconda is a way to get a Python installed on your system.

One of the neat but oftentimes confusing things about Python is that you can have multiple Python executables living around on your system. Anaconda makes it easy for you to:

  1. Obtain Python
  2. Manage different Python versions into isolated environments using a consistent interface
  3. Install packages into these environments

Why use anaconda (or one of its variants)?

Why is this a good thing? Primarily because you might have individual projects that need different version of Python and different versions of packages that are built for Python. Also, default Python installations, such as the ones shipped with older versions of macOS, tend to be versions behind the latest, which is to the detriment of your projects. Some built-in apps in an operating system may depend on that old version of Python (such as iPhoto), which means if you mess up the installation, you might break those built-in apps. Hence, you will want a tool that lets you easily create isolated Python environments.

The Anaconda Python distribution fulfills the following key needs:

  1. You'll be able to create isolated environments on a per-project basis. (see: Follow the rule of one-to-one in managing your projects)
  2. You'll be able to install packages into those isolated environments, and evolve them over time. (see: Create one conda environment per project)

Installing Anaconda on your local machine thus helps you get easy access to Python, Jupyter (see: Use Jupyter as an experimentation playground), and other tools for modelling and analysis.

How to get anaconda?

To install the Miniforge variant of Anaconda, which will be lighter-weight than the full Anaconda distribution, using the following command:

cd ~
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" -O anaconda.sh

This will send you to your home directory, and then download the Miniforge bash script installer from Anaconda's download page as anaconda.sh.

Now, install Anaconda:

bash anaconda.sh -b -p $HOME/anaconda/

This will install the Anaconda distribution of Python onto your system inside your home directory. You can now install packages at will, without needing sudo privileges!

Next steps

Level-up your conda skills

Install a suite of really cool utilities on your machine using homebrew

What utilities are recommended?

gcc

Install gcc if you want to have the GNU C compiler available on your Mac at the same time as the clang C compiler.

C compilers come in handy for some numerical computing packages, which multiple data science languages (Julia, Python, R) depend on.

mobile-shell

If you've ever been disconnected from SSH because of a flaky internet connection, mosh can be your saviour. Check out the tool's homepage.

tmux

This is a tool for multiplexing your shell sessions -- uber handy if you want to persist a shell session on a remote server even after you disconnect. If you're of the type who has a habit of creating new shell sessions for every project, then tmux might be able to help you get things under control. Check out the tool's homepage.

tree

The tree command line tool allows you to see the file tree at the terminal. If you pair it with exa, you will have an upgraded file tree experience. See its homepage.

exa

exa is next-level ls (which is used to list files in a directory). According to the website, "A modern replacement for ls". See the homepage. If you alias ls to exa, it's next-level convenience! (see Create shell command aliases for your commonly used commands)

ripgrep

ripgrep provides a command line tool rg, which recursively scans down the file tree from the current directory for files that contain text that you want to search. Its Github repo should reveal all of its secrets.

diff-so-fancy

This gives you a tool for viewing differences between files, aka "diffs". Check out its Github repo for more information. You can also configure git to use diff-so-fancy to render diffs at the terminal. (see: Install and configure git on your machine)

bat

bat is next-level cat, which is a utility for viewing text files in the terminal. Check out the Github repository for what you get. You can alias cat to bat, and in that way, not need to deviate from muscle memory to use bat.

fd

fd gives you a faster replacement for the shell tool find, which you can use to find files by name. Check out the Github repository to learn more about it.

fzf

On recommendation from my colleague Arkadij Kummer, grab fzf to have extremely fast fuzzy text search on the filesystem. Check out the project's GitHub repository!

croc

Use croc as a tool to move data from one machine to another easily in a secure fashion. (I have used this in lieu of commercial utilities that cost tens of thousands of dollars in license fees.) Check out the project's GitHub repository!

Install these really cool utilities

Now that you've read about these utilities' reason for existence, go ahead and install them!

brew install \
	git gcc tmux wget mobile-shell \
	diff-so-fancy ripgrep bat fd fzf croc

See also