Install Docker on your machine

Why you will need Docker

Docker is used in a myriad of ways, but here are the main reasons I see a data scientist will want to use Docker:

  1. It solves the "it works on my machine" headache.
  2. Those of us who operate as full-stack data scientists might sometimes be involved in deploying models as apps with front-ends for those whom we serve.
  3. Vicki Boykis said so.

While conda environments give you everything that you need for the practice of data science on your local (or remote) machine, Docker containers will give you the ultimate portability. From a technical standpoint, conda environments package data science packages, stopping the stack at stuff that ships with the operating system, which your project might unavoidably depend on. Docker lets you ship an entire operating system + anything else you install in it.

Docker (the company) has a few good reasons why you would want to use Docker (the tool), and you can read about them here. Those reasons you read about on the Docker website are likely also applicable

How do you install Docker

Install the Desktop client first. Don't worry about registering for an account on Docker Hub, though that can be useful later.

If you're on Linux, there are a few guides you can follow, my favourite being the curated guides from DigitalOcean.

If you do a quick Google search for "install docker" and tag on your operating system name, look out first for the Digital Ocean tutorials, which are in my opinion the best maintained.

Configure your machine

After getting access to your development machine, you'll want to configure it and take full control over how it works. Backing the following steps are a core set of ideas:

  • Give yourself a rich set of commonly necessary tooling right from the beginning, but without the bloat that might be unnecessary.
  • Standardize your compute environment for maximal portability from computer to computer.
  • Build up automation to get you up and running as fast as possible.
  • Have full control over your system, such that you know as much about your configuration as possible.
  • Squeeze as much productivity out of your UI as possible.

Head over to the following pages to see how you can get things going.

Initial setup

Getting Anaconda Python installed

Master the shell

Further configuration

Use docker containers for system-level packages

Why you might need to use Docker

If conda environments are such a great environment isolation tool, why would we need Docker?

That's because sometimes, your project might have an unavoidable dependency on system-level packages. I have seen some projects that use spatial mapping tooling require system-level packages. Others that depend on audio processing might require packages that can only be obtained outside of conda. In these cases, yes, installing them locally on your machine can be handy (see Install homebrew on your machine), but if you're also interested in building an app, then you'll need them packaged up inside a Docker container.

What is a Docker container? The best anchoring way to thinking about it is a fully-fledged operating system completely insulated from its host (i.e. your computer). It has no knowledge of your runtime environment variables (see: Create runtime environment variable configuration files for each of your projects and Take full control of your shell environment variables). It's like having a completely clean operating system, without the cost of buying new hardware.

How do we use Docker

I'm assuming you've already obtained Docker on your system. (see: Install Docker on your machine).

The core thing you need to know how to write is a Dockerfile. This file specifies exactly how a Docker container is to be built. The easiest way to think about the Dockerfile syntax is that it's almost bash, with a bit more additional syntax. The Docker docs give an extremely thorough tutorial. For those who are more hands-on, I recommend pair coding with another more experienced individual who is willing to teach you the ropes, to build a Docker container when it becomes relevant to your problem.