Data Science Bootstrap

index

Why this knowledge base exists

I'm super glad you made it to my knowledge base on bootstrapping your data science machine - otherwise known as getting set up for success with great organization and sane structure. The content inside here has been battle-tested through real-world experience with colleagues and others skilled in their computing domains, but a bit new to the modern tooling offered to us in the data science world.

This knowledge base exists because I want to encourage more data scientists to adopt sane practices and conventions that promote collaboration and reproducibility in our data work. These are practices that, through years of practice in developing data projects and open source software, I have come to see the importance of.

Where I think you, the reader, are coming from

The most important thing I'm assuming about you, the reader, is that you have experienced the same challenges I encountered when structure and workflow were absent from my work. I wrote down this knowledge base for your benefit. Based on one decade (as of 2023) of continual refinement, you'll learn how to:

structure your computer for data analysis projects, and
structure a data analysis project for maximum effectiveness.

Because I'm a Pythonista who uses Jupyter and VSCode, some tips are specific to the language and these tools. However, being a Python programmer isn't a hard requirement. More than the specifics, I hope this knowledge base imparts to you a particular philosophy of how to work. That philosophy should be portable across languages and tooling, though having specific tooling can sometimes help you adhere to the philosophy. To read more about the philosophies behind this knowledge base, check out the page: The philosophies that ground the bootstrap.

For the beginner

As you grow in your knowledge and skillsets, this knowledge base should help you keep an eye out for critical topics you might want to learn.

For the moderately experienced

If you're looking to refine your skillsets, this knowledge graph should give you the base from which you dive deeper into specific tools.

For the seasoned data scientist

If you're a seasoned data practitioner, this guide should be able to serve you the way it helps me: as a cookbook/recipe guide to remind you of things when you forget them.

Things you'll learn

The things you'll learn here cover the first steps, starting at configuring your laptop or workstation for data science development up to some practices that help you organize your projects, regardless of where you do your computing.

I have a recommended order below, based on my experience with colleagues and other newcomers to a project:

However, you may wish to follow the guide differently and not read it in the way I prescribed above. That's not a problem! The online version is intentionally structured as a knowledge graph and not a book so that you can explore it on your own terms.

Apply these ideas just-in-time

As you go through this content, I would also encourage you to keep in mind: Time will distill the best practices in your context. Don't feel pressured to apply every single thing you see here to your project. Incrementally adopt these practices as they make sense. They're all quite composable with one another.

Not everything written here is applicable to every single project. Indeed, rarely do I use 100% of everything I've written here. Sometimes, my projects end up being more software tool development oriented, and hence I use a lot of the software-oriented ideas. Sometimes my projects are one-off, and so I ease off on the reproducibility aspect. Most of the time, my projects require a lot of exploratory work beyond simple exploratory data analysis, and imposing structure early on can be stifling for the project.

So rather than see this collection of notes as something that we must impose on every project, I would encourage you to be picky and choosy, and use only what helps you, in a just-in-time fashion, to increase your effectiveness in a project. Just-in-time adoption of a practice or a tool is preferable, because doing so eases the pressure to be rigidly complete from the get-go. In my own work, I incorporate a practice into the project just-in-time as I sense the need for it.

Moreover, as my colleague Zachary Barry would say, none of these practices can be mastered overnight. It takes running into walls to appreciate why these practices are important. For an individual who has not yet encountered problems with disorganized code, multiple versions of the same dataset, and other issues I describe here, it is difficult to deeply appreciate why it matters to apply simple and basic software development practices to your data science work. So I would encourage you to use this knowledge base as a reference tool that helps you find out, in a just-in-time fashion, a practice or tool that helps you solve a problem.

Ways to support the project

If you wish to support the project, there are a few ways:

Firstly, I spent some time linearizing this content based on my experience guiding skilled newcomers to the DS world. That's available on the eBook version on LeanPub. If you purchase a copy, you will get instructions to access the repository that houses the book source and automation to bootstrap each of your Python data science projects easily!

Secondly, you can support my data science education work on Patreon! My supporters get early access to the data science content that I make.

Finally, if you have a question regarding the content, please feel free to reach out on LinkedIn. (If I make substantial edits on the basis of your comments or questions, I might reach out to you to offer a free copy of the eBook!)

Get bootstrapped on your data science projects

Why this knowledge base exists

I'm super glad you made it to my knowledge base on bootstrapping your data science machine - otherwise known as getting set up for success with great organization and sane structure. The content inside here has been battle-tested through real-world experience with colleagues and others skilled in their computing domains, but a bit new to the modern tooling offered to us in the data science world.

This knowledge base exists because I want to encourage more data scientists to adopt sane practices and conventions that promote collaboration and reproducibility in our data work. These are practices that, through years of practice in developing data projects and open source software, I have come to see the importance of.

Where I think you, the reader, are coming from

The most important thing I'm assuming about you, the reader, is that you have experienced the same challenges I encountered when structure and workflow were absent from my work. I wrote down this knowledge base for your benefit. Based on one decade (as of 2023) of continual refinement, you'll learn how to:

structure your computer for data analysis projects, and
structure a data analysis project for maximum effectiveness.

Because I'm a Pythonista who uses Jupyter and VSCode, some tips are specific to the language and these tools. However, being a Python programmer isn't a hard requirement. More than the specifics, I hope this knowledge base imparts to you a particular philosophy of how to work. That philosophy should be portable across languages and tooling, though having specific tooling can sometimes help you adhere to the philosophy. To read more about the philosophies behind this knowledge base, check out the page: The philosophies that ground the bootstrap.

For the beginner

As you grow in your knowledge and skillsets, this knowledge base should help you keep an eye out for critical topics you might want to learn.

For the moderately experienced

If you're looking to refine your skillsets, this knowledge graph should give you the base from which you dive deeper into specific tools.

For the seasoned data scientist

If you're a seasoned data practitioner, this guide should be able to serve you the way it helps me: as a cookbook/recipe guide to remind you of things when you forget them.

Things you'll learn

The things you'll learn here cover the first steps, starting at configuring your laptop or workstation for data science development up to some practices that help you organize your projects, regardless of where you do your computing.

I have a recommended order below, based on my experience with colleagues and other newcomers to a project:

However, you may wish to follow the guide differently and not read it in the way I prescribed above. That's not a problem! The online version is intentionally structured as a knowledge graph and not a book so that you can explore it on your own terms.

Apply these ideas just-in-time

As you go through this content, I would also encourage you to keep in mind: Time will distill the best practices in your context. Don't feel pressured to apply every single thing you see here to your project. Incrementally adopt these practices as they make sense. They're all quite composable with one another.

Not everything written here is applicable to every single project. Indeed, rarely do I use 100% of everything I've written here. Sometimes, my projects end up being more software tool development oriented, and hence I use a lot of the software-oriented ideas. Sometimes my projects are one-off, and so I ease off on the reproducibility aspect. Most of the time, my projects require a lot of exploratory work beyond simple exploratory data analysis, and imposing structure early on can be stifling for the project.

So rather than see this collection of notes as something that we must impose on every project, I would encourage you to be picky and choosy, and use only what helps you, in a just-in-time fashion, to increase your effectiveness in a project. Just-in-time adoption of a practice or a tool is preferable, because doing so eases the pressure to be rigidly complete from the get-go. In my own work, I incorporate a practice into the project just-in-time as I sense the need for it.

Moreover, as my colleague Zachary Barry would say, none of these practices can be mastered overnight. It takes running into walls to appreciate why these practices are important. For an individual who has not yet encountered problems with disorganized code, multiple versions of the same dataset, and other issues I describe here, it is difficult to deeply appreciate why it matters to apply simple and basic software development practices to your data science work. So I would encourage you to use this knowledge base as a reference tool that helps you find out, in a just-in-time fashion, a practice or tool that helps you solve a problem.

Ways to support the project

If you wish to support the project, there are a few ways:

Firstly, I spent some time linearizing this content based on my experience guiding skilled newcomers to the DS world. That's available on the eBook version on LeanPub. If you purchase a copy, you will get instructions to access the repository that houses the book source and automation to bootstrap each of your Python data science projects easily!

Secondly, you can support my data science education work on Patreon! My supporters get early access to the data science content that I make.

Finally, if you have a question regarding the content, please feel free to reach out on LinkedIn. (If I make substantial edits on the basis of your comments or questions, I might reach out to you to offer a free copy of the eBook!)

Use docker containers for system-level packages

Why you might need to use Docker

If conda environments are such a great environment isolation tool, why would we need Docker?

That's because sometimes, your project might have an unavoidable dependency on system-level packages. I have seen some projects that use spatial mapping tooling require system-level packages. Others that depend on audio processing might require packages that can only be obtained outside of conda. In these cases, yes, installing them locally on your machine can be handy (see Install homebrew on your machine), but if you're also interested in building an app, then you'll need them packaged up inside a Docker container.

What is a Docker container? The best anchoring way to thinking about it is a fully-fledged operating system completely insulated from its host (i.e. your computer). It has no knowledge of your runtime environment variables (see: Create runtime environment variable configuration files for each of your projects and Take full control of your shell environment variables). It's like having a completely clean operating system, without the cost of buying new hardware.

How do we use Docker

I'm assuming you've already obtained Docker on your system. (see: Install Docker on your machine).

The core thing you need to know how to write is a Dockerfile. This file specifies exactly how a Docker container is to be built. The easiest way to think about the Dockerfile syntax is that it's almost bash, with a bit more additional syntax. The Docker docs give an extremely thorough tutorial. For those who are more hands-on, I recommend pair coding with another more experienced individual who is willing to teach you the ropes, to build a Docker container when it becomes relevant to your problem.

Use pip only when you cannot find packages on conda

When you can use pip

If you can't find a package on conda (see Prioritize conda to install packages), then pip can serve as a viable alternative for adding packages to your environment.

How to use pip with conda environments

In your environment.yml file:

name: some_env_name
channels:
- conda-forge
dependencies:
- python=3.8
- pandas
- scipy
- numpy
- ...
- pip:
  - some_pip_package==2.1

Some things to note here.

Firstly the pip section uses the same syntax for setting versions as requirements.txt. It uses == rather than =, which conda uses. This is because its contents are dumped to a temporary text file that gets parsed by pip itself.

Secondly, keep monitoring for when the package shows up on conda-forge, as that will help you retain the advantages of installing packages by a single package manager.

Prioritize conda to install packages

Why should you use conda for packages

As a matter of practical advice, I usually prefer conda-installed packages over pip-installed packages. Here are the reasons why.

Firstly, Conda packages have their versions and dependencies tracked properly, and so the conda dependency solver (or its drop-in replacement mamba) can be used to pick out the right set of packages.

Secondly, on occasion one might need to use packages that come from multiple languages. There have been projects I worked on that used Python calling out to R packages. Conda was designed to handle mutliple programming languages in the same environment, and will help you pull down packages used in multiple languages, and all of their dependencies.

Thirdly, as the suite of packages that become available in conda-forge increases, and as the conda-forge developers increase the amount of tooling to automatically mirror language-specific packages on conda-forge, it becomes progressively easier to rely primarily on the conda package manager. This idea relates to the notion of specifying single sources of truth for categories of stuff.

How to search for conda-installable versions of packages

To do so, you specify your environment using environment.yml files. These are used by the conda package manager to download the desired packages, their dependencies, and their appropriate versions onto your machine.

When you want to search for a package, before you assume it's available on PyPI, search for it on Anaconda.org. You can do this by either running:

conda search package_name

or by going to the Anaconda.org website and search for the package that you're interested in.

Also, be sure you check the GitHub repository under the "Installation" instructions for anything that suggests that you could install the package from conda-forge.

Once you've found it, add the package to your environment.yml file under the dependencies section.

If you can't find a conda-installable version of the package, then consider using pip. (see: Use pip only when you cannot find packages on conda)

Pages that link here

Why this knowledge base exists

Where I think you, the reader, are coming from

For the beginner

For the moderately experienced

For the seasoned data scientist

Things you'll learn

Apply these ideas just-in-time

Ways to support the project

Why this knowledge base exists

Where I think you, the reader, are coming from

For the beginner

For the moderately experienced

For the seasoned data scientist

Things you'll learn

Apply these ideas just-in-time

Ways to support the project

Why you might need to use Docker

How do we use Docker

When you can use pip

How to use pip with conda environments

Why should you use conda for packages

How to search for conda-installable versions of packages