Data Science Bootstrap

There should be one, and preferably only one, obvious source of truth for things

This philosophy is a play on the Zen of Python, in which one line states:

There should be one-- and preferably only one --obvious way to do it.

Part of how we manage complexity in our projects is by distilling things down to single sources of truth. By defining single sources of truth, we avoid the hairy situation of being unsure which version/copy of a thing we ought to depend on. This has the effect of minimizing confusion, and even potential conflicts, downstream.

As you go through the knowledge base, you will see this philosophy at play in how we structure and organize our individual projects.

Data scientists should strive to know every last detail about their compute stack

I believe in this philosophy: to execute on your best work, you need to have end-to-end control over every last detail.

This is illustrated by the following examples.

Firstly, the story of Jeff Dean and Sanjay Ghemawat: in order to optimize Google's software early on in Google's life, they went all the way down to the hardware and manipulated bits. I believe we should aspire to do the same if we want maximal effectiveness.

Secondly, on another tech giant constantly in my radar: Apple makes some of the world's most beloved, though to some most hated, computers. One thing that cannot be denied: there is a general pattern of excellence in their execution. And one of the causal factors is their relentless control over every little last detail in how they build computers. I believe the same exists for our work as data scientists.

Thirdly, I remember stories of digging deep into 64-bit vs. 32-bit computation on GPU vs. CPU in order to solve a multinomial sampler problem in NumPy, in order to complete the implementation of a Bayesian neural network that I was writing. If I didn't dig deep, I would have been stuck and be unable to complete the neural network implementation that I was working on, which would have been a pity: my PyData NYC 2017 talk on Bayesian deep learning would never have materialized.

The compute stack that we use, from as low-level as the processor architecture (ARM/x86/x64/RISC) to the web technologies that we touch, affect what we do on a day-to-day basis, and thus what we can deliver. Having the breadth of knowledge to be able to effectively navigate every layer of the stack gives us superpowers to deliver the best work that we can.

index

Why this knowledge base exists

I'm super glad you made it to my knowledge base on bootstrapping your data science machine - otherwise known as getting set up for success with great organization and sane structure. The content inside here has been battle-tested through real-world experience with colleagues and others skilled in their computing domains, but a bit new to the modern tooling offered to us in the data science world.

This knowledge base exists because I want to encourage more data scientists to adopt sane practices and conventions that promote collaboration and reproducibility in our data work. These are practices that, through years of practice in developing data projects and open source software, I have come to see the importance of.

Where I think you, the reader, are coming from

The most important thing I'm assuming about you, the reader, is that you have experienced the same challenges I encountered when structure and workflow were absent from my work. I wrote down this knowledge base for your benefit. Based on one decade (as of 2023) of continual refinement, you'll learn how to:

structure your computer for data analysis projects, and
structure a data analysis project for maximum effectiveness.

Because I'm a Pythonista who uses Jupyter and VSCode, some tips are specific to the language and these tools. However, being a Python programmer isn't a hard requirement. More than the specifics, I hope this knowledge base imparts to you a particular philosophy of how to work. That philosophy should be portable across languages and tooling, though having specific tooling can sometimes help you adhere to the philosophy. To read more about the philosophies behind this knowledge base, check out the page: The philosophies that ground the bootstrap.

For the beginner

As you grow in your knowledge and skillsets, this knowledge base should help you keep an eye out for critical topics you might want to learn.

For the moderately experienced

If you're looking to refine your skillsets, this knowledge graph should give you the base from which you dive deeper into specific tools.

For the seasoned data scientist

If you're a seasoned data practitioner, this guide should be able to serve you the way it helps me: as a cookbook/recipe guide to remind you of things when you forget them.

Things you'll learn

The things you'll learn here cover the first steps, starting at configuring your laptop or workstation for data science development up to some practices that help you organize your projects, regardless of where you do your computing.

I have a recommended order below, based on my experience with colleagues and other newcomers to a project:

However, you may wish to follow the guide differently and not read it in the way I prescribed above. That's not a problem! The online version is intentionally structured as a knowledge graph and not a book so that you can explore it on your own terms.

Apply these ideas just-in-time

As you go through this content, I would also encourage you to keep in mind: Time will distill the best practices in your context. Don't feel pressured to apply every single thing you see here to your project. Incrementally adopt these practices as they make sense. They're all quite composable with one another.

Not everything written here is applicable to every single project. Indeed, rarely do I use 100% of everything I've written here. Sometimes, my projects end up being more software tool development oriented, and hence I use a lot of the software-oriented ideas. Sometimes my projects are one-off, and so I ease off on the reproducibility aspect. Most of the time, my projects require a lot of exploratory work beyond simple exploratory data analysis, and imposing structure early on can be stifling for the project.

So rather than see this collection of notes as something that we must impose on every project, I would encourage you to be picky and choosy, and use only what helps you, in a just-in-time fashion, to increase your effectiveness in a project. Just-in-time adoption of a practice or a tool is preferable, because doing so eases the pressure to be rigidly complete from the get-go. In my own work, I incorporate a practice into the project just-in-time as I sense the need for it.

Moreover, as my colleague Zachary Barry would say, none of these practices can be mastered overnight. It takes running into walls to appreciate why these practices are important. For an individual who has not yet encountered problems with disorganized code, multiple versions of the same dataset, and other issues I describe here, it is difficult to deeply appreciate why it matters to apply simple and basic software development practices to your data science work. So I would encourage you to use this knowledge base as a reference tool that helps you find out, in a just-in-time fashion, a practice or tool that helps you solve a problem.

Ways to support the project

If you wish to support the project, there are a few ways:

Firstly, I spent some time linearizing this content based on my experience guiding skilled newcomers to the DS world. That's available on the eBook version on LeanPub. If you purchase a copy, you will get instructions to access the repository that houses the book source and automation to bootstrap each of your Python data science projects easily!

Secondly, you can support my data science education work on Patreon! My supporters get early access to the data science content that I make.

Finally, if you have a question regarding the content, please feel free to reach out on LinkedIn. (If I make substantial edits on the basis of your comments or questions, I might reach out to you to offer a free copy of the eBook!)

Get bootstrapped on your data science projects

Why this knowledge base exists

I'm super glad you made it to my knowledge base on bootstrapping your data science machine - otherwise known as getting set up for success with great organization and sane structure. The content inside here has been battle-tested through real-world experience with colleagues and others skilled in their computing domains, but a bit new to the modern tooling offered to us in the data science world.

This knowledge base exists because I want to encourage more data scientists to adopt sane practices and conventions that promote collaboration and reproducibility in our data work. These are practices that, through years of practice in developing data projects and open source software, I have come to see the importance of.

Where I think you, the reader, are coming from

The most important thing I'm assuming about you, the reader, is that you have experienced the same challenges I encountered when structure and workflow were absent from my work. I wrote down this knowledge base for your benefit. Based on one decade (as of 2023) of continual refinement, you'll learn how to:

structure your computer for data analysis projects, and
structure a data analysis project for maximum effectiveness.

Because I'm a Pythonista who uses Jupyter and VSCode, some tips are specific to the language and these tools. However, being a Python programmer isn't a hard requirement. More than the specifics, I hope this knowledge base imparts to you a particular philosophy of how to work. That philosophy should be portable across languages and tooling, though having specific tooling can sometimes help you adhere to the philosophy. To read more about the philosophies behind this knowledge base, check out the page: The philosophies that ground the bootstrap.

For the beginner

As you grow in your knowledge and skillsets, this knowledge base should help you keep an eye out for critical topics you might want to learn.

For the moderately experienced

If you're looking to refine your skillsets, this knowledge graph should give you the base from which you dive deeper into specific tools.

For the seasoned data scientist

If you're a seasoned data practitioner, this guide should be able to serve you the way it helps me: as a cookbook/recipe guide to remind you of things when you forget them.

Things you'll learn

The things you'll learn here cover the first steps, starting at configuring your laptop or workstation for data science development up to some practices that help you organize your projects, regardless of where you do your computing.

I have a recommended order below, based on my experience with colleagues and other newcomers to a project:

However, you may wish to follow the guide differently and not read it in the way I prescribed above. That's not a problem! The online version is intentionally structured as a knowledge graph and not a book so that you can explore it on your own terms.

Apply these ideas just-in-time

As you go through this content, I would also encourage you to keep in mind: Time will distill the best practices in your context. Don't feel pressured to apply every single thing you see here to your project. Incrementally adopt these practices as they make sense. They're all quite composable with one another.

Not everything written here is applicable to every single project. Indeed, rarely do I use 100% of everything I've written here. Sometimes, my projects end up being more software tool development oriented, and hence I use a lot of the software-oriented ideas. Sometimes my projects are one-off, and so I ease off on the reproducibility aspect. Most of the time, my projects require a lot of exploratory work beyond simple exploratory data analysis, and imposing structure early on can be stifling for the project.

So rather than see this collection of notes as something that we must impose on every project, I would encourage you to be picky and choosy, and use only what helps you, in a just-in-time fashion, to increase your effectiveness in a project. Just-in-time adoption of a practice or a tool is preferable, because doing so eases the pressure to be rigidly complete from the get-go. In my own work, I incorporate a practice into the project just-in-time as I sense the need for it.

Moreover, as my colleague Zachary Barry would say, none of these practices can be mastered overnight. It takes running into walls to appreciate why these practices are important. For an individual who has not yet encountered problems with disorganized code, multiple versions of the same dataset, and other issues I describe here, it is difficult to deeply appreciate why it matters to apply simple and basic software development practices to your data science work. So I would encourage you to use this knowledge base as a reference tool that helps you find out, in a just-in-time fashion, a practice or tool that helps you solve a problem.

Ways to support the project

If you wish to support the project, there are a few ways:

Firstly, I spent some time linearizing this content based on my experience guiding skilled newcomers to the DS world. That's available on the eBook version on LeanPub. If you purchase a copy, you will get instructions to access the repository that houses the book source and automation to bootstrap each of your Python data science projects easily!

Secondly, you can support my data science education work on Patreon! My supporters get early access to the data science content that I make.

Finally, if you have a question regarding the content, please feel free to reach out on LinkedIn. (If I make substantial edits on the basis of your comments or questions, I might reach out to you to offer a free copy of the eBook!)

Organize your projects by leveraging categories

This philosophy is best reflected in software development: The best software developers are masters of organization. If you go into a GitHub repository and browse a few well-structured projects, you'll easily glean this point. These projects keep things simple, are modular, have awesome documentation, and rely on single sources of truth for everything. Plain, unambiguous, and organized -- these are the best adjectives to describe them.

At the core of this philosophy is the fact that these developers have thought carefully about categories of things. You can think of a project as being composed of a series of categories of distinct entities: data, notebooks, scripts, source code, and more. They relate to each other in unique ways: data are consumed by notebooks, notebooks import source code, etc. If we're extremely clear about the categories of things that exist for our project, and strive to cleanly describe the relationships between these categories of things, then our projects will become very well-organized.

I believe data science projects ought to be organized the same way. Especially if they are collaborative projects involving more than one person. As such, it should be possible for us to adopt a sane way of working that is highly inspired from the software development world. We thus inject structure into our projects.

Now, structure for the sake of structure is pointless; structure should exist for our utilitarian benefit. We impose a particular file structure so that we can navigate through it and find what we want quickly. We structure our source code so that we can find what we need more easily. With clearly defined categories of things and their relationships, we can more cleanly collaborate with others.

Eliminate drudgery by investing in automation

This philosophy has a slightly more practical bent. Investing early in personal and project-based automation pays off compound interest dividends much later on. Personal productive automation that you invest in early on, and along the way as you work, are like compound interest. The earlier on you invest in defining the categories of things that can be automated, the faster you will find yourself running your project later on, and the fewer errors related to manual and out-of-order execution of steps you will find.

The easiest things to automate are oftentimes not the biggest. Rather, they are the small things that occur over and over. These are the things you should automate.

If you need an example, the zip archive of all of the notes that I make available to readers who purchase the eBook copy is automatically archived up and pushed to its hosted location inside a continuous deployment pipeline. Without this automation, I'd be stuck manually doing this step every time I update the knowledge base!

Pages that link here

Why this knowledge base exists

Where I think you, the reader, are coming from

For the beginner

For the moderately experienced

For the seasoned data scientist

Things you'll learn

Apply these ideas just-in-time

Ways to support the project

Why this knowledge base exists

Where I think you, the reader, are coming from

For the beginner

For the moderately experienced

For the seasoned data scientist

Things you'll learn

Apply these ideas just-in-time

Ways to support the project