The Data Science Bootstrap Notes

Welcome, and thank you for coming! I'm super glad you made it to my knowledge base on bootstrapping your data science machine, projects, and code - or otherwise known more pedestrianly as getting set up for success with great organization and sane structure.

This book exists because I've seen too many brilliant data scientists struggle not with the math, statistics, or domain knowledge, but with the basic computing infrastructure that supports their work. It's the missing education that most data science programs don't teach - the fundamental computing skills that make everything else possible.

All of data science is founded on computing, and when you master your computer - really understand how to wield it as an extension of your thinking - you unlock the ability to do truly great things. The content inside here has been battle-tested through real-world experience from myself, colleagues, and others skilled in their computing domains.

This knowledge base exists because I want to encourage more data scientists to adopt sane practices and conventions that promote collaboration and reproducibility in our data work. These are practices that, through years of practice in developing data projects and open source software, I have come to see the importance of.

Where I think you, the reader, are coming from

The most important thing I'm assuming about you, the reader, is that you have experienced the same challenges I encountered when structure and workflow were absent from my work. I wrote down this knowledge base for your benefit. Based on one decade (as of 2023) of continual refinement, you'll learn how to:

bootstrap your computer for data analysis projects, and
structure a data analysis project for maximum effectiveness.

Because I'm a Pythonista who uses Jupyter and VSCode, some tips are specific to the language and these tools. However, being a Python programmer isn't a hard requirement. More than the specifics, I hope this knowledge base imparts to you a particular philosophy of how to work. That philosophy should be portable across languages and tooling, though having specific tooling can sometimes help you adhere to the philosophy.

The four core philosophies that ground everything in this book are knowing your compute stack, establishing single sources of truth, automating relentlessly, and categorizing everything. These aren't abstract principles - they're practical approaches that compound into serious productivity gains. To understand how these philosophies work together and why they matter, check out: Philosophies.

For the beginner

As you grow in your knowledge and skillsets, this knowledge base should help you keep an eye out for critical topics you might want to learn.

For the moderately experienced

If you're looking to refine your skillsets, this book should give you the base from which you dive deeper into specific tools.

For the seasoned data scientist

If you're a seasoned data practitioner, this guide should be able to serve you the way it helps me: as a cookbook/recipe guide to remind you of things when you forget them.

Things you'll learn

This knowledge base covers several key areas that are essential for effective data science work:

Development Philosophy: You'll understand the key philosophies and principles that guide effective data science work, helping you make better decisions about tools and practices.
Machine Setup: You'll learn how to configure your development environment, including essential tools like package managers, version control, and development environments. This ensures you have a robust foundation for your data science work.
Project Organization: You'll discover how to structure data science projects effectively, including managing environments, documenting your work, and organizing your data and code in a reproducible way.
Core Skills: You'll learn essential skills that every data scientist needs, from data manipulation to visualization and analysis.
Ways of Working: You'll explore best practices for collaboration, working with notebooks, and leveraging AI tools effectively in your data science workflow.

Each of these areas builds upon the others, giving you a comprehensive foundation for your data science journey. Whether you're setting up a new machine, starting a new project, or improving your existing workflow, you'll find practical guidance here.

Apply these ideas just-in-time

As you go through this content, I would also encourage you to keep in mind: Time will distill the best practices in your context. Don't feel pressured to apply every single thing you see here to your project. Incrementally adopt these practices as they make sense. They're all quite composable with one another.

Not everything written here is applicable to every single project. Indeed, rarely do I use 100% of everything I've written here. Sometimes, my projects end up being more software tool development oriented, and hence I use a lot of the software-oriented ideas. Sometimes my projects are one-off, and so I ease off on the reproducibility aspect. Most of the time, my projects require a lot of exploratory work beyond simple exploratory data analysis, and imposing structure early on can be stifling for the project.

So rather than see this collection of notes as something that we must impose on every project, I would encourage you to be picky and choosy, and use only what helps you, in a just-in-time fashion, to increase your effectiveness in a project. Just-in-time adoption of a practice or a tool is preferable, because doing so eases the pressure to be rigidly complete from the get-go. In my own work, I incorporate a practice into the project just-in-time as I sense the need for it.

Moreover, as my colleague Zachary Barry would say, none of these practices can be mastered overnight. It takes running into walls to appreciate why these practices are important. For an individual who has not yet encountered problems with disorganized code, multiple versions of the same dataset, and other issues I describe here, it is difficult to deeply appreciate why it matters to apply simple and basic software development practices to your data science work. So I would encourage you to use this knowledge base as a reference tool that helps you find out, in a just-in-time fashion, a practice or tool that helps you solve a problem.

Changes from the first edition

I wrote this book in 2017, and decided to update the book in 2024. In the intervening seven years, the Python data science tooling has evolved. GenAI has also impacted how I think about tool design. Finally, I also gained much more experience and exposure to cloud tooling. Synthesizing my updated knowledge together, I have updated this book to reflect the latest in tooling. You will see certain tools featured, such as:

pixi for environment management,
pyds-cli to do project initialization,
and llamabot for building LLM applications, and
GitHub Actions to give you a trigger-able bot for automation.

Additionally, I switched out the format from a navigable knowledge base to a linearized book format online, and took advantage of MkDocs to make this happen. My previously hand-coded and bespoke website format mimicked as closely as possible what Andy Matuschak did with his knowledge base, but after multiple years of experimenting with the format, I decided that simple was better than cool, and switched to MkDocs for formatting and publishing.

Ways to support the project

If you wish to support the project, there are a few ways:

Firstly, I spent some time linearizing this content based on my experience guiding skilled newcomers to the DS world. That's available on the eBook version on LeanPub.

Secondly, you can support my data science education work on GitHub! All I ask is support for some coffee!

Finally, if you have a question regarding the content, please feel free to reach out on LinkedIn. (If I make substantial edits on the basis of your comments or questions, I might reach out to you to offer a free copy of the eBook!)