Get bootstrapped on your data science projects
I'm super glad you made it to my knowledge base on bootstrapping your data science machine - otherwise known as getting set up for success with great organization and sane structure. The content inside here has been battle-tested through real-world experience with colleagues and others skilled in their computing domains, but a bit new to the modern tooling offered to us in the data science world.
This knowledge base exists because I want to encourage more data scientists to adopt sane practices and conventions that promote collaboration and reproducibility in our data work. These are practices that, through years of practice in developing data projects and open source software, I have come to see the importance of.
The most important thing I'm assuming about you, the reader, is that you have experienced the same challenges I encountered when structure and workflow were absent from my work. I wrote down this knowledge base for your benefit. Based on my seven years (as of 2020) of continual refinement, you'll learn how to:
Because I'm a Pythonista who uses Jupyter and VSCode, some tips are specific to the language and these tools. However, being a Python programmer isn't a hard requirement. More than the specifics, I hope this knowledge base imparts to you a particular philosophy of how to work. That philosophy should be portable across languages and tooling, though having specific tooling can sometimes help you adhere to the philosophy. To read more about the philosophies behind this knowledge base, check out the page: The philosophies that ground the bootstrap.
As you grow in your knowledge and skillsets, this knowledge base should help you keep an eye out for critical topics you might want to learn.
If you're looking to refine your skillsets, this knowledge graph should give you the base from which you dive deeper into specific tools.
If you're a seasoned data practitioner, this guide should be able to serve you the way it helps me: as a cookbook/recipe guide to remind you of things when you forget them.
The things you'll learn here cover the first steps, starting at configuring your laptop or workstation for data science development up to some practices that help you organize your projects, regardless of where you do your computing.
I have a recommended order below, based on my experience with colleagues and other newcomers to a project:
However, you may wish to follow the guide differently and not read it in the way I prescribed above. That's not a problem! The online version is intentionally structured as a knowledge graph and not a book so that you can explore it on your own terms.
As you go through this content, I would also encourage you to keep in mind: Time will distill the best practices in your context. Don't feel pressured to apply every single thing you see here to your project. Incrementally adopt these practices as they make sense. They're all quite composable with one another.
Not everything written here is applicable to every single project. Indeed, rarely do I use 100% of everything I've written here. Sometimes, my projects end up being more software tool development oriented, and hence I use a lot of the software-oriented ideas. Sometimes my projects are one-off, and so I ease off on the reproducibility aspect. Most of the time, my projects require a lot of exploratory work beyond simple exploratory data analysis, and imposing structure early on can be stifling for the project.
So rather than see this collection of notes as something that we must impose on every project, I would encourage you to be picky and choosy, and use only what helps you, in a just-in-time fashion, to increase your effectiveness in a project. Just-in-time adoption of a practice or a tool is preferable, because doing so eases the pressure to be rigidly complete from the get-go. In my own work, I incorporate a practice into the project just-in-time as I sense the need for it.
Moreover, as my colleague Zachary Barry would say, none of these practices can be mastered overnight. It takes running into walls to appreciate why these practices are important. For an individual who has not yet encountered problems with disorganized code, multiple versions of the same dataset, and other issues I describe here, it is difficult to deeply appreciate why it matters to apply simple and basic software development practices to your data science work. So I would encourage you to use this knowledge base as a reference tool that helps you find out, in a just-in-time fashion, a practice or tool that helps you solve a problem.
If you wish to support the project, there are a few ways:
Firstly, I spent some time linearizing this content based on my experience guiding skilled newcomers to the DS world. That's available on the eBook version on LeanPub. If you purchase a copy (when it's released), you will get instructions to access the repository that houses the book source and automation to bootstrap each of your Python data science projects easily!
Secondly, you can support my data science education work on Patreon! My supporters get early access to the data science content that I make. Including a free copy of the eBook, which has bonus content in there!(Special thanks goes to my Supporters!)
Finally, if you have a question regarding the content, please feel free to reach out on Shortwhale. (If I make substantial edits on the basis of your comments or questions, I might reach out to you to offer a free copy of the eBook!)
Get prepped per project
Treat your projects as if they were software projects for maximum organizational effectiveness. Why? The biggest reason is that it will nudge us towards getting organized. The "magic" behind well-constructed software projects is that someone sat down and thought clearly about how to organize things. The same principle can be applied to data analysis projects.
Firstly, some overall ideas to ground the specifics:
Some ideas pertaining to Git:
Notes that pertain to organizing files:
Notes that pertain to your compute environment:
And notes that pertain to good coding practices:
Treating projects as if they were software projects, but without software engineering's stricter practices, keeps us primed to think about the generalizability of what we do, but without the over-engineering that might constrain future flexibility.
Handling data in a data science project is very tricky. Primarily, we have to worry about the following:
The notes linked in this section should give you an overview on how to approach handling data in a sane fashion on your project.
Configure your machine
After getting access to your development machine, you'll want to configure it and take full control over how it works. Backing the following steps are a core set of ideas:
Head over to the following pages to see how you can get things going.
I'd like to thank my Patreon supporters!
Your backing helps keep me caffeinated to continue producing educational DS material for the broader community!
Choose and customize your development environment
At the end of the day, we choose a development environment that we are most comfortable with. The interface with our colleagues is at the level of what we share, so this should not be the highest of your concerns. Nonetheless, let me showcase where some tools can be used. Above all, avoid religious wars about text editors. Be productive, stay productive.
The philosophies that ground the bootstrap
Here are the philosophies that ground the bootstrap. Internalizing these philosophies will help you understand where we're coming from.