Choose and customize your development environment

At the end of the day, we choose a development environment that we are most comfortable with. The interface with our colleagues is at the level of what we share, so this should not be the highest of your concerns. Nonetheless, let me showcase where some tools can be used. Above all, avoid religious wars about text editors. Be productive, stay productive.

Use Jupyter as an experimentation playground

What are the use cases for Jupyter?

I use Jupyter notebooks in the following ways.

Firstly, I use them as a prototyping environment. They are wonderful, because I can hold the state of a program in memory and interactively modify it until I get what I need out of the program. (This especially saves on time spent re-computing things.)

Secondly, I use Jupyter as an authoring environment for interactive computational teaching material. For example, I structured Network Analysis Made Simple as a series of Jupyter notebooks.

Finally, on occasion, I use Jupyter with ipywidgets and Voila to build out dashboards and interactive applications for my colleagues.

How do you get Jupyter?

Get Jupyter installed in each of your environments, by including it in your environment.yml file. (see: Create one conda environment per project)

Doing so is based on advice I received at SciPy 2016, in which one of the Jupyter developers strongly advised against "global" installations of Jupyter, to avoid package conflicts.

How do you get Jupyter to recognize your environment's Python?

To get Jupyter to recognize the Python interpreter that defined by your conda environment (see: Create one conda environment per project), you need to make sure you have ipykernel installed inside your environment. Then, use the following command:

export ENV_NAME="put_your_environment_name_here"
conda activate $ENV_NAME
python -m ipykernel install --user --name $ENV_NAME

How do you launch Jupyter?

Newcomers to Anaconda are usually spoonfed the GUI, but I am a proponent of launching Jupyter from the terminal because doing so makes us fully aware of our environment, including the environment variables. (see the related: Create runtime environment variable configuration files for each of your projects and Take full control of your shell environment variables)

To launch Jupyter:

  1. Open your shell
  2. Navigate to your project directory
  3. Activate your conda environment
  4. Then launch Jupyter Lab: jupyter lab

In shell terms:

cd /path/to/project/directory
conda activate $ENV_NAME
jupyter lab

Use VSCode to help you with software development and collaboration

Why you might want to consider using VSCode

VSCode's most significant selling point is not that it's free. Its biggest selling points are actually:

  1. Remote editing
  2. Collaborative coding

If you are developing on a remote machine, the VSCode Remote Development extension pack will enable you to code on a powerful remote machine while having the UI local. VSCode's remote capabilities are convenient, as you can then install other utilities to turn VSCode into a full-fledged IDE (such as code linters and more). You can avoid running code locally, which is fantastic on a power-sipping laptop, and instead take advantage of more powerful compute servers. (It also beats mounting remote file shares, as the latency can be kill patience!)

If you're pair coding, nothing beats the collaborative coding capabilities of VSCode Live Share. This extension pack will enable you to invite a colleague or friend to type in your VSCode session with minimal latency. (It beats using MS Teams remote control!)

How do we get those extension packs?

I linked to the extension packs above, but in case you were distracted for a moment:

Of course, I'm assuming you have VSCode available on your system :).

See also: Configure VSCode for maximum productivity.

Alternatives to VSCode

Some of you may have customized vim, emacs, Eclipse, or PyCharm to your heart's content. If that's the case, you're already well-equipped!

Shell-based plain text editors give you a quick way to edit texts

Why you should learn how to use shell-based text editors

As a data scientist, you'll possibly end up working on a remote machine. This necessitates using a remote ssh session, usually in a terminal shell. At other times, in a pinch, you might need to quickly edit a file on your local filesystem and need access to a shell. Thus, knowing how to use/wrangle at least one of the built-in ones that are widely available in most systems will allow you to make quick edits on the fly.

Most important shell text editor usage steps

The most important things to learn are to:

  1. Open a file
  2. Edit and navigate through it
  3. Save and close it

As long as you can master those three actions in any text editor, you're gold. Don't worry about other extensions and add-ons until you've mastered these steps and can execute them from memory.

Which shell text editors exist?

Three of them are the most famous:

  • nano
  • vi/vim
  • emacs

The venerable nano is usually available in most systems, and has a relatively low learning curve. You can also customize it to do syntax highlighting! (see: Enhance nano with syntax highlighting)

vi/vim is the butt of many jokes on "how to exit text editors", but really it's easy:

  1. Hit escape
  2. Quickly hit :wq (the colon enters command-line mode, the w stands for write, and the q means quit)
  3. Hit enter

In normal mode, there is also: Shift + zz for save and quit, or Shift + zq for quit without save. (h/t Arkadij Kummer for surfacing this one to me.)

If you need to save the file, you'll be prompted.

Further reading

Get bootstrapped on your data science projects

Why this knowledge base exists

I'm super glad you made it to my knowledge base on bootstrapping your data science machine - otherwise known as getting set up for success with great organization and sane structure. The content inside here has been battle-tested through real-world experience with colleagues and others skilled in their computing domains, but a bit new to the modern tooling offered to us in the data science world.

This knowledge base exists because I want to encourage more data scientists to adopt sane practices and conventions that promote collaboration and reproducibility in our data work. These are practices that, through years of practice in developing data projects and open source software, I have come to see the importance of.

Where I think you, the reader, are coming from

The most important thing I'm assuming about you, the reader, is that you have experienced the same challenges I encountered when structure and workflow were absent from my work. I wrote down this knowledge base for your benefit. Based on my seven years (as of 2020) of continual refinement, you'll learn how to:

  1. structure your computer for data analysis projects, and
  2. structure a data analysis project for maximum effectiveness.

Because I'm a Pythonista who uses Jupyter and VSCode, some tips are specific to the language and these tools. However, being a Python programmer isn't a hard requirement. More than the specifics, I hope this knowledge base imparts to you a particular philosophy of how to work. That philosophy should be portable across languages and tooling, though having specific tooling can sometimes help you adhere to the philosophy. To read more about the philosophies behind this knowledge base, check out the page: The philosophies that ground the bootstrap.

For the beginner

As you grow in your knowledge and skillsets, this knowledge base should help you keep an eye out for critical topics you might want to learn.

For the moderately experienced

If you're looking to refine your skillsets, this knowledge graph should give you the base from which you dive deeper into specific tools.

For the seasoned data scientist

If you're a seasoned data practitioner, this guide should be able to serve you the way it helps me: as a cookbook/recipe guide to remind you of things when you forget them.

Things you'll learn

The things you'll learn here cover the first steps, starting at configuring your laptop or workstation for data science development up to some practices that help you organize your projects, regardless of where you do your computing.

I have a recommended order below, based on my experience with colleagues and other newcomers to a project:

  1. Configure your machine
  2. Get prepped per project
  3. Navigate the packaging world
  4. Handling data
  5. Choose and customize your development environment

However, you may wish to follow the guide differently and not read it in the way I prescribed above. That's not a problem! The online version is intentionally structured as a knowledge graph and not a book so that you can explore it on your own terms.

Apply these ideas just-in-time

As you go through this content, I would also encourage you to keep in mind: Time will distill the best practices in your context. Don't feel pressured to apply every single thing you see here to your project. Incrementally adopt these practices as they make sense. They're all quite composable with one another.

Not everything written here is applicable to every single project. Indeed, rarely do I use 100% of everything I've written here. Sometimes, my projects end up being more software tool development oriented, and hence I use a lot of the software-oriented ideas. Sometimes my projects are one-off, and so I ease off on the reproducibility aspect. Most of the time, my projects require a lot of exploratory work beyond simple exploratory data analysis, and imposing structure early on can be stifling for the project.

So rather than see this collection of notes as something that we must impose on every project, I would encourage you to be picky and choosy, and use only what helps you, in a just-in-time fashion, to increase your effectiveness in a project. Just-in-time adoption of a practice or a tool is preferable, because doing so eases the pressure to be rigidly complete from the get-go. In my own work, I incorporate a practice into the project just-in-time as I sense the need for it.

Moreover, as my colleague Zachary Barry would say, none of these practices can be mastered overnight. It takes running into walls to appreciate why these practices are important. For an individual who has not yet encountered problems with disorganized code, multiple versions of the same dataset, and other issues I describe here, it is difficult to deeply appreciate why it matters to apply simple and basic software development practices to your data science work. So I would encourage you to use this knowledge base as a reference tool that helps you find out, in a just-in-time fashion, a practice or tool that helps you solve a problem.

Ways to support the project

If you wish to support the project, there are a few ways:

Firstly, I spent some time linearizing this content based on my experience guiding skilled newcomers to the DS world. That's available on the eBook version on LeanPub. If you purchase a copy (when it's released), you will get instructions to access the repository that houses the book source and automation to bootstrap each of your Python data science projects easily!

Secondly, you can support my data science education work on Patreon! My supporters get early access to the data science content that I make. Including a free copy of the eBook, which has bonus content in there!(Special thanks goes to my Supporters!)

Finally, if you have a question regarding the content, please feel free to reach out on Shortwhale. (If I make substantial edits on the basis of your comments or questions, I might reach out to you to offer a free copy of the eBook!)

Turbocharge Jupyter Lab using Language Servers

With the Language Server Protocol's (LSP) development, it's now possible to turbocharge your Jupyter Lab installation! I'm going to show you how to make this work.

Why install Jupyter LSP?

Installing the Jupyter LSP brings a world of superpowers to your Jupyter experience! If you are a human programmer, you will undoubtedly make mistakes in programming, such as forgetting to assign a value to a variable or forgetting a function's signature. According to the repository README, you'll get superpowers such as:

  1. Hover tooltips to remind yourself of what a variable is
  2. Code style diagnostics (squiggly yellow lines!)
  3. Ability to jump to the definition of a variable/function
  4. Automatic completion and hinting
  5. Function signature suggestions
  6. Kernel-free suggestions (for those times your kernel is busy)
  7. Rename functions and variables throughout the workspace

If you've used the Python extension and Jupyter notebook interface inside VSCode, these should feel familiar to you!

Prerequisites

To make this work, you will need jupyterlab>=3.0.

Installation

Installation starts by installing the jupyterlab-lsp package. It's available on both PyPI and conda. You can follow the official installation instructions online, though I am also summarizing the instructions below.

tl;dr

The tl;dr version is as follows. Ensure you have the following packages in your environment.yml:

name: <your_env_name>
channels:
- conda-forge
dependencies:
- python
- jupyterlab>=3.0
- jupyter
- ipykernel
- jupyter-lsp
- jupyterlab-lsp
- python-language-server
- jedi=0.17.2 # pinned because one of the language server packages needs it precisely pinned here
# other packages that you need go below!

Then, update your environment and run Jupyter Lab.

conda activate <your_env_name>
conda env update -f environment.yml
jupyter lab

Finally, configure the Code Completion section in JupyterLab's advanced settings. The most pertinent key-value pairs to configure are:

{
    "continuousHinting": true,
    "kernelCompletionsFirst": true,
}

Install the LSP packages

We first need to ensure that you have the language server protocol packages installed for Jupyter and JupyterLab. Ensure that you install them in the appropriate environment.

conda activate <your_environment>
conda install -c conda-forge jupyter-lsp jupyterlab-lsp

Install a particular language server

We now need a particular language's language server. The available language servers are listed in the docs. If you're a conda user, you can install all of them from conda-forge:

conda install -c conda-forge python-language-server

Now, the Jupyter LSP package will communicate with the Python language server when Jupyter or Jupyter Lab are running!

Configure code completion

Of the many things that the language server protocol provides, code completion is probably the handiest of them all to configure. I've listed the section to configure above in the tl;dr section, so I'll ask you to refer back to it above in the spirit of Don't Repeat Yourself.

References

Enhance nano with syntax highlighting

Why you might want syntax highlighting in nano

Nano is pretty bare-bones. That said, syntax highlighting is the one biggest upgrade to plain text nano that one can make. Syntax highlighting in nano helps the way syntax highlighting helps in any other text editor: it helps you find the salient features of a language (flow control and loops, and error messages) that might be handy for reading code.

How to upgrade nano with syntax highlighting

Anthony Scopatz has this repo of nano syntax highlighting configurations that you can use. It includes installation instructions as well!

Configure VSCode for maximum productivity

How do we configure VSCode?

VSCode has a built-in configuration setting that you can access using Ctrl/Cmd followed by a , (comma).

At the same time, VSCode is extensible using its Marketplace of extensions.

What built-in options should I configure VSCode for maximum productivity?

Autosave

Save some keystrokes by configuring VSCode to autosave your files after 10 ms. This is useful if your workflow doesn't involve running a live server that reloads code on every save.

Format on save/paste

When you explicitly command VSCode to save your file, it can run code formatters (such as black) on the file being saved.

Insert Final Newline

This option will insert a final newline to a plain text file. Makes viewing them in the terminal much easier.

Trim trailing whitespace

This will clean out trailing whitespace. Again, makes viewing them in the terminal a bit easier.

What VSCode extensions help with productivity?

Python + Pylance

The official Python extension for VSCode contains a ton of goodies that are configurable in the settings. It can help you load conda environments in the shell directly, automatically lint source code that is open, debug a program that has to be executed, automatic importing of resolvable functions and modules, and many, many more goodies for Python developers and data scientists.

In fact, you can write a .py file and execute it interactively as if it were a Jupyter notebook, without the overhead of the Jupyter front-end interface. Steven Mortimer even has an R-bloggers post on how to configure VSCode to behave like RStudio for Python, for those who prefer the RStudio interface!

With the Python extension being powerful as it is, the Pylance extension will supercharge it even more. Using an extremely performant code analyzer, it will highlight nearly all problems it detects within a source .py file within a few hundred milliseconds (maximum) of the file being saved. When I have had to do tool development as part of my data science work, Pylance has been essential for my productivity.

Peacock

One good practice we have written about here is to Follow the rule of one-to-one in managing your projects. When one project has one directory, that directory acts as a "workspace", which in turn gets opened in one VSCode window. If you have 3-4 windows open, then figuring out which one corresponds to which project can take a few seconds.

VSCode Remote

This extension gives you superpowers. It will enable you to edit code on a remote server while still having all of the goodies of a locally-running VSCode session. Leverage this extension to help you develop on a powerful remote machine without ever leaving your local editor.

indent-rainbow

This extension highlights indentation in different colours, making it easier to read Python (and other indentation-friendly languages') source files.

Rainbow CSV

Rainbow CSV highlights the columns in a CSV file when you open it up in VSCode, making it easier to view your data. This one was suggested by one of my book reviewers Simon Eng.

markdownlint

This extension checks your Markdown files for formatting issues, such as headers containing punctuations, or missing line breaks after a header. These are based on the Node.js markdownlint package by David Anson.

Markdown Table Prettifier

Also suggested by Simon, Markdown Table Prettifier helps you format your Markdown tables such that they are easily readable in plain text mode.

Polacode

Polacode is "polaroid for code", giving you the ability to create "screenshots" of your code to share with others. If you've ever used carbon.now.sh, you'll enjoy this one.

index

Why this knowledge base exists

I'm super glad you made it to my knowledge base on bootstrapping your data science machine - otherwise known as getting set up for success with great organization and sane structure. The content inside here has been battle-tested through real-world experience with colleagues and others skilled in their computing domains, but a bit new to the modern tooling offered to us in the data science world.

This knowledge base exists because I want to encourage more data scientists to adopt sane practices and conventions that promote collaboration and reproducibility in our data work. These are practices that, through years of practice in developing data projects and open source software, I have come to see the importance of.

Where I think you, the reader, are coming from

The most important thing I'm assuming about you, the reader, is that you have experienced the same challenges I encountered when structure and workflow were absent from my work. I wrote down this knowledge base for your benefit. Based on my seven years (as of 2020) of continual refinement, you'll learn how to:

  1. structure your computer for data analysis projects, and
  2. structure a data analysis project for maximum effectiveness.

Because I'm a Pythonista who uses Jupyter and VSCode, some tips are specific to the language and these tools. However, being a Python programmer isn't a hard requirement. More than the specifics, I hope this knowledge base imparts to you a particular philosophy of how to work. That philosophy should be portable across languages and tooling, though having specific tooling can sometimes help you adhere to the philosophy. To read more about the philosophies behind this knowledge base, check out the page: The philosophies that ground the bootstrap.

For the beginner

As you grow in your knowledge and skillsets, this knowledge base should help you keep an eye out for critical topics you might want to learn.

For the moderately experienced

If you're looking to refine your skillsets, this knowledge graph should give you the base from which you dive deeper into specific tools.

For the seasoned data scientist

If you're a seasoned data practitioner, this guide should be able to serve you the way it helps me: as a cookbook/recipe guide to remind you of things when you forget them.

Things you'll learn

The things you'll learn here cover the first steps, starting at configuring your laptop or workstation for data science development up to some practices that help you organize your projects, regardless of where you do your computing.

I have a recommended order below, based on my experience with colleagues and other newcomers to a project:

  1. Configure your machine
  2. Get prepped per project
  3. Navigate the packaging world
  4. Handling data
  5. Choose and customize your development environment

However, you may wish to follow the guide differently and not read it in the way I prescribed above. That's not a problem! The online version is intentionally structured as a knowledge graph and not a book so that you can explore it on your own terms.

Apply these ideas just-in-time

As you go through this content, I would also encourage you to keep in mind: Time will distill the best practices in your context. Don't feel pressured to apply every single thing you see here to your project. Incrementally adopt these practices as they make sense. They're all quite composable with one another.

Not everything written here is applicable to every single project. Indeed, rarely do I use 100% of everything I've written here. Sometimes, my projects end up being more software tool development oriented, and hence I use a lot of the software-oriented ideas. Sometimes my projects are one-off, and so I ease off on the reproducibility aspect. Most of the time, my projects require a lot of exploratory work beyond simple exploratory data analysis, and imposing structure early on can be stifling for the project.

So rather than see this collection of notes as something that we must impose on every project, I would encourage you to be picky and choosy, and use only what helps you, in a just-in-time fashion, to increase your effectiveness in a project. Just-in-time adoption of a practice or a tool is preferable, because doing so eases the pressure to be rigidly complete from the get-go. In my own work, I incorporate a practice into the project just-in-time as I sense the need for it.

Moreover, as my colleague Zachary Barry would say, none of these practices can be mastered overnight. It takes running into walls to appreciate why these practices are important. For an individual who has not yet encountered problems with disorganized code, multiple versions of the same dataset, and other issues I describe here, it is difficult to deeply appreciate why it matters to apply simple and basic software development practices to your data science work. So I would encourage you to use this knowledge base as a reference tool that helps you find out, in a just-in-time fashion, a practice or tool that helps you solve a problem.

Ways to support the project

If you wish to support the project, there are a few ways:

Firstly, I spent some time linearizing this content based on my experience guiding skilled newcomers to the DS world. That's available on the eBook version on LeanPub. If you purchase a copy (when it's released), you will get instructions to access the repository that houses the book source and automation to bootstrap each of your Python data science projects easily!

Secondly, you can support my data science education work on Patreon! My supporters get early access to the data science content that I make. Including a free copy of the eBook, which has bonus content in there!(Special thanks goes to my Supporters!)

Finally, if you have a question regarding the content, please feel free to reach out on Shortwhale. (If I make substantial edits on the basis of your comments or questions, I might reach out to you to offer a free copy of the eBook!)