written by Eric J. Ma on 2025-01-10 | tags: cybersecurity pre-commit hooks jupyter secrets management best practices data science
In this post, I share a practical approach to managing secrets in data science workflows, learned from personal experience with both successes and mistakes. I cover essential tools like direnv
and .env
files for local development, strategies for secure secret handling in Jupyter notebooks, and crucial version control practices including pre-commit hooks to catch accidental API key commits. I also discuss team collaboration approaches for secret sharing, platform-specific secrets management features, and what to do when secrets accidentally get committed to repositories. While tools like AWS Secrets Manager exist for enterprise needs, I focus on practical, accessible methods that create robust security through layered defenses, following proven software development principles that apply equally well to data science work.
I've made my share of mistakes with secrets management - from accidentally committing API keys in Jupyter notebooks to storing passwords in plain text. Today, I want to share what I've learned about keeping secrets secure in our data science workflows.
Let's face it: as data scientists, we deal with a lot of sensitive credentials. API keys for data vendors, database passwords, cloud service tokens - the list goes on. Here's my practical approach to handling these securely.
The foundation of my secret management starts with environment variables. I've found that using direnv
combined with .env
files is a game-changer for local development. Here's the how and why.
First, install direnv
in your shell. This tool automatically loads environment variables when you enter a directory and unloads them when you leave. This is way more secure than setting globals in your .bashrc
file, where any program in your user space could potentially access them.
In my projects, I create a .env
file with permissions set to 400 (meaning only I can read it):
chmod 400 .env
Inside this file, I store my secrets:
DATA_VENDOR_API_KEY=abc123 DB_PASSWORD=supersecret AWS_ACCESS_KEY=xyz789
Now, here's an important technical detail: if your Jupyter notebooks are launched in an environment where direnv
is already active, the environment variables will be automatically available through os.getenv()
. You won't need any additional setup:
import os api_key = os.getenv('DATA_VENDOR_API_KEY') # Works if direnv is active
However, if your notebooks launch without direnv
support (which is common in many setups), you can fall back to python-dotenv
:
from dotenv import load_dotenv import os load_dotenv() # Backup for when direnv isn't available api_key = os.getenv('DATA_VENDOR_API_KEY')
I generally include python-dotenv
in my projects anyway - it's a lightweight safeguard that ensures my code works across different environments, whether direnv
is available or not.
One crucial habit I've developed is to immediately add .env
to my .gitignore
file. This prevents me from accidentally committing secrets to version control. Here's what I add to .gitignore
:
.env # <---this is the key thing to add!
# These are also good to add!
*.pem
*credentials*
While .gitignore
helps prevent committing specific files, I've learned that having an automated check for accidental API key commits is invaluable. I use a pre-commit hook that scans for potential API keys in my code before they ever make it into a commit.
Here's the pre-commit hook I use (you can find it here on GitHub):
repos: - repo: https://github.com/ericmjl/api-key-precommit rev: v0.1.0 hooks: - id: api-key-checker args: - "--pattern" - "[a-zA-Z0-9]{32,}" # Catches generic API keys - "--pattern" - "api[_-]key[_-][a-zA-Z0-9]{16,}" # API keys with prefix - "--pattern" - "sk-[A-Za-z0-9]{48}" # OpenAI-style keys - "--pattern" - "ghp_[A-Za-z0-9]{36}" # GitHub tokens
This hook has saved me numerous times. Based on the configuration, you can catch keys like:
Hot tip: use a local LLM to help you generate the pattern. You can copy/paste the key a few times in a plain text editor, flip around 4-5 characters in there, and then ask it for the regex pattern. Much faster than trying to decipher it yourself! Here's the prompt:
I have the following example API keys (not real, obfuscated): sk-38qv80ehlkjsh4t8-dfo9387hanlc sk-38qv80ehl3jsh4se-dfe9387haelc sk-319voi4nwp8uryv4;jnyqr8gy[d9u63t sk-490408tyklnldsf-498324uohsiuaodf Return for me a regex pattern that can match these keys and beyond.
Once you have this set up, the pre-commit hook will automatically scan your code before each commit. If it finds anything that looks like an API key, it'll stop the commit process and point out exactly where the potential exposure is. This gives you a chance to remove any secrets before they ever make it into version control - which has saved me countless times from accidentally exposing sensitive information.
I've found that combining this with .gitignore
creates a double-layered defense against accidentally exposing secrets. The .gitignore
prevents committing known secret-containing files, while the pre-commit hook catches any secrets that might slip into your actual code.
First, let me address something critical: if you absolutely must share a secret with a teammate (which should be rare), never do it through chat, email, or externally hosted services. I've learned to use self-destroying notes through open source tools like Cryptgeon, which you can deploy on your internal network. Here's the thing - NEVER trust externally hosted secret-sharing services. You have no idea what they're doing with your data behind the scenes!
While I've built my own tool (pigeon notes, which I'll open source later this year), I actually recommend you don't trust my hosted version either. The safest approach is to deploy your own instance of an open source tool like Cryptgeon on your internal infrastructure. This way, you have full control and visibility over how your sensitive data is handled.
Now, for managing secrets in your development workflow, I've found that almost every major vendor platform that involves code execution provides some form of secrets management. For example:
The key is to check your platform's documentation and invest time in understanding its secrets management capabilities. Instead of hard-coding secrets directly in your notebooks or scripts (which is both insecure and a common mistake), leverage these built-in features - they're usually well-tested and integrated with the platform's security model. While it might take some initial setup time, it's worth learning how to store and access secrets properly within each environment you work in.
When a secret gets committed (it happens), the first step is to immediately rotate any exposed credentials - this minimizes the window of vulnerability. Then comes the cleanup: BFG Repo Cleaner helps scrub the secret from Git history. Install it using homebrew (brew install bfg
) or download the jar file. With BFG installed, run:
bfg --replace-text secrets.txt your-repo.git
Where secrets.txt
contains the patterns of secrets you want to remove. After BFG does its work, force push the cleaned repository:
git push origin --force
Let the team know they'll need to re-clone the repository fresh - any local copies still contain the secret in their history. How long it takes depends on your repository's history and complexity, so coordinate with your team to pause development until the cleanup is complete to avoid commit conflicts.
Through trial and error, I've developed these habits:
direnv
Everything I've described here isn't unique to data science - it's exactly how software developers have been building applications for years. The principles behind the 12-factor app methodology, which has guided software development for over a decade, apply just as well to our data science work. Whether we're building a web service or training a machine learning model, we're still writing code, and the same best practices around configuration and secrets management apply.
Secrets management, like Swiss cheese - each security measure has holes, but when layered together, they create a robust defense. While AWS Secrets Manager might be overkill for some projects, even simple practices like using .env
files and proper permissions can significantly improve security.
It only takes one exposed secret to potentially compromise your data or services. As data scientists, we handle sensitive information daily, so taking these precautions isn't just good practice - it's essential.
@article{
ericmjl-2025-a-projects,
author = {Eric J. Ma},
title = {A practical guide to securing secrets in data science projects},
year = {2025},
month = {01},
day = {10},
howpublished = {\url{https://ericmjl.github.io}},
journal = {Eric J. Ma's Blog},
url = {https://ericmjl.github.io/blog/2025/1/10/a-practical-guide-to-securing-secrets-in-data-science-projects},
}
I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.
If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!
Finally, I do free 30-minute GenAI strategy calls for teams that are looking to leverage GenAI for maximum impact. Consider booking a call on Calendly if you're interested!