Collaborating on Data Science Projects

Data science work doesn't happen in isolation. While you might have gone through solo research training, real-world projects often require collaboration with other data scientists, engineers, and stakeholders. In this chapter, I want to share effective patterns for collaboration that I've seen work well in practice.

The power of pair programming

One of the most transformative practices you can adopt is regular pair programming sessions. While it might seem counterintuitive to have two people working on the same task, the benefits are substantial and compound over time. Knowledge sharing happens organically as team members work side by side, with each person bringing their unique perspectives and experiences to the problem at hand. You'll find complex problems get solved faster because you have two minds actively engaged, catching edge cases and spotting potential issues before they become problems.

The real-time review aspect of pair programming naturally leads to higher quality code. Instead of waiting for formal code reviews where context might be lost, your code gets reviewed as it's written. This immediate feedback loop helps catch not just bugs, but also architectural issues and potential maintainability problems early in the development cycle. Perhaps most importantly, team members stay naturally aligned on practices and patterns through this continuous collaboration, reducing the friction that often comes from different coding styles and approaches within a team.

The key is to schedule substantial sessions - aim for 2-3 hours, twice a week. This gives you enough time to dive deep into problems while maintaining a regular cadence of collaboration. Shorter sessions often don't allow enough time to get into the flow state where the real benefits of pairing emerge.

For traditional driver-navigator pair programming, where one person is at the keyboard while the other observes and guides, you really don't need much - a simple screen share will do just fine. This style works particularly well with notebooks, where you're often thinking through problems together and the navigator can focus on the bigger picture while the driver handles the implementation details.

If you want both people actively coding - maybe splitting related tasks or collaboratively editing the same file - then you'll want either to be in the same room or to use tools like VSCode's Live Share or Jupyter's Real-Time Collaboration extension. These tools give you that same-room feeling even when working remotely.

Prioritize the collaboration itself over perfect tooling choices.

Leveraging AI in collaborative work

Pair programming already concentrates two people's attention on one problem; AI slots into that rhythm without replacing either role. It can bridge thinking and typing speed the same way it does for solo work. When pairing, use AI to:

Generate boilerplate code faster than either person could type it
Draft documentation while you focus on logic
Propose refactoring options for duplicated code
Help explain complex code sections to your partner

For more detailed strategies on incorporating AI into your workflow, see Working with AI tools; you'll find specific techniques there for coding at the speed of thought.

The science in data science projects

Unlike traditional software development projects, data science work often resists neat categorization into sprint-sized chunks. The scientific nature of the work means you'll inevitably go through periods of apparent unproductiveness while searching for the right approach.

This is where good management and coaching become crucial. Instead of forcing data science work into a traditional agile framework with story points and fixed sprint deliverables, focus on producing meaningful work products that demonstrate progress in understanding:

Evidence notebooks that document:
- Why certain approaches didn't work
- Supporting visualizations and analysis
- Clear plans for next steps with reasoning
Literature reviews that:
- Survey similar problems and solutions
- Analyze pros and cons of different approaches
- Connect existing research to your specific context
Minimally Complex Examples (MCE) that:
- Demonstrate the complete system at a small scale
- Include mock data and working code
- Force refinement of problem understanding
Exploratory analyses that:
- Test fundamental assumptions
- Validate or invalidate hypotheses about the data
- Reveal unexpected patterns or challenges

Making this work requires the right organizational environment. Management needs to fundamentally understand that data science is a scientific endeavor, not a software development activity with predictable outputs. Product owners need to be comfortable with the exploratory nature of the work and the inherent uncertainty in working with stochastic systems. You can't guarantee outcomes in data science the way you might guarantee a feature delivery date - and stakeholders need to embrace this reality. Success here means creating space for exploration while maintaining clear communication about progress, even when that progress looks different from traditional software development milestones.

Effective work distribution

Once everyone agrees the work is exploratory, the next question is how you divide it without forcing sprint-shaped fiction onto scientific work. Resist the urge to play agile theater - don't waste time categorizing work into T-shirt sizes and story points. These artificial sizing exercises might work for software engineering, but they break down completely for scientific exploration. Instead, organize work around coherent work streams with moderate complexity and scope. This approach aligns better with how data scientists naturally work:

Taking ambiguous problems
Decomposing them into conceptual units
Working through them in an order that emerges interactively as the exploration unveils itself.

This doesn't mean avoiding structure entirely - use version control (git) and maintain regular communication. Just don't fall into the trap of fake agile practices that prioritize the appearance of progress over actual scientific investigation.

Pull requests, review, and what CI enforces

When nobody is at the keyboard with you, pull requests carry the pairing forward. I keep the same habits I would use in a live session: small pull requests, a short description of what changed and why, and explicit callouts when behavior or data assumptions shift.

For data science work, reviewers should look past syntax. Do tests cover the new path? Does the notebook diff show exploration only, or did executable logic land in a module where it belongs? If metrics or features move, does the description include a line a stakeholder could read without opening every file?

That human pass is only half the contract. Continuous integration is the other half, written in YAML: when CI runs tests, lint, or a docs build on every pull request, "green" means the agreed checks passed. Merging on red CI should be a conscious exception, not a casual habit. The book goes deeper on wiring that up in Use CI/CD to automate tasks. Conventions for branches, .github/ layout, and keeping agent instructions versioned with the repo are in Start with a sane repository structure.

When a change affects how others run the project or interpret results, the PR should carry the documentation update (or a clear follow-up issue). For keeping docs next to code and using them in stakeholder updates, see Store your project documentation in your project repository.

Handling merge conflicts without the drama

Shared repositories mean merge conflicts even when your work streams make sense. On teams that expect basic software skills from data scientists, you will hit them sooner or later. These moments can be genuinely confusing, especially when Git starts talking about "incoming changes" versus "current changes" and you're just trying to get your analysis done. Here's the thing: you don't need to understand the philosophical differences between these concepts to resolve conflicts effectively.

The key insight is to stop worrying about what "incoming" and "current" actually mean. Instead, focus on getting your files into a working state and moving on. When you see a merge conflict, your goal is simple: make the file contain what it should contain to do its job. Look at the conflict markers, understand what each version is trying to do, and create a version that combines the best of both or chooses the right approach for your current needs.

Remember that those conflict markers (<<<<<<<, =======, >>>>>>>) are just text in your file. You can delete them entirely and edit the file however you need it to be. Don't feel constrained by Git's suggested structure - just make the file work the way you want it to work and call it a day.

For auto-generated files like pixi.lock, don't overthink it. If you're not sure which version to keep, just pick one, commit it, and merge. After the merge is complete, run pixi lock to regenerate the lock file properly. The lock file will be correct, and you can move on with your actual work. Spending time trying to manually resolve conflicts in generated files is a waste of your mental energy.

The same principle applies to other generated files - documentation that gets auto-built, configuration files that get templated, or any file that can be recreated from source. Don't get bogged down in the mechanics of merge resolution when the file can be regenerated. Your time is better spent on the analysis and insights that actually matter for your project.

Remember: the goal isn't to become a Git expert. The goal is to keep your work flowing smoothly so you can focus on the science, not the tooling.

The art of managing unproductive patches

Every data science project will have periods where progress seems slow. These are often crucial phases where deep learning is happening. Good managers understand this and look for indicators of progress beyond immediate deliverables:

Clear documentation of failed approaches
Well-reasoned plans for next steps
Improved understanding of problem constraints
Refined hypotheses based on data exploration

Transparent communication matters most during these stretches. Regular pair programming sessions help because they create natural checkpoints for sharing thinking and getting feedback.

From there, calibrate collaboration to your team. While collaboration is powerful, don't force it just because it's prescribed - even by this book. Balance is critical: avoid disappearing for weeks at a time, but also protect space for deep individual work. Pair up when tackling new territory or when you're stuck; work solo when you need to think deeply about a problem. Use AI to accelerate the mechanical parts of your work. But above all, focus on doing real science rather than just going through the motions of agile ceremonies.

Scaling tacit knowledge

Much of what makes a team effective lives in tacit knowledge that never fits cleanly into a README. Pair programming surfaces some of that in the moment; Agent Skills give you a durable way to capture specialized workflows (for example, a specific debugging sequence for instrument traces) so teammates can reuse them without repeating the same hallway explanations. That matters especially for international teams, where written playbooks reduce dependence on real-time fluency in one shared language.

Those artifacts pair naturally with how you document norms for coding agents; see Building repository memory.