Collaborating on Data Science Projects

Data science work doesn't happen in isolation. While you might have gone through solo research training, real-world projects often require collaboration with other data scientists, engineers, and stakeholders. In this chapter, I want to share effective patterns for collaboration that I've seen work well in practice.

The power of pair programming

One of the most transformative practices you can adopt is regular pair programming sessions. While it might seem counterintuitive to have two people working on the same task, the benefits are substantial and compound over time. Knowledge sharing happens organically as team members work side by side, with each person bringing their unique perspectives and experiences to the problem at hand. You'll find complex problems get solved faster because you have two minds actively engaged, catching edge cases and spotting potential issues before they become problems.

The real-time review aspect of pair programming naturally leads to higher quality code. Instead of waiting for formal code reviews where context might be lost, your code gets reviewed as it's written. This immediate feedback loop helps catch not just bugs, but also architectural issues and potential maintainability problems early in the development cycle. Perhaps most importantly, team members stay naturally aligned on practices and patterns through this continuous collaboration, reducing the friction that often comes from different coding styles and approaches within a team.

The key is to schedule substantial sessions - aim for 2-3 hours, twice a week. This gives you enough time to dive deep into problems while maintaining a regular cadence of collaboration. Shorter sessions often don't allow enough time to get into the flow state where the real benefits of pairing emerge.

For traditional driver-navigator pair programming, where one person is at the keyboard while the other observes and guides, you really don't need much - a simple screen share will do just fine. This style works particularly well with notebooks, where you're often thinking through problems together and the navigator can focus on the bigger picture while the driver handles the implementation details.

If you want both people actively coding - maybe splitting related tasks or collaboratively editing the same file - then you'll want either to be in the same room or to use tools like VSCode's Live Share or Jupyter's Real-Time Collaboration extension. These tools give you that same-room feeling even when working remotely.

The key is to focus on the collaboration itself rather than getting caught up in the tooling.

Leveraging AI in collaborative work

Just as AI tools can help bridge the gap between thinking and coding speed for individuals, they can supercharge pair programming sessions. When pairing, use AI to:

Generate boilerplate code faster than either person could type it
Draft documentation while you focus on logic
Propose refactoring options for duplicated code
Help explain complex code sections to your partner

For more detailed strategies on incorporating AI into your workflow, check out the chapter on ways of working with AI tools - you'll find specific techniques there for coding at the speed of thought.

The science in data science projects

Unlike traditional software development projects, data science work often resists neat categorization into sprint-sized chunks. The scientific nature of the work means you'll inevitably go through periods of apparent unproductiveness while searching for the right approach.

This is where good management and coaching become crucial. Instead of forcing data science work into a traditional agile framework with story points and fixed sprint deliverables, focus on producing meaningful work products that demonstrate progress in understanding:

Evidence notebooks that document:
- Why certain approaches didn't work
- Supporting visualizations and analysis
- Clear plans for next steps with reasoning
Literature reviews that:
- Survey similar problems and solutions
- Analyze pros and cons of different approaches
- Connect existing research to your specific context
Minimally Complex Examples (MCE) that:
- Demonstrate the complete system at a small scale
- Include mock data and working code
- Force refinement of problem understanding
Exploratory analyses that:
- Test fundamental assumptions
- Validate or invalidate hypotheses about the data
- Reveal unexpected patterns or challenges

Making this work requires the right organizational environment. Management needs to fundamentally understand that data science is a scientific endeavor, not a software development activity with predictable outputs. Product owners need to be comfortable with the exploratory nature of the work and the inherent uncertainty in working with stochastic systems. You can't guarantee outcomes in data science the way you might guarantee a feature delivery date - and stakeholders need to embrace this reality. Success here means creating space for exploration while maintaining clear communication about progress, even when that progress looks different from traditional software development milestones.

Effective work distribution

In data science projects, resist the urge to play agile theater - don't waste time categorizing work into T-shirt sizes and story points. These artificial sizing exercises might work for software engineering, but they break down completely for scientific exploration. Instead, organize work around coherent work streams with moderate complexity and scope. This approach aligns better with how data scientists naturally work:

Taking ambiguous problems
Decomposing them into conceptual units
Working through them in an order that emerges from the problem structure

This doesn't mean avoiding structure entirely - use version control (git) and maintain regular communication. Just don't fall into the trap of fake agile practices that prioritize the appearance of progress over actual scientific investigation.

Handling merge conflicts without the drama

When you're working in a team that emphasizes data scientists having basic software skills, you'll inevitably encounter merge conflicts. These can be genuinely confusing, especially when Git starts talking about "incoming changes" versus "current changes" and you're just trying to get your analysis done. Here's the thing: you don't need to understand the philosophical differences between these concepts to resolve conflicts effectively.

The key insight is to stop worrying about what "incoming" and "current" actually mean. Instead, focus on getting your files into a working state and moving on. When you see a merge conflict, your goal is simple: make the file contain what it should contain to do its job. Look at the conflict markers, understand what each version is trying to do, and create a version that combines the best of both or chooses the right approach for your current needs.

Remember that those conflict markers (<<<<<<<, =======, >>>>>>>) are just text in your file. You can delete them entirely and edit the file however you need it to be. Don't feel constrained by Git's suggested structure - just make the file work the way you want it to work and call it a day.

For auto-generated files like pixi.lock, don't overthink it. If you're not sure which version to keep, just pick one, commit it, and merge. After the merge is complete, run pixi lock to regenerate the lock file properly. The lock file will be correct, and you can move on with your actual work. Spending time trying to manually resolve conflicts in generated files is a waste of your mental energy.

The same principle applies to other generated files - documentation that gets auto-built, configuration files that get templated, or any file that can be recreated from source. Don't get bogged down in the mechanics of merge resolution when the file can be regenerated. Your time is better spent on the analysis and insights that actually matter for your project.

Remember: the goal isn't to become a Git expert. The goal is to keep your work flowing smoothly so you can focus on the science, not the tooling.

The art of managing unproductive patches

Every data science project will have periods where progress seems slow. These are often crucial phases where deep learning is happening. Good managers understand this and look for indicators of progress beyond immediate deliverables:

Clear documentation of failed approaches
Well-reasoned plans for next steps
Improved understanding of problem constraints
Refined hypotheses based on data exploration

The key is maintaining transparent communication during these phases. Regular pair programming sessions can help here too - they provide natural checkpoints for sharing thinking and getting feedback.

The key is finding what works for your team. While collaboration is powerful, don't force it just because it's prescribed - even by this book. Balance is critical: avoid disappearing for weeks at a time, but also protect space for deep individual work. Pair up when tackling new territory or when you're stuck, work solo when you need to think deeply about a problem. Use AI to accelerate the mechanical parts of your work. But above all, focus on doing real science rather than just going through the motions of agile ceremonies.