Collaborating on Data Science Projects
Data science work doesn't happen in isolation. While you might have gone through solo research training, real-world projects often require collaboration with other data scientists, engineers, and stakeholders. In this chapter, I want to share effective patterns for collaboration that I've seen work well in practice.
The power of pair programming
One of the most transformative practices you can adopt is regular pair programming sessions. While it might seem counterintuitive to have two people working on the same task, the benefits are substantial and compound over time. Knowledge sharing happens organically as team members work side by side, with each person bringing their unique perspectives and experiences to the problem at hand. You'll find complex problems get solved faster because you have two minds actively engaged, catching edge cases and spotting potential issues before they become problems.
The real-time review aspect of pair programming naturally leads to higher quality code. Instead of waiting for formal code reviews where context might be lost, your code gets reviewed as it's written. This immediate feedback loop helps catch not just bugs, but also architectural issues and potential maintainability problems early in the development cycle. Perhaps most importantly, team members stay naturally aligned on practices and patterns through this continuous collaboration, reducing the friction that often comes from different coding styles and approaches within a team.
The key is to schedule substantial sessions - aim for 2-3 hours, twice a week. This gives you enough time to dive deep into problems while maintaining a regular cadence of collaboration. Shorter sessions often don't allow enough time to get into the flow state where the real benefits of pairing emerge.
For traditional driver-navigator pair programming, where one person is at the keyboard while the other observes and guides, you really don't need much - a simple screen share will do just fine. This style works particularly well with notebooks, where you're often thinking through problems together and the navigator can focus on the bigger picture while the driver handles the implementation details.
If you want both people actively coding - maybe splitting related tasks or collaboratively editing the same file - then you'll want either to be in the same room or to use tools like VSCode's Live Share or Jupyter's Real-Time Collaboration extension. These tools give you that same-room feeling even when working remotely.
The key is to focus on the collaboration itself rather than getting caught up in the tooling.
Leveraging AI in collaborative work
Just as AI tools can help bridge the gap between thinking and coding speed for individuals, they can supercharge pair programming sessions. When pairing, use AI to:
- Generate boilerplate code faster than either person could type it
- Draft documentation while you focus on logic
- Propose refactoring options for duplicated code
- Help explain complex code sections to your partner
For more detailed strategies on incorporating AI into your workflow, check out the chapter on ways of working with AI tools - you'll find specific techniques there for coding at the speed of thought.
The science in data science projects
Unlike traditional software development projects, data science work often resists neat categorization into sprint-sized chunks. The scientific nature of the work means you'll inevitably go through periods of apparent unproductiveness while searching for the right approach.
This is where good management and coaching become crucial. Instead of forcing data science work into a traditional agile framework with story points and fixed sprint deliverables, focus on producing meaningful work products that demonstrate progress in understanding:
-
Evidence notebooks that document:
- Why certain approaches didn't work
- Supporting visualizations and analysis
- Clear plans for next steps with reasoning
-
Literature reviews that:
- Survey similar problems and solutions
- Analyze pros and cons of different approaches
- Connect existing research to your specific context
-
Minimally Complex Examples (MCE) that:
- Demonstrate the complete system at a small scale
- Include mock data and working code
- Force refinement of problem understanding
-
Exploratory analyses that:
- Test fundamental assumptions
- Validate or invalidate hypotheses about the data
- Reveal unexpected patterns or challenges
Making this work requires the right organizational environment. Management needs to fundamentally understand that data science is a scientific endeavor, not a software development activity with predictable outputs. Product owners need to be comfortable with the exploratory nature of the work and the inherent uncertainty in working with stochastic systems. You can't guarantee outcomes in data science the way you might guarantee a feature delivery date - and stakeholders need to embrace this reality. Success here means creating space for exploration while maintaining clear communication about progress, even when that progress looks different from traditional software development milestones.
Effective work distribution
In data science projects, resist the urge to play agile theater - don't waste time categorizing work into T-shirt sizes and story points. These artificial sizing exercises might work for software engineering, but they break down completely for scientific exploration. Instead, organize work around coherent work streams with moderate complexity and scope. This approach aligns better with how data scientists naturally work:
- Taking ambiguous problems
- Decomposing them into conceptual units
- Working through them in an order that emerges from the problem structure
This doesn't mean avoiding structure entirely - use version control (git) and maintain regular communication. Just don't fall into the trap of fake agile practices that prioritize the appearance of progress over actual scientific investigation.
The art of managing unproductive patches
Every data science project will have periods where progress seems slow. These are often crucial phases where deep learning is happening. Good managers understand this and look for indicators of progress beyond immediate deliverables:
- Clear documentation of failed approaches
- Well-reasoned plans for next steps
- Improved understanding of problem constraints
- Refined hypotheses based on data exploration
The key is maintaining transparent communication during these phases. Regular pair programming sessions can help here too - they provide natural checkpoints for sharing thinking and getting feedback.
The key is finding what works for your team. While collaboration is powerful, don't force it just because it's prescribed - even by this book. Balance is critical: avoid disappearing for weeks at a time, but also protect space for deep individual work. Pair up when tackling new territory or when you're stuck, work solo when you need to think deeply about a problem. Use AI to accelerate the mechanical parts of your work. But above all, focus on doing real science rather than just going through the motions of agile ceremonies.