Coding as a Team Sport

How to Work With Other Data Scientists or Engineers

Feb 21, 2025

In one of my first data science roles, right after I got access to the shared server, I thought I deleted all of our data. I didn’t have any experience with Linux, and after duplicating a data folder and deleting it, I panicked, thinking I’d accidentally removed the original as well. I sat at our team’s table not telling anyone and thinking, “this is going to be my last day at this job.”

I didn’t actually delete it—and my perspective now is that if permissions are set up such that one new person can delete all of your data, the fault doesn’t primarily lie with that person. But it was a terrifying introduction to the fact that when you’re coding with other people, you’re working in a shared space, and your coding process and habits directly affect your teammates.

Apart from thinking I deleted all of our data, it was a good first experience coding with others. And since then, I’ve done data science on various other teams where we’ve had to figure out processes and habits for how to code more effectively together.

The core problems to solve are: how to make our work reproducible by teammates, how to let others easily run and build upon our code, and how to work on different parts of a project in parallel without conflicts.

There’s no one-size-fits-all set of rules: I once had to use a linter that enforced specific code formatting, which would be overkill for most of the work I do. But I think the following practices will get most data people most of the way there most of the time.

Do Everything in Code

When you do something outside of code—like downloading data (unless it’s a one-time download that won’t need updating, in which case you might just document what you did), manipulating it by hand, or making a graph in Excel, you’re making your work less reproducible by your teammates and therefore less useful.

Think of it this way: if you’re out for a week and someone needs to rerun everything you did, can they do that, or is there a part where you’re tweaking something manually? Even if it’s written down somewhere, the existence of that step creates extra work for your team.

Organize Your Code

Once you’ve committed to doing everything in code, the next challenge is how to write and organize that code in a way that others can follow and maintain.

This hit home for me recently with a project at work where I wrote multiple Python files for different parts of a process without connecting them. If I didn’t remember to change the names of the input and output files, I ended up running my analysis on outdated data.

This was tripping me up to some degree, but what I really wanted was for it to be usable by others, so I rewrote it with a configuration class/pipeline. Now the handoff between steps is automated, with no need to hard code anything, so a new person can run my code with just one line. The parameters—the few variables that might change each time you run it—are easily accessible in the config file, without having to touch any underlying code.

I did it this way because there’s a multi-step process with various inputs and outputs. But the same general principle should apply to whatever you’re coding with other people: make it easy to understand your work, run it, and modify the likely parameters without getting deeper into the code. Do this by organizing each .py or other code file into functions or objects as appropriate, and by thoughtfully organizing your files overall, including using main or config files as needed.

This is from a recent project structure. There’s both some file organization and a readme showing that organization.

The opposite of this? Code—in a Jupyter notebook or not—without functions, where it’s unclear what was exploratory analysis versus what needs to be kept, where it needs to be run in a non-obvious order, or where parts are randomly commented out. It’s very hard for another person to use that or build on it.

Branching and Pull Requests

Beyond organizing individual files and projects, there’s the question of how to manage changes over time. When I’m writing code on my own, I use git, but not like a software developer. I push to main. I have huge commits with multiple features. My commit messages are incoherent. And I would argue some of these things are fine and appropriate for solo work where the output is an analysis, or even in some cases an ongoing process.

But when I’m coding with other people, those practices don’t work anymore. It becomes necessary to use version control more like developers: create branches for specific features or fixes so multiple people can work in parallel without conflicts. These branches should be focused and relatively small, making it easier to review changes and merge them back into the main branch through pull requests. This keeps everyone’s code in sync and lets the team move faster by working on different features simultaneously.

This is from a terrific introduction on pull requests from Fred Hutch Data Science Lab (DaSL). Licensed under CC-BY-4.0.

It also makes it much easier to track what I’m doing and why, and to roll it back if necessary: for instance, if I added a feature that isn’t needed anymore, you can just roll back that one commit instead of getting deep into the code.

The Payoff

Data scientists often have a reputation for poor coding practices, but the truth is there’s both a huge range of practices and of needs—what constitutes good coding practices is different if you’re coding by yourself for an ad hoc analysis vs. building something with others vs. writing software.

But eventually, you’re probably going to want to code with someone else. And when that happens, you don’t want your contributions to go unused because your code didn’t meet professional standards, or you didn’t know how to make a pull request.

And with LLMs, it’s much easier to adopt these behaviors. If you have a Jupyter notebook that does what you want but lacks functions or code organization, the time it takes to refactor it into a well-organized Python file is minimal – but it should be you doing it and not a teammate, because you’re the person who knows the original intent. Similarly, getting the git commands you need to create branches and push code is as simple as asking. LLMs aren’t amazing at coding all things all the time, but these are the kinds of things they do extremely well.

The Present of Coding

Discussion about this post