A Developer Is You

Coding Tools and Practices for Data People

Aug 26, 2024

I struggled to complete my PhD. And my dissertation chair* cycled through a set of motivational strategies to get me to finish writing. My favorite was to tell me the following: “Lots of people dumber than you have finished their dissertations.”

The operative word here is not “dumber.” You can substitute whatever you'd like for that. Less organized. Less motivated. Busier. But what I got from this was that finishing your dissertation is a totally normal thing that people figure out how to do, even people who have better reasons than you to be having a hard time. You can do it too.

This is a reference to an RPG called Kingdom of Loathing, not an endorsement of drinking and coding

I take this with me whenever I'm struggling with something new. I'm sure there are people with more positive growth mindset–related self-talk, but this is mine.

And a big part of what I struggle with – but wish to do anyway – is learning new tools and development practices. I started coding in Python for data science about seven years ago with no background in computer science. I'd written a small amount of code in R and in VBA, but what I’d used the most was Stata, a proprietary tool for statistical analysis that bears little resemblance to a general-purpose programming language. Then, suddenly, after a couple of Coursera courses and a bit of coding on my own, someone was paying me to code in Python.

I'm now midway along the data-science-to-engineering pipeline; I’ve even started calling myself an engineer at work, and no one stopped me. The learning curve these last seven years has been steep, but one of the things that's helped is catching myself whenever I start thinking “This isn't for me” or “I can't do this.” I remind myself that people figure this out all the time.

You can try the same thing. Even if you just started coding because you wanted to clean some data and make some graphs, don’t be afraid to look into a wider set of practices from data science and software development. Remind yourself that people like you (or, yes, less equipped than you along every relevant dimension) have learned them. And these practices are for you.

Here's what I'm going to go through in this piece:

How to figure out what to learn
How to learn it
Specific tools or methods if you're a data scientist or want to be one

How Do You Figure Out What to Learn?

A good way to identify what to learn is to look for things that are annoying in your current workflow. What takes you forever, or is always leading to errors? What do you feel like there's got to be a better way to do?

Once you know what to work on, there are different ways to look for solutions. You can Google your problem. You can ask your coworkers or other technical people.

But you can also be on the lookout for connections between your problem and things you hear out in the world – and put yourself in more situations where you're likely to encounter solutions. This might mean going to meetups or conferences, or just being in spaces where technical people talk about what they're doing.

For instance, I was recently at posit conf and I was chatting with Michael Chow, software engineer and one of the one of the developers of Great Tables, a Python library for formatting tables. It wasn’t on my radar previously, but when he mentioned it and showed me some examples, I thought about a report I built and that I run periodically.

My report is in HTML, and I'm currently using a combination of pandas styler and rolling my own HTML with a lot of help from Claude-3.5. The table looks ok. The code is clunky and longer than it needs to be. I don’t feel good about ability to tweak it or build on it quickly.

This is a cool, professional-looking table. You can see other Great Tables examples (and the code) here

So when Michael brought up Great Tables, my ears perked up. We talked about my use case. I looked at some code. And now, if I have time to go back to that report, or next time I'm building something similar, I have a new tool to try out.

So that's one way to run into potentially useful tools or practices. But there are others.

For instance, if everyone in your field is doing something – or even just everyone who seems like a better coder than you, or who has a job you want – you should at least file that information away, even if you don't currently have time to learn it. You might also have a conversation with one of them about it. Talk about what problems you try to solve and how. Ask, Hey, should be I doing this? You might get useful information back.

You can also learn new tools or improve your practices just for the sake of it. You can pick up a book on software development for data science. You can learn something because it sounds cool or you want to put it on your resume. That’s great! If you're doing that, you're probably already sold on this whole concept of learning and applying things. But if you don't do this, that's ok as well – it's not the only way to expand your skills.

Methods for Learning

To truly learn a new tool or practice, you have to use it over and over. Taking a class isn't sufficient. And doing something once isn't going to stick with you. We learn through repetition; ideally, you can even add this new tool or practice to your workflow.

But how do you get to the point where you can use something for the first time?

It depends on your specific use case. The kinds of tradeoffs you'll typically run into include:

You can pay for more accountability and personal attention by going to an in-person class.
If you know exactly what you need and can be self directed, a resource you can work through in your own time will be faster and cheaper.

I also highly recommend using LLMs – they’re by far the tool I use the most in developing new practices. But be honest with yourself about what you're trying to get from them. Because it could be few different things:

A substitute for learning: I was never going to learn CSS. An LLM writes my CSS for me so I can use it to format my markdown slides. I'm not fooling myself into thinking I know CSS.
A guide to learning: You ask the LLM to teach you something. Initially you have it tell you how to do the thing, but you transition to doing it on your own.
Something in between: You're paying attention and reading what it's telling you to do, but you're also heavily relying on it to get the task done. There are tools where it's basically impossible to not learn something about what's going on if you use an LLM, just from reading the output, even if you don't memorize the syntax.

I tested this out and it was exactly what I was looking for.

Specific Tools

If you're writing code and you're not using git for version control, please start right now. Like, stop reading this and go learn git.

Beyond that, my personal list of tools and practices for early-career data coders goes something like:

A development environment that's not a notebook, like PyCharm, RStudio, or Visual Studio Code. (For more on why, see this.)
Using coding practices like putting all of your code in functions, with each function doing one thing; not repeating yourself; using relative rather than absolute references; and naming your functions and objects in a useful way
Virtual environments
Writing a readme file that explains what you're doing and why
Being able to run your code on a Linux server

The things I've added to my specific toolkit over the last year or so include:

Docker. I put this off for a very long time, and then suddenly I had a very specific problem where this was the only answer – so I used it.
Classes. I'd never written classes outside of a tutorial. But I've found I like them as a way of organizing my code.
Visual Studio Code. The extensions lower the barrier to adopting other practices, because you can do so many things from within it.
GitHub Actions. You can write test cases and have them run when you push or merge your code. Both times I've implemented this, it's been red for push after push because of dependency issues. But I got it fixed.
Tools for package creation, including for automatically generating documentation.

I'm telling on myself a little because it took me 6+ years to get to these (which is also how long it took me to get my PhD!) But I'm ok with that. There are people who got there much faster and people who got there slower, and there are no prizes for any of us beyond getting to use the thing.

A Developer Is You

I did finish my dissertation, eventually. I'm glad I did.**

It was harder than anything I've done career-wise since. But that was partly because I was super stuck in my head about it. The pressure was intense. You feel like a failure because it's not done, and that keeps you from doing it, and it just gets worse.

After that, nothing was as bad, but I was still like “This isn't for me!” when it came to tools or practices that developers were telling me to use, in part because it was hard to get past that initial discomfort from not getting it right away.

This wasn’t even my first time with GitHub actions, it was my second — and my tests still failed eight times before they passed. Also please ignore my commit messages.

It's easier now because I've failed to get something immediately so many times that it's normal – and I also know I can actually learn it.

But one last piece of advice: if you really can't figure it out, and the LLM can't help you, ask another person. I’ve done this a few times in the past year, and it's humbling when a software developer solves the problem I struggled with for hours in under ten minutes. But also, I'm glad I asked. There’s also no prize for never asking for help.

________________________________________________________

* I asked him if I could cite him and he told me I could name him as well. .

** For an alternative perspective, see Alex Gold's blog post, I left a PhD Program, maybe you should too!

The Present of Coding

Discussion about this post