If you’re a student or seeking entry-level data analytics or data science jobs and you have a Jupyter notebook-heavy GitHub portfolio, this is for you. 🎓
If you have or plan to create repos, with or without Jupyter notebooks, this may be for you. 🛠
If you are like, “unit tests? obviously I already have unit tests!”, this is definitely not for you, but that’s awesome! 👏
Your GitHub portfolio can help you:
Show what you did, that you understand your work, that your methods make sense for your problem, and that you can communicate well.
Ensure your code is easy to run and build on.
Show that you understand and adhere to industry norms and best practices.
These things have some overlap. But what it means to be doing each of these depends on your background, existing skills, and the kinds of jobs you’re targeting. The purpose of this is to give you guidance to get the most out of whatever time you want to put in.
Basic and (Relatively) Quick 🏃♂️
These tasks are the low-hanging fruit if your repo is meant to be looked at:
Package management. Place all install statements and imports at the top of your notebook. This makes your code more readable. It also makes it clear what dependencies your code has, and makes it easy for someone else to install them and run your code.
Explain at the beginning of the notebook what you're doing. Start with an explanation of your approach, goals, and processes. What can they expect to see if they keep reading? Use markdown and lists. Keep it brief.
Use headings throughout. Use headings in markdown to explain each new section in order to walk your reader through it.
Remove irrelevant code and figures to focus on your narrative. When you’re doing data analysis, you often try a bunch of things that don’t end up in your final version. Keep your notebook focused on the parts that matter for the story you’re trying to tell. You’re not trying to prove you’re familiar with a lot of different methods or that you wrote a lot of code, you’re just showing what wound up being important. This also gives you less to document and reorganize.
Code organization. Now that you’ve pared down your code and results, reorganize it in a way that tells the story you’re trying to tell. This might be different from the order in which you initially wrote or ran it.
Make sure your code runs. Don’t have error messages/code that broke when you ran it in your most recent commit.
Important but Potentially Less Quick 🐢
If you want to spend more time, I think these areas are where you should focus.
Organize your code in functions. Organizing all of your notebook code into functions makes your code easier to read, understand, and reuse. When you write code in a Jupyter Notebook without functions, it's easier to make mistakes and harder to follow. Functions help you avoid these mistakes by keeping your code organized and making sure each part does its own job without messing with other parts. If this isn’t how you’re used to programming, it may feel like a major shift, but there are lot of resources out there that can help you.
Write documentation to explain inputs, outputs, and usage. In addition to your previous documentation, briefly explain each of your inputs (API keys, data) and your outputs (graphs/tables), and explain how to run your code.
Automate all of the pieces of your work: For instance, if your analysis involves downloading a data set or otherwise getting data, do that in your script. This makes it easier for others to reproduce your work, and also makes it easier for you to update your analysis if the data changes. You can also include the specific version numbers of the packages you’re using in your install statements and use relative file paths. These assist with reproduction as well.
I’m mostly focusing on the technical pieces of this, because those are easier to explain. But it’s also worth revisiting whether your methods were appropriate to your problem and if your results convey what you intended them to.
If they weren’t or they don’t, you can redo some of it, or you can briefly explain in your documentation that you realize this and suggest alternative methods for if you were to expand your work.
Next Steps – If You Want To 🛣️
If you want to keep going, you can reorganize your file structure and use the whole repository rather than just the notebook.
Use a requirements.txt file for your dependencies.
Put your overall documentation into a readme.
Put your functions into one or more .py files.
If you still want to show figures and walk readers through what you did, import your functions into your notebook from your .py file and call them from there.
And if you want to continue engineering — by all means, keep doing so. Engineer as much as you want, if you’re enjoying it. But you shouldn’t feel like you have to.
Do This By Committing and Pushing ⬆️
For all of the changes you make, it’s good to get into the habit of doing this not by modifying the code or documentation from the GitHub GUI but instead by making the changes locally and then committing and pushing them to your repo. This is a cleaner, better process.
Use Large Language Models To Help You 🤖
I think you should use an LLM like GPT-4 or Claude-3 to help you with this. It can:
Provide guidance with git and troubleshoot your errors
Help you write clearer documentation
Help you organize your code into functions
You should anticipate all of these being a back-and-forth process where you ask, see its responses, and iterate with it or modify its outputs if they don’t work for you.
Conclusion
You don’t have to do any of these things.
I can’t make you do them.
These practices will rarely be the deciding factor between having no job options and having some job options.
But even though you don't have to do any of these things, it's worth considering at least some of them. The thing is, these aren't just random hoops to jump through, or some kind of hazing process. They're all practices that are both useful and that you're likely to come across pretty soon in your data science career. And like anything else, they get easier with practice. So if you're already putting together a portfolio, it's probably a good idea to spend some time both getting used to coding this way and showing that you can.
This is a wonderful guide - thank you!