So You Want to Hire a Data Scientist

Nov 11, 2024

You've got data. You're thinking about hiring someone to work with that data. And you're pretty sure you want a data scientist.

But before you write that job posting, let's talk about whether you’re actually looking for a data scientist, what skills you should advertise and screen for, and what kinds of interview questions to ask your candidates. Because getting this right up front will save you (and your candidates) a lot of headaches down the road.

Do You Actually Want a “Data Scientist”?

It’s important to align your job title with how professionals in the field describe themselves, because the title on the job listing affects who applies to the role. If you post for a “data scientist,” your applicants will tend to have a certain range of skills and expect to be doing a certain set of activities at your company.

However, sometimes when organizations think they want a data scientist, they really need something else entirely. Let's break down some common data-related roles:

Data Analyst: Examines data to spot trends and create summaries. If you mainly need someone to work with existing data to answer business questions, create dashboards, and generate reports, a data analyst might be what you're looking for.
Data Engineer: Builds and maintains data infrastructure, pipelines, and warehouses. If your data isn't organized or accessible yet, you might need one of these first.
Machine Learning Engineer: Specializes in deploying and maintaining machine learning models in production. ML engineers are thinking about automation and scalability.
Data Architect: Designs the overall structure of data systems and how data flows between them. Think big-picture planning of how your organization handles data.
Data Scientist: Combines statistics, programming, and domain knowledge. May do some or all of the following: analyze data, build machine learning models, create dashboards – and is likely to do all of these via a programming language rather than via drag-and-drop tools or statistical software.

Sometimes organizations hire data scientists when what they really need is a data analyst with Tableau skills. Or they hire a data scientist when their data infrastructure isn't ready. In both cases, this ends badly: you've brought people on to do tasks they either can't perform well or don't want to do. And they’ll likely either underperform or start looking for their next job.

What if you primarily need someone with domain expertise, like in accounting or biology or economics, but who can also code? You may want to use a job title on your listing that reflects both of these, like Economist (Data Science). Or, if you need a specific technical expertise, you may mention that instead – like Data Scientist (Natural Language Processing), or Geospatial Data Scientist.

This is the original Data Science Venn Diagram from Drew Conway, which many other data science Venn diagrams are descended from.

What Tools Do Data Scientists Use?

Understanding the typical data science toolkit can both help you write better job listings and guide you in what to look for on resumes.

Data scientists typically work with:

Programming languages (especially Python and R)
SQL for database querying
Data analysis and machine learning libraries (for instance, pandas, NumPy, PyTorch, and LangChain for Python; tidymodels, dplyr, and caret in R)
Data visualization and dashboarding tools (ggplot, plotly/dash, shiny, streamlit)
Cloud platforms (AWS, Azure)
Development tools like git for version control and virtual environments or Docker for reproducibility
Tools and packages that are specific to the area they’re working in – like for analyzing geographic data, text data, or time-series data

While the fundamental skills (statistics, programming, problem-solving) change slowly, the specific tools, particularly Python and R packages, change more quickly.

This means that if you choose to list specific tooling in your job announcement, like packages for data analysis or dashboarding, regularly update your listings. Otherwise you risk being out of sync with what your candidates are using.

Trending Machine Learning Packages Based on Stack Overflow Questions

This chart shows the relative interest in three Python machine learning libraries based on the frequency of tags referencing each on a popular technical Q&A website

Also, if your tech stack – the tools you’re going to expect your data scientists to work with – looks different from the above list, that is something to make sure you’re noting in the advertisement – both so you can attract people who want to work with those tools and so you don’t hire data scientists who may end up unhappy on your tech stack, or not get the career development they’re looking for. And if your tech stack is dramatically different from what data scientists typically use, this is another good moment to step back and consider whether you really need a data scientist rather than another type of data professional.

Evaluating Candidates: Why to Work With Subject Matter Experts

When hiring, you need to understand which differences matter and which don't, regarding both your listings and your candidates.

Some combinations of skills are commonly found in a single person, while others are not. A listing with job duties that include data cleaning, modeling, optimization, coding, and visualization is a role that one person could reasonably do all of, whereas one that includes data analysis, data governance, and data architecture is not.
Some skills are closely related and easily transfer, others do not. Someone who only knows SAS might struggle in a Python environment: the difference between these two is quite large. But using different packages for data analysis or dashboarding? Not a big deal – they can easily pick up a new package.
Some areas are more universal and some are more niche. If someone's resume reflects that they clean and analyze data, they should understand concepts like filtering and aggregation – and be able to walk through how to do that in their language of choice. But they may not be able to get into the specifics of date formats if they don't work much with that type of data.

All of these distinctions, however, can be difficult to parse without subject matter experts, so you should get them involved early and often in your hiring process. They can help design job listings, screen resumes, write interview questions, and interview candidates. If you can find data scientists for this, perhaps elsewhere in your organization, that's ideal, but someone in one of the adjacent roles discussed above may still be able to assist.

Specific Questions to Ask (And Not Ask)

How do you evaluate candidates for the skills you're looking for? Ideally, you'd come up with interview questions in the context of the specific role you're hiring for – the functional area, the level of seniority, the specific tech stack. But that's not always possible.

Here are some of the sorts of questions that will give you useful information about candidates for a wide range of roles. You should look for clear communication about technical concepts, experiences and skills that can be transferred to new problems, and coding and statistical knowledge. You'll want subject matter experts to help develop grading rubrics and evaluate responses.

Data and Statistical Reasoning

“You start exploring a data set and find that some data is missing. Walk through some steps you would take to deal with this.”
“Can you explain how you would approach a situation where two different data sources have conflicting results?”
“You have a medical test for a particular disease. Walk me through how to think about the relative costs of false positives and false negatives. How do I determine the probability that someone has the disease, given that they tested positive?”
“We have a model that’s 98% accurate, but our stakeholders aren’t happy with it. Walk me through why this might be and what questions you’d ask.”
“You’ve built a model that performs well on your test data but poorly in production. What are some possible reasons why?”

Technical Implementation and Coding Practices

“Let’s say you analyze some data for a one-time report, but now the client wants it to run automatically every day. What are you going to change about the tools you use and how you approach the problem?”
"How do you decide whether to spend time automating part of your analysis process or doing it manually each time?”
“Describe your process for version control and managing changes in your data science projects.”
“Describe how you would write code to count the number of unique values in a dataset column. What would the syntax look like in your preferred language?”
“Explain how you would write and execute a SQL query to find the most common value in a column.”

If you ask a candidate about automation vs. doing tasks manually, they may reference this comic.

Project Management and Stakeholder Communication

“What questions would you ask before starting a new data science project?”
“How would you handle a situation where a model’s predictions may have significant ethical implications (e.g., in hiring or lending)?”
“A stakeholder asks you to use ChatGPT to analyze our customer feedback data. How would you discuss the benefits and limitations of this approach?”

Questions to Avoid Asking

Questions focused on data governance, data architecture, or other roles that happen to have ‘data’ in the title.
Leadership or strategy questions, unless this is a leadership or strategy role.
Questions specific to niche tools when there are multiple solutions available.
Highly specialized technical questions that aren’t essential for the job.

Your goal is to understand how candidates think and work in areas that are relevant to what they're going to be doing.

Other Assessments

In addition to interviews, some organizations use other assessments to hire data scientists. These can take several forms: a timed coding test, take-home assignment, code sample, or live coding exercise during an interview. I don't think there's a one-size-fits-all answer to whether you should include one or more of these in your assessment process. It depends on the skills and abilities of your applicant pool relative to what you need, and the ability of other parts of your hiring process to effectively screen for what you're looking for.

HackerRank is a platform that offers timed coding assessments for employers to evaluate technical skills.

However, the emergence of large language models (LLMs) has changed this calculation, as these tools are capable of handling many standard coding tests. If you give a test that you're not directly watching someone take, you should assume they're going to use LLMs. As such, you can either design a test that LLMs won't be able to substantially assist with – which is going to be a challenge – or you should give them permission, so as to not penalize more honest candidates.

The Bottom Line

If you think you need a data scientist, first make sure that's actually what you need. Maybe you need a data analyst who's great at Tableau, or a data engineer to get your infrastructure in order. Use job titles that match how people in the field describe themselves.

When you're ready to hire:

If you know what your tech stack is, include it – but if you don't, make sure you're not listing tools that are out of date or not widely used.
Get subject matter experts involved in your hiring process.
Ask questions that evaluate coding, communication, and statistical skills.

And one final note: candidates are evaluating you just as much as you're evaluating them. The questions you ask and how you assess their skills tell them a lot about whether your organization understands data science and will be a good place for them to work. By creating an assessment process that shows you know what data scientists know and do, you can send positive signals.

The Present of Coding

Discussion about this post