I think there is a big difference between not knowing how like a bubble sort works (basic comp sci concept but not useful to a data scientist), not understanding your graphing code (might be ok assuming you’re sure the right data is being plotted the way you want and you can verify it by just looking at it?), and not understanding the polars/pandas/sql code that is doing the analysis.
For example, I have used an LLM to like translate a plotly graph to altair (so I could have shaded error bars which plotly doesn’t support) but it was a long back and forth and part of my wishes I just actually learned Altair rather than outsource to an LLM.
I think the jury is very much out on the productivity increases of LLMS for coding and there is evidence using them inhibits learning and understanding
LLMs can work for coding because you can always test the code - I think you are right about the importance of tests. But I want to understand any actual analysis code so I can make sure the analysis is correct! The LLM doesn’t understand anything; only humans know what our true intent is. If we divorce ourselves from understanding the analytic code than the risk is high that subtle but believable errors will creep in.
I also disagree to we have no choice but to learn/use these tools and change how we work. Companies are already setting limits on tokens and realizing that maybe all this AI spend isn’t worth the money. I use coding tools for what I think they are good at - and that’s what every data science should do. No pressure to use it because of FOMO.
LLMs are just another layer of abstraction. There are areas where it's going to be important to understand exactly what it's doing and areas where it's not, and you don't necessarily need to read the code to get the understanding.
Also, the baseline level of code and coding practices is so low in data science that issues of tool dependance or not understanding code just strike me as so secondary to the ability to use these to improve practices. In a context where lots of people are still coding in commented-out Jupyter notebooks, LLMs represent a huge opportunity for process improvements.
I think there is a big difference between not knowing how like a bubble sort works (basic comp sci concept but not useful to a data scientist), not understanding your graphing code (might be ok assuming you’re sure the right data is being plotted the way you want and you can verify it by just looking at it?), and not understanding the polars/pandas/sql code that is doing the analysis.
For example, I have used an LLM to like translate a plotly graph to altair (so I could have shaded error bars which plotly doesn’t support) but it was a long back and forth and part of my wishes I just actually learned Altair rather than outsource to an LLM.
I think the jury is very much out on the productivity increases of LLMS for coding and there is evidence using them inhibits learning and understanding
https://www.anthropic.com/research/AI-assistance-coding-skills
LLMs can work for coding because you can always test the code - I think you are right about the importance of tests. But I want to understand any actual analysis code so I can make sure the analysis is correct! The LLM doesn’t understand anything; only humans know what our true intent is. If we divorce ourselves from understanding the analytic code than the risk is high that subtle but believable errors will creep in.
I also disagree to we have no choice but to learn/use these tools and change how we work. Companies are already setting limits on tokens and realizing that maybe all this AI spend isn’t worth the money. I use coding tools for what I think they are good at - and that’s what every data science should do. No pressure to use it because of FOMO.
LLMs are just another layer of abstraction. There are areas where it's going to be important to understand exactly what it's doing and areas where it's not, and you don't necessarily need to read the code to get the understanding.
Also, the baseline level of code and coding practices is so low in data science that issues of tool dependance or not understanding code just strike me as so secondary to the ability to use these to improve practices. In a context where lots of people are still coding in commented-out Jupyter notebooks, LLMs represent a huge opportunity for process improvements.