6 Comments
User's avatar
Frank Corrigan's avatar

Very nicely communicated => “learning Python” increasingly means learning to think about architecture and workflows: how do you want data to flow, what should update automatically, how should pieces connect, what should you test? You no longer need to master syntax to work in these ecosystems.

Dan Elton's avatar

Nobody is actually writing code anymore. So just have the AI write Python!

Abigail Haddad's avatar

I appreciate your much shorter version of my post!

Colm's avatar

Liked the examples of code provided - technically this post is “just use python and sql”! I agree with the idea that making the data and code public is essential nowadays. Reproducibility is a real issue as newspapers, for example, start to use AI to assist in data analysis pieces. I recently wrote to one publication to see if they had released their prompts and code publicly, assuming they would have a GitHub and was informed they considered this to be IP and wouldn’t share! Hard to know what to make of research in this sort of environment

Matt's avatar

It's fucking embarrassing that people still use any of those tools. Python is the modern choice. R has more advanced statistical packages than any of the proprietary systems. There's just no excuse.

It should become popular to pass state laws that public university econ pros get fired if they use Stata and biostats/PH profs are fired if they use SAS.

Mamadou S Diallo's avatar

This piece captures something that’s often missed in the “Python vs. Stata/SAS” debate: the real difference isn’t syntax or even cost, it’s infrastructure.

What you describe with the OPM data is exactly the gap between availability and usability. Releasing hundreds of files via one-by-one downloads technically satisfies openness, but it doesn’t support reproducibility, auditing, or longitudinal workflows. Once you move into a Python-centric ecosystem, you’re no longer just “analyzing data” — you’re building a pipeline that can be rerun, inspected, and shared.

I also appreciate the emphasis on the broader lifecycle. In survey and official statistics work, the problem starts before analysis: questionnaires, skip logic, and metadata are still treated as static documents rather than versioned software artifacts. That’s where a lot of irreproducibility creeps in.

Python makes it possible to treat the entire chain — instrument design, data ingestion, validation, analysis, and publication — as one coherent system. That’s the motivation behind projects like svy (https://svylab.com/docs), not replacing statistics, but modernizing the workflow around them.

The most convincing part of your argument, to me, is that none of this is hypothetical anymore. The tooling exists, the sharing infrastructure exists, and AI assistance has lowered the switching cost dramatically. At this point, sticking with closed, non-programmable ecosystems is more a historical artifact than a technical necessity.