I’ve heard this refrain many times. However, the distinction never really made sense to me.
R and Python are merely programming languages. You don’t have to do stats in R and data processing in Python. You can do data processing in R, and statistics in Python.
What is R? It’s a programming language designed by statisticians, and so there’s tons of one-liner functions to do stats easily.
What is Python? It’s a programming language that’s really well-designed for general purpose computing, so it’s really expressive, and others can build tools on top of it.
What is data processing? I don’t think I can do justice to its definition here, but I’ll offer my own simple take: making data usable for other programming functions.
What is statistics? I think statistics, at its core, is really about describing/summarizing data, and figuring out how probable our data came from some model of randomness. That’s all it is, and it’s all about playing with numbers, really. There’s nothing more than that. Technically, you can do statistics in any programming language, because technically, all programming languages deal with numbers...
Which brings me to the point I want to make - as long as you have data, and you’re doing data science, you technically can use any language for it; the differences are not in the language itself, but in the ecosystem, ease-of-use, and other aspects.
Other bloggers have written about the benefits of using a single language, which include:
After all, as Wes McKinney wrote, the real problem isn't R vs. Python. It's the ability to move data seamlessly.