Faster iteration over dataframes

written by Eric J. Ma on 2020-09-07 | tags: data science pandas tricks tips productivity

If df.iterrows() is slow, what then is the alternative? Read on to figure out how to make looping over dataframes 1000X faster :).

I can't claim credit for this one, as I found it here on LinkedIn. That said, in case it's helpful for you, here's the tip:

We are usually taught to loop through a dataframe using df.iterrows():

for r, d in df.iterrows():
    print(d["column_name"])

Turns out, the faster way to loop over rows in a dataframe is:

for d in df.itertuples():
    print(d.column_name)

According to the post, this is about on the order of 1000X faster.

The main reason for the speedup is because the use of .itertuples() leads to the construction of namedtuples. By contrast, the use of .iterrows() returns d as a pandas Series, which is slower to construct on each loop iteration.

Of course, one would usually try to vectorize as much as possible, but in the event that looping is unavoidable, this might be a good tip to keep in your back pocket!

Cite this blog post:

@article{
    ericmjl-2020-faster-iteration-over-dataframes,
    author = {Eric J. Ma},
    title = {Faster iteration over dataframes},
    year = {2020},
    month = {09},
    day = {07},
    howpublished = {\url{https://ericmjl.github.io}},
    journal = {Eric J. Ma's Blog},
    url = {https://ericmjl.github.io/blog/2020/9/7/faster-iteration-over-dataframes},
}

I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.

If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!

Finally, I do free 30-minute GenAI strategy calls for teams that are looking to leverage GenAI for maximum impact. Consider booking a call on Calendly if you're interested!

Eric J Ma's Website