written by Eric J. Ma on 2020-09-07 | tags: data science pandas tricks tips productivity
I can't claim credit for this one, as I found it here on LinkedIn. That said, in case it's helpful for you, here's the tip:
We are usually taught to loop through a dataframe using df.iterrows()
:
for r, d in df.iterrows(): print(d["column_name"])
Turns out, the faster way to loop over rows in a dataframe is:
for d in df.itertuples(): print(d.column_name)
According to the post, this is about on the order of 1000X faster.
The main reason for the speedup is because the use of .itertuples()
leads to the construction of namedtuples
. By contrast, the use of .iterrows()
returns d
as a pandas Series, which is slower to construct on each loop iteration.
Of course, one would usually try to vectorize as much as possible, but in the event that looping is unavoidable, this might be a good tip to keep in your back pocket!
I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.
If you would like to receive deeper, in-depth content as an early subscriber, come support me on Patreon!