当前位置: 动力学知识库 > 问答 > 编程问答 >

python 2.7 - Iterrows performance

问题描述:

I'm working on python 2.7, pandas ( version 0.18.1 ) data frames.

I have to modify a column in the data frame based on several columns in the same data frame.

For that I have written my code as below

Sample data is like below

data is my dataframe

My sample data is like

+---+---+----+----+---+---------+---+----+----+---+----------+

| a | b | c | d | e | f | g | h | i | j | discount |

+---+---+----+----+---+---------+---+----+----+---+----------+

| 0 | | | | | 65497.6 | | | | | 0 |

| 0 | | | | | 73882.8 | | | | | 0 |

| 0 | | | | | 88588 | | 22 | | | 0 |

| 0 | | | | | 106480 | | 20 | 10 | | 0 |

| 0 | | | | | 52500 | | | | | 0 |

| 0 | | 20 | 10 | | 22997.5 | | | | | 0 |

| | | | | | | | | | | 0 |

| 0 | | | 20 | | 0 | | | | | 0 |

| 0 | | | | | 10520 | | | | | 0 |

+---+---+----+----+---+---------+---+----+----+---+----------+

And my code is like below

columns1 = ['a','b','c','d','e']

columns2 = ['f','g','h','i','j']

data['discount'] = 0

for i, row in data.iterrows():

a = 0

b = 0

for col1 in columns1 :

value = row[col1]

if value > 0:

a = value

break;

for col2 in columns2 :

value = row[col2]

if value > 0:

b = value

break;

if( a != 0 and b != 0):

data.loc[i, 'discount'] = abs(a-b)

As I'm doing this way it is taking lot of time and lot of memory on the large dataset. I have 700MB of data, It is taking more than 120GB of RAM to process and approximately after 10 hours process is giving the exception saying Memory Error

according to this http://stackoverflow.com/a/24871316, I should not use like that, Please let me know how can I write this code more efficient.

Please let me know the reason for down voting my question, so that I can learn

网友答案:

Assuming your empty cells are NaN values, this gives you the first non-NA value of each row for the group of columns you are interested in:

df[df>0][columns1].bfill(axis=1).iloc[:,0]

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
5    20.0
6     NaN
7    20.0
8     NaN

Thus, this will give you the abs(a-b) you're searching for:

res = (df[df>0][columns1].bfill(axis=1).iloc[:,0]
      -df[df>0][columns2].bfill(axis=1).iloc[:,0]).abs()
res

0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
5    22977.5
6        NaN
7        NaN
8        NaN

You can either combine it with your initialized discount column:

res.combine_first(df.discount)

or fill the blanks:

res.fillna(0)
分享给朋友:
您可能感兴趣的文章:
随机阅读: