当前位置: 动力学知识库 > 问答 > 编程问答 >

python - normalize column in pandas dataframe based on value in another column

问题描述:

I want to normalize the values in one column of a pandas dataframe based on the value in another column. It's not a pure normalization in a statistical sense. The second value is a type; I want to sum all the first values for each type, then, in each row, divide the value by the total for the type of that row. An example should make this clearer.

df = pd.read_table(datafile, names = ["A", "B", "value", "type"])

A B value type

0 A1 B1 1 type1

1 A2 B2 1 type1

2 A1 B1 1 type2

3 A1 B3 1 type3

4 A2 B2 1 type2

5 A2 B4 1 type3

6 A3 B4 1 type2

7 A3 B5 1 type3

8 A4 B6 1 type2

9 A4 B7 1 type3

Then I can find the sums with something like:

types = df.groupby(["type"])["value"].sum()

type

type1 2

type2 4

type3 4

Name: value, dtype: int64

Then how do I use this to normalize the value in each row?

I can use a loop like this to calculate the normalized values:

norms = []

for ix, row in df.iterrows():

norms.append(row["value"]/types[row["type"]])

And then replace the column with a new one with these values:

df["value"] = pd.Series(norms)

A B value type

0 A1 B1 0.50 type1

1 A2 B2 0.50 type1

2 A1 B1 0.25 type2

3 A1 B3 0.25 type3

4 A2 B2 0.25 type2

5 A2 B4 0.25 type3

6 A3 B4 0.25 type2

7 A3 B5 0.25 type3

8 A4 B6 0.25 type2

9 A4 B7 0.25 type3

But as I understand, using loops like this is not very efficient or proper, and there is likely a way to do it using some standard pandas functions.

Thanks.

网友答案:

You could use transform, which performs an operation on each group and then expands the result back up to match the original index. For example"

>>> df["value"] /= df.groupby("type")["value"].transform(sum)
>>> df
    A   B  value   type
0  A1  B1   0.50  type1
1  A2  B2   0.50  type1
2  A1  B1   0.25  type2
3  A1  B3   0.25  type3
4  A2  B2   0.25  type2
5  A2  B4   0.25  type3
6  A3  B4   0.25  type2
7  A3  B5   0.25  type3
8  A4  B6   0.25  type2
9  A4  B7   0.25  type3

because we have

>>> df.groupby("type")["value"].transform(sum)
0    2
1    2
2    4
3    4
4    4
5    4
6    4
7    4
8    4
9    4
dtype: int64
网友答案:

I think the best way to achieve this is by using the .apply() method on your groupby object:

# Using backslashes for explicit line continuation, not seen
#   that often in Python but useful in pandas when you're
#   chaining a lot of methods one after the other
df['value_normed'] = df.groupby('type', group_keys=False)\
    .apply(lambda g: g['value'] / g['value'].sum())
df
Out[9]: 
    A   B  value   type  value_normed
0  A1  B1      1  type1          0.50
1  A2  B2      1  type1          0.50
2  A1  B1      1  type2          0.25
3  A1  B3      1  type3          0.25
4  A2  B2      1  type2          0.25
5  A2  B4      1  type3          0.25
6  A3  B4      1  type2          0.25
7  A3  B5      1  type3          0.25
8  A4  B6      1  type2          0.25
9  A4  B7      1  type3          0.25

You need the group_keys=False argument so that type doesn't become the index for each group's data, which prevents you from matching the transformed values back up to your original dataframe easily.

分享给朋友:
您可能感兴趣的文章:
随机阅读: