当前位置: 动力学知识库 > 问答 > 编程问答 >

python - How to append columns based on other column values to pandas dataframe

问题描述:

I have the following problem: I want to append columns to a dataframe. These columns are the unique values in another row of this dataframe, filled with the occurence of this value in this row. It looks like this:

df:

Column1 Column2

0 1 a,b,c

1 2 a,e

2 3 a

3 4 c,f

4 5 c,f

What I am trying to get is:

 Column1 Column2 a b c e f

0 1 a,b,c 1 1 1

1 2 a,e 1 1

2 3 a 1

3 4 c,f 1 1

4 5 c,f 1 1

(the empty spaces can be nan or 0, it matters not.)

I have now written some code to aceive this, but instead of appending columns, it appends rows, so that my output looks like this:

 Column1 Column2

0 1 a,b,c

1 2 a,e

2 3 a

3 4 c,f

4 5 c,f

a 1 1

b 1 1

c 1 1

e 1 1

f 1 1

The code looks like this:

def NewCols(x):

for i, value in df['Column2'].iteritems():

listi=value.split(',')

for value in listi:

string = value

x[string]=list.count(string)

return x

df1=df.apply(NewCols)

What I am trying to do here is to iterate through each row of the dataframe and split the string (a,b,c) contained in Column2 at comma, so the variable listi

is then a list containing the separated string values. For each of this values I then want to make a new column and fill it with the number of occurences of that value in listi. I am confused why the code appends rows instead of columns. Does somebody know why and how I can correct that?

网友答案:

While we could do this using get_dummies, we can also cheat and use pd.value_counts directly:

>>> df = pd.DataFrame({'Column1': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'Column2': {0: 'a,b,c', 1: 'a,e', 2: 'a', 3: 'c,f', 4: 'c,f'}})
>>> df.join(df.Column2.str.split(",").apply(pd.value_counts).fillna(0))
   Column1 Column2  a  b  c  e  f
0        1   a,b,c  1  1  1  0  0
1        2     a,e  1  0  0  1  0
2        3       a  1  0  0  0  0
3        4     c,f  0  0  1  0  1
4        5     c,f  0  0  1  0  1

Step-by-step, we have

>>> df.Column2.str.split(",")
0    [a, b, c]
1       [a, e]
2          [a]
3       [c, f]
4       [c, f]
dtype: object
>>> df.Column2.str.split(",").apply(pd.value_counts)
    a   b   c   e   f
0   1   1   1 NaN NaN
1   1 NaN NaN   1 NaN
2   1 NaN NaN NaN NaN
3 NaN NaN   1 NaN   1
4 NaN NaN   1 NaN   1
>>> df.Column2.str.split(",").apply(pd.value_counts).fillna(0)
   a  b  c  e  f
0  1  1  1  0  0
1  1  0  0  1  0
2  1  0  0  0  0
3  0  0  1  0  1
4  0  0  1  0  1
>>> df.join(df.Column2.str.split(",").apply(pd.value_counts).fillna(0))
   Column1 Column2  a  b  c  e  f
0        1   a,b,c  1  1  1  0  0
1        2     a,e  1  0  0  1  0
2        3       a  1  0  0  0  0
3        4     c,f  0  0  1  0  1
4        5     c,f  0  0  1  0  1
网友答案:

When you use apply, it calls your function once for each column, with that column as an argument. So x in your NewCols will be set to a single column. When you do x[string] = list.count(string), you are adding values to that column. Since apply is called for each column, you wind up appending the values to both columns in this way.

apply is not the right choice when your computation depends only on the values of a single column. Instead, use map. In this case, what you need to do is write a NewCol function that accepts a single Column2 value and returns the data for a single row. You can return this as a dict, or, handily, a dict-like object such as a collections.Counter. Then you need to wrap this new row data into a DataFrame and attach it column-wise to your existing data using concat. Here is an example:

def NewCols(val):
    return collections.Counter(val.split(','))

>>> pandas.concat([d, pandas.DataFrame.from_records(d.Column2.map(NewCols))], axis=1)
   Column1 Column2   a   b   c   e   f
0        1   a,b,c   1   1   1 NaN NaN
1        2     a,e   1 NaN NaN   1 NaN
2        3       a   1 NaN NaN NaN NaN
3        4     c,f NaN NaN   1 NaN   1
4        5     c,f NaN NaN   1 NaN   1

For this particular computation, you actually don't need to write your own function at all, because pandas has split built in as an operation under the .str method accessor. So you can do this:

>>> pandas.concat([d, pandas.DataFrame.from_records(d.Column2.str.split(',').map(collections.Counter))], axis=1)
   Column1 Column2   a   b   c   e   f
0        1   a,b,c   1   1   1 NaN NaN
1        2     a,e   1 NaN NaN   1 NaN
2        3       a   1 NaN NaN NaN NaN
3        4     c,f NaN NaN   1 NaN   1
4        5     c,f NaN NaN   1 NaN   1
网友答案:

You could use something as:

import pandas as pd
import sklearn.feature_extraction.text

vect = sklearn.feature_extraction.text.CountVectorizer(binary=True,   token_pattern=u'(?u)\\b\\w+\\b')
df = ...
v = [a for a in df['Column2']]
new_df = df.combine_first( pd.DataFrame(vect.fit_transform(v).todense(), columns=vect.get_feature_names()) )
print new_df

Cheers!

分享给朋友:
您可能感兴趣的文章:
随机阅读: