当前位置: 动力学知识库 > 问答 > 编程问答 >

python - Pandas Shaping Data for Covariance

问题描述:

I need to conduct a simple covariance analysis in a time series. My raw data comes in the shape like this:

WEEK_END_DATE TITLE_SHORT SALES

2012-02-25 00:00:00.000000 "Bob" (EBK) 1

"Bob" (EBK) 1

2012-03-31 00:00:00.000000 "Bob" (EBK) 1

"Bob" (EBK) 1

2012-03-03 00:00:00.000000 "Sally" (EBK) 1

2012-03-10 00:00:00.000000 "Sally" (EBK) 1

2012-03-17 00:00:00.000000 "Sally" (EBK) 1

"Sally" (EBK) 1

2012-04-07 00:00:00.000000 "Sally" (EBK) 1

As you can see, there are some duplicates. Unless I'm missing something, I need this data to become a set of vectors for each title, so that I can use numpy.cov.

Question:

How do I find duplicates in date and name and AGGREGATE them by sum? I've been trying to use pandas groupby WEEK_END_DATE and TITTLE_SHORT but it comes out indexed in a way that I don't understand.

EDIT:

To be specific, when I try df.groupby(["WEEK_END_DATE", "TITLE_SHORT"]), I get this:

>df.ix[0:3]

WEEK_END_DATE TITLE_SHORT

2012-02-04 00:00:00.000000 'SALEM'S LOT (EBK) <pandas.core.indexing._NDFrameIndexer object a...

'TIS THE SEASON! (EBK) <pandas.core.indexing._NDFrameIndexer object a...

(NOT THAT YOU ASKED) (EBK) <pandas.core.indexing._NDFrameIndexer object a...

dtype: object

and trying to select df.ix[1,] gets this error:

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "/Library/Python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/series.py", line 613, in __getitem__

return self.index.get_value(self, key)

File "/Library/Python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 1630, in get_value

loc = self.get_loc(key)

File "/Library/Python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 2285, in get_loc

result = slice(*self.slice_locs(key, key))

File "/Library/Python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 2226, in slice_locs

start_slice = self._partial_tup_index(start, side='left')

File "/Library/Python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 2250, in _partial_tup_index

raise Exception('Level type mismatch: %s' % lab)

Exception: Level type mismatch: 3

网友答案:

I'm not entirely certain I know what's going on, but here's what I'd start with. First, get the data (which looks fixed-width to me):

>>> df = pd.read_fwf("weekend.dat", widths=(26, 20, 9), parse_dates=[0])
>>> df = df.fillna(method="ffill")
>>> df
        WEEK_END_DATE    TITLE_SHORT  SALES
0 2012-02-25 00:00:00    "Bob" (EBK)      1
1 2012-02-25 00:00:00    "Bob" (EBK)      1
2 2012-03-31 00:00:00    "Bob" (EBK)      1
3 2012-03-31 00:00:00    "Bob" (EBK)      1
4 2012-03-03 00:00:00  "Sally" (EBK)      1
5 2012-03-10 00:00:00  "Sally" (EBK)      1
6 2012-03-17 00:00:00  "Sally" (EBK)      1
7 2012-03-17 00:00:00  "Sally" (EBK)      1
8 2012-04-07 00:00:00  "Sally" (EBK)      1

Then aggregate the dups:

>>> g = df.groupby(["WEEK_END_DATE", "TITLE_SHORT"]).sum().reset_index()
>>> g
        WEEK_END_DATE    TITLE_SHORT  SALES
0 2012-02-25 00:00:00    "Bob" (EBK)      2
1 2012-03-03 00:00:00  "Sally" (EBK)      1
2 2012-03-10 00:00:00  "Sally" (EBK)      1
3 2012-03-17 00:00:00  "Sally" (EBK)      2
4 2012-03-31 00:00:00    "Bob" (EBK)      2
5 2012-04-07 00:00:00  "Sally" (EBK)      1

And then do whatever cov stuff you need to (note that cov is a Series/DataFrame/GroupBy method too, so you shouldn't need to call np.cov specifially).

分享给朋友:
您可能感兴趣的文章:
随机阅读: