当前位置: 动力学知识库 > 问答 > 编程问答 >

python - How to filter this data in pandas data frame or numpy array?

问题描述:

I'm trying to plot performance metrics of various assets in a back test.

I have imported the 'test_predictions.json' into a pandas data frame. It is a list of dictionaries and contains results from various asset (listed one after the other),

Here is a sample is the data:

trading_pair return timestamp prediction

[u'Poloniex_ETH_BTC' 0.003013302628677 1450753200L -0.157053292753482]

[u'Poloniex_ETH_BTC' 0.006013302628677 1450753206L -0.187053292753482]

...

[u'Poloniex_FCT_BTC' 0.006013302628677 1450753100L 0.257053292753482]

Each backtest starts and ends at different times.

Here' is the data for the assets of interest

'''

#These are the assets I would like to analyse

Poloniex_DOGE_BTC 2015-10-21 02:00:00 1445392800

Poloniex_DOGE_BTC 2016-01-12 05:00:00 1452574800

Poloniex_XRP_BTC 2015-10-28 06:00:00 1446012000

Poloniex_XRP_BTC 2016-01-12 05:00:00 1452574800

Poloniex_XMR_BTC 2015-10-21 14:00:00 1445436000

Poloniex_XMR_BTC 2016-01-12 06:00:00 1452578400

Poloniex_VRC_BTC 2015-10-25 07:00:00 1445756400

Poloniex_VRC_BTC 2016-01-12 00:00:00 1452556800

'''

So i'm trying to make an new array that contains the data for these assets. Each asset must be sliced appropriately so they all start from the latest start time and end at earliest end time (other wise there will be incomplete data).

#each array should start and end:

#start 2015-10-28 06:00:00

#end 2016-01-12 00:00:00

So the question is:

How can I search for an asset ie Poloniex_DOGE_BTC then acquire the index for start and end times specified above ?

I will be plotting the data via numpy so maybe its better turn into a numpy array, df.valuesand the conduct the search? Then i could use np.hstack(df_index_asset1, def_index_asset2) so it's in the right form to plot. So the problem is: using either pandas or numpy how do i retrieve the data for the specified assets which fall into the master start and end times?

On a side note here the code i wrote to get the start and end dates, it's not to most efficient so improving that would be a bonus.

EDIT:

From Kartik's answer I tried to obtain just the data for asset name: 'Poloniex_DOGE_BTC' using the follow code:

import pandas as pd

import numpy as np

preds = 'test_predictions.json'

df = pd.read_json(preds)

asset = 'Poloniex_DOGE_BTC'

grouped = df.groupby(asset)

print grouped

But throws this error

EDIT2: I have changed the link to the data so it is test_predictions.json`

EDIT3: this worked a treat:

preds = 'test_predictions.json'

df = pd.read_json(preds)

asset = 'Poloniex_DOGE_BTC'

grouped = df.groupby('market_trading_pair')

print grouped.get_group(asset)`

#each array should start and end:

#start 2015-10-28 06:00:00 1446012000

#end 2016-01-12 00:00:00 1452556800

Now how can we truncate the data so that it starts and ends from the above timestamps ?

网友答案:

Firstly, why like this?

data = pd.read_json(preds).values
df = pd.DataFrame(data)

You can just write that as:

df = pd.read_json(preds)

And if you want a NumPy array from df then you can execute data = df.values later.

And it should put the data in a DataFrame. (Unless I am much mistaken, because I have never used read_json() before.

The second thing, is getting the data for each asset out. For that, I am assuming you need to process all assets. To do that, you can simply do:

# To convert it to datetime.
# This is not important, and you can skip it if you want, because epoch times in
# seconds will perfectly work with the rest of the method.
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')

# This will give you a group for each asset on which you can apply some function.
# We will apply min and max to get the desired output.
grouped = df.groupby('trading_pair') # Where 'trading_pair' is the name of the column that has the asset names
start_times = grouped['timestamp'].min
end_times = grouped['timestamp'].max

Now start_times and end_times will be Series. The index of this series will be your asset names, and the value will be the minimum and maximum times respectively.

I think this is the answer you are looking for, from my understanding of your question. Please let me know if that is not the case.

EDIT

If you are specifically looking for a few (one or two or ten) assets, you can modify the above code like so:

asset = ['list', 'of', 'required', 'assets'] # Even one element is fine.
req_df = df[df['trading_pair'].isin(asset)]

grouped = req_df.groupby('trading_pair') # Where 'trading_pair' is the name of the column that has the asset
start_times = grouped['timestamp'].min
end_times = grouped['timestamp'].max

EDIT2 this worked a treat:

preds = 'test_predictions.json'

df = pd.read_json(preds)

asset = 'Poloniex_DOGE_BTC'

grouped = df.groupby('market_trading_pair')
print grouped.get_group(asset)`

#each array should start and end: 
#start 2015-10-28 06:00:00 1446012000
#end 2016-01-12 00:00:00 1452556800 

Now how can we truncate the data for that it starts from the above starts and ends at the above timestamps ?


As an aside, plotting datetimes from Pandas is very convenient as well. I use it all the time to produce most of the plots I create. And all of my data is timestamped.

分享给朋友:
您可能感兴趣的文章:
随机阅读: