# python - Efficient way for calculating selected differences in array

I have two arrays as an output from a simulation script where one contains IDs and one times, i.e. something like:

``ids = np.array([2, 0, 1, 0, 1, 1, 2])times = np.array([.1, .3, .3, .5, .6, 1.2, 1.3])``

These arrays are always of the same size. Now I need to calculate the differences of `times`, but only for those times with the same `ids`. Of course, I can simply loop over the different `ids` an do

``for id in np.unique(ids):diffs = np.diff(times[ids==id])print diffs# do stuff with diffs``

However, this is quite inefficient and the two arrays can be very large. Does anyone have a good idea on how to do that more efficiently?

You can use `array.argsort()` and ignore the values corresponding to change in ids:

``````>>> id_ind = ids.argsort(kind='mergesort')
>>> times_diffs = np.diff(times[id_ind])
array([ 0.2, -0.2,  0.3,  0.6, -1.1,  1.2])
``````

To see which values you need to discard, you could use a Counter to count the number of times per id (`from collections import Counter`)

or just sort ids, and see where its diff is nonzero: these are the indices where id change, and where you time diffs are irrelevant:

``````times_diffs[np.diff(ids[id_ind]) == 0] # ids[id_ind] being the sorted indices sequence
``````

and finally you can split this array with np.split and np.where:

``````np.split(times_diffs, np.where(np.diff(ids[id_ind]) != 0))
``````

As you mentionned in your comment, `argsort()` default algorithm (quicksort) might not preserve order between equals times, so the `argsort(kind='mergesort')` option must be used.

Say you `np.argsort` by `ids`:

``````inds = np.argsort(ids, kind='mergesort')
>>> array([1, 3, 2, 4, 5, 0, 6])
``````

Now sort `times` by this, `np.diff`, and prepend a `nan`:

``````diffs = np.concatenate(([np.nan], np.diff(times[inds])))
>>> diffs
array([ nan,  0.2, -0.2,  0.3,  0.6, -1.1,  1.2])
``````

These differences are correct except for the boundaries. Let's calculate those

``````boundaries = np.concatenate(([False], ids[inds][1: ] == ids[inds][: -1]))
>>> boundaries
array([False,  True, False,  True,  True, False,  True], dtype=bool)
``````

Now we can just do

``````diffs[~boundaries] = np.nan
``````

Let's see what we got:

``````>>> ids[inds]
array([0, 0, 1, 1, 1, 2, 2])

>>> times[inds]
array([ 0.3,  0.5,  0.3,  0.6,  1.2,  0.1,  1.3])

>>> diffs
array([ nan,  0.2,  nan,  0.3,  0.6,  nan,  1.2])
``````

The numpy_indexed package (disclaimer: I am its author) contains efficient and flexible functionality for these kind of grouping operations:

``````import numpy_indexed as npi
unique_ids, diffed_time_groups = npi.group_by(keys=ids, values=times, reduction=np.diff)
``````

Unlike pandas, it does not require a specialized datastructure just to perform this kind of rather elementary operation.

I'm adding another answer, since, even though these things are possible in `numpy`, I think that the higher-level `pandas` is much more natural for them.

In `pandas`, you could do this in one step, after creating a DataFrame:

``````df = pd.DataFrame({'ids': ids, 'times': times})

df['diffs'] = df.groupby(df.ids).transform(pd.Series.diff)
``````

This gives:

``````>>> df
ids  times  diffs
0    2    0.1    NaN
1    0    0.3    NaN
2    1    0.3    NaN
3    0    0.5    0.2
4    1    0.6    0.3
5    1    1.2    0.6
6    2    1.3    1.2
``````