问题描述:

I have an 1D array of numbers, and want to calculate all pairwise euclidean distances. I have a method (thanks to SO) of doing this with broadcasting, but it's inefficient because it calculates each distance twice. And it doesn't scale well.

Here's an example that gives me what I want with an array of 1000 numbers.

`import numpy as np`

import random

r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])

dists = np.abs(r - r[:, None])

What's the fastest implementation in scipy/numpy/scikit-learn that I can use to do this, given that it has to scale to situations where the 1D array has >10k values.

Note: the matrix is symmetric, so I'm guessing that it's possible to get at least a 2x speedup by addressing that, I just don't know how.

Neither of the other answers quite answered the question - 1 was in Cython, one was slower. But both provided very useful hints. Following up on them suggests that `scipy.spatial.distance.pdist`

is the way to go.

Here's some code:

```
import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
def option1(r):
dists = np.abs(r - r[:, None])
def option2(r):
dists = scipy.spatial.distance.pdist(r, 'cityblock')
def option3(r):
dists = sklearn.metrics.pairwise.manhattan_distances(r)
```

Timing with IPython:

```
In [36]: timeit option1(r)
100 loops, best of 3: 5.31 ms per loop
In [37]: timeit option2(c)
1000 loops, best of 3: 1.84 ms per loop
In [38]: timeit option3(c)
100 loops, best of 3: 11.5 ms per loop
```

I didn't try the Cython implementation (I can't use it for this project), but comparing my results to the other answer that did, it looks like `scipy.spatial.distance.pdist`

is roughly a third slower than the Cython implementation (taking into account the different machines by benchmarking on the np.abs solution).

Here is a Cython implementation that gives more than 3X speed improvement for this example on my computer. This timing should be reviewed for bigger arrays tough, because the BLAS routines can probably scale much better than this rather naive code.

I know you asked for something inside scipy/numpy/scikit-learn, but maybe this will open new possibilities for you:

File `my_cython.pyx`

:

```
import numpy as np
cimport numpy as np
import cython
cdef extern from "math.h":
double abs(double t)
@cython.wraparound(False)
@cython.boundscheck(False)
def pairwise_distance(np.ndarray[np.double_t, ndim=1] r):
cdef int i, j, c, size
cdef np.ndarray[np.double_t, ndim=1] ans
size = sum(range(1, r.shape[0]+1))
ans = np.empty(size, dtype=r.dtype)
c = -1
for i in range(r.shape[0]):
for j in range(i, r.shape[0]):
c += 1
ans[c] = abs(r[i] - r[j])
return ans
```

The answer is a 1-D array containing all non-repeated evaluations.

To import into Python:

```
import numpy as np
import random
import pyximport; pyximport.install()
from my_cython import pairwise_distance
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)], dtype=float)
def solOP(r):
return np.abs(r - r[:, None])
```

Timing with IPython:

```
In [2]: timeit solOP(r)
100 loops, best of 3: 7.38 ms per loop
In [3]: timeit pairwise_distance(r)
1000 loops, best of 3: 1.77 ms per loop
```

Using half the memory, but 6 times slower than `np.abs(r - r[:, None])`

:

```
triu = np.triu_indices(r.shape[0],1)
dists2 = abs(r[triu[1]]-r[triu[0]])
```

您可能感兴趣的文章：

- vbscript - can't copy files from appdata folder
- php - Return multiple count rows number in mysql
- driver - Linux device file is missing
- mysql - Working with sorl-thumbnail and Django Admin
- MongoDB conditionally $addToSet sub-document in array by specific field
- php - Reading a specific line from a text file
- clr - Any implementation of an Unrolled Linked List in C#?
- Finding Hudson Log Files
- Forward to a payment-gateway together with POST data using cURL (or any other PHP server side solution)
- WCF in Winforms app - is it always single-threaded?

随机阅读：

**推荐内容**-

**热点内容**