# python - Efficient Empirical CDF Computation / Storage

I'm trying to precompute the distributions of several random variables. In particular, these random variables are the results of functions evaluated at locations in a genome, so there will be on the order of 10^8 or 10^9 values for each. The functions are pretty smooth, so I don't think I'll lose much accuracy by only evaluating at every 2nd/10th/100th? base or so, but regardless there will be a large number of samples. My plan is to precompute quantile tables (maybe percentiles) for each function and reference these in the execution of my main program to avoid having to compute these distribution statistics in every run.

But I don't really see how I can easily do this: storing, sorting, and reducing an array of 10^9 floats isn't really feasible, but I can't think of another way that doesn't lose information about the distribution. Is there a way of measuring the quantiles of a sample distribution that doesn't require storing the whole thing in memory?

I agree with @katriealex's comment: ask someone w/ a strong statistics background.

You could easily evaluate min/max/mean/std deviation w/o needing to store any significant amount of memory. (note for mean + std deviation: use Knuth's technique:

``````delta = x - m[n-1]
m[n] = m[n-1] +  1/n * delta
S[n] = S[n-1] + (x[n] - m[n])*delta
mean = m[n]
std dev = sqrt(S[n]/n)
``````

This prevents you from floating point overflow/underflow problems encountered in the naive calculation of std dev, e.g. taking S1 = the sum of x[k] and S2 = the sum of x[k]^2 and trying to calculate std deviation = sqrt(S2/N - S1^2/N^2). See also Wikipedia.)

There are probably other stream-oriented algorithms for computing higher characteristic moments of the distribution, but I don't know what they are.

Or alternatively, you could also use histogramming techniques with enough bins to characterize the distribution.