当前位置: 动力学知识库 > 问答 > 编程问答 >

csv - histogram in gnuplot vs histogram in unix utilities

问题描述:

I have csv file, I want to create histogram from column 6. Using Linux utilities this is simple:

└──> cut -f6 -d, data.csv | sort | uniq -c | sort -k2,2n

563 0.0

72 0.025

35 0.05

22 0.075

14 0.1

21 0.125

14 0.15

10 0.175

5 0.2

3 0.225

7 0.25

3 0.275

6 0.3

5 0.325

3 0.35

1 0.375

3 0.4

1 0.425

3 0.45

3 0.475

5 0.5

7 0.525

11 0.55

3 0.575

4 0.6

3 0.625

11 0.65

5 0.675

9 0.7

5 0.725

7 0.75

8 0.775

5 0.8

3 0.825

3 0.85

4 0.875

2 0.9

1 0.925

1 0.975

109 1.0

But I would like to plot it using gnuplot my attempt was to modify following script that I've found. This is my modified version:

#!/usr/bin/gnuplot -p

# http://psy.swansea.ac.uk/staff/carter/gnuplot/gnuplot_frequency.htm

clear

reset

set datafile separator ",";

# set term dumb

set key off

set border 3

# Add a vertical dotted line at x=0 to show centre (mean) of distribution.

set yzeroaxis

# Each bar is half the (visual) width of its x-range.

set boxwidth 0.05 absolute

set style fill solid 1.0 noborder

bin_width = 0.1;

bin_number(x) = floor(x/bin_width)

rounded(x) = bin_width * ( bin_number(x) + 0.5 )

# MAKE BINS

# plot dataset_path using (rounded($6)):(6) smooth frequency with boxes

# DO NOT MAKE BINS

plot "data.csv" using 6:6 smooth frequency with boxes

This is the result:

this http://oi57.tinypic.com/x1acrm.jpg

It is saying something completely different than Unix tools. In gnuplot I've seen various types of histograms, e.g. some follows normal distribution pattern, others were ordered according to frequency (as if I replace the last sort -k2,2n with sort -n) another were ordered according to numbers from which histogram was created (mine case), etc. it would be nice if I could choose.

网友答案:

smooth frequency renders the data monotonic in x (i.e. the value given in the first using column, in your case the numerical value from column 6), and then sums up all y-values (the values given in the second using column).

Here you also give the the sixth column, which is wrong if you want to count the number of occurrences of each distinct value in the sixth column, use using 6:(1), i.e. the numerical value 1 in the second column, to count the actual number of occurrences of each value:

set style fill solid noborder
set boxwidth 0.8 relative
set datafile separator ','
plot 'nupic_out.csv' using 6:(1) smooth frequency with boxes notitle

To apply a logscale to the smoothed data, you must first save them to a temporary file with set table ...; plot and then plot this temporary file.

set datafile separator ','
set table 'tmp.dat'
plot 'nupic_out.csv' using 6:(1) smooth frequency with lines
unset table

Here you must pay attention, because a bug in gnuplot adds a wrong last line to the output file which you must skip. You can either skip this by a filter in the using statement with e.g.

plot 'tmp.dat' using (strcol(3) eq "i" ? $1 : 1/0):2 with boxes

which works fine here, or you could use head to cut the last two lines like

plot '< head -n-2 tmp.dat' using 1:2 with boxes

Another point to note is, that gnuplot always uses white spaces to write out its data files, so you must change the data file separator back to whitespace before plotting tmp.dat.

A full working script could be

set style fill solid noborder
set boxwidth 0.8 relative
set datafile separator ','

set table 'tmp.dat'
plot 'nupic_out.csv' using 6:(1) smooth frequency with lines notitle
unset table

set datafile separator whitespace
set logscale y
set yrange [0.8:*]
set autoscale xfix
plot '< head -n-2 tmp.dat' using 1:2 with boxes notitle

Now, using a binning function for the values in the sixth column, you must replace the 6 in using 6:(1) by an function which operates on the value given in the sixth column. This function must be enclosed in () and you reference the current value in the sixth column using $6 inside the function, like

plot 'nupic_out.csv' using (bin($6)):(1) smooth frequency with lines

Again, a full working script, using ChrisW's binning function could be

set style fill solid noborder
set datafile separator ','

set boxwidth 0.09 absolute
Min = -0.05
Max = 1.05
n = 11.0
width = (Max-Min)/n
bin(x) = width*(floor((x-Min)/width)+0.5) + Min

set table 'tmp.dat'
plot 'nupic_out.csv' using (bin($6)):(1) smooth frequency with lines notitle
unset table

set datafile separator whitespace
set logscale y
set xrange [-0.05:1.05]
set tics nomirror out
plot '< head -n-2 tmp.dat' using 1:2 with boxes notitle

分享给朋友:
您可能感兴趣的文章:
随机阅读: