I wanted to have a look at the julia language, so I wrote a little script to import a dataset I'm working with. But when I run and profile the script it turns out that it is much slower than a similar script in R.
When I do profiling it tells me that all the cat commands have a bad performance.
The files look like this:
I primarily want to get the data_strings and split them up into a matrix of single characters.
This is a somehow minimal code example:
f = open("/file1")
m = Array(Any, 1,0)
for ln in eachline(f)
if ln != '#' && ln != '\n' && ln != '/'
s = split(ln[1:end-1])
s = split(s,"")
m = reshape(s,1,length(s))
first = false
s = reshape(s,1,length(s))
m = vcat(m, s)
Any idea why julia might be slow with the cat command or how i can do it differently?
Thanks for any suggestions!
cat like that is slow in that it requires a lot of memory allocations. Every time we do a
vcat we are allocating a whole new array
m which is mostly the same as the old
m. Here is how I'd rewrite your code in a more Julian way, where
m is only created at the end:
function loadfile2() f = open("./sotest.txt","r") first = true lines = Any for ln in eachline(f) if ln == '#' || ln == '\n' || ln == '/' continue end data_str = split(ln[1:end-1]," ") data_chars = split(data_str,"") # Can make even faster (2x in my tests) with # data_chars = [data_str[i] for i in 1:length(data_str)] # But this inherently assumes ASCII data push!(lines, data_chars) end m = hcat(lines...)' # Stick column vectors together then transpose end
I made a 10,000 line version of your example data and found the following performance:
Old version: elapsed time: 3.937826405 seconds (3900659448 bytes allocated, 43.81% gc time) elapsed time: 3.581752309 seconds (3900645648 bytes allocated, 36.02% gc time) elapsed time: 3.57753696 seconds (3900645648 bytes allocated, 37.52% gc time) New version: elapsed time: 0.010351067 seconds (11568448 bytes allocated) elapsed time: 0.011136188 seconds (11568448 bytes allocated) elapsed time: 0.010654002 seconds (11568448 bytes allocated)