Idea. Read several files line by line, concatenate them, process the list of lines in all files.
Implementation. This can be implemented this way:
import qualified Data.ByteString.Char8 as B
readFiles :: [FilePath] -> IO B.ByteString
readFiles = fmap B.concat . mapM B.readFile
main = do
files <- getArgs
allLines <- readFiles files
Problem. This works unbearably slow. What's notable, the real or user time is several orders higher than system time (measured using UNIX
time), so I suppose the problem is in spending too much time in IO.
I didn't manage to find a simple and effective way to solve this problem in Haskell.
For instance, processing two files (30.000 lines and 1.2M each) takes
20.98 real 18.52 user 0.25 sys
This is the output when running
157,972,000 bytes allocated in the heap
6,153,848 bytes copied during GC
5,716,824 bytes maximum residency (4 sample(s))
1,740,768 bytes maximum slop
10 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 295 colls, 0 par 0.01s 0.01s 0.0000s 0.0006s
Gen 1 4 colls, 0 par 0.00s 0.00s 0.0010s 0.0019s
INIT time 0.00s ( 0.01s elapsed)
MUT time 16.09s ( 16.38s elapsed)
GC time 0.01s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 16.11s ( 16.41s elapsed)
%GC time 0.1% (0.1% elapsed)
Alloc rate 9,815,312 bytes per MUT second
Productivity 99.9% of total user, 98.1% of total elapsed
16.41 real 16.10 user 0.12 sys
Why is concatenating files using the code above is so slow?
How should I write
readFiles function in Haskell to make it faster?
You should show us exactly what your processing steps are.
This program is very performant even when run on multiple input files of the kind you are using (1.2 MB, 30k lines each):
import Control.Monad import Data.List import System.Environment import qualified Data.ByteString.Char8 as B readFiles :: [FilePath] -> IO B.ByteString readFiles = fmap B.concat . mapM B.readFile main = do files <- getArgs allLines <- readFiles files print $ foldl' (\s _ -> s+1) 0 (B.words allLines)
Here is how I created the input file:
import Control.Monad main = do forM_ [1..30000] $ \i -> do putStrLn $ unwords ["line", show i, "this is a test of the emergency"]
time ./program input -- 27 milliseconds time ./program input input -- 49 milliseconds time ./program input input input -- 69 milliseconds