Haskell: Data.Text vs. Data.Text.Lazy Performance

Question

for training i wrote a short Haskell program as a replacement for a Perl script. The program reads a log file which contains multi-line messages and simply joins them to produce one line per message. My test input file has 14000 lines and a size of 1 MB. The version using Data.Text has a runtime of 3.5 secs, the one using Data.Text.Lazy only 0.1 secs (the original perl script needs 0.2 secs). I found in other posts that using Data.Text.Lazy would only make sense for really great amounts of data and didn't expect such a difference. Can anybody explain the reason why or what's wrong with my program ?

The relevant part of the source (the only difference between both versions is the import Data.Text*):

{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Char (isDigit)
import Data.Monoid ((<>))
import qualified Data.Text.Lazy    as T
import qualified Data.Text.Lazy.IO as T
import System.Environment (getArgs, getProgName)
import System.IO (openFile, stdin, stdout, Handle,
                  IOMode(ReadMode,WriteMode))

main :: IO ()
main = do
  (inp,out) <- getInput
  actLog <- T.hGetContents inp
  let newLog = processLog actLog
  T.hPutStr out newLog

processLog :: T.Text -> T.Text
processLog = foldr joinLines "" . T.lines

joinLines :: T.Text -> T.Text -> T.Text
joinLines elem accu
  | T.length elem == 0    = accu                     -- Blank Line
  | T.head elem   == ' '  = textElem <> accu         -- Continuation
  | isDigit (T.head elem) = "\n" <> textElem <> accu -- Start
  | otherwise             = accu                     -- Garbage
    where textElem = T.strip elem

Show source
| haskell   | performance   2017-01-06 09:01 2 Answers

Answers to Haskell: Data.Text vs. Data.Text.Lazy Performance ( 2 )

  1. 2017-01-06 09:01

    The difference between the lazy and normal Data.Text is whether the entire file is read into memory at

    actLog <- T.hGetContents inp 
    

    When processed lazy Haskell reads only the line directly required to produce the output. So instead of reading the entire file and then processing and writing as needed, it can now read, write and process as needed eliminating the wait for the entire file to be read in memory.

  2. 2017-01-06 13:01

    This looks like a data structures issue rather than a laziness issue. A strict Text is essentially a single big chunk of memory, while a lazy Text is essentially a linked list of strict Texts ("chunks"). The way that a lazy Text is split up into chunks isn't supposed to be part of the meaning of the value (which is just the text obtained by concatenating all the chunks). But it can have a big effect on the cost of operations.

    You have a bunch of operations of the form short <> accu where accu is growing with the size of your output. If these are strict Texts, this concatenation has to copy the entire contents of both short and accu into a new strict Text value. The total runtime is necessarily quadratic. If they are lazy Texts, then <> has another option: it can prepend the list of chunks that is short to the list of chunks that is accu. This doesn't need to touch accu at all, but even if it did, the spine of the linked list of chunks that makes up accu could be much less data to process than the entire textual contents of accu, depending on the chunk size. That's why your lazy Text version is so much faster.

    It looks like you could write your program in the form

    processLog = T.unlines . processLines . T.lines
    processLines :: [T.Text] -> [T.Text]
    processLines = ...
    

    which leaves the problem of how to concatenate up to the library function T.unlines, in which case you can expect it to be efficient whether you use strict Text or lazy Text.

Leave a reply to - Haskell: Data.Text vs. Data.Text.Lazy Performance

◀ Go back