Sorting text

Text sorting is a classic problem. It turns out to be incredibly useful, as a command line tool, and a poor implementation could be unusable, even if it performed to its specification.

K&R broke this chapter into two implementations: a simpler, in-memory sort program, and a out-of-core sort that uses temporary files to reduce its memory footprint and improve its performance (successfully, even in Haskell).

Following along, here is the initial attempt:


  simplesort - simple in-memory line sorter




  simplesort reads lines from the standard input, sorts them into
  ascending lexicographic order, then writes them back to its standard

The original read the incoming text into a buffer, arranging an array containing offsets of the beginning of each line in the text. This array was then sorted and each line printed in order. K&P used a quicksort to sort the array. (By the way, I have a peculiar question about the original: The rquick procedure has:

if (i - lo < hi - i) then begin
    rquick(lo, i-1);
    rquick(i+1, hi)
else begin
    rquick(i+1, hi);
    rquick(lo, i-1)

as the recursive phase. What the heck is with the reversible order of the recursive operations? It doesn't seem to be described in the text.)

simplesort in Haskell is much, much simpler:

sortLines :: String -> String
sortLines = unlines . sort . lines

main = interact sortLines

The Haskell Prelude includes, in addition to the lines (String -> [String], breaking the incoming text into lines) and unlines ([String] -> String, reassembling the lines into a single block of text) functions, a List sorting function, sort. GHC currently uses a merge sort, which is tolerably fast even in this context.

simplesort, however, has one major problem: it requires the input to fit into memory. Given the overhead of Haskell's the garbage collector, that restriction is almost as problematic on modern hardware as it would be in Pascal on K&P's hardware.

Sorting big files


  sort - sort text lines




  sort sorts its input into ascending lexicographic order.  Two lines
  are in order if they are identical or if the leftmost character
  position in which they differ contains characters which are in
  order, using the internal numeric representation of characters.  If
  a line is a proper prefix of another line, it precedes that line in
  sort order.

  sort writes intermediate data to files named stemp#, where # is a
  small decimal digit string; these filenames should be avoided.


  To print the sorted output of a program:

    program | sort | print

The Haskell approach to I/O makes an external sort significantly different from the previous basic internal sort. The main function is built from three IO actions:

main :: IO ()
main = do
       n <- doRuns
       n' <- mergeRuns n
       showRun n'

The first step reads the input, breaking it into bite-sized chunks. These runs are sorted and written to temporary files. The first step returns an integer, where temporary files numbered from 1 to n-1 were created. The second step merges the n files, producing a file identified by n'. This step deletes the temporary files after they have been merged. The final step prints the contents of the final temporary file to standard out and deletes it.

Chunking input

doRuns uses another standard library function, getContents. This lazily reads stdin, producing a String:

doRuns :: IO Int
doRuns = do
         text <- getContents
         doRuns' 0 text

The contents of stdin are broken into approximately 64k chunks, continuing until the completion of the final line. Each run is sorted, using the same function as simplesort, and then written to a temporary file.

runSize = 2^16

doRuns' :: Int -> String -> IO Int
doRuns' i [] = return i
doRuns' i text = do
                 putTemp i $ sortLines this
                 doRuns' (i+1) rest
    (this,rest) = splitAtNextLine runSize text

splitAtNextLine sz text = (pre ++ eol, rest')
    (pre,suf) = splitAt sz text
    (eol,rest) = span (/= '\n') suf
    rest' = if (null rest) then rest else tail rest

Temporary files

Temporary file handling is relatively simple, with the file named "stemp" followed by the identifying number.

tmpname i = "stemp" ++ (show i)

putTemp :: Int -> String -> IO ()
putTemp i text = do
                hnd <- mustopen (tmpname i) IO.WriteMode
                hPutStr hnd text
                hClose hnd


In order to fix the memory used to merge the temporary files, each temporary file is read one line at a time, and a limited number of files are merged at once. Each incoming line (and the remainder of the file) is held in a priority queue until it is written to a new temporary.

Another fancy data structure

This is a leftist heap, courtesy of Chris Okasaki, Purely Functional Data Structures. The heap is used to provide the priority queue to manage the temporary files during the merge phase, by allowing easy access to the next line to merge.

data Heap a = Ord a => Empty | Node (a,Int,Heap a,Heap a)

rankH :: Heap a -> Int
rankH Empty = 0
rankH (Node (_,r,_,_)) = r

makeNodeH :: Ord a => a -> Heap a -> Heap a -> Heap a
makeNodeH x a b = let ra = rankH a
                      rb = rankH b
                  in if ra >= rb
                     then Node (x, rb + 1, a, b)
                     else Node (x, ra + 1, b, a)

emptyH :: Ord a => Heap a
emptyH = Empty

isEmptyH :: Ord a => Heap a -> Bool
isEmptyH Empty = True
isEmptyH _ = False

mergeH :: Ord a => Heap a -> Heap a -> Heap a
mergeH h Empty = h
mergeH Empty h = h
mergeH h1@(Node (x,_,a1,b1)) h2@(Node (y,_,a2,b2))
           = case compare x y of GT -> makeNodeH y a2 (mergeH h1 b2)
                                 _  -> makeNodeH x a1 (mergeH b1 h2)

insertH :: Ord a => a -> Heap a -> Heap a
insertH x h = mergeH (Node (x,1,Empty,Empty)) h

findMinH Empty = error "Heap: findMinH Empty"
findMinH (Node (x,_,_,_)) = x

deleteMinH Empty = error "Heap: deleteMinH Empty"
deleteMinH (Node (_,_,a,b)) = mergeH a b

Since the contents of each temporary file is read and converted to lines lazily, the first remaining line is separated and kept exposed, while the rest of the list is unevaluated until it is needed. These two elements make up a SF structure (String-File, maybe?), which is made an instance of Ord in order to be stored in the heap.

Insertions into the heap require some slight special support: an empty list of lines is ignored, returning the heap unchanged, while a non-empty list has the first line read to build the SF.

data SF = SF (String, [String])

instance Eq SF where
    (SF l) == (SF r) = (fst l) == (fst r)

instance Ord SF where
    compare (SF l) (SF r) = compare (fst l) (fst r)

insertSFHeap :: [String] -> Heap SF -> Heap SF
insertSFHeap [] heap = heap
insertSFHeap (h:t) heap = insertH (SF (h,t)) heap

mergeRuns goes over the 0..(i-1) temporary files, in groups of 8, writing the merged output to file i. When the 8 files are merged, they are removed, the new temporary is appended to the list of temporaries to be processed, and mergeRuns' recurses. This process completes when only one temporary is left.

The function merge itself inserts the texts from the temporary files into the priority queue and then pulls each line out in turn.

mergeorder = 8

mergeRuns :: Int -> IO Int
mergeRuns i = mergeRuns' [0..(i-1)] i

mergeRuns' :: [Int] -> Int -> IO Int
mergeRuns' []  _ = error "no runs to merge"
mergeRuns' [i] _ = return i
mergeRuns' files i = do
                     fds <- mapM (\i -> mustopen IO.ReadMode $ tmpname i) files'
                     texts <- mapM hGetContents fds
                     putTemp i $ merge texts
                     mapM_ hClose fds
                     mapM_ (removeFile . tmpname) files'
                     mergeRuns' (files'' ++ [i]) (i+1)
    (files',files'') = splitAt mergeorder files

merge :: [String] -> String
merge texts = unlines $ merge' texts''
    texts' = map lines texts
    texts'' = foldr insertSFHeap emptyH texts'

merge' :: Heap SF -> [String]
merge' h | isEmptyH h = []
         | otherwise  = min : (merge' rest)
    SF (min,minS) = findMinH h
    rest = insertSFHeap minS $ deleteMinH h


The final step is to copy the contents of the final temporary file to standard output and remove it:

showRun :: Int -> IO ()
showRun n = do
            fd <- mustopen IO.ReadMode fn
            hGetContents fd >>= putStr
            hClose fd
            removeFile fn
    fn = tmpname n


Throughout sort, I have used a pair of problematic functions: getContents and hGetContents. These two functions return Strings which can be evaluated a character at a time to read the contents of the file. The problem is the relation of this lazy input to the strict file manipulations. In particular, if any input not read before the file is closed will not be available. For example, the following two functions produce different output:

f1 = do
  fd <- openFile "/etc/motd" IO.ReadMode
  f <- hGetContents fd
  putStrLn f
  hClose fd

f2 = do
  fd <- openFile "/etc/motd" IO.ReadMode
  f <- hGetContents fd
  hClose fd
  putStrLn f

The first prints the contents of the file; the second prints an empty line. (Things get more complicated if reading and writing are mixed.)

sort requires lazy I/O to limit its memory usage, and avoids the difficulties with laziness. For example, when merging the IO action which consumes the text output is invoked before the IO action which closes the input file. As a result, the ordering of the IO monad ensures that the operations take place in the correct order. The key is that I have taken care to ensure that in no case does an expression involving the input text "leak" past the operation closing the input file.

Separation of function: unique

K&R use the unique program to make a point about separation of function: sorting and unique-ifying are commonly used together, and

Why should there be two separate programs when a single slightly more complicated one will do?

The answer is that on the one hand, the unique program is useful on its own, and on the other, the two programs implement different functionality, and

In its early stages, at least, a program should implement a single function.

  unique - delete adjacent duplicate lines




  unique writes to its output only the first line from each group of
  adjacent identical input lines.  It is most useful for text that has
  been sorted to bring identical lines together; in this case it
  passes through only unique instances of input lines.


  To eliminate duplicate lines in the output of a program:

    program | sort | unique

The unique program itself is fairly simple, since it only works on adjacent lines.

unique :: String -> String
unique text = unlines $ unique' $ lines text
      unique' :: [String] -> [String]
      unique' (l:l':ls) | l == l'   =     unique' (l':ls)
      unique' (l:ls)                = l : unique'     ls
      unique' []                    = []

main = interact $ unique

Permuted index

One use for a text sorting program, particularly for someone writing a book, is to produce an index. A particularly fancy form of index is a "keyword-in-context" or permuted index, which presents the index entries in the textural context that they appear. The kwic.hs and unrotate.hs programs produce a permuted index of every word of the input, sorted and rearranged so that the keywords line up.



  kwic - produce lines for KWIC index




  kwic writes one or more "folded" versions of each line to its
  output.  A line is "folded" at the beginning of each nonwhitespace
  string within the line by writing from that string through the end
  of the line, followed by the fold character, #, followed by the
  beginning of the line.

  kwic is used with sort and unrotate to produce a Key Word In
  Context, or KWIC, index.


    $ echo "This is a test." | runhaskell kwic.hs  | runhaskell simplesort.hs
    This is a test.#
    a test.#This is
    is a test.#This
    test.#This is a

The program reads input lines, producing all possible rotations of the words on the line. The end of the original line is marked in each output line by a '#'. To do that, it uses two functions:

  • doLines breaks the input into lines, maps the rotate across each line and concatenates the result into a list of new lines, then uses unlines to produce the final String.
  • rotate breaks a line into words and finds all possible breaks of the line into a prefix and suffix of the words. Ignoring the possibility of the empty suffix, it joins the suffix followed by the "#" character followed by the prefix, producing all of the relevant rotations of the line.
rotate :: String -> [String]
rotate line = map splice $ init $ zip (inits ws) (tails ws)
      splice (pre,suf) = unwords suf ++ "#" ++ unwords pre
      ws = words line

doLines :: String -> String
doLines = unlines . concatMap rotate . lines

main = interact $ doLines


Following the production of the rotated lines and sorting, an unrotate program is used to produce the final output.


  unrotate - format lines for a KWIC index




  unrotate reads its input a line at a time and writes an "unfolded"
  version to its output.. A line is "folded" if it contains within it
  an instance of the fold character #; "unfolding" involves writing
  from the end of the line down to but not including the fold
  character, starting in column 39 of the output line, wrapping
  characters that would thus appear before column 1 around to the end
  of the line, then writing the remainder of the line starting at
  column 41 and wrapping around at column 80 if necessary.

  unrotate is used with kwic and sort to produce a Key Word In
  Context, or KWIC, index.


    $ echo "This is a test." | runhaskell kwic.hs | runhaskell simplesort.hs \
      | runhaskell unrotate.hs
                                             This is a test.
                                    This is  a test.
                                       This  is a test.
                                  This is a  test.


  This does not currently handle line wrapping correctly.

unrotate uses one function, which re-formats each incoming line:

middle = 40

unrotate line = replicate (middle - lenp) ' ' ++ pre ++ "  " ++ suf
      (suf,('#':pre)) = span (/= '#') line
      lenp = length pre

main = interact $ unlines . map unrotate . lines

This program will fail if an incoming line does not contain a '#' character, since the pattern matching used with span in unrotate would fail.

gloria i ad inferni
faciamus opus

Return to Top | About this site...
Last edited Sat Aug 8 03:29:10 2009.
Copyright © 2005-2016 Tommy M. McGuire