CHAPTER ONE

Basic character I/O

The basic requirement for most of these programs is to read input, process it, and write output. Most of the programs are designed to act as filters, to read from standard input and write to standard output. Haskell provides several approaches to reading standard input. First, I will discuss the low-level (but slightly more general) interface reading a single character at a time. Then, I present (following excellent suggestions on the haskell-cafe mailing list) a higher-level, more Haskell-ish interface.

getc

getc reads a character from stdin. The original used a flag value, ENDFILE or -1, to indicate the end of input, while I find this both bad practice and unnecessary. Instead, I am using Haskell's Maybe type, with Nothing indicating no further input: the end of the file, and a Just value indicating a newly read character.

This is the first monadic code to be seen here. getc builds an IO action out of three parts:

  • It checks for the end of the input.
  • If the end has been reached, it returns Nothing, elevating it via "return".
  • If the end has not been reached, it reads the next character and returns Just that character.

The return value of getc is not a character or any other simple value; it is not even a Maybe Char. Instead, it is an IO action, and a complex one that involves a decision. This is hidden by the do notation, including the "<-" pseudo-assignment operator. As a result, I wrote this code as if I were writing in a simple, albeit strange, imperative language.

Note: in most cases, at least until later chapters, the type annotations were added after the functions were written. The annotations make good (sort of) documentation, but one of the problems I've had getting a grip on Haskell has been the IO monad and types, so I am writing this without specifying (or thinking too much about) the types.

Haskell does make a pretty good imperative language, although I find its lack of control structures disturbing.

(Yes, I know I can write my own damn control structures, more or less. It's just not the same. Besides, that would ruin the joke.)

getc :: IO (Maybe Char)
getc = do
       eof <- isEOF
       if eof
          then return Nothing
          else do
               c <- getChar
               return (Just c)

putc

putc writes a character to stdout. Just another name for putChar, really.

putc :: Char -> IO ()
putc = putChar

Copying stdin to stdout

Copy is the basic filter: copy the standard input to the standard output.

This uses a recursive function call rather than the loop of the original. Right offhand, I do not know whether Haskell's standard library provides anything like a "while" loop, but using functional programming's native control flow is completely workable.

Again, this function is based on the do notation for monadic IO. However, and here's the weird part, the IO "action" returned by this function involves a loop and is potentially arbitrarily large or long lasting. "Actions" are functions.

For a very good view of the interaction between functions, actions-as-data-structures, and actions-as-code-that-does-stuff, see "Tackling the Awkward Squad: monadic input/output, concurrency, exceptions, and foreign-language calls in Haskell," by Simon Peyton Jones.

copy :: IO ()
copy = do
       ch <- getc
       case ch of
               Nothing -> return ()
               Just c -> do
                         putc c
                         copy

The copy program

The copy program satisfies the following documentation:

PROGRAM

  copyprog  copy input to output

USAGE

  copyprog

FUNCTION

  copyprog copies its input to its output unchanged.  It is useful
  for copying from a terminal to a file, from file to file, or even
  from terminal to terminal.  It may be used for displaying the
  contents of a file, without interpretation or formatting, by
  copying from a file to terminal.

EXAMPLE

  To echo lines typed at your terminal:

   $ runhaskell copyprog.hs
   hello there, are you listening
   hello there, are you listening
   yes, I am.
   yes, I am.
   ^Z

Note that this program simply calls the SoftwareTools.copy function.

import SoftwareTools

main = copy

Counting characters in stdin

The charcount program count characters from the standard input.

PROGRAM

  charcount   count characters in input

USAGE

  charcount

FUNCTION

  charcount counts the characters in its input and writes the total
  as a single line of text to the output.  Since each line of text
  is internally delimited by a NEWLINE character, the total count
  is the number of lines plus the number of characters within each
  line.

EXAMPLE

    $ runhaskell charcount.hs
    A single line of input.
    ^Z
    24

  (I apologize for the Windows-ism.)

Once again with the recursive functions. (I'm not sure why I feel the need to point that out; probably because the original uses loops and I feel slightly bad for not doing a direct translation rather than a more idiomatic one.) In this case, a helper does the actual looping. This helper returns the character count. Note that Haskell's stdio support (correctly) converts Windows' CRNL pairs into NL, so the character count is off by the number of lines. (I'm currently developing this on Windows. Bleah.)

The $! operator in the recursive call is need to force the addition, to make it strict. In this case that prevents a stack overflow on large inputs. See wc.hs below for further details. Here, it is syntactically similar to the $ parentheses-elimination operator.

charcount :: IO ()
charcount = do
            nc <- charcount' 0
            putStrLn (show nc)
    where
    charcount' nc = do
                    ch <- getc
                    case ch of
                            Nothing -> return nc
                            Just c -> charcount' $! nc + 1

Counting lines

linecount counts lines from standard input.

PROGRAM

  linecount  count lines in input

USAGE

  linecount

FUNCTION

  linecount counts the lines in its input and writes the total as a
  line of text to the output.

EXAMPLE

    $ runhaskell linecount.hs
    A single line of input.
    ^Z
    1

This case is slightly different from the preceding, in that it deals with a specific character, the newline. The original introduces a NEWLINE constant, and the equivalent would be a new data type which is either a Newline or some RandomCharacter c. Since I'm an old-school Unix/C guy, I'm comfortable with fuddling around with character constants and am just using 'n' (and this is probably decent Haskelly style, since excessive data typery is excessive anywhere). On the other hand, the case expression has sprouted an extra guard, to handle difference between the newline and J. Random Characters. Pattern matching is fun. The "_" in the second Just case is a pseudovariable that indicates, "don't care".

linecount :: IO ()
linecount = do
            lc <- linecount' 0
            putStrLn (show lc)
    where
    linecount' lc = do
                    ch <- getc
                    case ch of
                            Nothing -> return lc
                            Just '\n' -> linecount' $! lc + 1
                            Just _ -> linecount' lc

Counting words

wordcount counts words from standard input.

PROGRAM

  wordcount  count words in input

USAGE

  wordcount

FUNCTION

  wordcount counts the words in its input and writes the total as a
  line of text to the output.  A "word" is a maximal sequence of
  characters not containing a blank or tab or newline.

EXAMPLE

  $ runhaskell wordcount.hs
  A single line of input
  ^Z
  5

BUGS

  The definition of "word" is simplistic.

In the wordcount function, the code was getting wider than I wanted, so I broke out another helper function, handlechar, that is mutually recursive with wordcount'. The extra parameter of wordcount' is a flag, indicating whether the current position in the input is inside of a word or not. The word/nonword decision, and the count, is made when the first nonwhitespace character is read following whitespace.

The handlechar function is defined piecewise. The first case is when the character is whitespace, the second is when it is not whitespace but inword is false, indicating a non-word to word transition. The final case handles a nonwhitespace character while the position is already in a word. This code counts the non-word to word transitions in the input by incrementing the count in the second case.

wordcount :: IO ()
wordcount = do
            wc <- wordcount' False 0
            putStrLn (show wc)
    where
    wordcount' inword wc = do
                           ch <- getc
                           case ch of
                                   Nothing -> return wc
                                   Just c  -> handlechar c inword wc

    handlechar c _     wc | isSpace c = wordcount' False wc
    handlechar _ False wc             = wordcount' True  $! wc + 1
    handlechar _ True  wc             = wordcount' True  wc

    isSpace c = (c == ' ' || c == '\n' || c == '\t')

I/O Alternatives

There are alternatives to character-by-character I/O in Haskell. One of the nicest is based on the function interact from the standard prelude. interact has the type (String -> String) -> IO (); in other words, it takes a function from a String to a String and produces an IO action. Internally, interact pulls characters from the result of the function and prints them to the standard output, while at the same time reading the standard input lazily, one character at a time. Here is an example, a replacement for copy:

copy' = interact id

id is the identity function, which returns its input unchanged; interact uses this to bind the input to the output.

To get fancier, a replacement for charcount would be:

charcount' = interact $ show1Ln . length
    where
    show1Ln v = (show v) ++ "\n"

The function length returns the number of elements in a list, and a String is a list of characters. show1Ln takes that length (as a number), converts it to a string and adds a newline; the result of combining the two is a function that takes a list and produces a newline-terminated string containing the number of elements in the list. interact uses that function to count the characters in the input and write that count to the output.

This function uses the "pointfree" style; it is identical to the function:

charcount'' = interact (\input -> show1Ln (length input))

To go further, an IO action that counts the number of words in the input (a replacement for wordcount) and another that counts the number of lines would be:

wordcount' = interact $ show1Ln . length . words

linecount' = interact $ show1Ln . length . lines

(Using the same show1Ln helper function and the Haskell Prelude's words (which breaks a String into a list of Strings containing "words") and lines (which breaks a String into a list of Strings containing lines) functions.)

interact has a very nice property: it completely hides the monadic, imperative IO from the purely functional processing of the function. In all of the examples above, the logic of the program is dealing solely with a (admittedly, lazily read) String, not any IO actions.

interact is appropriately named; it can be used for interactive tasks, as seen in the following, peculiar, program:

main = interact f
    where
    f l = "I'm starting\n" ++ (g l)
    g ('I':chs) = "Saw an I" ++ (g chs)
    g (ch:chs) = ch : (g chs)
    g [] = "I'm done\n"

This program prints "I'm starting", then copies its input to its output, noting when it sees the character 'I' and completing with the message, "I'm done". Running the program interactively results in the initial line, one line of user input followed by an output copy of the line (with I's noted), the second line of user input, and so on.

Functions similar to interact can be defined to deal with named files (which will need to be managed later), allowing this technique to deal with almost all of the programs from Software Tools. While the code in this translation was originally written using an imperative, getc-ish approach, I am replacing that code with interact-ish definitions because they have a number of advantages:

  • interact's use of a string to string function typically results in clearer and shorter code,
  • the code is much better Haskell style, and
  • the results are typically significantly faster than the imperative versions. (The reverse of an "abstraction penalty".)

On the other hand, interact (and similar techniques) appear not to cover all of the needs of IO (possibly including exceptions and multi-file manipulations, for an example of which see the sort program later).

[This section was added following many suggestions from the haskell-cafe mailing list, including Gwern Branwen, apfelmus, and many others.]

wc - Haskell wc

wc here is not part of Software Tools, but is a natural extension of the individual programs above. I could not resist an implementation.

PROGRAM

  wc   Count lines, words, and characters

USAGE

  wc

FUNCTION

  Counts the lines, words, and characters in its input and writes the
  total of each as a line of text to the output.

  Lines are distinguished by the newline character.

  Words are a maximal sequence of characters not containing a blank,
  tab, or newline.

EXAMPLE

  $ runhaskell wc.hs
  A single line input.
  ^Z
  1 4 21

BUGS

  The definition of "word" is simplistic.

One complication is that a naive version produces the following error:

$ ./wc < /usr/share/emacs/21.2/etc/DOC-X-21.2.1
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize' to increase it.

The problem is that, even though the function is written to be tail recursive, the additions performing the counting in handlechar are evaluated lazily, meaning that the thunks computing (c+1), for example, pile up until they are finally forced by the (show c):

handlechar    ch    w l c
-> wc'              w l (c+1) False
-> handlechar ch'   w l (c+1)
-> wc'              w l ((c+1)+1) False
-> handlechar ch''  w l ((c+1)+1)
-> wc'              w l (((c+1)+1)+1) False
-> handlechar ch''' w l (((c+1)+1)+1)
-> wc'              w l ((((c+1)+1)+1)+1) False
...
-> putStrLn $ (show l) ++ ... ++ (show ((...((c+1)+1)...)+1))

In general, it is possible that laziness would be a performance win, if the (+) operation were expensive, the number of them were small enough not to overflow the stack, and the result were ultimately not evaluated by the overall context. In this case, addition is cheap, the number of additions can grow large, and show will always have to evaluate the expression to determine the number.

There are two functions which force expressions to be evaluated strictly: $! and seq. The first acts like the $ operator. The second takes two arguments, forces evaluation of the first, and returns the second. Given that c, for example, is forced by seq, the further uses in the recursive call of handlechar will be of the integer value.

The code below, hints for which I found on the Haskell wiki and elsewhere on the web, forces the parameters to wc' to be evaluated strictly, preventing the thunks from accumulating. See the first line of wc'.

(See Learning Haskell and Number Theory: The End of GCD.)

This first version uses getc and the imperative style.

wc = do
     (w,l,c) <- wc' 0 0 0 False
     putStrLn $ (show l) ++ " " ++ (show w) ++ " " ++ (show c)
    where
    wc' w l c inword = w `seq` l `seq` c `seq`
                       do
                       ch <- getc
                       case ch of
                               Nothing -> return (w,l,c)
                               Just char -> handlechar char w l c inword
    handlechar ' '  w l c _     = wc' w     l     (c+1) False
    handlechar '\t' w l c _     = wc' w     l     (c+1) False
    handlechar '\n' w l c _     = wc' w     (l+1) (c+1) False
    handlechar _    w l c False = wc' (w+1) l     (c+1) True
    handlechar _    w l c True  = wc' w     l     (c+1) True

There are a couple of ways of approaching wc using interact. A simple one is:

wc' = interact $ show3Ln . (\text -> (length $ lines text,
                                      length $ words text,
                                      length text))
     where
     show3Ln (x,y,z) = (show x) ++ " " ++ (show y) ++ " " ++ (show z) ++ "\n"

This is indeed simple, but it has the weakness that the entire input must be read and stored: the input variable allows the program to use the incoming string three different times.

On the other hand, this definition effectively exploints laziness; it does not need extra strictness analysis to avoid failing on medium to large inputs.

To avoid forcing the storage of the entire input, the program can be "deforested"; the three separate uses combined:

wc'' = interact $ show3Ln . (cnt (0,0,0) False)
    where
      cnt a _ [] = a
      cnt a _ (ch:cs) | isSpc ch = force a `seq` cnt (l,w,c+1)   False cs where (l,w,c) = a
      cnt a _ (ch:cs) | isNl ch  = force a `seq` cnt (l+1,w,c+1) False cs where (l,w,c) = a
      cnt a False (_:cs)         = force a `seq` cnt (l,w+1,c+1) True  cs where (l,w,c) = a
      cnt a True (_:cs)          = force a `seq` cnt (l,w,c+1)   True  cs where (l,w,c) = a

      force (l,w,c) = l `seq` w `seq` c

      isSpc ch = ch == ' ' || ch == '\t'
      isLn ch = ch == '\n'

This version uses the same logic as the original wc to process each character from the input string one at a time, and uses an accumulator (the triple) to allow it to be tail-recursive. (As a reminder, the Boolean flag is used to indicate whether or not the processing is currently in a word. Further, note that this version is not correctly strict and will fail on large inputs.)

Removing tabs

detab:

PROGRAM

  detab  convert tabs to blanks

USAGE

  detab

FUNCTION

  detab copies its input to its output, expanding horizontal tabs to
  blanks along the way, so that the output is visually the same as the
  input, but contains no tab characters.  Tab stops are assumed to be
  set every four columns (i.e., 1, 5, 9,...), so that each tab
  character is replaced by from one to four blanks.

EXAMPLE

  Using -> as a visible tab:

    $ runhaskell detab.hs
    ->col 1->2->34->rest
        col 1   2   34  rest
    ^Z

BUGS

  detab is naive about backspaces, vertical motions, and non-printing
  characters.

The default tabstops are set every four characters; tabstops is provided as an explicit array based on that to allow extensions (for example, by providing weird, arbitrary tabstops):

tabspace = 4

tabstops :: [Bool]
tabstops = map (\col -> col `mod` tabspace == 1) [ 1 .. ]

tabspaces is a function that converts the list of boolean tabstops to the number of spaces that a tab in a given column would need to produce:

tabspaces :: [Bool] -> [Int]
tabspaces ts = spacesToTabstop tablessPrefix ++ tabspaces rest'
    where
      (tablessPrefix, (_:rest)) = span (\stop -> not stop) ts
      rest' = False:rest
      spacesToTabstop prefix = reverse [1 .. (length prefix)]

tabspaces walks down the tapstops list, breaking off spans that do not contain a tabstop and using that span to compute the number of spaces to the next tabstop. (A tab occurring in a column with a tabstop advances to the next tabstop, a behavior accomodated by replacing the True of the tabstop in the suffix provided by span with a False, as seen in rest'.)

detab is implemented using interact and an algorithm inspired by Tillmann Rendel and apfelmus:

detab' :: [Bool] -> String -> String
detab' tabstops text = detab'' ts text
    where
    detab'' _     []           = []
    detab'' _     ('\n':text') = '\n'       : detab'' ts text'
    detab'' (s:r) ('\t':text') = (toSpc s) ++ detab'' (adv s r) text'
    detab'' (_:r) (char:text') = char       : detab'' r text'

    ts = tabspaces tabstops     -- convert tabstops to column spaces
    toSpc n = replicate n ' '   -- produce n spaces
    adv n l = drop (n - 1) l    -- skip n-1 elements of l

detab :: IO ()
detab = interact $ detab' tabstops

detab' uses detab'' to work its way down the text string one character at a time, ensuring that it also knows how many spaces would need to be inserted when it sees a tab character. The key part of the latter is the third line of the detab'' definition: after inserting spaces to the next tabstop, it advances a similar distance into the tabspaces for the line. As a result the function ensures that the tabstops occur at the fixed intervals.

A piece of wisdom from K&P:

Most real programs are subjected to a steady flow of changes and improvements over their lifetimes, and many programmers spend most of their time maintaining and modifying existing programs. This is an expensive process, so one of the most important design considerations for a program is that it be easy to change.

gloria i ad inferni
faciamus opus

Return to Top | About this site...
Last edited Sat Aug 8 03:29:10 2009.
Copyright © 2005-2016 Tommy M. McGuire