CHAPTER TWO

Basic processing

Putting tabs back

entab is very nearly the inverse of detab:

PROGRAM

  entab -- convert runs of blanks to tabs

USAGE

  entab

FUNCTION

  entab copies its input to its output, replacing strings of blanks by
  tabs so that the output is visually the same as the input, but
  contains fewer characters.  Tab stops are assumed to be set every
  four columns (i.e., 1, 5, 9,...), so that each sequence of one to
  four blanks ending on a tab stop is replaced by one tab character.

EXAMPLE

  Using > as a visible tab:

    $ runhaskell entab.hs
        col 1   2   34  rest
    >col>1>2>34>rest
    ^Z

    $ runhaskell detab.hs | runhaskell entab.hs
    >col 1>2>34>rest
    ^Z
    >col>1>2>34>rest

BUGS

  entab is naive about backspaces, vertical motions, and non-printing
  characters.  entab will convert a single blank to a tab if it occurs
  at a tab stop.  Thus entab is not an exact inverse of detab.

The code for entab uses (once again) interact and a String -> String function. The function, entab', is further simplified by using perLine to map across the lines of the input:

perLine f = unlines . map f . lines

In this case, entab' breaks each input line into tabstop-separation blocks, and replaces any trailing spaces in each block with a tab (unless it is the last block of the input).

entab' :: [Bool] -> String -> String
entab' ts = perLine $ entab'' (tabspaces ts)
    where
      entab'' :: [Int] -> String -> String
      entab'' _ [] = []
      entab'' (col:rest) txt = doBlock col pre ++ entab'' (drop (col-1) rest) suf
          where
            (pre, suf) = splitAt col txt

      doBlock len blk | length blk < len = blk
      doBlock _   blk = case spcs
                        of "" -> blk   -- in case the final blk ends in spaces
                           _ -> (reverse rest) ++ "\t"
          where
            (spcs, rest) = break (/= ' ') $ reverse blk

entab = interact $ entab' tabstops

As a side note, this is why tabspaces produces a list of integers rather than a list of space-strings.

Overstrikes

Back in the day, people used teletypewriters to interact with computers. These ttys typed user input onto a roll of paper while passing it to the computer and then printed the computer's output on the paper. One nifty trick was to emulate bold text by doubling a character: print the character, backspace, then print the character again. The duplication would produce a larger, darker glyph. The same trick would allow multiple overlapping glyphs, to produce underlining for example. When CRT terminals entered the scene, they typically followed the conventions of ttys (called "glass ttys", they were), including allowing the nifty backspace trick.

Printers (including, apparently, those available to the Bell Labs folks), however, did not follow this convention. Instead, the input had to be built of a special control character at the beginning of each line. An initial space allowed the line to be printed normally, but an initial '+' caused the print head to return to the beginning of the line without advancing the paper. (A "carriage return" without the matching "newline".) Then, the line output would be printed over the previous line, allowing bold, underlines, and whatever other special effects that could be done with overlapping glyphs.

The overstrike program converts between the first convention and the second.

To preserve what remains of my sanity, I used the '@' character rather than a backspace. If I remember correctly, some systems did use it so, since their terminals did not have backspace keys, so maybe I am not entirely off in the weeds.

PROGRAM

  overstrike - replace overstrikes by multiple lines

USAGE

  overstrike

FUNCTION

  overstrike copies its input to its output, replacing lines
  containing backspaces by multiple lines that overstrike to print the
  same as the input, but contain no backspaces.  It is assumed that
  the output is to be printed on a device that takes the first
  character of each line as a carriage control; a blank carriage
  control causes normal space before print, while a plus sign '+'
  suppresses space before print and hence causes the remainder of the
  line to overstrike the previous line.

EXAMPLE

    $ runhaskell overstrike.hs
    abc@@@___
     abc
    +___
    ^Z

BUGS

  overstrike is naive about vertical motions and non-printing
  characters.  It produces one overstruck line for each sequence of
  backspaces.

overstrike based on interact starts by breaking the incoming text into lines then breaking each line into a list of segments, where each segment is a run of backspaces (i.e. '@') or regular text. mapAccumL is used to process each segment, passing an accumulating parameter through the segments. An '@' segment is replaced by a newline plus a continuation character followed by enough spaces to skip the non-overstricken prefix. The accumulator collects the indentation needed to space over the prefix.

When the segments are combined with concat, the result is that a line has been converted to one or more lines where any new lines begin with the continuation character and represent the necessary overstrike.

overstrike = interact $ overstrike1'
    where
      overstrike1' txt = " " ++ body txt ++ "\n"

      body             = concat $ intersperse "\n " . map doLine . lines

      doLine ln        = concat $ snd $ mapAccumL doSeg 0 $ segment ln

      doSeg i l@('@':_) = (n, "\n+" ++ replicate n ' ') where n = i - (length l)
      doSeg i l         = (n, l)                        where n = i + (length l)

      segment = groupBy (\l r -> (l == '@' || r /= '@') && (l /= '@' || r == '@'))

Text compression

This program provides compression using simple run-length encoding. The escape character is ~, the run length is encoded as A-Z where A=1, B=2, etc. One or more ~'s in the input should be encoded as ~l~ where l is A, B, .... Normally, the threshold for compression is 4, so ~D will be the lowest escape and run length value.

PROGRAM

  compress - compress input by encoding repeated characters

USAGE

  compress

FUNCTION

  compress copies its input to its output, replacing strings of
  four or more identical characters by a code sequence so that
  the output generally contains fewer characters than the input.
  A run of x's is encoded as ~nx, where the count n is a
  character: 'A' calls for a repetition of one x, 'B' a
  repetition of two x's, and so on.  Runs longer than 26 are
  broken into several shorter ones.  Runs of ~'s of any length
  are encoded.

EXAMPLE

    $ runhaskell compress.hs >out
    Item    Name         Value
    1       car          ~$7,000.00
    ^Z
    $ cat out
    Item~D Name~I Value
    1~G car~J ~A~$7,000.00

BUGS

The implementation assumes 26 legal uppercase letters beginning
with A.

This is another case where using interact and higher-level functional programming simplifies the solution greatly. My original version, using character-by-character input, was significantly longer and much less comprehensible (which also describes the original).

In this implementation, compress' uses group to break the incoming string into runs of single characters. Those runs are transformed into (length, character) pairs, which are then transformed back into strings following the rle (run-length-encoding) rules.

  • A length greater than 26 (since the length is encoded by 'A'-'Z') is replaced by an encoding of the first 26 followed by an encoding of the remainder.
  • Any number of tildes is encoded, to prevent it from being misunderstood on decompression.
  • A length less than 4 is simply used as-is, since "~Ab" is longer than "b".
  • Anything else is encoded.
compress :: IO ()
compress = interact compress'


compress' :: String -> String
compress' = concat . map rle . map toPair . group
    where
      toPair :: String -> (Int, Char)
      toPair str = (length str, head str)

      rle :: (Int,Char) -> String
      rle (n,ch) | n > max              = rle (max, ch) ++ rle (n - max, ch)
      rle (n,ch) | n < min && ch /= '~' = replicate n ch
      rle (n,ch)                        = ['~', toChr n, ch]

      min = 4
      max = 26
      toChr i = chr (ordA + i - 1)
      ordA = ord 'A'

Text expansion

Text expansion, from run-length encoding, is pretty straightforward.

PROGRAM

  expand - expand compressed input

USAGE

  expand

FUNCTION

  expand copies its input, which has presumably been encoded by
  compress, to its output, replacing code sequences ~nc by the
  repeated characters they stand for so that the text output exactly
  matches that which was originally encoded.  The occurrence of the
  warning character ~ in the input means that the next character is a
  repetition count; 'A' calls for one instance of the following
  character, 'B' calls for two, and so on up to 'Z'.

EXAMPLE

    $ runhaskell expand.hs
    Item~D Name~I Value
    Item    Name         Value
    1~G car~J ~A~$7,000.00
    1       car          ~$7,000.00
    ^Z

expand is very simple; it merely needs to handle three cases: the end of the text, text matching the tilde escape string, and everything else.

There seems to be an error in the original; the way I read it, if the file terminates with a ~, the function will call getc to get the EOF, fall through and print the ~, then go to the top of the loop to call getc again, which is undefined at that point. Likewise if the file ends with ~A.

expand = interact expand'

expand' [] = []
expand' ('~':n:ch:rest) = replicate count ch ++ expand' rest where count = (ord n) - (ord 'A') + 1)
expand' (ch:rest) = ch : expand' rest

Arrows

As an aside, Don Stewart has a discussion of run-length encoding using Arrows. It's neat, but....

Command arguments

echo! Whoohoo!

PROGRAM

  echo - echo arguments to output

USAGE

  echo [ argument ... ]

FUNCTION

  echo copies its command line arguments to its output as a line of
  text with one space between each argument.  If there are no
  arguments, no output is produced.

EXAMPLE

    $ runhaskell echo.hs Hello World!
    Hello World!

Gets the command line arguments from System.getArgs, interleaves a space between the elements, joins all the elements into a single string and then prints it. Extra complexity is needed to avoid printing a newline if no arguments were provided.

Note the extravagant use of mapM_ to call putc on all of the characters of the string.

echo = do
       args <- getArgs
       mapM_ putc $ concat $ intersperse " " args
       if (length args) > 0 then putc '\n' else return ()

Character transliteration

translit is very similar to the Unix program tr.

PROGRAM

  translit - transliterate characters

USAGE

  translit [^]src [dest]

FUNCTION

  translit maps its input, on a character by character basis, and writes
  the translated version to its output.  In the simplest case, each
  character in the argument src is translated to the corresponding
  character in the argument dest; all other characters are copied as is.
  Both src and dest may contain substrings of the form "c1-c2" as a
  shorthand for all of the characters in the range c1..c2.  c1 and c2
  must both be digits, or both be letters of the same case.

  If dest is absent, all characters represented by src are deleted.
  Otherwise, if dest is shorter than src, all characters in src that
  would map to or beyond the last character of dest are mapped to the
  last character in dest; moreover adjacent instances of such characters
  in the input are represented in the output by a single instance of the
  last character in dest.  Thus

    translit 0-9 9

  converts each string of digits to the single digit 9.

  Finally, if src is preceded by a ^, then *all but* the characters
  represented by src are taken as the source string; i.e., they are all
  deleted if dest is absent, or they are all collapsed if the last
  character in dest is present.

EXAMPLE

To convert upper case to lower:

  translit A-Z a-z

to discard punctuation and isolate words by spaces on each line:

  translit ^a-zA-Z@n " "

Oh, and tabs can be specified as @t and newlines as @n in the command line arguments....

Minor functions

index returns Just the index of the first occurrance a character in a list (or string), or Nothing. Is strict in the counter argument to index', n. See the wordcount program, wc.hs, for a more in-depth look at strictness and laziness. (There is a library function for this already, List.elemIndex.)

index :: (Num n, Eq a) => [a] -> a -> Maybe n
index str c = index' 0 str
    where
    index' n []                = Nothing
    index' n (ch:_) | c == ch  = Just n
    index' n (_:rest)          = let n' = (n+1) in n' `seq` index' n' rest

addchar adds c to a set s. It is written to be polymorphic over list elements, but it is intended to be used on strings (hence the name).

addchar :: a -> [a] -> [a]
addchar c s = c:s

addstrs appends the characters between l and u to s.

addstr :: Char -> Char -> String -> String
addstr l u s = s `union` [ l .. u ]

isAlphaNum is true if c is alphanumeric.

isAlphaNum :: Char -> Bool
isAlphaNum c = (isAlpha c) || (isDigit c)

esc converts escaped characters to their normal representation: 'n' becomes newline, etc.

esc :: Char -> Char
esc 'n' = '\n'
esc 't' = '\t'
esc c   = c

translit

dodash expands runs of characters described as 'a-z', for example, to 'abc...xyz'.

K&P claim to have written dodash in a general fashion with the arbitrary ending delimeter to use "in later dealings with sets of characters." I have left it so for the moment, although it would be better to remove the generalization. (The delimeter, an argument of chr 0, is unused in translit.)

dodash :: Char -> String -> String -> String
dodash delim src dst = dodash' src dst
    where
    dodash' [] dst                   = dst
    dodash' (c:src) dst | c == delim = dst
    dodash' ('@':c:src) dst          = dodash delim src (addchar (esc c) dst)
    dodash' (l:'-':r:src) dst
        | (isAlphaNum l) && (isAlphaNum r) && l < r    =
            dodash delim src (addstr l r dst)
    dodash' (c:src) dst              = dodash delim src (addchar c dst)

makeset expands a string inset to a full character set using dodash.

makeset inset = dodash (chr 0) inset []

The first problem for translit is to parse the command line arguments, producing the negation flag and the from and to sets.

To keep things simple, parseargs works in stages. The function itself first generally parses the arguments, which can be either one string or two. parseargs' takes over from there.

The task of parseargs' is to check for the negation of the from string, the first argument. If it is specified, it sets "allbut" in the output, passing along the results of parseargs''.

parseargs'' does the (relatively) heavy lifting, making sets from the two arguments and doing some input error checking.

parseArgs is an IO action because it potentially throws an error, and it seemed cleaner to keep that out of pure code.

parseargs :: IO (Bool,String,String)
parseargs = do
            args <- getArgs
            case args of
                      (from:to:[]) -> parseargs' from to
                      (from:[]) -> parseargs' from ""
                      _ -> error "usage: translit from [ to ]"
    where
    -- parseargs' picks off the negation operator if necessary
    parseargs' :: String -> String -> IO (Bool,String,String)
    parseargs' ('^':from) to = do
                               (fromset,toset) <- parseargs'' from to
                               return (True,fromset,toset)
    parseargs' from to = do
                         (fromset,toset) <- parseargs'' from to
                         return (False,fromset,toset)
    -- parseargs'' provides the base parsing, after negation
    parseargs'' :: String -> String -> IO (String,String)
    parseargs'' from to =
        let fromset = makeset from
            toset = makeset to
        in
        if length(fromset) < length(toset)
            then error "translit: \"from\" shorter than \"to\""
            else return (fromset,toset)

xindex is an interface for index. Given a list (or string), a character, a flag indicating whether the result should be inverted, and an integer representing a squash length (...), produce a value which may be an index into the fromset, may be a flag indicating no such index exists (-1), or may be some special key value, that squash length + 1.

In the original, an alternative of xindex is presented that uses more complex boolean logic to compute the same result. However, that version assumes that index will handle the Maybe Char produced by getc to indicate end of file. This translation does not, to keep index more general.

This function is largely the brains of translit. It is used to identify the transliteration characters in the fromset, including handling the special cases of the fromset being negated, and fromset being longer than toset.

This nonsense is nasty, and represents one of the finest examples of a need for data abstraction. I have largely left it in the same form as the original, partially to illustrate the advances made in the last 30 years of programming, and partially because I cannot figure out how it works.

xindex :: Eq a => [a] -> a -> Bool -> Int -> Int
xindex inset c allbut lastto =
    case (index inset c) of
                         Nothing -> if allbut then lastto + 1 else -1
                         Just i -> if allbut then -1 else i

This is the actual transliteration. translit' is the function from a string to a string, also acting as a placeholder, to provide a scope with the values allbut, fromset, toset, lastto, and squash---these are computed from the command line parameters but are constant in the loop.

The code is fairly hairy. One particual problem is the use of special integer values from xindex. Unfortunately, I could not decide how to clean that up.

The function itself (along with xindex above) was originally more-or-less directly transcribed from the original, then refactored to its present state. The refactoring has made the loop structure of translit'' much clearer.

This code uses the "skip" function to "do nothing" as a monadic statement. The alternative, either using "return ()" or repeating code, would be more confusing. The two skips are important: the first is what doesn't print anything while squashing runs of the last character of a toset, while the second is what doesn't print anything when the toset has been omitted.

The squashing parameter to translit'' is responsible for the requirement that adjacent squashed characters all be replaced by a single squash character. This code would otherwise replace all of the squashed characters by a copy of the terminal character of the toset---no squashing would occur.

translit' :: Bool -> String -> String -> Int -> Bool -> String -> String
translit' allbut fromset toset lastto squash = translit'' False
    where
    translit'' squashing [] = []
    translit'' squashing (c:rest) =
        let i = xindex fromset c allbut lastto in
        if squash && (i >= lastto) && (lastto >= 0)
        then
            if not squashing
            then (toset !! lastto) : translit'' True rest
            else                     translit'' True rest
        else
            if (i >= 0) && (lastto >= 0)
            then      (toset !! i) : translit'' False rest
            else if (i < 0)
                 then            c : translit'' False rest
                 else                translit'' False rest

The final IO action is fairly simple, although it needs to integrate the assorted components:

translit = do (allbut,fromset,toset) <- parseargs
              let lastto = (length toset) - 1
                  squash = (length fromset) > lastto || allbut
              interact $ translit' allbut fromset toset lastto squash

I theorize that allbut is present because the strings used by K&P cannot hold all possible characters, so negated sets cannot be directly represented. I have kept it here, where its sole purpose is to tell xindex to reverse (more or less) the result of the index test.

Further:

To answer the obvious question raised by this experiment, we did not use sets for fromset and toset because sets that are large enough are not always available. The mapping array suggested in one of the following exercises should make translit run faster, but it does not mesh well with some of the other uses we have planned for dodash in Chapters 5 and 6.

Numbers

Software Tools provides a small number of numeric functions, used to convert numbers to and from strings. These functions are used by the subsequent code. Most of these duplicate elements of the Prelude or library, but their definition is presented here for completeness.

itoc itself is pretty uninteresting. It replicates a standard library function, but illustrates one type-related issue: The Int type is a requirement; no other type would satisfy this because the result of ord is an Int.

Also, beware the difference between quotRem and divMod for negative numbers:

> (-13) `quotRem` 10
(-1,-3)
> (-13) `divMod` 10
(-2,7)

The recursive call is not in tail position; this function will not execute as a loop in constant space. However, the number of digits in an Int is not that large. On the other hand converting it to tail-recursive form would not be difficult.

Note that this function does not correctly handle twos-complement numbers; there is one more negative number than can be represented as a positive number, so the first line will do something funny with the minimal negative Int. Like overflow.

itoc :: Int -> String
itoc n | n < 0 = '-' : itoc (-n)
itoc n = let (q,r) = n `quotRem` 10
             d = chr ((ord '0') + r)
         in if q == 0
            then [ d ]
            else (itoc q) ++ [d]

putdec calls itoc to do the conversion from Int to String, then uses two mapM_ calls to format the output. The first prints spaces to pad the field width from nd, the length of the string, to w. The second actually prints the characters of the string.

putdec :: Int -> Int -> IO ()
putdec n w = do
             mapM_ (\n -> putc ' ') [ nd .. (w - 1) ]
             mapM_ putc             s
    where
    s = itoc n
    nd = length s

ctoi is written as a case expression rather than as a sequence of pattern-matched definitions because the ctoi' helper is called from several branches of ctoi. (A 'where' definition like ctoi' would only be available to the immediately previous branch.)

The original uses a var i argument to index into the string being parsed. This argument is updated by ctoi, and presumably will be further used when parsing the string following the number. (They have used this and similar approaches in several other functions including getc, as well.) This code replaces that with a pair value: the first element is the number and the second is the remainder of the string with the number removed.

ctoi :: String -> (Int,String)
ctoi s = case s of
                (c:s')  | c == ' ' ||
                          c == '\t' -> ctoi s'
                ('-':s')            -> let (n,s'') = (ctoi' 0 s') in (-1*n,s'')
                ('+':s')            -> ctoi' 0 s'
                (c:s')  | isDigit c -> ctoi' 0 (c:s')
                s'                  -> (0,s')
    where
    ctoi' n (c:s) | isDigit c = let d = (ord c) - (ord '0') in
                                    ctoi' (10*n + d) s
    ctoi' n s = (n,s)

gloria i ad inferni
faciamus opus

Return to Top | About this site...
Last edited Sat Aug 8 03:29:10 2009.
Copyright © 2005-2016 Tommy M. McGuire