Lab: Pipeline Style in Haskell

Pipleline style in Haskell - Introduction

In this lab you will implement a solution of the Pipeline style in Haskell for the term frequency task.

Implementing the solution in Haskell gives you the opportunity to practice some of the functional programming concepts we have covered in the course (e.g., higher-order functions, currying, function composition, repetition without loops, and so on).

The task you have to solve and the constraints of the style are the same as the ones specified in the JavaScript version lab.

Create and Clone Your GitHub Repository

To create the repository for this lab, fork this special starter repository.

Your repository now exists on the GitHub servers. If you want to work on it, you first have to “clone” it on your computer.

Implement Your Program

Once you have the repository (a directory) on your computer, you can open it in your IDE (e.g., in VS Code). You can easily do this using the code command, passing as an argument the path to the appropriate directory:

code ~/lab-h1-pipeline-haskell

Note that the starting repository for this lab is slightly different from the previous ones. Make sure to read the README.md. The repository contains a Haskell file named Pipeline.hs that you should use as a starting point. Take a look at the content of the file to get an idea of the structure of the program, as well as the functions you need to implement (with names, types, and descriptions).

There are no automated tests for reasons we are going to discuss soon.

Let’s Ignore I/O for Now

Performing I/O in Haskell involves monads, which we still have to cover in the course. Therefore, we will focus for the moment on writing the core of our program. We will implement a pipeline function that has a String parameter (the text to be processed) and produces the result as a String.

This function is passed to the interact function, which reads the entire input (from the standard input), passes it to our pipeline function, and then prints the result to the standard output. That will be our main function.

All of this is already implemented for you in the starter code.

Note that at the beginning of the file, the content of the stop words file and the number of frequencies to print are hardcoded. The only “input” to our program is the text to be processed. (This is why we don’t have automated tests for this lab. There are still reference output files for you to manually compare.)

Haskell Fragments for Your Functions

Try to implement and use each function in the REPL (ghci) before integrating it into the pipeline in the main file.

To wit: do not try to write the entire implementation at once and compile it only at the end. You will likely end up with several errors that are hard to understand. Benefit from having small functions that you can work on individually!

The next sections give you ALL the ingredients you need to implement the pipeline. We will not always use the shortest or most idiomatic Haskell code: we will instead keep focusing on versions you can fully understand at this point.

Repeat computation with higher-order functions

Use higher-order functions to repeat computations.

Map

ghci> map (\n -> n * n) [1, 10, 100]
[1, 100, 10000]

Filter

ghci> filter (\s -> length s >= 1) ["ab", "", "a"]
["ab", "a"]

Reduce / Fold

ghci> foldl (+) 0 [1, 2, 3, 4]
10
ghci> foldl (\acc n -> acc + n) 0 [1, 2, 3, 4]
10

These two variants are equivalent. The first one uses the + operator directly (surrounded by parentheses, treated as a function), while the second one explicitly uses a lambda to combine the accumulator and the current element. The first is more idiomatic, but feel free to use the second one if you prefer more explicit code.

Working with pairs

A pair is a tuple of two elements. You can use the functions fst and snd to access the first and second element, respectively.

ghci> fst ("a", 1)
"a"
ghci> snd ("a", 1)
1

Working with lists

Concatenate lists

Use the ++ operator:

ghci> [1, 2] ++ [3, 4]
[1, 2, 3, 4]

Check if an element is in a list

Use the elem function:

ghci> elem 2 [1, 2, 3]
True
ghci> elem 4 [1, 2, 3]
False

You may find the opposite function notElem useful as well.

Reverse a list

Use the reverse function:

ghci> reverse [1, 2, 3]
[3, 2, 1]

Sort a list on a key

Use the sortOn function from the Data.List module, passing a function that extracts the key to sort on:

ghci> import Data.List (sortOn)
ghci> sortOn snd [("a", 2), ("b", 1)]
[("b", 1), ("a", 2)]

Take the first `n` elements of a list

Use the take function:

ghci> take 2 [1, 2, 3]
[1, 2]
ghci> take 5 [1, 2, 3]
[1, 2, 3]

Working with characters and strings

Check if a character is alphanumeric

Use isAlphaNum from the Data.Char module:

ghci> import Data.Char (isAlphaNum)
ghci> isAlphaNum 'a'
True
ghci> isAlphaNum ':'
False

Convert a character to lowercase

Use toLower from the Data.Char module:

ghci> import Data.Char (toLower)
ghci> toLower 'A'
'a'

Turn a character into a string

A string is a list of characters. You can simply use the list literal syntax:

ghci> ['a']
"a"

Concatenate strings

A string is a list of character, so you can use the ++ operator as shown above.

“Stringify” a value

Use the show function (e.g., to convert an integer to a string):

ghci> show 42
"42"

Range of characters

You can use the “range syntax”:

ghci> ['a'..'c']
"abc"  -- equivalent to ['a', 'b', 'c']

Split a String

The built-in words function splits a string on spaces:

ghci> words "hello world"
["hello", "world"]

For a more generic split operation, use the splitOn function from the Data.List.Split library (we have simply packaged this library in Split.hs for you to use without extra installation):

ghci> :load Split.hs
ghci> splitOn "," "hello,world"
["hello", "world"]

Working with maps / dictionaries

Create an empty map

Use empty from the Data.Map module:

ghci> import Data.Map (empty)
ghci> empty
fromList []

Update or insert a key-value pair

Use the insertWith function from the Data.Map module. If the key is not present in the map, the value is inserted. If the key is present, the function passed as the first argument is used to combine the old and new values.

ghci> import Data.Map (insertWith)
ghci> mapAfterFirstElem = insertWith (+) "a" 10 empty
ghci> mapAfterFirstElem
fromList [("a", 10)]
ghci> insertWith (+) "a" 5 mapAfterFirstElem
fromList [("a", 15)]
ghci> insertWith (+) "b" 5 mapAfterFirstElem
fromList [("a", 10), ("b", 5)]

The first argument is a function that combines the old and new value if the key already exists. The second argument is the key. The third argument is the new value (to be used as a default if the key does not exist). The fourth and last argument is the map to update.

This data structure is immutable, thus insertWith returns a new, updated map.

Turn a map into a list of key-value pairs

Use the toList function from the Data.Map module:

ghci> import Data.Map (toList)
ghci> toList mapAfterFirstElem
[("a", 10)]

Compose functions

As seen during the course, you can compose functions using the infix . operator, evocative of the “ring” symbol used to denote function composition in mathematics.

The function on the right is applied first. A good mnemonic is to read . as “after”.

ghci> increment x = x + 1
ghci> square x = x * x
ghci> squareAfterIncrement = square . increment
ghci> squareAfterIncrement 2
9

Implement the Entire Pipeline

Once you have working functions, add your implementation in the Pipeline.hs file. The file already contains all the signatures of the functions, which both help you to understand their behavior and the compiler to provide better error messages.

Run Your Program

Compile your program with ghc:

ghc Pipeline.hs

This will produce an executable file named Pipeline.

You can run your program by executing the following command, using the shell redirection operator < to pass the content of a file as input:

./Pipeline < input-small.txt

The output will be printed to the standard output (courtesy of the interact function).