Last month, Daryl Lee gave us a taste of the language Scheme in the article It's time to learn Scheme with a C++ code generator. This time we will be looking at some practical examples written with Scheme Shell (SCSH): finding and replacing text in a bunch of files, sorting files in two different ways, and converting data from a CSV file to an HTML file.
SCSH is a scripting language based on the programming language Scheme. It was created by Olin Shivers to replace long sh or bash scripts and extends the Scheme language to make it far more suited for shell scripting.
SCSH wraps the Unix system (Linux, BSD, Cygwin) underneath in a Lisp-y interface, giving you a regular expression domain-specific language and an awk domain-specific language to work with. SCSH looks strange at first for a Unix user who uses Perl or the shell for scripting because of the underlying philosophical differences between Unix and Lisp.
Unix vs. Lisp philosophy
The Unix philosophy focuses on specialization and strings. Specialization in Unix means writing small programs that each do one task well instead of writing one giant program that does many tasks. This increases modularity because you will have small simple components that can be combined together to form larger programs.
Running programs with string arguments is the only way to pass data from one program to another in Unix. This means that you lose the type of data that you're passing and the program receiving the data has to parse and convert the string to its proper data type. For example, when you run
kill 223, the string "223" is being parsed and turned into a number. Each Unix program that accepts input must do its own parsing to turn the string into whatever type of object they need. This makes it difficult to send full objects around.
Lisp programs, in both Scheme and Common Lisp dialects, like modularity too, but they also like to pass data in the form of integers, symbols, lists, and other objects to other programs. You can see this in Emacs, which uses small Emacs-Lisp programs that deal with various types of objects, not only strings.
Features of Scheme Shell
In SCSH, the regular expressions engine is not string-based as is common in other languages. Instead, it is a domain-specific language embedded in Scheme Shell with a Lisp-like syntax. SCSH allows the slash ("/") character in symbol names, which lets you create filenames that are not strings. Also, SCSH provides a network socket interface with higher-level functions that automate the creation of a server or a client.
The syntax for creating regexes is known as SRE and looks similar to Scheme (i.e. a bunch of lists), which gives it some advantages over the typical string representation of a regular expression: you can add comments explaining the regex, and compose regexes. Adding comments to the SRE code is done by adding Scheme comments. Because SREs are lists, there is no need, as in Perl, to drop comments directly into the regular expression representation (which is a violation of the POSIX regex standard and is probably incompatible with other regex engines). This is helpful for some of the longer regexes that you may encounter, such as the Sudoku solver written in Perl regex, or the famous email address validation regex. Regexes can include dynamic variables and are generated on demand. This is similar to the variable interpolation that Perl and other languages allow, except in SCSH the variable interpolations do not interfere with the POSIX standard regular expressions.
Lisp macros allow the manipulation of syntax at compile-time. This means that you can rearrange function calls or whatever else you like. The Scheme specification refers to macros as syntax, and the language provides a pattern-matching tool to help with the definition of new syntax. The regular expression engine, process notation, and awk notation in Scheme Shell are all defined as macros/syntax. The definition of new syntax helps hide the messy details of what you want to do. For example, the awk syntax hides the code that loops through the file looking for records by letting you specify which record reader to use.
Replacing pieces of text in HTML files
Now let's see how SCSH can help you with real work. For a previous project I had to generate HTML files from a LaTeX file using the latex2html program. Unfortunately, when run with no arguments, latex2html generated absolute pathnames for the navigation bar images. A Web browser viewing the generated HTML would look for the images in /usr/lib/latex2html/icons/, which is not accessible when the Web pages are accessed via the Internet, and when latex2html is not installed.
The solution was to package the images needed for the navigation menu and to find and replace all instances of the absolute pathname. To try this out, I ran latex2html with a single argument,
latex2html testdoc.latex, which created a directory named testdoc and placed the generated HTML files in that directory:
<!-- Navigation Panel -->
<IMG WIDTH="37" HEIGHT="24" ALIGN="BOTTOM" BORDER="0" ALT="next"
<IMG WIDTH="26" HEIGHT="24" ALIGN="BOTTOM" BORDER="0" ALT="up"
<!— More HTML —>
SCSH needs to search for
file:/usr/lib/latex2html/icons/ and replace it with a different location or an empty string. For my project I replaced it with an empty string and bundled the HTML files with the icons that I wanted to use. The following is the code for doing this simple task:
(define replace (rx "file:/usr/lib/latex2html/icons/"))
(define (read-lines) (port->string-list (current-input-port)))
(define (replace-line line)
(regexp-substitute/global (current-output-port) replace line 'pre 'post)
(for-each (lambda (fname)
(let ((lines (with-input-from-file fname read-lines)))
(rename-file fname (string-append fname ".bak"))
(lambda () (for-each replace-line lines)))))
This example demonstrates the basics of SCSH. The first line is the shebang line, which tells the shell what interpreter to use. The next three lines define the regular expression we are looking to replace, a function that converts the current input port to a list of strings, and a function that replaces the regex match with a blank string. The regular expression syntax does not appear special in this case -- it is simply a string to match.
The next line calls the
for-each function, which applies the function defined by the
lambda to all files that end with "*.html". In the
lambda-defined function, we use the
let form to set the variable
lines to a list of lines read from the file
fname. For backup purposes, the next part of the function renames the file to the filename plus the file extension ".bak". Now we reach the part where string replacement occurs:
(lambda () (for-each replace-line lines)))))
This opens the file
fname and for each string in the variable
lines, it applies the
replace-line function to find and replace the regex specified earlier and output the line to the open file.
Sorting files by date and time
The next practical example is a script to display and sort files modified recently.
-o sort -s
(define (new-date day month year)
(make-date 0 0 0 day (- month 1) (- year 1900)))
(define older-than? =)
(define (date-is comparison-proc day month year)
(lambda (f) (comparison-proc (file-last-mod f)
(time (new-date day month year)))))
(define (sort-by-date filter-proc filenames)
(sort-list (filter filter-proc filenames)
(lambda (a b) (older-than? (file-last-mod a)
(define (display-filename/date filename)
(format #t "~a - ~a~%"
(format-date "~d ~B ~Y" (date (file-last-mod filename))) filename))
(sort-by-date (date-is newer-than? 21 4 2008)
The first line is the shebang again. The second line is a few command-line arguments to the
The next line defines a function for creating a date/time object, which recognizes a strange quirk in how SCSH usually creates them (the month parameter must be between 0 and 11, and the year parameter is the difference between 1900 and the year given). The next two lines define the aliases
newer-than? for the lesser-than-or-equal-to and greater-than-or-equal-to comparison functions. The
date-is function returns an anonymous function which uses a comparison function to compare the file modification time of the filename f with the day, month, and year given to
date-is. One advantage of doing this is that it makes the call to
sort-by-date read better; e.g.
date-is newer-than 2 1 2008 returns a function that returns true if the file modification date is newer than 2 January 2008. Next, the
sort-by-date function returns a filtered and sorted list of filename strings.
Finally we have the definition of the
display-filename/date function and the display of the sorted filenames using that function. The function
display-filename/date controls how the filename and date are displayed -- currently, in the form of "day month year" and then the filename.
Taking data from CSV files and converting it to HTML
Scheme Shell's other embedded domain-specific language is awk, a Unix tool that helps users parse records and fields from a text stream. You invoke the domain-specific language by using the macro
awk, which lets you specify how to separate records and fields, and which records to skip. This awk syntax abstracts away looping through the records and fields in a file and lets you define what happens to a record and when.
The awk syntax requires a record processing function, names for the values returned by the processing function, and a list of conditional clauses. The record processing function reads a record from an input stream and returns the record and fields that it parses out from the record. You can create a field reader with the SCSH function
field-reader. Typically, the record processing function will return the record read and a list holding the fields. The variable names that the syntax requires make it easy to refer to those values.
A common way of storing data for a graph or a chart or spreadsheet is in a comma-separted values (CSV) file. Each record is on a new line, and each field is separated with a comma:
In this example there are three records, one on each line, and each has three fields. All of those would be converted to strings by the awk syntax and placed in a list for processing.
An example of a real-world CSV file you may have to deal with is one that holds contact information. These are useful for making backups, and you can automate importing this data into another program.
Hiro Protagonist,,"Last of the freelance hackers",,,,,,,,,,,,,,,,,
Mr. Lee,firstname.lastname@example.org,,,,,,Mr. Lee's Greater Hong Kong,President,,,,,,,,,,,
Casimir Radon,email@example.com,"Physics club head, friend of Sarah",,,,,,,555.555.1234,,,,,,,,,
In this CSV file the first line contains a list of all field names. Our code will have to ignore the first line, and the awk syntax of SCSH allows us to do this.
The following code prints out the name and email address of a person in HTML form, but only if the person has an email address:
(define read-csv (field-reader (infix-splitter "," 20)))
(define (empty-field? x) (string= x ""))
(define (start-html page-title) (format #t #<<END
(define (end-html) (display " </p>
(define (display-email-address email name)
(format #t #<@ <a href="mailto:~a">~a</a><br/>~%@ email name name))
(define $ list-ref)
(start-html "Contact List")
(awk (read-csv) (record fields) n-records ()
(range: 1 #f (if (not (empty-field? ($ fields 1)))
(display-email-address ($ fields 1) ($ fields 0)))))
The fields of this CSV file contain commas. This could be a huge problem when parsing but is easily dealt with in SCSH. A field enclosed with double quotes can include a comma. Thus, when we define the function for reading CSV records and fields, we can use the
infix-splitter and designate the comma as the delimiter for the fields without worry.
The next definition is for the function
empty-field?, which checks to see if the string given to it is empty. The
start-html function displays the beginning of an HTML page and lets you set the title of the page. It uses a here-string for the HTML content, which means that you can include double quotes without escaping them. The
end-html function simply prints the end of the HTML page. The
display-email-address function, using a here-string, constructs and then displays an HTML link to an email address. The here-string in this case is delimited with the
After that, SCSH opens the file contacts.csv and executes the lambda function. At the end of that lambda function, the file is automatically closed. In the lambda function we start with the
start-html function and then use the awk syntax with the
read-csv record reader. Every time a record is read, the code checks to see if one of the conditions is passed. In this example, if the record number is any number after 1, then SCSH will run the next expression, which checks to see whether the current record's second field, the email address, is empty. If it is not empty, then the email address is displayed.
Awk can check other types of conditions, such as the line number, whether the record read matches a regular expression, or a simple if check.
You can find SCSH libraries with modules that make the language more useful and competitive with Perl, Python, and Ruby. Some of the more useful libraries are SSAX for XML parsing and SUNet for the Internet-related scripting. SUNet contains clients for the FTP, SMTP, POP3, Daytime, Time, and DNS protocols. It also contains an FTP server and an HTTP/Web server. You can also find libraries for interfacing with PostgreSQL and MySQL databases, along with a library for extracting information about images.
Scheme Shell illustrates the power of having a small core language, Scheme, that can be molded to solve problems in a particular domain (shell scripting). It has an innovative method for creating regular expressions and makes the shell scripting a little less painful. While it may seem verbose compared to some Perl code, judging code by the number of characters or words is a downward spiral that ends in a language like APL. Shell scripting is an important task, and taking an extra minute to type out full function names should not be seen as a burden. Using Scheme Shell you could create a Web server or even a GUI application, or a typical Ncurses-based installation script.