Shell Lab: An Introduction to Shell Scripting
1 Overview
Fluency with the shell is a life skill. How can you list 10 header files in
/usr/include/
that have at least 100 lines? How can you get all
the unique second fields in a bunch of CSV files? How do you add a #include
at the top of several files automatically? All of these tasks can be done in
one line of shell scripting:
$ cd /usr/include; wc -l *.h | grep ' [1-9]\{3\}' | head -n 10
957 aalib.h
244 aio.h
1596 alpm.h
436 ansidecl.h
138 a.out.h
721 archive_entry.h
437 argon2.h
564 argp.h
156 argz.h
813 aspell.h
$ cd ~/src/algs4/data; cut -d, -f2 *.csv | sort | uniq -u
[...]
$ cd ~/src/mycode; sed -i '1s/^/#include\n/' *.c
The kind of commands in the above example are limited. To iterate through
files, for instance, I have used globbing, i.e., simple wildcards: *
is all
files (including directories), *.h
is all files ending with .h
, and [a-z]*
would be all files starting with a lowercase letter. These are expanded by
the shell, in the sense that they are replaced by the actual list of files
before calling the program. For instance, echo
just prints its arguments,
nothing else, yet:
$ echo hello world
hello world
$ echo *.c
main.c lib.c
If we need to iterate through files, rather than pass all filenames to a program
(as with wc(1)
, cut(1)
, and sed(1)
above), we need an actual programming language.
This is provided by your shell: it has if
statements, for
and while
loops,
and case
switches. We will go over these and over a few commands in this
lab.
Shell scripts are slow, inefficient, in the same way a Swiss-army knife is bad at all the jobs you throw at it, but it can do them. Its power resides in the ability to express complicated tasks easily, making it extraordinarily common and useful for one-time or simple tasks. As with most scripting languages, there are several ways of doing any task (for the examples above, see this, this, and this), and although there are better ways than others, shell scripting is mostly about saving time, so the first working solution is the best.
This project is divided into 6 parts and a bonus part. The commands presented in each Part must be used in your solutions, except for the bonus part.
2 Downloading the assignment
Start by extracting ~prof/shelllab-handout.tar.bz2
in your home:
$ cd ~/
$ tar xvf ~prof/shelllab-handout.tar.bz2
This will cause a number of files to be unpacked into the directory. You
should only modify the files src/*.sh
, and these are the only files that are
submitted. The test(1)
folder contains the different tests and grading examples
for your program. Use the make test
command to run the test driver. This
would run all the tests for all parts. See Section 11.2 for more finely
grained testing options.
3 Concepts used across multiple parts
3.1 Scripts
Rather than typing commands directly in the prompt, we will write script
files. These are text files bearing the extension .sh
(by convention).
They contain a sequence of commands, separated by newlines. If you want to
put two commands on a single line, you can separate them with a semicolon.
To run your script, you should always indicate the directory in which it is,
even if you are currently in that directory (in that case, type
./myscript.sh
as ./
indicates the current directory).
To debug a script, you can issue, within it, the command setopt xtrace
.
This will print each line that the shell executes:
$ cat test.sh
#!/bin/zsh -f
setopt xtrace # this is a comment!
var=value # var is a variable
echo $var # $var means the value of var
$ ./test.sh
+./test.sh:3> var=value
+./test.sh:4> echo value
value
The first line of the script file should indicate which shell we will be using. Each of the solution files already contains that line, called a shebang:
#!/bin/zsh -f
This lets the kernel know that it should use the Z shell, a standard shell,
for interpreting the program. The file is then processed as if it were typed
into a shell. To read the documentation of the Z shell, look at its info
page (type info zsh
; search by typing s
, an index is accessible with key
i
). Use info info
to learn more about navigation into info pages.
—Note
The Z shell is now fairly standard. Since 2019, it is the default shell on macOS, and it is easily installed on all Linux distributions. Its scripting language alleviates a lot of the weird behaviors, peculiarities, and inconsistencies of the older shells (in particular the completely standard Bourne shell (sh) and Bourne Again shell (bash)), at the expense of backward compatibility.
—Note
All of your scripts already have “execution rights,” which is a technicality of the file system. If you write your own scripts and obtainzsh:
permission denied: ./myscript.sh
, this is likely due to the lack of
execution rights. You can solve this by adding the right, using chmod +x
./myscript.sh
(see the manpage of chmod(1)
for more).
3.2 Standard input: <
and |
At this point, you should be familiar with the standard input in C. This
is the “file” that corresponds to the user interactive input which we read to
get commands in the Dict Lab. Most text processing commands read by default the
standard input, for instance, here is sort(1)
, a program that sorts the lines in
input:
$ sort
3
1
4
1
[Ctrl-d]
1
1
3
4
—Note
As we’ve seen in the Dict Lab, when the input is provided from the shell, the key combo Ctrl-d ends the input. The programsort(1)
needs to read all of its
input in order to produce any output.
With programs that can produce output as they read input, looking at the
terminal can be a bit weird. For instance, the program cat(1)
, without
arguments, reads the standard input and repeats it on the standard output
(it’s of nearly no use):
$ cat
hello
hello
how do you do
how do you do
good talk
good talk
[Ctrl-d]
It takes a good understanding of what the command is doing in order to distinguish the input, typed by the user, and the output, printed by the program. Can you tell which is which?
In a shell, we can replace the standard input with a file. This is done with
the <
operator:
$ cat myfile
lorem
ipsum
dolor
$ sort < myfile
dolor
ipsum
lorem
(This is equivalent to sort myfile
since sort(1)
can read the contents from
files passed as arguments, see its manpage. This is true of lots of commands.)
We can also replace the standard input of a program with the standard output of
another. This fundamental operation is called piping or pipelining and uses
the |
operator:
$ echo hello | sort
hello
Thus, the statement cat file | sort
is functionally equivalent to sort <
file
. We strongly prefer the latter, since it doesn’t require running an extra
command.
—Note
The programecho
(which is actually a builtin, not a program) prints its
arguments followed by a newline. If we wanted multiple lines, we could do:
$ echo 'line3\nline1\nline2' | sort
line1
line2
line3
This is because echo
interprets the two characters \n
as a single newline.
Another, more advanced option, is to use a subshell, which would have a
sequence of echo
separated by semicolons:
$ { echo line1; echo line2; echo line3 } | sort
line1
line2
line3
Try to answer these questions: What would cat | cat
do? What about cat <
myfile | cat
?
—Note
What do the words program, executable, builtin, and command mean?- Program, executable: These are synonymous. They refer to a file on the disk that can be executed (run). This can be a native assembly program (we will see how these files are laid out during the course) or a script (the interpreter of which being given in the shebang).
- Builtin: You can “call” builtins in a shell just like a program, but instead
of having the shell find the executable file and running it, the shell itself
implements that feature. When you type
$ echo hello
, the program/bin/echo
is not called; instead, the shell just printshello
. This is more efficient. Here are some other builtins implemented by the Z shell:alias
,bg
,break
,cd
,continue
,declare
,echo
,eval
,exec
,exit
,export
,false
,fg
,history
,jobs
,kill
,printf
,pwd
,read
,return
,shift
,source
,stat
,trap
,true
,wait
,which
. - Command: This is the general term for a single instruction in a shell. They can either run a program or a builtin.
3.3 Quoting
Shell scripting is a fairly loose language in terms of typing. A string
"echo"
can be used as a command and the strings 'test'
and "test"
are the
same from the point of view of the shell. So why and when do we use quotes?
Grouping. If we have a file named
file with spaces.txt
, then if we pass it tocat(1)
(which just dumps the contents of the file to the standard output), we need to make sure thatcat(1)
takes this filename as one argument. Compare:$ cat filename with spaces.txt cat: filename: No such file or directory cat: with: No such file or directory cat: spaces.txt: No such file or directory $ cat "filename with spaces.txt" [contents of file]
Preventing expansion. Variables (starting with
$
, see The variables$#
and$@
) and some special characters (mostly backslash) are interpreted in a different way by the shell. Compare the following, whereecho -E
is used to prevent any kind of interpretation byecho(1)
itself:$ myvar=value $ echo -E 'the\ val:\ \"$myvar\"' the\ val:\ \"$myvar\" $ echo -E "the\ val:\ \"$myvar\"" the\ val:\ "value" $ echo -E the\ val:\ \"$myvar\" the val: "value"
In words, single quotes do not change their contents at all, double quotes interpret
\"
as double quotes and evaluate variables, and without any quotes, backslashes are removed and variables evaluated.- Raw spaces. It makes sense for some commands to pass a space as argument;
this is the case for
cut(1)
for instance, which selects fields in a line given a delimiter. If you want the third word of a line, you might say “use space as delimiter.” To do so, you need quotes:cut -d ' '
. See the Bonus Part for more. An even more extreme situation is when you want to pass an empty argument to a program; you would writecmd ''
.
In the following, echo(1)
receives only one argument; can you explain the behaviors?
$ echo -E lorem"'"ipsum
lorem'ipsum
$ echo -E 'lo '\'' rem'
lo ' rem
$ echo -E ''""''""''
$ echo -E '\'\'"' \""
\'' "
3.4 Regular expressions
Two of the most common programs used in scripting are sed(1)
and grep(1)
. They
both rely on patterns to match lines of input files, and do something with
them (sed(1)
transforms them, while grep(1)
just prints the matching lines).
Patterns are universally expressed using regular expressions (regex), not to
be confused with globbing. We will make sparse use of regexes in this lab,
but it is an indispensable computer skill that you will sharpen over the
years. Regexes are implemented in each and every programming language.
A regex is a string that describes a simple string property, for instance
“starts with some repetitions (possibly 0) of b
, followed by a sequence of 3
digits and ends with a nondigit” (e.g., bb314c
and 2a
match it, but 314
and c1b
don’t). Rather than introducing regexes completely and formally, we
will go through examples.
Regex | Meaning | Match examples | Nonmatch examples |
---|---|---|---|
abc |
the exact string abc |
abc |
lol , ok |
\(ab\)* |
any repetition of ab |
abab , the empty string |
a , ba |
ab* |
a followed by any number of b |
a , ab , abb |
abab , the empty string |
. |
any single character | a , 8 , # |
az , the empty string |
.* |
any number of character | ab3k , the empty string |
none, all strings match |
a.*b |
starts with a , ends with b |
ab , aloremb |
a , b , lorem |
[abc] |
either a , b , or c |
a , b , c |
ab , bc , the empty string |
l[ou]c |
either loc or luc |
loc , luc |
lc , louc |
l\(o\|u\)c |
same as previous with or | and grouping |
same as previous | same as previous |
ab\|bc |
either ab or bc |
ab , bc |
abc , abbc |
x\(ab\|ba\)*y |
x , then sequence of ab or ba , then y |
xy , xaby , xbaaby |
xaay , xbby , the empty string |
[a-z] |
any lowercase letter | a , z , t |
0 , # , the empty string |
[a-zA-Z0-9] |
any alphanumeric character | c , H , 7 |
# , lo , the empty string |
[^a-z] |
any character but lowercase letters | 8 , # , K |
t , KO , the empty string |
Regexes are used in grep(1)
, sed(1)
, and many applications to find a
substring that follows a pattern within a string (e.g., find a.*e
in
kloatre
: the substring atre
matches it). We can thus additionally ask for
the regex to match only at the beginning or the end of the string:
Regex | Meaning | Match examples | Nonmatch examples |
---|---|---|---|
^[0-9] |
starts with a digit | 0t , 44tro1 |
t0lt , the empty string |
[a-z]$ |
ends with a lowercase letter | 0k b , lo#l |
a0k8 , the empty string |
Let’s now cover some more involved regexes:
/[^/]*$
: This finds a sequence of characters that are not slashes at the end of the string. If the string where a path/usr/include/file.h
, then this would match the substring/file.h
.^[a-zA-Z0-9]*$
: This only matches alphanumeric strings.ff\(-[0-9]\|\)$
: The string ends withff
or withff-N
withN
a digit.\(^\|,\)[^,]*
: A substring that matches this regex is a sequence of characters that are not commas that either appears at the beginning of the string or after a comma. This is useful, for instance, to find fields in a CSV field.[^,]*\($\|,\)
: This has the same effect as the previous one, with a different strategy: we match a sequence of noncommas that is either at the end of the string or followed by a comma.
It is often much easier to write a regex than to read one, so you can practice with the following (these are harder than what you’ll need in the lab):
- Write a regex for strings that start with
a
, end withb
, and contain at least onec
. - Write a regex for strings that are IPv4 addresses (
0.0.0.0
to255.255.255.255
). - Write a regex for strings that are dates (say from
1/1/1900
to12/31/2099
). - Write a regex that matches the substring of a CSV line that corresponds to
the first two fields (e.g., matching
x,y
inx,y,z,t
). - We’ve seen that
^.*$
is any string. How would you write a regex for strings that contain only one kind of letter? The stringsaaaaaa
,bb
, andccccc
, for instance, should match it, but notab
orbc
. (There’s no “good” way of doing it with the regexes we’ve seen.)
To test regexes, you can use grep(1)
. Its first argument is a regex and it
reads from the standard input by default. grep(1)
will print the matching
input lines:
$ echo "lorem 314" | grep '^[^0-9]*1'
$ echo "ipsum 159" | grep '^[^0-9]*1'
ipsum 159
$ grep '[a-z]'
NOTREPRINTED BY GREP
Reprinted by grep
Reprinted by grep
314159 (NO LOWERCASE)
314159 (with lowercase)
314159 (with lowercase)
(Note that the repetitions are printed by grep(1)
.)
Programming languages will always implement at least the regexes we just saw.
However, many will implement all sorts of extensions, see for instance the
Python module for regex and the info page for grep(1)
(run info grep
then go to
section 3).
—Note
The backslashes in grouping and disjunction (e.g.,\(a\|b\)
) are cumbersome.
The simplest extension of regexes drops them (grep -E
implements that, and so
does Python). Further extensions allow nongreedy iterations (e.g., ^.*a
matches the last a
, as .*
is taken for as long as possible; some languages
implement ^.*?a
to say “match .*
on the shortest possible string”), and
backward references (with Perl-style regexes, that grep(1)
understands with the
-P
flag, the regex ^(.)\1*$
matches any repeated sequence of the same
letter). To go further, you can look at the regex howto of Python.
4 Part 1: Most frequent line
In this part, you will write a script that reads lines from the standard input and prints out the line that appears most often at the end.
4.1 Example use
$ ./maxocc.sh
lorem
ipsum
dolor
ipsum
sit
amet
sit
ipsum
[Ctrl-d]
ipsum
4.2 sort
, uniq
, tail
, sed
The solution (or, really, a solution) for this part combines a few programs on a single line, using pipes.
sort(1)
: We’ve seen that one, but as with every program, options can change its behavior. Usingsort -n
, the program will sort according to the numerical value of the line, so that if a line starts with34 xyz...
it appears before the line starting with188 xyz...
, even though the latter would appear first in a dictionary, as1
appears before3
. Other very useful options are-u
that flushes the repeating lines and-r
that reverses the order.uniq(1)
: Without options, this deletes repeated adjacent lines, so as to keep only a unique occurrence of each. Note that In particularsort -u
andsort | uniq
do the same thing. With option-c
, it counts how many occurrences of the repeated lines appear:$ uniq -c hello dup dup hello [Ctrl-d] 1 hello 2 dup 1 hello
tail(1)
: The programshead(1)
andtail(1)
print the first few or last few lines of a file (or the standard input). The number of lines is indicated with the-n
option. The programtail(1)
has a very useful option-f
to follow a file (not used in this lab): every time the file grows,tail(1)
would print the new lines; this is often used to watch log files.$ tail -n 1 hello world [Ctrl-d] world
sed(1)
: We could do a whole lab onsed(1)
(and its siblingsperl(1)
andawk(1)
), and I encourage you to look at the info pages to better understand the power of these tools. We will use it in its most common form: substitute a pattern with a string in each line of a file. The commandsed(1)
takes as first argument its instructions and reads from the standard input (if no other arguments are provided), applying the program on each input line. The only instruction we will use iss/PATTERN/REPLACEMENT/
which replaces substrings matchingPATTERN
withREPLACEMENT
(only the first match within the line is replaced by default). For instance, if we want to prefix each line withX:
, we can use:$ cat myfile lorem ipsum dolor $ sed 's/^/X:/' < myfile X:lorem X:ipsum X:dolor
If we want to erase a bunch of spaces, a bunch of digits, and a bunch of spaces at the beginning of each line, we can do:
$ echo ' 18 lorem' | sed 's/^ *[0-9]* *//' lorem
4.3 Writing the script
Your script should be in the file maxocc.sh
. As it should read from the
standard input, the first command you start can simply read from the standard
input. For instance, if your script only contains the command sort
, then
it will behave like sort(1)
.
The strategy to obtain the line that appears most often is to:
5 Part 2: Print all arguments
In this part, you will write a script that prints all of its arguments, one per line.
5.1 Example use
$ ./println.sh a b c
a
b
c
$ ./println.sh a 'b c'
a
b c
$ ./println.sh a'\ x' 'b c'
a\ x
b c
5.2 The variables $#
and $@
Shell scripts can define variables and use them:
$ var=value; echo $var
value
$ var2=$var; echo $var2
value
In a shell script written in a file, there are two variables defined by
default, @
and #
. The first one is a special variable that contains each
of the arguments to the script, the second one contains the number of
arguments (argv
and argc
in C). For instance, suppose we have a script
called args.sh
that contains the following:
#!/bin/zsh -f echo Number of args: $# echo These args are: $@
Then executing it with different arguments give:
$ ./args.sh
Number of args: 0
These args are:
$ ./args.sh hello world
Number of args: 2
These args are: hello world
$ ./args.sh 'hello world'
Number of args: 1
These args are: hello world
Note that hello world
is two arguments, while 'hello world'
(or "hello
world"
) is one argument. Note that using echo "$@"
doesn’t allow us to
distinguish between these two cases. Separating the arguments with newlines
would.
To access a specific argument, say the third, one can use either $3
(in
classic notation) or $@[3]
(in array notation, starting at 1).
5.3 for
loops
The Z shell and other shells have a vast diversity of different for
loop
constructions, but we will focus on the classical one. Its syntax is:
for elt in elt1 elt2 elt3 elt4; do # use $elt done
The list of items can be replaced by:
- An array: The only array we will be using is
$@
, the array of arguments to your program. To avoid any weird behavior with empty arguments (myprog '' ''
has two arguments!), we should usefor elt in "$@";
instead offor elt in $@;
. - The output of a program (not used in this lab). The syntax is then
for elt in $(cmd);
. To “split” the output ofcmd
into elements, the shell uses a special variable$IFS
that contains the list of characters at which it should split. - A globbing pattern that matches filenames (not used in this lab), for
instance
*
that matches all filenames, or*.h
which matches all filenames with extension.h
.
5.4 Writing the script
Your script should be in the file println.sh
. You simply have to iterate
through the array $@
and print each of the items you see. To print them
even if they contain special characters like backslashes, you will have to
use echo -E
(see Quoting).
6 Part 3: Lowercase
The script for this part should turn its input(s) to lowercase. It follows the usual UNIX convention:
- If no argument is provided, it reads the standard input, and outputs the lowercased version on the standard output.
- If arguments are provided, these are files that should be read, and their contents printed on the standard output, in lowercase.
6.1 Example use
$ echo 'lOrem' | ./lowercase.sh
lorem
$ cat t1
HELLo
worlD
$ ./lowercase.sh t1 t1
hello
world
hello
world
6.2 tr
, if
tr(1)
: Shell scripting is very much line driven, andtr(1)
is one standard command that breaks that rule. The commandtr(1)
translates a set of characters to another one. The set of characters can be expressed like the ranges of the regexes[a-z]
. For instance, uppercasing a word can be done using:$ echo hello | tr 'a-z' 'A-Z' HELLO
The \(i\)-th element of the first set is mapped to the \(i\)-th element of the second (here,
k
, say, is mapped toK
). Useful options (not for this lab) include-d
, used to delete a set of characters, and-c
, used to complement a set. The option-s
will be seen in the Bonus Part.if
: This builtin of the shell behaves like a normalif
statement. Its syntax is:if CONDITION; then ... else ... fi
In there,
CONDITION
can either be:A program. The return value of a program is a small integer (0 to 127). In your script, you can exit with a return value
n
usingexit n
. In shell parlance, a program succeeds when it returns0
and fails otherwise. This means that the true part of theif
statement is executed if the program returns0
. There are two standard builtins calledtrue(1)
andfalse(1)
that just return0
and1
, respectively, hence:$ if true; then echo hello; fi hello $ if false; then echo X; else echo Y; fi Y
After executing any statement, the shell stores the return value of the previous program in the variable
$?
:$ true; echo $? 0 $ false; echo $? 1
- A numerical condition
(( COND ))
. For instance, one can use(( $# == 0 ))
or(( 43 > 88 ))
. Numerical conditions are fairly powerful, as they can also use for instance the ternary conditional operator. - A test condition
[[ COND ]]
(used once in this lab, in the next exercise;[[
is an alias for the programtest(1)
). This is used to test file properties and properties of strings. For instance,[[ -e myfile ]]
tests that a file namedmyfile
exists, and[[ $var == "" ]]
tests that$var
is the empty string. A logical combination of the above (not used in this lab), using
||
(or)&&
(and) and!
(not):$ if true && ! false; then echo X; else echo Y; fi X
6.3 Writing the script
Your script should be in the file lowercase.sh
. You first want to test
whether there are arguments provided to your program; if there is none, just
call tr(1)
, which will read the standard input. If not, you should iterate
through all the arguments provided and use tr(1)
on each. Funnily enough,
tr(1)
does not follow the convention of accepting filenames as arguments, so
you should replace its standard input with each of the input filenames, in
turn.
7 Part 4: Validate input
For this part, your program should check that it receives exactly one
argument, and verify that it is alphanumeric using grep(1)
. On success, it
will print accepted
and return 0
, otherwise it will print rejected
and
return 1
.
7.1 Example use
$ ./validate.sh
usage: ./validate.sh PASSWD
$ ./validate.sh a b c
usage: ./validate.sh PASSWD
$ ./validate.sh loremipsum
accepted
$ echo $?
0
$ ./validate.sh to+be_rejected
rejected
$ echo $?
1
7.2 grep
, variable as standard input
grep(1)
: We have quickly seengrep(1)
before: it lists all the lines on its input that match a given regex. To use it as a “yes/no” program, we will use the-q
option. With this option (standing for quiet),grep(1)
will exit immediately with return value0
if any match is found, and with value1
if no match is found. For instance, checking that the input contains ana
:$ echo 'hello' | grep -q 'a'; echo $? 1 $ echo 'target' | grep -q 'a'; echo $? 0
- Variables as standard input: You will receive the argument as the variable
$1
(that is, the first element of the array$@
of arguments). To pass it on the standard input of a commandcmd
, the usual way is withecho "$1" | cmd
. For technical reasons, it is not efficient to use a pipe simply to put a variable on the standard input; modern shells implement the better-looking constructcmd <<< $1
. Can you explain the difference between this andcmd < $1
?—Note
The “technical reasons” mentioned above have to do with how pipes are traditionally implemented. In a commandcmd1 | cmd2
, the shell would create two new processes, forcmd1
andcmd2
, and branch the standard output ofcmd1
to the standard input ofcmd2
, which is a lot of work. Avoiding doing this just forecho(1)
can save a bit of time. More on this “branching” and “creating processes” during the session!
7.3 Writing the script
Your script should be in the file validate.sh
. You will first test that
the number of arguments is correct; if it isn’t, you should print the
message usage: ./validate.sh PASSWD
and exit with code 1
. Note that it is
bad practice to write ./validate.sh
verbatim in your script, as your
program may be called from somewhere else. You should use the
variable $0
which contains the name of the program:
$ cat t.sh
echo $0
$ ./t.sh
./t.sh
$ cd child_directory
$ ../t.sh
../t.sh
$ cd ..
$ mv t.sh othername.sh
$ ./othername.sh
./othername.sh
$ ././././othername.sh
././././othername.sh
After this, you should check that your only argument is alphanumeric using
grep -q
as the condition of an if
statement. If it is, you should print
accepted
and exit with code 0
, otherwise, print rejected
with code 1
.
8 Part 5: REPL (Read, Evaluate, Print, Loop)
In this part, you will implement a mid-layer between the user and the shell.
Your script should prompt the user for input using >
, read the user input,
evaluate it as a command, and print the outputs of the command. In addition,
it should log its activity in a file named log
, prefixing input commands
with 0:
, the standard output with 1:
, and the standard error output with
2:
.
8.1 Example use
$ rm -f log # delete the file log
$ ./repl.sh
> echo hello\ world
hello world
> cat existing nonexisting
I'm the contents of the file 'existing'
cat: nonexisting: No such file or directory
> [Ctrl-d]
$ cat log
0:echo hello\ world
1:hello world
0:cat existing nonexisting
1:I'm the contents of the file 'existing'
2:cat: nonexisting: No such file or directory
8.2 while
, read
, eval
, redirecting output to file
while
: Your main looping mechanism in this script is a while loop, that will say “as long as there is input from the user.” The syntax is:while CONDITION; do ... done
The
CONDITION
is the same as in anif
statement.read(1)
: The builtinread(1)
reads one line of standard input and puts it in a variable. Consider:$ echo hello | read var $ echo $var hello
To avoid the input to be treated specially (with backslashes, field splitting, etc.),
read(1)
takes the option-r
(for raw):$ echo -E 'read\ me' | read var; echo -E $var read me $ echo -E 'read\ me' | read -r var; echo -E $var read\ me
The command
read(1)
returns a nonzero status when reaching the end of the input, there is thus this very idiomatic use of it to read each line of the standard input:while read -r line; do # use $line done
—Note
To read from a file instead, we would have writtendone < myfile
instead ofdone
, or equivalently,< myfile while
instead ofwhile
. If we had writtenwhile read -r line < myfile
instead, every timeread(1)
would have been called,myfile
would have been reopened, and this would result in an infinite loop.eval
: This builtin takes one argument and executes it as is, as if interpreted by the shell. Consider:$ var='t=8; echo "$t\ x" | sed s/8/9/' $ echo -E $var t=8; echo "$t\ x" | sed s/8/9/ $ eval $var 9\ x
Redirecting output to file: We have seen one way to redirect the output of a program, using pipes. To redirect to a file we use the
>
redirection:$ echo hello > myfile $ cat myfile hello
The
>
redirection first erases the file, then writes to it. To append to a file, we use>>
:$ echo hello > myfile; echo world > myfile; cat myfile world $ echo wide >> myfile; echo web >> myfile; cat myfile world wide web
—Note
There is an alternative, that comes in handy in multiple situations. The programtee(1)
reads its standard input and writes it to both the standard output and to a provided file. Consider:$ echo hello | tee myfile hello $ cat myfile hello
Programs can output on the standard output (
echo hello
) or the error output (echo hello >&2
, not used in this lab). To redirect the error output to a file, we use2>
instead of>
:$ cat t.sh echo 'to std out' echo 'to error out' >&2 $ ./t.sh 2> stderr to std out $ cat stderr to std err $ ./t.sh > stdout 2> stderr $ cat stdout to std out $ cat stderr to error out
—Note
When outputting to the error output (with>&2
) or taking the error output to a file (with2>
), the number2
indicates the so-called file descriptor of the output. Three numbers are reserved:0
is the standard input,1
the standard output, and2
the error output. More on this later in the session.
8.3 Writing the script
Your script should be in the file repl.sh
. It will first print the prompt,
using echo -n '> '
(the -n
option does not add a trailing newline, and
the quotes are necessary since >
is a special symbol for the shell and we
need a space). After this, we enter the loop, reading each line of the
standard input using the while read
construction. In the body of the loop,
we:
- Evaluate the line we just read, redirecting its standard output to a
temporary file, and the error output to another. You can use
/tmp/out
and/tmp/err
, or look into the manpage ofmktemp(1)
. - The user should see the outputs, so we
cat(1)
both temporary files (the standard output first); both outputs go on the standard output. - We populate the file
log
: we write the line we received, preceded with0:
, this can be done usingecho -E "0:$line"
and the appropriate redirection. Then we take care of the standard and error outputs of the command just evaluated. We can usesed(1)
to add a prefix to each line of/tmp/out
and/tmp/err
and redirect them to the filelog
(see Part 1 for how to usesed(1)
to do this). - We can now print the prompt
>
again.
9 Part 6: Most similar files
In this part, your script is provided with at least 3 files in argument, and
should return the pair that is “most similar.” Our definition of “most
similar” is the two files that differ at the furthest character. For
instance, lorem
and loram
differ at the fourth character, while lorem
and lorel
do so at the fifth, so they are more similar.
9.1 Example use
$ grep . f* # print the contents of all the files with name f...
f1:lolem ipsum dolor sit amet
f2:lorem ipsam dolor sit amet
f3:lorem ipsum dolqr sit amet
f4:lorem ipsum dolor sit amet
f5:lorem ipsum dolor sit apet
$ ./mostsim.sh f*
f4
f5
Indeed, the files f4
and f5
differ on character 24, all others differ
before that. You are guaranteed that the pair of files that are
most similar is unique. When a file appears twice, we say that it differs at
its size plus one:
$ ./mostsim.sh f* f2
f2
f2
$ echo hello > f6
$ ./mostsim.sh f4 f5 f6 f6
f4
f5
The files f6
and f6
are only similar “up to” byte 6.
9.2 cmp
, wc
, seq
, more on variables
cmp(1)
: This is the most basic tool to check whether two files differ. It simply tells you if they do, and where is the first mismatch; it exits with error code0
and no output only if they have the same contents. More involved commands, such asdiff(1)
, list the differences between two files, and can be extra useful when you have multiple versions of a file and you’re searching for a change that broke a feature. Here is an example use ofcmp(1)
:$ cmp f4 f5 f4 f5 differ: byte 24, line 1 $ echo $? 1 $ cmp f4 f4 $ echo $? 0
Note that the output is always the same. In particular, to retrieve the byte indicated by
cmp(1)
, one can replace everything up to and includingbyte
with the empty string (insed(1)
parlance,s/^.*byte //
), and everything including and after the comma with the empty string (insed(1)
parlance,s/,.*$//
).wc(1)
: This command provides statistics on the number of bytes, words, and lines of a file:$ wc myfile 3 3 15 myfile
In this part, we are only interested in the
-c
option, that counts the number of bytes (c
is for character). To have a clean output, without extra spaces and the filename, we can use the following trick:$ wc -c < myfile 15
This is the filesize of the file. This will be used in case
cmp(1)
does not return any information, meaning that its two input files have the same contents.seq(1)
: You will have to compare each pair of files from$@
. To avoid comparing a file with itself, or comparing them twice, we will address$@
as an array (indexed from 1). Theseq(1)
command prints the sequence of numbers between two integers:$ seq 1 10 1 2 3 4 5 6 7 8 9 10 $ i=30; seq $((i + 1)) 34 31 32 33 34
This last example also illustrates how simple arithmetic computations can be made (although you shouldn’t need more than what is in that example). You can then iterate through the indices of
$@
usingfor idx in $(seq 1 $#); do ...; done
.—Note
Usingseq(1)
is… old school. Modern alternatives are:- To use a range:
{1..5}
evaluates to the string1 2 3 4 5
, and can be used asfor idx in {1..$#}; do ...; done
. Use an arithmetic for loop:
for (( i = 1; i <= $#; ++i )) do # use $@[i] done
- To use a range:
More on variables. Setting a variable is done using the syntax
var=value
, as we’ve seen. We are not allowed to use spaces around the equal sign (var = value
in a shell means: execute the programvar
with two arguments,=
andvalue
). To set a variable to the output of a program, we can either useread(1)
as above, or the more idiomatic:var=$(cmd)
. Note thatcmd
can be any command, for instance:$ var=$(cmp f1 f2 | sed 's/byte/char/'); echo $var f1 f2 differ: char 3, line 1
Similarly, to set a variable to, say, the third argument of the script, one can use
var=$@[3]
(orvar=$3
).
9.3 Writing the script
Your script should be in the file mostsim.sh
. First, as in any computation
of a maximum, we define a variable that will contain the current maximum
number of common bytes between two files; we can set it to -1
at first.
Then two nested loops will iterate through $@
, the outer loop with index i
from 1
to
$# - 1
, the inner loop with index j
from i + 1
to $#
.
In the body of the inner loop, we compute the “similarity” of the files at
$@[i]
and $@[j]
(note that $
is not needed in array indices, but
$@[$i]
would still work). To compute this value:
- We set a variable to the output of
cmp(1)
on the files$@[i]
and$@[j]
, where we isolate withsed(1)
the byte number of the first difference (refer tocmp(1)
above to see howsed(1)
can do that). - If that variable ends up empty (e.g.,
[[ $var == "" ]]
), this means that the two files have same contents. In that case, we set our variable to the filesize of, say,$@[i]
, plus one. - Now we can check if our variable is bigger than the current maximum:
if (( var > curmax )); then
. If it is, then we can save$i
and$j
in some variables$save1
and$save2
, then update our current maximum.
At the end, we can print the two filenames at our saved positions within
$@
, using echo $@[save1]; echo $@[save2]
.
—Note
As we have seen, arithmetic is done through$((math expr))
. Everything
behaves pretty much like you would expect, including when mixing floating
point numbers:
$ var=3
$ echo $((var / 2))
1
$ echo $((var / 2.0))
1.5
Note that the $
sign in front of variables is optional and that spaces
don’t matter. One bug you may experience is with the expression $(($# -
1))
. This works as expected, but not if you remove the spaces; this is
because $#-
is a variable on its own (its meaning is not important) and
$#-1
is then interpreted as that variable followed by 1
. Hence
make sure to use either $((# - 1))
(with our without spaces) or $(($# -
1))
(with spaces).
10 Bonus Part: Reformat the output of wc
The output of wc(1)
is human-friendly, but it may not be the best format for,
say, a spreadsheet application. In this part, we will reformat the output of
wc(1)
. I propose one strategy to do it, but you are free to use any strategy
that works.
10.1 Example use
$ wc *.sh
9 25 124 lowercase.sh
4 14 94 maxocc.sh
18 52 382 mostsim.sh
5 11 51 println.sh
15 47 297 repl.sh
13 32 188 validate.sh
9 26 176 wcreformat.sh
73 207 1312 total
$ ./wcreformat.sh *.sh
124,25,9,lowercase.sh
94,14,4,maxocc.sh
382,52,18,mostsim.sh
51,11,5,println.sh
297,47,15,repl.sh
188,32,13,validate.sh
176,26,9,wcreformat.sh
As you can see, the new format gives, for each file, its number of bytes,
words, and lines (which is the reversed order from wc(1)
), and then the
filename, all separated with commas instead of spaces.
10.2 tr -s
, cut
, case
, and wc "$@"
tr -s
: This option oftr(1)
“squeezes” the characters that are provided. The output ofwc(1)
, as seen in the example above, adds a variable number of spaces between the columns, which makes field splitting a bit awkward. By usingtr -s
, we avoid this:$ echo " 3 1 4 1 5" | tr -s ' ' 3 1 4 1 5
cut(1)
: This command extracts fields from each input line. To indicate how fields appear, we indicated a delimiter using the-d
option; to select a field, we use the-f
option:$ echo 'hello:world' | cut -d':' -f1 hello $ echo 'hello:world' | cut -d':' -f2 world
—Note
Modern shells have an array type, that is less error-prone than trusting the shell to “do the right thing” when it comes to splitting. To convert a string into an array, we indicate how we want to split it. The syntax is a bit awkward, and oftentimescut(1)
does the job. But in a case where you’d be interested in extracting multiple fields, it may be better to split with the shell feature; the syntax reads:$ mystring='split:me::at:colons' $ ar=("${(@s/:/)mystring}") $ echo "1: $ar[1], 2: $ar[2], 3: $ar[3], 4: $ar[4], 5: $ar[5]" 1: split, 2: me, 3: , 4: at, 5: colons
The
:
in the splitting can be changed to any separator.case
: Thecase
switch in shell scripting is used to compare a string against very simple patterns (globbing, just like for files). For instance:case $myvar; in a*) echo 'myvar starts with a';; *k*) echo 'myvar contains a k' echo 'but it does not start with a!';; [0-9]) echo 'myvar is a single digit';; lorem) echo 'myvar is the string lorem';; *) echo 'myvar is something else';; esac
Note the double semicolon at the end of the sequence of instructions for each case. This is somewhat equivalent to the
break
instruction in C, but it is mandatory here.wc "$@"
: The syntax"$@"
means “reproduce all the arguments to the current script as is.” Hence the commandwc "$@"
passes all the arguments towc(1)
. Note that whenwc(1)
receives more than one file, it adds an extra line showing the “total” of bytes, words, and lines. If you are reading the output ofwc(1)
line by line, you may want to skip this extra line if it appears. One way to do it is to check whether the current line ends withtotal
, using acase
switch.
10.3 Writing the script
Your script should be in the file wcreformat.sh
. Here is a possible
strategy:
- Use
wc "$@"
to pass all of your arguments towc(1)
. - Pipe this into a
tr(1)
that would squeeze the spaces. - Pipe this into a
while read
loop (yes, you can do that!). - In the loop, check that you are not reading the
total
line. If not, then split the line into its four fields using fourcut <<< $line
, and print them in the right order, separated with commas.
11 Evaluation
11.1 Points
The parts are graded as follows:
- Parts 1 to 6 are worth 10 points each,
- The bonus part is worth 20 points.
11.2 Driver
This project is evaluated by a test driver. You can run the full driver by typing, at the root of your project:
$ make test
This moves to the directory tests
and runs the driver ./driver.sh
.
—Note
The driver itself is a shell script. All of your labs are evaluated using shell scripts and you have full access to the sources. They use, at times, advanced constructions, but reading them can be educational.As you work on each successive parts, you may want to run the driver on each part separately and on smaller test files:
$ cd tests/
$ ./driver.sh -h
usage: ./driver.sh [-P PART]...
Grade the dictlab.
-P PART Grade only part PART (1-6, B). Default is all.
$ ./driver.sh -P 1
* PART 1: MOST FREQUENT LINE
...
PART 1 SCORE 10 / 10
FINAL 10 / 10
For each test, the driver prints what is the command executed. For instance, when the driver prints:
[OK] ../src/maxocc.sh < files/part1-3
it ran ../src/maxocc.sh
with files/part1-3
as input.
The result, for each test, is readily printed; in bracket, there’s a shorthand notation for the result:
OK
: Test passedKO
: Test failed
—Note
The driver is executed from thetest(1)
directory, hence to access your scripts,
it needs to call them as, for instance, ../src/maxocc.sh
, which means “go to the parent
directory, then to the src
directory, then find maxocc.sh
.”
12 Handing in Your Work
When you have completed the lab, you will submit it as follows:
$ make submit
This will package your whole src/
folder into a tarball that is saved in
~/submissions/
, where the instructor will get it. Since this is simply saved
on your account, you can use make submit
as many times as you want up until
the deadline.