Introduction


Today is about working with a command line interface. There are many shell variants, and we will be working with Bash (Bourne again shell). This is the default for Linux and MacOS and needs to be installed for Windows. It is easiest to work with a Unix based operating system when coding, such as Linux or MacOS. However, I do realise that many of you will be working with Windows.

If you want to know more about Linux then you are welcome to come and speak to me about it. I do most of my work in a Linux environment and I think programming is much easier in Linux and MacOS. I use Manjaro, but I think that PopOS is a good Linux distribution to start with. It is quite similar to MacOS and you don’t need to use the command line as frequently as with other Linux distributions.

In this section we will look at how to setup a project as well as how to work with the command line. You might be used to working with graphical user interfaces for most of your coding career, but it is useful to know how to work through the shell. This is not something that is often taught in economics programs, but if you are looking at a career in data science this might be very useful. Here are some of the reasons why we might be interested in using the shell. We will start with looking at the basic file structures within a computer. The key points are the following,

  1. Power
  2. Reproducibility
  3. Interacting with servers and super computers (NB)
  4. Automating workflow and analysis pipelines

The primary references for this section are the notes from Merely Useful and the slides from Grant McDermott

Note: The output displayed will be for the file structure on my computer. For your PC things will obviously be different.

Motivation

Let us quickly talk about some things that we can use the shell for.

  • renaming and moving files en masse
  • finding things on the computer
  • combining and manipulating PDFs
  • installing and updating software
  • scheduling tasks
  • monitoring resources on your system
  • connecting to cloud environments
  • running jobs on super computers

There are many more examples, but these are some of the fundamental things that make the command line useful. I think the following quite encapsulates my thoughts on the command line,

As data scientists, we work with a lot of data. This data is often stored in files. It is important to know how to work with files (and the directories they live in) on the command line. Every action that you can do using a GUI, you can do with command-line tools (and much more).

Hello World!


Once you have opened your terminal it should look something like this,

I am using Linux, so obviously my terminal is going to look different from yours. Mine is customised to look the way I like it. If you are using Windows it will probably look something more along the lines of this,

The first thing that you want to do is open up the Bash shell. You can do this through the built-in terminal in RStudio if you prefer. I will quickly demonstrate this in class.

To start, when Bash runs it presents us with a to indicate that it is waiting for us to type something. This prompt is a simple dollar sign by default:

$

For our very first command in the terminal we will enter echo 'Hellow World!' in the command line and see what happens.

$ echo 'Hello World!'
## Hello World!

You will see the syntax above for the rest of the notes. The first part is the command that I have entered and below we will see the output from issuing that command.

Exercise

  1. Try and print your own name to the terminal.
  2. Clear your terminal using the clear command.

Listing files


Now that we have that pesky first command out of the way, let us continue our journey with shell commands. The following commands will let us explore our folders and files, and will also introduce us to several conventions that most Unix tools follow.

The first thing I normally do is check my current working directory and list the files located in this directory. You have encountered the idea of a current working directory in R, so this should be familiar. We can do this by running the following command, which is for print working directory. (Notice that we only run the command, there are no options and arguments).

$ pwd
## /home/dawie/Dropbox/2022/871-data-science/DataScience-871/01-shell
$ type pwd
## pwd is a shell builtin

On my computer the pwd is the 01-shell folder. We can also see that it is a built-in command for the shell.

It is important to note that all Bash commands have the same basic syntax – command, option(s), argument(s). Below is a picture to illustrate the point,

The options and arguments are not always needed, but you will need a command (such as ls). The options start with a dash and are usually one letter. Options are sometimes referred to as switches or flags. Arguments tell the command what to operate on, so this is usually a file, path or set of files and folders.

Let us see what happens if we type the command and option from above into the shell,

$ ls -F 
## 01-shell.html
## 01-shell.Rmd
## images/

The ls command gives the files listed in the current working directory, while the -F option provides a classification of the types of files.

Note: Directory names in a path are separated with / on Unix, but \ on Windows.

We could use other options like -a, -l and -h to show all the folders and files. The -l is for long format and the -h is for human readable.

$ ls -l -a -h 
## total 1.5M
## drwxr-xr-x  3 dawie dawie 4.0K Feb 13 13:19 .
## drwxr-xr-x 18 dawie dawie 4.0K Feb 24 13:42 ..
## -rw-r--r--  1 dawie dawie 1.5M Feb 13 13:19 01-shell.html
## -rw-r--r--  1 dawie dawie  39K Mar  1 16:25 01-shell.Rmd
## drwxr-xr-x  2 dawie dawie 4.0K Feb 12 21:12 images

You might notice that there is a lot of information here. Let us analyse this one piece at a time.

The first column indicates the object type. It can either be a (d) directory / folder, (l) link or (-) file. In the case of the first line, we see that this is a directory.

The following nine columns indicate the permissions associated with the objects user types. In this case we have r (read), w (write) or x (execute) access. The - indicates missing permissions.

The remaining columns represent the hard links to the object, the identity of the owner’s of the object and then descriptive elements about the object.

In terms of the first row, the output shows a special directory called ., which is the current working directory. In the second row, we have .. which means the directory that contains the current one. This is referred to as the parent directory.

Beyond the first two columns, the rest of the objects are files. We can see the files listed here are Jupyter notebooks, html files and some RMarkdown files.

The manual


It is worth noting that options for many of the commands, such as ls, is contained in the manual. We can access the manual for the command in the following way,

$ man ls | head -n 30 # head lets us look at the first 30 lines here
## LS(1)                            User Commands                           LS(1)
## 
## NAME
##        ls - list directory contents
## 
## SYNOPSIS
##        ls [OPTION]... [FILE]...
## 
## DESCRIPTION
##        List  information  about  the FILEs (the current directory by default).
##        Sort entries alphabetically if none of -cftuvSUX nor --sort  is  speci‐
##        fied.
## 
##        Mandatory  arguments  to  long  options are mandatory for short options
##        too.
## 
##        -a, --all
##               do not ignore entries starting with .
## 
##        -A, --almost-all
##               do not list implied . and ..
## 
##        --author
##               with -l, print the author of each file
## 
##        -b, --escape
##               print C-style escapes for nongraphic characters
## 
##        --block-size=SIZE
##               with  -l,  scale  sizes  by  SIZE  when  printing  them;   e.g.,

Creating directories


In the next part we will be creating some files and directories that relate to our research project. In order to do this we use the command mkdir, which is short for make directory.

$ cd ../research-project
$ rmdir new_dir # remove in case it already exists
$ mkdir new_dir
$ ls
## rmdir: failed to remove 'new_dir': No such file or directory
## bin
## data
## docs
## example1.txt
## example2.txt
## examples
## journal
## new_dir
## README.md
## results
## src

You should now be able to see the new_dir directory that we have created with this command. This is similar to creating a new folder, as you would with a graphical file explorer in your operating system of choice.

Naming directories


There are a few “rules” about naming directories that we can quickly mention.

  1. Don’t use spaces.
  2. Don’t begin the name with a dash
  3. Stick to letters, digits, dashes and underscores for names.

Examples of bad names

  • Data Science 871 – Has spaces
  • -DataScience871 – Starts with a dash
  • #DataScienceLife – Don’t use hashtags!

Examples of “good” names

  • DataScience871
  • DataScience-871
  • datascience-871
  • data_science_871

I generally stick to lowercase with dashes, but that is a preference. It is simply easier to type.

Using touch and rm


Getting back to our example with the new directory we just created.

$ cd ../research-project
$ ls new_dir

You will see that this directory is empty. So let us create a file and put it into this new directory. The files name is going to be draft.txt. The .txt extension indicates to us that this will be a text file. in order to create an empty file we can use the touch command.

$ cd ../research-project/new_dir
$ touch draft.txt
$ ls
## draft.txt

We can also delete the objects that we created with the rm command.

NB: Please note that there is no undo option with the shell. If you remove something it is permanently deleted. So make very sure about your actions before commiting. This is one of the reasons why version control is so important, which we will discuss in the next lecture.

$ cd ../research-project/new_dir
$ rm draft.txt
$ ls

Exercise

What happens when we execute rm -i new_dir/draft.txt? Why would we want this protection when using rm?

If you wanted to delete the entire directory then you would have to use rmdir. If there are files in the directory you will get a warning telling you that this might not be the best idea. If there are files and you want to remove the directory, then you can use the recursive option -r.

$ cd ../research-project
$ rmdir new_dir
$ ls
## bin
## data
## docs
## example1.txt
## example2.txt
## examples
## journal
## README.md
## results
## src

You will note that new_dir is now gone from the list.

Copying and renaming


Another important command is copy. Let us make a new sub-directory with copies and then copy across files from another folder.

$ cd ../research-project
$ #rmdir examples -r # delete in case this already exists (just for this example)
$ #mkdir examples
$ cd examples
$ touch example1.txt example2.txt

Let us now copy example1.txt into the docs folder with a new name.

$ cd ../research-project
$ cp examples/example1.txt docs/doc1.txt
$ ls docs
## doc1.txt

We can also move and rename files with the mv command. This is similar to copying, but completely moves the file to a new location.

$ cd ../research-project
$ mv examples/example2.txt docs
$ ls docs
## doc1.txt
## example2.txt

We can also move the file back to its original location.

$ cd ../research-project
$ mv docs/example2.txt examples
$ ls docs
## doc1.txt

If you are moving the object into the same directory but with a new object name, then you are effectively just renaming it.

$ cd ../research-project
$ mv docs/doc1.txt docs/doc_new.txt
$ ls docs
## doc_new.txt

There is a more convenient way to do this, the rename function. The syntax here is pattern, replacement, file(s).

$ cd ../research-project
$ rename txt csv docs/doc_new.txt
$ ls docs
## doc_new.csv

We can also easily revert our changes.

$ cd ../research-project
$ rename csv txt docs/doc_new.csv
$ mv docs/doc_new.txt docs/doc1.txt
$ ls docs
## doc1.txt

Wildcards * and ?


The place where rename is super useful is when we can use it in combination with regular expressions and wildcards. You would have dealt with the concept of regular expressions in the first part of the course.

With these methods we could change all the .txt file extensions in the exmamples folder to .csv in one line.

$ cd ../research-project
$ rename txt csv examples/* # the star represents the wild card expression here
$ ls examples
## example1.csv
## example2.csv
## example3.csv
## hello.sh
## nursery.csv

We can then just as easily change it back, with a similar command. Make sure you understand what is happening here.

$ cd ../research-project
$ rename csv txt examples/* 
$ ls examples
## example1.txt
## example2.txt
## example3.txt
## hello.sh
## nursery.txt

Wildcards are special characters that can be used as a replacement for other characters. The two most important ones are the * and ?.

We havent mentioned the ? wildcard yet. It matches any single character in a filename, so ?.txt matches a.txt but not any.txt.

Working with text files


Next we will take a look at some data from the Project Gutenberg website. We have some full ebooks in our data folder.

$ cd ../research-project
$ ls data
## dracula.txt
## frankenstein.txt
## jane_eyre.txt
## moby_dick.txt
## sense_and_sensibility.txt
## sherlock_holmes.txt
## time_machine.txt

There are some really interesting books in this folder. I recommend reading all of them if you have the time (especially Frankenstein!). If you don’t have a lot of time read Time Machine, it is quite short.

The word count command wc

Moving on, we can use the wc command, which is short for word count, which will tell us how many lines, words and letters there are in a file.

$ cd ../research-project/data
$ wc moby_dick.txt
##   22331  215832 1253891 moby_dick.txt

We can’t actually run the wc command on the entire directory.

$ cd ../research-project
$ wc data
$ pwd
## wc: data: Is a directory
##       0       0       0 data
## /home/dawie/Dropbox/2022/871-data-science/DataScience-871/research-project

What if we wanted to get the wordcount for two books, well then we would just specify the filenames of both books as inputs,

$ cd ../research-project/data
$ wc moby_dick.txt frankenstein.txt
##   22331  215832 1253891 moby_dick.txt
##    7832   78100  442967 frankenstein.txt
##   30163  293932 1696858 total

If we wanted to know the word counts for all the books in the folder we could use wildcards.

$ cd ../research-project/data
$ wc *.txt
##   15975  164429  867222 dracula.txt
##    7832   78100  442967 frankenstein.txt
##   21054  188460 1049294 jane_eyre.txt
##   22331  215832 1253891 moby_dick.txt
##   13028  121593  693116 sense_and_sensibility.txt
##   13053  107536  581903 sherlock_holmes.txt
##    3582   35527  200928 time_machine.txt
##   96855  911477 5089321 total

We could also just show the number of lines in each file,

$ cd ../research-project/data
$ wc -l *.txt
##   15975 dracula.txt
##    7832 frankenstein.txt
##   21054 jane_eyre.txt
##   22331 moby_dick.txt
##   13028 sense_and_sensibility.txt
##   13053 sherlock_holmes.txt
##    3582 time_machine.txt
##   96855 total

The easiest way to read text is with the cat command, which stands for “concatenate”. cat will read in all of the text. We don’t want all the text, we simply want the first few lines, so we use the head command. This is similar to the head command in R.

$ cd ../research-project/data
$ head -n 18 frankenstein.txt
## Project Gutenberg's Frankenstein, by Mary Wollstonecraft (Godwin) Shelley
## 
## This eBook is for the use of anyone anywhere at no cost and with
## almost no restrictions whatsoever.  You may copy it, give it away or
## re-use it under the terms of the Project Gutenberg License included
## with this eBook or online at www.gutenberg.net
## 
## 
## 
## Title: Frankenstein
##        or The Modern Prometheus
## Author: Mary Wollstonecraft (Godwin) Shelley
## Editor:
## Release Date: June 17, 2008 [EBook #84]
## Posting Date:
## Last updated: January 13, 2018
## Language: English
## Character set encoding: UTF-8

Regular expressions

If we want to find specific patterns in the text, we can use regular expressions type matching with grep. This stands for “global regular expression print”.

What if we wanted to find some famous line in Frankenstein. One such line is,

I was benevolent and good; misery made me a fiend.

We can search for the first part of the sentence as follows,

$ cd ../research-project/data
$ grep -n "I was benevolent" frankenstein.txt 
## 3119:alone am irrevocably excluded.  I was benevolent and good; misery made
## 3128:compassion?  Believe me, Frankenstein, I was benevolent; my soul glowed

We can look for specific words as well, such as fear.

$ cd ../research-project/data
$ grep -n "fear" frankenstein.txt | head -n 5
## 151:conquer all fear of danger or death and to induce me to commence this
## 340:the trembling sensation, half pleasurable and half fearful, with which
## 463:morning, fearing to encounter in the dark those large loose masses which
## 503:feared that his sufferings had deprived him of understanding.  When he
## 675:fear to encounter your unbelief, perhaps your ridicule; but many things

It seems there is a lot of fear related words in the book! However, what if we are looking just for fear and not words like fearful that contain fear. In this case we can use the -w flag. It indicates we are looking to match the whole word.

$ cd ../research-project/data
$ grep -n -w "fear" frankenstein.txt | head -n 5
## 151:conquer all fear of danger or death and to induce me to commence this
## 675:fear to encounter your unbelief, perhaps your ridicule; but many things
## 1662:what I was doing.  My heart palpitated in the sickness of fear, and I
## 1666:  Doth walk in fear and dread,
## 1853:lake.  I fear that he will become an idler unless we yield the point

We can use grep to find patterns in a group of files as well. Say we want to find the word benevolent across all the books in the directory,

$ cd ../research-project
$ grep -R -l "benevolent" data/
## data/sherlock_holmes.txt
## data/moby_dick.txt
## data/frankenstein.txt
## data/jane_eyre.txt

It seems that only four books in our list contain the word. It should be noted that there are many options for grep. If you want to find out what functions are available you can consult the manual for the command by the man command.

One of the things that we could do is figure out how many lines in the books in our list contain the word benevolent.

$ cd ../research-project
$ grep -w -r "benevolent" data | wc -l
## 23

It looks like there are 23 lines in all of the books that contain this specific word.

One of the real powers of regular expressions is the fact that we aren’t limited to characters and strings when defining our search, we can also use metacharacters.

Metacharacters are characters that can be used to represent other characters. As an example, the period "." is a metacharacter, which represents any character. If we wanted to search Frankenstein for the character "a" followed by any character and then followed by the character "x" we could use the following command,

$ cd ../research-project/data
$ grep -P "a.x" frankenstein.txt
## grow watchful with anxious thoughts, when a strange sight suddenly
## favourite was menaced, she could no longer control her anxiety. She
## succeeded. But my enthusiasm was checked by my anxiety, and I appeared
## of my toils.  With an anxiety that almost amounted to agony, I
## with the most anxious affection.  Poor Justine was very ill; but other
## “I have written myself into better spirits, dear cousin; but my anxiety
## letter: “I will write instantly and relieve them from the anxiety
## quitted the cottage that morning, and waited anxiously to discover from
## unutterable anxiety; and fear not but that when you are ready I shall
## and anxious to gain experience and instruction.  The difference of
## dreadful crime and avoided with shuddering anxiety any encounter with my
## tortured as I have been by anxious suspense; yet I hope to see peace in
## heart the anxiety that preyed there and entered with seeming earnestness
## shapes of objects, a thousand fears arose in my mind.  I was anxious
## destruction, and you will anxiously await my return. Years will pass, and

You see some nice words here like anxious or anxiety. Besides metacharacters that represent other characters, we also have quantifiers which allow you to specify the number of times a particular regex should appear in a string. The most basic quantifiers are + and *. The plus represents one or more occurrences of the preceding expression. This means an expression like s+as would mean, one or more s followed by as. Let us illustrate with an example,

$ cd ../research-project/data
$ grep -P "s+as" frankenstein.txt
## You will rejoice to hear that no disaster has accompanied the
## has been.  I do not know that the relation of my disasters will be
## “Come, Victor; not brooding thoughts of vengeance against the assassin,
## “your disaster is irreparable.  What do you intend to do?”
## a not less terrible, disaster.  I tried to calm Ernest; I enquired more
## some other topic than that of our disaster, had not Ernest exclaimed,
## assassinated, and the murderer escapes; he walks about the world free,
## and the irresistible, disastrous future imparted to me a kind of calm
## assassin of those most innocent victims; they died by my machinations.

Some examples here are disaster and assasin.

It is not possible to go through all the cool things you can do with regular expressions in our lecture, so you can go and read more on it here if you find this interesting.

Finding files

While the grep command helps us to find things in files, we can use find to find files themselves. We can use find . to find and list everything in a directory.

$ cd ../research-project
$ find .
## .
## ./data
## ./data/dracula.txt
## ./data/sense_and_sensibility.txt
## ./data/sherlock_holmes.txt
## ./data/time_machine.txt
## ./data/moby_dick.txt
## ./data/frankenstein.txt
## ./data/jane_eyre.txt
## ./example1.txt
## ./bin
## ./bin/top_words.sh
## ./bin/book_summary.sh
## ./README.md
## ./docs
## ./docs/doc1.txt
## ./results
## ./journal
## ./journal/makefile
## ./example2.txt
## ./src
## ./src/top-words.R
## ./examples
## ./examples/example1.txt
## ./examples/example2.txt
## ./examples/nursery.txt
## ./examples/example3.txt
## ./examples/hello.sh

If we only wanted to find directories then we could do,

$ cd ../research-project
$ find . -type d
## .
## ./data
## ./bin
## ./docs
## ./results
## ./journal
## ./src
## ./examples

What about finding the files?

$ cd ../research-project
$ find . -type f
## ./data/dracula.txt
## ./data/sense_and_sensibility.txt
## ./data/sherlock_holmes.txt
## ./data/time_machine.txt
## ./data/moby_dick.txt
## ./data/frankenstein.txt
## ./data/jane_eyre.txt
## ./example1.txt
## ./bin/top_words.sh
## ./bin/book_summary.sh
## ./README.md
## ./docs/doc1.txt
## ./journal/makefile
## ./example2.txt
## ./src/top-words.R
## ./examples/example1.txt
## ./examples/example2.txt
## ./examples/nursery.txt
## ./examples/example3.txt
## ./examples/hello.sh

We could also find files of a certain type using the * wildcard.

$ cd ../research-project
$ find . -name "*.txt"
## ./data/dracula.txt
## ./data/sense_and_sensibility.txt
## ./data/sherlock_holmes.txt
## ./data/time_machine.txt
## ./data/moby_dick.txt
## ./data/frankenstein.txt
## ./data/jane_eyre.txt
## ./example1.txt
## ./docs/doc1.txt
## ./example2.txt
## ./examples/example1.txt
## ./examples/example2.txt
## ./examples/nursery.txt
## ./examples/example3.txt

We can combine some of the knowledge we have accumulated thus far to see how large the files are in terms of number of lines,

$ cd ../research-project
$ wc -l $(find . -name "*.txt")
##   15975 ./data/dracula.txt
##   13028 ./data/sense_and_sensibility.txt
##   13053 ./data/sherlock_holmes.txt
##    3582 ./data/time_machine.txt
##   22331 ./data/moby_dick.txt
##    7832 ./data/frankenstein.txt
##   21054 ./data/jane_eyre.txt
##       0 ./example1.txt
##       0 ./docs/doc1.txt
##       0 ./example2.txt
##       0 ./examples/example1.txt
##       0 ./examples/example2.txt
##       2 ./examples/nursery.txt
##       2 ./examples/example3.txt
##   96859 total

You will often have find and grep used together. The following is a nice way to show the author of each of the books in the data/ directory,

$ cd ../research-project/data
$ grep "Author:" $(find . -name "*.txt")
## ./dracula.txt:Author: Bram Stoker
## ./sense_and_sensibility.txt:Author: Jane Austen
## ./sherlock_holmes.txt:Author: Arthur Conan Doyle
## ./time_machine.txt:Author: H. G. Wells
## ./moby_dick.txt:Author: Herman Melville
## ./frankenstein.txt:Author: Mary Wollstonecraft (Godwin) Shelley
## ./jane_eyre.txt:Author: Charlotte Bronte

Using awk and sed

Let us continue on our route of manipulating text in the shell. We encounter two more tools, sed and awk.

$ cd ../research-project
$ cat examples/nursery.txt
## Jack and Jill
## Went up the hill

Let us now change “Jack” to “Bill”.

$ cd ../research-project
$ sed -i 's/Jack/Bill/g' examples/nursery.txt
$ cat examples/nursery.txt
## Bill and Jill
## Went up the hill

And then change is back.

$ cd ../research-project
$ sed -i 's/Bill/Jack/g' examples/nursery.txt
$ cat examples/nursery.txt
## Jack and Jill
## Went up the hill

In this next example we are going to count the 10 most commonly used words in Frakenstein. We are going to be using a pipe operator |, similar to the one that you have used in R, to chain some commands together. We will also use the redirect operator <.

$ cd ../research-project
$ sed -e 's/\s/\n/g' < data/frankenstein.txt | sort | uniq -c | sort -n -r | head -10
##    4056 the
##    3102 
##    2971 and
##    2741 of
##    2719 I
##    2142 to
##    1631 my
##    1394 a
##    1125 in
##     993 was

Let us talk a bit more about the pipe and redirect operator.

Pipes and redirect


You can send output from the shell to a file using the redirect operator >. We can print a message to the shell using the echo command.

$ echo "The shell really isn't all that scary. Or is it?"
## The shell really isn't all that scary. Or is it?

You could also save this output to a file, just send it to a particular filename.

$ cd ../research-project/examples
$ echo "Some text for the example file" > example3.txt
$ ls -a
## .
## ..
## example1.txt
## example2.txt
## example3.txt
## hello.sh
## nursery.txt

You can see that example3.txt has now been created and if you open the file you will see the text located in that file. You can append the text in an existing file if you use >>.

$ cd ../research-project/examples
$ echo "Some more text for the example file! We appended!" >> example3.txt
$ cat example3.txt
## Some text for the example file
## Some more text for the example file! We appended!

A nice operation that you can do with the shell, when working with git, is to issue the following command, echo "*.csv" >> .gitignore. Can you figure out what this is going to do and why it is useful?

Next we discuss the pipe operator in a bit more detail. You should be familiar with how this works in R by now, but is also a really cool feature that can be used in the shell.

We can easily chain together a string of simple operations and make it something quite complex. See the following example,

$ cd ../research-project
$ cat -n data/frankenstein.txt | head -n1002 | tail -n10
##    993   When I returned home my first care was to procure the whole works of this
##    994   author, and afterwards of Paracelsus and Albertus Magnus. I read and
##    995   studied the wild fancies of these writers with delight; they appeared to me
##    996   treasures known to few besides myself. I have described myself as always
##    997   having been imbued with a fervent longing to penetrate the secrets of
##    998   nature. In spite of the intense labour and wonderful discoveries of modern
##    999   philosophers, I always came from my studies discontented and unsatisfied.
##   1000   Sir Isaac Newton is said to have avowed that he felt like a child picking
##   1001   up shells beside the great and unexplored ocean of truth. Those of his
##   1002   successors in each branch of natural philosophy with whom I was acquainted

If we wanted to check which of the books is the longest in this list we could do the following. First, we use the redirect operator > to send the list of line lengths to a text document, instead of printing it to the shell. Second, we use the sort command to sort the lines in the file. We use the -n flag to sort numerically.

$ cd ../research-project
$ wc -l data/*.txt > examples/lengths.txt
$ sort -n examples/lengths.txt
##    3582 data/time_machine.txt
##    7832 data/frankenstein.txt
##   13028 data/sense_and_sensibility.txt
##   13053 data/sherlock_holmes.txt
##   15975 data/dracula.txt
##   21054 data/jane_eyre.txt
##   22331 data/moby_dick.txt
##   96855 total

It seems that Moby Dick is the longest book and Time Machine the shortest. See, I told you it was a quick read. Frankenstein is second shortest, also a really good book, so it is worthwhile reading.

$ cd ../research-project
$ rm examples/lengths.txt

We remove the lengths file since we aren’t going to be using it further.

A much easier way to have done the same operation above uses a pipe operator.

$ cd ../research-project/data
$ wc -l *.txt | sort -n
##    3582 time_machine.txt
##    7832 frankenstein.txt
##   13028 sense_and_sensibility.txt
##   13053 sherlock_holmes.txt
##   15975 dracula.txt
##   21054 jane_eyre.txt
##   22331 moby_dick.txt
##   96855 total

Below is a graphical illustration of the operations we performed. The code is not exactly the same, but it performs a very similar function.

Exercise

In our research-project/data directory, we want to find the 3 files which have the least number of lines. Which command listed below would work?

  1. wc -l *.txt > sort -n > head -n 3
  2. wc -l *.txt | sort -n | head -n 1-3
  3. wc -l *.txt | head -n 3 | sort -n
  4. wc -l *.txt | sort -n | head -n 3

Next lets move on to an interesting topic that you would have encountered in R. The usage of for loops.

Using curl

One thing that I have not mentioned is how to download files from the internet using the command line. I often use this command to download software or some text file from a website. Say that we wanted to include a new book in our data folder from the Project Gutenberg website, we could do the following,

$ cd ../research-project/data
$ curl -s "https://www.gutenberg.org/files/11/11-0.txt" > alice_in_wonderland.txt
$ ls -a -h
## .
## ..
## alice_in_wonderland.txt
## dracula.txt
## frankenstein.txt
## jane_eyre.txt
## moby_dick.txt
## sense_and_sensibility.txt
## sherlock_holmes.txt
## time_machine.txt

We could also perform operations on the file without having to store it on our computer.

$ curl -s "https://www.gutenberg.org/files/11/11-0.txt" | grep " CHAPTER"
##  CHAPTER I.     Down the Rabbit-Hole
##  CHAPTER II.    The Pool of Tears
##  CHAPTER III.   A Caucus-Race and a Long Tale
##  CHAPTER IV.    The Rabbit Sends in a Little Bill
##  CHAPTER V.     Advice from a Caterpillar
##  CHAPTER VI.    Pig and Pepper
##  CHAPTER VII.   A Mad Tea-Party
##  CHAPTER VIII.  The Queen’s Croquet-Ground
##  CHAPTER IX.    The Mock Turtle’s Story
##  CHAPTER X.     The Lobster Quadrille
##  CHAPTER XI.    Who Stole the Tarts?
##  CHAPTER XII.   Alice’s Evidence

Let us clean up the directory.

$ cd ../research-project/data
$ rm alice_in_wonderland.txt

If you want to explore other ways in which data can be downloaded and transformed, then the following link provides is a great read.

Iteration (for loops)


These loops work similar to those in other programming languages. The basic syntax is the following,

for i in LIST
do 
  OPERATION $i ## the $ sign indicates a variable in bash
done

We can also condense things into a single line by using “;” appropriately.

for i in LIST; do OPERATION $i; done

The second method is more compact and perhaps better to use in the shell. The semicolons indicate line endings. Let us show an example of a for loop in the shell.

$ for i in 1 2 3 4 5; do echo $i; done
## 1
## 2
## 3
## 4
## 5

We could have also used the brace expansion {1..n} instead of writing out all the numbers from 1 to n.

$ for i in {1..5}; do echo $i; done
## 1
## 2
## 3
## 4
## 5

A loop might be useful if we wanted to repeat a certain set of command for each item in a list. Let us go back to our Project Gutenberg example. Suppose we wanted to take the first 8 lines out of each book whose name starts with an s (after the initial 9 lines of preamble). We then want code to take 8 lines after the first 9 lines. We actually did something similar to this before,

$ cd ../research-project
$ head -n 17 data/frankenstein.txt | tail -n 8
## Title: Frankenstein
##        or The Modern Prometheus
## Author: Mary Wollstonecraft (Godwin) Shelley
## Editor:
## Release Date: June 17, 2008 [EBook #84]
## Posting Date:
## Last updated: January 13, 2018
## Language: English

We can use wildcards to find all the books that start with s,

$ cd ../research-project
$ for filename in data/s*.txt; do head -n 17 $filename | tail -n 8; done
## Title: Sense and Sensibility
## 
## Author: Jane Austen
## Editor:
## Release Date: May 25, 2008 [EBook #161]
## Posting Date:
## Last updated: February 11, 2015
## Language: English
## Title: The Adventures of Sherlock Holmes
## 
## Author: Arthur Conan Doyle
## Editor:
## Release Date: April 18, 2011 [EBook #1661]
## Posting Date: November 29, 2002
## Latest Update:
## Language: English

Functions

We can also write functions, similar to that in R and other scripting languages. The following function computes the factorial of some integer that we input. Don’t be too concerned about how this works for now, it is simply nice to know that this can be done.

$ fac() { (echo 1; seq $1) | paste -s -d\* - | bc; }
$ type -a fac
## fac is a function
## fac () 
## { 
##     ( echo 1;
##     seq $1 ) | paste -s -d\* - | bc
## }
$ fac() { (echo 1; seq $1) | paste -s -d\* - | bc; }
$ fac 5
## 120

Shell scripting


We can explore data and file structure with the shell, but we can also write shell scripts that combine commands. These scripts normally have a .sh extension.

I have written a short shell script called hello.sh. You can probably guess what the script is going to say / do?

$ cd ../research-project
$ cat examples/hello.sh
## #!/bin/sh
## echo -e "\nHello World!\n"

From reading this you might be able to figure out what is does, but let us unpack it.

  • #!/bin/sh is a shebang. This indicates which program should be used to run the command.
  • echo -e "\nHello World!\n" is the actual command that we would like to run. The -e flag tells bash to evaluate an expression.

If we want to run this script then we can simply type the filename.

$ cd ../research-project
$ bash examples/hello.sh
## 
## Hello World!

The directory for runnable programs is normally the bin/ directory. We have one of these in our file structure. Within that directory we also have a file called book_summary.sh, which is an empty shell script that we can fill from the shell.

We start by writing to the empty shell script with the following command,

$ cd ../research-project/bin
$ echo "head -n 17 ../data/moby_dick.txt | tail -n 8 | grep Author" > book_summary.sh

Can you make sense of what is happening here? If you are following along with these commands then you are starting to understand shell commands.

We can run the shell script as before,

$ cd ../research-project/bin
$ bash book_summary.sh
## Author: Herman Melville

Shell scripts are especially useful if you have a long list of commands that you want to pipe together. The following is quite unwieldy in the terminal and work much better in a shell script,

curl -sL "https://www.gutenberg.org/files/11/11-0.txt" |
tr '[:upper:]' '[:lower:]' |
grep -oE "[a-z\']{2,}" |
sort |
uniq -c |
sort -nr |
head -n 10

Then we can easily call the shell script (in this case top_words.sh) and see the result, without having to do all the typing,

$ cd ../research-project/bin
$ bash top_words.sh
##    1839 the
##     942 and
##     811 to
##     638 of
##     610 it
##     553 she
##     486 you
##     462 said
##     435 in
##     403 alice

Editing text files


I will say a few words in class about editing text files and text editors that you could potentially use. This is an important topic, but we do not have the scope to fully cover it in this lecture. I mostly use VS Code to edit text documents. However, there are other options like nano, vim and Emacs for those that are interested. You can also use RStudio, but I wouldn’t really recommend it. There are better text editors with much more functionality.

As always, find what works best for you. Find the space that best fits your style and you will enjoy the journey so much more.

Make


In the early days of computing there were not nice user interfaces that we saw today. Most of the work was done on the command line. This is exactly why the tools on the command line are so well developed. One of the primary things computer scientists wanted to do was to share software, but it was always a bit tricky to install software across different systems. This is where make comes in. The idea behind make is the following,

  1. Download files required for installation into a certain directory.
  2. cd into that directory.
  3. Run the make command.

In order to do this one had to construct a makefile that contains all the instructions on how different parts of the downloaded components need to interact with each other. The nice thing about make is that it allows for a process whereby we can create documents automatically. This is amazing for reproducibility in projects! You are essentially writing down the steps whereby things should be executed in order to get to a finished product.

We will show a very easy example of constructing our own makefile and in the next lecture we will go into version control and project management in a bit more detail.

We created a makefile file in the journal directory of the research-project folder, using a text editor (in this case VS Code). In the makefile we have the following text,

draft_journal_entry.txt:
  touch draft_journal_entry.txt

The general format for makefile entries are the following,

[target]: [dependencies...]
  [commands...]

The indent in the second line is important. It should be done with the tab key. If this whitespace is not used then the command will not execute.

In our simple example above, the target is the text file draft_journal_entry.txt. This file is created with the touch command in the second line. We can check the journal folder to see the contents.

$ cd ../research-project/journal
$ ls
## makefile

Let us know use the make command to see if the makefile is properly executed.

$ cd ../research-project/journal
$ make draft_journal_entry.txt
## touch draft_journal_entry.txt

It seems like this has worked properly. We have created a new file with our command in the makefile. Let us try running it again and see what happens.

$ cd ../research-project/journal
$ make draft_journal_entry.txt
## make: 'draft_journal_entry.txt' is up to date.

Interesting, it tells us that the commands are up to date. This means that there is nothing to be done.

Now let us do something more complex. Let us add a table of contents to our journal,

$ cd ../research-project/journal
$ echo "1. 2022-02-11-Lecture-Stellenbosch" > toc.txt
$ ls
## draft_journal_entry.txt
## makefile
## toc.txt

Next, let us edit our makefile in VS Code to include the following lines,

readme.txt: toc.txt
  echo "This journal contains the following number of entries:" > readme.txt
  wc -l toc.txt | grep -P -o "[0-9]+" >> readme.txt

Now let us run make with the readme.txt as the target.

$ cd ../research-project/journal
$ make readme.txt
## echo "This journal contains the following number of entries:" > readme.txt
## wc -l toc.txt | egrep -o "[0-9]+" >> readme.txt

To see that this works, let us look at the contents of readme.txt.

$ cd ../research-project/journal
$ cat readme.txt
## This journal contains the following number of entries:
## 1

Now let’s modify toc.txt then we’ll try running make again.

$ cd ../research-project/journal
$ echo "2. 2022-02-12-Chilling-SomersetWest" >> toc.txt
$ make readme.txt
$ cat readme.txt
## echo "This journal contains the following number of entries:" > readme.txt
## wc -l toc.txt | egrep -o "[0-9]+" >> readme.txt
## This journal contains the following number of entries:
## 2

In order to simplify the make experience, we can create a rule at the top of our makefile called all where we can list all of the files that are built by the makefile. By adding the all target we can simply run make without any arguments in order to build all of the targets in the makefile.

Let us edit our makefile in VS Code so that it looks as follows,

all: draft_journal_entry.txt readme.txt

draft_journal_entry.txt:
  touch draft_journal_entry.txt
  
readme.txt: toc.txt
  echo "This journal contains the following number of entries:" > readme.txt
  wc -l toc.txt | egrep -o "[0-9]+" >> readme.txt

clean:
  rm draft_journal_entry.txt
  rm readme.txt

We add the clean rule to destroy the files created in our makefile. Normally you wouldn’t do this, but for the purpose of the lecture it helps to keep the folder empty if we want to repeat the process from the start.

Let’s save and close our makefile then let’s test it out. First let’s clean up our repository:

$ cd ../research-project/journal
$ make clean
$ ls
## rm draft_journal_entry.txt
## rm readme.txt
## makefile
## toc.txt

The clean command works like expected. Now let us simply run the makefile and check that it executes the correct commands.

$ cd ../research-project/journal
$ make
$ ls
## touch draft_journal_entry.txt
## echo "This journal contains the following number of entries:" > readme.txt
## wc -l toc.txt | egrep -o "[0-9]+" >> readme.txt
## draft_journal_entry.txt
## makefile
## readme.txt
## toc.txt

It does exactly what we wanted it to do. Success!

In the next lecture we will be looking at some examples that include R scripts, so that we can show how powerful make really can be. However, even if you didn’t understand it all fully, you might already feel like this is something that could be useful in automating a lot of processes.

Conclusion


This has been a lot of information to process. Hopefully you found it useful. We will be seeing more on shell commands in some of our other lectures, so this work will not go to waste.

There are many topics that we did not cover, in particular SSH, but we will perhaps have time to look at some more ideas as we continue the course. Especially toward the end of the course when we talk about cloud computing.

There is so much you can do with the command line, that there is even a book called “Data Science at the Command Line”. It is worth checking out if you are interested.