Today is about working with a command line interface. There are many shell variants, and we will be working with Bash (Bourne again shell). This is the default for Linux and MacOS and needs to be installed for Windows. It is easiest to work with a Unix based operating system when coding, such as Linux or MacOS. However, I do realise that many of you will be working with Windows.
If you want to know more about Linux then you are welcome to come and speak to me about it. I do most of my work in a Linux environment and I think programming is much easier in Linux and MacOS. I use Manjaro, but I think that PopOS is a good Linux distribution to start with. It is quite similar to MacOS and you don’t need to use the command line as frequently as with other Linux distributions.
In this section we will look at how to setup a project as well as how to work with the command line. You might be used to working with graphical user interfaces for most of your coding career, but it is useful to know how to work through the shell. This is not something that is often taught in economics programs, but if you are looking at a career in data science this might be very useful. Here are some of the reasons why we might be interested in using the shell. We will start with looking at the basic file structures within a computer. The key points are the following,
The primary references for this section are the notes from Merely Useful and the slides from Grant McDermott
Note: The output displayed will be for the file structure on my computer. For your PC things will obviously be different.
Let us quickly talk about some things that we can use the shell for.
There are many more examples, but these are some of the fundamental things that make the command line useful. I think the following quite encapsulates my thoughts on the command line,
As data scientists, we work with a lot of data. This data is often stored in files. It is important to know how to work with files (and the directories they live in) on the command line. Every action that you can do using a GUI, you can do with command-line tools (and much more).
Once you have opened your terminal it should look something like this,
I am using Linux, so obviously my terminal is going to look different from yours. Mine is customised to look the way I like it. If you are using Windows it will probably look something more along the lines of this,
The first thing that you want to do is open up the Bash shell. You can do this through the built-in terminal in RStudio if you prefer. I will quickly demonstrate this in class.
To start, when Bash runs it presents us with a to indicate that it is waiting for us to type something. This prompt is a simple dollar sign by default:
$
For our very first command in the terminal we will enter echo 'Hellow World!'
in the command line and see what happens.
$ echo 'Hello World!'
## Hello World!
You will see the syntax above for the rest of the notes. The first part is the command that I have entered and below we will see the output from issuing that command.
clear
command.Now that we have that pesky first command out of the way, let us continue our journey with shell commands. The following commands will let us explore our folders and files, and will also introduce us to several conventions that most Unix tools follow.
The first thing I normally do is check my current working directory and list the files located in this directory. You have encountered the idea of a current working directory in R, so this should be familiar. We can do this by running the following command, which is for print working directory. (Notice that we only run the command, there are no options and arguments).
$ pwd
## /home/dawie/Dropbox/2022/871-data-science/DataScience-871/01-shell
$ type pwd
## pwd is a shell builtin
On my computer the pwd is the 01-shell
folder. We can also see that it is a built-in command for the shell.
It is important to note that all Bash commands have the same basic syntax – command, option(s), argument(s). Below is a picture to illustrate the point,
The options and arguments are not always needed, but you will need a command (such as ls
). The options start with a dash and are usually one letter. Options are sometimes referred to as switches or flags. Arguments tell the command what to operate on, so this is usually a file, path or set of files and folders.
Let us see what happens if we type the command and option from above into the shell,
$ ls -F
## 01-shell.html
## 01-shell.Rmd
## images/
The ls
command gives the files listed in the current working directory, while the -F
option provides a classification of the types of files.
Note: Directory names in a path are separated with
/
on Unix, but\
on Windows.
We could use other options like -a
, -l
and -h
to show all
the folders and files. The -l
is for long format and the -h
is for human readable.
$ ls -l -a -h
## total 1.5M
## drwxr-xr-x 3 dawie dawie 4.0K Feb 13 13:19 .
## drwxr-xr-x 18 dawie dawie 4.0K Feb 24 13:42 ..
## -rw-r--r-- 1 dawie dawie 1.5M Feb 13 13:19 01-shell.html
## -rw-r--r-- 1 dawie dawie 39K Mar 1 16:25 01-shell.Rmd
## drwxr-xr-x 2 dawie dawie 4.0K Feb 12 21:12 images
You might notice that there is a lot of information here. Let us analyse this one piece at a time.
The first column indicates the object type. It can either be a (d
) directory / folder, (l
) link or (-
) file. In the case of the first line, we see that this is a directory.
The following nine columns indicate the permissions associated with the objects user types. In this case we have r
(read), w
(write) or x
(execute) access. The -
indicates missing permissions.
The remaining columns represent the hard links to the object, the identity of the owner’s of the object and then descriptive elements about the object.
In terms of the first row, the output shows a special directory called .
, which is the current working directory. In the second row, we have ..
which means the directory that contains the current one. This is referred to as the parent directory.
Beyond the first two columns, the rest of the objects are files. We can see the files listed here are Jupyter notebooks, html
files and some RMarkdown
files.
It is worth noting that options for many of the commands, such as ls
, is contained in the manual. We can access the manual for the command in the following way,
$ man ls | head -n 30 # head lets us look at the first 30 lines here
## LS(1) User Commands LS(1)
##
## NAME
## ls - list directory contents
##
## SYNOPSIS
## ls [OPTION]... [FILE]...
##
## DESCRIPTION
## List information about the FILEs (the current directory by default).
## Sort entries alphabetically if none of -cftuvSUX nor --sort is speci‐
## fied.
##
## Mandatory arguments to long options are mandatory for short options
## too.
##
## -a, --all
## do not ignore entries starting with .
##
## -A, --almost-all
## do not list implied . and ..
##
## --author
## with -l, print the author of each file
##
## -b, --escape
## print C-style escapes for nongraphic characters
##
## --block-size=SIZE
## with -l, scale sizes by SIZE when printing them; e.g.,
In the next part we will be creating some files and directories that relate to our research project. In order to do this we use the command mkdir
, which is short for make directory.
$ cd ../research-project
$ rmdir new_dir # remove in case it already exists
$ mkdir new_dir
$ ls
## rmdir: failed to remove 'new_dir': No such file or directory
## bin
## data
## docs
## example1.txt
## example2.txt
## examples
## journal
## new_dir
## README.md
## results
## src
You should now be able to see the new_dir
directory that we have created with this command. This is similar to creating a new folder, as you would with a graphical file explorer in your operating system of choice.
There are a few “rules” about naming directories that we can quickly mention.
Examples of bad names
Examples of “good” names
I generally stick to lowercase with dashes, but that is a preference. It is simply easier to type.
touch
and rm
Getting back to our example with the new directory we just created.
$ cd ../research-project
$ ls new_dir
You will see that this directory is empty. So let us create a file and put it into this new directory. The files name is going to be draft.txt
. The .txt
extension indicates to us that this will be a text file. in order to create an empty file we can use the touch
command.
$ cd ../research-project/new_dir
$ touch draft.txt
$ ls
## draft.txt
We can also delete the objects that we created with the rm
command.
NB: Please note that there is no undo option with the shell. If you remove something it is permanently deleted. So make very sure about your actions before commiting. This is one of the reasons why version control is so important, which we will discuss in the next lecture.
$ cd ../research-project/new_dir
$ rm draft.txt
$ ls
What happens when we execute rm -i new_dir/draft.txt
? Why would we want this protection when using rm
?
If you wanted to delete the entire directory then you would have to use rmdir
. If there are files in the directory you will get a warning telling you that this might not be the best idea. If there are files and you want to remove the directory, then you can use the recursive option -r
.
$ cd ../research-project
$ rmdir new_dir
$ ls
## bin
## data
## docs
## example1.txt
## example2.txt
## examples
## journal
## README.md
## results
## src
You will note that new_dir
is now gone from the list.
Another important command is copy. Let us make a new sub-directory with copies and then copy across files from another folder.
$ cd ../research-project
$ #rmdir examples -r # delete in case this already exists (just for this example)
$ #mkdir examples
$ cd examples
$ touch example1.txt example2.txt
Let us now copy example1.txt
into the docs
folder with a new name.
$ cd ../research-project
$ cp examples/example1.txt docs/doc1.txt
$ ls docs
## doc1.txt
We can also move and rename files with the mv
command. This is similar to copying, but completely moves the file to a new location.
$ cd ../research-project
$ mv examples/example2.txt docs
$ ls docs
## doc1.txt
## example2.txt
We can also move the file back to its original location.
$ cd ../research-project
$ mv docs/example2.txt examples
$ ls docs
## doc1.txt
If you are moving the object into the same directory but with a new object name, then you are effectively just renaming it.
$ cd ../research-project
$ mv docs/doc1.txt docs/doc_new.txt
$ ls docs
## doc_new.txt
There is a more convenient way to do this, the rename
function. The syntax here is pattern
, replacement
, file(s)
.
$ cd ../research-project
$ rename txt csv docs/doc_new.txt
$ ls docs
## doc_new.csv
We can also easily revert our changes.
$ cd ../research-project
$ rename csv txt docs/doc_new.csv
$ mv docs/doc_new.txt docs/doc1.txt
$ ls docs
## doc1.txt
*
and ?
The place where rename
is super useful is when we can use it in combination with regular expressions and wildcards. You would have dealt with the concept of regular expressions in the first part of the course.
With these methods we could change all the .txt
file extensions in the exmamples folder to .csv
in one line.
$ cd ../research-project
$ rename txt csv examples/* # the star represents the wild card expression here
$ ls examples
## example1.csv
## example2.csv
## example3.csv
## hello.sh
## nursery.csv
We can then just as easily change it back, with a similar command. Make sure you understand what is happening here.
$ cd ../research-project
$ rename csv txt examples/*
$ ls examples
## example1.txt
## example2.txt
## example3.txt
## hello.sh
## nursery.txt
Wildcards are special characters that can be used as a replacement for other characters. The two most important ones are the *
and ?
.
We havent mentioned the ?
wildcard yet. It matches any single character in a filename, so ?.txt
matches a.txt
but not any.txt
.
Next we will take a look at some data from the Project Gutenberg website. We have some full ebooks in our data folder.
$ cd ../research-project
$ ls data
## dracula.txt
## frankenstein.txt
## jane_eyre.txt
## moby_dick.txt
## sense_and_sensibility.txt
## sherlock_holmes.txt
## time_machine.txt
There are some really interesting books in this folder. I recommend reading all of them if you have the time (especially Frankenstein
!). If you don’t have a lot of time read Time Machine
, it is quite short.
wc
Moving on, we can use the wc
command, which is short for word count, which will tell us how many lines, words and letters there are in a file.
$ cd ../research-project/data
$ wc moby_dick.txt
## 22331 215832 1253891 moby_dick.txt
We can’t actually run the wc
command on the entire directory.
$ cd ../research-project
$ wc data
$ pwd
## wc: data: Is a directory
## 0 0 0 data
## /home/dawie/Dropbox/2022/871-data-science/DataScience-871/research-project
What if we wanted to get the wordcount for two books, well then we would just specify the filenames of both books as inputs,
$ cd ../research-project/data
$ wc moby_dick.txt frankenstein.txt
## 22331 215832 1253891 moby_dick.txt
## 7832 78100 442967 frankenstein.txt
## 30163 293932 1696858 total
If we wanted to know the word counts for all the books in the folder we could use wildcards.
$ cd ../research-project/data
$ wc *.txt
## 15975 164429 867222 dracula.txt
## 7832 78100 442967 frankenstein.txt
## 21054 188460 1049294 jane_eyre.txt
## 22331 215832 1253891 moby_dick.txt
## 13028 121593 693116 sense_and_sensibility.txt
## 13053 107536 581903 sherlock_holmes.txt
## 3582 35527 200928 time_machine.txt
## 96855 911477 5089321 total
We could also just show the number of lines in each file,
$ cd ../research-project/data
$ wc -l *.txt
## 15975 dracula.txt
## 7832 frankenstein.txt
## 21054 jane_eyre.txt
## 22331 moby_dick.txt
## 13028 sense_and_sensibility.txt
## 13053 sherlock_holmes.txt
## 3582 time_machine.txt
## 96855 total
The easiest way to read text is with the cat
command, which stands for “concatenate”. cat
will read in all of the text. We don’t want all the text, we simply want the first few lines, so we use the head
command. This is similar to the head
command in R.
$ cd ../research-project/data
$ head -n 18 frankenstein.txt
## Project Gutenberg's Frankenstein, by Mary Wollstonecraft (Godwin) Shelley
##
## This eBook is for the use of anyone anywhere at no cost and with
## almost no restrictions whatsoever. You may copy it, give it away or
## re-use it under the terms of the Project Gutenberg License included
## with this eBook or online at www.gutenberg.net
##
##
##
## Title: Frankenstein
## or The Modern Prometheus
## Author: Mary Wollstonecraft (Godwin) Shelley
## Editor:
## Release Date: June 17, 2008 [EBook #84]
## Posting Date:
## Last updated: January 13, 2018
## Language: English
## Character set encoding: UTF-8
If we want to find specific patterns in the text, we can use regular expressions type matching with grep
. This stands for “global regular expression print”.
What if we wanted to find some famous line in Frankenstein
. One such line is,
I was benevolent and good; misery made me a fiend.
We can search for the first part of the sentence as follows,
$ cd ../research-project/data
$ grep -n "I was benevolent" frankenstein.txt
## 3119:alone am irrevocably excluded. I was benevolent and good; misery made
## 3128:compassion? Believe me, Frankenstein, I was benevolent; my soul glowed
We can look for specific words as well, such as fear.
$ cd ../research-project/data
$ grep -n "fear" frankenstein.txt | head -n 5
## 151:conquer all fear of danger or death and to induce me to commence this
## 340:the trembling sensation, half pleasurable and half fearful, with which
## 463:morning, fearing to encounter in the dark those large loose masses which
## 503:feared that his sufferings had deprived him of understanding. When he
## 675:fear to encounter your unbelief, perhaps your ridicule; but many things
It seems there is a lot of fear related words in the book! However, what if we are looking just for fear
and not words like fearful
that contain fear
. In this case we can use the -w
flag. It indicates we are looking to match the whole word.
$ cd ../research-project/data
$ grep -n -w "fear" frankenstein.txt | head -n 5
## 151:conquer all fear of danger or death and to induce me to commence this
## 675:fear to encounter your unbelief, perhaps your ridicule; but many things
## 1662:what I was doing. My heart palpitated in the sickness of fear, and I
## 1666: Doth walk in fear and dread,
## 1853:lake. I fear that he will become an idler unless we yield the point
We can use grep
to find patterns in a group of files as well. Say we want to find the word benevolent across all the books in the directory,
$ cd ../research-project
$ grep -R -l "benevolent" data/
## data/sherlock_holmes.txt
## data/moby_dick.txt
## data/frankenstein.txt
## data/jane_eyre.txt
It seems that only four books in our list contain the word. It should be noted that there are many options for grep
. If you want to find out what functions are available you can consult the manual for the command by the man
command.
One of the things that we could do is figure out how many lines in the books in our list contain the word benevolent.
$ cd ../research-project
$ grep -w -r "benevolent" data | wc -l
## 23
It looks like there are 23 lines in all of the books that contain this specific word.
One of the real powers of regular expressions is the fact that we aren’t limited to characters and strings when defining our search, we can also use metacharacters.
Metacharacters are characters that can be used to represent other characters. As an example, the period "."
is a metacharacter, which represents any character. If we wanted to search Frankenstein
for the character "a"
followed by any character and then followed by the character "x"
we could use the following command,
$ cd ../research-project/data
$ grep -P "a.x" frankenstein.txt
## grow watchful with anxious thoughts, when a strange sight suddenly
## favourite was menaced, she could no longer control her anxiety. She
## succeeded. But my enthusiasm was checked by my anxiety, and I appeared
## of my toils. With an anxiety that almost amounted to agony, I
## with the most anxious affection. Poor Justine was very ill; but other
## “I have written myself into better spirits, dear cousin; but my anxiety
## letter: “I will write instantly and relieve them from the anxiety
## quitted the cottage that morning, and waited anxiously to discover from
## unutterable anxiety; and fear not but that when you are ready I shall
## and anxious to gain experience and instruction. The difference of
## dreadful crime and avoided with shuddering anxiety any encounter with my
## tortured as I have been by anxious suspense; yet I hope to see peace in
## heart the anxiety that preyed there and entered with seeming earnestness
## shapes of objects, a thousand fears arose in my mind. I was anxious
## destruction, and you will anxiously await my return. Years will pass, and
You see some nice words here like anxious or anxiety. Besides metacharacters that represent other characters, we also have quantifiers which allow you to specify the number of times a particular regex should appear in a string. The most basic quantifiers are +
and *
. The plus represents one or more occurrences of the preceding expression. This means an expression like s+as
would mean, one or more s
followed by as
. Let us illustrate with an example,
$ cd ../research-project/data
$ grep -P "s+as" frankenstein.txt
## You will rejoice to hear that no disaster has accompanied the
## has been. I do not know that the relation of my disasters will be
## “Come, Victor; not brooding thoughts of vengeance against the assassin,
## “your disaster is irreparable. What do you intend to do?”
## a not less terrible, disaster. I tried to calm Ernest; I enquired more
## some other topic than that of our disaster, had not Ernest exclaimed,
## assassinated, and the murderer escapes; he walks about the world free,
## and the irresistible, disastrous future imparted to me a kind of calm
## assassin of those most innocent victims; they died by my machinations.
Some examples here are disaster and assasin.
It is not possible to go through all the cool things you can do with regular expressions in our lecture, so you can go and read more on it here if you find this interesting.
While the grep
command helps us to find things in files, we can use find
to find files themselves. We can use find .
to find and list everything in a directory.
$ cd ../research-project
$ find .
## .
## ./data
## ./data/dracula.txt
## ./data/sense_and_sensibility.txt
## ./data/sherlock_holmes.txt
## ./data/time_machine.txt
## ./data/moby_dick.txt
## ./data/frankenstein.txt
## ./data/jane_eyre.txt
## ./example1.txt
## ./bin
## ./bin/top_words.sh
## ./bin/book_summary.sh
## ./README.md
## ./docs
## ./docs/doc1.txt
## ./results
## ./journal
## ./journal/makefile
## ./example2.txt
## ./src
## ./src/top-words.R
## ./examples
## ./examples/example1.txt
## ./examples/example2.txt
## ./examples/nursery.txt
## ./examples/example3.txt
## ./examples/hello.sh
If we only wanted to find directories then we could do,
$ cd ../research-project
$ find . -type d
## .
## ./data
## ./bin
## ./docs
## ./results
## ./journal
## ./src
## ./examples
What about finding the files?
$ cd ../research-project
$ find . -type f
## ./data/dracula.txt
## ./data/sense_and_sensibility.txt
## ./data/sherlock_holmes.txt
## ./data/time_machine.txt
## ./data/moby_dick.txt
## ./data/frankenstein.txt
## ./data/jane_eyre.txt
## ./example1.txt
## ./bin/top_words.sh
## ./bin/book_summary.sh
## ./README.md
## ./docs/doc1.txt
## ./journal/makefile
## ./example2.txt
## ./src/top-words.R
## ./examples/example1.txt
## ./examples/example2.txt
## ./examples/nursery.txt
## ./examples/example3.txt
## ./examples/hello.sh
We could also find files of a certain type using the *
wildcard.
$ cd ../research-project
$ find . -name "*.txt"
## ./data/dracula.txt
## ./data/sense_and_sensibility.txt
## ./data/sherlock_holmes.txt
## ./data/time_machine.txt
## ./data/moby_dick.txt
## ./data/frankenstein.txt
## ./data/jane_eyre.txt
## ./example1.txt
## ./docs/doc1.txt
## ./example2.txt
## ./examples/example1.txt
## ./examples/example2.txt
## ./examples/nursery.txt
## ./examples/example3.txt
We can combine some of the knowledge we have accumulated thus far to see how large the files are in terms of number of lines,
$ cd ../research-project
$ wc -l $(find . -name "*.txt")
## 15975 ./data/dracula.txt
## 13028 ./data/sense_and_sensibility.txt
## 13053 ./data/sherlock_holmes.txt
## 3582 ./data/time_machine.txt
## 22331 ./data/moby_dick.txt
## 7832 ./data/frankenstein.txt
## 21054 ./data/jane_eyre.txt
## 0 ./example1.txt
## 0 ./docs/doc1.txt
## 0 ./example2.txt
## 0 ./examples/example1.txt
## 0 ./examples/example2.txt
## 2 ./examples/nursery.txt
## 2 ./examples/example3.txt
## 96859 total
You will often have find
and grep
used together. The following is a nice way to show the author of each of the books in the data/
directory,
$ cd ../research-project/data
$ grep "Author:" $(find . -name "*.txt")
## ./dracula.txt:Author: Bram Stoker
## ./sense_and_sensibility.txt:Author: Jane Austen
## ./sherlock_holmes.txt:Author: Arthur Conan Doyle
## ./time_machine.txt:Author: H. G. Wells
## ./moby_dick.txt:Author: Herman Melville
## ./frankenstein.txt:Author: Mary Wollstonecraft (Godwin) Shelley
## ./jane_eyre.txt:Author: Charlotte Bronte
awk
and sed
Let us continue on our route of manipulating text in the shell. We encounter two more tools, sed
and awk
.
$ cd ../research-project
$ cat examples/nursery.txt
## Jack and Jill
## Went up the hill
Let us now change “Jack” to “Bill”.
$ cd ../research-project
$ sed -i 's/Jack/Bill/g' examples/nursery.txt
$ cat examples/nursery.txt
## Bill and Jill
## Went up the hill
And then change is back.
$ cd ../research-project
$ sed -i 's/Bill/Jack/g' examples/nursery.txt
$ cat examples/nursery.txt
## Jack and Jill
## Went up the hill
In this next example we are going to count the 10 most commonly used words in Frakenstein
. We are going to be using a pipe operator |
, similar to the one that you have used in R, to chain some commands together. We will also use the redirect operator <
.
$ cd ../research-project
$ sed -e 's/\s/\n/g' < data/frankenstein.txt | sort | uniq -c | sort -n -r | head -10
## 4056 the
## 3102
## 2971 and
## 2741 of
## 2719 I
## 2142 to
## 1631 my
## 1394 a
## 1125 in
## 993 was
Let us talk a bit more about the pipe and redirect operator.
You can send output from the shell to a file using the redirect operator >
. We can print a message to the shell using the echo
command.
$ echo "The shell really isn't all that scary. Or is it?"
## The shell really isn't all that scary. Or is it?
You could also save this output to a file, just send it to a particular filename.
$ cd ../research-project/examples
$ echo "Some text for the example file" > example3.txt
$ ls -a
## .
## ..
## example1.txt
## example2.txt
## example3.txt
## hello.sh
## nursery.txt
You can see that example3.txt
has now been created and if you open the file you will see the text located in that file. You can append the text in an existing file if you use >>
.
$ cd ../research-project/examples
$ echo "Some more text for the example file! We appended!" >> example3.txt
$ cat example3.txt
## Some text for the example file
## Some more text for the example file! We appended!
A nice operation that you can do with the shell, when working with git
, is to issue the following command, echo "*.csv" >> .gitignore
. Can you figure out what this is going to do and why it is useful?
Next we discuss the pipe operator in a bit more detail. You should be familiar with how this works in R by now, but is also a really cool feature that can be used in the shell.
We can easily chain together a string of simple operations and make it something quite complex. See the following example,
$ cd ../research-project
$ cat -n data/frankenstein.txt | head -n1002 | tail -n10
## 993 When I returned home my first care was to procure the whole works of this
## 994 author, and afterwards of Paracelsus and Albertus Magnus. I read and
## 995 studied the wild fancies of these writers with delight; they appeared to me
## 996 treasures known to few besides myself. I have described myself as always
## 997 having been imbued with a fervent longing to penetrate the secrets of
## 998 nature. In spite of the intense labour and wonderful discoveries of modern
## 999 philosophers, I always came from my studies discontented and unsatisfied.
## 1000 Sir Isaac Newton is said to have avowed that he felt like a child picking
## 1001 up shells beside the great and unexplored ocean of truth. Those of his
## 1002 successors in each branch of natural philosophy with whom I was acquainted
If we wanted to check which of the books is the longest in this list we could do the following. First, we use the redirect operator >
to send the list of line lengths to a text document, instead of printing it to the shell. Second, we use the sort
command to sort the lines in the file. We use the -n
flag to sort numerically.
$ cd ../research-project
$ wc -l data/*.txt > examples/lengths.txt
$ sort -n examples/lengths.txt
## 3582 data/time_machine.txt
## 7832 data/frankenstein.txt
## 13028 data/sense_and_sensibility.txt
## 13053 data/sherlock_holmes.txt
## 15975 data/dracula.txt
## 21054 data/jane_eyre.txt
## 22331 data/moby_dick.txt
## 96855 total
It seems that Moby Dick is the longest book and Time Machine the shortest. See, I told you it was a quick read. Frankenstein is second shortest, also a really good book, so it is worthwhile reading.
$ cd ../research-project
$ rm examples/lengths.txt
We remove the lengths file since we aren’t going to be using it further.
A much easier way to have done the same operation above uses a pipe operator.
$ cd ../research-project/data
$ wc -l *.txt | sort -n
## 3582 time_machine.txt
## 7832 frankenstein.txt
## 13028 sense_and_sensibility.txt
## 13053 sherlock_holmes.txt
## 15975 dracula.txt
## 21054 jane_eyre.txt
## 22331 moby_dick.txt
## 96855 total
Below is a graphical illustration of the operations we performed. The code is not exactly the same, but it performs a very similar function.
In our research-project/data
directory, we want to find the 3 files which have the least number of lines. Which command listed below would work?
wc -l *.txt > sort -n > head -n 3
wc -l *.txt | sort -n | head -n 1-3
wc -l *.txt | head -n 3 | sort -n
wc -l *.txt | sort -n | head -n 3
Next lets move on to an interesting topic that you would have encountered in R. The usage of for loops
.
curl
One thing that I have not mentioned is how to download files from the internet using the command line. I often use this command to download software or some text file from a website. Say that we wanted to include a new book in our data
folder from the Project Gutenberg website, we could do the following,
$ cd ../research-project/data
$ curl -s "https://www.gutenberg.org/files/11/11-0.txt" > alice_in_wonderland.txt
$ ls -a -h
## .
## ..
## alice_in_wonderland.txt
## dracula.txt
## frankenstein.txt
## jane_eyre.txt
## moby_dick.txt
## sense_and_sensibility.txt
## sherlock_holmes.txt
## time_machine.txt
We could also perform operations on the file without having to store it on our computer.
$ curl -s "https://www.gutenberg.org/files/11/11-0.txt" | grep " CHAPTER"
## CHAPTER I. Down the Rabbit-Hole
## CHAPTER II. The Pool of Tears
## CHAPTER III. A Caucus-Race and a Long Tale
## CHAPTER IV. The Rabbit Sends in a Little Bill
## CHAPTER V. Advice from a Caterpillar
## CHAPTER VI. Pig and Pepper
## CHAPTER VII. A Mad Tea-Party
## CHAPTER VIII. The Queen’s Croquet-Ground
## CHAPTER IX. The Mock Turtle’s Story
## CHAPTER X. The Lobster Quadrille
## CHAPTER XI. Who Stole the Tarts?
## CHAPTER XII. Alice’s Evidence
Let us clean up the directory.
$ cd ../research-project/data
$ rm alice_in_wonderland.txt
If you want to explore other ways in which data can be downloaded and transformed, then the following link provides is a great read.
These loops work similar to those in other programming languages. The basic syntax is the following,
for i in LIST
do
OPERATION $i ## the $ sign indicates a variable in bash
done
We can also condense things into a single line by using “;” appropriately.
for i in LIST; do OPERATION $i; done
The second method is more compact and perhaps better to use in the shell. The semicolons indicate line endings. Let us show an example of a for loop in the shell.
$ for i in 1 2 3 4 5; do echo $i; done
## 1
## 2
## 3
## 4
## 5
We could have also used the brace expansion {1..n}
instead of writing out all the numbers from 1
to n
.
$ for i in {1..5}; do echo $i; done
## 1
## 2
## 3
## 4
## 5
A loop might be useful if we wanted to repeat a certain set of command for each item in a list. Let us go back to our Project Gutenberg example. Suppose we wanted to take the first 8 lines out of each book whose name starts with an s
(after the initial 9 lines of preamble). We then want code to take 8 lines after the first 9 lines. We actually did something similar to this before,
$ cd ../research-project
$ head -n 17 data/frankenstein.txt | tail -n 8
## Title: Frankenstein
## or The Modern Prometheus
## Author: Mary Wollstonecraft (Godwin) Shelley
## Editor:
## Release Date: June 17, 2008 [EBook #84]
## Posting Date:
## Last updated: January 13, 2018
## Language: English
We can use wildcards to find all the books that start with s
,
$ cd ../research-project
$ for filename in data/s*.txt; do head -n 17 $filename | tail -n 8; done
## Title: Sense and Sensibility
##
## Author: Jane Austen
## Editor:
## Release Date: May 25, 2008 [EBook #161]
## Posting Date:
## Last updated: February 11, 2015
## Language: English
## Title: The Adventures of Sherlock Holmes
##
## Author: Arthur Conan Doyle
## Editor:
## Release Date: April 18, 2011 [EBook #1661]
## Posting Date: November 29, 2002
## Latest Update:
## Language: English
We can also write functions, similar to that in R and other scripting languages. The following function computes the factorial of some integer that we input. Don’t be too concerned about how this works for now, it is simply nice to know that this can be done.
$ fac() { (echo 1; seq $1) | paste -s -d\* - | bc; }
$ type -a fac
## fac is a function
## fac ()
## {
## ( echo 1;
## seq $1 ) | paste -s -d\* - | bc
## }
$ fac() { (echo 1; seq $1) | paste -s -d\* - | bc; }
$ fac 5
## 120
We can explore data and file structure with the shell, but we can also write shell scripts that combine commands. These scripts normally have a .sh
extension.
I have written a short shell script called hello.sh
. You can probably guess what the script is going to say / do?
$ cd ../research-project
$ cat examples/hello.sh
## #!/bin/sh
## echo -e "\nHello World!\n"
From reading this you might be able to figure out what is does, but let us unpack it.
#!/bin/sh
is a shebang. This indicates which program should be used to run the command.echo -e "\nHello World!\n"
is the actual command that we would like to run. The -e
flag tells bash
to evaluate an expression.If we want to run this script then we can simply type the filename.
$ cd ../research-project
$ bash examples/hello.sh
##
## Hello World!
The directory for runnable programs is normally the bin/
directory. We have one of these in our file structure. Within that directory we also have a file called book_summary.sh
, which is an empty shell script that we can fill from the shell.
We start by writing to the empty shell script with the following command,
$ cd ../research-project/bin
$ echo "head -n 17 ../data/moby_dick.txt | tail -n 8 | grep Author" > book_summary.sh
Can you make sense of what is happening here? If you are following along with these commands then you are starting to understand shell commands.
We can run the shell script as before,
$ cd ../research-project/bin
$ bash book_summary.sh
## Author: Herman Melville
Shell scripts are especially useful if you have a long list of commands that you want to pipe together. The following is quite unwieldy in the terminal and work much better in a shell script,
curl -sL "https://www.gutenberg.org/files/11/11-0.txt" |
tr '[:upper:]' '[:lower:]' |
grep -oE "[a-z\']{2,}" |
sort |
uniq -c |
sort -nr |
head -n 10
Then we can easily call the shell script (in this case top_words.sh
) and see the result, without having to do all the typing,
$ cd ../research-project/bin
$ bash top_words.sh
## 1839 the
## 942 and
## 811 to
## 638 of
## 610 it
## 553 she
## 486 you
## 462 said
## 435 in
## 403 alice
I will say a few words in class about editing text files and text editors that you could potentially use. This is an important topic, but we do not have the scope to fully cover it in this lecture. I mostly use VS Code to edit text documents. However, there are other options like nano, vim and Emacs for those that are interested. You can also use RStudio
, but I wouldn’t really recommend it. There are better text editors with much more functionality.
As always, find what works best for you. Find the space that best fits your style and you will enjoy the journey so much more.
In the early days of computing there were not nice user interfaces that we saw today. Most of the work was done on the command line. This is exactly why the tools on the command line are so well developed. One of the primary things computer scientists wanted to do was to share software, but it was always a bit tricky to install software across different systems. This is where make
comes in. The idea behind make
is the following,
cd
into that directory.make
command.In order to do this one had to construct a makefile
that contains all the instructions on how different parts of the downloaded components need to interact with each other. The nice thing about make
is that it allows for a process whereby we can create documents automatically. This is amazing for reproducibility in projects! You are essentially writing down the steps whereby things should be executed in order to get to a finished product.
We will show a very easy example of constructing our own makefile
and in the next lecture we will go into version control and project management in a bit more detail.
We created a makefile
file in the journal
directory of the research-project
folder, using a text editor (in this case VS Code). In the makefile we have the following text,
draft_journal_entry.txt:
touch draft_journal_entry.txt
The general format for makefile
entries are the following,
[target]: [dependencies...]
[commands...]
The indent in the second line is important. It should be done with the tab
key. If this whitespace is not used then the command will not execute.
In our simple example above, the target is the text file draft_journal_entry.txt
. This file is created with the touch
command in the second line. We can check the journal
folder to see the contents.
$ cd ../research-project/journal
$ ls
## makefile
Let us know use the make
command to see if the makefile
is properly executed.
$ cd ../research-project/journal
$ make draft_journal_entry.txt
## touch draft_journal_entry.txt
It seems like this has worked properly. We have created a new file with our command in the makefile
. Let us try running it again and see what happens.
$ cd ../research-project/journal
$ make draft_journal_entry.txt
## make: 'draft_journal_entry.txt' is up to date.
Interesting, it tells us that the commands are up to date. This means that there is nothing to be done.
Now let us do something more complex. Let us add a table of contents to our journal,
$ cd ../research-project/journal
$ echo "1. 2022-02-11-Lecture-Stellenbosch" > toc.txt
$ ls
## draft_journal_entry.txt
## makefile
## toc.txt
Next, let us edit our makefile
in VS Code to include the following lines,
readme.txt: toc.txt
echo "This journal contains the following number of entries:" > readme.txt
wc -l toc.txt | grep -P -o "[0-9]+" >> readme.txt
Now let us run make
with the readme.txt
as the target.
$ cd ../research-project/journal
$ make readme.txt
## echo "This journal contains the following number of entries:" > readme.txt
## wc -l toc.txt | egrep -o "[0-9]+" >> readme.txt
To see that this works, let us look at the contents of readme.txt
.
$ cd ../research-project/journal
$ cat readme.txt
## This journal contains the following number of entries:
## 1
Now let’s modify toc.txt
then we’ll try running make
again.
$ cd ../research-project/journal
$ echo "2. 2022-02-12-Chilling-SomersetWest" >> toc.txt
$ make readme.txt
$ cat readme.txt
## echo "This journal contains the following number of entries:" > readme.txt
## wc -l toc.txt | egrep -o "[0-9]+" >> readme.txt
## This journal contains the following number of entries:
## 2
In order to simplify the make
experience, we can create a rule at the top of our makefile
called all
where we can list all of the files that are built by the makefile
. By adding the all
target we can simply run make
without any arguments in order to build all of the targets in the makefile
.
Let us edit our makefile
in VS Code so that it looks as follows,
all: draft_journal_entry.txt readme.txt
draft_journal_entry.txt:
touch draft_journal_entry.txt
readme.txt: toc.txt
echo "This journal contains the following number of entries:" > readme.txt
wc -l toc.txt | egrep -o "[0-9]+" >> readme.txt
clean:
rm draft_journal_entry.txt
rm readme.txt
We add the clean
rule to destroy the files created in our makefile
. Normally you wouldn’t do this, but for the purpose of the lecture it helps to keep the folder empty if we want to repeat the process from the start.
Let’s save and close our makefile
then let’s test it out. First let’s clean up our repository:
$ cd ../research-project/journal
$ make clean
$ ls
## rm draft_journal_entry.txt
## rm readme.txt
## makefile
## toc.txt
The clean
command works like expected. Now let us simply run the makefile
and check that it executes the correct commands.
$ cd ../research-project/journal
$ make
$ ls
## touch draft_journal_entry.txt
## echo "This journal contains the following number of entries:" > readme.txt
## wc -l toc.txt | egrep -o "[0-9]+" >> readme.txt
## draft_journal_entry.txt
## makefile
## readme.txt
## toc.txt
It does exactly what we wanted it to do. Success!
In the next lecture we will be looking at some examples that include R
scripts, so that we can show how powerful make
really can be. However, even if you didn’t understand it all fully, you might already feel like this is something that could be useful in automating a lot of processes.
This has been a lot of information to process. Hopefully you found it useful. We will be seeing more on shell commands in some of our other lectures, so this work will not go to waste.
There are many topics that we did not cover, in particular SSH, but we will perhaps have time to look at some more ideas as we continue the course. Especially toward the end of the course when we talk about cloud computing.
There is so much you can do with the command line, that there is even a book called “Data Science at the Command Line”. It is worth checking out if you are interested.