How to solve data science problems

Tutorial 1, Advanced Crime Analysis, BSc Security and Crime Science, UCL

Outcomes of this tutorial

This tutorial shows you how to solve programming problems around data science. Each approach to a research problem is different and so are the problems and errors that you will encounter.

Because the aim of this whole module is not that you become a core R programmer but rather a data science problem-solver, we adhere to a pragmatic approach to programming. That is, we want you to be able to solve data science problems with R using all help and tools available. This means, you do not need to begin with the fundamentals of programming (bottom-up) but instead start with a problem (e.g. scraping all missing persons data) and solve that problem stepwise (top-down).

The pragmatic approach

enables you to solve problems quickly
fast success
puts the problem first

At times, you will feel like “WTF!?” - but don’t worry. A central aspect of that pragmatic approach is that you know where to get help.

The aim of this tutorial is to equip you with the skills and knowledge needed to find help to solve (alomost) all problems you will encounter in this module.

Structure of this tutorial

You will work through a set of problems that you might encounter in the module and your capstone data science project. These problems are deliberately chosen at a difficulty that we do not (yet) expect you to solve.

For the first problem, the task for you is to try to solve the problem as best as you can in 15 minutes.

After that, we will discuss how you approached that problem and will then for the rest of today’s tutorial show you how to approach R problems differently. Your task is then to solve a set of other problems using the help options we showed you.

Problem 1: Reading a full year of police data

Problem outline

When you’re working with open-source data from the police.uk data repository, you will be provided with data per month. In the data/tutorial1/police_data folder, you will find data for each police-recorded crime for Greater Manchester Police from Dec. 2017 - Nov. 2018 as a csv file.

The problem is that each month is a separate file (e.g. 2017-12-... for Dec. 2017, etc.). You want to work with all crimes in one file (e.g. do check for temporal effects) so you’d need to ‘paste’ them together somehow.

One option would be to read all 12 files. However, this is not very useful if the data were to grow (e.g. 100,000 files), so a more ‘generative’ approach is needed.

Task

Read the 12 months of crime data in a way that is extendable to 100,000 files (assuming the same structure). Give this problem 15 minutes.

#write you solution here

Check you solution: the code should produce a data.frame which has the following dimensions: dim(your_big_dataframe) –> 394017 rows, 12 columns.

WTF!? How to solve these kinds of problems?

Step 1: defining the problem

A starting point when facing a problem like the one above is “reconstructing” the problem through its parts. Rather than solving the whole problem at once (which will often fail), you can start by identifying the “sub-problems” within.

For example, in Problem 1: Reading a full year of police data, there a two sub-problems:

reading a .csv file from a specific path
repeating that read operation and ‘binding’ the 12 .csv files

If you have attempted to read the .csv file from a path, you might have encountered three sub-sub-problems: (1) finding a way to get all file names, (2) specifying a long path that takes into account the exact relative location of the files, and (3) pasting all files together. Thus:

reading a .csv file from a specific path
repeating that read operation and ‘binding’ the 12 .csv files
1. finding all files
2. getting the relative paths to the files
3. combining the files row-wise

Once the problem space is mapped out, you can start solving each of them:

reading a .csv file from a specific path: find out how to read a .csv file
repeating that read operation and ‘binding’ the 12 .csv files
1. finding all files: find out how to show all files in a folder/directory
2. getting the relative paths to the file: find out how to retrieve the relative file path
3. combining the files row-wise: find out how to combine/append files

Step 2: problem abstraction

An important skill to solve programming problems is to ‘abstract’ a particular problem. For example, although the specific problem here is to bunch-read crime data from 12 csv files, a solution that would show you how to bunch-read 2 .txt files would probably bring very close to the overall solution.

You will see that problem abstraction is very useful for the other problems in this tutorial.

Step 3: finding actual solutions to the (sub-) problem(s)

We will focus on four approaches to finding solutions to the problems. Sometimes, the full problem can be solved at once, other times you might have to define the problem carefully because no solutions to that specific problem exist (yet).

Each of the following approaches can be helpful for both the sub-problem route as well as for solving the full problem at once.

Proper use of Google

While it may seem obvious to use Google to search for solutions to problems, many struggle to find good search terms for programming problems.

Some advice for good search queries:

include the programming language: (ideally) you want to find a solution in your desired programming language, so prepend or append it to your search query string (e.g. “create list r”)
stick to the necessary bits: no need to search as you would write; a search string “how to create a list in r” is nearly identical to “create list r”.
if you copy-and-paste an error message, exclude the actual file name or path: the specific file name is irrelevant to the problem and you do not want to reveal details about the files (e.g. when closed-source data) or your computer (e.g. your folder structure or name). This is particularly important when using Stackoverflow (see below).

Let’s look at some examples of how to use Google as an inital help:

For sub-problems

Here we have used example search queries (and links to them) that should help you solve each of the sub-problems identified above.

reading a .csv file from a specific path: “read csv file from path r”
repeating that read operation and ‘binding’ the 12 .csv files
1. finding all files: “list files in folder r”
2. getting the relative paths to the file: “get relative path of file r”
3. combining the files: “bind files r”, “append files from directory r”

For error messages

Often (especially at the beginning) there will be error messages as a result of your code. In the current example, it’s possible that you got an error like cannot open file '2017-12-greater-manchester-street.csv': No such file or directoryError in file(file, "rt") : cannot open the connection

To fix the code, you must know what the error is telling you. A simple way to find out is copy-and-pasting the error message into Google.

Note the difference between these two:

The second one will bring much more meaningful results because it excludes the file name and reduces the error message to its essence.

For the full problem

You can of course always attempt to find solutions to the problem as a whole. This can often work for general problems (e.g. file input/output) and common operations (e.g. string operations), but will likely fail once you work on your own project.

For the current problem, let’s try to solve it all at once:

You see that these search results will bring you to solutions that are very close to the once you’re looking for (e.g. this one). You still have to add the “relative path” aspect to the final solution. This highlights that for more specific problems, it’s often useful to identify the sub-problems.

Stackoverflow

You will have noticed that many (most) of Google’s search results point to the website stackoverflow.com. Stackoverflow is a large Q&A platform for programmers where users post programming questions and get answers from experts around the world.

Often, questions are answered within minutes but it also happens that questions remain unanswered. The better the question, the higher the chance of getting a high-quality answer. If you cannot find a solution to your problem even after trying the sub-problem path, or cannot find a solution to a sub-problem, you could consider posting a question on stackoverflow yourself. Some advice on how to do this:

provide example code EXAMPLE
- ! read this primer on reproducible R code
use the problem it its purest (most abstract) form EXAMPLE
be brief EXAMPLE
check for related questions (this will be suggested by on the right-side while you type your question automatically)

Who knows, maybe at some point you can help others find solutions to their (yet) unanswered R questions.

Using R’s built-in help

If you already have a function from R core or an installed R package in mind, you can check R’s help for details on the function and its parameters. Calling the R help works by using the ?.

In the example, suppose you know that list.files(...) is what you need to list the files in a directory but you’re not quite sure on how to get the relative path: it’s then worth checking this with:

?list.files

This help file shows you that there is an argument called full.names in the list.files(...) function which is

a logical value. If TRUE, the directory path is prepended to the file names to give a relative file path. If FALSE, the file names (rather than paths) are returned.

Under “Usage” you see what each of the parameters are set to by default:

#from the help file:
list.files(path = "."
           , pattern = NULL
           , all.files = FALSE
           , full.names = FALSE
           , recursive = FALSE
           , ignore.case = FALSE
           , include.dirs = FALSE
           , no.. = FALSE
           )

Re-using code

Another approach that you will likely use once you have produced more code in the next weeks, is re-using code.

Suppose you have the working code for the current example. Now even if the problem changes, the code mights still help you to solve a related problem.

Example: Suppose your new problem is to read all .txt files (from a folder with many different file formats) and bind them together column-wise (i.e. adding a column for each new file).

You can solve this problem by re-using the code since the structure of the problem is the same with a few minor additions:

~~reading a .csv file from a specific path~~ reading a .txt file from a specific path
1. NEW: select files by file format (select only .txt files)
~~repeating that read operation and ‘binding’ the 12 .csv files~~ repeating that read operation for all .txt files and ‘binding’ them
1. finding all files
2. getting the relative paths to the files
3. combining the files column-wise

You find a folder with files where each contains a column of 100 variables at ./data/tutorial1/mixed_file_formats. Read and column-bind only the .txt files. Modify the code below to solve the new problem.

all_files = list.files(path = './data/tutorial1/police_data'
                       , full.names = T)


big_data_frame = do.call(what = rbind
                         , args = lapply(X = all_files
                                         , FUN = function(x){
                                           read.csv(x
                                                    , header = T
                                                    )
                                           }
                                         )
                         )

You can find the solution to each problem (incl. Problem 1) in this R Notebook. We encourage you to only check the solutions after you have solved the problem. We assume that you attempt to solve the problems yourself with the strategies outlined. In doin so, you will acquire the problem-solving skills that are necessary to write the code in the next weeks and for your capstone project.

Problem 2: Calculation with dates

Problem outline

In some cases, you might be interested in temporal effects (e.g. how language use develops over time) which might require you to do arithmetic operations with dates (e.g. calculating the difference in minutes between two dates).

Task

In the folder ./data/tutorial1/vlogs_data you can find an .RData file called vlogs_data.RData. This file contains a dataframe with four columns expressing YouTube metadata (the YouTuber’s name, the video URL, number of views and date of the video posting) from the ~~alt-right~~ controverse YouTuber Milos Yiannopulos and popular vlogger Caset Neistat. Suppose you’re interested in comparing the view count: a problem you’d encounter is that view count might be highly correlated to the days the video is on the platform.

Your task is to calculate a new variable (column) that is called view_count_corrected and is equal to the original view count divided by the number of days the video is active. All videos were scraped on the 30th of November 2018 (use this date as the reference date).

#your R code here

Problem 3: Cleaning text data

Problem outline

Later in this module, you will work with text data. This source of data is one of the most exciting ones ~~but~~ because it is very messy and unstructured. This means that you often have to spend a considerable amount of time to clean the data.

Task

Read the .csv file called messy_text_data.csv from the folder ./data/tutorial1/messy_text. You can see that each row contains two variables: the original string (names of the top 5 most wanted terrorists by the FBI) and the cleaned string. Your task is to reproduce the cleaned string from the original one. You can check your work against then cleaned_string column. Name your output variable cleaned_string_check.

#your R code here

Problem 4: Writing data to individual files

Problem outline

Another issue you could encounter when dealing with text data (e.g. when sharing text data) is that of writing data “in reverse” (i.e. writing from a dataframe to individual files).

Task

In the folder ./data/tutorial1/vlogs_data_2 you can find the .RData file vlogs_data_with_text.RData. This file loads the dataframe called vlogs_with_text which contains a vlog ID (channel_vlog_id), the video URL (url), and the actual transcript from the vlog. Your task is to write each vlog transcript to a separate .txt file. Each file should contain only the transcript and should have as a file name the channel_vlog_id.

Tip: you’d need to loop through the dataframe and then use the standard function to write a table (write.table(...)).

#your R code here

Problem 5: Transforming wide data to long data

Problem outline

A common data transformation problem is that of converting wide data to long data (brief explainer). Often, you want to analyse the data using factors instead of different columns. For example, rather than having one column for burglaries and one for violent crimes per city, you’d rather have one column indicating the crime type (which is either burglaries or violent crimes) and another indicating the count. This kind of dataframe representation is in line with Wickhams (2014) idea of tidy data.

Task

Read the .txt file called crime_data.txt from the folder ./data/tutorial1/wide_data. You will notice that the data is in the wide format (i.e. having different columns for burglaries and violent crimes). Your task is to convert this wide data frame to a long dataframe where:

a new ‘key’ column is created called crime_type
a new ‘value’ column is created called count

#your R code here

How to solve data science problems

Dept of Security and Crime Science, UCL

B Kleinberg

8 January 2019

Outcomes of this tutorial

Structure of this tutorial

Problem 1: Reading a full year of police data

Problem outline

Task

WTF!? How to solve these kinds of problems?

Step 1: defining the problem

Step 2: problem abstraction

Step 3: finding actual solutions to the (sub-) problem(s)

Proper use of Google

Stackoverflow

Using R’s built-in help

Re-using code

Problem 2: Calculation with dates

Problem outline

Task

Problem 3: Cleaning text data

Problem outline

Task

Problem 4: Writing data to individual files

Problem outline

Task

Problem 5: Transforming wide data to long data

Problem outline

Task

END