Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Lecture 5

Overview of Data Types

Tyler Ransom

ECON 5253, University of Oklahoma

1 / 21

Plan for the day

Focus on grappling with data

  • File formats and extensions

  • Archiving & file compression

  • Dictionaries (hash tables)

  • Big Data file types

2 / 21

A note on file extensions

  • Often, if you download a file, you will immediately understand what type of a file it is by its extension

  • File extensions in and of themselves don't serve any particular purpose other than convenience

  • File extensions were created so that humans could keep track of which files on their workspace are scripts, which are binaries, etc.

3 / 21

Why is the file format important?

  • File formats matter because they may need to match the environment you're working in

  • If you use the wrong file format, it may cause your computations to run slower than otherwise

  • To the extent that the environment you're working in requires a specific file format, then using the correct format is essential

4 / 21

Common file extensions when working with data

  • In the following table, I list some of the most common file extensions

  • For a more complete list of almost every file extension imaginable (note: they missed Stata's .do and .dta formats), see here.

  • Another great discussion about file formats is here on stackexchange

5 / 21

Open-format file extensions

The following file extensions are not tied to a specific software program

  • In this sense they are "raw" and can be viewed in any sort of text editor
File extension Description
CSV Comma separated values; data is in tabular form with column breaks marked by commas
TSV Tab separated values; data is in tabular form with column breaks marked by tabs
DAT Tab-delimited tabular data (ASCII file)
TXT Plain text; not organized in any specific manner (though usually columns are delimited with tabs or commsa)
XML eXtensible Markup Language; data is in text form with tags marking different fields
HTML HyperText Markup Language; similar to XML; used for almost every webpage you view
YAML YAML Ain't Markup Language; human readable version of XML
JSON JavaScript Object Notation; similar to YAML in that it has a human readable element. YAML is technically a superset of JSON.
HDF Hierarchical Data Format; bills itself as a "scientific data format" that can handle all types of data (tabular, text, images, etc.)
PDF Portable Document Format; not a great way to store data, but there exists much data in PDF format. Good luck unpacking it!
TIFF, JPEG These common image formats are used to store data in the form of images, or sometimes pictures of text data. There exist image processing libraries in almost every scientific programming language that can convert data in this format into more usable formats.
MP3, WAV These common audio formats may be used to store data. For example, voice-to-text applications have some way of converting income audio (in some format) into data that the machine can comprehend. The same holds for MP4 and other video file formats (e.g. for video input to self-driving cars, etc.)
6 / 21

Examples of CSV, TSV, XML, YAML, and JSON files

A possible JSON representation describing a person (source)

{
"firstName": "John",
"lastName": "Smith",
"isAlive": true,
"age": 27,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
},
{
"type": "mobile",
"number": "123 456-7890"
}
],
"children": [],
"spouse": null
}
7 / 21

Examples of CSV, TSV, XML, YAML, and JSON files

The same example as previously, but in XML: (source)

<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>25</age>
<address>
<streetAddress>21 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumber>
<type>home</type>
<number>212 555-1234</number>
</phoneNumber>
<phoneNumber>
<type>fax</type>
<number>646 555-4567</number>
</phoneNumber>
<gender>
<type>male</type>
</gender>
</person>
8 / 21

Examples of CSV, TSV, XML, YAML, and JSON files

The same example, but in YAML: (source)

firstName: John
lastName: Smith
age: 25
address:
streetAddress: 21 2nd Street
city: New York
state: NY
postalCode: '10021'
phoneNumber:
- type: home
number: 212 555-1234
- type: fax
number: 646 555-4567
gender:
type: male

Note that the JSON code above is also valid YAML; YAML simply has an alternative syntax that makes it more human-readable

9 / 21

Proprietary file extensions

The following file extensions typically require additional software to read, edit, or convert to another format

File extension Description
DB A common file extension for tabular data for SQLite
SQLITE Another common file extension for tabular data for SQLite
XLS, XLSX Tab-delimited tabular data for Microsoft Excel
RDA, RDATA Tabular file format for R
MAT ... for Matlab
SAS7BDAT ... for SAS
SAV ... for SPSS
DTA ... for Stata
10 / 21

Archiving & file compression

Because data can be big and bulky, it is often easier to store and share the data in compressed form

File extension Description
ZIP The most common format for file compression
Z Alternative to ZIP; uses a slightly different format for compression
7Z Alternative to ZIP; uses 7-Zip software for compression
GZ Another alternative to ZIP (primarily used in Linux systems), using what's called gzip
TAR So-called "tarball" which is a way to collect many files into one archive file. TAR stands for "Tape ARchive"
TAR.GZ; TGZ A compressed version of a tarball (compression via gzip)
TAR.BZ2; .TB2; .TBZ; .TBZ2 Compressed tarball (via bzip2)

Tarball vs. TGZ

11 / 21

Other file types that aren't data

  • There are many file types that don't correspond to readable data. For example, script files (e.g. .R, .py, .jl, .sql, .do, .cpp, .f90, ...) are text files with convenient extensions to help the user remember which programming language the code is in

  • As a rule of thumb, if you don't recognize the extension of a file, it's best to inspect the file in a text editor (though pay attention to the size of the file as this can also help you discern whether it's code or data)

12 / 21

General Types of Data

  • When you think of data, you probably think of rows and columns, like a matrix or a spreadsheet

  • But it turns out there are other ways to store data, and you should know their similarities and differences to tabular data

13 / 21

Dictionaries (a.k.a. Hash tables)

  • A dictionary is a list that contains keys and values

  • Each key points to one value

  • While this may seem like an odd way to store data, it turns out that there are many, many applications in which this is the most efficient way to store things

  • We won't get into the nitty gritty details of dictionaries, but they are the workhorse of computer science, and you should at least know what they are and how they differ from tabular data

  • In fact, dictionaries are often used to store multiple arrays in one file (e.g. in Matlab and other languages where the system can store multiple arrays at once)

14 / 21

Dictionaries (a.k.a Hash tables)

  • The capability to manipulate hash tables is included in almost every major scientific programming language (although it is quite clunky in R ... this is why R is not considered to be a "general purpose programming language" by some people)

Dictionaries

The Julia code for constructing the dictionary above is

mydictionary = Dict("John Smith"=>"521-1234", "Lisa Smith"=>"521-8976", "Sandra Dee"=>"521-9655")
15 / 21

Why are dictionaries useful?

  • You might look at the previous example and think a vector would be a better way to store phone numbers

  • The power of dictionaries is in their lookup speed

  • Looking up an index in a dictionary takes the same amount of time no matter how long the dictionary is!

    • Computer scientists call this O(1) access time
  • Moreover, dictionaries can index objects, not just scalars

  • So I could have a dictionary of data frames, a dictionary of arrays, ...

16 / 21

Big Data file types

  • Big Data file systems like Hadoop and Spark often use the same file types as R, SQL, Python, and Julia

  • That is, CSV and TSV files are the workhorse

  • Because of the nature of distributed file systems (which we will discuss in much greater detail next time), it is often the case that JSON and XML are not good choices because they can't be broken up across machines

  • Note: there is a distinction between JSON files and JSON records; see the second link at the end of this document for further details

17 / 21

Sequence

  • Sequence files are dictionaries that have been optimized for Hadoop and friends

  • The advantage to taking the dictionary approach is that the files can easily be coupled and decoupled

18 / 21

Avro

  • Avro is an evolved version of Sequence---it contains more capability to store complex objects natively
19 / 21

Parquet

  • Parquet is a format that allows Hadoop and friends to partition the data column-wise (rather than row-wise)

  • Other formats in this vein are RC (Record Columnar) and ORC (Optimized Record Columnar)

20 / 21

Plan for the day

Focus on grappling with data

  • File formats and extensions

  • Archiving & file compression

  • Dictionaries (hash tables)

  • Big Data file types

2 / 21
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow