Aims of this homework
- learn the basics of dealing with text data in R
- compute some text metrics
- build a linguistic representation of your own texts
Task 1: Creating your corpus
In this homework, you will use your own text data for some initial linguistic analysis.
Select at least three assignments that you handed in as coursework in your BSc so far. To import the into R, either copy-and-paste them as raw text into a string variable, or save the raw text as a .txt
file and read these files in.
Create a corpus of these texts using the quanteda
package.
#your code here
Task 2: Text statistics
Calculate the average number of characters per word and the average number of words per sentence.
#your code here
Task 3: Text metrics
What is the type-token ratio (TTR) of your texts (each text individually)?
#your code here
Task 4: Word frequencies
Build a term frequency count representation, and retrieve the top features (hint: topfeatures
) for each text.
#your code here
Task 5: TF-IDF
Now build a TF-IDF weighted representation of your corpus. Perform this transformation in five different ways: (1) based on the raw texts, (2) removing stopwords, (3) removing punctuation, (4) stemming the words, and (5) combining (2)-(4).
#your code here
Task 6: Parts-of-speech
For all texts, calculate the part-of-speech proportions and find out which POS tag is used most often by you in each text.
#your code here
END
LS0tCnRpdGxlOiAiVGV4dCBkYXRhICYgdGV4dCBtaW5pbmciCnN1YnRpdGU6ICJIb21ld29yayB3ZWVrIDQiCmF1dGhvcjogIkIgS2xlaW5iZXJnIgpzdWJ0aXRsZTogQWR2YW5jZWQgQ3JpbWUgQW5hbHlzaXMsIFVDTApvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgojIyBBaW1zIG9mIHRoaXMgaG9tZXdvcmsKCi0gbGVhcm4gdGhlIGJhc2ljcyBvZiBkZWFsaW5nIHdpdGggdGV4dCBkYXRhIGluIFIKLSBjb21wdXRlIHNvbWUgdGV4dCBtZXRyaWNzCi0gYnVpbGQgYSBsaW5ndWlzdGljIHJlcHJlc2VudGF0aW9uIG9mIHlvdXIgb3duIHRleHRzCgoKIyMgVGFzayAxOiBDcmVhdGluZyB5b3VyIGNvcnB1cwoKSW4gdGhpcyBob21ld29yaywgeW91IHdpbGwgdXNlIHlvdXIgb3duIHRleHQgZGF0YSBmb3Igc29tZSBpbml0aWFsIGxpbmd1aXN0aWMgYW5hbHlzaXMuCgpTZWxlY3QgYXQgbGVhc3QgdGhyZWUgYXNzaWdubWVudHMgdGhhdCB5b3UgaGFuZGVkIGluIGFzIGNvdXJzZXdvcmsgaW4geW91ciBCU2Mgc28gZmFyLiBUbyBpbXBvcnQgdGhlIGludG8gUiwgZWl0aGVyIGNvcHktYW5kLXBhc3RlIHRoZW0gYXMgcmF3IHRleHQgaW50byBhIHN0cmluZyB2YXJpYWJsZSwgb3Igc2F2ZSB0aGUgcmF3IHRleHQgYXMgYSBgLnR4dGAgZmlsZSBhbmQgcmVhZCB0aGVzZSBmaWxlcyBpbi4KCkNyZWF0ZSBhIGNvcnB1cyBvZiB0aGVzZSB0ZXh0cyB1c2luZyB0aGUgYHF1YW50ZWRhYCBwYWNrYWdlLgoKYGBge3J9CiN5b3VyIGNvZGUgaGVyZQpgYGAKCgojIyBUYXNrIDI6IFRleHQgc3RhdGlzdGljcwoKQ2FsY3VsYXRlIHRoZSBhdmVyYWdlIG51bWJlciBvZiBjaGFyYWN0ZXJzIHBlciB3b3JkIGFuZCB0aGUgYXZlcmFnZSBudW1iZXIgb2Ygd29yZHMgcGVyIHNlbnRlbmNlLgoKYGBge3J9CiN5b3VyIGNvZGUgaGVyZQpgYGAKCiMjIFRhc2sgMzogVGV4dCBtZXRyaWNzCgpXaGF0IGlzIHRoZSB0eXBlLXRva2VuIHJhdGlvIChUVFIpIG9mIHlvdXIgdGV4dHMgKGVhY2ggdGV4dCBpbmRpdmlkdWFsbHkpPwoKYGBge3J9CiN5b3VyIGNvZGUgaGVyZQpgYGAKCiMjIFRhc2sgNDogV29yZCBmcmVxdWVuY2llcwoKQnVpbGQgYSB0ZXJtIGZyZXF1ZW5jeSBjb3VudCByZXByZXNlbnRhdGlvbiwgYW5kIHJldHJpZXZlIHRoZSB0b3AgZmVhdHVyZXMgKGhpbnQ6IGB0b3BmZWF0dXJlc2ApIGZvciBlYWNoIHRleHQuCgpgYGB7cn0KI3lvdXIgY29kZSBoZXJlCmBgYAoKIyMgVGFzayA1OiBURi1JREYKCk5vdyBidWlsZCBhIFRGLUlERiB3ZWlnaHRlZCByZXByZXNlbnRhdGlvbiBvZiB5b3VyIGNvcnB1cy4gUGVyZm9ybSB0aGlzIHRyYW5zZm9ybWF0aW9uIGluIGZpdmUgZGlmZmVyZW50IHdheXM6ICgxKSBiYXNlZCBvbiB0aGUgcmF3IHRleHRzLCAoMikgcmVtb3Zpbmcgc3RvcHdvcmRzLCAoMykgcmVtb3ZpbmcgcHVuY3R1YXRpb24sICg0KSBzdGVtbWluZyB0aGUgd29yZHMsIGFuZCAoNSkgY29tYmluaW5nICgyKS0oNCkuCgpgYGB7cn0KI3lvdXIgY29kZSBoZXJlCmBgYAoKCiMjIFRhc2sgNjogUGFydHMtb2Ytc3BlZWNoCgpGb3IgYWxsIHRleHRzLCBjYWxjdWxhdGUgdGhlIHBhcnQtb2Ytc3BlZWNoIHByb3BvcnRpb25zIGFuZCBmaW5kIG91dCB3aGljaCBQT1MgdGFnIGlzIHVzZWQgbW9zdCBvZnRlbiBieSB5b3UgaW4gZWFjaCB0ZXh0LgoKYGBge3J9CiN5b3VyIGNvZGUgaGVyZQpgYGAKCiMjIEVORAoKLS0tCg==