Homework: week 12
Due in class, Tuesday/Thursday/Monday Nov 12/14/18.
The file email.rds contains 200 emails from the ENRON email database. The first 100 are regular emails, the second 100 are spam.
Download the data to a local disk, and read it into R using the command
readRDS. If saved in the object
enron, you can access each individual email by using double brackets: the command
cat(enron[]) will print the first email.
We want to process these emails in several steps.
- Write functions that allow you to extract sender, subject and date received of an email. To get you started, investigate what the following pattern is doing:
gsub(".*From:[^<]*<([^>]*)>.*", "\\1", enron[]). Wrap this pattern into a function, then call
- Write functions that extract from a character string
- the ratio of upper case to lower case letters
- true/false for the presence/absence of a key word
- Process the emails from the ENRON database with the help of your functions, i.e. in a first step summarize all emails by sender, subject, and date. Then further process subject lines. Think of five keywords that might allow you to distinguish between spam and regular email. Report percentages of regular email/spam for each of these keywords.
Submit a commented R markdown script of your code (use the filename firstname-lastname-hw12-X.Rmd where X is either A or B depending on the section you're in (Tuesday is B, Thursday is A, Monday is XW).