stat 579

Homework: week 12

Due in class, Tuesday/Thursday/Monday Nov 12/14/18.

Spam Detector

The file email.rds contains 200 emails from the ENRON email database. The first 100 are regular emails, the second 100 are spam.

Download the data to a local disk, and read it into R using the command readRDS. If saved in the object enron, you can access each individual email by using double brackets: the command cat(enron[[1]]) will print the first email.

We want to process these emails in several steps.

  1. Write functions that allow you to extract sender, subject and date received of an email. To get you started, investigate what the following pattern is doing: gsub(".*From:[^<]*<([^>]*)>.*", "\\1", enron[[1]]). Wrap this pattern into a function, then call ldply(enron, myfunction).
  2. Write functions that extract from a character string
    • the ratio of upper case to lower case letters
    • true/false for the presence/absence of a key word
  3. Process the emails from the ENRON database with the help of your functions, i.e. in a first step summarize all emails by sender, subject, and date. Then further process subject lines. Think of five keywords that might allow you to distinguish between spam and regular email. Report percentages of regular email/spam for each of these keywords.
    1. Deliverables:

      Submit a commented R markdown script of your code (use the filename firstname-lastname-hw12-X.Rmd where X is either A or B depending on the section you're in (Tuesday is B, Thursday is A, Monday is XW).

    Great Answers: