The goal of this project is to extract information from large chat datasets. I will be using a google hangouts chat for this project.
The first part is the results of my case study, the second part is how to replicate this work with code in R.
Part 1 – Case Study: Emery & Aimee (December 2016 – October 2017)
The dialogue started in December of 2016 and the data was pulled mid-October 2017 with a total of over 44,000 words sent back and forth.
The periods of most messages occurred in early June, and after August when we were geographically separated. The period in the middle with very few messages is when we were living together and on vacation.
Emery | Aimée | |
Total Words | 23742 | 20925 |
Unique Words | 5350 | 4457 |
Rate of Unique Words | 0.225 | 0.213 |
Total Sentiment Score (Bing) | 1081 | 1346 |
After cleaning and stemming the conversations, this chart gives a brief synopsis. Emery had greater total words sent, greater unique words, and a greater rate of using unique words. Aimée has a greater sentiment score using the Bing sentiment method for all words with >10 occurrences.
WordCloud of Emery WordCloud of Aimee
With most of our sentiment contributions relatively similar, Emery uses the word “tired” more (negative), and Aimée uses sorry (negative) significantly more.
Part 2 – Tutorial and R Script
library(dplyr) library(tidyr) library(wordcloud) library(tm) library(ggplot2) library(SnowballC) library(stringr) library(tidytext) library(RColorBrewer) library(sentiments)
These librarys are necessary. you may need to install.library(“package”) if they aren’t installed on your system.
The next set of code is importing the .csv data file(s). I was combining two files of conversations from the same two people. You will probably only deal with one file.
data1 <- read.csv("~/Desktop/hang1.csv", header = T, sep = ";") #import data data4 <- subset(data1, select = c(timestamp, sender_name, message) ) unique(data4$sender_name) #confirming only 2 names countfram <- data4 %>% count(sender_name) #creates count of total messages sent per person data4$time <- data4$timestamp #renaming a column data4$time <- as.Date(data4$time, f = "%Y") #removing the time from the stamp, to be left with dates only ggplot(data4) + geom_bar(aes(x=data4$time, fill = sender_name)) + xlab("") + ylab("")
The last snippet of ggplot is used to create the visual timelapse of sending rates. You may need to adjust the bin width to get this where you want it.
Emery <- subset(data4, sender_name == 'Emery') Aimee <- subset(data4, sender_name == "Aimee") tidy_emery <- Emery %>% unnest_tokens(word, message) tidy_aimee <- Aimee %>% unnest_tokens(word, message) tidy_emery2 <- subset(tidy_emery, select = word) tidy_aimee2 <- subset(tidy_aimee, select = word) ####### Emery WordCloud ######## emery_corpus <- Corpus(DataframeSource(tidy_emery2)) emery_corpus <- tm_map(emery_corpus, content_transformer(tolower)) emery_corpus <- tm_map(emery_corpus, removeWords, stopwords('english')) Etdm <- TermDocumentMatrix(emery_corpus) Edf <- tidy(Etdm) emery_count <- Edf %>% count(term, sort = TRUE) emery_count <- emery_count[-c(38,96,244,245,424),] #remove noise words from data wordcloud(emery_count$term, emery_count$n, max.words = 200, colors = brewer.pal(8, "Dark2")) ########################
The code above is to create WordClouds. In this example, I used total of the 200 most used words, removed stopwords, and a few lines that were noise in the data.
The final step in this process is to determine sentiments. I use examples from both the Bing and NRC sentiment processors.
### NRC Lexicon to find joy words #### nrcPositive <- get_sentiments("nrc") %>% filter(sentiment == "positive") emery_Positive <- tidy_emery2 %>% inner_join(nrcPositive) %>% count(word, sort = TRUE) sum(emery_Positive$n) # Count of all positve words = 1630 ### emery sentiment contributions Bing ### bing <- get_sentiments("bing") bing_word_countsE <- tidy_emery2 %>% inner_join(bing) %>% count(word, sentiment, sort = TRUE) %>% ungroup() bing_word_countsE %>% filter(n > 10) %>% mutate(n = ifelse(sentiment == "negative", -n, n)) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n, fill = sentiment)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ylab("Contribution to sentiment") + ggtitle("Emery Sentiment Contributions")+ theme(plot.title = element_text(hjust = 0.5))
The code above is used to create the sentiment charts.
##### unique words count, unique words per total words sum(emery_count$n) # = 23742 total words, 5350 unique, = 0.225 unique rate sum(aimee_count$n) # = 20925 total words, 4457 unique, = 0.213 unique rate ##### Summation of Bing Sentiment scores with usage > 10 Ecleansent <- bing_word_countsE %>% filter(n > 10) %>% mutate(n = ifelse(sentiment == "negative", -n, n)) %>% mutate(word = reorder(word, n)) sum(Ecleansent$n) # = 1081 for Emery
This work was made possible by StackExchange and Tidy Text.
Feel free to use, edit, and share anything here.
WordPress messed up some of my code when trying to copy paste. Tried my best to fix it!