The goal of this project is to extract information from large chat datasets. I will be using a google hangouts chat for this project.

The first part is the results of my case study, the second part is how to replicate this work with code in R.

Part 1 – Case Study: Emery & Aimee (December 2016 – October 2017)

messages

The dialogue started in December of 2016 and the data was pulled mid-October 2017 with a total of over 44,000 words sent back and forth.

The periods of most messages occurred in early June, and after August when we were geographically separated. The period in the middle with very few messages is when we were living together and on vacation.

Emery Aimée
Total Words 23742 20925
Unique Words 5350 4457
Rate of Unique Words 0.225 0.213
Total Sentiment Score (Bing) 1081 1346

After cleaning and stemming the conversations, this chart gives a brief synopsis. Emery had greater total words sent, greater unique words, and a greater rate of using unique words. Aimée has a greater sentiment score using the Bing sentiment method for all words with >10 occurrences.

 

EmeryCropAimeeCrop

WordCloud of Emery                                       WordCloud of Aimee

 

 

Emery Sentiment Contributions

 

Aimee Sentiment Contributions

With most of our sentiment contributions relatively similar, Emery uses the word “tired” more (negative), and Aimée uses sorry (negative) significantly more.

 

Part 2 – Tutorial and R Script


library(dplyr)
library(tidyr)
library(wordcloud)
library(tm)
library(ggplot2)
library(SnowballC)
library(stringr)
library(tidytext)
library(RColorBrewer)
library(sentiments)

These librarys are necessary. you may need to install.library(“package”) if they aren’t installed on your system.

The next set of code is importing the .csv data file(s). I was combining two files of conversations from the same two people. You will probably only deal with one file.


data1 <- read.csv("~/Desktop/hang1.csv", header = T, sep = ";") #import data

data4 <- subset(data1, select = c(timestamp, sender_name, message) )
unique(data4$sender_name) #confirming only 2 names
countfram <- data4 %>% count(sender_name) #creates count of total messages sent per person
data4$time <- data4$timestamp #renaming a column
data4$time <- as.Date(data4$time, f = "%Y") #removing the time from the stamp, to be left with dates only
ggplot(data4) + geom_bar(aes(x=data4$time, fill = sender_name)) + xlab("") + ylab("")

The last snippet of ggplot is used to create the visual timelapse of sending rates. You may need to adjust the bin width to get this where you want it. messages

Emery <- subset(data4, sender_name == 'Emery')
Aimee <- subset(data4, sender_name == "Aimee")

tidy_emery <- Emery %>% unnest_tokens(word, message)
tidy_aimee <- Aimee %>% unnest_tokens(word, message)

tidy_emery2 <- subset(tidy_emery, select = word)
tidy_aimee2 <- subset(tidy_aimee, select = word)

####### Emery WordCloud ########
emery_corpus <- Corpus(DataframeSource(tidy_emery2))
emery_corpus <- tm_map(emery_corpus, content_transformer(tolower))
emery_corpus <- tm_map(emery_corpus, removeWords, stopwords('english'))

Etdm <- TermDocumentMatrix(emery_corpus)
Edf <- tidy(Etdm)
emery_count <- Edf %>% count(term, sort = TRUE)
emery_count <- emery_count[-c(38,96,244,245,424),] #remove noise words from data

wordcloud(emery_count$term, emery_count$n, max.words = 200, colors = brewer.pal(8, "Dark2"))
########################

The code above is to create WordClouds. In this example, I used total of the 200 most used words, removed stopwords, and a few lines that were noise in the data.

 

The final step in this process is to determine sentiments. I use examples from both the Bing and NRC sentiment processors.

### NRC Lexicon to find joy words ####
nrcPositive <- get_sentiments("nrc") %>% filter(sentiment == "positive")
emery_Positive <- tidy_emery2 %>% inner_join(nrcPositive) %>% count(word, sort = TRUE)

sum(emery_Positive$n) # Count of all positve words = 1630
### emery sentiment contributions Bing ###
bing <- get_sentiments("bing")

bing_word_countsE <- tidy_emery2 %>% inner_join(bing) %>% count(word, sentiment, sort = TRUE) %>% ungroup()

bing_word_countsE %>% filter(n > 10) %>% mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Contribution to sentiment") + ggtitle("Emery Sentiment Contributions")+
theme(plot.title = element_text(hjust = 0.5))

The code above is used to create the sentiment charts.

Emery Sentiment Contributions

 

##### unique words count, unique words per total words
sum(emery_count$n) # = 23742 total words,  5350 unique, = 0.225 unique rate
sum(aimee_count$n) # = 20925 total words,  4457 unique, = 0.213 unique rate

##### Summation of Bing Sentiment scores with usage > 10

Ecleansent <- bing_word_countsE %>% filter(n &gt; 10) %>% mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(word = reorder(word, n))

sum(Ecleansent$n) # = 1081 for Emery

 

This work was made possible by StackExchange and Tidy Text.

Feel free to use, edit, and share anything here.

WordPress messed up some of my code when trying to copy paste. Tried my best to fix it!