author: “Jonathan Lau”

1 Introduction

This report showcases the power of R for social media analysis - Twitter in this specific case. The analysis below is only cursory and was prepared over a few days. The example code below focuses on generating insight for TEDxGlasgow and the topic of climate from Twitter data. The note below provides some context when reading this report.

This document was prepared without full access to Twitter API
It should only be used / read as a proof of concept and technical ability
This was cobbled together quickly and has not been reviewed
Any analysis or commentary is subject to change

I have applied to Twitter for a developer licence to see what data options I have as an individual. A higher tier of access would provide for more comprehensive and historical analysis.

2 Understanding Twitter data

This is an introduction to the power of Twitter data and what you can achieve using social media analysis.

The first step is accessing tweets via the Twitter API and then leveraging the power of the rtweet libary. Next we can use this extracted data for social media analysis.

2.1 Power of twitter data

The volume and velocity of tweets posted on twitter every second is an indicator of the power of twitter data.

The enormous amount of information available, from the tweet text and its metadata, gives great scope for analyzing extracted tweets and deriving insights.

The following code extracts a 1% random sample of live tweets using stream_tweets() for a 10 second window and saves it into a data frame.

The dimensions of the data frame will give insights about the number of live tweets extracted and the number of columns that contain the actual tweeted text and metadata on the tweets.

# Load libraries
library(tidyverse)
library(httpuv)
library(rtweet)

# Extract live tweets for 10 seconds window
tweets10s <- stream_tweets("", timeout = 10)

## Streaming tweets for 10 seconds...

## Finished streaming tweets!

# View dimensions of the data frame with live tweets
dim(tweets10s)

## [1] 498  90

Comment

[Twitter allows the extraction of only a limited number of tweets with a free account.]

2.2 Search and extract tweets

Many functions are available in R to extract twitter data for analysis.

search_tweets() is a powerful function from rtweet which is used to extract tweets based on a search query.

The function returns a maximum of 18,000 tweets for each request posted.

# Extract tweets on "#TedXGlasgow" and include retweets    
twts_tedx <- search_tweets("TEDxGlasgow", 
                 n = 18000, 
                 include_rts = TRUE, 
                 lang = "en")

# View tweets
twts_tedx %>% 
  relocate(text, screen_name)

# Extract tweets on "TEDxGlaClimate" and include retweets    
twts_tedx_climate <- search_tweets("TEDxGlaClimate", 
                 n = 18000, 
                 include_rts = TRUE, 
                 lang = "en")

# View tweets
twts_tedx_climate %>% 
  relocate(text, screen_name)

You can see various tweets posted by users.

2.3 Search and extract timelines

Similar to search_tweets(), get_timeline() is another function in the rtweet library that can be used to extract tweets.

The get_timeline() function is different from search_tweets(). It extracts tweets posted by a given user to their timeline instead of searching based on a query.

The get_timeline() function can extract upto 3200 tweets at a time.

Tweets posted by TedXGlasgow to their timeline are extracted below.

# Extract tweets posted by the user @TedXGlasgow
get_TedX <- get_timeline("@TedXGlasgow", n = 3200)

# View output
get_TedX

Comment

2.4 User interest and tweet counts

The metadata components of extracted twitter data can be analyzed to derive insights.

To identify twitter users who are interested in a topic, you can look at users who tweet often on that topic. The insights derived can be used to promote targeted events to interested users.

The code below identifies users who have tweeted often on the topic “TedXGlasgow”.

# Create a table of users and tweet counts for the topic
sc_name <- table(get_TedX$screen_name)

# Sort the table in descending order of tweet counts
sc_name_sort <- sort(sc_name, decreasing = TRUE)

# View sorted table for top 10 users
head(sc_name_sort, 10)

## TEDxGlasgow 
##        3180

Comment

2.5 Compare follower count

The follower count for a twitter account indicates the popularity of the personality or a business entity and is a measure of influence in social media.

Knowing the follower counts helps digital marketers strategically position ads on popular twitter accounts for increased visibility.

The follow code extracts user data and compare followers count for twitter accounts of popular Scottish news sites.

# Extract user data for the twitter accounts of news sites and Darin O Lien for comparison
users <- lookup_users(c("DarinOlien", "ScotEntNews", "BBCScotlandNews", "STVNews", "heraldscotland", "Scotland", "VisitScotNews", "TheScotsman", "BBCRadioScot"))

# Create a data frame of screen names and follower counts
user_df <- users[,c("screen_name","followers_count")]

# Display and compare the follower counts for the 4 news sites
user_df

Comment

2.6 Retweet counts

A retweet helps utilize existing content to build a following for your brand.

The number of times a twitter text is retweeted indicates what is trending. The inputs gathered can be leveraged by promoting your brand using the popular retweets.

The code below identifies tweets on “TEDxGlasClimate” that have been retweeted the most.

# Create a data frame of tweet text and retweet count
rtwt <- twts_tedx_climate[,c("text", "retweet_count")]
head(rtwt)

# Sort data frame based on descending order of retweet counts
rtwt_sort <- arrange(rtwt, desc(retweet_count))

# Exclude rows with duplicate text from sorted data frame
rtwt_unique <- unique(rtwt_sort, by = "text")

# Print top 6 unique posts retweeted most number of times
rownames(rtwt_unique) <- NULL
head(rtwt_unique)

Comment

3 Analyzing Twitter data

It’s time to go deeper by applying filters to tweets; and analysing Twitter user data using the golden ratio and the Twitter lists they subscribe to. Then we can extract trending topics and analyse Twitter data over time to identify interesting insights.

3.1 Filtering for original tweets

An original tweet is an original posting by a twitter user and is not a retweet, quote, or reply.

The “-filter” can be combined with a search query to exclude retweets, quotes, and replies during tweet extraction.

# Extract 5000 original tweets on "Climate"
tweets_org <- search_tweets("Climate -filter:retweets -filter:quote -filter:replies", n = 5000)

# Check for presence of replies
tweets_org %>% 
    count(reply_to_screen_name)

# Check for presence of quotes
tweets_org %>% 
    count(is_quote)

# Check for presence of retweets
tweets_org %>% 
    count(is_retweet)

For (just shy of) the 5000 tweets, the output of NA for reply_to_screen_name and FALSE for is_quote and is_retweets confirms that the filtered tweets are original posts and not replies, quotes, or retweets.

3.2 Filtering on tweet language

You can use the language filter with a search query to filter tweets based on the language of the tweet.

The filter extracts tweets that have been classified by Twitter as being of a particular language.

# Extract tweets on "Climate" in French
tweets_french <- search_tweets("Climate", lang = "fr")

# Display the tweets and language metadata
tweets_french %>% 
  select(text, lang)

3.3 Filter based on tweet popularity

Popular tweets are tweets that are retweeted and favourited several times.

They are useful in identifying current trends. A brand can promote its merchandise and build brand loyalty by identifying popular tweets and retweeting them.

The code below extracts tweets on “TEDx” that have been retweeted a minimum of 50 times and also favorited at least by 50 users.

# Extract tweets with a minimum of 50 retweets and 50 favorites
tweets_pop <- search_tweets("TEDx min_retweets:50 AND min_faves:50")

# Create a data frame to check retweet and favorite counts
counts <- tweets_pop[c("retweet_count", "favorite_count")]
head(counts)

# View the tweets
head(tweets_pop$text)

## [1] "@NeuroClastic One autistic person I know:\nHas PhD, does public speaking, holds down academic job, did TEDx talk. \n\nAnother: \nCan't drive, forgets to shower, goes mute in social situations, had meltodowns everday for 5 yrs as teenager, struggles to clean. \n\nThey're both me. I am peaks &amp; troughs"    
## [2] "No voy a descansar hasta tener mi propia TEDx Talk"                                                                                                                                                                                                                                                                  
## [3] "God, this quote is so powerful I just can't share it enough...\n\n'Saying someone shouldn't be depressed because other people have it worse\n\nIs like saying someone shouldn't be happy because other people have it better...' Bristol Tedx talk\n\nNEVER invalidate someone's struggling..."                      
## [4] "\U0001f525Super excited to share my journey in advancing #SustainableFinance #SDGs #ESG at @TEDx Hong Kong, as part of @TEDTalks Global Countdown #JoinTheCountdown \n\nhttps://t.co/0iW44C0BXb\n\nSign up online to catch me at this exciting occasion\U0001f447\n\nhttps://t.co/WIoVXicZOE https://t.co/3dZYYN1nFE"
## [5] "Ma conférence #TEDx est en ligne !  \n\nDes anecdotes sur le métier de “foulologue”, des explications scientifiques et des conseils de survie dans une foule en panique... c'est par ici :\nhttps://t.co/ANRf7M2sFy\n\n #TEDtalks https://t.co/upCYi62kIE"

3.4 Extract user information

Analyzing twitter user data provides vital information which can be used to plan relevant promotional strategies.

User information contains data on the number of followers and friends of the twitter user.

The user information may have multiple instances of the same user as the user might have tweeted multiple times on a given subject. You need to take the mean values of the follower and friend counts in order to consider only one instance.

#TEDxGlasgow related

# Extract user information of people who have tweeted on the TEDxGlasgow
user_cos <- users_data(twts_tedx)

# View few rows of user data
head(user_cos)

# Aggregate screen name, follower and friend counts
counts_df <- user_cos %>%
               group_by(screen_name) %>%
               summarise(follower = mean(followers_count, na.rm = TRUE),
                   friend = mean(friends_count, na.rm = TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

# View the output
counts_df

The screen names have been tabulated with their corresponding counts of followers and friends. In the next exercise, you will learn how to use this data to calculate the golden ratio.

3.5 Explore users based on the golden ratio

The ratio of the number of followers to the number of friends a user has is called the golden ratio.

This ratio is a useful metric for marketeers to strategize promotions.

# Calculate and store the golden ratio
counts_df$ratio <- counts_df$follower/counts_df$friend

# Sort the data frame in decreasing order of follower count
counts_sort <- arrange(counts_df, desc(follower))

# View the first few rows
head(counts_sort)

# Select rows where the follower count is greater than 50000
counts_sort[counts_sort$follower > 50000,]

# Select rows where the follower count is less than 1000
counts_sort[counts_sort$follower < 1000,]

Users having a high follower count should have a high positive ratio too. These users can be used as a medium to promote a brand to a wide audience.

3.6 Subscribers to twitter lists

A twitter list is a curated group of twitter accounts.

Twitter users subscribe to lists that interest them. Collecting user information from twitter lists could help brands promote products to interested delegates

The code below extracts lists of the twitter account of “TEDxGlasgow”.

# Loading library
library(tidyverse)

# Extract all the lists "TEDx" subscribes to and view the first 4 columns
lst_TEDx <- lists_users("TEDxGlasgow")

lst_TEDx %>%
    arrange(desc(subscriber_count)) %>%
  head()

# Extract subscribers of the list "TEDx" and view the first 4 columns
list_TED_sub <- lists_subscribers("9783131", n = 500) %>%
    arrange(followers_count)

list_TED_sub[,1:4]

# Create a list of top screen names from the subscribers list
users <- list_TED_sub$screen_name %>%
    head()

# Extract user information for the list and view the first 4 columns
users_TEDx_sub <- lookup_users(users)
users_TEDx_sub

You now have extracted user data of potential delegates to whom you can promote TEDxGlasgow.

3.7 Trends by country name

Location-specific trends identify popular topics trending in a specific location. You can extract trends at the country level or city level.

It is more meaningful to extract trends around a specific region, in order to focus on twitter audience in that region for targeted marketing of a brand.

What is trending in the UK?

# Get topics trending in UK
gt_country <- get_trends("United Kingdom") %>% 
    arrange(desc(tweet_volume)) %>% 
    view()

3.8 Trends by city and most tweeted trends

It is meaningful to extract trends around a specific region to focus on twitter audience in that region.

Trending topics in a city provide a chance to promote region-specific events or products.

This code extracts topics that are trending in Glasgow and also look at the most tweeted trends.

Note: tweet_volume is returned for trends only if this data is available.

# Get topics trending in Glasgow
gt_city <- get_trends("Glasgow")

# View the first 6 columns
head(gt_city[,1:6])

# Aggregate the trends and tweet volumes
trend_df <- gt_city %>%
    group_by(trend) %>%
    summarise(tweet_vol = mean(tweet_volume, na.rm = TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

# Sort data frame on descending order of tweet volumes and print header
trend_df_sort <- arrange(trend_df, desc(tweet_vol))
head(trend_df_sort,10)

Trends can change quickly
The most-tweeted trend in Glasgow has recently been: * ‘yeonjun’ a Kpop star!
* #Gray (because of Eastenders??) * #Fridaythoughts

3.9 Visualizing frequency of tweets

Visualizing the frequency of tweets over time helps understand the interest level over a product.

It would be interesting to check the interest level and recall for #ClimateCrisis by visualizing the frequency of tweets.

# Extract tweets on #ClimateCrisis and exclude retweets
ClimateCrisis_twts <- search_tweets("#ClimateCrisis", n = 18000, include_rts = FALSE)

# View the output
head(ClimateCrisis_twts)

# Create a time series plot
ts_plot(ClimateCrisis_twts, by = "hours", color = "blue")

Comment
ClimateCrisis appears to have a cyclical pattern over the days accessed - with lowering peaks most recently.

3.10 Create time series objects

A time series object contains the aggregated frequency of tweets over a specified time interval.

Creating time series objects is the first step before visualizing tweet frequencies for comparison.

This code creates time series objects for two TED events for comparison

# Create a time series object for TEDxGlasgow at hourly intervals
TEDxGlasgow_ts <- ts_data(twts_tedx, by = "hours")

# Rename the two columns in the time series object
names(TEDxGlasgow_ts) <- c("time", "TEDxGlasgow_n")

# View data
TEDxGlasgow_ts

# Create a time series object for TEDxGlaClimate at hourly intervals
TEDxGlaClimate_ts <- ts_data(twts_tedx_climate, by = "hours")

# Rename the two columns in the time series object
names(TEDxGlaClimate_ts) <- c("time", "TEDxGlasClimate_n")

# View datax
TEDxGlaClimate_ts

# # Get TEDsummit2019 data for comparison
# twts_tedsummit2019 <- search_tweets("#TEDSummit2019", 
#                  n = 18000, 
#                  include_rts = TRUE, 
#                  lang = "en")
# 
# 
# # Create a time series object for TEDxEdinburgh at hourly intervals
# tedsummit2019_ts <- ts_data(twts_tedsummit2019, by = "hours")
# 
# # Rename the two columns in the time series object
# names(tedsummit2019_ts) <- c("time", "edin_n") 
# 
# # View data
# tedsummit2019_ts

Time series objects aggregate tweet frequencies over time. They are useful for creating time series plots for comparison.

3.11 Compare tweet frequencies for two tags

The volume of tweets posted for a product is a strong indicator of its brand salience. Let’s compare “brand” salience for two competing tags, #TEDxGlaClimate and #ClimateCrisis.

library(reshape)

# Merge the two time series objects and retain "time" column
merged_df <- merge(TEDxGlasgow_ts, TEDxGlaClimate_ts, by = "time", all = TRUE)
head(merged_df)

# Stack the tweet frequency columns
melt_df <- melt(merged_df, na.rm = TRUE, id.vars = "time")

# View the output
head(melt_df)

# Plot frequency of tweets on Puma and Nike
ggplot(data = melt_df, aes(x = time, y = value, col = variable)) +
  geom_line(lwd = 0.8)

Comments
For the data accessed (limited by using a free account), #TEDxGlasClimate seems to have somewhat different activity from the main Twitter handle; apart from the peak around Oct 10th.

4 Visualize Tweet texts

A picture is worth a thousand words! This following code explores how you can visualize text from tweets using bar plots and word clouds. Tweet text will be processed to prepare a clean text corpus for analysis. Imagine being able to extract key discussion topics and people’s perceptions about a subject or brand from the tweets they are sharing. This is possible using topic modeling and sentiment analysis.

4.1 Remove URLs and characters other than letters

Tweet text posted by twitter users is unstructured, noisy, and raw.

It contains emoticons, URLs, and numbers. This redundant information has to be cleaned before analysis in order to yield reliable results.

The code below removes URLs and replaces characters other than letters with spaces from#ClimateCrisis tweets.

# Loading Regex library
library(qdapRegex)

# Extract tweet text from #ClimateCrisis dataset
twt_txt <- ClimateCrisis_twts$text
head(twt_txt)

## [1] "Graywater #Recycling Is An Excellent Strategy For #Drought #climate #drought #climatechange #climatecrisis https://t.co/VBSzrBqTlF https://t.co/N4XbblvFVG"      
## [2] "How #Israel Became A Leader In #Water Use In The #MiddleEast #climate #drought #climatechange #climatecrisis https://t.co/WFqhIWGLVJ https://t.co/J1ZSVYKlQ2"    
## [3] "Why Focusing On Cutting #Emissions Alone Won't Halt #EcologicalDecline #climatechange #climatecrisis #climate... https://t.co/j3Yb6KvecB https://t.co/bxcFAPNazt"
## [4] "How #Israel Became A Leader In #Water Use In The #MiddleEast #climate #drought #climatechange #climatecrisis https://t.co/WFqhIWGLVJ https://t.co/KgdcoQ85BK"    
## [5] "How #Israel Became A Leader In #Water Use In The #MiddleEast #climate #drought #climatechange #climatecrisis https://t.co/WFqhIWGLVJ https://t.co/bAhpeoDVrc"    
## [6] "Can We #Terraform the Sahara to Stop Climate Change? #climatecrisis #climate https://t.co/ie8wSnERIV https://t.co/m27zgbMtFZ"

# Remove URLs from the tweet text and view the output
twt_txt_url <- rm_twitter_url(twt_txt)
head(twt_txt_url)

## [1] "Graywater #Recycling Is An Excellent Strategy For #Drought #climate #drought #climatechange #climatecrisis"      
## [2] "How #Israel Became A Leader In #Water Use In The #MiddleEast #climate #drought #climatechange #climatecrisis"    
## [3] "Why Focusing On Cutting #Emissions Alone Won't Halt #EcologicalDecline #climatechange #climatecrisis #climate..."
## [4] "How #Israel Became A Leader In #Water Use In The #MiddleEast #climate #drought #climatechange #climatecrisis"    
## [5] "How #Israel Became A Leader In #Water Use In The #MiddleEast #climate #drought #climatechange #climatecrisis"    
## [6] "Can We #Terraform the Sahara to Stop Climate Change? #climatecrisis #climate"

# Replace special characters, punctuation, & numbers with spaces
twt_txt_chrs  <- gsub("[^A-Za-z]"," " , twt_txt_url)

# View text after replacing special characters, punctuation, & numbers
head(twt_txt_chrs)

## [1] "Graywater  Recycling Is An Excellent Strategy For  Drought  climate  drought  climatechange  climatecrisis"      
## [2] "How  Israel Became A Leader In  Water Use In The  MiddleEast  climate  drought  climatechange  climatecrisis"    
## [3] "Why Focusing On Cutting  Emissions Alone Won t Halt  EcologicalDecline  climatechange  climatecrisis  climate   "
## [4] "How  Israel Became A Leader In  Water Use In The  MiddleEast  climate  drought  climatechange  climatecrisis"    
## [5] "How  Israel Became A Leader In  Water Use In The  MiddleEast  climate  drought  climatechange  climatecrisis"    
## [6] "Can We  Terraform the Sahara to Stop Climate Change   climatecrisis  climate"

The URLs have been removed and special characters, punctuation, & numbers have been replaced with additional spaces in the text.

4.2 Build a corpus and convert to lowercase

A corpus is a list of text documents. You have to convert the tweet text into a corpus to facilitate subsequent steps in text processing.

When analyzing text, you want to ensure that a word is not counted as two different words because the case is different in the two instances. Hence, you need to convert text to lowercase.

The code will create a text corpus and convert all characters to lower case.

# Loading text mining library
library(tm)

# Convert text in "twt_gsub" dataset to a text corpus and view output
twt_corpus <- twt_txt_chrs %>% 
                VectorSource() %>% 
                Corpus() 
head(twt_corpus$content)

## [1] "Graywater  Recycling Is An Excellent Strategy For  Drought  climate  drought  climatechange  climatecrisis"      
## [2] "How  Israel Became A Leader In  Water Use In The  MiddleEast  climate  drought  climatechange  climatecrisis"    
## [3] "Why Focusing On Cutting  Emissions Alone Won t Halt  EcologicalDecline  climatechange  climatecrisis  climate   "
## [4] "How  Israel Became A Leader In  Water Use In The  MiddleEast  climate  drought  climatechange  climatecrisis"    
## [5] "How  Israel Became A Leader In  Water Use In The  MiddleEast  climate  drought  climatechange  climatecrisis"    
## [6] "Can We  Terraform the Sahara to Stop Climate Change   climatecrisis  climate"

# Convert the corpus to lowercase
twt_corpus_lwr <- tm_map(twt_corpus, tolower)

## Warning in tm_map.SimpleCorpus(twt_corpus, tolower): transformation drops
## documents

# View the corpus after converting to lowercase
head(twt_corpus_lwr$content)

## [1] "graywater  recycling is an excellent strategy for  drought  climate  drought  climatechange  climatecrisis"      
## [2] "how  israel became a leader in  water use in the  middleeast  climate  drought  climatechange  climatecrisis"    
## [3] "why focusing on cutting  emissions alone won t halt  ecologicaldecline  climatechange  climatecrisis  climate   "
## [4] "how  israel became a leader in  water use in the  middleeast  climate  drought  climatechange  climatecrisis"    
## [5] "how  israel became a leader in  water use in the  middleeast  climate  drought  climatechange  climatecrisis"    
## [6] "can we  terraform the sahara to stop climate change   climatecrisis  climate"

The corpus has been built from the tweet text and converted the characters in the corpus to lowercase.

4.3 Remove stop words and additional spaces

The text corpus usually has many common words like a, an, the, of, and but. These are called stop words.

Stop words are usually removed during text processing so one can focus on the important words in the corpus to derive insights.

Also, the additional spaces created during the removal of special characters, punctuation, numbers, and stop words need to be removed from the corpus.

# Remove English stop words from the corpus using SMART dictionary and view the corpus
twt_corpus_stpwd <- tm_map(twt_corpus_lwr, removeWords, stopwords("smart"))

## Warning in tm_map.SimpleCorpus(twt_corpus_lwr, removeWords, stopwords("smart")):
## transformation drops documents

head(twt_corpus_stpwd$content)

## [1] "graywater  recycling   excellent strategy   drought  climate  drought  climatechange  climatecrisis"  
## [2] "  israel   leader   water     middleeast  climate  drought  climatechange  climatecrisis"             
## [3] " focusing  cutting  emissions  won  halt  ecologicaldecline  climatechange  climatecrisis  climate   "
## [4] "  israel   leader   water     middleeast  climate  drought  climatechange  climatecrisis"             
## [5] "  israel   leader   water     middleeast  climate  drought  climatechange  climatecrisis"             
## [6] "   terraform  sahara  stop climate change   climatecrisis  climate"

# Remove additional spaces from the corpus
twt_corpus_spaces <- tm_map(twt_corpus_stpwd, stripWhitespace)

## Warning in tm_map.SimpleCorpus(twt_corpus_stpwd, stripWhitespace):
## transformation drops documents

# View the text corpus after removing spaces
head(twt_corpus_spaces$content)

## [1] "graywater recycling excellent strategy drought climate drought climatechange climatecrisis" 
## [2] " israel leader water middleeast climate drought climatechange climatecrisis"                
## [3] " focusing cutting emissions won halt ecologicaldecline climatechange climatecrisis climate "
## [4] " israel leader water middleeast climate drought climatechange climatecrisis"                
## [5] " israel leader water middleeast climate drought climatechange climatecrisis"                
## [6] " terraform sahara stop climate change climatecrisis climate"

You can see some of the common stop words and all the additional spaces removed in the output.

4.4 Removing custom stop words

Popular terms in a text corpus can be visualized using bar plots or word clouds.

However, it is important to remove (custom) stop words present in the corpus first before using the visualization tools.

The code below will check the term frequencies and remove (custom) stop words from the text corpus created for “ClimateCrisis”.

# Loading library for text analysis
library(qdap)

# Extract term frequencies for top 60 words and view output
termfreq  <-  freq_terms(twt_corpus_spaces, 60)
termfreq

# Create a vector of custom stop words
custom_stopwds <- c("climatecrisis", "amp", "ve", "don")

# Remove custom stop words and create a refined corpus
corp_refined <- tm_map(twt_corpus_spaces, removeWords, custom_stopwds)

## Warning in tm_map.SimpleCorpus(twt_corpus_spaces, removeWords, custom_stopwds):
## transformation drops documents

# Extract term frequencies for the top 20 words
termfreq_clean <- freq_terms(corp_refined, 20)
termfreq_clean

You can see that the corpus has only the relevant and important terms after the stop words are removed. Let’s use this refined corpus to create visualizations in the next two exercises.

4.5 Visualize popular terms with bar plots

Bar plot is a simple yet popular tool used in data visualization.

It quickly helps summarize categories and their values in a visual form.

The code below will create bar plots for the popular terms appearing in a text corpus.

# Extract term frequencies for the top 25 words
termfreq_25w <- freq_terms(corp_refined, 25)
termfreq_25w

# Identify terms with more than 10 counts from the top 25 list
term30 <- subset(termfreq_25w, FREQ > 30)
term30

Terms like climate, join, watch, party are popular. Bar plots quickly help summarize these popular terms in an easily interpretable form.

# Create a bar plot using terms with more than 30 counts
ggplot(term30, aes(x = reorder(WORD, -FREQ), y = FREQ)) +
        geom_bar(stat = "identity", fill = "blue") + 
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

4.6 Word clouds for visualization

A word cloud is an image made up of words in which the size of each word indicates its frequency.

It is an effective promotional image for marketing campaigns.

The code below will create word clouds using the words in a text corpus.

library(RColorBrewer)
library(wordcloud)

# Create word cloud with 6 colors and max 50 words
wordcloud(corp_refined, max.words = 50, 
    colors = brewer.pal(6, "Dark2"), 
    scale=c(4,1), random.order = FALSE)

## Warning in wordcloud(corp_refined, max.words = 50, colors = brewer.pal(6, :
## climateaction could not be fit on page. It will not be plotted.

## Warning in wordcloud(corp_refined, max.words = 50, colors = brewer.pal(6, :
## facetheclimateemergency could not be fit on page. It will not be plotted.

Comment
You can see that popular terms like climate and wedonthavetime are in large font sizes and positioned at the center of the word cloud to highlight their relevance and importance.

4.7 The LDA algorithm

Latent Dirichlet Allocation algorithm is used for topic modeling.

The document term matrix and the number of topics are input into the LDA()function.

4.8 Create a document term matrix

The document term matrix or DTM is a matrix representation of a corpus.

Creating the DTM from the text corpus is the first step towards building a topic model.

# Create a document term matrix (DTM) for #ClimateCrisis
dtm_ClimateCrisis <- DocumentTermMatrix(corp_refined)
dtm_ClimateCrisis

## <<DocumentTermMatrix (documents: 8492, terms: 20415)>>
## Non-/sparse entries: 110885/173253295
## Sparsity           : 100%
## Maximal term length: 43
## Weighting          : term frequency (tf)

# Find the sum of word counts in each document
rowTotals <- apply(dtm_ClimateCrisis, 1, sum)
head(rowTotals)

## 1 2 3 4 5 6 
## 8 7 8 7 7 6

# Select rows with a row total greater than zero
dtm_ClimateCrisis_new <- dtm_ClimateCrisis[rowTotals > 0, ]
dtm_ClimateCrisis_new

## <<DocumentTermMatrix (documents: 8374, terms: 20415)>>
## Non-/sparse entries: 110885/170844325
## Sparsity           : 100%
## Maximal term length: 43
## Weighting          : term frequency (tf)

Comment
You can see that the final DTM has 233 documents and 355 terms. The code below will use this DTM to perform topic modeling

4.9 Create a topic model

Topic modeling is the task of automatically discovering topics from a vast amount of text.

You can create topic models from the tweet text to quickly summarize the vast information available into distinct topics and gain insights.

The code below will extract distinct topics from tweets on #ClimateCrisis.

# Load libraries
library(topicmodels)

# Create a topic model with 5 topics
topicmodl_5 <- LDA(dtm_ClimateCrisis_new, k = 5)

# Select and view the top 10 terms in the topic model
top_10terms <- terms(topicmodl_5, 10)
top_10terms

##       Topic 1         Topic 2            Topic 3                  
##  [1,] "climate"       "climate"          "climatechange"          
##  [2,] "climatechange" "climatechange"    "climateaction"          
##  [3,] "trump"         "change"           "climateemergency"       
##  [4,] "vote"          "climateaction"    "gretathunberg"          
##  [5,] "climateaction" "arctic"           "climate"                
##  [6,] "planet"        "world"            "climatechangeisreal"    
##  [7,] "change"        "global"           "actonclimate"           
##  [8,] "green"         "energy"           "facetheclimateemergency"
##  [9,] "covid"         "climateemergency" "climateactionnow"       
## [10,] "people"        "action"           "climatejustice"         
##       Topic 4         Topic 5             
##  [1,] "climate"       "futureofcap"       
##  [2,] "food"          "eppgroup"          
##  [3,] "million"       "reneweurope"       
##  [4,] "climatechange" "theprogressives"   
##  [5,] "year"          "care"              
##  [6,] "fueling"       "extinction"        
##  [7,] "responsible"   "biodiversitycrisis"
##  [8,] "canadian"      "policy"            
##  [9,] "wasted"        "late"              
## [10,] "tonnes"        "make"

Comment
By comparison for TEDxGlaClimate, sustainability, carbon, energy, water, and global warming aren’t in the top topics.

4.10 Extract sentiment scores

Sentiment analysis is useful in social media monitoring since it gives an overview of people’s sentiments.

Climate change is a widely discussed topic for which the perceptions range from being a severe threat to nothing but a hoax.

The code below will perform sentiment analysis and extract the sentiment scores for tweets on “Climate change”.

These cam be used to plot and analyze how the collective sentiment varies among people.

library(syuzhet)

# Perform sentiment analysis for tweets on `ClimateCrisis` 
sa.value <- get_nrc_sentiment(ClimateCrisis_twts$text)

## Warning: `filter_()` is deprecated as of dplyr 0.7.0.
## Please use `filter()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

# View the sentiment scores
head(sa.value, 10)

Next, these scores are used to plot the sentiments and analyze the results.

4.11 Perform sentiment analysis

Can we plot and analyze the most prevalent sentiments among people and see how the collective sentiment varies for#ClimateCrisis?

# Calculate sum of sentiment scores
score <- colSums(sa.value[,])

# Convert the sum of scores to a data frame
score_df <- data.frame(score)

# Convert row names into 'sentiment' column and combine with sentiment scores
score_df2 <- cbind(sentiment = row.names(score_df),  
                  score_df, row.names = NULL)
print(score_df2)

##       sentiment score
## 1         anger  3542
## 2  anticipation  4164
## 3       disgust  2003
## 4          fear  4663
## 5           joy  3208
## 6       sadness  3184
## 7      surprise  2398
## 8         trust  6208
## 9      negative  6929
## 10     positive  9565

# Plot the sentiment scores
ggplot(data = score_df2, aes(x = sentiment, y = score, fill = sentiment)) +
     geom_bar(stat = "identity") +
       theme(axis.text.x = element_text(angle = 45, hjust = 1))

Comment

For TEDxGlaClimate, it is interesting to see that positive sentiments collectively outnumber the negative ones. Trust, anticipation and fear’ are notable too.

5 Network analysis

Twitter users tweet, like, follow, and retweet creating complex network structures. We can analyse these network structures and visualize the relationships between these individual people as a retweet network. By extracting geolocation data from the tweets we’ll also discover how to display tweet locations on a map, and answer powerful questions such as which states or countries are talking about your brand the most? Geographic data adds a new dimension to the Twitter data analysis.

5.1 Preparing data for a retweet network

A retweet network is a network of twitter users who retweet tweets posted by other users.

People who retweet on #TEDxGlasgow can be potential players for broadcasting messages of upcoming events.

For starters, the following code will prepare the tweet data on #TEDxGlasgow for creating a retweet network.

# Extract source vertex and target vertex from the tweet data frame
rtwt_df <- twts_tedx[, c("screen_name" , "retweet_screen_name" )]

# View the data frame
head(rtwt_df)

# Remove rows with missing values
rtwt_df_new <- rtwt_df[complete.cases(rtwt_df), ]

# Create a matrix
rtwt_matrx <- as.matrix(rtwt_df_new)
head(rtwt_matrx)

##      screen_name   retweet_screen_name
## [1,] "JMcKWriter"  "TEDxGlasgow"      
## [2,] "jo_lynn13"   "TEDxGlasgow"      
## [3,] "TEDxGlasgow" "WeDontHaveTime"   
## [4,] "TEDxGlasgow" "jo_lynn13"        
## [5,] "zebunsia"    "TEDxGlasgow"      
## [6,] "zebunsia"    "CountUsInSOCIAL"

5.2 Create a retweet network

The core step in network analysis is to create a network object like a retweet network as it helps analyze the inter-relationships between the nodes.

Understanding the position of potential customers on a retweet network allows a brand to identify key players who are likely to retweet posts to spread brand messaging.

library(igraph)

# Convert the matrix to a retweet network
nw_rtweet <- graph_from_edgelist(el = rtwt_matrx, directed = TRUE)

# View the retweet network
print.igraph(nw_rtweet)

## IGRAPH a8573cc DN-- 13 12 -- 
## + attr: name (v/c)
## + edges from a8573cc (vertex names):
##  [1] JMcKWriter     ->TEDxGlasgow     jo_lynn13      ->TEDxGlasgow    
##  [3] TEDxGlasgow    ->WeDontHaveTime  TEDxGlasgow    ->jo_lynn13      
##  [5] zebunsia       ->TEDxGlasgow     zebunsia       ->CountUsInSOCIAL
##  [7] zebunsia       ->bryonygeorge    zebunsia       ->TEDxGlasgow    
##  [9] lovelypurplebee->TEDxGlasgow     Kjsalgado1967  ->WeDontHaveTime 
## [11] IWillRoam      ->JamieACooke     Michaela_v89   ->ScotEntNews

Comment
Number of edges and vertices. Source and target vertices in network….

5.3 Network centrality measures

Influence of a vertex is measured by the number of edges and its position
Network centrality is the measure of importance of a vertex in a network
Network centrality measures assign a numerical value to each vertex
Value is a measure of a vertex’s influence on other vertices

5.4 Calculate out-degree scores

In a retweet network, the out-degree of a user indicates the number of times the user retweets posts.

Users with high out-degree scores are key players who can be used as a medium to retweet promotional posts.

# Calculate out-degree scores from the retweet network
out_degree <- degree(nw_rtweet, mode = c("out"))

# Sort the out-degree scores in decreasing order
out_degree_sort <- sort(out_degree, decreasing = TRUE)

# View users with the top 10 out-degree scores
out_degree_sort[1:10]

##        zebunsia     TEDxGlasgow      JMcKWriter       jo_lynn13 lovelypurplebee 
##               4               2               1               1               1 
##   Kjsalgado1967       IWillRoam    Michaela_v89  WeDontHaveTime CountUsInSOCIAL 
##               1               1               1               0               0

Comment
You now have 10 users who can be key players to promote posts for a conferences through their retweets.

5.5 Compute the in-degree scores

In a retweet network, the in-degree of a user indicates the number of times the user’s posts are retweeted.

Users with high in-degrees are influential as their tweets are retweeted many times.

# Compute the in-degree scores from the retweet network
in_degree <- degree(nw_rtweet, mode = c("in"))

# Sort the in-degree scores in decreasing order
in_degree_sort <- sort(in_degree, decreasing = TRUE)

# View users with the top 10 in-degree scores
in_degree_sort[1:10]

##     TEDxGlasgow  WeDontHaveTime       jo_lynn13 CountUsInSOCIAL    bryonygeorge 
##               5               2               1               1               1 
##     JamieACooke     ScotEntNews      JMcKWriter        zebunsia lovelypurplebee 
##               1               1               0               0               0

You have identified 10 influential users who can be used to initiate branding messages for TEDxGlasgow.

5.6 Calculate the betweenness scores

Betweenness centrality represents the degree to which nodes stand between each other.

In a retweet network, a user with a high betweenness centrality score would have more control over the network because more information will pass through the user.

The code below identifies users who are central to people who retweet the most and those whose tweets are retweeted frequently.

# Calculate the betweenness scores from the retweet network
betwn_nw <- betweenness(nw_rtweet, directed = TRUE)

# Sort betweenness scores in decreasing order and round the values
betwn_nw_sort <- betwn_nw %>%
                    sort(decreasing = TRUE) %>%
                    round()

# View users with the top 10 betweenness scores 
betwn_nw_sort[1:10]

##     TEDxGlasgow      JMcKWriter       jo_lynn13  WeDontHaveTime        zebunsia 
##               7               0               0               0               0 
## CountUsInSOCIAL    bryonygeorge lovelypurplebee   Kjsalgado1967       IWillRoam 
##               0               0               0               0               0

Comment
These are critical users who act as bridges for information flow between users who initiate brand messaging and users who promote such posts through their retweets.

5.7 Create a network plot with attributes

Visualization of twitter networks helps understand complex networks in an easier and appealing way.

You can format a plot to enhance the readability and improve its visual appeal.

The code below visualises a retweet network on #TEDxGlasgow.

# Create a basic network plot
plot.igraph(nw_rtweet)

# Create a network plot with formatting attributes
plot(nw_rtweet, asp = 9/12, 
     vertex.size = 10,
       vertex.color = "green", 
     edge.arrow.size = 0.5,
     edge.color = "black",
     vertex.label.cex = 0.9,
     vertex.label.color = "black")

Comment

We have an interesting network plot having multiple groups of users who tweet and retweet on #TEDxGlasgow

5.8 Network plot based on centrality measure

It will be more meaningful if the vertex size in the plot is proportional to the number of times the user retweets.

The code below, adds attributes such that the vertex size is indicative of the number of times the user retweets.

# Create a variable for out-degree
deg_out <- degree(nw_rtweet, mode = c("out"))
deg_out

##      JMcKWriter     TEDxGlasgow       jo_lynn13  WeDontHaveTime        zebunsia 
##               1               2               1               0               4 
## CountUsInSOCIAL    bryonygeorge lovelypurplebee   Kjsalgado1967       IWillRoam 
##               0               0               1               1               1 
##     JamieACooke    Michaela_v89     ScotEntNews 
##               0               1               0

# Amplify the out-degree values
vert_size <- (deg_out * 3) + 5

# Set vertex size to amplified out-degree values
plot(nw_rtweet, asp = 10/11, 
     vertex.size = vert_size, vertex.color = "lightblue",
     edge.arrow.size = 0.5,
     edge.color = "grey",
     vertex.label.cex = 0.8,
     vertex.label.color = "black")

The vertex size in the plot is now proportionate to the out-degree. Vertices with bigger circles are the users who retweet more.

5.9 Follower count to enhance the network plot

The users who retweet most will add more value if they have a high follower count as their retweets will reach a wider audience.

In a network plot, the combination of vertex size indicating the number of retweets by a user and vertex color indicating a high follower count provides clear insights on the most influential users who can promote a brand.

The code below creates a plot showing the most influential users.

# Create a column and categorize follower counts above and below 500
user_cos$follow <- ifelse(user_cos$followers_count > 500, "1", "0")
head(user_cos)

# Assign the new column as vertex attribute to the retweet network
V(nw_rtweet)$followers <- user_cos$follow

## Warning in vattrs[[name]][index] <- value: number of items to replace is not a
## multiple of replacement length

vertex_attr(nw_rtweet)

## $name
##  [1] "JMcKWriter"      "TEDxGlasgow"     "jo_lynn13"       "WeDontHaveTime" 
##  [5] "zebunsia"        "CountUsInSOCIAL" "bryonygeorge"    "lovelypurplebee"
##  [9] "Kjsalgado1967"   "IWillRoam"       "JamieACooke"     "Michaela_v89"   
## [13] "ScotEntNews"    
## 
## $followers
##  [1] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"

# Set the vertex colors based on follower count and create a plot
sub_color <- c("lightgreen", "tomato")
plot(nw_rtweet, asp = 9/12,
     vertex.size = vert_size, edge.arrow.size = 0.5,
     vertex.label.cex = 0.8,
     vertex.color = sub_color[as.factor(vertex_attr(nw_rtweet, "followers"))],
     vertex.label.color = "black", vertex.frame.color = "grey")

This shows the most influential users in the network. The vertices colored light green are these users as they retweet the most and also have a high number of followers.

5.10 Extract geolocation coordinates

Analysing the geolocation of tweets helps influence customers with targeted marketing.

The first step in analyzing geolocation data using maps is to extract the available geolocation coordinates.

# Extract geo-coordinates data to append as new columns
cc_coord <- lat_lng(ClimateCrisis_twts)

# View the columns with geo-coordinates for first 20 tweets
head(cc_coord[c("lat","lng")], 20)

Comment
Output shows NA values for the first 20 rows as these tweets did not include the geolocation data

5.11 Twitter data on the map

It will be interesting to visualize tweets on “#ClimateCrisis” on the map to see regions from where they are tweeted the most.A brand promoting vegan products can target people in these regions for their marketing.

Remember not all tweets will have the geolocation data as this is an optional input for the users.

library(maps)

## 
## Attaching package: 'maps'

## The following object is masked from 'package:purrr':
## 
##     map

# Omit rows with missing geo-coordinates in the data frame
cc_geo <- na.omit(cc_coord[,c("lat", "lng")])

# View the output
head(cc_geo)

# Plot longitude and latitude values of tweets on the US state map
map(database = "state", fill = TRUE, col = "light yellow")
with(cc_geo, points(lng, lat, pch = 20, cex = 1, col = 'blue'))

# Plot longitude and latitude values of tweets on the world map
map(database = "world", fill = TRUE, col = "light yellow")
with(cc_geo, points(lng, lat, pch = 20, cex = 1, col = 'blue'))

Comment
#ClimateCrisis used most in Europe, US East Coast and India.

TEDxGlasgow - Social Media Analysis

1 Introduction

2 Understanding Twitter data

2.1 Power of twitter data

2.2 Search and extract tweets

2.3 Search and extract timelines

2.4 User interest and tweet counts

2.5 Compare follower count

2.6 Retweet counts

3 Analyzing Twitter data

3.1 Filtering for original tweets

3.2 Filtering on tweet language

3.3 Filter based on tweet popularity

3.4 Extract user information

3.5 Explore users based on the golden ratio

3.6 Subscribers to twitter lists

3.7 Trends by country name

3.8 Trends by city and most tweeted trends

3.9 Visualizing frequency of tweets

3.10 Create time series objects

3.11 Compare tweet frequencies for two tags

4 Visualize Tweet texts

4.1 Remove URLs and characters other than letters

4.2 Build a corpus and convert to lowercase

4.3 Remove stop words and additional spaces

4.4 Removing custom stop words

4.5 Visualize popular terms with bar plots

4.6 Word clouds for visualization

4.7 The LDA algorithm

4.8 Create a document term matrix

4.9 Create a topic model

4.10 Extract sentiment scores

4.11 Perform sentiment analysis

5 Network analysis

5.1 Preparing data for a retweet network

5.2 Create a retweet network

5.3 Network centrality measures

5.4 Calculate out-degree scores

5.5 Compute the in-degree scores

5.6 Calculate the betweenness scores

5.7 Create a network plot with attributes

5.8 Network plot based on centrality measure

5.9 Follower count to enhance the network plot

5.10 Extract geolocation coordinates

5.11 Twitter data on the map