Topical Analysis of Twitter User Clusters

This project was the culminating work of my Master's degree. This work was influenced by the RADII (Reticular Analysis of Discourse for Influence Indicators) project which was active in my lab at the time.

It involved creating artificial dialogues from clusters on a 2017 French election twitter dataset. To create the clusters, I used the Leiden clustering algorithm after comparing it to several other clustering algorithms. I then performed topical analysis using the SCIL toolset, which I created previously, to identify meso-topics (the most prevalent topics in a dialogue). I also analyzed hashtags that were used in tweets to give further insight into what each cluster was about.

The dataset:

#Tweets			#Users
674309			59144
#Retweets	#Replies	#Tweets
504526	116676	53107

Retweets were used to create a digraph of twitter users. User A had an edge to user B if A retweeted B. The weight of the edge was the number of times that A retweeted B. Some users were located outside of the dataset (meaning minimal information was available on the tweet). This was done consciously since all of the tweets used from the dataset were retweets, the tweet content of each user was known. This included the content of tweets outside of the dataset.

The graph:

#Nodes		#Edges
65517	Overlap With Dataset	222231	Weighted
65517	46683	222231	504526

The idea behind this project was that people would be more likely to retweet people who share their opinions. Using the graph structure created by the retweets, we can cluster users to understand more about how influence is used and propagated through social networks.

As many tweets were in French, tweets translated using Helsinki-NLP opus-mt-fr-en model with Hugging Face transformers. The work was presented at the Fall 2022 Master's Project postering session at Rensselaer Polytechnic Institute in front of fellow graduate students and professors. The visualization in the poster (linked below) was made using the ForceAtlas 2 algorithm within Gephi on a subset of the data (nodes with weighted degree of at least 50 belonging to the 10 largest clusters found using the Leiden clustering algorithm).

Topical Analysis of Twitter User Clusters

Links