Twitter Copy paste Propaganda Analysis

          Copy pasta in twitter and its Analysis


 Problem Statement

Platforms like Twitter and Instagram are used by millions of users every day across the World. For some, these are their main source of news and information on current events across the world, but they can be misused spreading misinformation, pushing incorrect information to gullible users and over all creating an harmful environment across the platform.

After the 2019 elections in India, Twitter released figures vaunting the degree of engagement on its platform in the world’s largest democracy. It reported 396 million “conversations” from January 1st to May 23 that year, a figure it said was a 600% increase over the prior election period in 2014. “Prime minister Narendra Modi emerged as the most mentioned political personality throughout the course of elections, while @BJP4India,” the Bharatiya Janata Party or BJP, “was the most mentioned political party on Twitter,” reported the Economic Times.



We believe just posting content which was told to be posted on twitter instead of sharing original thoughts is detrimental to the society and we decided to do analysis on the extent of it.


Central ministers turn bots, put out identical tweets to prove ‘Demonetisation Success’


In this project we try to collect data from twitter during important events in selective time periods and we analyze the posted content for similarity and results are as follows.

Related Work

There has been an existing analysis done on copy pasta in whatsapp groups which is published here


This work was focused on evaluating hastags and its reach on timely basis. The authors of this paper joined huge number of whatsapp groups and analyzed the twitter trends. The campaigns were highly effective at producing lasting Twitter trends with a relatively small number of participants. Out of the 75 campaigns, 69 succeeded in reaching India-wide trend status. While Twitter’s criteria for trend status seem to depend on various factors, most campaigns made it onto the trend list after accumulating about 5,000 tweets within 30 minutes. Not all of these needed to be copy-paste tweets from the tweet bank – some were retweets, or organic conversations spun off from coordinated activities.




A tweet bank hosted on Google Docs frames the campaign narrative with pre-written tweets


The campaign started with 2,100 tweets per 15-minute interval at 9:00AM.


This paper provided a good inspiration for us and we decided to expand the timeline of study. This paper was focused on several social medias like whatsapp, Twitter and facebook but look sat small portion of timeline during 2019 elections. We decided to expand this timeline and did analysis on some important events like GST introduction, UP and punjab elections, 2014 elections, farmer protest and looked for copy paste content.


Methodology pipeline

Data Collection :-  For data collection, we decided to focus on verified handles in twitter. The reason to select only verified handles is because it will weed out unwanted bot accounts as well as the burden of responsibility and authenticity will also be with the user. It narrows down the search area and we believe that the influence of verified individual and celebrities will be much larger. We included main stream political parties like BJP, Congress and AAP, where the number of political handles that we included from BJP is 450 and 50 handles from congress and 50 from AAP. Once the handles are collected, we used twitter api to get the metadata about the user account using twitter api V2. The handles and the code to collect the metadata is available here . The sample data collected for each account is given below




Sample user data, number of followers and description of each account

Once we collected the handles, we scraped the user account data using SNScrape. we collected over 10 Gb of data which goes as far as 2012 for all the collected accounts. The sample data collected for each tweet of an individual account is given below

Data Analysis :-

We collected huge chunk of data, but we didn't have enough resources to look for all copy pastas as a whole. So we selected important events and narrowed down the timelines where they happened and extracted the tweets that happened at that time for our analysis. The events and timelines are given below
Events and time periods selected.

once the tweet data is curated based on time lines we started analyzing the data using machine learning. To do this analysis we first removed stop words from the language. Once the stop words are removed we used SBERT to represent the tweet content in mathematical vector which will be understood by the machine. We then performed hierarchical clustering with cosine similarity as the distance measure. Long story short the machine will now look for the similarities between the tweets and creates clusters if they are similar.

Results :-

The following section contains important observations from our study.

Farmers Protest

During the farmers protest timeline we found that BJP twitter handles copy paste content was much higher than the other parties. From the below figure we can see that there are more than 30k tweets done by BJP while congress and AAP share <5k tweets. Its important to note that in our data collection we had more twitter handles for BJP than congress or AAP but we collected most of the loud mouths from all the parties.
 

 Hashtags trend during farmers protest

Hashtags and frequency 


Users and number of tweets copy pasted


Sambit patra (sambit swaraj account name) has the most copy paste tweets with more than 600. its fair to say that most of the content he shares might not be his original thoughts. Its really interesting to note that on average he pasted 10 tweets a day :( . 


one important observation here is that people with even 1M followers are being involved in copy pasting throwing out their credibility.

Demonetization

Copy paste tweets vs parties



User vs copy pastes and followers vs copy paste graphs


Similarly the following are the few other observations










Hashtag analysis and retweets according to timeline

Tweet :#ModiPunishesPak India delivers stinging blow to Pakistan across the LoC #ModiPunishesPak
Time line analysis



Tweet: #IndiaAgainstPropaganda India’s sovereignty cannot be compromised. External forces can be spectators but not participants. Indians know India and should decide for India. Let's remain united as a nation
Time line Analysis 




My PM My Pride #BravePmModi https://t.co/Oyc5AjWtn2
Timeline Analysis


Ablations 
Some of the things that we tried initially but didn't work are 
Trying to apply for twitter api extended access for gathering tweet info and retweet info. Apparently twitter has manual approval for the api access and the approval varies from case to case.

We also tried analyzing Hindi tweets using NLTK library. But we faced technical difficulties in making the library work and since none of the team is from NLP background, with limited time we unfortunately had to ditch the hindi tweet analysis. We how ever encourage and invite the readers to do analysis if they can and share the results.

Future Work

Building a browser extension for twitter where the percentage of copy paste tweets is shown under the profile of the user. for example :- "this user have a copy paste percentage of 67%. the views expressed might not be his original ideas". This extension can be just talking to some api and getting the updates every month.

The Team:

1Abhiram DV 
2. Dattatreya Ch 
3. Sai Vishwak Gangam 
4. Ravi Teja 
5. Maneesh Gupta
6. Prudhvi Koppuravuri

Comments

Popular posts from this blog

23: Understanding the discourses around CAA NRC

Archaeological Data Analysis on Harappan Civilization

14 : Misinformation Spread in Social Networks