Team 7: Discovering and Categorising Language Biases
“Bias and Impartiality is in the eye of the beholder.” - Samuel Johnson.
Biased language is made up of words or phrases that might make certain people or groups feel excluded or underrepresented. Such ideologies and biases become especially pernicious when they concern vulnerable groups of people that share certain protected attributes – including ethnicity, gender, and religion. Identifying language bi-ases towards these protected attributes can offer important cues to tracing harmful beliefs fostered in online spaces.
Platforms such as YouTube are increasingly connected to issues of racism, sexism and other forms of discrimination. Therefore there’s a need to examine and monitor these platforms in regards to the language used. Language bias is prevalent in YouTube video transcripts. The goal of our research is to identify and characterize the linguistic bias in such autogenerated captions.
“We tend to look through the language and not realize how much power language has” - Deborah Tannen
Understanding such biases is important for AI systems to adequately interact in the social world and failure to do so can lead to deployment of harmful technologies(eg: conversational AI systems turning sexist and racist).
Related work that they used
Links -
Gender and Dialect Bias in YouTube’s Automatic Captions
HateBERT: Retraining BERT for Abusive Language Detection in English
Discovering and Categorising Language Biases in Reddit
Why Racial Bias Still Haunts Speech-Recognition AI
Biases Make People Vulnerable to Misinformation Spread by Social Media
Bias in Natural Language Processing (NLP): A Dangerous But Fixable Problem
Research Questions
What kinds of biases are present in youtube videos by analysing the transcripts?
What kinds of biases are prevalent in some of the demographics?
Are people able to identify such biases in youtube?
Methodology pipeline
Dataset Creation
Selecting a seed video: Seed videos are decided by utilizing the search results from YouTube. For each bias, we searched for terms associated to them and assembled a set of videos which will act as a seed for the next step.
Crawling on video suggestions: Once we decide the seed videos, we extract the suggested videos for each by crawling through them. For each video, we crawl through 3 suggested videos. This goes on until we reach a level of 3. Thus we will have extracted approx 40 videos for each video seed. In total, the size of our dataset is close to 500 videos.
Extracting video metadata: Once all the videos have been assembled, we extract the transcripts and metadata such as video length, description, comments. For this experiment we mainly focus on the transcript. Extracted metadata will be used for further analysis.
Conducting The survey - We conducted a survey to find awareness of biases present in language among people. Transcripts of the videos extracted from the above procedure were shared in a survey with people and they were asked to identify if they could detect biases and if detected were they able to categorise that bias.
Experimentation
We used data driven approach using word-embeddings to discover and categorise language biases in youtube transcripts. Protected attributes are connected to evaluative words found in the data, which are then categorised through a semantic analysis system. We then successfully discover gender bias, religion bias, and ethnic bias in these transcripts.
Potential biased words used in research
women = ["sister" , "female" , "woman" , "girl" , "daughter" , "she" , "hers" , "her"]
men = ["brother" , "male" , "man" , "boy" , "son" , "he" , "his" , "him"]
By exploring the different properties and values of the most biased words towards a target set when compared to another target set in that community, we can get some interesting insights about the community itself and how the two target sets compare.
This process leverages the embeddings of the different set of biased words to aggregate words that are close in the embedding space. The return values of the function are the partition with best intrasim found biased towards target set 1 cl1, and for target set 2 cl2.
Ablations - what they tried initially and what didn't work
One of the major issues was finding biased transcripts, looking for videos which have biases and justifying it was difficult. We tried different network analysis for finding baised videos which later turned out to be bad way to create dataset. Adding Video transcript to classifier features didn’t work out.
Deductions/Discussion
After looking at the results of the survey, we can observe that a lot of people are able to identify the biases present in the transcripts of videos. Differentiating between racial and religion bias is difficult. People confuse between religion and racial biased words a lot. But in more than 40% cases people are unable to detect biases, they couldn’t identify biases present in the transcripts.
Gender biases and racial biases are very prevalent in the transcripts of youtube videos.
People find it difficult to differentiate in racial and religion bias
In more than 40% cases people are unable to detect biases.
Could use search results, trending, recommendations,
trying to figure out bias in algo
Conclusion
We have answered our research questions i.e We found out the different kind of biases in youtube videos . We found out that racial and gender biases are most prevalent biases on youtube. Then through the surveys we found if people are able to identify such biases .We found out that more than 60% people are able to do that however they get confused between racial and gender biases because these are the most common.
Future Work
As currently we have only used youtube transcripts for discovering and categorising biases and also have focused on particular set of biases , in the future this can be expanded to multiple social media platform and focusing on more number of biases and taking surveys from more number of people from diverse backgrounds to get better analysis.
Comments
Post a Comment