Paper Notes
Tags: content moderation, hatespeech, ai, coling
Socio-Culturally Aware Evaluation Framework for LLM-Based Content Moderation
Ensuring fair and effective content moderation across diverse cultures remains a major challenge. At COLING 2025, I presented our work on a Socio-Culturally Aware Evaluation Framework for LLM-Based Content Moderation at SUMEval2, addressing these challenges through a novel dataset and evaluation approach.
SUMEval Workshop & Panels
The workshop focused on multilingual evaluation, fairness, cultural representation, dataset creation, and model probing across languages. The workshop was organized by Hellina Hailu Nigatu (UC Berkeley), Monojit Choudhury (MBZUAI), Oana Ignat (Santa Clara University), Sebastian Ruder (Cohere), Sunayana Sitaram (Microsoft), and Vishrav Chaudhary (Meta).
It featured two expert panels:
- Multilingual Dataset Collection: Discussed challenges in data collection, licensing, and community engagement. Panelists included Ekaterina Artemova (Toloka AI), Gloria Emezue (Lanfrica), Neha Sengupta (Inception AI), and Vivek Sheshadri (Karya).
- Next Billion AI Users: Explored socio-cultural nuances in AI deployment for underserved communities. Panelists included Alham Aji Fikri (MBZUAI), Mitesh Khapra (IIT Madras), Sandipan Dandapat (Microsoft), and Vishrav Chaudhary (Meta).


Motivation for the Work
My work at Microsoft AI involved collecting data and building classical models as well as fine-tuning BERT-based models for following areas of content moderation for Bing Search:
- Hate speech detection
- Misinformation detection
- Self-harm intent detection
- Sexual content detection
Finding high-quality open-source datasets for content moderation was challenging. For hate speech, many datasets existed, but most were focused only on hate content toward a certain group. Additionally, most datasets contained a large amount of positive data around marginalized groups, causing models trained on these datasets to become overly sensitive and wrongly classify sentences mentioning these groups as hate speech. Misinformation datasets were limited to a few conspiracy theories and had noisy annotations. Hate speech or conspiracy-promoting posts on social media are heavily influenced by personal characteristics and experiences. This subjective nature of hate speech was not well represented in existing datasets. We also did not find relevant datasets for self-harm and adult content. Since these four taxonomies are all crucial for content moderation systems, we aimed to create a single dataset covering all these categories.
Our Contributions
- We propose a scalable approach for generating diverse datasets for content moderation with minimal human annotation, leveraging predefined personas to introduce a wide range of cultural and individual perspectives.
- We present a content moderation benchmark that enables comprehensive evaluationacross multiple dimensions of diversity.
- We evaluate five LLMs on persona-driven datasets, analyzing their performance and highlighting challenges posed by socio-cultural diversity.
Links
- Paper: https://arxiv.org/abs/2412.13578
- Repository: https://github.com/gaurikholkar/Socio-Culturally-Aware-Evaluation-Framework-for-LLM-Based-Content-Moderation
- Poster: Download Poster
- HF Dataset Links: Persona-driven dataset and diversity-driven dataset
Citation
If you use our work, please cite:
@misc{kumar2024socioculturallyawareevaluationframework,
title={Socio-Culturally Aware Evaluation Framework for LLM-Based Content Moderation},
author={Shanu Kumar and Gauri Kholkar and Saish Mendke and Anubhav Sadana and Parag Agrawal and Sandipan Dandapat},
year={2024},
eprint={2412.13578},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.13578}
}