-
Notifications
You must be signed in to change notification settings - Fork 0
/
paper_final_draft.tex
273 lines (173 loc) · 24.6 KB
/
paper_final_draft.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
% This must be in the first 5 lines to tell arXiv to use pdfLaTeX, which is strongly recommended.
\pdfoutput=1
% In particular, the hyperref package requires pdfLaTeX in order to break URLs across lines.
\documentclass[11pt]{article}
% Change "review" to "final" to generate the final (sometimes called camera-ready) version.
% Change to "preprint" to generate a non-anonymous version with page numbers.
\usepackage[final]{coling}
% Standard package includes
\usepackage{times}
\usepackage{latexsym}
% % To render Arabic letters %%
% \usepackage{arabtex}
% \usepackage{utf8}
% \setcode{utf8}
% For proper rendering and hyphenation of words containing Latin characters (including in bib files)
\usepackage[T1]{fontenc}
% For Vietnamese characters
% \usepackage[T5]{fontenc}
% See https://www.latex-project.org/help/documentation/encguide.pdf for other character sets
% This assumes your files are encoded as UTF8
\usepackage[utf8]{inputenc}
% This is not strictly necessary, and may be commented out,
% but it will improve the layout of the manuscript,
% and will typically save some space.
\usepackage{microtype}
% This is also not strictly necessary, and may be commented out.
% However, it will improve the aesthetics of text in
% the typewriter font.
\usepackage{inconsolata}
%Including images in your LaTeX document requires adding
%additional package(s)
\usepackage{graphicx}
% If the title and author information does not fit in the area allocated, uncomment the following
%
%\setlength\titlebox{<dim>}
%
% and set <dim> to something 5cm or larger.
% ###################################
\title{Navigating Dialectal Bias and Ethical Complexities in Levantine Arabic Hate Speech Detection}
% Author information can be set in various styles:
% For several authors from the same institution:
% \author{Author 1 \and ... \and Author n \\
% Address line \\ ... \\ Address line}
% if the names do not fit well on one line use
% Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
% For authors from different institutions:
% \author{Author 1 \\ Address line \\ ... \\ Address line
% \And Author n \\ Address line \\ ... \\ Address line
% \And Author n \\ Address line \\ ... \\ Address line
% \And Author n \\ Address line \\ ... \\ Address line}
% To start a separate ``row'' of authors use \AND, as in
% \author{Author 1 \\ Address line \\ ... \\ Address line
% \AND
% Author 2 \\ Address line \\ ... \\ Address line \And
% Author 3 \\ Address line \\ ... \\ Address line}
% \author{First Author \\
% Affiliation / Address line 1 \\
% Affiliation / Address line 2 \\
% Affiliation / Address line 3 \\
% \texttt{email@domain} \\\And
% Second Author \\
% Affiliation / Address line 1 \\
% Affiliation / Address line 2 \\
% Affiliation / Address line 3 \\
% \texttt{email@domain} \\}
\author{
\textbf{Ahmed Haj Ahmed\textsuperscript{1,2}},
\textbf{Rui-Jie Yew\textsuperscript{2}},
\textbf{Xerxes Minocher\textsuperscript{1}},
\textbf{Suresh Venkatasubramanian\textsuperscript{2}},
% \\
% \textbf{Fifth Author\textsuperscript{1,2}},
% \textbf{Sixth Author\textsuperscript{1}},
% \textbf{Seventh Author\textsuperscript{1}},
% \textbf{Eighth Author \textsuperscript{1,2,3,4}},
% \\
% \textbf{Ninth Author\textsuperscript{1}},
% \textbf{Tenth Author\textsuperscript{1}},
% \textbf{Eleventh E. Author\textsuperscript{1,2,3,4,5}},
% \textbf{Twelfth Author\textsuperscript{1}},
% \\
% \textbf{Thirteenth Author\textsuperscript{3}},
% \textbf{Fourteenth F. Author\textsuperscript{2,4}},
% \textbf{Fifteenth Author\textsuperscript{1}},
% \textbf{Sixteenth Author\textsuperscript{1}},
% \\
% \textbf{Seventeenth S. Author\textsuperscript{4,5}},
% \textbf{Eighteenth Author\textsuperscript{3,4}},
% \textbf{Nineteenth N. Author\textsuperscript{2,5}},
% \textbf{Twentieth Author\textsuperscript{1}}
% \\
\\
\textsuperscript{1}Haverford College,
\textsuperscript{2}Brown University
% \textsuperscript{3}Affiliation 3,
% \textsuperscript{4}Affiliation 4,
% \textsuperscript{5}Affiliation 5
\\
\small{
\textbf{Correspondence:} \href{mailto:ahajahmed@haverford.edu}{ahajahmed@haverford.edu}
}
}
\begin{document}
% \setlength{\parskip}{0pt}
% \raggedbottom
\maketitle
\begin{abstract}
% {\small \textbf{Content Warning:} The content of this paper may be upsetting or triggering to some readers.}
% \vspace{0.4em} % Adds space between the warning and the abstract text.
Social media platforms have become central to global communication, yet they also facilitate the spread of hate speech. For underrepresented dialects like Levantine Arabic, detecting hate speech presents unique cultural, ethical, and linguistic challenges. This paper explores the complex sociopolitical and linguistic landscape of Levantine Arabic and critically examines the limitations of current datasets used in hate speech detection. We highlight the scarcity of publicly available, diverse datasets and analyze the consequences of dialectal bias within existing resources. By emphasizing the need for culturally and contextually informed natural language processing (NLP) tools, we advocate for a more nuanced and inclusive approach to hate speech detection in the Arab world.
\vspace{0.3em}
\noindent \textbf{Warning:} The content of this paper may be upsetting or triggering to some readers.
% \end{abstract}
\end{abstract}
% %%%%% Add the content warning here
% \noindent \textbf{Content Warning:} The content of this paper may be upsetting or triggering to some readers.
\section{Introduction}
Language, reflecting culture and identity, can both unite and divide communities. In the Levant, deep-rooted socio-political tensions have turned language into a weapon. The rise of digital platforms has amplified hate speech, necessitating robust detection and mitigation mechanisms \citep{CASTANOPULGARIN2021101608, https://doi.org/10.1002/1944-2866.POI364}. Automated tools leveraging NLP are essential for curbing online hate speech \citep{JAHAN2023126232}. However, these tools are not equally effective across all languages and dialects. While significant progress has been made for languages like English, Levantine Arabic remains underserved \citep{bender2019rule}.
Levantine Arabic, spoken across Syria, Jordan, Palestine, and Lebanon, is a dialect continuum with significant regional variations, making it challenging for current NLP technologies to capture \citep{haff2022currasbaladilevantine}. Existing hate speech detection models often overlook the rich cultural and sociolinguistic nuances of the dialect. This paper addresses the ethical, cultural, and linguistic challenges in detecting hate speech in Levantine Arabic and highlights the critical need for more representative datasets.
\section{The Linguistic Complexity of Levantine Arabic}
\subsection{Dialectal Variation}
Levantine Arabic is a continuum of dialects differing significantly across countries and regions. For example, in Syria, the Damascus dialect contrasts with that of Idlib or rural areas; for instance, "clothes" is "awaei" in Damascus but "teyab" in Aleppo, and "girl" is "bint" in Damascus and "sabiye" elsewhere, with pronunciation variations like consonant softening altering meanings \citep{Naïm+2012+920+935}. Jordanian Arabic varies between urban centers like Amman and rural areas that preserve traditional forms; "now" is "halla" in urban settings and "hassa" in rural regions, and the letter jim may be pronounced as a soft "j" or a hard "g" \citep{c3909b9d-f472-38b7-b06d-dfca0dcd22bb}. Palestinian Arabic differs between Jerusalem, the West Bank, Gaza, and diaspora communities; "cup" is "kasseh" in Jerusalem but "kubayeh" in Gaza. In Lebanon, Beirut's Arabic incorporates French loanwords due to historical influences—unlike regions like Tripoli or the south; the pronunciation of the letter qaf also varies between a glottal stop, a hard "k," or the standard "q" sound \citep{obégi1971phonemic, Naïm+2012+920+935}.
These regional differences are deeply tied to cultural and socio-political identities. Variations in expressions, idioms, and pronunciation can carry different meanings depending on locality, posing significant challenges for NLP tools. Generic models, often trained on standardized Arabic, may not capture these subtleties
\subsection{The Role of Sociolinguistic Context}
Understanding hate speech in Levantine Arabic requires not only linguistic proficiency but also a deep understanding of the socio-political context in which the language is used. The Levant is a region marked by ongoing conflicts, occupation, and political instability. Hate speech is often employed strategically to exacerbate sectarian divisions, mobilize political supporters, or criticize opposition groups.
In Syria, for instance, even subtle linguistic features like the pronunciation of the qaf (\<ق>) have become sociopolitical markers. Historically a neutral phonetic variation, the qaf pronunciation shifted during the conflict to signal regime alignment \citep{Omran_2021}. Security forces used it in propaganda to stoke sectarian fears, while opposition groups mocked it as a regime identifier, transforming a simple linguistic trait into a symbol of allegiance and deepening societal divides.
Similar dynamics can be observed in Lebanon, where political factions often use divisive rhetoric to maintain control. Hate speech is not merely offensive language but part of broader strategies to sustain political dominance and suppress dissent. Any attempt to detect and mitigate hate speech in this context must account for these complex and shifting dynamics, including the sociopolitical significance of linguistic nuances.
\section{The Problem with Current Datasets}
\subsection{Lack of Publicly Available Datasets}
One of the most significant barriers to improving hate speech detection in Levantine Arabic is the lack of publicly available datasets. While several datasets exist for Modern Standard Arabic (MSA), Egyptian Arabic, Gulf Arabic, and others \citep{ALAKROT2018174, mubarak-etal-2017-abusive, 10.5555/3382225.3382239, 8593146}, there is a striking absence of resources dedicated to Levantine Arabic. This gap limits the ability of researchers and developers to create effective hate speech detection models for the region.
The few datasets that do exist for Levantine Arabic are often restricted in scope, limiting their utility for broader research. Moreover, these datasets are rarely representative of the full spectrum of dialectal variation found within the Levant. Without publicly available, diverse datasets, the development of inclusive and effective NLP tools remains out of reach \citep{barocas-hardt-narayanan}.
\subsection{Dialectal Bias in Existing Datasets}
Even the best available datasets for Levantine Arabic are biased toward specific regional dialects. A prominent case in point is the Levantine Hate Speech and Abusive Language (L-HSAB) Twitter dataset—the first and only publicly available dataset dedicated to hate speech and abusive language in Levantine Arabic \citep{mulki-etal-2019-l}. While L-HSAB is invaluable due to its size and scope, it disproportionately focuses on Lebanese Arabic. This bias stems primarily from its data collection methodology, which involved extracting tweets using keywords related to Lebanese political figures and events \citep{f5a0fc6c-286b-3484-9927-c1949a72ae5c}.
The most frequently mentioned entities in L-HSAB are predominantly Lebanese. "Gebran Bassil," a Lebanese politician, is mentioned over 1,000 times. The term "Lebanon" appears around 250 times, and "Wiam Wahhab," another Lebanese politician and journalist, is mentioned approximately 200 times. This concentration on specific individuals and topics skews the dataset toward Lebanese political discourse, thereby overlooking the linguistic and sociopolitical nuances present in other Levantine regions.
This skew introduces significant bias, as the linguistic features, idiomatic expressions, and even manifestations of hate speech in Lebanese Arabic differ markedly from other Levantine dialects. For instance, certain derogatory terms or politically charged phrases common in Lebanese discourse may be absent or hold different connotations in Syrian or Jordanian contexts. A term like "\<زعران>" ("za‘ran", meaning "thugs" in Lebanese Arabic) is a strong insult in Lebanon but does not carry the same weight in Syrian Arabic. Conversely, a Syrian expression such as "\<شبيحة>" ("shabbiha", referring to pro-regime militias) is a loaded term in Syria but might not evoke the same response or understanding among Lebanese speakers \citep{doi:10.1177/2633002420907771}.
Moreover, the focus on specific events and actors further narrows the dataset's applicability. The political landscape and issues prevalent in Lebanon are unique and may not reflect the concerns or conflicts in Syria, Jordan, or Palestine. Hate speech related to Lebanese political parties like the Free Patriotic Movement or events like the Lebanese protests of 2019 would not encompass the types of hate speech prevalent in other regions.
As a result, models trained on datasets like L-HSAB are less likely to generalize effectively to other dialects. They may fail to detect hate speech in Syrian, Jordanian, or Palestinian Arabic due to differences in vocabulary, idioms, and sociopolitical references. This limitation reduces the overall effectiveness of hate speech detection tools across the Levantine region.
Furthermore, this bias can lead to misclassification, where non-hateful speech in one dialect is incorrectly flagged as abusive because the model does not accurately interpret the linguistic nuances of that dialect. Conversely, actual hate speech may go undetected in underrepresented dialects, allowing harmful content to proliferate.
In summary, while datasets like L-HSAB are crucial stepping stones in advancing hate speech detection for Levantine Arabic, their dialectal and topical biases highlight the need for more inclusive data collection strategies. Expanding the dataset to include a broader range of dialects and sociopolitical contexts is essential. By doing so, we can develop NLP tools that are both effective and equitable, ensuring that all communities within the Levantine region are adequately represented and protected in the digital space \citep{barocas-hardt-narayanan}.
\subsection{Limitations of Pre-trained Embeddings and the Need for Domain-Specific Models}
In addition to dataset biases, the choice of language models and embeddings plays a crucial role in the effectiveness of hate speech detection systems. Our analyses and experiments on the L-HSAB dataset underscore the limitations of relying on pre-trained embeddings that are not tailored to the specific linguistic characteristics of Levantine Arabic.
We evaluated several embedding techniques to assess their performance in detecting hate speech within the L-HSAB dataset. The methods included traditional approaches like Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF), as well as neural embeddings such as pre-trained Arabic fastText, custom-trained Word2Vec on Levantine Arabic data, pre-trained GoogleNews Word2Vec, and pre-trained GloVe embeddings \citep{Harris1954DistributionalS, 10.5555/106765.106782, bojanowski2016enriching, mikolov2013efficientestimationwordrepresentations, pennington-etal-2014-glove}.
Effective Techniques: Our experiments revealed that BoW, TF-IDF, pre-trained Arabic fastText, and custom-trained Word2Vec embeddings significantly outperformed the other methods. These techniques achieved higher F1 scores, indicating better precision and recall in identifying hate speech content. The success of these models can be attributed to their alignment with the linguistic properties of Levantine Arabic, either through their focus on Arabic text or customization to the specific dialect.
Ineffective Techniques: In stark contrast, pre-trained embeddings like GoogleNews Word2Vec and GloVe, which are primarily trained on English corpora, scored nearly 0\% in F1 metrics. This drastic underperformance highlights a critical issue: models trained predominantly on English data fail to recognize or interpret Arabic text accurately. Consequently, they are ineffective for tasks involving Levantine Arabic hate speech detection.
These findings emphasize the importance of domain-specific adaptations in NLP models. Utilizing embeddings and language models that are trained or fine-tuned on Levantine Arabic data is essential for capturing the unique linguistic features and nuances of the dialect. Relying on generic, pre-trained models not only reduces accuracy but also risks missing or misclassifying hate speech, thereby undermining the effectiveness of detection systems.
By investing in domain-specific models, researchers and technologists can create more accurate and reliable hate speech detection tools. Such tools will be better equipped to handle the linguistic diversity of Levantine Arabic, ultimately contributing to a safer and more inclusive online environment for speakers of all regional dialects.
\section{Ethical Considerations in Hate Speech Detection}
The dialectical bias identified above privileges one regional dialect over others, and risk marginalizing communities whose voices are already underrepresented in the digital sphere. There are also ethical concerns beyond issues of data bias. False positives—where non-hate speech is misclassified—can result in the suppression of legitimate cultural expressions, especially in a region where language is tightly bound to identity. A prominent example is the misclassification of the Arabic word "\<شهيد>" ("shaheed", meaning "martyr") by social media platforms like Meta \citep{oversightboard_shaheed_2024}. The term holds significant cultural and religious importance, often used to honor individuals who have died for a sacred cause. However, automated moderation systems have frequently removed content containing "shaheed," interpreting it as a reference to terrorism or violent extremism due to its association with entities on terrorism watchlists.
Conversely, false negatives—where actual hate speech goes undetected—allow harmful narratives to spread unchecked, fueling further violence. For example, derogatory terms or slurs specific to a particular region or group may go unnoticed by moderation systems trained primarily on other dialects or on Modern Standard Arabic. In the context of the Syrian conflict, hate speech containing region-specific pejoratives aimed at certain ethnic or sectarian groups might not be recognized as such by models lacking comprehensive dialectal data. This oversight enables the propagation of inflammatory content that can exacerbate tensions and incite real-world violence.
Technologists and researchers have a responsibility to develop models that not only detect hate speech but do so in a way that respects the linguistic and cultural integrity of Levantine Arabic. Practically, ethical considerations are particularly relevant within a conflict-ridden region like the Levant where the failure to identify and address hate speech content undermines efforts to promote peace and stability. By incorporating diverse linguistic inputs and cultural insights, developers can create more nuanced models that differentiate between harmful content and legitimate expression, thereby protecting both free speech and community safety.
\section{Towards More Culturally Aware Language Technologies}
Addressing the challenges of hate speech detection in Levantine Arabic requires practical solutions that consider the language's unique properties. \citet{bergman2022responsiblenaturallanguageannotation} offer valuable guidelines for developing effective and ethically sound NLP tools for underrepresented dialects. By incorporating these recommendations, we can create language technologies that are culturally aware and inclusive, specifically tailored to Levantine Arabic.
\subsection{Engaging Local Communities}
Engaging local communities is essential for capturing the full spectrum of dialectal variations and cultural contexts within Levantine Arabic. The language's rich diversity necessitates collaboration with native speakers from various regions. Involving annotators and experts who possess both language proficiency and deep understanding of local contexts ensures that the linguistic nuances specific to each dialect are accurately represented \citep{radiya-dixit_bogen_2024}.
\subsection{Rethinking Data Collection and Annotation}
To overcome dialectal bias, new data collection and annotation strategies must account for Levantine Arabic's specific properties. Given the significant dialectal variations, stratified sampling techniques are crucial for comprehensively capturing the linguistic landscape \citep{bergman2022responsiblenaturallanguageannotation}. Annotation processes should prioritize using annotators proficient in specific regional dialects and familiar with local sociopolitical contexts \citep{doi:10.1126/science.aal4230, radiya-dixit_bogen_2024}. Researchers must be mindful of potential consequences when collecting data from conflict-affected regions, as certain linguistic features can carry sociopolitical implications. Providing transparent annotation guidelines and support systems for annotators is also critical.
\subsection{Prioritizing Ethical Design}
Developing NLP tools for Levantine Arabic must be grounded in ethical design principles that account for the language's unique properties. Practitioners should carefully consider the granularity of language divisions within Levantine Arabic and strive for inclusivity without compromising annotation quality \citep{bergman2022responsiblenaturallanguageannotation}. Providing support systems for annotators is essential, especially given potential exposure to disturbing content in conflict-affected regions. By adopting these strategies, researchers can develop hate speech detection models that are equipped to handle Levantine Arabic's dialectal diversity and cultural contexts, promoting an inclusive digital environment.
\section{Conclusion}
Detecting hate speech in Levantine Arabic presents unique cultural, linguistic, and ethical challenges due to intricate dialectal variations and biased datasets. This highlights the urgent need for more inclusive NLP approaches. By engaging local communities, reimagining data collection, and embedding ethical considerations into technology design, we can develop tools that effectively identify hate speech while honoring the Levant's rich linguistic diversity. This paper advocates for renewed cultural sensitivity in NLP applications targeting Levantine Arabic. Addressing sociolinguistic complexities and ethical implications enables us to create tools that serve all speakers, enhance detection accuracy, and promote a more just digital environment throughout the Arab world.
\section{Limitations}
This paper primarily offers a conceptual discussion on the challenges of detecting hate speech in Levantine Arabic without providing empirical data. The absence of quantitative analysis limits the assessment of the practical impact of our recommendations. Additionally, while we discuss dialectal variations across Syria, Jordan, Palestine, and Lebanon, the linguistic analysis is not exhaustive, and some regional nuances may not be fully represented. Lastly, although we reference frameworks like the playbook by \citet{bergman2022responsiblenaturallanguageannotation}, we do not offer a detailed roadmap for creating inclusive and effective hate speech detection models. Future work should focus on developing concrete tools to operationalize these recommendations.
\section*{Acknowledgments}
We would like to express our sincere gratitude to Dr. Nagham El Karhili at the Global Internet Forum to Counter Terrorism, Dr. [or Professor?] Stevie Bergman at Google DeepMind and Brown University, Professor Manar Darwish at Haverford College, and Anna Lacy, Digital Scholarship Librarian at Haverford College, for their insightful feedback and guidance throughout this research. Special thanks to the authors of the L-HSAB dataset—Hala Mulki, Hatem Haddad, Chedi Bechikh Ali, and Halima Alshabani—whose valuable work significantly contributed to this study. We also extend our appreciation to Professor Daniel Ritchie, the Computer Science Department at Brown University, and Google Research for offering and directing the exploreCSR program and for their funding support. We are grateful to the Digital Scholarship team at Haverford College for their assistance and support. Lastly, we thank the Marian E. Koshland Integrated Natural Sciences Center at Haverford College for their funding and support.
% Bibliography entries for the entire Anthology, followed by custom entries
%\bibliography{anthology,custom}
% Custom bibliography entries only
\bibliography{custom}
\appendix
\section{Example Appendix}
\label{sec:appendix}
This is an appendix.
\end{document}