paper_draft.tex

% This must be in the first 5 lines to tell arXiv to use pdfLaTeX, which is strongly recommended.
\pdfoutput=1
% In particular, the hyperref package requires pdfLaTeX in order to break URLs across lines.

\documentclass[11pt]{article}

% Change "review" to "final" to generate the final (sometimes called camera-ready) version.
% Change to "preprint" to generate a non-anonymous version with page numbers.
\usepackage[final]{coling}

% Standard package includes
\usepackage{times}
\usepackage{latexsym}


%% To render Arabic letters %% 
\usepackage{arabtex} 
% \usepackage{utf8}
% \setcode{utf8}


% For proper rendering and hyphenation of words containing Latin characters (including in bib files)
\usepackage[T1]{fontenc}
% For Vietnamese characters
% \usepackage[T5]{fontenc}
% See https://www.latex-project.org/help/documentation/encguide.pdf for other character sets

% This assumes your files are encoded as UTF8
\usepackage[utf8]{inputenc}

% This is not strictly necessary, and may be commented out,
% but it will improve the layout of the manuscript,
% and will typically save some space.
\usepackage{microtype}

% This is also not strictly necessary, and may be commented out.
% However, it will improve the aesthetics of text in
% the typewriter font.
\usepackage{inconsolata}

%Including images in your LaTeX document requires adding
%additional package(s)
\usepackage{graphicx}

% If the title and author information does not fit in the area allocated, uncomment the following
%
%\setlength\titlebox{<dim>}
%
% and set <dim> to something 5cm or larger.

% ###################################


\title{Navigating Dialectal Bias and Ethical Complexities in Levantine Arabic Hate Speech Detection}

% Author information can be set in various styles:
% For several authors from the same institution:
% \author{Author 1 \and ... \and Author n \\
%         Address line \\ ... \\ Address line}
% if the names do not fit well on one line use
%         Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\

% For authors from different institutions:
% \author{Author 1 \\ Address line \\  ... \\ Address line
%         \And  Author n \\ Address line \\ ... \\ Address line 
%         \And Author n \\ Address line \\ ... \\ Address line
%         \And Author n \\ Address line \\ ... \\ Address line}
        
% To start a separate ``row'' of authors use \AND, as in
% \author{Author 1 \\ Address line \\  ... \\ Address line
%         \AND
%         Author 2 \\ Address line \\ ... \\ Address line \And
%         Author 3 \\ Address line \\ ... \\ Address line}

% \author{First Author \\
%   Affiliation / Address line 1 \\
%   Affiliation / Address line 2 \\
%   Affiliation / Address line 3 \\
%   \texttt{email@domain} \\\And
%   Second Author \\
%   Affiliation / Address line 1 \\
%   Affiliation / Address line 2 \\
%   Affiliation / Address line 3 \\
%   \texttt{email@domain} \\}

\author{
 \textbf{Ahmed Haj Ahmed\textsuperscript{1,2}},
 \textbf{Rui-Jie Yew\textsuperscript{2}},
 \textbf{Xerxes Minocher\textsuperscript{1}},
 \textbf{Fourth Author\textsuperscript{1}},
\\
 \textbf{Fifth Author\textsuperscript{1,2}},
 \textbf{Sixth Author\textsuperscript{1}},
 \textbf{Seventh Author\textsuperscript{1}},
 \textbf{Eighth Author \textsuperscript{1,2,3,4}},
% \\
%  \textbf{Ninth Author\textsuperscript{1}},
%  \textbf{Tenth Author\textsuperscript{1}},
%  \textbf{Eleventh E. Author\textsuperscript{1,2,3,4,5}},
%  \textbf{Twelfth Author\textsuperscript{1}},
% \\
%  \textbf{Thirteenth Author\textsuperscript{3}},
%  \textbf{Fourteenth F. Author\textsuperscript{2,4}},
%  \textbf{Fifteenth Author\textsuperscript{1}},
%  \textbf{Sixteenth Author\textsuperscript{1}},
% \\
%  \textbf{Seventeenth S. Author\textsuperscript{4,5}},
%  \textbf{Eighteenth Author\textsuperscript{3,4}},
%  \textbf{Nineteenth N. Author\textsuperscript{2,5}},
%  \textbf{Twentieth Author\textsuperscript{1}}
% \\
\\
 \textsuperscript{1}Haverford College,
 \textsuperscript{2}Brown University
 % \textsuperscript{3}Affiliation 3,
 % \textsuperscript{4}Affiliation 4,
 % \textsuperscript{5}Affiliation 5
\\
 \small{
   \textbf{Correspondence:} \href{mailto:email@domain}{email@domain}
 }
}

\begin{document}
\maketitle
\begin{abstract}
Social media platforms have become central to global communication, yet they also facilitate the spread of hate speech. For underrepresented dialects like Levantine Arabic, detecting hate speech presents unique cultural, ethical, and linguistic challenges. This paper explores the complex sociopolitical and linguistic landscape of Levantine Arabic and critically examines the limitations of current datasets used in hate speech detection. We highlight the scarcity of publicly available, diverse datasets and analyze the consequences of dialectal bias within existing resources. By emphasizing the need for culturally and contextually informed natural language processing (NLP) tools, we advocate for a more nuanced and inclusive approach to hate speech detection in the Arab world. 
\end{abstract}


\section{Introduction}

Language, reflecting culture and identity, can both unite and divide communities. In the Levant, deep-rooted socio-political tensions have turned language into a potent weapon. The rise of digital platforms has amplified hate speech, necessitating robust detection and mitigation mechanisms (Castaño-Pulgarín et al., 2021; Awan, 2014). Automated tools leveraging NLP are essential for curbing online hate speech (Jahan and Oussalah, 2023). However, these tools are not equally effective across all languages and dialects. While significant progress has been made for languages like English, Levantine Arabic remains underserved (Bender, 2019).

Levantine Arabic, spoken across Syria, Jordan, Palestine, and Lebanon, is a dialect continuum with significant regional variations, making it challenging for current NLP technologies to capture (El Haff et al., 2022). Existing hate speech detection models often overlook the rich cultural and sociolinguistic nuances of the dialect. This paper addresses the ethical, cultural, and linguistic challenges in detecting hate speech in Levantine Arabic and highlights the critical need for more representative datasets.


\section{The Linguistic Complexity of Levantine Arabic}

\subsection{Dialectal Variation}

%%%%% Condense the first four paragrpahs into one 

% Levantine Arabic is a continuum of dialects that differ significantly across countries and regions. In Syria, the Damascus dialect contrasts markedly with that of Idlib or rural northeastern areas. For example, "clothes" is "awaei" in Damascus but "teyab" in Aleppo; "girl" is "bint" in Damascus and "sabiye" elsewhere. Pronunciation variations, such as consonant softening in Damascene Arabic, can alter word meanings entirely (Naïm, 2012).

% %%%%%%%%%%% Jordanian Arabic exhibits distinctions between urban and rural speakers (Sakarna, 2005). Urban dialects in cities like Amman may simplify verb conjugations and adopt vocabulary from Modern Standard Arabic or English, while rural dialects preserve traditional forms and unique colloquialisms. For instance, "now" is "halla" in urban settings and "hassa" in rural regions; the letter jim may be pronounced as a soft "j" or a hard "g" sound.

% % Palestinian Arabic varies between Jerusalem, the West Bank, Gaza, and diaspora communities. The word for "cup" is "kasseh" in Jerusalem but "kubayeh" in Gaza. Idiomatic expressions also differ; "ya’ateek al-’afiya" (may God give you health) is commonly used as thanks in the West Bank but may not be used elsewhere.

% In Lebanon, Beirut's Arabic incorporates many French loanwords due to historical influences, unlike regions like Tripoli or the south where such loanwords are less prevalent (Obégi, 1971). A Beiruti might use "merci" for "thank you," while others use the traditional Arabic "shukran." The pronunciation of the letter qaf varies: in Beirut, it's often a glottal stop (like the pause in "uh-oh"), whereas in rural areas it may be pronounced as a hard "k" or the standard "q" sound (Naïm, 2012).

% new paragrpahs inplace of the prev four 
Levantine Arabic is a continuum of dialects differing significantly across countries and regions. For example, in Syria, the Damascus dialect contrasts with others; for example, "clothes" is "awaei" in Damascus but "teyab" in Aleppo, and "girl" is "bint" in Damascus versus "sabiye" elsewhere;. Jordanian Arabic exhibits distinctions between urban and rural speakers (Sakarna, 2005); Palestinian Arabic varies between Jerusalem, the West Bank, Gaza, and diaspora communities (e.g. the word for "cup" is "kasseh" in Jerusalem but "kubayeh" in Gaza); in Lebanon, Beirut's Arabic incorporates many French loanwords due to historical influences. A Beiruti might use "merci" for "thank you," while others use the traditional Arabic "shukran." 

These regional differences are deeply tied to cultural and socio-political identities. Variations in expressions, idioms, and pronunciation can carry different meanings depending on locality, posing significant challenges for NLP tools aiming to detect hate speech across the Levantine region. Generic models, often trained on standardized Arabic, may not capture these subtleties


\subsection{The Role of Sociolinguistic Context}

Understanding hate speech in Levantine Arabic requires not only linguistic proficiency but also a deep understanding of the socio-political context in which the language is used. The Levant is a region marked by ongoing conflicts, occupation, and political instability. Hate speech is often employed strategically to exacerbate sectarian divisions, mobilize political supporters, or criticize opposition groups.

In Syria, for instance, even subtle linguistic features like the pronunciation of the qaf (ق) have become sociopolitical markers. Historically a neutral phonetic variation, the qaf pronunciation shifted during the conflict to signal regime alignment (Omran, 2021). Security forces used it in propaganda to stoke sectarian fears, while opposition groups mocked it as a regime identifier, transforming a simple linguistic trait into a symbol of allegiance and deepening societal divides.

Similar dynamics can be observed in Lebanon, where political factions often use divisive rhetoric to maintain control. In this landscape, hate speech is not merely offensive language but part of broader strategies to sustain political dominance and suppress dissent. Any attempt to detect and mitigate hate speech in this context must account for these complex and shifting dynamics, including the sociopolitical significance of linguistic nuances.


\section{The Problem with Current Datasets}

\subsection{Lack of Publicly Available Datasets}

One of the most significant barriers to improving hate speech detection in Levantine Arabic is the lack of publicly available datasets. While several datasets exist for Modern Standard Arabic (MSA), Egyptian Arabic, Gulf Arabic, and others (Alakrota et al., 2018) (Mubarak et al., 2017) (Albadi et al., 2018) (Al-Ajlan and Ykhlef, 2018), there is a striking absence of resources dedicated to Levantine Arabic. This gap limits the ability of researchers and developers to create effective hate speech detection models for the region.

The few datasets that do exist for Levantine Arabic are often proprietary or restricted in scope, limiting their utility for broader research. Moreover, these datasets are rarely representative of the full spectrum of dialectal variation found within the Levant. Without publicly available, diverse datasets, the development of inclusive and effective NLP tools remains out of reach (Barocas et al. 2019).

\subsection{Dialectal Bias in Existing Datasets}

Even the best available datasets for Levantine Arabic are biased toward specific regional dialects. A prominent case in point is the Levantine Hate Speech and Abusive Language (L-HSAB) Twitter dataset—the first and only publicly available dataset dedicated to hate speech and abusive language in Levantine Arabic (Mulki et al., 2019). While L-HSAB is invaluable due to its size and scope, it disproportionately focuses on Lebanese Arabic. This bias stems primarily from its data collection methodology, which involved extracting tweets using keywords related to Lebanese political figures and events (Barocas and Selbst, 2016).

The most frequently mentioned entities in L-HSAB are predominantly Lebanese. "Gebran Bassil," a Lebanese politician and leader of the Free Patriotic Movement, is mentioned over 1,000 times. The term "Lebanon" appears around 250 times, and "Wiam Wahhab," another Lebanese politician and journalist, is mentioned approximately 200 times. This concentration on specific individuals and topics skews the dataset toward Lebanese political discourse, thereby overlooking the linguistic and sociopolitical nuances present in other Levantine regions.

This skew introduces significant bias, as the linguistic features, idiomatic expressions, and even manifestations of hate speech in Lebanese Arabic differ markedly from those in other Levantine dialects. For instance, certain derogatory terms or politically charged phrases common in Lebanese discourse may be absent or hold different connotations in Syrian or Jordanian contexts. A term like "زعران" ("za‘ran", meaning "thugs" in Lebanese Arabic) is a strong insult in Lebanon but does not carry the same weight in Syrian Arabic. Conversely, a Syrian expression such as "شبيحة" ("shabbiha", referring to pro-regime militias) is a loaded term in Syria but might not evoke the same response or understanding among Lebanese speakers (Üngör, 2020). 

Moreover, the focus on specific events and actors further narrows the dataset's applicability. The political landscape and issues prevalent in Lebanon are unique and may not reflect the concerns or conflicts in Syria, Jordan, or Palestine. For example, hate speech related to Lebanese political parties like the Free Patriotic Movement or events like the Lebanese protests of 2019 would not encompass the types of hate speech prevalent in other regions.

As a result, models trained on datasets like L-HSAB are less likely to generalize effectively to other dialects. They may fail to detect hate speech in Syrian, Jordanian, or Palestinian Arabic due to differences in vocabulary, idioms, and sociopolitical references. This limitation reduces the overall effectiveness of hate speech detection tools across the Levantine region.

Dialectal bias in data collection also has ethical implications. By privileging one regional dialect over others, these datasets risk marginalizing communities whose voices are already underrepresented in the digital sphere. The exclusion of specific dialects perpetuates existing inequalities and limits the ability of affected communities to engage meaningfully in online discourse. For instance, Syrian or Palestinian users may face hate speech that goes undetected by models biased toward Lebanese Arabic, leaving them vulnerable to online abuse without adequate protection. 

Furthermore, this bias can lead to misclassification, where non-hateful speech in one dialect is incorrectly flagged as abusive because the model does not accurately interpret the linguistic nuances of that dialect. Conversely, actual hate speech may go undetected in underrepresented dialects, allowing harmful content to proliferate.

In summary, while datasets like L-HSAB are crucial stepping stones in advancing hate speech detection for Levantine Arabic, their dialectal and topical biases highlight the need for more inclusive data collection strategies. Expanding the dataset to include a broader range of dialects and sociopolitical contexts is essential. By doing so, we can develop NLP tools that are both effective and equitable, ensuring that all communities within the Levantine region are adequately represented and protected in the digital space (Barocas et al. 2019).


\subsection{Limitations of Pre-trained Embeddings and the Need for Domain-Specific Models}

In addition to dataset biases, the choice of language models and embeddings plays a crucial role in the effectiveness of hate speech detection systems. Our analyses and experiments on the L-HSAB dataset underscore the limitations of relying on pre-trained embeddings that are not tailored to the specific linguistic characteristics of Levantine Arabic.

We evaluated several embedding techniques to assess their performance in detecting hate speech within the L-HSAB dataset. The methods included traditional approaches like Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF), as well as neural embeddings such as pre-trained Arabic fastText, custom-trained Word2Vec on Levantine Arabic data, pre-trained GoogleNews Word2Vec, and pre-trained GloVe embeddings (Harris, 1954) (Jones, 1972) (Bojanowski et al., 2016) (Mikolov et al., 2013) (Pennington et al., 2014).

Effective Techniques: Our experiments revealed that BoW, TF-IDF, pre-trained Arabic fastText, and custom-trained Word2Vec embeddings significantly outperformed the other methods. These techniques achieved higher F1 scores, indicating better precision and recall in identifying hate speech content. The success of these models can be attributed to their alignment with the linguistic properties of Levantine Arabic, either through their focus on Arabic text or customization to the specific dialect.

Ineffective Techniques: In stark contrast, pre-trained embeddings like GoogleNews Word2Vec and GloVe, which are primarily trained on English corpora, scored nearly 0\% in F1 metrics. This drastic underperformance highlights a critical issue: models trained predominantly on English data fail to recognize or interpret Arabic text accurately. Consequently, they are ineffective for tasks involving Levantine Arabic hate speech detection.

These findings emphasize the importance of domain-specific adaptations in NLP models. Utilizing embeddings and language models that are trained or fine-tuned on Levantine Arabic data is essential for capturing the unique linguistic features and nuances of the dialect. Relying on generic, pre-trained models not only reduces accuracy but also risks missing or misclassifying hate speech, thereby undermining the effectiveness of detection systems.

By investing in domain-specific models, researchers and technologists can create more accurate and reliable hate speech detection tools. Such tools will be better equipped to handle the linguistic diversity of Levantine Arabic, ultimately contributing to a safer and more inclusive online environment for speakers of all regional dialects.


\section{Ethical Considerations in Hate Speech Detection}

The ethical challenges of hate speech detection in Levantine Arabic extend beyond issues of data bias. False positives—where non-hate speech is misclassified—can result in the suppression of legitimate cultural expressions, especially in a region where language is tightly bound to identity. A prominent example is the misclassification of the Arabic word "شهيد" ("shaheed", meaning "martyr") by social media platforms like Meta (Oversight Board, 2024). The term holds significant cultural and religious importance, often used to honor individuals who have died for a sacred cause. However, automated moderation systems have frequently removed content containing "shaheed," interpreting it as a reference to terrorism or violent extremism due to its association with entities on terrorism watchlists. This over-enforcement not only suppresses legitimate expression but also disproportionately affects Muslim and Arabic-speaking communities, infringing on their ability to communicate shared values and experiences (Diaz, 2023).

Misclassifications of regionally specific idioms or expressions as hate speech alienate speakers and contribute to the erasure of linguistic diversity, effectively elevating one perspective over others. A colloquial phrase used affectionately in one Levantine dialect might be misunderstood and flagged as offensive by a system not attuned to regional nuances. Conversely, false negatives—where actual hate speech goes undetected—allow harmful narratives to spread unchecked, fueling further violence. For example, derogatory terms or slurs specific to a particular region or group may go unnoticed by moderation systems trained primarily on other dialects or on Modern Standard Arabic. In the context of the Syrian conflict, hate speech containing region-specific pejoratives aimed at certain ethnic or sectarian groups might not be recognized as such by models lacking comprehensive dialectal data. This oversight enables the propagation of inflammatory content that can exacerbate tensions and incite real-world violence.

Ethically, technologists and researchers have a responsibility to develop models that not only detect hate speech but do so in a way that respects the linguistic and cultural integrity of Levantine Arabic. Practically, ethical considerations are particularly relevant within a conflict-ridden region like the levant where the failure to identify and address hate speech content undermines efforts to promote peace and stability. This requires engaging with local communities to better understand the sociolinguistic context in which hate speech occurs and ensuring that datasets are representative of the entire dialect continuum. By incorporating diverse linguistic inputs and cultural insights, developers can create more nuanced models that differentiate between harmful content and legitimate expression, thereby protecting both free speech and community safety.


\section{Towards More Culturally Aware Language Technologies}

Addressing the challenges of hate speech detection in Levantine Arabic requires practical solutions that consider the language's unique properties. Bergman and Diab (2022) offer valuable guidelines for developing effective and ethically sound NLP tools for underrepresented dialects. By incorporating these recommendations, we can create language technologies that are culturally aware and inclusive, specifically tailored to Levantine Arabic.


\subsection{Engaging Local Communities}

Engaging local communities is essential for capturing the full spectrum of dialectal variations and cultural contexts within Levantine Arabic. The language's rich diversity necessitates collaboration with native speakers from various regions. For instance, idioms, pronunciations, and expressions in Syrian Arabic differ markedly from those in Jordanian Arabic. Involving annotators and experts who possess both language proficiency and deep understanding of local contexts ensures that the linguistic nuances specific to each dialect are accurately represented (Radiya-Dixit and Bogen, 2024).


\subsection{Rethinking Data Collection and Annotation}

To overcome dialectal bias, new data collection and annotation strategies must account for Levantine Arabic's specific properties. Bergman and Diab (2022) recommend that data sampling be representative of the target user cohort's speech and orthographies. Given the significant dialectal variations, stratified sampling techniques are crucial for comprehensively capturing the linguistic landscape. Annotation processes should prioritize using annotators proficient in specific regional dialects and familiar with local sociopolitical contexts (Caliskan et al., 2017; Radiya-Dixit and Bogen, 2024). For example, annotators from rural Jordan better understand local colloquialisms than those from urban areas or other countries. Implementing sample routing systems ensures content is reviewed by those best equipped to handle specific dialects, reducing misinterpretation and increasing annotation accuracy (Bergman and Diab, 2022). Ethical considerations should guide data collection practices. Researchers must be mindful of potential consequences when collecting data from conflict-affected regions, as certain linguistic features can carry sociopolitical implications. Avoiding the marginalization of any group or exacerbation of existing inequalities is paramount (Bergman and Diab, 2022). Providing transparent annotation guidelines and support systems for annotators is also critical.


\subsection{Prioritizing Ethical Design}

Developing NLP tools for Levantine Arabic must be grounded in ethical design principles that account for the language's unique properties. This involves prioritizing high-quality annotation over breadth when resources are limited, ensuring models are accurate and respectful of the communities they serve. Practitioners should carefully consider the granularity of language divisions within Levantine Arabic and strive for inclusivity without compromising annotation quality (Bergman and Diab, 2022). Providing support systems for annotators is essential, especially given potential exposure to disturbing content in conflict-affected regions. Access to psychological support and reasonable work conditions safeguards annotators' well-being and contributes to reliable annotations. By adopting these strategies, researchers can develop hate speech detection models that are culturally aware and ethically responsible. Incorporating the playbook proposed by Bergman and Diab (2022) offers a practical roadmap for creating NLP applications equipped to handle Levantine Arabic's dialectal diversity and cultural contexts, promoting a more inclusive digital environment.


\section{Conclusion}

Detecting hate speech in Levantine Arabic poses unique cultural, linguistic, and ethical challenges due to intricate dialectal variations and biased datasets across Syria, Jordan, Palestine, and Lebanon. This highlights an urgent need for more inclusive and representative NLP approaches. By engaging local communities, reimagining data collection methods, and embedding ethical considerations into technology design, we can develop language tools that effectively identify hate speech while honoring the Levant's rich linguistic diversity. This paper advocates for renewed cultural sensitivity in NLP applications targeting Levantine Arabic. By addressing sociolinguistic complexities and ethical implications, we can create tools that genuinely serve all speakers, regardless of regional or political backgrounds, thus enhancing hate speech detection accuracy and promoting a more just digital environment throughout the Arab world.


\section*{Acknowledgments}

% We would like to express our sincere gratitude to Stevie Bergman at Google DeepMind and Brown University, Nagham El Karhili at the Global Internet Forum to Counter Terrorism, Deepak Kumar at Bryn Mawr College, and Manar Darwish at Haverford College for their insightful feedback and guidance. Special thanks to the authors of the L-HSAB dataset—Hala Mulki, Hatem Haddad, Chedi Bechikh Ali, and Halima Alshabani—for their valuable work on the dataset that significantly contributed to this study. We also extend our appreciation to the Computer Science Department at Brown University and Google Research for offering the exploreCSR program and for their funding support. Lastly, we are grateful to the Marian E. Koshland Integrated Natural Sciences Center at Haverford College for their funding and support.


% Bibliography entries for the entire Anthology, followed by custom entries
%\bibliography{anthology,custom}
% Custom bibliography entries only
\bibliography{custom}

\appendix

\section{Example Appendix}
\label{sec:appendix}

This is an appendix.

\end{document}