AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark

Abstract

Detecting biases in natural language under standing (NLU) for African American Vernacular English (AAVE) is crucial to developing inclusive natural language process ing (NLP) systems. To address dialect induced performance discrepancies, we introduce AAVENUE (AAVE Natural Language Understanding Evaluation), a benchmark for evaluating large language model (LLM) performance on NLU tasks in AAVE and Standard American English (SAE). AAVENUE builds upon and extends existing benchmarks like VALUE, replacing deterministic syntactic and morphological transformations with a more flexible methodology leveraging LLM-based translation with few-shot prompting, improving performance across several evaluation metrics when translating key tasks from the GLUE and SuperGLUE benchmarks. We compare AAVENUE and VALUE translations using five popular LLMs and a comprehensive set of metrics including fluency, BARTScore, quality, coherence, and understandability. Additionally, the fluency of AAVENUE is validated by annotations from AAVE speakers. Our evaluations reveal that LLMs consistently perform better on SAE tasks than AAVE-translated versions, underscoring inherent biases and highlighting the need for more inclusive NLP models

Methodology & Data

The AAVENUE benchmark is designed to evaluate LLM performance on NLU tasks across AAVE and SAE. We extended existing benchmarks by leveraging few-shot prompted translations using GPT-4o-mini, enhancing flexibility compared to deterministic linguistic transformations used by benchmarks like VALUE. Below, we explain our task selection, translation process, validation, and evaluation metrics.

Task Selection

We selected five key tasks from the GLUE and SuperGLUE benchmarks to test model performance in SAE and AAVE. The chosen tasks assess different aspects of NLU:

Task	Description	Aspect Tested
BoolQ	Yes/no questions based on a passage.	Comprehension and information processing.
MultiRC	Answering questions requiring connection of information across a passage.	Handling complex and interconnected texts.
SST-2	Sentiment analysis of movie reviews.	Understanding sentiment in different dialects.
COPA	Choosing the most plausible outcome or cause from two alternatives.	Cause-and-effect reasoning.
WSC	Determining which noun a pronoun refers to in ambiguous contexts.	Pronoun resolution and dialectal nuances.

SAE to AAVE Translation

We used GPT-4o-mini with few-shot prompting to translate 1000 data points for each task from SAE to AAVE. Below are examples of translations for each task:

Task	SAE Example	AAVE Example
SST-2	The movie was preachy and poorly acted.	Ain't much to like, it be preachy and acted bad.
BoolQ	Can I be sacked for falling asleep at work?	Can I get fired for fallin' asleep on the job?
COPA	Man lost the competition. Choice 1: The competition was sabotaged. Choice 2: He intimidated his competitors. (Selected: Choice 1)	Man lost da competition. Choice 1: Da competition got messed up. Choice 2: He scared off his competitors. (Selected: Choice 1)
WSC	Sam Goodman's biography of the Spartan general Xenophanes shows the difficulties he faced in his childhood.	Sam Goodman's biography on that Spartan general Xenophanes show y'all the tough times he had growin' up.
MultiRC	Paragraph: A stranger in town... How does Jason react to the stranger's presence?	Paragraph: A stranger in town... How Jason be actin' to that stranger around?

Validation and Evaluation Metrics

We validated our translations using several key metrics. The results from our evaluations are summarized below:

Metric	Description
Fluency	Evaluates the grammatical correctness and natural flow of the text (scored out of 100).
Coherence	Assesses the logical flow and consistency of the text (scored out of 100).
Understandability	Determines how easily the translation can be understood (scored out of 100).
Quality	Overall quality assessment of the translation (scored out of 100).
BARTScore	Measures how closely the AAVE translation aligns with the original SAE sentence, with lower scores indicating better alignment.

Human Validation

We recruited 10 fluent AAVE speakers from the Bronx and Queens, NY, to evaluate the cultural and linguistic authenticity of the AAVE translations. Each translation was rated on a scale of 1 to 10 for its accuracy in reflecting AAVE. Below are the average scores from the human validators:

Task	Average Score (Out of 10)
BoolQ	7.02
MultiRC	7.27
SST-2	7.09
COPA	7.22
WSC	7.25

Comparison of AAVENUE vs. VALUE

We compared the AAVENUE translations to those generated by the VALUE benchmark across several large language models (LLMs) using binary comparison tasks. Below are the results of these comparisons:

Task	Model	AAVENUE Preference (%)	VALUE Preference (%)	About the Same (%)
BoolQ	GPT-4-turbo	94.51%	4.62%	0.88%
BoolQ	GPT-4o-mini	88.79%	10.33%	0.88%
COPA	GPT-4o-mini	90.42%	9.38%	0.21%
MultiRC	Gemini-1.5-flash	93.85%	6.15%	0.00%

Results

Our evaluations reveal significant performance differences between SAE and AAVE translations across five NLU tasks. Large language models (LLMs) consistently performed better on SAE tasks than their AAVE counterparts. Below, we summarize the accuracy scores and comparison metrics used to evaluate model performance across these tasks.

Accuracy Scores

We evaluated the accuracy of translations across five key tasks using five popular LLMs. The following tables present the accuracy scores for GPT and Gemini models across both SAE and AAVE tasks.

Task	GPT-4o-mini (SAE/AAVE)	GPT-4-turbo (SAE/AAVE)	GPT-4o (SAE/AAVE)
SST-2	90.40% / 88.40% (-2.0)	94.00% / 92.80% (-1.2)	88.80% / 87.30% (-1.5)
BoolQ	88.29% / 85.29% (-3.0)	88.09% / 86.49% (-1.6)	89.19% / 86.89% (-2.3)
COPA	95.40% / 93.20% (-2.2)	97.60% / 96.80% (-0.8)	97.20% / 96.40% (-0.8)
WSC	60.03% / 57.90% (-2.1)	69.60% / 68.69% (-0.9)	70.36% / 67.02% (-3.3)
MultiRC	84.50% / 72.00% (-12.5)	86.20% / 73.70% (-12.5)	87.50% / 71.30% (-16.2)

Task	Gemini-1.5-Flash (SAE/AAVE)	Gemini-1.5-Pro (SAE/AAVE)
SST-2	87.70% / 87.10% (-0.6)	92.00% / 91.40% (-0.6)
BoolQ	89.69% / 87.29% (-2.4)	89.49% / 85.89% (-3.6)
COPA	91.40% / 92.00% (+0.6)	97.40% / 95.80% (-1.6)
WSC	48.78% / 48.48% (-0.3)	51.37% / 51.22% (-0.2)
MultiRC	84.10% / 70.70% (-13.4)	85.90% / 71.90% (-14.0)

Accuracy Score Analysis

The accuracy scores reveal consistent performance drops when models handle AAVE translations. Tasks like MultiRC and WSC showed the largest accuracy drops, indicating challenges in reading comprehension and pronoun resolution. While GPT-4-turbo exhibited relatively smaller drops, models like GPT-4o-mini struggled more with AAVE translations, particularly in contextually complex tasks. This suggests a clear need for more inclusive training data to better handle AAVE.

Intersection Over Union (IoU) Analysis

We analyzed the intersection over union (IoU) between incorrect answers in SAE and AAVE translations to understand whether models faced similar difficulties across dialects. The following table summarizes the IoU percentages:

Task	GPT-4o-mini	GPT-4-turbo	GPT-4o	Gemini-1.5-Flash	Gemini-1.5-Pro
SST-2	8.40%	5.10%	9.80%	10.40%	6.20%
BoolQ	10.21%	10.71%	8.91%	8.51%	8.41%
COPA	3.00%	1.60%	2.00%	5.80%	1.80%
WSC	35.56%	24.01%	25.68%	49.54%	44.53%
MultiRC	9.60%	9.00%	8.30%	9.90%	7.90%

Our IoU analysis indicates that the challenges in handling AAVE are often dialect-specific. The IoU scores reveal minimal overlap in incorrect responses between SAE and AAVE, suggesting that models encounter distinct challenges when processing AAVE texts. However, tasks like WSC showed significant overlap, indicating difficulty in handling pronoun resolution in both dialects.

Limitations

While AAVENUE presents a comprehensive benchmark for evaluating large language models (LLMs) across Standard American English (SAE) and African American Vernacular English (AAVE), it is important to acknowledge several limitations. First, our benchmark focuses primarily on a select number of tasks from the GLUE and SuperGLUE benchmarks, which may not fully capture the broad range of real-world applications where dialectal differences play a role. Additionally, AAVE itself varies across regions and communities, and while we validated our translations with fluent AAVE speakers—who were compensated for their contributions—the inherent variability in AAVE may limit the generalizability of our findings.

Another limitation is our reliance on GPT-4o-mini for translations, which, despite being an advanced model, may still reflect biases present in its pre-training data. This reliance on a single model restricts the diversity of approaches we could explore for reducing translation biases. Furthermore, AAVENUE currently focuses exclusively on AAVE and SAE, leaving out other underrepresented dialects that could benefit from similar evaluations. Expanding the benchmark to include a wider range of dialects would provide a more complete picture of LLM inclusivity.

Lastly, while we employed various quantitative metrics such as fluency and coherence to evaluate the translations, a deeper qualitative analysis involving AAVE speakers could offer better insights into the cultural and linguistic nuances that automated metrics might overlook. These limitations highlight important areas for future research and development in ensuring more equitable and representative natural language processing systems.

Ethical Considerations

Our research is guided by ethical principles aimed at promoting fairness and inclusivity, particularly in evaluating and addressing dialectal biases in large language models (LLMs). To ensure cultural and linguistic authenticity, we collected original data and recruited fluent AAVE speakers to validate our translations. All participants provided informed consent, and the AAVE speakers were compensated for their time and contributions.

We took careful steps to avoid potential harm and bias throughout the research process. The data collection and evaluation processes adhered to ethical guidelines, and we prioritized transparency in reporting our findings. Our goal is to contribute to the development of more inclusive natural language processing (NLP) systems that serve underrepresented dialects effectively.

By making our code and evaluation methods publicly available, we encourage further collaboration and research in this area. We believe this transparency will support ongoing efforts to create equitable NLP systems, ensuring these technologies are fair and reliable across diverse linguistic communities.

Related Works

Our work builds upon several important benchmarks and studies in the field of natural language processing. The GLUE benchmark provides a platform for evaluating language model performance across a variety of standard linguistic tasks, primarily focusing on Standard American English (SAE). SuperGLUE extends this by introducing more challenging tasks that require nuanced understanding and reasoning.

VALUE (VernAcular Language Understanding Evaluation) addresses dialect disparity in natural language understanding by using a set of linguistic transformation rules to evaluate models on African American Vernacular English (AAVE). However, its deterministic approach can limit generalizability across different contexts.

Additionally, studies such as "The Risk of Racial Bias in Hate Speech Detection" by Sap et al. and "Language (Technology) is Power: A Critical Survey of 'Bias' in NLP" by Blodgett et al. have highlighted the biases that exist in language technologies, underscoring the importance of addressing these issues to develop fair and equitable NLP systems.

Our benchmark, AAVENUE, seeks to expand upon these foundations by providing a more comprehensive evaluation of dialectal bias, specifically focusing on the performance of large language models in handling tasks in AAVE compared to SAE.