EACL 2026 Abjad NLP Shared Task: Medical text classification in Arabic
A collaborative research challenge advancing Arabic natural language processing in the medical domain
Overview
Welcome to the medical text classification shared task at the Abjad NLP workshop in EACL 2026! This shared task brings together researchers and practitioners to develop and evaluate state-of-the-art models for processing Arabic medical texts.
Each row in the dataset contains a question-answer pair in Arabic under the column named "text", along with a category under the "category" column (the class to be predicted). There are 82 categories in total, and there is considerable class imbalance in the dataset across these 82 categories, which makes the problem interesting.
Note that you need to predict the integer label corresponding to the "category" column given under the "label" column. These category names were originally in Arabic, but we translated them into English using an LLM in order to aid in modeling.
These question-answer pairs are from the healthcare domain, and given the importance of NLP applications for healthcare in Abjad script languages such as Arabic, we are confident that this shared task will attract significant interest and have a positive impact on the community.
هيا نستمتع!
Registration
Sign up for the competition on Kaggle after filling out the Google form!
Getting Started
To help you get started, we have shared a sample Colab notebook, where we finetune CamelBERT.
Important Dates
All deadlines are 11:59 PM UTC-12 (Anywhere on Earth)
Task Description
Task Format
Participants will develop systems to perform multi-class classification of Arabic medical text into 82 predefined categories. Each text instance must be assigned to exactly one category represented by an integer label between 0 and 81.
Dataset Information
The dataset consists of authentic medical-domain text in Arabic. Each row in the dataset contains:
- text: A medical-domain text segment written in Arabic
- category: The English name of the corresponding medical category
- label: The integer class label (0–81) that participants must predict
There are 82 categories in total, and the dataset exhibits notable class imbalance, making the task both challenging and practically important for real-world healthcare NLP applications.
Dataset Links
- Training Dataset: Download here
- Evaluation Dataset (no labels): Download here
Evaluation Metric
Submissions will be evaluated using the macro-averaged F1 score across all 82 classes. This metric assigns equal weight to each category, encouraging solutions that perform well even on minority classes.
For more details about the macro F1 score, refer to the scikit-learn documentation .
Contact
For questions or clarifications, please contact the organising team.
We look forward to your participation in the AbjadNLP Medical Text Classification shared task and to advancing medical NLP for Arabic and other Abjad-script languages.
Submission Guidelines
📢 Sign Up on Kaggle
To participate in this shared task, please register through the Google form first (see link at the top of the page) and then go to Kaggle to join the competition:
Prediction File Format
You will submit your predictions as a CSV file with 2 columns: Id and Predicted.
Id,Predicted
0,34
1,76
2,43
Each row should contain:
- Id: Row identifier from the evaluation dataset
- Predicted: Your predicted integer label (0-81) corresponding to one of the 82 categories
Submission Instructions
- Download Data: Download the training and evaluation datasets from the provided Google Drive links
- Develop System: Build and train your classification model using the training dataset
- Generate Predictions: Run your system on the evaluation dataset to generate predictions
- Format Output: Create a CSV file with Id and Predicted columns as shown above
- Submit: Submit your CSV file through the submission portal before the deadline (December 31, 2025)
- Await Results: Final results will be released on January 2, 2026
System Description Papers
Participants are highly encouraged to submit system description papers detailing their approaches. Submitting a paper at a shared task is a low-stress way to join the AI research community. Accepted system description papers will be a part of the ACL Anthology. As long as you adhere to the system paper submission guidelines, you will most likely get accepted. It DOES NOT matter whether you topped the leaderboard or not- we are excited to see the progress you have made, and all of us will have something to learn from your explorations. Writing a system description paper might sound challenging at first, but we are here to help! Here is a tutorial to get you started, and our contact information can be found at the bottom of this page. Papers should cover:
- Model architecture and approach
- Preprocessing and feature engineering techniques
- Training procedure and hyperparameters
- External resources used (if any)
- Results and analysis
Paper submission deadline: January 13, 2026
Notification of acceptance: January 20, 2026
Camera-ready versions due: February 3, 2026
Contact for Submissions
For questions about submissions or technical issues, please contact the organizers.
Organizers
The EACL 2026 Abjad NLP Medical Text Classification Shared Task is organized by a team of researchers and practitioners specialized in natural language processing.
Niranjan Kumar M
Specialization in Data Science, Sr Data Scientist at Lowe's
Co-Organizer
LinkedIn Profile
Balaji Nagarajan
Master in Data Science, Senior Manager of Data Science at Lowe's
Co-Organizer
LinkedIn Profile
Imed Zitouni
PhD, Senior Director of Engineering at Meta
Editor-in-Chief at ACM TALLIP
Co-Organizer
LinkedIn ProfileFor questions or clarifications, please contact the organising team at abjadnlpmedicaltextclassificat@gmail.com.