Developing synthetic datasets for improving multilabel text classification in the Sustainable Development domain
Faculty Member's Name: Marina G Erechtchoukova
Faculty Member's Email Address: marina@yorku.ca
Department/School: School of Information Technology
Project Title: Developing synthetic datasets for improving multilabel text classification in the Sustainable Development domain
Description of Research Project
Application of pre-trained Large Language Models (LLMs) for Natural Language Processing (NLP) downstream tasks in a specific domain requires model fine-tuning on relevant corpora of textual documents. For domains with limited resources, assembling such corpora becomes challenging and limits LLMs utilization in practical technological solutions. The aim of this project is to investigate data augmentation techniques for developing synthetic datasets used for fine-tuning selected pre-trained LLMs to adapt them to the Sustainability domain. The techniques will be evaluated by comparing the fine-tuned model performance on the text multilabel classification task using standard NLP metrics. The project will provide insights into the importance of quality of a fine-tuning corpus vs. complexity of a selected model. The results will be a part of framework for adapting an LLM to a low resource domain.
Undergraduate Student Responsibilities
1. Learn data augmentation approaches;
2. Build corpora relevant to 17 Sustainable Development Goals by collecting documents publicly available online;
3. Implement text preprocessing following the selected text paraphrasing pipeline;
4. Develop synthetic datasets using language models following the selected text paraphrasing pipeline;
5. Participate in planning and preparation of computational experiments and will be responsible for their execution;
6. Participate in comparative analysis.
Qualifications Required
1. Ability to write and debug computer code in Python or R;
2. Ability to read and understand software manuals and use the software to run computational experiments;
3. Interests in Artificial Intelligence and Natural Language Processing is a must;
4. Desire to learn and explore new computational tools is a must;
5. Willingness to run large number of computational experiments;
6. Understanding of basic statistics;
7. Familiarity with cloud-based platforms and text preprocessing techniques is an asset.

Interested in this project posting?
Submit your resumé and unique cover letter for this projects to the faculty supervisor. Deadline: February 6, 2026 by 4 p.m.
