Tonmoy Talukder, G M Shahariar
Abstract
This paper introduces Bangla Key2Text, a large-scale dataset of 2.6 million Bangla keyword–text pairs designed for keyword-driven text generation in a low-resource language. The dataset is constructed using a BERT-based keyword extraction pipeline applied to millions of Bangla news texts, transforming raw articles into structured keyword–text pairs suitable for supervised learning. To establish baseline performance on this new benchmark, we fine-tune two sequence-to-sequence models, mT5 and BanglaT5, and evaluate them using multiple automatic metrics and human judgments. Experimental results show that task-specific fine-tuning substantially improves keyword-conditioned text generation in Bangla compared to zero-shot large language models. The dataset, trained models, and code are publicly released to support future research in Bangla natural language generation and keyword-to-text generation tasks.
Citation bib
@article{talukder2026bangla,
title={Bangla Key2Text: Text Generation from Keywords for a Low Resource Language},
author={Talukder, Tonmoy and Shahariar, GM},
journal={arXiv preprint arXiv:2604.19508},
year={2026}
}