Impact
2.6M keyword-text pairs generated
Source
Mixed repos
Repos
1 linked service
Repository Shape
Problem
Bangla Key2Text needed a scalable way to transform raw Bangla news sentences into keyword-conditioned examples. Manual keyword annotation would not scale to millions of rows, and general-purpose English-centric keyword extractors are weak fits for Bangla morphology, punctuation, and low-resource model coverage.
Research Goal
Generate reliable Bangla keyword-text pairs at dataset scale while keeping the pipeline reproducible in notebooks for model and metric comparison.
Extraction Pipeline
- Load large Bangla sentence CSVs from the research dataset workspace.
- Normalize punctuation and filter stop tokens before candidate ranking.
- Encode each sentence with Bangla transformer models from Hugging Face.
- Compute a sentence-level mean embedding and compare token embeddings against it.
- Rank candidate tokens by cosine similarity and keep the strongest keywords per sentence.
- Merge generated shards into release-ready keyword-text CSV files.
def rank_keywords(tokens, embeddings): sentence_vector = embeddings.mean(dim=0) scores = cosine_similarity(embeddings, sentence_vector) ranked = sorted(zip(tokens, scores), key=lambda item: item[1], reverse=True) return [token for token, _ in ranked if token not in stop_words]Modeling Choices
The notebooks compare a main Electra/BERT-style extractor with a BanglaBERT-base variant, plus non-transformer baselines. The core idea is intentionally simple: a useful keyword should be semantically close to the sentence representation while surviving language-specific cleanup.
- Transformer embeddings capture semantic relevance better than raw frequency for short Bangla text.
- Mean sentence embedding keeps inference simple and cheap enough for dataset generation.
- Notebook experiments visualize token embeddings with t-SNE/UMAP to inspect separation and ranking behavior.
Benchmark
| Extractor | MRR | MAP | NDCG |
|---|---|---|---|
| bnKey2Text main | 0.3359 | 0.3356 | 0.3381 |
| BanglaBERT-base variant | 0.3349 | 0.3243 | 0.3111 |
| TextRank | 0.3380 | 0.3160 | 0.2701 |
| YAKE | 0.3355 | 0.3163 | 0.3791 |
The performance notebook also reports a 74.1% average exact-set matching score between human keywords and the main bnKey2Text output on the tested sample. The mixed metric profile made the transformer extractor a practical choice for dataset construction rather than a claim that one method dominates every metric.
Impact
This keyword extraction work became the data construction layer behind Bangla Key2Text, enabling supervised keyword-to-text generation experiments for a low-resource language instead of relying only on prompt-based or zero-shot generation.