Keyword Extractor Case Study

Impact

2.6M keyword-text pairs generated

Source

Mixed repos

Repos

1 linked service

Repository Shape

Keyword Extractor

GitHub

Problem

Bangla Key2Text needed a scalable way to transform raw Bangla news sentences into keyword-conditioned examples. Manual keyword annotation would not scale to millions of rows, and general-purpose English-centric keyword extractors are weak fits for Bangla morphology, punctuation, and low-resource model coverage.

Research Goal

Generate reliable Bangla keyword-text pairs at dataset scale while keeping the pipeline reproducible in notebooks for model and metric comparison.

Extraction Pipeline

Load large Bangla sentence CSVs from the research dataset workspace.
Normalize punctuation and filter stop tokens before candidate ranking.
Encode each sentence with Bangla transformer models from Hugging Face.
Compute a sentence-level mean embedding and compare token embeddings against it.
Rank candidate tokens by cosine similarity and keep the strongest keywords per sentence.
Merge generated shards into release-ready keyword-text CSV files.

keyword_ranker.py

def rank_keywords(tokens, embeddings):    sentence_vector = embeddings.mean(dim=0)

Modeling Choices

The notebooks compare a main Electra/BERT-style extractor with a BanglaBERT-base variant, plus non-transformer baselines. The core idea is intentionally simple: a useful keyword should be semantically close to the sentence representation while surviving language-specific cleanup.

Transformer embeddings capture semantic relevance better than raw frequency for short Bangla text.
Mean sentence embedding keeps inference simple and cheap enough for dataset generation.
Notebook experiments visualize token embeddings with t-SNE/UMAP to inspect separation and ranking behavior.

Benchmark

Extractor	MRR	MAP	NDCG
bnKey2Text main	0.3359	0.3356	0.3381
BanglaBERT-base variant	0.3349	0.3243	0.3111
TextRank	0.3380	0.3160	0.2701
YAKE	0.3355	0.3163	0.3791

The performance notebook also reports a 74.1% average exact-set matching score between human keywords and the main bnKey2Text output on the tested sample. The mixed metric profile made the transformer extractor a practical choice for dataset construction rather than a claim that one method dominates every metric.

Impact

2.6M

Keyword-text pairs

Extractor families compared

74.1%

Exact-set match sample

This keyword extraction work became the data construction layer behind Bangla Key2Text, enabling supervised keyword-to-text generation experiments for a low-resource language instead of relying only on prompt-based or zero-shot generation.

All Case Studies

Keyword Extractor