Back to Case Study
Case StudyResearch engineering / Bangla NLP

Keyword Extractor

Built a Bangla keyword extraction pipeline that ranks sentence tokens with transformer embeddings and produces keyword-text pairs for low-resource text generation research.

PythonPyTorchHugging FaceBanglaBERTNLP

Impact

2.6M keyword-text pairs generated

Source

Mixed repos

Repos

1 linked service

Repository Shape

Keyword Extractor
GitHub

Problem

Bangla Key2Text needed a scalable way to transform raw Bangla news sentences into keyword-conditioned examples. Manual keyword annotation would not scale to millions of rows, and general-purpose English-centric keyword extractors are weak fits for Bangla morphology, punctuation, and low-resource model coverage.

Research Goal

Generate reliable Bangla keyword-text pairs at dataset scale while keeping the pipeline reproducible in notebooks for model and metric comparison.

Extraction Pipeline

  • Load large Bangla sentence CSVs from the research dataset workspace.
  • Normalize punctuation and filter stop tokens before candidate ranking.
  • Encode each sentence with Bangla transformer models from Hugging Face.
  • Compute a sentence-level mean embedding and compare token embeddings against it.
  • Rank candidate tokens by cosine similarity and keep the strongest keywords per sentence.
  • Merge generated shards into release-ready keyword-text CSV files.
keyword_ranker.py
def rank_keywords(tokens, embeddings):    sentence_vector = embeddings.mean(dim=0)    scores = cosine_similarity(embeddings, sentence_vector)    ranked = sorted(zip(tokens, scores), key=lambda item: item[1], reverse=True)    return [token for token, _ in ranked if token not in stop_words]

Modeling Choices

The notebooks compare a main Electra/BERT-style extractor with a BanglaBERT-base variant, plus non-transformer baselines. The core idea is intentionally simple: a useful keyword should be semantically close to the sentence representation while surviving language-specific cleanup.

  • Transformer embeddings capture semantic relevance better than raw frequency for short Bangla text.
  • Mean sentence embedding keeps inference simple and cheap enough for dataset generation.
  • Notebook experiments visualize token embeddings with t-SNE/UMAP to inspect separation and ranking behavior.

Benchmark

ExtractorMRRMAPNDCG
bnKey2Text main0.33590.33560.3381
BanglaBERT-base variant0.33490.32430.3111
TextRank0.33800.31600.2701
YAKE0.33550.31630.3791

The performance notebook also reports a 74.1% average exact-set matching score between human keywords and the main bnKey2Text output on the tested sample. The mixed metric profile made the transformer extractor a practical choice for dataset construction rather than a claim that one method dominates every metric.

Impact

2.6M
Keyword-text pairs
4
Extractor families compared
74.1%
Exact-set match sample

This keyword extraction work became the data construction layer behind Bangla Key2Text, enabling supervised keyword-to-text generation experiments for a low-resource language instead of relying only on prompt-based or zero-shot generation.

Share