Back to Projects
Case StudyResearch engineering

BanglaVQA Dataset Pipeline

Built the collection, annotation, and quality-control pipeline behind a Bangla visual question answering benchmark.

PythonHuggingFaceCLIP

Impact

20K+ annotated QA pairs

Source

Mixed repos

Repos

2 linked services

Repository Shape

Dataset Tools
GitHub
Annotation App
Private

Private repositories are represented through architecture notes, impact, and design tradeoffs instead of source links.

Role

Research engineering across data collection, annotation workflow, filtering, and reproducibility.

Architecture

  • Crawler jobs gather candidate image-question pairs and normalize metadata.
  • Annotation tooling captures human labels with agreement checks and review queues.
  • Filtering scripts produce release-ready splits and reproducible experiment manifests.

Highlights

  • 20K+ annotated QA pairs across diverse visual domains.
  • Quality gates reduce noisy examples before model evaluation.
  • Pipeline supports both research iteration and publication-grade dataset packaging.

Constraints

Research Assets

Some annotation tooling and unreleased dataset assets are private. Public artifacts can be linked separately from private workflow repositories.