Bangla VQA: Scalable Annotation Pipeline for Multimodal Datasets

20K+

Annotated QA pairs

94.2%

Inter-annotator agreement

3×

Faster than manual pipeline

Problem & Constraints

Building a multimodal dataset for a low-resource language like Bangla presents challenges at every stage: image sourcing at scale, question generation in a language with limited NLP tooling, quality filtering without ground-truth baselines, and annotation at acceptable inter-annotator agreement. Off-the-shelf annotation tools assume English-first workflows and break on right-to-left or complex scripts.

The goal was to produce 20K+ image-QA pairs covering diverse visual categories (objects, scenes, activities, text in image) with question types spanning yes/no, counting, attribute, and open-ended — all verified at 90%+ inter-annotator agreement.

Pipeline Architecture

The pipeline has four stages: (1) Image collection via distributed crawlers filtered by CLIP embedding similarity to target visual categories. (2) Question generation using a fine-tuned mT5 model seeded with visual category labels. (3) Human annotation through a custom web interface with Bangla keyboard support. (4) Quality filtering using cross-annotator agreement scoring and an automated consistency checker.

Celery workers handle the crawling and embedding stages asynchronously. PostgreSQL stores the job graph and annotation state. The annotation UI is a lightweight Flask app with a Bangla Unicode input layer — critical because standard IMEs frequently mangle conjunct consonants.

Key Design Decisions

CLIP for image filtering, not keyword search

Why: Keyword-based crawling retrieves images matching the word, not the concept. CLIP embedding similarity filters for semantic relevance — e.g., "bicycle" returns images of bicycles, not bicycles-as-logos.

Alternative: Google/Bing image search with keyword filters: faster but noisier, with many irrelevant images.

mT5 for Bangla question generation

Why: mT5 has decent Bangla coverage and can be fine-tuned on a small seed set of human-written QA pairs. It generates grammatically plausible questions that annotators can accept or edit rather than write from scratch.

Alternative: Template-based generation: faster, but produces unnatural phrasing and limited diversity.

PostgreSQL job graph over simple queue

Why: Annotation tasks have dependencies — quality review can only start after at least 2 independent annotators complete. A DAG in PostgreSQL makes this dependency explicit and auditable.

Alternative: Redis queue: simpler but cannot express inter-task dependencies cleanly.

Fleiss's kappa as agreement threshold

Why: Fleiss's kappa accounts for chance agreement, unlike raw percentage. Setting kappa ≥ 0.80 as the acceptance threshold ensures genuine agreement, not noise correlation.

Alternative: Raw percentage agreement: easier to compute but inflates quality metrics for imbalanced answer distributions.

Quality Control

Every image-QA pair passes through three independent annotators. Answers are normalized (unicode NFC + whitespace trimming) before comparison. Pairs where at least 2 of 3 annotators agree are accepted; the rest go to a senior reviewer. The automated consistency checker flags semantically contradictory answers using cross-lingual sentence embeddings.

Key Insight

Bangla script normalization is non-trivial. Identical-looking characters have multiple Unicode representations (e.g., visarga vs. aspirate marker). Pre-normalization reduced false disagreements by 8% and improved kappa from 0.76 to 0.83.

CLIP Embedding Filter

clip_filter.py

import torchfrom PIL import Imagefrom transformers import CLIPProcessor, CLIPModelmodel = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")CATEGORY_PROMPTS = [    "a photo of a bicycle",    "a photo of food being eaten",    "a photo of text written in Bengali",    # ... 40 more categories]def score_image(image_path: str) -> dict[str, float]:    image = Image.open(image_path).convert("RGB")    inputs = processor(        text=CATEGORY_PROMPTS,        images=image,        return_tensors="pt",        padding=True    )    with torch.no_grad():        outputs = model(**inputs)        probs = outputs.logits_per_image.softmax(dim=1)    return {        prompt: float(prob)        for prompt, prob in zip(CATEGORY_PROMPTS, probs[0])    }def is_relevant(image_path: str, threshold: float = 0.15) -> bool:    scores = score_image(image_path)    return max(scores.values()) >= threshold

Tradeoffs & Limitations

The pipeline is optimized for precision over recall. The CLIP filter and kappa threshold together reject roughly 35% of crawled images and 18% of generated QA pairs. This is intentional — dataset quality matters more than size for establishing a benchmark. The tradeoff is cost: rejected pairs represent annotator hours that cannot be recovered.

The question generator has a known bias toward object-recognition questions ("What is this?") and underproduces spatial reasoning questions ("Which object is to the left of...?"). A second fine-tuning pass with stratified sampling by question type partially corrected this but did not eliminate it.

Lessons & What I'd Improve

The biggest bottleneck wasn't compute — it was annotator throughput. The annotation UI had too much friction: annotators had to switch between mouse and keyboard, and the Bangla input method required a mode switch that broke flow. A redesign with keyboard-first navigation and inline IME integration reduced per-annotation time from 45 seconds to 28 seconds.

I'd also add a model-in-the-loop stage: run a baseline VQA model on each generated QA pair and flag questions where the model answers correctly with very high confidence. These are likely too easy for the benchmark. Filtering them out earlier would improve dataset difficulty without requiring extra human review.

All Case Studies Related: Rate Limiting in Distributed Systems →