Problem & Constraints
Building a multimodal dataset for a low-resource language like Bangla presents challenges at every stage: image sourcing at scale, question generation in a language with limited NLP tooling, quality filtering without ground-truth baselines, and annotation at acceptable inter-annotator agreement. Off-the-shelf annotation tools assume English-first workflows and break on right-to-left or complex scripts.
The goal was to produce 20K+ image-QA pairs covering diverse visual categories (objects, scenes, activities, text in image) with question types spanning yes/no, counting, attribute, and open-ended — all verified at 90%+ inter-annotator agreement.
Pipeline Architecture
The pipeline has four stages: (1) Image collection via distributed crawlers filtered by CLIP embedding similarity to target visual categories. (2) Question generation using a fine-tuned mT5 model seeded with visual category labels. (3) Human annotation through a custom web interface with Bangla keyboard support. (4) Quality filtering using cross-annotator agreement scoring and an automated consistency checker.
Celery workers handle the crawling and embedding stages asynchronously. PostgreSQL stores the job graph and annotation state. The annotation UI is a lightweight Flask app with a Bangla Unicode input layer — critical because standard IMEs frequently mangle conjunct consonants.
Key Design Decisions
CLIP for image filtering, not keyword search
Why: Keyword-based crawling retrieves images matching the word, not the concept. CLIP embedding similarity filters for semantic relevance — e.g., "bicycle" returns images of bicycles, not bicycles-as-logos.
Alternative: Google/Bing image search with keyword filters: faster but noisier, with many irrelevant images.
mT5 for Bangla question generation
Why: mT5 has decent Bangla coverage and can be fine-tuned on a small seed set of human-written QA pairs. It generates grammatically plausible questions that annotators can accept or edit rather than write from scratch.
Alternative: Template-based generation: faster, but produces unnatural phrasing and limited diversity.
PostgreSQL job graph over simple queue
Why: Annotation tasks have dependencies — quality review can only start after at least 2 independent annotators complete. A DAG in PostgreSQL makes this dependency explicit and auditable.
Alternative: Redis queue: simpler but cannot express inter-task dependencies cleanly.
Fleiss's kappa as agreement threshold
Why: Fleiss's kappa accounts for chance agreement, unlike raw percentage. Setting kappa ≥ 0.80 as the acceptance threshold ensures genuine agreement, not noise correlation.
Alternative: Raw percentage agreement: easier to compute but inflates quality metrics for imbalanced answer distributions.
Quality Control
Every image-QA pair passes through three independent annotators. Answers are normalized (unicode NFC + whitespace trimming) before comparison. Pairs where at least 2 of 3 annotators agree are accepted; the rest go to a senior reviewer. The automated consistency checker flags semantically contradictory answers using cross-lingual sentence embeddings.
Key Insight
Bangla script normalization is non-trivial. Identical-looking characters have multiple Unicode representations (e.g., visarga vs. aspirate marker). Pre-normalization reduced false disagreements by 8% and improved kappa from 0.76 to 0.83.
CLIP Embedding Filter
import torchfrom PIL import Imagefrom transformers import CLIPProcessor, CLIPModelmodel = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")CATEGORY_PROMPTS = [ "a photo of a bicycle", "a photo of food being eaten", "a photo of text written in Bengali", # ... 40 more categories]def score_image(image_path: str) -> dict[str, float]: image = Image.open(image_path).convert("RGB") inputs = processor( text=CATEGORY_PROMPTS, images=image, return_tensors="pt", padding=True ) with torch.no_grad(): outputs = model(**inputs) probs = outputs.logits_per_image.softmax(dim=1) return { prompt: float(prob) for prompt, prob in zip(CATEGORY_PROMPTS, probs[0]) }def is_relevant(image_path: str, threshold: float = 0.15) -> bool: scores = score_image(image_path) return max(scores.values()) >= thresholdTradeoffs & Limitations
The pipeline is optimized for precision over recall. The CLIP filter and kappa threshold together reject roughly 35% of crawled images and 18% of generated QA pairs. This is intentional — dataset quality matters more than size for establishing a benchmark. The tradeoff is cost: rejected pairs represent annotator hours that cannot be recovered.
The question generator has a known bias toward object-recognition questions ("What is this?") and underproduces spatial reasoning questions ("Which object is to the left of...?"). A second fine-tuning pass with stratified sampling by question type partially corrected this but did not eliminate it.
Lessons & What I'd Improve
The biggest bottleneck wasn't compute — it was annotator throughput. The annotation UI had too much friction: annotators had to switch between mouse and keyboard, and the Bangla input method required a mode switch that broke flow. A redesign with keyboard-first navigation and inline IME integration reduced per-annotation time from 45 seconds to 28 seconds.
I'd also add a model-in-the-loop stage: run a baseline VQA model on each generated QA pair and flag questions where the model answers correctly with very high confidence. These are likely too easy for the benchmark. Filtering them out earlier would improve dataset difficulty without requiring extra human review.