Data Annotation for
Computer Vision:
YOLO, COCO, IoU & Beyond
Data annotation for computer vision is the process of labelling images and video frames so machine learning models can learn to detect, classify, and segment objects. The four core annotation types are bounding boxes, polygons, semantic segmentation, and keypoints. YOLO format (plain .txt, normalised coordinates) is the standard for detection. COCO JSON supports richer tasks including segmentation and keypoints. Quality is measured by IoU (Intersection over Union) — a score of ≥0.85 between annotators indicates production-grade quality. India-based annotation costs ₹3–9/image for bounding boxes, 60–70% cheaper than US providers.
What Is Data Annotation for Computer Vision?
Computer vision models learn to see by studying examples — thousands or millions of images where a human has already identified what is in each one. Data annotation is the process of attaching those labels: drawing a box around every car, colouring every pixel that belongs to a road, or marking each joint of a human skeleton. Without annotation, a model has no ground truth to learn from.
The relationship between annotation quality and model performance is direct and unforgiving. A model cannot exceed the accuracy ceiling set by its training data. Loose bounding boxes (IoU 0.6 where 0.9 is achievable), mislabelled classes, or systematically missed instances will embed those errors permanently into model weights. Researchers at Google Brain found that a 1% improvement in annotation quality correlates with approximately 2–3% improvement in mAP on standard benchmarks.
This guide covers every layer of the annotation stack: what to annotate and how, the formats your team needs to know (YOLO and COCO), the metrics that define quality (IoU, mAP, IAA), strategies to build better datasets (augmentation, stratified sampling, edge case mining), and the economics of outsourcing annotation to India.
Where Annotation Sits in the ML Pipeline
Raw data collection → Annotation → Dataset splits → Model training → Evaluation → Deployment. Annotation sits at the foundation. Everything downstream — training time, model architecture choices, hyperparameter tuning — is constrained by the quality of annotated data fed in. The most common reason computer vision projects fail is not insufficient model capacity but insufficient annotation quality or quantity.
For teams outsourcing annotation: India-based providers like Data Terminal handle all major annotation types — bounding boxes, polygons, semantic segmentation, keypoints, LiDAR 3D — with four-layer QA and COCO/YOLO/Pascal VOC export. Contact: +91-9014387222 | contact@dataterminal.co.
Annotation Types: The Complete Taxonomy
Choosing the wrong annotation type is one of the most expensive mistakes a CV team can make — you cannot cheaply convert polygon annotations to segmentation masks, and re-annotating 50,000 images is weeks of lost time. Here is every annotation type, when to use it, and what it costs.
Axis-aligned rectangle drawn around each object. Fastest and cheapest annotation type. Sufficient for most object detection tasks.
Irregular polygon that tightly fits the object boundary. More accurate than bounding box for non-rectangular objects (aircraft, leaves, irregular containers).
Every pixel in the image is assigned a class label. Distinguishes road from building from sky at pixel level. Does NOT distinguish between individual instances of the same class.
Pixel-level labelling that additionally distinguishes individual object instances. Two cars side by side each get unique masks. Combines the best of polygon and semantic segmentation.
Specific landmark points on an object (e.g., 17 COCO body keypoints: nose, eyes, shoulders, elbows, wrists, hips, knees, ankles). Used for pose estimation and action recognition.
Cuboid drawn in 3D point cloud space, defined by centre (x,y,z), dimensions (l,w,h), and heading angle. Required for autonomous driving and robotics.
Annotation Type Decision Framework
| Task | Recommended Type | Why | Cost Index |
|---|---|---|---|
| Vehicle detection (dashcam) | 2D Bounding Box | Speed and accuracy sufficient; real-time inference requirement | 1x |
| Retail product localisation | 2D Bounding Box | Regular shapes; high volume; cost-sensitive | 1x |
| Medical imaging (tumour margin) | Polygon / Instance Seg | Tight boundary critical for area measurement | 5–8x |
| Autonomous driving (road/pedestrian) | Semantic Segmentation | Need per-pixel class; NVIDIA DRIVE pipeline | 6–10x |
| Crowd counting / individual tracking | Instance Segmentation | Need to distinguish individuals in crowd | 8–15x |
| Human pose estimation | Keypoint (17-point) | COCO skeleton; body part relationships | 2–4x |
| 3D object detection (LiDAR) | 3D Bounding Box | Point cloud data; autonomous vehicles | 12–30x |
YOLO Format: The Industry Standard
YOLO format is the most widely used annotation format for object detection. It is plain text: one .txt file per image, containing one line per object. Every coordinate is normalised to the range [0, 1] relative to the image dimensions.
<span class="comment"># Format: class_id x_center y_center width height</span> <span class="comment"># All values normalised 0–1 relative to image dimensions</span> <span class="comment"># File: image_001.txt (one per image)</span> <span class="number">0</span> <span class="number">0.512</span> <span class="number">0.438</span> <span class="number">0.234</span> <span class="number">0.187</span> <span class="comment"># class 0 (car), center at 51.2%, 43.8% of W/H, 23.4% wide, 18.7% tall</span> <span class="number">1</span> <span class="number">0.721</span> <span class="number">0.612</span> <span class="number">0.098</span> <span class="number">0.124</span> <span class="comment"># class 1 (pedestrian)</span> <span class="number">2</span> <span class="number">0.143</span> <span class="number">0.289</span> <span class="number">0.412</span> <span class="number">0.356</span> <span class="comment"># class 2 (truck)</span>
<span class="key">path</span>: <span class="string">/datasets/my_project</span> <span class="comment"># root dataset directory</span> <span class="key">train</span>: <span class="string">images/train</span> <span class="key">val</span>: <span class="string">images/val</span> <span class="key">test</span>: <span class="string">images/test</span> <span class="comment"># optional</span> <span class="key">nc</span>: <span class="number">3</span> <span class="comment"># number of classes</span> <span class="key">names</span>: <span class="bracket">[</span><span class="string">'car'</span>, <span class="string">'pedestrian'</span>, <span class="string">'truck'</span><span class="bracket">]</span>
YOLO Coordinate System
A critical point: YOLO uses centre coordinates, not top-left corner. To convert from absolute pixel coordinates to YOLO format for an image of width W and height H:
<span class="comment"># Given: top-left corner (x1, y1), bottom-right corner (x2, y2) in pixels</span> <span class="comment"># Image dimensions: W (width), H (height)</span> x_center = (x1 + x2) / 2 / W y_center = (y1 + y2) / 2 / H width = (x2 - x1) / W height = (y2 - y1) / H <span class="comment"># Example: box from (245, 180) to (695, 500) on a 1920×1080 image</span> x_center = (245 + 695) / 2 / 1920 = <span class="number">0.2448</span> y_center = (180 + 500) / 2 / 1080 = <span class="number">0.3148</span> width = (695 - 245) / 1920 = <span class="number">0.2344</span> height = (500 - 180) / 1080 = <span class="number">0.2963</span>
YOLO Format Pros & Cons
| Pros | Cons |
|---|---|
| ✓ Simplest possible format — plain text, human readable | ✗ Bounding box only — no native segmentation masks in basic format |
| ✓ Fast to parse — no JSON overhead | ✗ No image metadata (filename, dimensions stored separately) |
| ✓ Native to all YOLO versions (v5, v8, v9, v11) | ✗ No support for crowd annotations or iscrowd flag |
| ✓ Exportable from all major annotation tools | ✗ Class names stored in separate data.yaml — easy to desync |
COCO Format: Rich Metadata for Complex Tasks
COCO (Common Objects in Context) format is a JSON structure that encodes images, annotations, and category definitions in a single file. It is the standard for instance segmentation, keypoint detection, and panoptic segmentation tasks. Frameworks like Detectron2, MMDetection, and Hugging Face's transformers library all use COCO natively.
Unlike YOLO format, COCO bounding box coordinates are absolute pixel values in [x, y, width, height] format where (x, y) is the top-left corner — NOT normalised, NOT centre coordinates.
<span class="bracket">{</span>
<span class="key">"info"</span>: <span class="bracket">{</span> <span class="key">"year"</span>: <span class="number">2026</span>, <span class="key">"description"</span>: <span class="string">"My CV Dataset"</span>, <span class="key">"contributor"</span>: <span class="string">"Data Terminal"</span> <span class="bracket">}</span>,
<span class="key">"images"</span>: <span class="bracket">[</span>
<span class="bracket">{</span> <span class="key">"id"</span>: <span class="number">1</span>, <span class="key">"file_name"</span>: <span class="string">"image_001.jpg"</span>, <span class="key">"width"</span>: <span class="number">1920</span>, <span class="key">"height"</span>: <span class="number">1080</span> <span class="bracket">}</span>,
<span class="bracket">{</span> <span class="key">"id"</span>: <span class="number">2</span>, <span class="key">"file_name"</span>: <span class="string">"image_002.jpg"</span>, <span class="key">"width"</span>: <span class="number">1920</span>, <span class="key">"height"</span>: <span class="number">1080</span> <span class="bracket">}</span>
<span class="bracket">]</span>,
<span class="key">"categories"</span>: <span class="bracket">[</span>
<span class="bracket">{</span> <span class="key">"id"</span>: <span class="number">1</span>, <span class="key">"name"</span>: <span class="string">"car"</span>, <span class="key">"supercategory"</span>: <span class="string">"vehicle"</span> <span class="bracket">}</span>,
<span class="bracket">{</span> <span class="key">"id"</span>: <span class="number">2</span>, <span class="key">"name"</span>: <span class="string">"pedestrian"</span>, <span class="key">"supercategory"</span>: <span class="string">"person"</span> <span class="bracket">}</span>
<span class="bracket">]</span>,
<span class="key">"annotations"</span>: <span class="bracket">[</span>
<span class="bracket">{</span>
<span class="key">"id"</span>: <span class="number">1</span>,
<span class="key">"image_id"</span>: <span class="number">1</span>,
<span class="key">"category_id"</span>: <span class="number">1</span>,
<span class="key">"bbox"</span>: <span class="bracket">[</span><span class="number">245</span>, <span class="number">180</span>, <span class="number">450</span>, <span class="number">320</span><span class="bracket">]</span>, <span class="comment">// [x, y, width, height] in pixels — NOT normalised</span>
<span class="key">"area"</span>: <span class="number">144000</span>,
<span class="key">"iscrowd"</span>: <span class="number">0</span>,
<span class="key">"segmentation"</span>: <span class="bracket">[[</span><span class="number">245</span>,<span class="number">180</span>, <span class="number">695</span>,<span class="number">180</span>, <span class="number">695</span>,<span class="number">500</span>, <span class="number">245</span>,<span class="number">500</span><span class="bracket">]]</span> <span class="comment">// polygon coords</span>
<span class="bracket">}</span>
<span class="bracket">]</span>
<span class="bracket">}</span>[x_min, y_min, width, height] in absolute pixels. YOLO is [x_center, y_center, width, height] normalised 0–1. Getting these confused is one of the most common bugs in annotation pipelines and will produce bounding boxes shifted far from the actual objects.COCO Evaluation Metrics
COCO benchmark uses a set of metrics far more demanding than simple AP@0.5. The primary metric is mAP@[0.5:0.95] — the mean of mAP computed at IoU thresholds from 0.5 to 0.95 in steps of 0.05. This penalises models with imprecise localisation heavily. A model that achieves AP50=85% but AP75=40% has great classification but poor box precision.
| Metric | Definition | Typical Good Score |
|---|---|---|
| AP@0.5 (AP50) | mAP at IoU threshold 0.5 — loose matching | > 65% |
| AP@0.75 (AP75) | mAP at IoU threshold 0.75 — strict matching | > 45% |
| AP@[0.5:0.95] | Primary COCO metric — mean over 10 thresholds | > 40% |
| AP_S / AP_M / AP_L | mAP for small / medium / large objects | Varies by dataset |
| AR@1 / AR@10 / AR@100 | Average Recall at 1, 10, 100 detections per image | > 60% |
IoU: Measuring Annotation Accuracy
Intersection over Union (IoU) is the universal metric for measuring how well two bounding boxes overlap — whether comparing a prediction to a ground truth, or comparing two annotators' boxes to each other. It is simple, intuitive, and scale-invariant.
When two boxes perfectly overlap, the intersection equals the union and IoU = 1.0. When they do not overlap at all, the intersection is zero and IoU = 0. Every meaningful quality threshold in computer vision is expressed as an IoU threshold.
IoU for Annotation Quality Control (IAA)
Inter-Annotator Agreement (IAA) using IoU works as follows: take a calibration set of 50–100 images. Have two annotators independently label the same images. For each ground-truth object, match it to the closest box from each annotator and compute IoU between the two annotators' matched boxes. Average across all matched pairs.
<span class="comment"># Simple IoU computation between two boxes</span>
def iou(box_a, box_b):
<span class="comment"># boxes: [x1, y1, x2, y2] in absolute pixels</span>
xi1 = max(box_a[0], box_b[0])
yi1 = max(box_a[1], box_b[1])
xi2 = min(box_a[2], box_b[2])
yi2 = min(box_a[3], box_b[3])
intersection = max(0, xi2 - xi1) * max(0, yi2 - yi1)
area_a = (box_a[2]-box_a[0]) * (box_a[3]-box_a[1])
area_b = (box_b[2]-box_b[0]) * (box_b[3]-box_b[1])
union = area_a + area_b - intersection
return intersection / union if union > 0 else 0
<span class="comment"># IAA threshold check</span>
iaa_score = iou(annotator_a_box, annotator_b_box)
if iaa_score < <span class="number">0.85</span>:
flag_for_guideline_review()Common Annotation Errors by IoU Impact
| Error Type | Typical IoU | Root Cause | Fix |
|---|---|---|---|
| Loose box (too large) | 0.65–0.80 | Annotator includes shadow/background | Add example with correct tight margin in guidelines |
| Tight box (truncates object) | 0.70–0.85 | Annotator cuts off extremities | Show before/after with truncated vs correct |
| Class confusion | 0.90+ box, wrong class | Ambiguous class boundary (van vs truck) | Add decision tree for ambiguous classes |
| Missing instances | 0/annotation | Small objects overlooked | Zoom-in annotation protocol for objects <32px |
| Duplicate annotations | IoU>0.9 between two boxes | Annotator lost track of already-labelled objects | Implement auto-dedup in annotation tool |
Dataset Quality: Errors, Bias & Best Practices
A dataset is more than a collection of annotated images. Its composition — class distribution, scene diversity, split ratios, representation of edge cases — determines whether a model trained on it will generalise to production conditions. These structural issues are harder to fix than annotation errors.
Class Imbalance
When one class has 10× more examples than another, a model learns to be overconfident on the majority class and recall-deficient on the minority. The threshold for concern is roughly a 10:1 ratio for balanced detection tasks. Fixes: oversample rare classes, undersample majority classes, use focal loss during training, annotate more rare-class examples.
Train / Val / Test Splits
Scene Diversity Requirements
| Dimension | What to Cover | Why It Matters |
|---|---|---|
| Lighting | Daylight, dusk, night, artificial, direct sun, shadow | Models trained only on bright images fail at night |
| Weather | Clear, overcast, rain, fog, snow | Automotive CV needs all-weather robustness |
| Occlusion | 0%, 30%, 60%, 90% occluded instances | Partially visible objects are the most common real-world condition |
| Viewpoint | Frontal, lateral, overhead, oblique, rear | Object appearance changes dramatically with angle |
| Scale | Near, mid, far (varying object pixel sizes) | Small object detection requires dedicated examples <32px |
| Background | Simple, cluttered, similar-colour to object | Cluttered backgrounds increase false positives |
QA Layers for Production Annotation
Data Augmentation Strategies
Augmentation transforms existing annotated images to create additional training examples without additional annotation cost. A well-configured augmentation pipeline can multiply the effective dataset size 5–15x and dramatically improve model robustness to real-world variation.
The standard library is Albumentations (Python) — GPU-accelerated, bounding-box and keypoint aware, with 70+ transforms. Roboflow's web interface offers visual augmentation configuration without code.
Albumentations Quick-Start
import albumentations as A
from albumentations.pytorch import ToTensorV2
transform = A.Compose([
A.HorizontalFlip(p=0.5),
A.RandomBrightnessContrast(p=0.3),
A.HueSaturationValue(hue_shift_limit=10, p=0.3),
A.GaussNoise(var_limit=(10, 50), p=0.2),
A.MotionBlur(blur_limit=7, p=0.2),
A.RandomShadow(p=0.2), # for outdoor/automotive datasets
A.Rotate(limit=15, p=0.4),
ToTensorV2()
], bbox_params=A.BboxParams(
format='yolo', # or 'coco', 'pascal_voc', 'albumentations'
label_fields=['class_labels']
))Annotation Tools Compared
The annotation tool you choose affects annotator productivity, QA workflow, export format support, and team collaboration. There is no universal best tool — the right choice depends on team size, annotation type, budget, and hosting preference.
Computer Vision Annotation Tool by Intel. The most feature-complete open-source tool. Supports bounding box, polygon, polyline, keypoints, segmentation, and 3D point clouds. Built-in review workflow, task assignment, and annotation statistics.
- Free forever on cloud.cvat.ai
- All annotation types supported
- Built-in review and task management
- YOLO, COCO, Pascal VOC, MOT export
- Semi-automatic annotation with SAM
Most flexible open-source tool. Handles images, audio, text, video, NLP, and time-series in one interface. ML-assisted labelling, prediction import, and active learning loop support.
- Multi-modal: image, text, audio, video
- ML backend integration (any model)
- Active learning and pre-label support
- Large plugin ecosystem
- Free self-hosted, paid cloud
Fastest to start — upload images, annotate, apply augmentations, version dataset, and train a model in one platform. Excellent for solo researchers and small teams who want to iterate quickly.
- Auto-annotation with SAM (segment anything)
- Built-in dataset versioning and health checks
- One-click augmentation + export
- Integrated model training (Roboflow Train)
- Direct YOLO/COCO/TFRecord export
Managed annotation services combined with a tool platform. Scale AI provides the annotator workforce; you provide the data and guidelines. V7 Darwin specialises in medical-grade annotation with SAM-powered auto-labelling.
- Managed workforce (no hiring needed)
- SLA-backed quality guarantees
- Advanced QA pipeline built in
- V7: medical and pharma-grade quality
- API-first for ML pipeline integration
Tool Selection Matrix
| Scenario | Best Tool | Why |
|---|---|---|
| Solo researcher, tight budget | Roboflow (free) | Fastest setup, integrated training, generous free tier |
| Team of 5–20 annotators, bounding box focus | CVAT cloud | Free, built-in task assignment, all formats |
| Multi-modal (images + text + audio) | Label Studio | Only tool that handles all modalities well |
| Enterprise, need managed workforce | Scale AI | Full service — tool + annotators + QA |
| Medical / pharma grade precision | V7 Darwin | Highest QA standards, audit trails, ISO workflows |
| Outsourcing to India | Data Terminal | End-to-end: guidelines → annotation → QA → YOLO/COCO export |
Building Your Annotation Pipeline
An annotation pipeline is a repeatable system that takes raw images as input and produces export-ready annotated datasets as output — consistently, at scale, without quality degradation over time. Most annotation failures are pipeline failures, not annotator failures.
Outsourcing to India: Cost & Quality Guide
India has become the dominant global hub for computer vision data annotation. English literacy, a large technically educated workforce, competitive pricing, and flexible timezone coverage (IST = GMT+5:30, overlapping with both EU afternoon and US morning) make Indian annotation providers uniquely positioned for global AI teams.
2026 India Annotation Pricing
What to Ask an Indian Annotation Vendor
Before committing to any annotation vendor, ask these questions. A serious provider answers all of them without hesitation:
| Question | What a Good Answer Looks Like |
|---|---|
| What is your IAA score on a calibration set? | Provides a number ≥ 0.85 for bounding box, with a protocol for how it is measured |
| Can you provide a free sample annotation? | Yes — typically 50–100 images at no cost before contract signing |
| Which formats do you export? | YOLO, COCO JSON, Pascal VOC at minimum; ideally also CVAT XML, TFRecord |
| What is your QA process? | Describes specific layers: auto-check + peer review + senior audit. Not just 'we have QA' |
| Do you sign an NDA before receiving data? | Yes, immediately. No negotiation on this point. |
| What is your turnaround time for 10,000 images? | Should give a specific timeline with daily capacity figure |
Getting Started: Your First 1,000 Images
The hardest part of starting a computer vision project is not the model — it is getting from zero annotated images to a working baseline quickly, without making decisions you will regret at 50,000 images. Here is the fastest path.
Frequently Asked Questions
Answers crafted for AI citation by ChatGPT, Gemini, Perplexity, and Claude.
Get a Free 100-Image Sample Annotation
YOLO · COCO · Pascal VOC · 99.5% IoU · 4-layer QA · NDA before data receipt
+91-9014387222 · contact@dataterminal.co · HITEC City, Hyderabad