Computer vision Study Guide
Study Guide
📖 Core Concepts
Computer Vision (CV) – The study of how computers can obtain high‑level understanding from images or video (e.g., recognize objects, infer 3‑D structure, make decisions).
Image Understanding – Converts raw pixel data into symbolic descriptions (objects, actions, scene layout) using geometry, physics, statistics, and learning.
Scope of CV Tasks – Acquire → process → analyze → understand → produce numerical or symbolic output (classification, pose, decision).
Hierarchy of Abstraction
Low‑level: edges, textures, regions.
Mid‑level: boundaries, surfaces, volumes.
High‑level: objects, scenes, events.
Distinctions
Image Processing: input → transformed image (e.g., filtering).
Computer Vision: input → analysis/decision (may output a description, not an image).
Machine Vision: CV in controlled, real‑time industrial settings (fixed lighting, actuator integration).
---
📌 Must Remember
Recognition vs. Identification – Recognition = class label (e.g., “car”). Identification = specific instance (e.g., “my red Toyota”).
Detection – Scan whole image, return locations (bounding boxes) of objects of interest.
Pose Estimation – Find 3‑D position + orientation of an object relative to the camera.
Optical Flow – Apparent 2‑D motion field of each pixel between consecutive frames.
Egomotion – Rigid 3‑D motion of the camera itself (rotation + translation).
SLAM (Simultaneous Localization & Mapping) – Builds a metric map while estimating the robot/vehicle’s pose.
Segmentation – Partition image into meaningful regions (foreground/background, object parts).
Co‑segmentation – Jointly segment the same object across multiple images/videos.
Scale‑Space – Multi‑scale representation that reveals structures at appropriate spatial scales (e.g., Gaussian pyramid).
---
🔄 Key Processes
Pre‑processing – Build a scale‑space, normalize illumination, denoise.
Feature Extraction – Detect edges, corners, blobs, or dense descriptors (SIFT, ORB).
Detection / Segmentation
Detection: slide‑window or region‑proposal → confidence scores → bounding boxes.
Segmentation: assign each pixel to a region (thresholding, graph‑cut, CNN‑based masks).
Higher‑Level Processing (application‑specific)
Recognition: classification of detected regions (CNN, SVM).
Parameter Estimation: pose, size, shape fitting (PnP, ICP).
Motion Analysis: tracking, optical flow, egomotion estimation.
Scene Reconstruction: triangulate points → point cloud → surface mesh.
Decision Making – Pass/fail, match/no‑match, alert generation.
Typical recognition pipeline:
$$\text{Image} \;\xrightarrow{\text{pre‑process}} \;\xrightarrow{\text{features}} \;\xrightarrow{\text{detect/segment}} \;\xrightarrow{\text{classify \& estimate pose}} \;\xrightarrow{\text{decision}}$$
---
🔍 Key Comparisons
Computer Vision vs. Image Processing
CV → analysis → symbolic output.
Image Processing → transformation → another image.
Computer Vision vs. Computer Graphics
Graphics: model → image.
CV: image → model.
Machine Vision vs. General Computer Vision
Machine Vision: real‑time, controlled lighting, actuator feedback.
CV: broader research scope, may tolerate offline processing.
Recognition vs. Detection vs. Identification
Detection: “where is an object?” (bounding box).
Recognition: “what is it?” (class label).
Identification: “which specific instance?” (ID).
Optical Flow vs. Egomotion
Optical flow: pixel‑wise apparent motion.
Egomotion: 3‑D rigid motion of the camera; derived from flow + depth.
---
⚠️ Common Misunderstandings
“CV = Deep Learning” – Classic geometry‑based methods (e.g., PnP, SLAM) remain essential, especially where data is scarce.
“Image processing yields decisions” – Pure image processing stops at an enhanced image; decisions require higher‑level interpretation.
“Optical flow gives absolute object speed” – It provides relative pixel motion; depth is needed for real‑world speed.
“Segmentation always produces perfect object masks” – Occlusions, similar textures, and lighting can cause leakage or missing parts.
“Medical imaging is a separate field” – It heavily relies on CV techniques (CNNs for disease detection, registration for multimodal scans).
---
🧠 Mental Models / Intuition
“Seeing → Understanding → Acting” – Imagine a human looking at a scene: first low‑level edges appear, then parts are grouped, then objects are recognized, finally actions are decided.
Scale‑Space as “Zoom Levels” – Small Gaussian blur = fine details; large blur = coarse structures. Different tasks (edge detection vs. object detection) operate at different “zoom levels”.
Feature Hierarchy – Corners are interest points where two edges meet; clusters of corners form keypoints that survive across scale/rotation, serving as reliable anchors.
---
🚩 Exceptions & Edge Cases
Real‑time constraints – In machine vision, algorithmic complexity must be bounded; lightweight descriptors (ORB) may replace heavy CNNs.
Extreme lighting / motion blur – Standard edge detectors fail; need robust preprocessing (de‑blurring, HDR techniques).
Textureless regions – Optical flow becomes ambiguous; incorporate global priors or use feature tracking instead.
Occlusion in Pose Estimation – If keypoints are hidden, pose may be under‑determined; use model‑based fitting or multiple views.
---
📍 When to Use Which
| Situation | Preferred Method |
|-----------|------------------|
| Fast, controlled industrial inspection | Machine‑vision pipeline → simple filters + fixed‑pattern detectors |
| Variable illumination, complex objects | Deep‑learning based detection/segmentation (e.g., Mask RCNN) |
| Need exact 3‑D geometry from few images | Classical multi‑view stereo + bundle adjustment |
| Tracking many points in texture‑rich video | Sparse optical flow (Lucas‑Kanade) or dense flow if GPU available |
| Estimating camera motion (SLAM) in unknown environment | Visual‑odometry + loop‑closure (feature‑based SLAM) |
| Identifying a specific individual (face ID) | Face embedding + nearest‑neighbor search (recognition + identification) |
| Restoring heavily degraded images | Model‑based restoration (e.g., non‑local means, deep de‑blurring) |
---
👀 Patterns to Recognize
Edge → Corner → Keypoint – A chain that often indicates a robust feature for matching.
Repeating texture + uniform color – Signals potential failure of pure intensity‑based flow; consider gradient‑based or feature‑based tracking.
Sharp intensity gradient + high curvature – Likely object boundary → good seed for segmentation.
Temporal consistency of masks – When masks change slowly across frames, co‑segmentation can exploit this for better stability.
Large motion vectors + blurred edges – Indicates motion blur → image restoration needed before reliable feature extraction.
---
🗂️ Exam Traps
“All computer‑vision systems output a class label.” – Many output continuous values (pose, depth, flow) or binary decisions (pass/fail).
“Optical flow directly yields depth.” – Depth requires additional constraints (stereo baseline, known motion).
“Machine vision is just a hardware problem.” – Algorithmic design (real‑time detection, lighting normalization) is equally critical.
“Segmentation always precedes detection.” – In many pipelines, detection (region proposals) comes first, then segmentation refines the region.
“Deep learning eliminates the need for preprocessing.” – Pre‑processing (normalization, scale‑space) still improves robustness and training stability.
---
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or