Published 2026-06-15|18 min read

The Silent Invigilator: Real-Time Multi-Modal Exam Surveillance via Deep Geometric Inference and Spatiotemporal Anomaly Accumulation

Open SourceAIComputer Science
Computer VisionDeep LearningYOLOv8Anomaly DetectionMediaPipeSpatiotemporal Systems·GitHub

Abstract

Academic integrity in high-stakes examinations remains threatened by the inherent limitations of human invigilation — attentional drift, cognitive saturation, and inter-observer variability. We present The Silent Invigilator, a real-time autonomous surveillance architecture that fuses multi-modal geometric deep learning with a sliding-window spatiotemporal anomaly accumulation engine. The system jointly estimates 3D head pose via Perspective-n-Point (PnP) reprojection minimization, tracks bilateral iris-center deviation vectors for gaze classification, computes mouth aspect ratios for vocalization detection, and performs YOLOv8-based prohibited-object recognition — all within a unified multi-threaded pipeline operating at real-time throughput. A composite risk function Sₜ ∈ [0, 100] aggregates these modalities through a weighted temporal accumulator, filtering transient physiological noise while capturing persistent high-confidence malpractice signatures. The system deploys across three surfaces: a standalone OpenCV desktop runtime, a Flask-SocketIO web dashboard with JWT-authenticated role-based access, and a cross-platform Flutter mobile application.


1. Problem Domain & Motivation

Manual examination invigilation suffers from three fundamental pathologies:

  1. Attentional decay: Vigilance decrement begins within 15–20 minutes of continuous monitoring, with detection accuracy dropping by up to 35% over a standard 3-hour session [1, 2].
  2. Cognitive overload: A single invigilator monitoring 25–40 candidates must simultaneously track gaze patterns, head movements, hand positions, and object interactions across a distributed spatial field — a task that exceeds the tracking capacity of human visual working memory.
  3. Subconscious bias: Involuntary differential scrutiny based on candidate demographics, seating position, or prior performance introduces systematic measurement error.

These constraints motivate an automated, computer-vision-driven approach that operates at constant vigilance, applies uniform detection thresholds across all candidates, and provides quantitative, auditable evidence trails for every flagged incident.

Reference Implementation — The complete source code for The Silent Invigilator is available at github.com/The-Peacemaker/Silent-Invigilator. The repository includes the Flask web server, standalone desktop client, Flutter mobile application, model weights, and benchmarking suite.


2. System Architecture

The Silent Invigilator is structured as a decoupled, multi-surface ecosystem comprising four principal subsystems connected through a shared REST/WebSocket protocol layer.

System Topology — Multi-Surface Architecture
CAPTURE LAYER USB Webcam / IP Camera RTSP Stream (H.264) MJPEG INFERENCE ENGINE MediaPipe Face Mesh (468 lm) solvePnP / RQDecomp3x3 YOLOv8n (COCO cls 67, 73) ByteTrack (IoU + Kalman) Thread 1: Geometric Features Thread 2: Object Detection (N=15) JSON RISK & STORAGE Spatiotemporal Accumulator SQLite (WAL mode) Socket.IO CLIENT SURFACES OpenCV Standalone (Desktop) Flask Web Dashboard (Admin/Teacher/Staff) Flutter Mobile App (Admin/Teacher) MJPEG / RTSP JSON Telemetry SQLite WAL JWT + Socket.IO

2.1 Multi-Threaded Pipeline Design

The pipeline operates on a producer-consumer architecture with three dedicated thread domains to decouple capture latency from inference throughput:

Thread Domain Responsibility Sync Primitive Target Latency
Frame Grabber Reads camera buffer, writes to shared slot threading.Lock < 2.1 ms
Feature Extractor FaceMesh, solvePnP, iris vector, MAR shared FrameQueue ~ 8.2 ms (GPU)
Object Detector YOLOv8n + ByteTrack, runs every N=15 cadence counter ~ 21 ms (GPU)
Logger Async SQLite writes, Socket.IO broadcast queue.Queue Non-blocking

3. Theoretical Framework & Algorithms

3.1 3D Head Pose via Perspective-n-Point

We estimate the six-degree-of-freedom head orientation by solving the Perspective-n-Point (PnP) problem [7]. Given a set of n 3D reference points Pw,iR3P_{w,i} \in \mathbb{R}^3 in anthropometric world coordinates and their corresponding 2D projections piR2p_i \in \mathbb{R}^2, the camera projection under the pinhole model is:

s[uv1]=K[RT][XwYwZw1]s \begin{bmatrix} u \ v \ 1 \end{bmatrix} = K \begin{bmatrix} R & T \end{bmatrix} \begin{bmatrix} X_w \ Y_w \ Z_w \ 1 \end{bmatrix}

where K is the camera intrinsics matrix constructed from focal length approximation f = w (frame width) and principal point at frame center:

K=[f0w/20fh/2001]K = \begin{bmatrix} f & 0 & w/2 \ 0 & f & h/2 \ 0 & 0 & 1 \end{bmatrix}

We solve the non-linear least squares problem:

minR,Ti=1npiproj(K,R,T,Pw,i)22\min_{R, T} \sum_{i=1}^{n} \left| p_i - \text{proj}(K, R, T, P_{w,i}) \right|_2^2

using the Levenberg-Marquardt algorithm via cv2.solvePnP. The rotation matrix R ∈ SO(3) is decomposed into Euler angles through cv2.RQDecomp3x3:

θ=arctan(R32/R33),ψ=arcsin(R31),ϕ=arctan(R21/R11)\theta = \arctan(R_{32}/R_{33}), \quad \psi = \arcsin(-R_{31}), \quad \phi = \arctan(R_{21}/R_{11})

Six canonical landmarks (indices 1, 33, 263, 61, 291, 199 from MediaPipe's 468-point topology) are used as the PnP reference set. A stabilized estimate is produced using an exponential moving average with α = 0.35 per axis.

experiment
**Clinical Calibration Note**: `head_yaw_limit = 25°` and `head_pitch_limit = 20°` were empirically determined from a pilot study of 12 participants simulating both normal examination posture and suspicious lateral scanning behavior. The 25° yaw threshold corresponds approximately to the angular displacement required to view an adjacent candidate's paper at 60 cm inter-seat spacing.

3.2 Iris-Vector Gaze Tracking

Rather than training a dedicated gaze regression network [3], we derive a lightweight geometric proxy: the normalized horizontal iris displacement. Let Lin,LoutR2L_{in}, L_{out} \in \mathbb{R}^2 be the inner and outer eye corner coordinates (indices 133/33 for left eye, 362/263 for right), and IcI_c be the iris landmark centroid (indices 468/473). The horizontal gaze ratio is:

γ=IcLout2LinLout2\gamma = \frac{| I_c - L_{out} |2}{| L{in} - L_{out} |_2}

Both eyes are computed independently and fused:

γavg=γleft+γright2\gamma_{avg} = \frac{\gamma_{left} + \gamma_{right}}{2}

Gaze State Ratio Range Suspicion Weight
Leftγ > 0.60Elevated
Center0.40 ≤ γ ≤ 0.60Baseline
Rightγ < 0.40Elevated

3.3 Mouth Aspect Ratio (Vocalization Detection)

Oral communication is detected through the Mouth Aspect Ratio — a normalized measure of vertical mouth opening. Given the vertical lip landmarks (indices 13, 14) and horizontal mouth corners (indices 61, 291):

MAR=p13p142p61p2912MAR = \frac{| p_{13} - p_{14} |2}{| p{61} - p_{291} |_2}

A first-order IIR filter smooths the raw signal:

MARt=αMARt+(1α)MARt1,α=0.3\overline{MAR}t = \alpha \cdot MAR_t + (1 - \alpha) \cdot \overline{MAR}{t-1}, \quad \alpha = 0.3

A vocalization event is asserted when MARt>0.5\overline{MAR}_{t} > 0.5.

3.4 Prohibited Object Detection via YOLOv8 + SAHI

YOLOv8n (2.6M parameters, 8.7 GFLOPs) performs single-shot detection [4] on COCO classes 67 (cell phone) and 73 (book). The model outputs quantized bounding box predictions b^=(x,y,w,h,c,pconf)\hat{b} = (x, y, w, h, c, p_{conf}).

To resolve small objects at distance, Slicing Aided Hyper Inference (SAHI) [5] partitions the frame into overlapping slices of dimension Ws×Hs=320×320W_s \times H_s = 320 \times 320 with overlap ratio σ=0.20\sigma = 0.20:

If=m,nSm,n,Sm,nSm+1,n=(1σ)WsI_f = \bigcup_{m,n} S_{m,n}, \quad S_{m,n} \cap S_{m+1,n} = (1 - \sigma)W_s

Cross-slice duplicate predictions are resolved via Non-Maximum Suppression with IoU threshold 0.55:

IoU(bi,bj)=bibjbibj0.55    suppress bjIoU(b_i, b_j) = \frac{|b_i \cap b_j|}{|b_i \cup b_j|} \geq 0.55 \implies \text{suppress } b_j

3.5 Spatiotemporal Composite Risk Accumulation

The system's core innovation is a sliding-window temporal accumulator that distinguishes transient physiological movements from sustained malpractice behavior. For each tracked student, a sliding window W = {τ: t - 90 < τ ≤ t} (~3 s at 30 FPS) stores per-frame risk vectors.

The instantaneous risk at frame τ is:

Rτ=100(0.40Oτ+0.22Eτ+0.16Hτ+0.14Dτ+0.08Cτ)R_τ = 100 \cdot (0.40 \cdot O_τ + 0.22 \cdot E_τ + 0.16 \cdot H_τ + 0.14 \cdot D_τ + 0.08 \cdot C_τ)

Component Notation Description Weight
ObjectOτ Phone = 1.0, Book = 0.45, None = 0.0 0.40
Gaze DeviationEτ Fraction of 90-frame window with non-center gaze 0.22
Head PoseHτ Fraction of 90-frame window with out-of-bounds head pose 0.16
Down-TiltDτ Fraction of 90-frame window with head-down posture 0.14
Temporal CorrelationCτ Fraction of last 20 frames with any deviation 0.08

The composite score is the windowed mean:

St=1WτWRτS_t = \frac{1}{|W|} \sum_{τ \in W} R_τ

A deterministic escalation cascade is triggered at:

St60    Alert,St80    High Alert,St90    Critical EscalationS_t \geq 60 \implies \text{Alert}, \quad S_t \geq 80 \implies \text{High Alert}, \quad S_t \geq 90 \implies \text{Critical Escalation}

When a phone is detected (Oτ=1.0O_{\tau} = 1.0), an immediate floor of Rτ=85R_{\tau} = 85 is enforced, bypassing the weighted sum — reflecting the protocol that unauthorized device possession warrants near-instant attention regardless of concurrent behavior.


4. Interactive Risk Simulator

The following simulation engine implements the composite scoring function in real-time. Toggle behavioral signals to observe how the temporal accumulator evolves:

Spatiotemporal Risk Accumulator

Interactive simulator — toggles update the 90-frame sliding window in real-time

0 Risk Index
Normal — No Anomalies Detected
Thresholds: Alert ≥ 60 · High ≥ 80 · Critical ≥ 90
Composite: 0.40·0 + 0.22·0 + 0.16·0 + 0.14·0 + 0.08·0 = 0
90-Frame Temporal Activity Window 0/90 frames active
t − 90 t

5. Implementation Architecture

5.1 Standalone Desktop Runtime

The standalone client is a self-contained OpenCV window application with integrated HUD rendering:

def detection_loop(self):
    while self.running:
        ret, frame = self.cap.read()
        if not ret: continue

        rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        results = self.face_mesh.process(rgb)
        
        if results.multi_face_landmarks:
            for landmarks in results.multi_face_landmarks:
                pitch, yaw, roll = self.calculate_head_pose(landmarks, frame.shape)
                gaze = self.get_gaze_ratio(landmarks)
                mar = self.calculate_mouth_aspect_ratio(landmarks)
                
                score = self.compute_additive_score(yaw, pitch, gaze, mar)
                self.temporal_buffer.append(score)
                self.draw_hud(frame, yaw, pitch, gaze, mar, score)
        
        if self.frame_count % 15 == 0:
            detections, frame = self.detect_objects_yolo(frame)
            if detections: self.handle_detection_alert(detections)
        
        cv2.imshow(self.WINDOW_NAME, frame)
        self.frame_count += 1

Key design decisions:

  • Synchronous capture-inference loop with thread-safe frame buffer
  • Three-tier scoring: Additive per-frame penalty + composite temporal scoring
  • Stabilized tracking via per-parameter EMA filters (α = 0.3–0.5)
  • Session recording to structured JSON reports on exit

5.2 Web Dashboard Server

The Flask backend implements production-grade architecture:

  • JWT authentication with access/refresh token rotation (30 min / 7 day expiry)
  • Role-based access control: Admin, Teacher, Staff Invigilator — each with scoped dashboards
  • Socket.IO real-time event bus for push-based telemetry
  • SQLite WAL-mode database with background thread logging
  • MJPEG streaming for live camera feed delivery

The system supports simultaneous multi-exam-session monitoring through a room-based Socket.IO channel architecture, allowing a single admin dashboard to observe multiple examination halls concurrently.

5.3 Cross-Platform Mobile Client (Flutter/Dart)

The Flutter mobile app extends invigilation to handheld devices:

  • JWT-authenticated API client connecting to the Flask backend
  • Live MJPEG feed with overlaid anomaly metrics
  • Role-specific dashboards: Admin (user management, pie-chart distribution) and Teacher (per-session alert timeline)
  • Animated splash screen with scanning-line eye icon
  • Demo mode: 3-minute scripted timeline simulating progressive anomaly escalation

6. Performance Benchmarking

6.1 Experimental Setup

Profiling was conducted on an Intel i5-12500H (12 cores, 2.5 GHz) with an RTX 3050 Laptop GPU (4 GB VRAM). Each pipeline configuration was evaluated over 500 frames at 640×480 resolution.

6.2 Latency Breakdown

Pipeline Stage Latency

Click tabs to toggle between CPU and GPU profiling data

Frame Ingestion & Camera Grab 2.1 ms
MediaPipe FaceMesh (468 landmarks + iris) 24.5 ms
YOLOv8n Object Detection (cls 67, 73) 82.4 ms
YOLOv8n + SAHI Hyper-Inference (σ = 0.20) 315.0 ms
ByteTrack IoU + Kalman Filter Update 12.3 ms
Estimated Effective Throughput ~22.5 FPS (CPU) · 44.4 ms total latency

6.3 Performance Data

Pipeline Configuration CPU Latency GPU Latency CPU FPS GPU FPS Speedup
Baseline (FaceMesh only) 26.6 ms 10.0 ms 37.6 100.0 2.66×
+ YOLOv8n (every frame) 109.0 ms 31.0 ms 9.2 32.3 3.52×
+ YOLOv8n (N=15 cadence) 33.4 ms 11.4 ms 29.9 87.7 2.93×
+ SAHI + YOLOv8n 341.5 ms 68.6 ms 2.9 14.6 4.98×
Full Pipeline (optimized)* 48.2 ms 15.3 ms 20.7 65.4 3.15×
insight
**Optimization Strategy**: The optimized pipeline runs MediaPipe FaceMesh on every frame (sub-millisecond on GPU) while throttling YOLOv8n to every N=15 frames with Lucas-Kanade optical flow interpolation for bounding box propagation between inference ticks. This reduces effective YOLO latency by 93% while maintaining detection coverage within ±0.5 s of real-time.

7. Deployment Topologies

The system supports two distinct deployment modes:

7.1 Standalone Mode

Fully self-contained native OpenCV window with real-time HUD, runs entirely on local hardware, writes structured reports to disk. Suitable for individual examination rooms without network infrastructure.

# Usage: python silent_invigilator.py
# Controls: Q = quit & save | R = reset scores | S = save snapshot

7.2 Server-Client Mode

The Flask backend operates as a central monitoring hub, accepting camera feeds from multiple examination rooms and broadcasting telemetry to connected dashboards.

Role Permissions Dashboard Surface
AdminUser CRUD, all-session monitoring, system config, alert ackWeb + Mobile
TeacherPer-session logs, real-time scores, incident timelineWeb + Mobile
Staff InvigilatorLive video feed, per-frame risk gauge, alert logWeb

8. Future Work

Several extensions are under active investigation:

  • Transformer-based temporal fusion: Replacing the weighted sliding window with a lightweight attention mechanism (Perceiver-IO) to learn inter-modal temporal dependencies end-to-end.
  • Multi-camera spatial fusion: Extending ByteTrack with cross-camera ReID embeddings for consistent identity tracking across overlapping camera views in large examination halls.
  • On-device deployment: Quantizing YOLOv8n to INT8 via TensorRT for NVIDIA Jetson Orin-class edge devices at sub-10 ms latency.
  • Adversarial robustness auditing: Evaluating system resilience against evasion attacks (e.g., adversarial patches on clothing designed to suppress YOLO detections).

References

[1] Parasuraman, R. (1987). Human-computer monitoring. Human Factors, 29(6), 671–686.

[2] Thomson, D. R., Besner, D., & Smilek, D. (2015). A resource-control account of sustained attention. Perspectives on Psychological Science, 10(1), 82–96.

[3] Lugaresi, C., et al. (2019). MediaPipe: A Framework for Building Perception Pipelines. arXiv:1906.08172.

[4] Jocher, G., et al. (2023). Ultralytics YOLOv8. GitHub: ultralytics/ultralytics.

[5] Akyon, F. C., et al. (2022). Slicing Aided Hyper Inference and Fine-Tuning for Small Object Detection. IEEE ICIP 2022.

[6] Zhang, Y., et al. (2022). ByteTrack: Multi-Object Tracking by Associating Every Detection Box. ECCV 2022.

[7] Lepetit, V., Moreno-Noguer, F., & Fua, P. (2009). EPnP: An Accurate O(n) Solution to the PnP Problem. International Journal of Computer Vision, 81(2).

[8] Bradski, G. (2000). The OpenCV Library. Dr. Dobb's Journal of Software Tools.