Benedict's Notebook

Abstract

Academic integrity in high-stakes examinations remains threatened by the inherent limitations of human invigilation — attentional drift, cognitive saturation, and inter-observer variability. We present The Silent Invigilator, a real-time autonomous surveillance architecture that fuses multi-modal geometric deep learning with a sliding-window spatiotemporal anomaly accumulation engine. The system jointly estimates 3D head pose via Perspective-n-Point (PnP) reprojection minimization, tracks bilateral iris-center deviation vectors for gaze classification, computes mouth aspect ratios for vocalization detection, and performs YOLOv8-based prohibited-object recognition — all within a unified multi-threaded pipeline operating at real-time throughput. A composite risk function Sₜ ∈ [0, 100] aggregates these modalities through a weighted temporal accumulator, filtering transient physiological noise while capturing persistent high-confidence malpractice signatures. The system deploys across three surfaces: a standalone OpenCV desktop runtime, a Flask-SocketIO web dashboard with JWT-authenticated role-based access, and a cross-platform Flutter mobile application.

1. Problem Domain & Motivation

Manual examination invigilation suffers from three fundamental pathologies:

Attentional decay: Vigilance decrement begins within 15–20 minutes of continuous monitoring, with detection accuracy dropping by up to 35% over a standard 3-hour session [1, 2].
Cognitive overload: A single invigilator monitoring 25–40 candidates must simultaneously track gaze patterns, head movements, hand positions, and object interactions across a distributed spatial field — a task that exceeds the tracking capacity of human visual working memory.
Subconscious bias: Involuntary differential scrutiny based on candidate demographics, seating position, or prior performance introduces systematic measurement error.

These constraints motivate an automated, computer-vision-driven approach that operates at constant vigilance, applies uniform detection thresholds across all candidates, and provides quantitative, auditable evidence trails for every flagged incident.

Reference Implementation — The complete source code for The Silent Invigilator is available at github.com/The-Peacemaker/Silent-Invigilator. The repository includes the Flask web server, standalone desktop client, Flutter mobile application, model weights, and benchmarking suite.

2. System Architecture

The Silent Invigilator is structured as a decoupled, multi-surface ecosystem comprising four principal subsystems connected through a shared REST/WebSocket protocol layer.

System Topology — Multi-Surface Architecture

2.1 Multi-Threaded Pipeline Design

The pipeline operates on a producer-consumer architecture with three dedicated thread domains to decouple capture latency from inference throughput:

Thread Domain	Responsibility	Sync Primitive	Target Latency
Frame Grabber	Reads camera buffer, writes to shared slot	threading.Lock	< 2.1 ms
Feature Extractor	FaceMesh, solvePnP, iris vector, MAR	shared FrameQueue	~ 8.2 ms (GPU)
Object Detector	YOLOv8n + ByteTrack, runs every N=15	cadence counter	~ 21 ms (GPU)
Logger	Async SQLite writes, Socket.IO broadcast	queue.Queue	Non-blocking

3. Theoretical Framework & Algorithms

3.1 3D Head Pose via Perspective-n-Point

We estimate the six-degree-of-freedom head orientation by solving the Perspective-n-Point (PnP) problem [7]. Given a set of n 3D reference points $P_{w,i} \in \mathbb{R}^3$ in anthropometric world coordinates and their corresponding 2D projections $p_i \in \mathbb{R}^2$ , the camera projection under the pinhole model is:

$s \begin{bmatrix} u \ v \ 1 \end{bmatrix} = K \begin{bmatrix} R & T \end{bmatrix} \begin{bmatrix} X_w \ Y_w \ Z_w \ 1 \end{bmatrix}$

where K is the camera intrinsics matrix constructed from focal length approximation f = w (frame width) and principal point at frame center:

$K = \begin{bmatrix} f & 0 & w/2 \ 0 & f & h/2 \ 0 & 0 & 1 \end{bmatrix}$

We solve the non-linear least squares problem:

$\min_{R, T} \sum_{i=1}^{n} \left| p_i - \text{proj}(K, R, T, P_{w,i}) \right|_2^2$

using the Levenberg-Marquardt algorithm via cv2.solvePnP. The rotation matrix R ∈ SO(3) is decomposed into Euler angles through cv2.RQDecomp3x3:

$\theta = \arctan(R_{32}/R_{33}), \quad \psi = \arcsin(-R_{31}), \quad \phi = \arctan(R_{21}/R_{11})$

Six canonical landmarks (indices 1, 33, 263, 61, 291, 199 from MediaPipe's 468-point topology) are used as the PnP reference set. A stabilized estimate is produced using an exponential moving average with α = 0.35 per axis.

experiment

**Clinical Calibration Note**: `head_yaw_limit = 25°` and `head_pitch_limit = 20°` were empirically determined from a pilot study of 12 participants simulating both normal examination posture and suspicious lateral scanning behavior. The 25° yaw threshold corresponds approximately to the angular displacement required to view an adjacent candidate's paper at 60 cm inter-seat spacing.

3.2 Iris-Vector Gaze Tracking

Rather than training a dedicated gaze regression network [3], we derive a lightweight geometric proxy: the normalized horizontal iris displacement. Let $L_{in}, L_{out} \in \mathbb{R}^2$ be the inner and outer eye corner coordinates (indices 133/33 for left eye, 362/263 for right), and $I_c$ be the iris landmark centroid (indices 468/473). The horizontal gaze ratio is:

$\gamma = \frac{| I_c - L_{out} |$

Both eyes are computed independently and fused:

$\gamma_{avg} = \frac{\gamma_{left} + \gamma_{right}}{2}$

Gaze State	Ratio Range	Suspicion Weight
Left	γ > 0.60	Elevated
Center	0.40 ≤ γ ≤ 0.60	Baseline
Right	γ < 0.40	Elevated

3.3 Mouth Aspect Ratio (Vocalization Detection)

Oral communication is detected through the Mouth Aspect Ratio — a normalized measure of vertical mouth opening. Given the vertical lip landmarks (indices 13, 14) and horizontal mouth corners (indices 61, 291):

$MAR = \frac{| p_{13} - p_{14} |$

A first-order IIR filter smooths the raw signal:

$\overline{MAR}$

A vocalization event is asserted when $\overline{MAR}_{t} > 0.5$ .

3.4 Prohibited Object Detection via YOLOv8 + SAHI

YOLOv8n (2.6M parameters, 8.7 GFLOPs) performs single-shot detection [4] on COCO classes 67 (cell phone) and 73 (book). The model outputs quantized bounding box predictions $\hat{b} = (x, y, w, h, c, p_{conf})$ .

To resolve small objects at distance, Slicing Aided Hyper Inference (SAHI) [5] partitions the frame into overlapping slices of dimension $W_s \times H_s = 320 \times 320$ with overlap ratio $\sigma = 0.20$ :

$I_f = \bigcup_{m,n} S_{m,n}, \quad S_{m,n} \cap S_{m+1,n} = (1 - \sigma)W_s$

Cross-slice duplicate predictions are resolved via Non-Maximum Suppression with IoU threshold 0.55:

$IoU(b_i, b_j) = \frac{|b_i \cap b_j|}{|b_i \cup b_j|} \geq 0.55 \implies \text{suppress } b_j$

3.5 Spatiotemporal Composite Risk Accumulation

The system's core innovation is a sliding-window temporal accumulator that distinguishes transient physiological movements from sustained malpractice behavior. For each tracked student, a sliding window W = {τ: t - 90 < τ ≤ t} (~3 s at 30 FPS) stores per-frame risk vectors.

The instantaneous risk at frame τ is:

$R_τ = 100 \cdot (0.40 \cdot O_τ + 0.22 \cdot E_τ + 0.16 \cdot H_τ + 0.14 \cdot D_τ + 0.08 \cdot C_τ)$

Component	Notation	Description	Weight
Object	O_τ	Phone = 1.0, Book = 0.45, None = 0.0	0.40
Gaze Deviation	E_τ	Fraction of 90-frame window with non-center gaze	0.22
Head Pose	H_τ	Fraction of 90-frame window with out-of-bounds head pose	0.16
Down-Tilt	D_τ	Fraction of 90-frame window with head-down posture	0.14
Temporal Correlation	C_τ	Fraction of last 20 frames with any deviation	0.08

The composite score is the windowed mean:

$S_t = \frac{1}{|W|} \sum_{τ \in W} R_τ$

A deterministic escalation cascade is triggered at:

$S_t \geq 60 \implies \text{Alert}, \quad S_t \geq 80 \implies \text{High Alert}, \quad S_t \geq 90 \implies \text{Critical Escalation}$

When a phone is detected ( $O_{\tau} = 1.0$ ), an immediate floor of $R_{\tau} = 85$ is enforced, bypassing the weighted sum — reflecting the protocol that unauthorized device possession warrants near-instant attention regardless of concurrent behavior.

4. Interactive Risk Simulator

The following simulation engine implements the composite scoring function in real-time. Toggle behavioral signals to observe how the temporal accumulator evolves:

Spatiotemporal Risk Accumulator

Interactive simulator — toggles update the 90-frame sliding window in real-time

Prohibited Device (Phone) w = 0.40

COCO cls 67 — Immediate floor 85

Prohibited Material (Book) w = 0.40

COCO cls 73 — Baseline score 45

Sustained Gaze Aversion w = 0.22

γ ∉ [0.40, 0.60] over 90-frame window

Head Pose Out-of-Bounds w = 0.16

|ψ| > 25° or |θ| > 20°

Head Down-Tilt Posture w = 0.14

Sustained downward pitch deviation

Short-Term Temporal Spikes w = 0.08

Correlated deviations in last 20 frames

0 Risk Index

Normal — No Anomalies Detected

Thresholds: Alert ≥ 60 · High ≥ 80 · Critical ≥ 90

Composite: 0.40·0 + 0.22·0 + 0.16·0 + 0.14·0 + 0.08·0 = 0

90-Frame Temporal Activity Window 0/90 frames active

t − 90 t

5. Implementation Architecture

5.1 Standalone Desktop Runtime

The standalone client is a self-contained OpenCV window application with integrated HUD rendering:

def detection_loop(self):
    while self.running:
        ret, frame = self.cap.read()
        if not ret: continue

        rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        results = self.face_mesh.process(rgb)
        
        if results.multi_face_landmarks:
            for landmarks in results.multi_face_landmarks:
                pitch, yaw, roll = self.calculate_head_pose(landmarks, frame.shape)
                gaze = self.get_gaze_ratio(landmarks)
                mar = self.calculate_mouth_aspect_ratio(landmarks)
                
                score = self.compute_additive_score(yaw, pitch, gaze, mar)
                self.temporal_buffer.append(score)
                self.draw_hud(frame, yaw, pitch, gaze, mar, score)
        
        if self.frame_count % 15 == 0:
            detections, frame = self.detect_objects_yolo(frame)
            if detections: self.handle_detection_alert(detections)
        
        cv2.imshow(self.WINDOW_NAME, frame)
        self.frame_count += 1

Key design decisions:

Synchronous capture-inference loop with thread-safe frame buffer
Three-tier scoring: Additive per-frame penalty + composite temporal scoring
Stabilized tracking via per-parameter EMA filters (α = 0.3–0.5)
Session recording to structured JSON reports on exit

5.2 Web Dashboard Server

The Flask backend implements production-grade architecture:

JWT authentication with access/refresh token rotation (30 min / 7 day expiry)
Role-based access control: Admin, Teacher, Staff Invigilator — each with scoped dashboards
Socket.IO real-time event bus for push-based telemetry
SQLite WAL-mode database with background thread logging
MJPEG streaming for live camera feed delivery

The system supports simultaneous multi-exam-session monitoring through a room-based Socket.IO channel architecture, allowing a single admin dashboard to observe multiple examination halls concurrently.

5.3 Cross-Platform Mobile Client (Flutter/Dart)

The Flutter mobile app extends invigilation to handheld devices:

JWT-authenticated API client connecting to the Flask backend
Live MJPEG feed with overlaid anomaly metrics
Role-specific dashboards: Admin (user management, pie-chart distribution) and Teacher (per-session alert timeline)
Animated splash screen with scanning-line eye icon
Demo mode: 3-minute scripted timeline simulating progressive anomaly escalation

6. Performance Benchmarking

6.1 Experimental Setup

Profiling was conducted on an Intel i5-12500H (12 cores, 2.5 GHz) with an RTX 3050 Laptop GPU (4 GB VRAM). Each pipeline configuration was evaluated over 500 frames at 640×480 resolution.

6.2 Latency Breakdown

Pipeline Stage Latency

Click tabs to toggle between CPU and GPU profiling data

Frame Ingestion & Camera Grab 2.1 ms

MediaPipe FaceMesh (468 landmarks + iris) 24.5 ms

YOLOv8n Object Detection (cls 67, 73) 82.4 ms

YOLOv8n + SAHI Hyper-Inference (σ = 0.20) 315.0 ms

ByteTrack IoU + Kalman Filter Update 12.3 ms

Estimated Effective Throughput ~22.5 FPS (CPU) · 44.4 ms total latency

6.3 Performance Data

Pipeline Configuration	CPU Latency	GPU Latency	CPU FPS	GPU FPS	Speedup
Baseline (FaceMesh only)	26.6 ms	10.0 ms	37.6	100.0	2.66×
+ YOLOv8n (every frame)	109.0 ms	31.0 ms	9.2	32.3	3.52×
+ YOLOv8n (N=15 cadence)	33.4 ms	11.4 ms	29.9	87.7	2.93×
+ SAHI + YOLOv8n	341.5 ms	68.6 ms	2.9	14.6	4.98×
Full Pipeline (optimized)*	48.2 ms	15.3 ms	20.7	65.4	3.15×

insight

**Optimization Strategy**: The optimized pipeline runs MediaPipe FaceMesh on every frame (sub-millisecond on GPU) while throttling YOLOv8n to every N=15 frames with Lucas-Kanade optical flow interpolation for bounding box propagation between inference ticks. This reduces effective YOLO latency by 93% while maintaining detection coverage within ±0.5 s of real-time.

7. Deployment Topologies

The system supports two distinct deployment modes:

7.1 Standalone Mode

Fully self-contained native OpenCV window with real-time HUD, runs entirely on local hardware, writes structured reports to disk. Suitable for individual examination rooms without network infrastructure.

# Usage: python silent_invigilator.py
# Controls: Q = quit & save | R = reset scores | S = save snapshot

7.2 Server-Client Mode

The Flask backend operates as a central monitoring hub, accepting camera feeds from multiple examination rooms and broadcasting telemetry to connected dashboards.

Role	Permissions	Dashboard Surface
Admin	User CRUD, all-session monitoring, system config, alert ack	Web + Mobile
Teacher	Per-session logs, real-time scores, incident timeline	Web + Mobile
Staff Invigilator	Live video feed, per-frame risk gauge, alert log	Web

8. Future Work

Several extensions are under active investigation:

Transformer-based temporal fusion: Replacing the weighted sliding window with a lightweight attention mechanism (Perceiver-IO) to learn inter-modal temporal dependencies end-to-end.
Multi-camera spatial fusion: Extending ByteTrack with cross-camera ReID embeddings for consistent identity tracking across overlapping camera views in large examination halls.
On-device deployment: Quantizing YOLOv8n to INT8 via TensorRT for NVIDIA Jetson Orin-class edge devices at sub-10 ms latency.
Adversarial robustness auditing: Evaluating system resilience against evasion attacks (e.g., adversarial patches on clothing designed to suppress YOLO detections).

References

[1] Parasuraman, R. (1987). Human-computer monitoring. Human Factors, 29(6), 671–686.

[2] Thomson, D. R., Besner, D., & Smilek, D. (2015). A resource-control account of sustained attention. Perspectives on Psychological Science, 10(1), 82–96.

[3] Lugaresi, C., et al. (2019). MediaPipe: A Framework for Building Perception Pipelines. arXiv:1906.08172.

[4] Jocher, G., et al. (2023). Ultralytics YOLOv8. GitHub: ultralytics/ultralytics.

[5] Akyon, F. C., et al. (2022). Slicing Aided Hyper Inference and Fine-Tuning for Small Object Detection. IEEE ICIP 2022.

[6] Zhang, Y., et al. (2022). ByteTrack: Multi-Object Tracking by Associating Every Detection Box. ECCV 2022.

[7] Lepetit, V., Moreno-Noguer, F., & Fua, P. (2009). EPnP: An Accurate O(n) Solution to the PnP Problem. International Journal of Computer Vision, 81(2).

[8] Bradski, G. (2000). The OpenCV Library. Dr. Dobb's Journal of Software Tools.

The Silent Invigilator: Real-Time Multi-Modal Exam Surveillance via Deep Geometric Inference and Spatiotemporal Anomaly Accumulation