Abstract
Academic integrity in high-stakes examinations remains threatened by the inherent limitations of human invigilation — attentional drift, cognitive saturation, and inter-observer variability. We present The Silent Invigilator, a real-time autonomous surveillance architecture that fuses multi-modal geometric deep learning with a sliding-window spatiotemporal anomaly accumulation engine. The system jointly estimates 3D head pose via Perspective-n-Point (PnP) reprojection minimization, tracks bilateral iris-center deviation vectors for gaze classification, computes mouth aspect ratios for vocalization detection, and performs YOLOv8-based prohibited-object recognition — all within a unified multi-threaded pipeline operating at real-time throughput. A composite risk function Sₜ ∈ [0, 100] aggregates these modalities through a weighted temporal accumulator, filtering transient physiological noise while capturing persistent high-confidence malpractice signatures. The system deploys across three surfaces: a standalone OpenCV desktop runtime, a Flask-SocketIO web dashboard with JWT-authenticated role-based access, and a cross-platform Flutter mobile application.
1. Problem Domain & Motivation
Manual examination invigilation suffers from three fundamental pathologies:
- Attentional decay: Vigilance decrement begins within 15–20 minutes of continuous monitoring, with detection accuracy dropping by up to 35% over a standard 3-hour session [1, 2].
- Cognitive overload: A single invigilator monitoring 25–40 candidates must simultaneously track gaze patterns, head movements, hand positions, and object interactions across a distributed spatial field — a task that exceeds the tracking capacity of human visual working memory.
- Subconscious bias: Involuntary differential scrutiny based on candidate demographics, seating position, or prior performance introduces systematic measurement error.
These constraints motivate an automated, computer-vision-driven approach that operates at constant vigilance, applies uniform detection thresholds across all candidates, and provides quantitative, auditable evidence trails for every flagged incident.
Reference Implementation — The complete source code for The Silent Invigilator is available at github.com/The-Peacemaker/Silent-Invigilator. The repository includes the Flask web server, standalone desktop client, Flutter mobile application, model weights, and benchmarking suite.
2. System Architecture
The Silent Invigilator is structured as a decoupled, multi-surface ecosystem comprising four principal subsystems connected through a shared REST/WebSocket protocol layer.
2.1 Multi-Threaded Pipeline Design
The pipeline operates on a producer-consumer architecture with three dedicated thread domains to decouple capture latency from inference throughput:
| Thread Domain | Responsibility | Sync Primitive | Target Latency |
|---|---|---|---|
| Frame Grabber | Reads camera buffer, writes to shared slot | threading.Lock | < 2.1 ms |
| Feature Extractor | FaceMesh, solvePnP, iris vector, MAR | shared FrameQueue | ~ 8.2 ms (GPU) |
| Object Detector | YOLOv8n + ByteTrack, runs every N=15 | cadence counter | ~ 21 ms (GPU) |
| Logger | Async SQLite writes, Socket.IO broadcast | queue.Queue | Non-blocking |
3. Theoretical Framework & Algorithms
3.1 3D Head Pose via Perspective-n-Point
We estimate the six-degree-of-freedom head orientation by solving the Perspective-n-Point (PnP) problem [7]. Given a set of n 3D reference points in anthropometric world coordinates and their corresponding 2D projections , the camera projection under the pinhole model is:
where K is the camera intrinsics matrix constructed from focal length approximation f = w (frame width) and principal point at frame center:
We solve the non-linear least squares problem:
using the Levenberg-Marquardt algorithm via cv2.solvePnP. The rotation matrix R ∈ SO(3) is decomposed into Euler angles through cv2.RQDecomp3x3:
Six canonical landmarks (indices 1, 33, 263, 61, 291, 199 from MediaPipe's 468-point topology) are used as the PnP reference set. A stabilized estimate is produced using an exponential moving average with α = 0.35 per axis.
3.2 Iris-Vector Gaze Tracking
Rather than training a dedicated gaze regression network [3], we derive a lightweight geometric proxy: the normalized horizontal iris displacement. Let be the inner and outer eye corner coordinates (indices 133/33 for left eye, 362/263 for right), and be the iris landmark centroid (indices 468/473). The horizontal gaze ratio is:
Both eyes are computed independently and fused:
| Gaze State | Ratio Range | Suspicion Weight |
|---|---|---|
| Left | γ > 0.60 | Elevated |
| Center | 0.40 ≤ γ ≤ 0.60 | Baseline |
| Right | γ < 0.40 | Elevated |
3.3 Mouth Aspect Ratio (Vocalization Detection)
Oral communication is detected through the Mouth Aspect Ratio — a normalized measure of vertical mouth opening. Given the vertical lip landmarks (indices 13, 14) and horizontal mouth corners (indices 61, 291):
A first-order IIR filter smooths the raw signal:
A vocalization event is asserted when .
3.4 Prohibited Object Detection via YOLOv8 + SAHI
YOLOv8n (2.6M parameters, 8.7 GFLOPs) performs single-shot detection [4] on COCO classes 67 (cell phone) and 73 (book). The model outputs quantized bounding box predictions .
To resolve small objects at distance, Slicing Aided Hyper Inference (SAHI) [5] partitions the frame into overlapping slices of dimension with overlap ratio :
Cross-slice duplicate predictions are resolved via Non-Maximum Suppression with IoU threshold 0.55:
3.5 Spatiotemporal Composite Risk Accumulation
The system's core innovation is a sliding-window temporal accumulator that distinguishes transient physiological movements from sustained malpractice behavior. For each tracked student, a sliding window W = {τ: t - 90 < τ ≤ t} (~3 s at 30 FPS) stores per-frame risk vectors.
The instantaneous risk at frame τ is:
| Component | Notation | Description | Weight |
|---|---|---|---|
| Object | Oτ | Phone = 1.0, Book = 0.45, None = 0.0 | 0.40 |
| Gaze Deviation | Eτ | Fraction of 90-frame window with non-center gaze | 0.22 |
| Head Pose | Hτ | Fraction of 90-frame window with out-of-bounds head pose | 0.16 |
| Down-Tilt | Dτ | Fraction of 90-frame window with head-down posture | 0.14 |
| Temporal Correlation | Cτ | Fraction of last 20 frames with any deviation | 0.08 |
The composite score is the windowed mean:
A deterministic escalation cascade is triggered at:
When a phone is detected (), an immediate floor of is enforced, bypassing the weighted sum — reflecting the protocol that unauthorized device possession warrants near-instant attention regardless of concurrent behavior.
4. Interactive Risk Simulator
The following simulation engine implements the composite scoring function in real-time. Toggle behavioral signals to observe how the temporal accumulator evolves:
Spatiotemporal Risk Accumulator
Interactive simulator — toggles update the 90-frame sliding window in real-time
5. Implementation Architecture
5.1 Standalone Desktop Runtime
The standalone client is a self-contained OpenCV window application with integrated HUD rendering:
def detection_loop(self):
while self.running:
ret, frame = self.cap.read()
if not ret: continue
rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
results = self.face_mesh.process(rgb)
if results.multi_face_landmarks:
for landmarks in results.multi_face_landmarks:
pitch, yaw, roll = self.calculate_head_pose(landmarks, frame.shape)
gaze = self.get_gaze_ratio(landmarks)
mar = self.calculate_mouth_aspect_ratio(landmarks)
score = self.compute_additive_score(yaw, pitch, gaze, mar)
self.temporal_buffer.append(score)
self.draw_hud(frame, yaw, pitch, gaze, mar, score)
if self.frame_count % 15 == 0:
detections, frame = self.detect_objects_yolo(frame)
if detections: self.handle_detection_alert(detections)
cv2.imshow(self.WINDOW_NAME, frame)
self.frame_count += 1Key design decisions:
- Synchronous capture-inference loop with thread-safe frame buffer
- Three-tier scoring: Additive per-frame penalty + composite temporal scoring
- Stabilized tracking via per-parameter EMA filters (α = 0.3–0.5)
- Session recording to structured JSON reports on exit
5.2 Web Dashboard Server
The Flask backend implements production-grade architecture:
- JWT authentication with access/refresh token rotation (30 min / 7 day expiry)
- Role-based access control: Admin, Teacher, Staff Invigilator — each with scoped dashboards
- Socket.IO real-time event bus for push-based telemetry
- SQLite WAL-mode database with background thread logging
- MJPEG streaming for live camera feed delivery
The system supports simultaneous multi-exam-session monitoring through a room-based Socket.IO channel architecture, allowing a single admin dashboard to observe multiple examination halls concurrently.
5.3 Cross-Platform Mobile Client (Flutter/Dart)
The Flutter mobile app extends invigilation to handheld devices:
- JWT-authenticated API client connecting to the Flask backend
- Live MJPEG feed with overlaid anomaly metrics
- Role-specific dashboards: Admin (user management, pie-chart distribution) and Teacher (per-session alert timeline)
- Animated splash screen with scanning-line eye icon
- Demo mode: 3-minute scripted timeline simulating progressive anomaly escalation
6. Performance Benchmarking
6.1 Experimental Setup
Profiling was conducted on an Intel i5-12500H (12 cores, 2.5 GHz) with an RTX 3050 Laptop GPU (4 GB VRAM). Each pipeline configuration was evaluated over 500 frames at 640×480 resolution.
6.2 Latency Breakdown
Pipeline Stage Latency
Click tabs to toggle between CPU and GPU profiling data
6.3 Performance Data
| Pipeline Configuration | CPU Latency | GPU Latency | CPU FPS | GPU FPS | Speedup |
|---|---|---|---|---|---|
| Baseline (FaceMesh only) | 26.6 ms | 10.0 ms | 37.6 | 100.0 | 2.66× |
| + YOLOv8n (every frame) | 109.0 ms | 31.0 ms | 9.2 | 32.3 | 3.52× |
| + YOLOv8n (N=15 cadence) | 33.4 ms | 11.4 ms | 29.9 | 87.7 | 2.93× |
| + SAHI + YOLOv8n | 341.5 ms | 68.6 ms | 2.9 | 14.6 | 4.98× |
| Full Pipeline (optimized)* | 48.2 ms | 15.3 ms | 20.7 | 65.4 | 3.15× |
7. Deployment Topologies
The system supports two distinct deployment modes:
7.1 Standalone Mode
Fully self-contained native OpenCV window with real-time HUD, runs entirely on local hardware, writes structured reports to disk. Suitable for individual examination rooms without network infrastructure.
# Usage: python silent_invigilator.py
# Controls: Q = quit & save | R = reset scores | S = save snapshot7.2 Server-Client Mode
The Flask backend operates as a central monitoring hub, accepting camera feeds from multiple examination rooms and broadcasting telemetry to connected dashboards.
| Role | Permissions | Dashboard Surface |
|---|---|---|
| Admin | User CRUD, all-session monitoring, system config, alert ack | Web + Mobile |
| Teacher | Per-session logs, real-time scores, incident timeline | Web + Mobile |
| Staff Invigilator | Live video feed, per-frame risk gauge, alert log | Web |
8. Future Work
Several extensions are under active investigation:
- Transformer-based temporal fusion: Replacing the weighted sliding window with a lightweight attention mechanism (Perceiver-IO) to learn inter-modal temporal dependencies end-to-end.
- Multi-camera spatial fusion: Extending ByteTrack with cross-camera ReID embeddings for consistent identity tracking across overlapping camera views in large examination halls.
- On-device deployment: Quantizing YOLOv8n to INT8 via TensorRT for NVIDIA Jetson Orin-class edge devices at sub-10 ms latency.
- Adversarial robustness auditing: Evaluating system resilience against evasion attacks (e.g., adversarial patches on clothing designed to suppress YOLO detections).
References
[1] Parasuraman, R. (1987). Human-computer monitoring. Human Factors, 29(6), 671–686.
[2] Thomson, D. R., Besner, D., & Smilek, D. (2015). A resource-control account of sustained attention. Perspectives on Psychological Science, 10(1), 82–96.
[3] Lugaresi, C., et al. (2019). MediaPipe: A Framework for Building Perception Pipelines. arXiv:1906.08172.
[4] Jocher, G., et al. (2023). Ultralytics YOLOv8. GitHub: ultralytics/ultralytics.
[5] Akyon, F. C., et al. (2022). Slicing Aided Hyper Inference and Fine-Tuning for Small Object Detection. IEEE ICIP 2022.
[6] Zhang, Y., et al. (2022). ByteTrack: Multi-Object Tracking by Associating Every Detection Box. ECCV 2022.
[7] Lepetit, V., Moreno-Noguer, F., & Fua, P. (2009). EPnP: An Accurate O(n) Solution to the PnP Problem. International Journal of Computer Vision, 81(2).
[8] Bradski, G. (2000). The OpenCV Library. Dr. Dobb's Journal of Software Tools.