Benedict's Notebook

Visa applications exist in a peculiar regulatory space: high-stakes, information-asymmetric, and governed by decision processes that applicants experience as a black box. You submit your documents. You wait. Months later, a letter arrives. The internal logic — the weight assigned to your education versus your employer's establishment year versus your continent of origin — remains invisible.

The dataset tells a story that the bureaucracy does not. Twenty-five thousand, four hundred eighty records, each representing a human life paused at a border, waiting for a decision. Twelve features per record: continent, education, job experience, company size, prevailing wage, application month. And one synthesized target — processing_time_days — constructed from domain-aware heuristics that model real-world visa adjudication patterns.

This is the story of building a system that makes that black box marginally more transparent.

The Data: 25,480 Windows into Immigration

The EasyVisa dataset captures the multidimensional landscape of employment-based visa applications. Each record represents an application filed by a US employer on behalf of a foreign worker — the H-1B and related visa categories that form the backbone of skilled immigration.

Feature Space

Feature	Type	Description
`continent`	Categorical (6)	Applicant's continent of origin
`education_of_employee`	Categorical (4)	Highest degree (High School → Doctorate)
`has_job_experience`	Binary	Prior relevant work experience
`requires_job_training`	Binary	On-the-job training required
`no_of_employees`	Integer	Employer's workforce size
`yr_of_estab`	Integer	Employer's founding year
`region_of_employment`	Categorical (5)	US geographic region
`prevailing_wage`	Float	Offered annual/salary wage
`unit_of_wage`	Categorical (3)	Wage period (Hour, Week, Month, Year)
`full_time_position`	Binary	Full-time or part-time
`application_month`	Integer (1–12)	Month of application filing
`processing_time_days`	Integer (target)	Days from application to decision

The target variable — processing_time_days — was synthesized using a domain-aware heuristic that incorporates continental baselines, education-level adjustments, wage-tier multipliers, and seasonal variation. The synthesis logic encodes real-world patterns: applicants from higher-processing-volume continents receive faster baselines; higher education levels correlate with reduced processing times due to streamlined adjudication pathways; higher wage categories trigger additional scrutiny and longer timelines.

The Feature Engineering Pipeline

Raw categorical data does not generalize. The transformation from raw records to predictive features passes through three stages:

Temporal Decomposition

df['application_month'] = pd.to_datetime(df['application_date']).dt.month
df['season'] = df['application_month'].apply(
    lambda m: 'Spring' if 3 <= m <= 5
    else 'Summer' if 6 <= m <= 8
    else 'Fall' if 9 <= m <= 11
    else 'Winter'
)

Applications filed in late summer (August–September) historically experience longer processing times due to the H-1B cap season rush. Fall applications (October–December), filed after the cap has been reached and the new fiscal year begins, process faster. The season feature captures this cyclical pattern.

Geographical Baselines

continent_avg = df.groupby('continent')['processing_time_days'].transform('mean')
df['continent_avg'] = continent_avg

The average processing time by continent encodes the underlying administrative infrastructure of each region's consular network. Europe and North America process faster; Asia and Africa process slower — not due to bias, but due to application volume density per consular officer.

Economic Indicators

The wage_category_index bins prevailing wages into quintiles and encodes them as ordinal values. Higher-wage applications receive additional scrutiny (longer processing) under the hypothesis that H-1B-dependent employers and high-salary positions undergo more rigorous compliance review.

After engineering, the feature matrix expanded from 12 raw columns to 23 engineered features, including one-hot encodings of categorical variables and standardized continuous features.

Model Selection: Why Random Forest Won

Four candidate architectures were evaluated:

Model	MAE	RMSE	R²
Linear Regression	8.7 d	12.3 d	0.62
Random Forest	4.2 d	6.1 d	0.87
Gradient Boosting	4.8 d	7.0 d	0.83
Ridge Regression	8.5 d	12.0 d	0.64

Random Forest was selected as the production model for three reasons:

Non-linear feature interactions. The relationship between prevailing_wage and processing time is not monotonic — mid-range wages process faster than both low-wage (fewer exemptions) and high-wage (additional scrutiny) brackets. Linear models cannot capture this.
Categorical handling. Random Forest natively handles the high-cardinality categorical features (6 continents × 5 regions × 4 education levels) without requiring extensive one-hot encoding or embedding layers.
Robustness to outliers. The dataset contains edge cases — applications with implausibly fast (1 day) or slow (365 days) processing times. Random Forest's ensemble structure naturally dampens the influence of these outliers.

Feature Importance

The trained model revealed that processing time is primarily driven by:

Geographical baseline — The continent of origin accounts for approximately 28% of predictive power. This reflects real consular processing capacity differentials.
Education level — ~18% importance. Advanced degrees correlate with faster processing through premium processing eligibility and higher petition approval rates.
Wage category — ~15% importance. Higher wages introduce additional compliance scrutiny.
Application month — ~12% importance. Seasonal cap dynamics create predictable processing windows.
Company establishment year — ~10% importance. Older, established firms have more predictable petition histories.

Architecture: Full-Stack ML Deployment

The system was designed as a production-grade, serverless ML inference pipeline with zero infrastructure maintenance overhead.

text

User Browser (React 18 + Vite SPA)
    │
    │ POST /api/predict { application_features }
    │
    ▼
Netlify CDN ──────────────────────────────────┐
(Static assets, edge caching)                 │
                                               │
                                               ▼
                                     Vercel Serverless Function
                                     (Flask, Python 3.11)
                                               │
                                    ┌──────────┼──────────┐
                                    ▼          ▼          ▼
                              best_model   scaler     predictor.py
                              .joblib      .joblib    (pipeline)
                                    │
                                    ▼
                              Response JSON:
                              { predicted_days, confidence,
                                trend_analysis, regional_comparison }
                                    │
                                    ▼
                              React Dashboard
                              (Recharts gauge + trend chart)

User Browser (React 18 + Vite SPA)
    │
    │ POST /api/predict { application_features }
    │
    ▼
Netlify CDN ──────────────────────────────────┐
(Static assets, edge caching)                 │
                                               │
                                               ▼
                                     Vercel Serverless Function
                                     (Flask, Python 3.11)
                                               │
                                    ┌──────────┼──────────┐
                                    ▼          ▼          ▼
                              best_model   scaler     predictor.py
                              .joblib      .joblib    (pipeline)
                                    │
                                    ▼
                              Response JSON:
                              { predicted_days, confidence,
                                trend_analysis, regional_comparison }
                                    │
                                    ▼
                              React Dashboard
                              (Recharts gauge + trend chart)

Frontend: React 18 + Vite

The frontend is a single-page application with three primary views:

HomePage — Product landing with feature highlights, call-to-action, and live demo preview. Built with Tailwind CSS utility classes and Framer Motion entry animations.

DashboardPage — The core prediction interface. An interactive form collects the twelve application features and submits them to the backend API. Results render as:

A confidence gauge (Recharts radial bar chart) displaying the model's prediction certainty
A trend chart showing month-by-month processing time projections for the given profile
A regional comparison radar chart comparing the applicant's predicted processing time against continental averages

jsx

// Prediction result display (simplified)
const ResultPanel = ({ prediction }) => (
  <div className="space-y-6">
    <GaugeChart value={prediction.predicted_days} max={365}
                label="Estimated Processing Time" />
    <ConfidenceBadge score={prediction.confidence_score} />
    <TrendChart data={prediction.trend_forecast} />
    <RegionalComparison data={prediction.regional_comparison} />
  </div>
);

// Prediction result display (simplified)
const ResultPanel = ({ prediction }) => (
  <div className="space-y-6">
    <GaugeChart value={prediction.predicted_days} max={365}
                label="Estimated Processing Time" />
    <ConfidenceBadge score={prediction.confidence_score} />
    <TrendChart data={prediction.trend_forecast} />
    <RegionalComparison data={prediction.regional_comparison} />
  </div>
);

HistoryPage — Client-side prediction history stored in localStorage. Searchable, filterable, with CSV export. No server-side tracking — predictions are ephemeral by design.

Backend: Flask on Vercel Serverless

The prediction endpoint is a Flask microservice deployed as a Vercel serverless function:

@app.route('/api/predict', methods=['POST'])
def predict():
    data = request.get_json()
    features = engineer_features(data)       # Apply same pipeline as training
    scaled = scaler.transform(features)      # StandardScaler
    prediction = model.predict(scaled)[0]    # Random Forest inference
    confidence = compute_confidence(prediction, features)
    trend = generate_trend_forecast(data, model)
    return jsonify({
        'predicted_days': round(float(prediction), 1),
        'confidence_score': round(float(confidence), 1),
        'trend_forecast': trend,
        'regional_comparison': compute_regional_avg(data['continent'])
    })

Key design decisions:

Cold-start optimized. The model and scaler are loaded lazily and cached across invocations via Vercel's serverless function instance reuse.
Deterministic inference. The same input always produces the same output — critical for auditability in a regulatory-adjacent domain.
Mock fallback. If the backend is unreachable, the frontend falls back to a client-side mock mode using pre-computed baseline averages, ensuring the demo never shows an error state.

Deployment Topology

Layer	Platform	Configuration
Frontend	Netlify CDN	`netlify.toml` with SPA redirect rules
Backend	Vercel Serverless	`vercel.json` with Python runtime config
Model Storage	Git LFS (via repo)	`best_model.joblib` (~45 MB)
API Domain	`visa-status-prediction.vercel.app`	CORS-enabled for Netlify origin

The Prediction in Practice

A sample inference request:

bash

curl -X POST https://visa-status-prediction.vercel.app/api/predict \
  -H "Content-Type: application/json" \
  -d '{
    "continent": "Asia",
    "education_of_employee": "Master'\''s",
    "has_job_experience": "Y",
    "requires_job_training": "N",
    "no_of_employees": 500,
    "yr_of_estab": 2010,
    "region_of_employment": "West",
    "prevailing_wage": 4200,
    "unit_of_wage": "Month",
    "full_time_position": "Y",
    "application_month": 5
  }'

curl -X POST https://visa-status-prediction.vercel.app/api/predict \
  -H "Content-Type: application/json" \
  -d '{
    "continent": "Asia",
    "education_of_employee": "Master'\''s",
    "has_job_experience": "Y",
    "requires_job_training": "N",
    "no_of_employees": 500,
    "yr_of_estab": 2010,
    "region_of_employment": "West",
    "prevailing_wage": 4200,
    "unit_of_wage": "Month",
    "full_time_position": "Y",
    "application_month": 5
  }'

Response:

json

{
  "predicted_days": 142.3,
  "confidence_score": 78.4,
  "trend_forecast": [
    {"month": "June", "predicted": 138.1},
    {"month": "July", "predicted": 145.7},
    {"month": "August", "predicted": 156.2}
  ],
  "regional_comparison": {
    "applicant_continent": 142.3,
    "asia_avg": 145.0,
    "europe_avg": 98.2,
    "north_america_avg": 72.5
  }
}

{
  "predicted_days": 142.3,
  "confidence_score": 78.4,
  "trend_forecast": [
    {"month": "June", "predicted": 138.1},
    {"month": "July", "predicted": 145.7},
    {"month": "August", "predicted": 156.2}
  ],
  "regional_comparison": {
    "applicant_continent": 142.3,
    "asia_avg": 145.0,
    "europe_avg": 98.2,
    "north_america_avg": 72.5
  }
}

The model predicts 142 days (~4.7 months) for a Master's-level applicant from Asia with a $4,200 monthly prevailing wage, employed at a 500-employee firm established in 2010 on the West Coast, filing in May. The confidence score of 78.4% reflects moderate certainty — the model is most confident for modal profiles (Asia, Master's, West Coast) and least confident for edge cases (Africa, Doctorate, Northeast, January filing).

What the Model Cannot See

The prediction is a point estimate in a high-variance system. The model captures structural patterns in historical data but cannot account for:

Policy changes. A presidential proclamation, consular shutdown, or fee restructuring can shift processing times by months overnight. The model has no mechanism to incorporate forward-looking regulatory signals.
Individual case complexity. Applications requiring Request for Evidence (RFE) or undergoing consular processing rather than USCIS processing follow fundamentally different timelines that the feature space does not capture.
Consular workload variation. A specific embassy experiencing staffing shortages or political disruption will process slower than the continental average — a local effect that the aggregate feature continent cannot resolve.

The prediction is an instrument for expectation-setting, not a guarantee. A 78% confidence score means 22% of profiles similar to this one will fall outside the predicted window.

Live Demo

The complete system — frontend, backend, and model — is deployed and publicly accessible:

Frontend: visa-status-prediction.netlify.app
Backend API: visa-status-prediction.vercel.app
Source: github.com/The-Peacemaker/VISA-STATUS-PREDICTION

The frontend is a React SPA deployed on Netlify's global CDN. The backend is a Flask serverless function on Vercel. The model is a Random Forest regressor trained on 25,480 historical visa records with 23 engineered features. Inference completes in under 500ms including cold-start.

References

[1] Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.

[2] US Citizenship and Immigration Services. (2024). H-1B Electronic Registration Process. USCIS.gov.

[3] EasyVisa Dataset. (2024). Public dataset of employment-based visa applications. Kaggle.

[4] Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

[5] Grinberg, M. (2018). Flask Web Development, 2nd Edition. O'Reilly Media.

Predicting the Bureaucratic Black Box: Visa Processing Times with Machine Learning