§Study Overview
Key metrics update as you filter by interviewer or score range · Click any chart's empty area to reset filters
Study context: This dashboard presents an empirical analysis of interviewer scoring consistency in the HKUST TIE (Technology, Innovation & Entrepreneurship) programme. Each candidate was independently scored by 2 of 6 interviewers on a 1–5 scale, with free-text comments and a full interview transcript. We combine EDA, BERTopic-driven dimension discovery, LLM-based coding, and SVR modelling to quantify interviewer alignment and identify what drives scores.
Scorer rows
107
of 107 total
Candidates
54
unique
Avg score
3.27
filtered mean
Interviewers
6
selected
MAE (model)
0.43
filtered subset
Headline finding: Interviewers are moderately to highly aligned (Cronbach α = 0.93, Pearson r = 0.88) — but systematic hawk/dove tendencies exist. A 37-feature SVR model explains ≈ 56 % of score variance (MAE ≈ 0.43), and the single strongest predictor is the interviewer's overall sentiment, not any specific technical dimension.
1Scoring Consistency
Click a histogram bar to zoom that range · Click an interviewer bar to solo that interviewer
Inter-rater reliability is strong: Cronbach's α = 0.93 (excellent) and paired Pearson r = 0.88 (p < 10⁻¹⁷). The mean absolute difference between two scorers for the same candidate is only 0.17 points, and zero pairs disagree by more than 1.0 point. However, systematic tendencies emerge at the interviewer level.
Score Distribution 107 rows
Interviewer Mean Scores 6 interviewers
Pairwise Score Differences
Scores by Interviewer (strip plot)
🕊️ Dove vs 🦅 Hawk: Interviewer 0 averages 3.47 (most lenient) while Interviewer 5 averages 2.87 (most strict) — a 0.60-point spread on the 5-point scale. Interviewer 5 also shows the highest variance (σ = 1.11), while Interviewer 1 is the most consistent scorer (σ = 0.63). Click an interviewer chip above to isolate their pattern.
2Interview Dynamics
How behavioural features correlate with score — recalculated for filtered data
Short responses are a red flag:
short_response_ratio is the 3rd most important feature in the final model (negative coefficient). Candidates who give many brief answers tend to score lower — likely signalling weaker preparation or shallower knowledge.
Behavioural Feature Correlations vs Score
Interview Length vs Score scatter
Engagement matters: Longer interviews correlate with higher scores (
interview_length_words ranks 7th in the model). This likely reflects a virtuous cycle — stronger candidates elicit more in-depth questioning. Vocabulary richness and response variance also carry positive signal, suggesting that articulate, detailed answers leave a measurable mark.
3Evaluation Dimensions
Comment (pink) vs transcript (blue) dimension correlations — recomputed on filtered data
Transcripts carry richer signal: Seven transcript-derived dimensions have r > 0.45 with score, compared to only one comment-based dimension (overall_sentiment, r = 0.67). The strongest transcript predictors are overall readiness (r = 0.68), leadership signals (r = 0.61), and entrepreneurial spirit (r = 0.60). This suggests full transcripts capture nuances that terse comments miss.
Dimension–Score Correlations comment vs transcript
Taxonomy gap — novel dimensions discovered: BERTopic analysis of 1,578 interview segments identified 63 latent topics, which LLM refinement synthesised into 10 interpretable dimensions. Two are entirely novel — Problem Identification & Market Validation and Experimental Methodology & Technical Rigor — neither appears in the hand-crafted rubric. Conversely, four hand-crafted dimensions (academic record, problem solving, overall sentiment, overall readiness) were not surfaced by topic modelling.
Cross-method agreement is uneven: Hardware assessment shows the best comment–transcript agreement (r = 0.67), while vision/leadership show poor cross-method alignment (r < 0.23). This means interviewers' brief comments and the full transcript can tell different stories — especially for "soft" qualities like vision and leadership.
4Predictive Model
Click any point to see candidate details · Filtered points are highlighted
Model performance: A linear SVR with Leave-One-Group-Out CV (grouped by candidate to prevent leakage) achieves MAE = 0.43, R² = 0.56, r = 0.75 using 37 features. The top predictors are: (1) comment overall sentiment (coeff. 0.162), (2) transcript prototyping ability (0.114), (3) short response ratio (−0.089), (4) transcript software depth (−0.074, counterintuitively negative), and (5) transcript leadership signals (0.068). Including interviewer identity only improves MAE by 0.016 — confirming bias is detectable but small relative to candidate-level signals.
Predicted vs Actual 107 pts
Model Comparison
| Feature set | n | MAE | r | R² |
|---|
Residual Histogram filtered
Three feature families are complementary: Comments alone (MAE 0.47), behavioural features alone (0.53), and transcript-LLM dimensions alone (0.53) each capture different facets of candidate quality. Combining all three drops MAE to 0.43 — a 9 % improvement over comments alone. Prototyping is the most predictive specific competency (both comment & transcript features rank in the top 6), while business skills have near-zero predictive power — suggesting interviewers may underweight commercial competence.
Candidate Details
5Interviewer Tendencies
Residuals (actual − predicted) reveal systematic hawk/dove tendencies · Click a bar to solo that interviewer
Residual analysis: After controlling for candidate quality via the model, positive residuals indicate an interviewer scores higher than predicted (lenient / dove 🕊️) and negative residuals indicate lower than predicted (strict / hawk 🦅). Interviewer 0 shows the most positive median residual (consistently generous), while Interviewer 5 shows the most negative (consistently harsh). The IQR spread reveals Interviewer 5 also has the widest variability — suggesting both bias and inconsistency.
Median Residual by Interviewer
Residual Distribution
Anomaly thresholds: Based on model residual distribution, scores are flagged as: moderate anomaly at ± 0.54, severe at ± 0.81, and extreme at ± 1.08. These thresholds can power an automated early-warning system to identify candidates whose scores warrant a second look.
6Recommendations
Actionable steps derived from the analysis
① Adopt a hybrid rubric. Merge the 10 data-mined dimensions with the 4 hand-crafted dimensions not surfaced by topic modelling (academic record, problem solving, overall sentiment, overall readiness). Include the 2 novel discovered dimensions — Problem Identification & Market Validation and Experimental Methodology & Technical Rigor — to close gaps in the evaluation framework.
② Calibrate interviewers. The 0.60-point spread between the most lenient (Int 0) and most strict (Int 5) interviewer is meaningful on a 5-point scale. Pair high-bias interviewers for norming exercises, share anchor examples, and periodically review residual plots to track drift.
③ Anchor scoring with behavioural examples. Provide dimension-level anchors (e.g., "a score of 4 in prototyping means the candidate demonstrated a complete build cycle with user testing"). The LLM-generated scoring guidance from Phase 1.2 is a ready-made starting point.
④ Flag divergent scores automatically. Use the anomaly-detection thresholds (± 0.54 / 0.81 / 1.08) to flag candidate scores that deviate beyond expected noise. This acts as a quality-control layer without overriding interviewer judgment.
⑤ Leverage transcripts for scalable assessment. Transcript-derived dimensions outperform brief comments (7 dimensions with r > 0.45 vs. 1). Investing in transcript-based LLM coding could provide a richer, more consistent signal — especially useful for large cohorts or auditing purposes.
Limitations: Small sample (54 candidates) limits statistical power — several topic correlations did not reach significance. BERTopic discovered 63 topics, which is likely over-split. LLM coding reproducibility has not been formally tested across runs. These findings should be validated on a future cohort.