AI vs Thai Exams
The AI vs Thai Exams project evaluates large language models on Thailand's standardized exams.
I built this project to create an up-to-date benchmark for AI models on Thai educational content, as existing leaderboards had become outdated with newer models. The exam datasets come from the ThaiExam Dataset from SCB 10X and the OpenThaiGPT Evaluation Dataset.
View Live Resultshttps://ai-vs-thai-exams.pages.dev/
Motivation
In 2024, I tried testing various large language models on O-NET high school exams using SCB 10X's ThaiExam dataset. With many new models being released frequently (like Claude 3.7 Sonnet at the time), I wanted to see how they performed on Thai content.
Prior work includes ThaiLLM Leaderboard project, as well the ThaiExam Leaderboard on CRFM HELM, but the data was quite outdated. There is no data for models newer than Claude 3.5 — and many models have improved significantly since then.
I initially considered cloning those projects to run benchmarks myself, but couldn't get them running locally (Python version conflicts, massive dependencies like PyTorch taking up GBs of space), so I built my own lightweight benchmark using Bun + Vercel AI SDK.
Methodology
The benchmark differs from existing approaches in several key ways:
- API-only models: Focuses only on models accessible via API, eliminating complex inference dependencies and keeping the project lightweight. We use Vercel AI SDK with the official API providers for OpenAI, Anthropic, and Google, while other open models are accessed via OpenRouter.
- Zero-shot testing: Uses one-shot prompting—no example questions or answers provided. Models only see the JSON input/output format followed by the question to solve.
- Reasoning transparency: Prompts are designed to allow models to think and explain their reasoning before answering, improving performance and revealing their understanding process.
- Individual evaluation: Unlike HELM's approach, each question is individually tested rather than in batches.
Community Support
The project is community-funded through individual sponsors who contribute to API costs for running expensive models like o1-preview and Claude Opus.
This project is funded by our generous sponsors who contribute funds and API keys:
- Jetbodin Prakoonsuksapan
- Sakol Assawasagool
- Khachain Wangthammang
- Kasidis Satangmongkol
- Tossapol Pomsuwan
- R'ket via Veha Suwatphisankij and Natechawin Suthison
- Chrisada Sookdhis
Source code on GitHubhttps://github.com/dtinth/thaiexamjs
Updates
I share regular updates on Facebook about new models and findings:
| Date | Update |
|---|---|
| 2025-08-08 | GPT-5 becomes second model to exceed 90% on O-NET (90.24%), joining Gemini 2.5 Pro (91.46%) |
| 2025-05-23 | Claude 4 performance analysis - coding-focused model shows decreased O-NET performance vs 3.7 Sonnet |
| 2025-04-18 | Gemini 2.5 Pro breakthrough - first model to exceed 90% on O-NET (up from 72% in June 2024) |
| 2025-03-08 | Major model additions - GPT-4.5, o3-mini, o1, Gemma-2-27b, plus IC Licensing Exams |
| 2025-02-28 | TPAT-1 medical ethics findings - models excel at intelligence (92%) but struggle with ethics (68%) |
| 2025-02-26 | Project announcement - lightweight benchmark launch with web report interface |