AI vs Thai Exams

The AI vs Thai Exams project evaluates large language models on Thailand's standardized exams.

I built this project to create an up-to-date benchmark for AI models on Thai educational content, as existing leaderboards had become outdated with newer models. The exam datasets come from the ThaiExam Dataset from SCB 10X and the OpenThaiGPT Evaluation Dataset.

View Live Resultshttps://ai-vs-thai-exams.pages.dev/

Motivation

In 2024, I tried testing various large language models on O-NET high school exams using SCB 10X's ThaiExam dataset. With many new models being released frequently (like Claude 3.7 Sonnet at the time), I wanted to see how they performed on Thai content.

Prior work includes ThaiLLM Leaderboard project, as well the ThaiExam Leaderboard on CRFM HELM, but the data was quite outdated. There is no data for models newer than Claude 3.5 — and many models have improved significantly since then.

I initially considered cloning those projects to run benchmarks myself, but couldn't get them running locally (Python version conflicts, massive dependencies like PyTorch taking up GBs of space), so I built my own lightweight benchmark using Bun + Vercel AI SDK.

Methodology

The benchmark differs from existing approaches in several key ways:

API-only models: Focuses only on models accessible via API, eliminating complex inference dependencies and keeping the project lightweight. We use Vercel AI SDK with the official API providers for OpenAI, Anthropic, and Google, while other open models are accessed via OpenRouter.
Zero-shot testing: Uses one-shot prompting—no example questions or answers provided. Models only see the JSON input/output format followed by the question to solve.
Reasoning transparency: Prompts are designed to allow models to think and explain their reasoning before answering, improving performance and revealing their understanding process.
Individual evaluation: Unlike HELM's approach, each question is individually tested rather than in batches.

Community Support

The project is community-funded through individual sponsors who contribute to API costs for running expensive models like o1-preview and Claude Opus.

This project is funded by our generous sponsors who contribute funds and API keys:

Source code on GitHubhttps://github.com/dtinth/thaiexamjs

Updates

I share regular updates on Facebook about new models and findings:

Date	Update
2025-08-08	GPT-5 becomes second model to exceed 90% on O-NET (90.24%), joining Gemini 2.5 Pro (91.46%)
2025-05-23	Claude 4 performance analysis - coding-focused model shows decreased O-NET performance vs 3.7 Sonnet
2025-04-18	Gemini 2.5 Pro breakthrough - first model to exceed 90% on O-NET (up from 72% in June 2024)
2025-03-08	Major model additions - GPT-4.5, o3-mini, o1, Gemma-2-27b, plus IC Licensing Exams
2025-02-28	TPAT-1 medical ethics findings - models excel at intelligence (92%) but struggle with ethics (68%)
2025-02-26	Project announcement - lightweight benchmark launch with web report interface