Finalist — Data Mining Category (Gemastik)

Kemendiktisaintek · National

Reached the finals of the Data Mining category at Gemastik XVIII 2025, Indonesia's largest national student technology competition held by Kemendiktisaintek. We built an agentic Text-to-SQL system targeting a real-world problem: the majority of Indonesian civil servants (ASN) lack SQL skills, making data-driven policy making inaccessible without technical staff dependency.

The problem

Only ~30% of Indonesian civil servants (ASN) can work digitally, and almost none can write SQL — creating an information bottleneck for government data access.
Policy makers and leaders are forced to rely on a small pool of technical staff just to query databases.
Goal: build a bridge between natural language and SQL so non-technical users can interact with government databases directly.

What we built

An agentic Text-to-SQL system that translates natural language questions (English) into executable SQL queries using a Small Language Model (SLM).
Model: Qwen-2.5-Coder-32B-Instruct — 32.5B parameters, chosen for its coding capability and resource efficiency over large commercial LLMs.
Architecture: three-area workflow — User Area (input handling), Management Area (schema validation, prompt construction, agent scratchpad), and Generation Area (SLM + tool binding).
The agent loop uses Model Context Protocol (MCP) for tool binding: sql_db_list_tables, sql_db_schema, sql_db_query — enabling iterative query generation, execution, and self-correction.
A Thinking Pool stores thought/action/observation traces per iteration, concatenated into the prompt for multi-step reasoning.
Evaluated on 198 questions from a representative subset of the Spider dataset (6 databases, 69 schemas).

Results

100% execution success rate — all 198 questions produced runnable SQL queries.
Hybrid Similarity: 7.75/10 — combining token similarity (7.37), AST structural similarity (7.89), and embedding-level semantic similarity (9.67).
SLM-based Semantic Equivalence: 8.19/10 — indicating the system reliably captures the intent behind questions.
Best performing database: apartment_rentals (hybrid 8.28, semantic 9.24) — simpler linear schema.
Hardest database: formula_1 (hybrid 7.09, semantic 6.21) — 13 tables with complex seasonal relations (lap_times, pit_stops, constructors).
Key failure pattern: semantic mismatch on relative/ambiguous terms (e.g., "popular" or "payment methods") — model hallucinates table/column choices when natural language doesn't exactly map to schema names.

Research paper

Paper title: "Sistem Tanya-Jawab Agentis Berbasis SQL Menggunakan Small Language Model".
Contribution: demonstrates that a lightweight SLM-based agentic approach can match LLM-scale Text-to-SQL performance at a fraction of the computational cost — suitable for deployment in resource-constrained government environments.

Links

Paper Certificate