4th Place Final — Data Analytics FINDIT!

Gadjah Mada University · National

Placed 4th in the Finals of the Data Analytics competition at FINDIT! 2024, organized by Gadjah Mada University — a national competition with over 200 participating teams. The task: predict customer promotion acceptance for a supermarket using demographic and behavioral data.

The problem

Multiclass classification: predict jumlah_promosi (0–6) — on which promotion number a customer accepts an offer, out of 6 total promotions.
Dataset: 3,817 customers × 17 features — demographics (age, education, marital status, income), shopping behavior (per-category spend, purchase channel), and complaint history.
Key challenge: missing values, ordinal categorical features, and class imbalance across 7 promotion classes.

EDA and feature engineering

Found that income range strongly correlates with promotion acceptance — discretized income into 5 quantile groups (pendapatan_score) to help the model capture this pattern.
Discovered that customers born before 1930 always had jumlah_promosi = 3 — flagged for potential age-group discretization.
Identified that jumlah_anak_balita and jumlah_anak_remaja had similar patterns relative to jumlah_promosi — merged into a single jumlah_anak feature.
Handled missing values using MICE (Multivariate Imputation by Chained Equations) with XGBoost as the base estimator.
Replaced impossible values ('5') in pendidikan and status_pernikahan with NaN before imputation.
Mapped ordinal categoricals (SMP→0, SMA→1, Sarjana→2, Magister→3, Doktor→4) for efficient encoding.

Modeling

Tested multiple treatments: baseline, income/age discretization, jumlah_anak feature creation, and Spearman-correlation-based clustering.
Best single treatment: discretization (Macro F1: 0.761 ± 0.026 vs baseline 0.748).
Final model: Stacking ensemble — CatBoost, XGBoost, LightGBM, Hist Gradient Boosting, Random Forest, and Extra Trees as base estimators, with Logistic Regression as the final meta-learner.
Stacking outperformed all individual models: Macro F1 0.782 ± 0.009, Accuracy 0.779 ± 0.009.
Evaluated using Stratified K-Fold cross-validation to preserve class proportions across folds.

Results

4th place out of 200+ teams as team LabtekV.
Stacking was the best model overall — outperforming soft voting (0.769), Random Forest (0.768), CatBoost (0.754), and XGBoost (0.734).
Main error pattern: label-0 (customers who never accepted any promotion) had the most misclassifications — expected given its dominant class proportion.

Links

Report Certificate