role · Data Science Research Intern

Sibelco

Two-year KE@Work research internship — optimizing energy consumption of an industrial ball mill through data analysis, predictive modeling, and automated control. Sep 2023 – Jul 2025.

Languages

Python Java LaTeX

Methods

Time Series Outlier Detection MPC Optimization

Tools

InfluxDB Graphana Scikit-learn TensorFlow

Type

Research KE@Work BSc Thesis

01 // summary

Two-year research internship at Sibelco as part of the KE@Work honours programme at Maastricht University. The project focused on reducing energy consumption of an industrial ball mill — a grinding machine where a gas burner, controlled by a PID loop, is the dominant energy consumer. The core challenge: a delay between the burner and the temperature sensor causes the controller to operate inefficiently, and no principled method existed for selecting the temperature setpoint.

The work progressed from exploratory correlation analysis and setpoint studies, through time-series modeling and data quality improvement, to a final predictive control system that automates setpoint selection. The system achieved a 2.6% reduction in energy consumption and an 8.9% increase in throughput over operator-selected setpoints. Culminated in a bachelor thesis: Automating Temperature Setpoint Selection For Ball Mill Using Predictive Modeling.

02 // work

3 phases

§

Exploratory Analysis & Setpoint Study

The first phase established which factors drive energy consumption and what constitutes a good temperature setpoint. Pearson correlations were computed across both short (3h) and long (1yr) time windows to identify stable relationships. A dedicated setpoint analysis divided operating conditions into outside-temperature ranges and found local optima for each — the first principled basis for setpoint selection in the installation.

Correlation analysis over short and long time horizons to identify energy-driving features
Energy consumption unit transformation and distribution comparison
Setpoint optimality study segmented by ambient temperature bands
Initial regression model for forecasting process variables at time advance x

§

Time-Series Modeling & Data Quality

The second phase attempted mathematical modeling of the system (AR, ARIMA, ARIMAX, LSTM) and found that data quality was the limiting factor — noise from maintenance periods, warm-up cycles, and sensor artefacts obscured most structure. This prompted a systematic data quality improvement effort across three methods.

Compared AR, ARIMA, ARIMAX, and LSTM on five held-out datasets; stationarity verified with Augmented Dickey–Fuller
Rule-based filtering using operator field knowledge — 13 rules, reducing the dataset by over 75% while significantly improving signal quality
Inverse variance weighting via generalised variance of the covariance matrix to down-weight noisy periods
Outlier detection: Isolation Forest, Mahalanobis distance, Local Outlier Factor, and One-Class SVM — algorithms compared on predicted inlier/outlier assignments

SHAP feature importance for the energy consumption predictor

§

Predictive Control & Automation

The final phase built a model predictive control (MPC) system to automate setpoint selection. Four ML models — Linear Regression, Decision Tree, Random Forest, and XGBoost — were tuned and compared for forecasting energy consumption and throughput. XGBoost performed best (MAE = 0.081, R² = 0.51 on standardised data) and was used in the final pipeline. Feature analysis across all models consistently identified lagged production values and outside temperature as the dominant drivers — confirming that the setpoint and ambient temperature are the only levers worth optimising. The system evaluates a cost function over all candidate integer setpoints (80–120°C) and selects the one with the lowest combined cost.

Four regression models compared: Linear Regression, Decision Tree, Random Forest, XGBoost — XGBoost best across MAE, MSE, and R²
Feature importances computed per model: lagged throughput and energy consumption dominate; setpoint and outside temperature are the key controllable factors
Cost function C = w₁E + w₂(1/T) with equal weights, minimised via grid search over integer setpoints 80–120°C
Welch t-test on outcome distributions: throughput gain significant (p = 0.001), energy gain not significant (p = 0.487)
End-to-end pipeline: weather forecast → predict outcomes at each setpoint → select optimal → 2.6% energy reduction, 8.9% throughput improvement over operator baseline

Sample output of the final setpoint optimisation system

← back to work_log piotr-marciniak.com / sibelco