themes
piotr@marciniak ~/work/sibelco
← home summary work
role · Data Science Research Intern

Sibelco

Two-year KE@Work research internship — optimizing energy consumption of an industrial ball mill through data analysis, predictive modeling, and automated control. Sep 2023 – Jul 2025.

Languages
Python Java LaTeX
Methods
Time Series Outlier Detection MPC Optimization
Tools
InfluxDB Graphana Scikit-learn TensorFlow
Type
Research KE@Work BSc Thesis
01 // summary

Two-year research internship at Sibelco as part of the KE@Work honours programme at Maastricht University. The project focused on reducing energy consumption of an industrial ball mill — a grinding machine where a gas burner, controlled by a PID loop, is the dominant energy consumer. The core challenge: a delay between the burner and the temperature sensor causes the controller to operate inefficiently, and no principled method existed for selecting the temperature setpoint.

The work progressed from exploratory correlation analysis and setpoint studies, through time-series modeling and data quality improvement, to a final predictive control system that automates setpoint selection. The system achieved a 2.6% reduction in energy consumption and an 8.9% increase in throughput over operator-selected setpoints. Culminated in a bachelor thesis: Automating Temperature Setpoint Selection For Ball Mill Using Predictive Modeling.

02 // work
3 phases
§

Exploratory Analysis & Setpoint Study

The first phase established which factors drive energy consumption and what constitutes a good temperature setpoint. Pearson correlations were computed across both short (3h) and long (1yr) time windows to identify stable relationships. A dedicated setpoint analysis divided operating conditions into outside-temperature ranges and found local optima for each — the first principled basis for setpoint selection in the installation.

  • Correlation analysis over short and long time horizons to identify energy-driving features
  • Energy consumption unit transformation and distribution comparison
  • Setpoint optimality study segmented by ambient temperature bands
  • Initial regression model for forecasting process variables at time advance x
§

Time-Series Modeling & Data Quality

The second phase attempted mathematical modeling of the system (AR, ARIMA, ARIMAX, LSTM) and found that data quality was the limiting factor — noise from maintenance periods, warm-up cycles, and sensor artefacts obscured most structure. This prompted a systematic data quality improvement effort across three methods.

  • Compared AR, ARIMA, ARIMAX, and LSTM on five held-out datasets; stationarity verified with Augmented Dickey–Fuller
  • Rule-based filtering using operator field knowledge — 13 rules, reducing the dataset by over 75% while significantly improving signal quality
  • Inverse variance weighting via generalised variance of the covariance matrix to down-weight noisy periods
  • Outlier detection: Isolation Forest, Mahalanobis distance, Local Outlier Factor, and One-Class SVM — algorithms compared on predicted inlier/outlier assignments
SHAP feature importance for the energy consumption predictor
SHAP values for the energy consumption regression model. Previous average energy consumption is the dominant driver; weather variables and throughput provide secondary signal.
§

Predictive Control & Automation

The final phase built a model predictive control (MPC) system to automate setpoint selection. Four ML models — Linear Regression, Decision Tree, Random Forest, and XGBoost — were tuned and compared for forecasting energy consumption and throughput. XGBoost performed best (MAE = 0.081, R² = 0.51 on standardised data) and was used in the final pipeline. Feature analysis across all models consistently identified lagged production values and outside temperature as the dominant drivers — confirming that the setpoint and ambient temperature are the only levers worth optimising. The system evaluates a cost function over all candidate integer setpoints (80–120°C) and selects the one with the lowest combined cost.

  • Four regression models compared: Linear Regression, Decision Tree, Random Forest, XGBoost — XGBoost best across MAE, MSE, and R²
  • Feature importances computed per model: lagged throughput and energy consumption dominate; setpoint and outside temperature are the key controllable factors
  • Cost function C = w₁E + w₂(1/T) with equal weights, minimised via grid search over integer setpoints 80–120°C
  • Welch t-test on outcome distributions: throughput gain significant (p = 0.001), energy gain not significant (p = 0.487)
  • End-to-end pipeline: weather forecast → predict outcomes at each setpoint → select optimal → 2.6% energy reduction, 8.9% throughput improvement over operator baseline
Sample output of the final setpoint optimisation system
Final system output: predicted energy consumption and throughput at each candidate setpoint (left) and the combined cost function (right). Optimal setpoint at 92°C vs. operator-selected 100°C.