← Back to Projects
2022 Predictive Modeling

Physics-Guided ML for Global Sensor Data

Trained and benchmarked ML models on high-frequency IoT data (124 global locations), implementing robust data-splitting to ensure real-world reliability.

View Source View Presentation
Python Scikit-Learn TensorFlow Data Pipelines

THE PROBLEM

ML models are useless if they fail when deployed to new geographical locations. Standard training workflows often lead models to memorize location-specific noise rather than learning generalizable patterns, resulting in over-optimistic performance that collapses in real-world deployment.

THE ACTION

Trained three powerful ML models (LightGBM, Random Forest, Neural Networks) on IoT data from 124 global locations. Enforced strict spatial cross-validation by totally separating the training and test sites.

THE RESULT

The models generalized successfully to unseen data. They significantly outperformed traditional industry benchmarks. However, testing a standard random data split exposed considerable over-optimism in the standard method. This proved strict spatial validation is the only way to guarantee real-world reliability.

.

Inspiration & Context

Evapotranspiration is one of the most difficult variables to measure directly in the field. At the same time, it is a critical variable in analyzing the global water cycle. I saw great potential in machine learning models to fill this gap for locations where measurement is practically impossible. By leveraging the massive FLUXNET dataset, my main question was: despite their success in other fields, could these models actually learn physical patterns that are generalizable to unseen locations?

Group By:
Global distribution of FLUXNET sites used in my analysis, categorized by IGBP vegetation types and Koppen climate zones.

I also noticed that the scientific community typically trains and tests models on the exact same locations, which leads to a spatial data leak. And that is why I decided to take three of the most powerful ML regression models and put them to the test.

New Geographical Locations

These are locations where their data is not represented in the training set.