Logo
Project • Geo-Climate Change Algorithm
Back
Personal Build • Data & Machine Learning

Geo-Climate Change Algorithm

A data-prep + regression modeling pipeline built in MATLAB to explore long-term temperature trends across geography and time. Focused on cleaning inconsistent timestamps, preserving informative outliers, and comparing multiple supervised learning models for predictive performance.

Data Cleaning Visualization Supervised ML MATLAB
Project cover

Focus

Turn messy, real-world climate data into stable features → train + compare regressors.

Overview

This project is a full workflow exercise: take an imperfect public dataset and make it usable for regression modeling. The goal wasn’t “perfect accuracy” — it was to practice the real steps that decide whether ML is meaningful: consistent features, honest handling of anomalies, and clear evaluation.

What I built

A MATLAB pipeline for preprocessing + visualization + supervised regression, then benchmarking multiple learners (tree-based, linear, and kernel methods).

What it demonstrates

Data hygiene, feature design, and model comparison.

Data Prep

The dataset contained inconsistent timestamps (multiple date formats + missing entries), which forced the pipeline to explicitly decide: normalize dates into a consistent representation or engineer a simpler derived feature (e.g., Year) to keep the model stable.

Key features used

Date/Year (temporal trend), AverageTemperature (target/response), and Latitude (geographic signal). This kept the model interpretable while still predictive.

Outliers: kept on purpose

Instead of deleting extreme points automatically, I preserved them when plausible climate data can include true anomalies (events, measurement artifacts, regime shifts).

One-Hot Encoding

Categorical and mixed-format fields needed to be converted into numeric features for regression. One-hot encoding let the model represent discrete labels without imposing fake ordering.

Encoding diagram / screenshot

Interactive Plots

I wrote a MATLAB routine that repeatedly prompts for a city and month (1–12), then plots the selected temperature history. Points above the mean are highlighted to make “unusual” periods visually obvious. This created an easy way to compare cities across regions without manually slicing the dataset.

Why this mattered

Visualization caught issues early (bad slices, weird distributions, missing data), and made model results easier to sanity-check.

UI goal

Minimal friction: type a city + month → immediately see the story in the data.

Model Training

I treated this as a regression problem (predicting a continuous temperature value). I compared multiple learners in MATLAB’s Regression Learner and prioritized models that performed well while still being explainable.

What worked best

Tree-based models performed strongly, suggesting the relationship between features and temperature isn’t purely linear.

What I looked for

Error metrics, residual behavior, and consistency across slices — not just one “best score.”

Results

The most consistent models captured nonlinear structure in the data and produced more stable residuals. Beyond the leaderboard, the bigger takeaway was how strongly performance depends on preprocessing quality and feature decisions.

Takeaway

Better features beat “more complex” models when the dataset is noisy or inconsistent.

Next improvement

Add richer geography features (region bins, hemisphere, elevation proxy) and validate using time-based splits (train on early years → test on later years).

Reflection

Challenges

• Long training times for heavier learners limited how many sweeps I could run.

• Feature formatting in MATLAB apps required careful conversions (categorical ↔ numeric ↔ cell).

• Dataset inconsistencies forced explicit tradeoffs instead of “auto-cleaning.”

What I learned

• How to structure a large preprocessing workflow that stays debuggable.

• How to compare regressors beyond one score (residuals, stability, sanity checks).

• Transferable habits for data work (feature design, slicing strategy, validation logic).