EK381 Probability, Statistics, & Data Science • Classification

Pet Classifier

A comparative ML project that classifies images of cats vs dogs using multiple classical supervised classifiers (CA, LDA, QDA, NN) and PCA for dimensionality reduction. Emphasis on understanding trade-offs between feature representation, model complexity, and error rate.

Machine Learning MATLAB PCA

Jump to Overview Dataset Supervised Unsupervised Results Reflection

Focus

Classical classifiers + PCA → measurable accuracy vs dimensionality trade-offs.

Overview

The goal of this project was to classify cat vs dog images using multiple supervised learning methods and evaluate their performance on a shared dataset. I implemented and compared: Closest Average (CA), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Nearest Neighbor (NN). I also explored Principal Component Analysis (PCA) to reduce dimensionality and studied how the number of retained components affects error rates.

What I built

• A consistent pipeline to train/test multiple classifiers on the same splits

• PCA feature compression + analysis vs k dimensions

• Error-rate comparisons to understand bias/variance trade-offs

Why it matters

Classical methods still matter in real systems: they’re interpretable, efficient, and great for building intuition about data geometry and representation.

Dataset

The dataset consists of labeled images (cat or dog). Images are vectorized into feature vectors for classical ML methods. PCA is applied using training statistics and then reused on the test set to avoid leakage.

Binary classes Cat vs Dog labeled samples

Math-driven baseline Closed-form decision rules for LDA/QDA

Supervised Learning

I implemented multiple supervised classifiers using the same train/test splits to compare performance fairly. Results were evaluated using training vs testing error to measure generalization and sensitivity to feature dimension.

Classification

Supervised learning is implemented through classification techniques that rely on labeled data. Closest Average (CA) assigns labels based on distance to class mean vectors. Linear Discriminant Analysis (LDA) models both classes as Gaussians with shared covariance, while Quadratic Discriminant Analysis (QDA) allows class-specific covariances. Nearest Neighbor (NN) predicts labels using the class of the closest labeled training example.

Evaluation

Each classifier is evaluated using training and testing error rates. Experiments are repeated across PCA dimensions \(k\) to study trade-offs between representation size, model assumptions, and accuracy.

Comparison Plot Error Rate vs Dimensionality + Classifier

Unsupervised Learning

The unsupervised component is Principal Component Analysis (PCA) used for dimensionality reduction. PCA identifies directions of maximum variance and projects data into a lower-dimensional subspace. The reduced representation is then fed into the supervised classifiers.

PCA projection

\[ \mathbf{X}^T_{\text{reduced}} = \left(\mathbf{X}^T - \mu_{\text{train}}^T\right)\mathbf{V}_k \]

Why PCA helps

• Reduces dimensionality → faster training

• Removes noise directions

• Lets you study accuracy vs representation size

Key rule

Fit PCA on training data only, then apply the same transform to test data (no leakage).

Implementation

The implementation emphasizes consistent preprocessing and fair comparisons. Training statistics (means, covariances, PCA basis) are computed on the training set and reused for the test set. Each classifier is evaluated under the same conditions.

Pipeline

• Load + vectorize images

• Split train/test

• Optional PCA(k)

• Train classifier

• Compute error rates

What I emphasized

• Correctness + comparability

• Clear metrics (train vs test error)

• Controlled experiments (vary only k)

Results

Results were analyzed by comparing error rates across different classifiers and PCA dimensions. This revealed practical trade-offs: some models improve with higher dimensionality, while others overfit or become unstable depending on covariance assumptions and sample count.

Takeaways

• Representation matters as much as the classifier

• PCA can improve speed and sometimes generalization

• LDA/QDA behavior depends on covariance quality

What I’d add next

• Confusion matrices + ROC-style analysis

• Regularized covariance (shrinkage) for QDA

• Hyperparameter search for NN (k-NN)

Reflection

Challenges

• Feature selection affecting results

• Long training / evaluation times at high dimensions

• Environment / package friction

What I learned

• How classical classifiers behave geometrically

• PCA as a practical tool, not just theory

• How to run fair comparisons across models

Back to Projects Top