Personal Portfolio Website

Overview

The goal of this final class project (EC418 Intro to Reinforcement Learning) is to create an agent capable of simulating Mario Kart (Pytux) by implementing reinforcement learning algorithms and training neural networks to predict aim points from the frames and improving its performance to minimize completion times. Potential enhancements included refining the controller, using better planning methods, incorporating reinforcement learning, or leveraging additional features like obstacle prediction and multiple aim points.

Reinforcement Learning

I began with reinforcement learning algorithms, beginning with Q-Learning:

\[ Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right] \]

This Q-learning implementation uses a table-based approach to map states (rounded aim point and velocity) to actions (e.g., steering, braking). The reward function combines progress (aim point alignment), velocity matching to a target, and penalties for collisions, ensuring the agent learns efficient and safe driving. Exploration and exploitation are balanced using an epsilon-greedy strategy with decay, and Q-values are updated with a weighted combination of immediate reward and future state value. Excessive rollouts are required; though agent still struggles to find convergence.

Q-Learning performed poorly. So next, I decided to implement Temporal Difference Learning with Linear Approximation:

\[ V_{t+1}(s) = \sum_{a} \pi(a|s)r(s, a) + \gamma \sum_{a} \pi(a|s) \sum_{s'} P(s'|a,s)V_t(s) \] \[ \theta_{t+1} = \theta_t + \alpha_t \left( [r_t + \gamma \theta_t^T x(s_{t+1})] - \theta_t^T x(s_t) \right) x(s_t) \]

Reintroducing Q-Learning with Linear Approximation, it produced greater results after 50+ rollouts. However, the agent struggled to learn a optimal policy due to the high-dimensional state space and the poor choice of the features consisting of velocity and acceleration.

\[ \theta_{t+1} = \theta_t + \alpha_t \left( r_t + \gamma \max_{a'} \theta_t^T x(s_{t+1}, a') - \theta_t^T x(s_t, a_t) \right) x(s_t, a_t) \]

Following this algorithm, I used features like aim point and velocity to extract states, which are evaluated through weight vectors (\( \theta \)) for actions such as steering, accelerating, and braking. The weights are updated using temporal difference learning, and the agent balances exploration and exploitation with an epsilon-greedy policy, while the reward function encourages smooth driving, progress, and speed maintenance.