Justin Yu
P Y T U X N E U R A L N E T W O R K
Overview

The goal of this final class project (EC418 Intro to Reinforcement Learning) is to create an agent capable of simulating Mario Kart (Pytux) by implementing reinforcement learning algorithms and training neural networks to predict aim points from the frames and improving its performance to minimize completion times. Potential enhancements included refining the controller, using better planning methods, incorporating reinforcement learning, or leveraging additional features like obstacle prediction and multiple aim points.

Reinforcement Learning

I began with reinforcement learning algorithms, beginning with Q-Learning:
\[ Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right] \] This Q-learning implementation uses a table-based approach to map states (rounded aim point and velocity) to actions (e.g., steering, braking). The reward function combines progress (aim point alignment), velocity matching to a target, and penalties for collisions, ensuring the agent learns efficient and safe driving. Exploration and exploitation are balanced using an epsilon-greedy strategy with decay, and Q-values are updated with a weighted combination of immediate reward and future state value. Excessive rollouts are required; though agent still struggles to find convergence.

Q-Learning performed poorly. So next, I decided to implement Temporal Difference Learning with Linear Approximation:
\[ V_{t+1}(s) = \sum_{a} \pi(a|s)r(s, a) + \gamma \sum_{a} \pi(a|s) \sum_{s'} P(s'|a,s)V_t(s) \] \[ \theta_{t+1} = \theta_t + \alpha_t \left( [r_t + \gamma \theta_t^T x(s_{t+1})] - \theta_t^T x(s_t) \right) x(s_t) \] Reintroducing Q-Learning with Linear Approximation, it produced greater results after 50+ rollouts. However, the agent struggled to learn a optimal policy due to the high-dimensional state space and the poor choice of the features consisting of velocity and acceleration. \[ \theta_{t+1} = \theta_t + \alpha_t \left( r_t + \gamma \max_{a'} \theta_t^T x(s_{t+1}, a') - \theta_t^T x(s_t, a_t) \right) x(s_t, a_t) \] Following this algorithm, I used features like aim point and velocity to extract states, which are evaluated through weight vectors (\( \theta \)) for actions such as steering, accelerating, and braking. The weights are updated using temporal difference learning, and the agent balances exploration and exploitation with an epsilon-greedy policy, while the reward function encourages smooth driving, progress, and speed maintenance.

Neural Networks

Deep Q-learning still struggles producing similar results to TD-Learning with Linear Approximation, but still better because it uses a neural network to approximate the \( Q \)-function, enabling it to handle complex, high-dimensional state-action spaces efficiently. \[ \theta_{t+1} = \theta_t + \alpha_t \left( r + \gamma \max_{a'} Q_{\theta_t}(s', a') - Q_{\theta_t}(s, a) \right) \nabla_{\theta} Q_{\theta_t}(s, a) \] In theory, leveraging gradient descent, it minimizes the loss between the predicted \( Q \)-value and the target \( y_{\text{target}} = r + \gamma \max_{a'} Q_\theta(s', a') \), adjusting the parameters \( \theta \) in the direction of the gradient to improve the approximation iteratively. This combination allows the model to generalize well across states, making it more effective than traditional Q-learning.

In modifying the planner’s CNN parameters, it resulted that smaller kernel sizes improved sharp turn handling, which significantly lowered the number of frames. It turned out that using ReLu activation functions led to greater consistency but the agent still struggled on very sudden turns on certain tracks. Adding batch normalization and dropout resolved shortcut attempts but did not really contribute to reducing the completion times, time steps resulted similar to TD-Learning. In all, these adjustments significantly improved the agent's performance whereas Q-Learning failed terribly even after 1000+ rollouts, requiring a higher max frame cap as well.

CNN Image

Policy gradient methods optimize the policy directly by adjusting the parameters \(\theta\) in the direction of the gradient of expected reward. Another reinforcement learning algorithm that I would like to try is Actor-Critic which uses two neural networks:
\[ y_{\text{target}} = r + \gamma Q_{\theta_Q}(s', \mu_{\theta_\mu}(s')) \] \[ \theta_{Q}(t + 1) = \theta_{Q}(t) + \alpha(t) \left( y_{\text{target}} - Q_{\theta_{Q}}(s, \mu_{\theta_{\mu}}(s)) \right) \nabla_{\theta_{Q}} Q_{\theta_{Q}}(s, \mu_{\theta_{\mu}}(s)) \] Using two frameworks, the actor network maps the states (e.g., kart velocity, track curvature, steer angle) to actions (e.g., steering, acceleration, drift), while the critic network evaluates the actor's value of each state. Thus, the actor learns and optimizes its policy \( \pi_{\theta}(s, a) \) via policy gradients, and the critic minimizes the temporal difference error between the predicted value and the target value. \[ \theta_{\mu}(t + 1) = \theta_{\mu}(t) + \beta(t) \nabla_{\theta_{\mu}} Q_{\theta_{Q}}(s, \mu_{\theta_{\mu}}(s)) \] \[ \nabla_{\theta_{\mu}} Q_{\theta_{Q}}(s, \mu_{\theta_{\mu}}(s)) = \nabla_{a} Q_{\theta_{Q}}(s, a) \nabla_{\theta_{\mu}} \mu_{\theta_{\mu}}(s) \] For each neural network, they can share similar features respectively, although TD-Learning with Linear Approximation and Deep Q-Learning have proven to perform terribly under the current selected features. So another possibility is to use a combination of CNN, which has shown to be more effective, and try to extract features such as track geometry and subjective turns through labeling.

PID Controller

Finally, I experimented with a general PID controller which manages the steering and velocity of the kart through a feedback loop mechanism:
\[ u(t) = K_p e(t) + K_i \int_{0}^{t} e(\tau) d\tau + K_d \frac{d}{dt} e(t) \] For this specific application, the steering PID controller minimizes the error between the kart's aim point and the track center, while the speed PID controller adjusts acceleration or braking to maintain a target velocity. The control() function uses these controllers to compute steering, acceleration, braking, and drifting actions, with mechanisms to reset integrals and rescue the kart when it is stuck or off-track. The performance of the controller significantly outperformed all the reinforcement learning algorithms I've tried with the features I chose, and managed to complete the fewest number of frames with minor struggles on sharp continous turns at times.

Challenges

• Poor show of results with selected features

• Requiring excessively long training times

• Limited prior knowledge of neural networks

• Incompatibility with environment and anaconda packages

Learned

• Various reinforcement learning algorithms and their implementations

• More practice with python including libraries like Tensorflow and Pytorch

• Introduction to neural networks and their applications

Back