Abstract
Deep Reinforcement Learning (RL) shows promising results for control problems with continuous action spaces. A drawback to Deep RL is that it can be very computationally intensive; this is particularly concerning when considering fielding Deep RL applications on computational and power-constrained edge computing hardware typically implemented onboard autonomous vehicle platforms. Another drawback to using Deep RL to learn optimal control strategies is that Deep RL agents can learn control strategies that exhibit high frequency and amplitude oscillations, which can negatively affect performance and cause damage to real-world systems. The first part of this thesis focuses on improving the computational efficiency of the Deep Deterministic Policy Gradient (DDPG) algorithm using mixed numerical precision methods. Mixed numerical precision methods are an active research area that is helping to make progress toward improving the computational efficiency of Deep Learning methods. While mixed-precision approaches are well understood for supervised learning tasks, this area is relatively unexplored for Deep RL. We aim to fill this gap in the research by presenting a method to improve the computational efficiency of the DDPG algorithm using mixed numerical precision and loss scaling. Then this thesis presents a numerical study investigating the impact of different neural network architectures on oscillations in the control signals output by DDPG agents when used for a complex continuous control problem. The neural network architectures considered in this study are commonly used in Deep RL literature. This study will first present numerical cases to compare the performance and computational improvements of DDPG agents trained with mixed-precision to those trained with single-precision in the context of continuous control of a complex Autonomous Undersea Vehicle model for various levels of the control system and Deep RL model complexity. Then, a numerical study will be presented to examine the effects of different DDPG actor and critic neural network architectures on action selection to minimize undesirable oscillations in the control signals output by DDPG agents.