Abstract
Connected and Autonomous Vehicle (CAV), a.k.a. driverless car, provides a solution to safety and efficiency in surface transportation. Yet, CAV technology introduces new security risks beyond the standard vehicle vulnerabilities. Machine Learning (ML) promises to secure CAV by utilizing telemetry data to validate legitimacy among CAVs as a method of Misbehavior Detection. However, most ML models base a misassumption of a benign environment, where the data during testing and training keep an equivalent distribution, which makes ML models vulnerable to adversarial perturbations. Furthermore, the lack of knowledge in CAV behavior opens new attack surfaces for adversaries to invalidate ML models. This thesis examines an attackers’ mindset to fool a ML system for Misbehavior Detection in CAV; thereby CAV defense strategies become robust to adversarial ML . Three software packages, OMNeT++, SUMO, and Veins, integrate to simulate CAV. The dataset for ML, comes from the Vehicle Reference Misbehavior Dataset, which this work expands with new attacks. Progressively, the work evaluates the accuracy of K-Nearest Neighbors and Random Forest for ML algorithms and that of Logistic Regression Neural Network and Recurrent Neural Network with Long-Short-Term-Memory for Deep Learning (DL) algorithms, all coded in Python. A Random Walk movement theory generates synthetic data following a normal distribution to further evaluate the robustness of the ML /DL models, i.e., the effectiveness of spoofed data from the attacker’s point of view. The work reaches a conclusion that DL models perform better than ML models in most cases. This work contributes to ML Misbehavior Detection in CAV, exposing the challenges posed by adversarial ML and recommending the use of DL over ML against adversaries. As a result, it strengthens CAV defense with robust DL models. Future work will solve the limitations of DL such as the need for large quantity of data and balanced classes of data. To correct unbalanced classes, synthetic data generation can use specific python packages such as SMOTE. Undersampling or oversampling the minority or majority classes, respectively, could be another solution. Pairing Feature Analysis, such as Exploratory Data Analysis, to identify the most relevant features would train the models effectively and accurately. Other future work includes engineering the features from the datasets as well as using the Real-World Datasets from CAV pilot cities.