Temporal Driver Activity Detection

The project focuses on implementing temporal activity detection in the context of driver monitoring, extending classification models to identify driver activities in a continuous input stream. The initial phase of the project involved a thorough examination and understanding of the Drive&Act dataset Drive&Act Website, Martin et al., ICCV 2019. This multimodal dataset provides a comprehensive framework for fine-grained driver behavior recognition, incorporating synchronized data streams such as RGB videos, depth images, optical flow, and 3D skeletal data captured in naturalistic driving scenarios. The primary focus during the dataset exploration phase was on the RGB videos captured from the A-Column co-driver view/camera location. Additionally, the project emphasized the “Activity” annotations within the dataset, which provide detailed temporal labeling of driver behaviors. These annotations were pivotal for designing and evaluating models aimed at detecting and segmenting driver activities accurately. This understanding laid the foundation for designing the detection pipeline and model development in subsequent stages. Initially, I started with researching if the initial problem of driver activity classification can be implemented with YOLOv5 instead of using sliding window and then pass the features extracted by YOLOv5 to a recurrent neural network (RNN) for temporal classification to extend for the sequential nature of driver activities. Prominent applications of driver activity recognition using YOLOv5 include the deployment of advanced driver monitoring systems by leading automotive manufacturers like Volvo and Mercedes-Benz. The system is structured into four primary components: input, backbone, neck, and prediction layers. The backbone serves as the feature extractor, capturing essential details from input images, while the neck enhances the features by aggregating spatial and semantic information across different scales. The output of YOLOv5 for driver activity recognition typically include the components like bounding boxes for each detected activity, class labels i.e. each bounding box, associated with a predicted class label that identifies the detected activity (e.g., “Using Phone” “Working on laptop” “Closing Bottle”) and confidence scores for each detection. However, training variations of this model and the temporal module on a CPU significantly increased training time, often taking several hours for even a small dataset and facing runtime disconnects.I tried trimming the video to shorter lengths but even training a small segment of 1 min took a lot of time. This limitation also restricted the ability to experiment with hyperparameter tuning. Hence we tried to follow the traditional approach as we had limited time to carry out experimentation. However I still feel this approach is promising. My next task was to construct a detection model. The plan was to extend the driver activity classification model using sliding window and pass the output to the detection model. However we finished working on the sliding window close to the deadline. We anticipated this earlier hence we changed our approach to combat the dependencies on modules. The input data was trained on the i3D model which gave us class probabilities as output which was stored as .pt files. My approach was to calculate the np.argmax of the arrays containing the class probabilities and pass it to our detection model as input for training.

Again due to computational limitations, we tackled each video from the dataset individually. I trained the model with Participant 1 and tested it on Participant 5. The training file consisted of 19,071 frames/arrays with each frame represented by an array of 34 class probabilities. Annotations for the frames are stored in CSV files, with 40 used as placeholders for missing annotations. The preprocessing phase involved converting these class probabilities to discrete class labels using the np.argmax function, followed by handling batch processing to optimize memory usage. To enhance the accuracy, several advanced model architectures were considered, including transformers, temporal convolutional neural networks (CNNs), and recurrent neural networks (RNNs). A significant challenge encountered during the process was aligning the input and target dimensions for the loss computation. Additionally, issues with batch size mismatches and tensor shape inconsistencies were resolved through careful debugging and adjustments to the data pipeline. Some of our approaches yielded very high training accuracies i.e 88-89% however the test accuracy was very low. Additionally, I incorporated techniques such as early stopping and batch normalization to stabilize training and mitigate overfitting. Early stopping ensured that the model halted training once the performance on the validation set plateaued, preventing unnecessary overtraining; however the accuracy did not improve. I tried increasing the training data and hyper-parameter tuning which also did not affect the accuracy. Making changes to the model even affected the training accuracy bringing it down between 55-60%. The LSTM model was configured with multiple layers and hidden units to effectively model the temporal dynamics of the input data. During training, techniques such as gradient clipping and dropout were applied to prevent exploding gradients and overfitting, respectively. The LSTM showed promising results in capturing temporal relationships within the data, achieving a high training accuracy. However, tuning the LSTM’s hyperparameters, such as the number of layers, hidden units, and learning rate, was critical to optimizing its performance. Despite these efforts, the LSTM occasionally faced challenges with overfitting, particularly on smaller subsets of data, necessitating further exploration of regularization techniques. The model in the picture gave a training accuracy of 83% but a test accuracy of just 44%.

This was my best approach LSTM_with_more_inputs.ipynb with more data manipulations however it resulted in the training accuracy of 64% showing negligible improvement. I do not consider my approach 100% correct as I was also trying to experiment with the data and the models.

This project received the highest grade in the course Machine Perception Learning at University of Stuttgart for the Winter Semester 2024.

The partial project can be found in Github