Computer Vision

Real Time Inference Loop...

Overview:

I developed a Python-based self-driving video game player that uses computer vision to automate keyboard controls to play video games. The program monitors a specific area of the screen where the game is being played and makes real-time decisions to control the game.

Check out the code and documentation on GitHub.

Technologies Used:

Python
PyAutoGUI for screen input
Pynput for keyboard input and control
TensorFlow for deep learning
CNN + LSTM model architecture

Project Timeline and Challenges:

Data Collection:

The project began with creating a solution to record my own training data. I recorded 50 sessions, each about forty-five seconds long, at ten frames per second. These recordings were captured as 256 x 256 RGB images, resulting in a manageable initial dataset. However, segmenting this data to create training examples drastically inflated the data size, presenting a significant challenge in terms of volume and management.

Data Pre-processing:

To handle the expanded dataset, I developed a specialized data pipeline for pre-processing. The data had to be segmented to create training examples: a window of 20 frames was used to predict the output keystrokes that should be made. This segmentation process significantly increased the data size. Working within the Google ecosystem, I leveraged Google Colab for both pre-processing and training the data. This platform provided the necessary computational resources, allowing for efficient data handling and processing.

Training the Model:

Training the model revealed additional hurdles. The large dataset required batching to fit into RAM, and using Google Colab made switching to higher RAM and GPUs easier. However, significant work was needed to train on the extensive data. I initially implemented my own convolutional neural network (CNN) but found its performance lacking. This led me to utilize pre-trained models like ResNet, which significantly improved the model's performance and efficiency through transfer learning.

Model Optimization:

One of the key challenges during implementation was the inference time. The model needed to make inferences at a rate of 10 frames per second to be effective in real-time gameplay. The initial models were too large, resulting in slow inference times. I optimized the model's architecture to reduce its size, ensuring faster inference times and a more responsive real-time decision-making process.

Performance Improvement:

Performance issues also arose due to training data imbalances. Much of the training data consisted of driving straight, which caused the model to struggle with turns. To address this, I developed a method to balance the training examples, ensuring that turning actions were more heavily represented. This adjustment improved the model's ability to handle various driving scenarios effectively.

Achievements:

I successfully built a prototype that can play and drive in a video game by interpreting the screen's visual data. This project not only honed my skills in computer vision and machine learning but also demonstrated the practical application of AI in visual tasks. Additionally, I gained valuable experience in managing and processing large datasets, improving my ability to handle extensive data for machine learning projects. The project also taught me the importance of pre-trained models and transfer learning, which greatly enhanced the effectiveness of my solution.

By following a structured approach, addressing challenges methodically, and iterating on the design, I was able to develop a functional and efficient self-driving video game player.