Certainly! This project aims to develop a system that can help people who are visually impaired to understand their surroundings by converting images into text and then into voice commands. The system is based on two types of deep learning models: Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) models. The CNNs are used to analyze and process the images, while the LSTM models are used to generate the corresponding text based on the visual information. Once the image is converted to text, the system uses text-to-speech technology to convert the text into spoken words.
Dataset Details: https://www.kaggle.com/datasets/adityajn105/flickr8k
Consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events. … The images were chosen from six different Flickr groups, and tend not to contain any well-known people or locations, but were manually selected to depict a variety of scenes and situations