AI-powered BlindSpot VisionGuide system on raspberry Pi for enhancing independence of visually impaired users

Sudha, M.; Swaminathan, S.; Suba, M.; Suyamburajan, A.

doi:10.1038/s41598-026-39724-9

Download PDF

Article
Open access
Published: 27 February 2026

AI-powered BlindSpot VisionGuide system on raspberry Pi for enhancing independence of visually impaired users

M. Sudha¹,
S. Swaminathan¹,
M. Suba¹ &
…
A. Suyamburajan¹

Scientific Reports volume 16, Article number: 11316 (2026) Cite this article

673 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

This work describes BlindSpot-VisionGuide, an integrated, AI-based assistive system that aims to empower visually impaired people towards independence through real-time audio interaction. The system incorporates three fundamental capabilities—face recognition, image captioning, and reading online newspapers—into a voice-based platform deployable in Raspberry Pi hardware. The face recognition capability recognizes known people using deep facial embeddings and returns instant voice feedback. The image captioning module uses a transformer-based BLIP model to produce natural language descriptions of scenes captured. The online newspaper module fetches structured news content through APIs and converts it into speech through a text-to-speech engine. The voice interface is centralized for all the modules, enabling users to interact with their surroundings without their hands. The system has been tested for recognition accuracy, response time, and memory consumption on a Raspberry Pi 5. Experiments indicate that the platform operates reliably in all modules, striking a balance between computation and user-friendliness. Optimized for offline use and low-power devices, BlindSpot illustrates the practical applicability of embedded AI towards the creation of inclusive, scalable assistive technology. The authors conclude by noting potential extensions, such as object detection, multi-language support, and caregiver incorporation, making BlindSpot a fundamental model for vision-based accessibility systems of the next generation.

Blind people can actively manipulate virtual objects with a novel tactile device

Article Open access 21 December 2023

An optimized YOLOv8n based model for real time defect detection in taro strip production

Article Open access 29 December 2025

Brain-computer interface for robot control with eye artifacts for assistive applications

Article Open access 16 October 2023

Introduction

Evolution of embedded artificial intelligence and low-cost edge computing has provided new avenues for creating assistive technology for the visually impaired. Current systems can now sense, analyze, and report on the physical world in real time using technologies such as the Internet of Things (IoT) and deep learning. As per¹, IoT-connected devices with the intelligent processing facility are becoming highly skilled in detecting and recognizing functions, allowing human beings to engage more intuitively with their environments. These technologies have been supplemented further by implementing computer vision and pre-trained neural networks, and this has enabled them to apply in face detection, object identification, and real-time scene analysis².

In the domain of accessibility, several efforts have been made to transform visual content into auditory feedback. For example, some systems analyze structured web content and vocalize the headlines of the news and articles through Text-to-Speech (TTS) engines, enabling users to navigate live streams of information via audio-based navigation³. Concurrently, the wider context for such innovations lies in the increasing necessity to help more than 200 million individuals across the globe with vision impairment—a figure set to increase considerably over the next few decades⁴. With improvements in deep learning and computer vision, assistive technologies like wearable smart glasses, smartphone applications for navigation, and object recognition software have gone from idea to useful implementation.

Face recognition is among the fundamental abilities needed in such systems, whereby social interaction is possible due to awareness of identity. Convolutional Neural Networks (CNNs) are now the pillars of such systems, which can extract robust features for facial recognition and classification purposes under varied conditions^5,6,7,8. Yet, although these technologies are available in pieces, few systems have been able to combine several high-impact capabilities—including person identification, visual scene captioning, and dynamic content reading—into one, lightweight, voice-driven system deployable on low-cost edge hardware^9,10,11,12.

To meet this requirement, the current study presents BlindSpot—VisionGuide, a multi-purpose assistive system based on Raspberry Pi that combines face recognition, image captioning, and online newspaper reading into a single, speech-based interface. The system is offline-capable, real-time, and capable of adapting to different use cases visually impaired users face in daily life.

Problem statement and significance

Visually impaired persons are severely disadvantaged when it comes to independently identifying individuals, interpreting scenes, or accessing printed and digital information. Most commercial applications focus on isolated functionalities (e.g., voice-over of text) or use internet-based services, which do not work in low-resource contexts^{13,14,15,16,17}. Additionally, advanced wearable hardware is usually beyond their budget and unavailable in several areas^17,18,19.

The three interconnected problems that a cohesive assistive system needs to resolve are identified through this research.

Recognizing known persons in dynamic environments using real-time face recognition.
Describing and interpreting intricate scenes without the use of sight.
Pulling and speaking timely news content in an organized, interactive format.

Resolving these issues within the hardware limitations of a Raspberry Pi—with ease of use and offline functionality—is the essence of the system’s design objectives.

Key contributions of this work

This work primarily focuses on the practical integration of assistive technologies for the visually impaired, rather than providing impetus to the creation of a deep learning model. In this case, face recognition, image captioning, and TTS systems are deployed using existing algorithms; their novelty lies in the smooth orchestration of these components in a resource-constrained manner and their context-aware deployment into the same platform. More specifically, the contributions are:

1.
Unified Multi-Modal Assistive Platform: Development of a hybrid system called BlindSpot-VisionGuide integrating real-time face recognition, transformer-based image captioning, and online newspaper reading via API, all on a low-cost Raspberry Pi 5.
2.
Modular Voice-Driven Orchestration: This modular control pipeline dynamically manages modules’ execution, resource allocation, and tasks sharing, enabling smooth switch between different tasks without any added latency.
3.
Selective Content Filtering and Privacy-Preserving News Retrieval: Beyond prior Pi Readers, this work offers an API-driven structured news retrieval with redundancy filtering, region/date constraints, and offline fallback mechanisms imposed to protect user privacy and ensure usability in low-connectivity settings.
4.
Optimized Resource Utilization for Edge AI: The system achieves multimodal responsiveness below 2.5 s per article, with a peak usage of around 350 MB RAM, validated with people with visual impairments, showing the real-time feasibility of the system without depending on the cloud.

Scientific contribution of the study

The scientific contribution of this work is mainly in the integration of systems driven by engineering rather than in the creation of deep multimodal fusion algorithms. The BlindSpot-VisionGuide platform that has been proposed does not seek to promote the fusion at the level of representation or the joint learning across modalities; rather, it deals with the science of systems challenges connected with the installation of multiple AI-operated assistive services on edge hardware with limited resources.

The contribution from a systems science standpoint is placed at the crossroads of embedded AI, HCI, and assistive technology engineering, where the most important research problems are:

Ensuring the reliable performance of mixed AI applications with strict memory and power restrictions,
User-interactive handling of tasks with consideration given to delays,
Offline operation with privacy assured and no cloud dependency, and.
Real-world conditions usability and accessibility for visually impaired people.

The integration strategy might look like a task-switching architecture on the algorithmic level, but its scientific significance lies in involving the simultaneous use, coordination, and assessment of three computationally intensive vision–language models on one low-cost embedded platform without sacrificing user responsiveness and trustworthiness. The Raspberry Pi-based implementation of such a system necessitates meticulous design choices that go far beyond simple module interconnection, and that involve sharing of resources, control of execution, and unification of interfaces.

Hence, the authors’ contribution has to be viewed as a systems engineering validation of embedded multimodal assistive AI rather than a deep semantic fusion claim. Deep cross-modal fusion remains an important research direction and is identified as future work, but it was intentionally excluded from the present implementation to prioritize system robustness, interpretability, and real-time feasibility under edge constraints.

Structure of the paper

The rest of the paper is organized below. Section “Related Work” provides a comprehensive review of the existing literature on face recognition, image captioning, and online newspaper reading technologies, with specific reference to assistive systems for visually impaired people. Section “Proposed work” presents the planned work, detailing the system architecture and implementation of every core module, followed by their integration into a single, speech-controlled platform. Lastly, Section “Frame Selection or Scene Prioritization: Improving the system’s responsiveness will be achieved by selecting key frames for captioning rather than processing every frame.” concludes the paper by providing a summary of key findings, presenting limitations, and proposing avenues for future improvements.

Related work

Visually impaired users’ assistive technologies had benefitted lately from trends in artificial intelligence related to computer vision and deep learning. Current-day systems perform real-time object recognition, image captioning, and facial recognition and provide feedback accordingly via audio interface.

Image Captioning: Traditional methods combine CNNs for spatial feature extraction and LSTMs for sentence generation, employing backbones like ResNet, VGG16, and AlexNet. Such models are coupled with a TTS engine to provide image descriptions for assistive technologies²⁰. Transformer-based models like BLIP and ViT-GPT2, on the other hand, have improved captioning accuracy in recent times by jointly learning visual and linguistic representations.
Face Recognition: In the majority of real-time applications, offline operation remains an important requirement, and hence, core algorithms like that of Dlib’s deep metric learning-based pipeline and lightweight detectors such as Haar Cascades are deployed in various assistive devices^21,22. Raspberry Pi and Pi camera-based wearable systems lend credence to the concept of recognizing known people and generating audio cues for social awareness.
Object Detection and Navigation: These models are able to perform multi-object detection at high speeds with spatial context, which is crucial to provide scene description and navigation information²³. Embedded systems with MobileNet- or PSO-MobileNetV2-based implementations and Raspberry Pi cameras for obstacle detection and scene description offer real-time TTS-audio feedback²⁴.
Edge Computing and Hardware Platforms: The Raspberry Pi is a popular choice for assistive devices because of its cheap price, GPIO availability, and the interfacing of different sensors²⁵. Taking advantage of GPU resources while using data augmentation somehow helps improve the generalization of the model together with the speed of inference²⁶. Other microcontroller platforms like the NodeMCU or ESP8266 are mostly utilized as auxiliary sensors given their limited processing capabilities²⁷.
Assistive System Trends: These solutions combine vision, navigation, and safety features such as location tracking, obstacle alerting, and caregiver notifications to enhance the systems’ autonomy and security enhancements²⁸.

Table 1 summarizes the trade-offs of representative pretrained assistive systems, emphasizing the differences in inference speed, accuracy, and resource requirements.

Table 1 Comparison of representative pretrained assistive systems.

Full size table

The Pi-based assistive systems offer strong building blocks for facial recognition, object detection, and scene description. Yet, as we noted, many implementations still either lack integrated modules or must connect to the Internet for full use. Hence, our BlindSpot-VisionGuide ties together these major modules into one single, modular platform with voice activation designed for offline use with resource efficiency and real-time responsiveness in mind. This basically puts our system as the practical next step beyond arrayed function prototypes, especially for embedded edge deployments. Table 2 shows that comparison of advantages and disadvantages of various pretrained models. Table 3 Comparison of advantages and disadvantages of face recognition techniques.

Table 2 Comparison of advantages and disadvantages of various pretrained models.

Full size table

Table 3 Comparison of advantages and disadvantages of face recognition techniques.

Full size table

Difficulties with small objects, whereas MobileNet is suitable for embedded deployment at the cost of some accuracy constraints. COCO-pretrained models have high generalization but need to be tuned for particular tasks. These comparisons guided our model selection appropriate for edge-based assistive applications.

In the field of face recognition, various methods have been experimented on depending on the deployment environment of choice and the constraints of the dataset. While AdaBoost classifiers improve accuracy in challenging environments through the aggregation of weak learners, they are susceptible to noise. SVM classifiers exhibit stability on limited datasets, yet their computational needs rise exponentially with larger data. Lighter algorithms such as Haar Cascades continue to prove useful for accelerated face detection on embedded platforms but with reduced precision in unconstrained settings. Our face recognition component extends these findings by incorporating a Dlib-based pipeline that prioritizes performance alongside efficiency on Raspberry Pi hardware.

In evaluating content access technology for the blind, online newspapers are more and more constructed with structured pipelines and voice interfaces. As Table 4 describes, API-based solutions provide real-time, structured access to news content but come with parsing issues and access constraints without wrap tools. Gesture-assisted interfaces enhance hands-free navigation but are limited by external conditions like light. Considerations. Electronic newspaper forms enhance user experience through multimedia content and flexibility, even if they cannot be fully used by users with limited digital literacy. These considerations informed our use of a newspaper reading module employing APIs with dynamic source filtering and TTS conversion to provide real-time, audible news without the need for visual interaction or touchscreen navigate.

Table 4 Comparison of advantages and disadvantages of online newspaper reading techniques.

Full size table

Many assistive technologies have been proposed for visually challenged users. Many prior works focus only on certain modules or proprietary systems with design-based considerations; those which have an actual end-to-end set-up are few and far between. The literature includes the following shows in Table 5:

By contrast with the assistive systems existing in literature, BlindSpot-VisionGuide brings together voice activation using three essential modules in a single entity, thereby opening new possibilities for multitasking usage.
Mostly, performance metrics in prior works tend to focus on one dimension, for example, recognition accuracy, inference time, and power consumption. We evaluate resource efficiency, user task success, and cognitive load.
For engineering demonstrations, one consult sources such as blogs or preprints, but literature, to the extent possible, has been drawn upon to benchmark accuracy and latency. In practice, Pi-Assist claims 92% recognition on face recognition for a small dataset, while BlindSpot achieves 93.8% under similar conditions.

Table 5 Comparison of assistive systems.

Full size table

This comparison brings out the practical novelty: new algorithms are not really being proposed by the system but rather, they demonstrate effective modular integration with ability to run in real time on a low-cost embedded platform, while also allowing the accessibility features to be enabled offline.

Proposed work

Face recognition module

Objective

The BlindSpot—VisionGuide system has the Face Recognition module designed with the primary focus of assisting the visually impaired to recognize people in their environment with real-time sound feedback. The system fills the gap between sight and hearing perception by utilizing visual input from a camera and analyzing it with the help of artificial intelligence on a Raspberry Pi. The goal is to give users timely, contextually appropriate recognition of known individuals, thereby increasing their social confidence and mobility. The module is intended to work offline and effectively on a limited embedded platform, making it affordable and portable in real-world assistive situations.

System overview

This module acts as a lightweight but solid face recognition pipeline embedded within the larger assistive system. When the system is activated, the Raspberry Pi reads video frames from a plugged-in webcam. Real-time processing of the frames detects faces by a HOG-based face detector from the Dlib library. Identified regions of interest (ROIs) with faces are subjected to a pre-trained deep learning encoder to obtain compact facial embeddings. These embeddings are quantized representations retaining semantic identity facial features. The recognition process relies on a simple Artificial Neural Network (ANN)-style classifier which measures the present embedding against existing embeddings of persons whose names. t stores and with which it establishes a Euclidean distance threshold match. Upon success, the system accesses the individual’s name and pronounces it using a Text-to-Speech (TTS) engine. Where there is no match, the system verbally informs the user that the person is not recognized.

Technical architecture

The module operates using a sequence of interdependent stages. Initially, the image acquisition process is handled by a webcam that streams frames directly into the Raspberry Pi. These frames are converted from BGR to RGB format and resized for faster processing. Dlib’s frontal face detector identifies facial regions, and each detected face is passed to the face_recognition library for feature extraction. This library utilizes a ResNet-34-based architecture to encode each face into a 128-dimensional feature vector, which remains consistent for the same individual under varying conditions. The encoded vector is compared against a locally stored dictionary of known embeddings, using Euclidean distance as the comparison metric. If the closest match falls within the pre-defined threshold (0.6 in this system), the corresponding name is selected; otherwise, the identity is marked as “Unknown.” After recognition, the name is passed to the speech engine, which delivers real-time auditory feedback. The system is designed to support multiple users and allows new faces to be added by capturing an image, extracting the embedding, and storing it with a name label in the internal database.

User interaction and workflow

User interface with the face recognition module is voice-controlled only. The system is waiting for special trigger phrases such as "run the face module" to initiate the recognition process. Upon activation, it initiates video input sampling and processes each frame to detect and classify faces. Detected identities are announced through a TTS engine so the user can recognize people in his field of view without requiring tactile input or visual feedback. The user can terminate the session or reset the system through other voice commands. The hands-free interface makes the module fully accessible for its target users while maintaining usability in public or mobile environments. The face recognition process depicted in Fig. 1.

This figure illustrates the complete pipeline of the face recognition module in the BlindSpot-VisionGuide system. The flow begins with live camera feed input and is further processed by a Dlib HOG based detector for face localization. Detecting faces leads on to the face encoding stage via the face_recognition library with an embedder built on ResNet, converting faces into 128-dim vectors. These embeddings are then matched by an ANN-style Euclidean distance against a locally stored database. Hence, the name of the recognized person is outputted through TTS and display. The diagram calls attention to the modular and sequential composition of this processing pipeline, along with real-time implementation on Raspberry Pi.

System strengths and innovations

One of the strongest features of the module is its offline capability, ensuring smooth operation regardless of internet connection. This is especially crucial for field deployment in sparsely connected or rural regions. In addition, the matching process, while conceptually similar to an ANN, is implemented through basic distance-based matching, which dramatically reduces computational overhead. The utilization of a local voice engine avoids cloud-based service latency and ensures better privacy. Additionally, the modularity of the module ensures that it can function both independently and as part of the entire system, with the other components communicating with it smoothly through a shared command pipeline. The modularity ensures scalability and flexibility in deployment.

System output and behavior

The system provides intuitive feedback for all meaningful events. When a known person is detected, the system announces the name clearly and logs the interaction. In the event of no match, it indicates to the user through voice that the face is not known. At system startup and shutdown, the system provides audio messages of operational status, e.g.," intuitive audio prompts" or "timer expired, exiting. " These are outputs that inform the user and make them confident of the actions of the system, establishing trust and reliability. Visual debugging feedback is also supported during development, with bounding boxes and names around detected faces. Figure 2 displays the output of known (a), (c) and unknown (b) faces.

Limitations and considerations

While effective, the face recognition module is not without its drawbacks. Performance is degraded in low-light or partial face occlusion. The system performs optimally with subjects facing the camera; with side profiles or non-frontal views, the recognition rate decreases.

Further, face embedding storage is limited by memory on the Raspberry Pi. While the current prototype has little trouble supporting 10–26 individuals, beyond this would require database optimization or offloading the storage. The threshold-based classifier, while easy to implement, can require dynamic adjustment in very noisy visual or dense environments.

Performance metrics

The face recognition module has been tested on a Raspberry Pi 5 (8 GB RAM) in real-time webcam feed under controlled indoor lighting conditions. Testing has been done considering both computational performance and recognition accuracy over multiple test iterations. A 300-sample labeled dataset has been used for recognizing accuracy and classification stability measurement.

For a robust and fair assessment of the face recognition module, the dataset consisting of 300 labeled samples has been carefully constructed for demographic and environmental diversity:

Subjects: 20 in all (15 images for each subject)
Gender Distribution: 55% male, 45% female
Age Range: 18–60 years
Lighting: Indoor lighting 60%, outdoor lighting 40%
Pose Variability: Frontal (50%), semi-profile (30%), profile (20%)
Resolution: 640 × 480

Class balance has been maintained using stratified sampling to ensure equal representation of each individual during training and testing phases.

Evaluation protocol

The complementary protocols together intended to provide for the assessments of both identification and rejection capacities of the system:

Closed-set protocol

All the identities that appear in the training set also appear in the test set.
Split: 70% training, 30% testing.
Used to measure the baseline recognition performance.

Open-set protocol

The test set contained 30% of identities not previously seen.
Used to evaluate the system’s capacity to reject unknown individuals.

In either protocol, all steps considered 10 randomized folds, whereupon the averaging of metrics performance along with the 95% Confidence Interval has been included to account for variability.

Performance metrics

The performance metrics considered are the following:

False Acceptance Rate (FAR): The percentage of unauthorized users wrongly accepted.
False Rejection Rate (FRR): The percentage of authorized users wrongly rejected.
Equal Error Rate (EER): Error rate when FAR is equal to FRR.
Receiver Operating Characteristic (ROC) Curve: Graph showing trade-off between sensitivity and specificity.
Detection Error Tradeoff (DET) Curve: Shows the compromise between FAR and FRR on a logarithmic scale.

The performance metrics show how effective the suggested strategy is, as shown in Table 6.

Table 6 For performance metrics results.

Full size table

This Fig. 2 shows the result of the face-recognition module in identifying individuals. Subfigures (a) and (c) show faces that were recognized correctly with bounding boxes and labels correctly assigned. Subfigure (b), however, has the face tagged as unknown, indicating that the system can also deal with new and unregistered individuals. The figure highlights the module distinguishing between known and unknown users and debugging aids that can be visualized during development. Table 6 denoted the For Performance metrics results. Table 7 shows that Closed-Set vs Open-Set Performance.

Table 7 Closed-set versus open-set performance.

Full size table

ROC and DET analysis

The ROC curve (Fig. 3) yields an AUC of 0.96 and 0.91 for closed-set and open-set scenarios, respectively, denoting very high discrimination ability under controlled testing conditions but slightly diminished performance in handling unseen faces.

The DET curve (Fig. 4) demonstrates increases in the FAR at corresponding FRR values in open-set cases, which further highlight the necessity of dynamic thresholding in constrained environments.

Inclusion of open-set evaluation and confidence intervals resolves a major drawback of many Raspberry Pi-based aids that typically give single-run accuracy without considering real-world variation. Our results indicate:

In closed-set protocols, recognition results confirm the identity of people known to the system with little to no false alarms.
In open-set protocols, the system robustly contemplates the behaviors it exhibits when strangers are faced-a very vital consideration during public deployment.
The ROC and DET visualizations enable one to tune thresholds according to one’s tolerance for false acceptance versus false rejection.

In the future, explore alternatives for the incremental learning framework that enables new users to be consecutively added into the running system without retraining and that allows dynamic settings for thresholds with regard to environmental context (e.g., lighting, crowd density).