ETRO VUB
About ETRO  |  News  |  Events  |  Vacancies  |  Contact  
Home Research Education Industry Publications About ETRO

Master theses

Current and past ideas and concepts for Master Theses.

Semantic Scene Understanding with Vision-Language Models for Safer Aerial Robot Motion Planning

Subject

When a single or a group of students are interested in this subject, please contact ir. Mohammad Javad Zallaghi, PhD Candidate (Mohammad.Javad.Zallaghi@vub.be), dr. ir. Bryan Convens (bryan.convens@vub.be), Prof. dr. ir. Bram Vanderborght (bram.vanderborght@vub.be), and Prof. dr. ir. Adrian Munteanu (adrian.munteanu@vub.be) for further information.
Javad will be responsible for weekly technical guidance during the project. To learn more about Javad’s robotics research, see https://mjavadzallaghi.github.io/.
Period: Academic year 2025-2026, first and second session deadlines for written/oral defense in May/June and August/September, respectively.
Who: Any interested student or team of students with experience in at least one aspect of robotics, learning-based control and autonomous robots, or learning-based vision and perception algorithms, should email us. Javad will organize a meeting to discuss which aspects the student could focus on based on background and interest and answer your questions.


Autonomous aerial robots navigating unknown, obstacle-cluttered environments must make fast decisions to avoid collisions based on information from the perception pipeline of the autonomy stack. Proven art in navigation pipelines often rely on geometric perception (data stream of depth and lidar sensors) and reactive safety filters (like Reference Governors or Control Barrier Functions) to prevent crashes. However, purely geometric understanding can miss crucial context – for example, a depth camera might not perceive a glass window or understand that a human is a delicate, moving obstacle requiring a larger static safety margin from the obstacle. Recent advances in Vision-Language Models (VLMs), such as OpenAI’s CLIP and GPT-4 Vision, enable AI systems to interpret visual scenes with rich semantic understanding and even follow natural language instructions. This opens up new possibilities to make aerial robot navigation perception-aware in a deeper sense: beyond geometry, the robot can understand what objects are and which areas are semantically unsafe or off-limits.

Kind of work

For this MA2 project the main research question is:
“Can a VLM provide spatial and semantic cues (like online perception semantic data for the safety requirements of the navigation) that help a robot avoid collisions safer than using depth data alone?”

A VLM can enhance an aerial robot's perception pipeline by identifying the types of objects and structures in its environment, not just their distance. For instance, the robot’s camera feed could be processed by a pre-trained model [1] (such as CLIP or an open-vocabulary detector) to recognize obstacles like “glass” (invisible by depth sensor), or “people”. This semantic knowledge will be used to inform the navigation policy or safety layer. Recognizing this semantic data would enhance the robot to navigate more safely. Recent research in autonomous driving has shown that CLIP-based vision models are robust at learning visual concepts from natural language supervision and can be optimized for real-time edge deployment [2]. These models enable open-set recognition – the ability to detect objects that were not explicitly seen during training by using text queries for those objects [3]. This is particularly useful for aerial robots in unknown environments, as the robot can query “is there a thick or a thin tree in front of me?” and get a detection response from the VLM even without a specialized window detector. The output of the VLM in a fusion with geometrical occupancy determines for example the perceptually-occupied region, which were not detected in depth frame, like “glass”. If the VLM sees a glass in front of the semantic-occupancy input map for the navigation policy should be working safer for preventing collisions while navigating.

Framework of the Thesis

This project could integrate an open-vocabulary object detection model into the robots’s perception stack. OWL-ViT (Open-Vocabulary Vision Transformer) from Google or a CLIP-based region classifier could be used to scan the RGB image for specified hazard classes (like “glass”, “human”). Tools like NVIDIA’s NanoOWL (which optimizes OWL-ViT for Jetson platforms) demonstrate real-time performance of such models on Orin-based hardware [5]. The perception stack must generate a semantic occupancy grid or annotated depth map where areas corresponding to recognized hazards are marked as occupied or high-cost for the robot motion planner. An interesting case study is detecting transparent or reflective obstacles: a VLM can infer the presence of a glass wall from visual cues (reflections, frames) and label that region accordingly. Another interesting case could be detection of trees, their different parts, level of cluttered-ness of zones in the image of the forest, and creating a semantic occupancy grid of the environment of the planner. The expected outcome of this project would be a perception system where the navigation policy is less prone to “surprise” collisions, because it is augmented by a semantic understanding of the scene. As an example, a scene-aware safety filter that tracks moving objects and identify risky observations has boosted an aerial robot navigation system’s success rate by over 64% [5]. In this project, we propose using VLMs as the mechanism to achieve such scene awareness (identifying objects/regions that a purely geometric method might miss), thereby reducing collisions in complex environments.

Number of Students

1-2

Expected Student Profile

Knowledge of Python/C/C++
Experience of AI and deep learning frameworks
Familiarity with VLMs

Promotor

Prof. Dr. Ir. Adrian Munteanu

+32 (0)2 629 1684

acmuntea@etrovub.be

more info

Supervisor

Mr. Mohammad Zallaghi

+32 (0)2 629 1529

mjzallag@etrovub.be

more info

Image

Glass occupancy detection using an efficient VLM as an implementation scenario for this project. While traditional navigation policy relies on the depth frame (top right), a more reliable navigation policy should be able to not collide with obstacles which were not seen in depth frame. VLM-based occupancy map (right bottom) here would be beneficial in providing that perception for the autonomy stack.

- Contact person

- IRIS

- AVSP

- LAMI

- Contact person

- Thesis proposals

- ETRO Courses

- Contact person

- Spin-offs

- Know How

- Journals

- Conferences

- Books

- Vacancies

- News

- Events

- Press

Contact

ETRO Department

info@etro.vub.ac.be

Tel: +32 2 629 29 30

©2025 • Vrije Universiteit Brussel • ETRO Dept. • Pleinlaan 2 • 1050 Brussels • Tel: +32 2 629 2930 (secretariat) • Fax: +32 2 629 2883 • WebmasterDisclaimer