Adversarial Attacks in Vision-Language Models

Under the guidance of Professor Zsolt Kira, I led a project focused on exploring adversarial attacks in Vision-Language Models (VLMs). Using the CLIP text encoder, I extracted harmful concepts such as nudity and violence, injected them into latent spaces, and generated optimized prompts through gradient-based (PeZ) and non-gradient-based (Genetic Algorithm) methods. These optimized prompts successfully bypassed safety mechanisms in text-to-image (T2I) models like Stable Diffusion and Flex, enabling the generation of inappropriate images.

Building on this work, I am now extending these attacks to other modalities, such as audio and depth, using ImageBind. This research aims to uncover vulnerabilities across multi-modal systems, enhancing our understanding of adversarial techniques and reinforcing my commitment to advancing trustworthy AI.

This project has strengthened my expertise in adversarial attacks, prompt optimization, and multi-modal vulnerabilities.

outcome image

GitHub Repo

Enjoy Reading This Article?