research projects | Shamima Hossain

vlm

Factual image captioning using VLM and knowledge graphs

VLMs struggle to output factually correct image captions, they either generate partial facts or completely hallucinate. Increasing model size further worsens the ability. My experiments suggest that hallucination in VLMs is often linked to poor hierarchical knowledge because spatial understanding is a crucial factor in Visual understanding. I demonstrate how incorporating hierarchical and structured knowledge such as country → city → landmark help make vlm responses for factual image captioning a promising path to explore. I also systematically ablate different kinds of hierarchical knowledge augmentation and compare their results.

Experimental Findings:

VLMs ability to fact-check benefits more with hierarchical knowledge than simply scaling up models
hierarchical knowledge augmentation helps traverse more nuanced reasoning pathways

Models: Qwen/Qwen2-VL-2B-Instruct

Frameworks/Libraries: Graphx, python

computer vision

Extracting reusable filters from Diffusion models for controlled image manipulation

Leverages diffusion models to compute a semantic difference matrix from text embeddings, representing the nuanced transformations between prompts in latent space. By applying this filter to the model’s latents, we can visually explore prompt-driven changes, capturing transferable semantic shifts that can be reused across different images.

Experimental Findings:

I was fascinated to discover that a prompt difference matrix directly maps to visual changes in both latent and pixel spaces.

Models: Stable Diffusion-1.5

Frameworks/Libraries: python, huggingface

llm

Identifying RAG outputs in an LLM-generated response

LLMs responses that are augmented using external knowledge are hard to cross-check which parts of the external knowledge are used. I demonstrate a simple technique, a blend of COT and token manipulation. The result is a practical method that lets users highlight the retrieved parts in a generated text in a different color making the responses more interpretable.

Experimental Findings:

LLMs often know when they lack information, the more compact the model the greater this ability
They can tag external information that was not present in their parametric knowledge

Models: LLama-3.2-Instruct

Frameworks/Libraries: python

Image feature aggregation using Attention mechanisms

I demonstrated a use case of feature aggregation to get a condensed view of an image using the concept of attention to images. I utilized pretrained CLIP embeddings which already have knowledge of images. The attention-mechanism is done by breaking the images into image patches and calculating the attention of the patch embeddings.

Experimental Findings:

I built this simple project to exercise my understanding of attention mechanisms by coding up one from scratch.
Interpreting image tokens using simple self-attention and multi-head attention.

Models: CLIP model

Frameworks/Libraries: python

demo

Prompt Autocomplete App - an end-to-end ML workflow pipeline

A production workflow of building, training, deploying and scaling a transformer-based model for auto-completion of prompts for text-to-image use case.

Models: DistilBert

Frameworks/Libraries: AWS Sagemaker, AWS lambda, Huggingface pytorch wrappers for training and deploying models, AWS Api Gateway