research projects
I'm currently working on projects that further explore the integration of LLMs and VLMs in multimodal reasoning, fact-checking and tool-usage. These are some of the projects where I perform curiosity-driven experiments and also build working prototypes for the production deployment of deep learning algorithms using cloud technologies. Each project below shows practical applications, experimental methodologies, or unique insights into how LLMs and VLMs work.
vlm
Factual image captioning using VLM and knowledge graphs
VLMs struggle to output factually correct image captions, they either generate partial facts or completely hallucinate. Increasing model size further worsens the ability. My experiments suggest that hallucination in VLMs is often linked to poor hierarchical knowledge because spatial understanding is a crucial factor in Visual understanding. I demonstrate how incorporating hierarchical and structured knowledge such as country → city → landmark help make vlm responses for factual image captioning a promising path to explore. I also systematically ablate different kinds of hierarchical knowledge augmentation and compare their results.
- VLMs ability to fact-check benefits more with hierarchical knowledge than simply scaling up models
- hierarchical knowledge augmentation helps traverse more nuanced reasoning pathways
Models: Qwen/Qwen2-VL-2B-Instruct
Frameworks/Libraries: Graphx, python
computer vision
Extracting reusable filters from Diffusion models for controlled image manipulation
Leverages diffusion models to compute a semantic difference matrix from text embeddings, representing the nuanced transformations between prompts in latent space. By applying this filter to the model’s latents, we can visually explore prompt-driven changes, capturing transferable semantic shifts that can be reused across different images.
- I was fascinated to discover that a prompt difference matrix directly maps to visual changes in both latent and pixel spaces.
Models: Stable Diffusion-1.5
Frameworks/Libraries: python, huggingface
llm
Identifying RAG outputs in an LLM-generated response
LLMs responses that are augmented using external knowledge are hard to cross-check which parts of the external knowledge are used. I demonstrate a simple technique, a blend of COT and token manipulation. The result is a practical method that lets users highlight the retrieved parts in a generated text in a different color making the responses more interpretable.
- LLMs often know when they lack information, the more compact the model the greater this ability
- They can tag external information that was not present in their parametric knowledge
Models: LLama-3.2-Instruct
Frameworks/Libraries: python
Image feature aggregation using Attention mechanisms
I demonstrated a use case of feature aggregation to get a condensed view of an image using the concept of attention to images. I utilized pretrained CLIP embeddings which already have knowledge of images. The attention-mechanism is done by breaking the images into image patches and calculating the attention of the patch embeddings.
- I built this simple project to exercise my understanding of attention mechanisms by coding up one from scratch.
- Interpreting image tokens using simple self-attention and multi-head attention.
Models: CLIP model
Frameworks/Libraries: python
demo
Prompt Autocomplete App - an end-to-end ML workflow pipeline
A production workflow of building, training, deploying and scaling a transformer-based model for auto-completion of prompts for text-to-image use case.
Models: DistilBert
Frameworks/Libraries: AWS Sagemaker, AWS lambda, Huggingface pytorch wrappers for training and deploying models, AWS Api Gateway