 # Overview

This project illustrates how to use **Multimodal LLMs** in Dataiku DSS.

Motivated by the strong reasoning capabilities of LLMs, researchers are now developing Multimodal Large Language Models (M-LLM) to extend these capabilities to the vision domain, creating a new research hotspot. 

M-LLMs are able to interpret and generate human-like language while simultaneously processing and integrating information from multiple data modalities to enhance its comprehension and expressive capabilities. Data modalities encompass diverse types such as text, image, audio, and tabular data. 

![multimodal LLMs.png](IZyK87lKnwa2)

M-LLMs can revolutionize tasks like image captioning, visual question answering (VQA), cross-modal retrieval, translation, content creation, and accessibility. They adeptly handle diverse modalities enabling versatile applications.

Our focus will narrow to **images and text as input sources** and **text as output** of the M-LLMs.

This project covers the following aspects:

- M-LLMs for **Image Understanding** using zero-shot approaches
 - **Visual question answering** using LENS, IDEFICS and GPT-4(Vision)
 - **Image captioning** using IDEFICS and GPT-4(Vision)
- M-LLMs for **Computer Vision tasks** using zero-shot approaches
 - **Image classification** using GPT-4(Vision)
- **Chatting about images**
- **Captions visualization and listening**

This project can be [downloaded](https://downloads.dataiku.com/public/dss-samples/EX_VISION_LLM/) and instructions to reuse it with your own datasets are provided in the last section.

# Data

This project uses multiple datasets to cover different types of tasks.

The input data is contained in the  [1. Inputs](flow_zone:U5MIxg3) Flow Zone.

## VQA: Answer questions on images

The first [dataset](managed_folder:TVEwE7rl) is a small sample of 5 images from [unsplash.com](https://unsplash.com/fr): 
 - Three images containing various **ingredients like vegetables and condiments**
 - One image corresponding to a **simple classic setting in a kitchen**
 - One image corresponding to **four kittens in a basket**

![dataset1.png](0fxazP1hkkjn)

## Image captioning: Generate descriptive captions from paintings

The second [dataset](managed_folder:rwNI6KPi) is a sample of 9 images of paintings from [unsplash.com](https://unsplash.com/fr). 

![dataset2.png](iMJPM1hGWtPH)

## Image classification: Classify waste

The third [dataset](managed_folder:JWJHJp2t) is a sample of the [RealWaste dataset from UC Irvine](https://archive.ics.uci.edu/dataset/908/realwaste). It contains images of waste items across 9 major material types, collected within an authentic landfill environment. The class labels are Cardboard, Food Organics, Glass, Metal, Miscellaneous Trash, Paper, Plastic, Textile Trash, and Vegetation. Below is an example of images from each class.

![dataset3.png](tUZM2Jnzqn9K)

# Walkthrough

The project is divided into two main parts: firstly, we emphasize the use of M-LLMs for image understanding to perform VQA and image captioning tasks. Secondly, we will explore M-LLMs for computer vision tasks, specifically image classification. All of these tasks are performed using **zero-shot approaches** (i.e. no examples including the ground truth output are provided).

*Theorically, Image Understanding often entails a large panel of tasks, from basic computer vision tasks to interpretation of visual elements (i.e. VQA and image captioning). In this project, Image Understanding refers only to the latter. Basic computer Vision tasks are addressed in a separate section. *

## M-LLMs for Image Understanding

Image Understanding refers to the ability of a system or model to interpret visual information within images. In this section, we focus on VQA and image captioning tasks. 

- **Visual Question Answering (VQA)**  is a task within computer vision that involves developing models capable of answering questions related to visual content, such as images or videos. These models integrate both image understanding and natural language processing to generate accurate responses to queries about the visual input.  
- On the other hand, **Image Captioning** is a task where models generate descriptive and contextually relevant textual captions for images. This involves not only recognizing the objects and scenes within the image but also creating coherent and informative textual descriptions that convey the content of the visual input.  

Both VQA and Image Captioning showcase the intersection of vision and language, highlighting the potential of artificial intelligence to comprehend and generate meaningful insights across multiple modalities.

### Visual Question Answering with M-LLMs

#### Visual Question Answering using LENS

The first approach consists in using a framework called [LENS](https://arxiv.org/abs/2306.16410) (Large Language Models ENhanced to See). LENS has been proposed by [Contextual AI](https://contextual.ai/) and [Stanford University](https://www.stanford.edu/). It combines the power of **independent vision modules** and LLMs to enable **comprehensive multimodal understanding**.

LENS works in two main steps:
1. Rich textual information is extracted using existing vision modules, such as contrastive text-image models (CLIP) and image-captioning models (BLIP). This text information includes tags, attributes, captions.
2. Then, the text is sent to a reasoning module, a frozen LLM which generates answers based on the textual descriptions fed by the vision modules.

![lens.png](fewFMG0aFEum)

<div align=center  style="margin-bottom:20px;"><i>Diagram taken from the <a href="https://arxiv.org/abs/2306.16410">LENS paper</a></i></div>


The originality of their approach lies in 3 aspects: first, LENS handles CV challenges by using language models’ zero-shot, in-context learning capabilities through natural language descriptions of visual inputs. Second, LENS gives any off-the-shelf LLM the ability to see without further training or data. Lastly, they use frozen LLMs to handle object recognition and visual reasoning tasks without additional vision-and-language alignment or multimodal data.

In our project, LENS has been used to answer [3 simple questions](dataset:questions) on these [images](managed_folder:TVEwE7rl):
- *What recipe can you make with these ingredients?*
- *Is there a lemon?*
- *How many cats are in the picture?*

In Dataiku, this approach is illustrated in this [Flow Zone](flow_zone:khT2Xmn). It first consists in using a [Python recipe](recipe:compute_answers_lens) to generate the captions of the images using the `Lens`and `LensProcessor` class. These classes are provided by this [GitHub repository](https://github.com/ContextualAI/lens) and have been downloaded in the [library](https://design.ds-platform.ondku.net/projects/COMPUTERVISIONLLMS/libedition/versioned) of this project. Then, a [Prompt recipe](recipe:compute_captions_generated) is used to apply a LLM (e.g. GPT-4) on the captions with the following prompt: 
```
Provide a short answer to the question
```

The results of this approach are contained in this [dataset](dataset:answers). The column `captions` contains the captions of the images generated by the `LensProcessor` object while the column `llm_output` contains the answers. 

The captions obtained by the `LensProcessor` object are not always consensual. For instance, the number of cats in the image is indicated to be 3, 4 or 5. As a result, the LLM could be misled. 
Thus, this approach really depends on the results of the caption generation step. 

#### Visual Question Answering using IDEFICS

The second approach consists in using an open-source M-LLM called [IDEFICS](https://arxiv.org/abs/2306.16527) (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS). IDEFICS has been released by [Hugging Face](https://huggingface.co/blog/idefics) and has been the **first open-access visual language model at the 80B scale**. IDEFICS is a **reproduction of [Flamingo](https://arxiv.org/abs/2204.14198)** (Apr, 2022), a multimodal model developed by [DeepMind](https://deepmind.google/) which has not been publicly released. It takes **sequence of interleaved images and texts as inputs** and **generates text outputs**. 

As Flamingo, it consists of two parts but pretrained in that case:
 - **A vision encoder**: OpenCLIP, to get image embeddings
 - **A language model**: Llama v1, to get text embeddings
 
![flamingo.png](n8g4hz6TLZwR) 
<div align=center  style="margin-bottom:20px;"><i>Diagram taken from the <a href="https://arxiv.org/abs/2204.14198">Flamingo paper</a></i></div>

The foundation of IDEFICS’ capabilities lies in the data it was trained on (Wikipedia, Public Multimodal Dataset, LAION, and the new OBELICS dataset). 

Within Dataiku, this methodology is showcased in this [Flow Zone](flow_zone:iCcBXiJ) through a single [Python recipe](recipe:compute_idefics_answers). The process involves utilizing the `idefics-9b-instruct` model for inference, where **the model is prompted directly with questions and accompanying images**. In this recipe, IDEFICS is employed to generate responses for a dataset comprising questions and images. The code is structured to optimize the IDEFICS model's memory usage, leveraging **quantization** to diminish its memory footprint. Subsequently, responses are extracted and appended to the dataset for storage. The outcomes are encapsulated within this particular [dataset](dataset:idefics_answers). 

<div class="alert"> <b>Side note: What is Quantization</b>
 <br>
Quantization is a technique used in machine learning to reduce the memory requirements and computational costs of a model. It involves representing numerical values with fewer bits than the original precision. In the context of deep learning, which often involves using 32-bit floating-point numbers, quantization typically reduces the precision to lower bit-width representations, such as 8-bit or even 4-bit integers.
</div>

The results for the questions "Is there a lemon?" and "How many cats are there in the picture?" appear more accurate this time. The model correctly identifies the lemon and indicates its position on the workspace. Additionally, the obtained responses regarding cooking recipes are sensible, with well-defined cooking steps.

#### Visual Question Answering using GPT-4(Vision)

The third approach used to perform VQA is using a proprietary LLM, specifically [GPT-4(Vision)](https://cdn.openai.com/papers/GPTV_System_Card.pdf). GPT-4 with vision (GPT-4V) enables users to **instruct GPT-4 to analyze image inputs** provided by the user. This is the latest GPT-4 capability [OpenAI](https://openai.com/) is making broadly available.

As the architecture of GPT-4V has not been released publicly, here are a few known elements:
- Similar to GPT-4, **training of GPT-4V was completed in 2022** and they began providing early access to the system in March 2023
- As GPT-4 is the technology behind the visual capabilities of GPT-4V, its training process was the same
  - The pre-trained model was first **trained to predict the next word in a document**, using a **large dataset of text and image data** from the Internet as well as licensed sources of data.
  - It was then **fine-tuned with additional data**, using **reinforcement learning from human feedback (RLHF)**, to produce outputs that are preferred by human trainers.
- OpenAI gave a diverse set of alpha users access to GPT-4V earlier this year, including Be My Eyes, an organization that builds tools for visually impaired users, to **conduct a pilot from March 2023 to August 2023**. A key goal of the pilot was to inform **how GPT-4V can be deployed responsibly**.

This [Flow Zone](flow_zone:uzX9hHC) showcases the use of GPT-4V for a VQA task. Similar to IDEFICS, this model requires a prompt type that incorporates images. Consequently, a [Python recipe](recipe:compute_answers_gpt4v) has been utilized. The code encodes images in base64, constructs input messages for the Chat Completion API, calls the OpenAI API, and stores responses in a new dataset in Dataiku, effectively combining text and image inputs for question-answering with GPT-4.

Regarding the [results](dataset:gpt4v_answers), we noticed that this time the model provided very precise answers compared to those of IDEFICS. For instance, for the question *"How many cats are in the picture?"*, the response is as follows: *"There are four cats in the picture. They are all kittens, and they appear to be sitting in a wooden basket outdoors"*. Similarly, for cooking recipes, the answer was much more organized, with the model presenting a structured list of ingredients followed by the recipe steps. 

### Image captioning with M-LLMs

The objective in this section is to generate descriptions of images.

#### Image captioning using IDEFICS

The first approach tested involves utilizing IDEFICS to generate descriptions of the images contained within this [folder](managed_folder:rwNI6KPi). IDEFICS, as detailed in section [Visual Question Answering using IDEFICS](#Visual Question Answering using IDEFICS), accepts both text and image prompts. However, when captioning an image, providing a text prompt to the model is not mandatory; only the preprocessed input image can be provided. In the absence of a text prompt, the model will initiate text generation from the BOS (beginning-of-sequence) token, thus forming a caption. For the image input to the model, either an image object (PIL.Image) or a URL from which the image can be retrieved can be used. 
Optionally, a text prompt can also be provided, which the model will continue based on the given image. Both textual and image prompts can be passed to the model's processor as a single list to generate appropriate inputs.

In our project, we used the following prompt to generate captions : 
```
You are an expert in art. Provide a descriptive caption for the painting.
```
This approach is illustrated in this [Python recipe](recipe:compute_image_captions_IDEFICS). 

The results are gathered in this [dataset](dataset:image_captions_idefics). 

In the results, we observe that:
- The generated **descriptions can vary in length**, ranging from very descriptive to quite concise. For instance, one of the images was described with the caption: "A city street with buildings and cars."
- IDEFICS generates **descriptions that closely match the content of the image** without any added embellishment.
- The **descriptions does not include additional contextual elements** about the image, such as the author or publication date.

#### Image captioning using GPT-4(Vision)

The second approach used to caption the images is to leverage [GPT-4(Vision)](https://cdn.openai.com/papers/GPTV_System_Card.pdf).

This is illustrated in this [Flow Zone](flow_zone:t2qDWhG), where the goal was still to caption a [set of paintings](managed_folder:rwNI6KPi). The following prompt has been submitted to GPT-4(Vision) in combination with each image, in this [Python recipe](recipe:compute_captions_gpt4v): 
```
You are an expert in art. Provide a descriptive caption for the painting.
```

Below is an example of result for one image:
![image captioning example.png](G6J5A10D2ooI)
The remaining results can be accessed either in this [dataset](dataset:images_captions) or by navigating through this [webapp](web_app:7kN1NaD). 

In the results, we can observe several things:
- The **length of the description varies significantly**, ranging from around twenty to a hundred words depending on the images.
- For some images, the **model generates a short title**, for example, "Baroque Splendor: A Celestial Assembly in the Heavens" or "Three Cats in Repose Amidst Foliage and Flowers," in addition to a more detailed description.
- Sometimes, the **model only generates a short title without a detailed description**.
- In general, the **model provides a description of what is present in the image** but does not provide details about the author, the actual title, or the year of creation. If the model is prompted to provide these information, it may succeed for some images, but not all of them. 

## M-LLMs for Computer Vision tasks 

Computer vision encompasses a variety of tasks aimed at understanding and interpreting visual data. Among the fundamental tasks in computer vision are image classification, object detection, and image segmentation. 
 - **Image classification** involves categorizing an entire image into predefined classes or categories. 
 - **Object detection**, on the other hand, goes a step further by not only identifying the objects present in an image but also locating their precise positions by drawing bounding boxes around them. 
 - **Image segmentation** divides an image into semantically meaningful regions, assigning each pixel to a particular class or category, thereby enabling more detailed analysis and understanding of the image's content. 
 
In this section, we will showcase the application of M-LLMs for the task of image classification.

### Image classification using GPT-4(Vision)

In this section, the choice was made to leverage GPT-4 (Vision) for accomplishing an image classification task. The dataset used is [Image for image classification](managed_folder:JWJHJp2t). 

The approach is illustrated in this [Flow Zone](flow_zone:C2RRxY9). In a nutshell, everything happens in this [Python recipe](recipe:compute_classified_images). As before, each image is fed into the `gpt-4-vision-preview` model with the following prompt:
```
You are an expert in image classification. Assign a category to the image from the following category list: 
"Cardboard", "Food Organics", "Glass", "Metal", "Miscellaneous Trash", "Paper", "Plastic", "Textile Trash", "Vegetation".
Only output the category. Do not create a sentence.
```

The model predictions are contained within this [dataset](dataset:classified_images). An evaluation recipe was then employed to compare the model predictions against the true labels. The obtained **F1 score is 0.75**, which is highly satisfactory. For comparison, an EfficientNet B4 model yields a score of 0.595 after training on GPU on the same dataset.

<div class="alert"> <b>Side note: Function calling and gpt-4-vision-preview</b>
 <br>
It's worth noting here that the list of categories was directly provided to the model in the prompt, but there is no constraint for the model to adhere strictly to the contents of this list. It would have been interesting to employ a method such as <a href="https://platform.openai.com/docs/guides/function-calling">function calling</a> to increase the probability of the model correctly labeling the image with a category from this list. Unfortunately, function calling is currently not supported by gpt-4-vision-preview (<a href="https://platform.openai.com/docs/guides/function-calling/supported-models">as of February 16, 2024</a>).
</div>

Please also note that a powerful approach to perform zero-shot or few-shot image classification is to use a joint text-image model such as CLIP, as shown in this [example project](https://gallery.dataiku.com/projects/EX_CLIP/flow/).

### M-LLMs for other tasks (e.g. object detection, image segmentation)

In this section, various approaches have been explored, although definitive results have not yet been achieved. We are actively investigating this area, and the outcomes will be shared in an upcoming version of this project.

## Chatting about images

To showcase this project, an [initial web app](web_app:2d8Sjr0) has been developed. This web app is a chatbot that enables users to converse with an assistant while integrating an image into the conversation. The chatbot has been built to be able to use either the LENS framework or `gpt-4-vision-preview`. 
A preview of the chatbot is provided below:

![chatbot.png](JtUIQoVX1yPv)

## Navigating through generated captions

Secondly, a [second web app](web_app:7kN1NaD) has been developed to enable users to browse the captions of the images generated by GPT-4 (Vision). An audio file is also associated with these images and captions. 
A preview is provided below:

![image captioning webapp.png](RsMrjsR1jBBo)

# Next: Perform your own multimodal task

The project can be downloaded [here](https://downloads.dataiku.com/public/dss-samples/EX_VISION_LLM/).

## Technical requirements

This project:

- leverages features available starting from **Dataiku 12.3**;
- requires a Python 3.9 code environment named `py_39_llm_vision` with the packages and resources specified in  [Appendix: code environments](article:2) .

## How to reuse this project

Once you have imported the project, you can directly navigate the Flow. You need to specify the LLM connection you want to use in the `"LLM_id"` global variable. If you want to use GPT-4(V), you also need to provide an OpenAI API key as an `openai_key`user secret.

If you want to use your own data and depending on what you want to achieve, you can choose to replace:

- the images in this [folder](managed_folder:TVEwE7rl) and the questions in this [dataset](dataset:questions) for the VQA task
- the images in this [folder](managed_folder:rwNI6KPi) for the Image Captioning task 
- the images in this [folder](managed_folder:JWJHJp2t) for the Image Classification task following the classical tree structure by class

All the datasets are stored in filesystem so no remapping will be needed. However you have the option to [change the connection type](https://knowledge.dataiku.com/latest/data-sourcing/connections/concept-connection-changes.html#connection-changes) if you want to rely on a specific data storage type.

If you want to leverage certain parts of the Flow directly, keep in mind that Dataiku allows you to reuse elements at different levels. In particular it is possible to:

- [duplicate a whole project](https://knowledge.dataiku.com/latest/getting-started/dataiku-ui/how-to-duplicate-project.html)
- [copy and paste entire subflows](https://knowledge.dataiku.com/latest/collaboration/sharing-projects-assets/how-to-copy-flow-items.html)
- copy and paste recipes and remap input/output datasets
- [copy and paste preparation steps](https://doc.dataiku.com/dss/latest/preparation/copy-steps.html) within a Prepare recipe

# Related resources

- [LLM Mesh](https://doc.dataiku.com/dss/latest/generative-ai/index.html) in Dataiku
- [Load and re-use a Hugging Face model](https://developer.dataiku.com/latest/tutorials/machine-learning/code-env-resources/hf-resources/index.html) in Dataiku