# Overview

This _**Dataiku LLM Developer Kit**_ shows how Large Language Models (LLMs) like GPT-4o, Gemini or Mistral can be used in Dataiku. It illustrates the notions presented in the ["Introduction to Large Language Models With Dataiku" guidebook](https://content.dataiku.com/llms-dataiku/dataiku-llm-starter-kit).

**You are invited to read the [guidebook](https://content.dataiku.com/llms-dataiku/dataiku-llm-starter-kit) and learn about the notion of [LLM Mesh](https://doc.dataiku.com/dss/latest/generative-ai/introduction.html) before exploring this project**.

The project shows how to:
- Perform various Natural Language Processing (NLP) tasks with **visual recipes**;
- Call LLMs with Python **code recipes** to better control the text generation process;
- Implement a **question answering system based on documents** in various formats;
- Augment LLMs with **external tools** that enable them to act in the real world;
- Implement a **question answering system based on a SQL database**;
- **Fine-tune an LLM**;
- Implement a **conversational agent**.

The project can be [downloaded](https://downloads.dataiku.com/public/dss-samples/EX_LLM_STARTER_KIT/) and the [last section](#next-develop-your-own-llm-use-cases-1) provides instructions to reuse it with your own datasets.

# Walkthrough

## Performing natural language processing tasks without code

Using visual recipes is the simplest way to leverage LLMs in Dataiku.

In the [1. Basic use](flow_zone:JSx6cEw) Flow zone, we illustrate such recipes with a [small sample](dataset:product_reviews) of the [Amazon Reviews dataset](https://nijianmo.github.io/amazon/index.html). This dataset includes 21 reviews distributed in 7 product categories ("Cell Phones and Accessories", "Digital Music", "Electronics", "Industrial and Scientific", "Office Products", "Software", "Toys and Games") and classified as "positive", "negative" or "neutral".

For example, a review categorized as "Cell Phones and Accessories" and "positive" is:
```
Can't say enough good things. Fits snugly and provides great all round protection. Good grip but slides easily into pocket. Really adds to the quality feel of the Moto G 2014. I had the red and it looks sharp. Highly recommended.
```

### Text classification

If we first want to classify the reviews in various categories, we can simply create a ["Classify text" recipe](recipe:compute_classification2), list the targeted classes and optionally specify some few-shot examples. In this way, we can perform **zero-shot or few-shot in-context learning** as described in the ["Introduction to Large Language Models With Dataiku" guidebook](https://content.dataiku.com/llms-dataiku/dataiku-llm-starter-kit). It is also possible to request row-level explanations for the predicted classes.

The [resulting dataset](dataset:product_reviews_classified) includes several columns in addition to those of the input dataset: one for the predictions, one for the explanations (if requested), one for the LLM raw response and for potential LLM error messages.
![classification_v12.3.png](85suKFenVGEt)

### Text summarization

Similarly, a ["Summarize" recipe](recipe:compute_product_reviews_summarized) can be used to summarize the product reviews with an optional limit on the number of words or sentences.
![summarization_v12.3.png](D0H9zMf2jBvJ)

### Text generation

The two previous recipes are specific to certain tasks. More generally, the **[Prompt recipe](https://knowledge.dataiku.com/latest/ml-analytics/gen-ai/concept-prompt-studio.html)** allows to use any prompt. This enables users to easily perform a wide range of NLP tasks. The prompt template, the few-shot examples and various settings can be specified directly through the Prompt recipe or via Prompt Studios. We use this recipe to accomplish a [translation task](recipe:compute_product_reviews_translated) and a [structured information extraction task](recipe:compute_product_reviews_extracted).
![generation_v12.3.png](5IoS2dHBavHU)

## Using the LLM Mesh with code recipes

The visual LLM recipes enable Dataiku users to easily accomplish natural language processing tasks. As shown in the [2. Basic use (with code)](flow_zone:default) Flow zone, we can go a step further with **code recipes**.

### LLM connection

If an [LLM connection](https://doc.dataiku.com/dss/latest/generative-ai/llm-connections.html) has been configured by the administrator of the Dataiku instance, we can use it in both visual and code recipes. In this way, we can take advantage of all the benefits of the [LLM Mesh](https://blog.dataiku.com/llm-mesh) (decoupling application from service layer, enforcing a secure gateway, controlling cost and performance, etc.). For instance, we show how to perform [few-shot sentiment analysis](recipe:compute_sentiment_llm_mesh) on the same [dataset](dataset:product_reviews) as before.

### Function calling and structured outputs

We may want the text generated by an LLM to follow a very specific format. This is important if this text is subsequently processed by a computer program. However, **LLMs may deviate from the targeted format in spite of the instructions provided in the prompt**.

It is possible to **minimize or prevent the risk of formatting errors** with function calling or structured outputs which are increasingly common features offered by LLM providers. With "[structured outputs](https://medium.com/data-from-the-trenches/taming-llm-outputs-59a58ee3246d)", we can [specify a JSON schema](recipe:compute_extracted_json_structured_output) to which the LLM response will adhere.

Alternatively, if the structured outputs feature is not available for the LLM connection, we can [use]((recipe:compute_extracted_json_openai_function_call)) "[function calling](https://developer.dataiku.com/latest/concepts-and-examples/llm-mesh.html#tool-calls)" to in **increase the likelihood** the LLM output will follow a certain JSON schema, without guarantee.

### Handling other modalities, e.g. images

Multimodal LLMs like GPT-4o can handle **non-textual inputs such as images**. For example, we can [generate](recipe:compute_image_descriptions) captions for [some images](managed_folder:e23xCOAh). Conversely, starting from text descriptions, we can [create](recipe:compute_pvFOY2UN) [images](managed_folder:pvFOY2UN) with a text-to-image model (which is not an LLM...).

## Answering questions with documents

In the [3. Retrieval-based question answering - visual](flow_zone:nS8CTF4) and [4. Retrieval-based question answering - code](flow_zone:wwVg0ys) Flow zones, we create a question-answering system based on a specific collection of documents. More specifically, we implement a ***retrieval-augmented generation* pipeline** which includes the following steps:
1. Receiving the question from the user;
2. Retrieving (from the collection of documents) the passages most semantically similar to the question;
3. Incorporating both the question and the selected passages in a question-answering prompt ;
4. Querying a LLM with this prompt and receiving the answer.

You can find more details about the retrieval-augmented generation pipeline in the ["Introduction to Large Language Models With Dataiku" guidebook](https://content.dataiku.com/llms-dataiku/dataiku-llm-starter-kit) or this [blog post](https://medium.com/data-from-the-trenches/digging-through-the-minutiae-query-your-documents-with-gpt-3-c4600635a55).

The pipeline in the [3. Retrieval-based question answering - visual](flow_zone:nS8CTF4) Flow zone is based on visual features available from Dataiku version 12.3 onwards while the pipelline in the [4. Retrieval-based question answering - code](flow_zone:wwVg0ys) Flow zone is based only on code recipes and leverages [LangChain](https://python.langchain.com/en/latest/), a rich and fast-evolving toolkit to create LLM-powered applications.

### Loading, splitting and vectorizing documents

Before answering questions, we need to **preprocess the documents** once and for all (at least until the next update of the documents) to enable the retrieval of relevant passages at Step 2 above.

In our example, we want to answer questions with [4 documents](managed_folder:TyR7HVoz) related to biodiversity: two .pdf documents, one .docx document and one .html page. We first:
- Load the content of the documents;
- Split the documents' content in chunks;
- Compute the embeddings of these chunks with a semantic similarity model;
- Create an index of these embeddings to be able to quickly retrieve them;
- Save this index in a vector store.

This happens:
- in the [3. Retrieval-based question answering - visual](flow_zone:nS8CTF4) Flow zone, through:
  - a first [visual recipe](recipe:compute_documents_extracted) of the [Text Extraction and OCR plugin](https://www.dataiku.com/product/plugins/tesseract-ocr/) to load the documents' content;
  - a [Prepare recipe](recipe:compute_documents_extracted_prepared) to clean the documents' content, split the extracts in chunks, enrich the chunks with relevant metadata, and create a meaningful extract id (for example,  `wikipedia_biodiversity.pdf - Page 10`) to better identify the sources of an answer;
  - a second [visual recipe](recipe:compute_knowledge) for all the other steps;
- in the [4. Retrieval-based question answering - code](flow_zone:wwVg0ys) Flow zone through a [Python recipe](recipe:compute_4pbb7xKD).

### Generating answers and sources

With the document indexed, it is then straightforward to implement the retrieval-augmented generation pipeline described above. 

In the [3. Retrieval-based question answering - visual](flow_zone:nS8CTF4) Flow zone, we use a [Prompt recipe](recipe:compute_answers2) and leverage an augmented LLM connection to get [answers](dataset:answers) for a [set of questions](dataset:questions).

In the [4. Retrieval-based question answering - code](flow_zone:wwVg0ys) Flow zone, we use a [Python recipe](recipe:compute_answers) to generate [answers](dataset:answers2) for the same set of questions. More precisely, we implement a simple RAG chain with LangChain. Alternatively, we perform exactly the same operation with a [Code Agent](saved_model:bR8FHxfn) and the corresponding [Prompt Recipe](recipe:compute_answers2).

A very simple [web app](web_app:gfMP9Yq) calling the [Code Agent](saved_model:bR8FHxfn) allows users to get answers and provide feedback (😀 or 🙁) on the answers (in which case, the query, the answer and the feedback are [logged](managed_folder:zhYE6r2h). **Please note that you need to download the project, import it on your own Dataiku instance and provide an LLM connection to test the web app. It cannot be used on Dataiku's public gallery**. Two alternative web apps are available: [one using a conversational interface](web_app:1yaSXSp) and a minimalistic [one using only a Dataiku LLM connection](web_app:qpPlJBw), without LangChain.
![Question answering web app](gASJVkUuWqtZ)

Moreover, another [web app](web_app:MNIEqRT) enables you to interactively search the [knowledge bank](retrievable_knowledge:iXhIIXV5) used in the [3. Retrieval-based question answering - visual](flow_zone:nS8CTF4) Flow zone. It can help debug a question-answering pipeline and it requires Dataiku v12.4.1 or above. You can learn more about semantic search through this [blog post](https://medium.com/data-from-the-trenches/semantic-search-an-overlooked-nlp-superpower-b67c4b1b119a) and this [Dataiku example project](https://gallery.dataiku.com/projects/EX_SEMANTICSEARCH/).
![semantic_search.png](CC3AOEIthgzV)

### Multimodal retrieval-augmented generation

The approach presented above works only with text. However, many documents include charts, diagrams or tables whose information is useful to answer potential questions. In this case, a **multimodal RAG approach** is desirable and [one](https://knowledge.dataiku.com/latest/ml-analytics/gen-ai/tutorial-multimodal-embedding.html) is implemented with visual recipes in the [3. Retrieval-based question answering - visual](flow_zone:nS8CTF4). The steps ([content extraction and embedding](recipe:compute_documents_multimodal_embedded), [retrieval and generation](recipe:compute_answers_multimodal)) are similar as before but this time:
- the content embeddings are computed on textual representations of each page of the documents;
- when content embeddings are identified as nearest neighbours of the query embedding, it is the [images](managed_folder:Os2WPwaP) of the corresponding pages of the documents that are fed to the LLM (not their textual representations).

### Automatically comparing generated answers and ground-truth answers

Evaluating a question answering system is challenging because various valid answers can be phrased differently. Simply comparing word for word generated answers and ground-truth answers would most likely underestimate the proportion of correct answers. We use the [LLM Evaluation recipe](https://doc.dataiku.com/dss/latest/mlops/model-evaluations/llm-evaluation.html) to **perform automated (but potentially noisy) evaluations**.

This evaluation takes place in four recipes ([one](recipe:evaluate_answers) for the text-only RAG pipeline in the [3. Retrieval-based question answering - visual](flow_zone:nS8CTF4) Flow zone, [one](recipe:evaluate_answers_multimodal) for the multimodal RAG pipeline in the same Flow zone, [one](recipe:evaluate_answers2) for the Python recipe in the [4. Retrieval-based question answering - code](flow_zone:wwVg0ys) Flow zone, and [one](recipe:evaluate_answers_code_prepared) for the Code Agent in the same Flow zone). Each of these LLM Evaluation recipes return their results in two datasets and one model evaluation store.

## Augmenting an LLM with external tools

In the [5. LLMs combined with tools](flow_zone:Gzn9KXV) Flow zone, we create an ***agent***, ie. an LLM augmented with ***tools*** so that it can act in the real world, access external knowledge sources, or better perform certain tasks. For this fictitious example, we imagine the case of an Internet provider handling [customer requests](dataset:requests) related to Internet connection troubles, password reinitialization, options...

### Defining tools

In this context, tools are simply programs that take some text as input and provide or summarize their results as text. Here, we consider 9 tools (included in the `fictitious_tools.py` file in the project's libraries):
- `get_customer_id` retrieves the customer ID given his or her name;
- `get_details` retrieves the customer data of a specific customer;
- `reset_password` reinitializes the password of a specific customer;
- `schedule_local_intervention` sends an email allowing a specific customer to schedule an appointment for a technician's visit;
- `schedule_distant_intervention` sends an email allowing a specific customer to schedule an appointment for a phone discussion with a technician;
- `cancel_appointment` cancels an upcoming appointment;
- `run_diagnostics` determines whether an Internet connection problem requires the local or distant intervention of a technician;
- `sign_up_to_option` activates the Premium option or the TV option for a specific customer;
- `cancel_option` deactivates the Premium option or the TV option for a specific customer.

For example, the `reset_password` tool is the following Python function:
```
def reset_password(customer_id):
    """
    Reset the password of a customer.
    Use this if the customer mentions problems with his or her password.
    The input should be a single integer representing the customer's id. E.g. '123'.
    """
    try:
        row = customers_df.loc[extract_integer(customer_id)]
        return f"Email sent so that customer #{customer_id} can define a new password. No further action is needed to reinitialize the password."
    except (KeyError, ValueError):
        return "Unknown customer id"
```
Since our example is fictitious, this function is "fake", in the sense that it does not really reset a password. It just confirms that it did. Of course, tools can be arbitrarily complex functions: they may call API endpoints, they may leverage a predictive model... They may even be another LLM augmented with other tools. What matters here from the perspective of the LLM is that **the function is described in a docstring and returns an intelligible string**.

### Implementing the ReAct approach

Once the tools are defined, we can create an agent based on the **ReAct** approach, either as a [Python recipe](recipe:compute_requests_processed_langgraph) or a [Code Agent](saved_model:2k5v5N3O) (you can read more about ReAct in the ["Introduction to Large Language Models with Dataiku" guidebook](https://content.dataiku.com/llms-dataiku/dataiku-llm-starter-kit) or in the original [research paper](https://arxiv.org/abs/2210.03629)).

This involves writing the prompt template and implementing all the machinery to call the LLM, parse its answers and execute the actions requested by the LLM with the tools. Fortunately, **LangGraph** makes this quite straightforward.

Our code allows for two variants. In the default situation, the agent can potentially affect all customers. This creates the risk that handling a request inadequately impacts another customer. In the alternative situation, the customer is assumed to be known. We can then replace the generic tools by **restricted tools** specific to this customer. Given that agents can make mistakes and take inadequate actions, it is obviously better to use restricted tools whenever possible.

Given some fictitious [customer data](dataset:customers), we use this new agent to [process](recipe:compute_requests_processed_langgraph) [examples of requests](dataset:requests) using the restricted tools. The results are satisfying. For example, with the following request:

```
Hi. I have a problem with my Internet connection. Can you please solve this? Many thanks!
```
... we get the following reply:
```
I have scheduled a technician to visit you and solve the Internet connection problem. Please let me know if you need any further assistance.
```
... with these actions:
```
run_diagnostics()
schedule_local_intervention()
```
Just with the description of the tools, the agent correctly inferred that it should run some diagnostics as soon as an Internet connection problem is reported. It then appropriately chose to organize the visit of a technician. The draft reply is also appropriate and well written.

The course of action and the draft reply of the agent were adequate for all [examples of requests](dataset:requests). Still, **agents can make mistakes and should be rigorously developed and evaluated** (cf. the ["Introduction to Large Language Models with Dataiku" guidebook](https://content.dataiku.com/llms-dataiku/dataiku-llm-starter-kit) for practical recommendations on making agents more robust).

You can test the agent through a [web app](web_app:e5X6LOZ), with either the general tools (when the customer is not identified) or the restricted tools specific to a certain customer. **Please note that you need to download the project, import it on your own Dataiku instance and set up an LLM connection to test the web app. It cannot be used on Dataiku's public gallery.**

![agent.png](lCWoxTemTO6f)

### Evaluating the answers of an agent

Evaluating an agent is challenging because we may need to assess both its final answers and its trajectories, ie. all the actions it took along the way. For both the final answers and the trajectories, there may be several distinct but equally valid choices. The LLM Starter Kit illustrates **three general evaluation methods for agents**. The table below summarizes their required inputs, their outputs and the Python packages leveraged in the evaluation recipe.

| Evaluation method | Required inputs | Outputs | Python package |
|:---:|:---:|:---:|:---:|
| **LLM-as-a-judge**<br>(without ground truth) | - Request<br>- Reply<br>- Intermediate steps | - Score<br>- Justification | `langchain` |
| **LLM-as-a-judge**<br>(with ground truth) | - Request<br> - Reply<br>- Ground truth reply | - Score<br>- Justification | `mlflow` |
| **Conformance checking**<br>(process mining technique) | - Intermediate steps<br>- Model for the ground truth trajectory | - Trajectory fit score<br>- Aligned trajectories | `pm4py` |

The **first evaluation method** simply asks an LLM to score and comment the reply and the trajectory on the basis of the request. The LLM is not provided a ground truth answer or a ground truth trajectory, which makes this method convenient but limits its accuracy.

The **second evaluation method** is the same but the LLM is given access to a ground truth reply but not the intermediate steps. This may be relevant when only the final answer matters and when inadequate actions are inexpensive and have no side effects.

The  **third evaluation method** uses the trajectory of the agent and a *model* of the ground truth trajectory, ie. the list of required actions and the potential constraints on their execution. With these pieces of information, we can use a [process mining technique](https://en.wikipedia.org/wiki/Conformance_checking) called **conformance checking** to score the trajectory of the agent and align, to the extent possible, this trajectory and an ideal trajectory. This allows to detect inadequate trajectories and, within these trajectories, which actions are missing or superfluous.

For example, if the user's request is "I would like to subscribe to the TV and Premium options", a model for the ground truth trajectory could be:
```
PARALLEL:
  - sign_up_to_option({"s"="TV"})
  - sign_up_to_option({"s"="Premium"})
```
... which means that the actions `sign_up_to_option({"s": "TV"})` and ``sign_up_to_option({"s": "Premium"})`` should both take place, in any order. If the trajectory of the agent is only:
```
sign_up_to_option({"s": "TV"})
```
... then the trajectory fit will be 0.667 (a score less than 1 denotes an incorrect trajectory) and the aligned trajectories will be:

| Trajectory of the agent | Aligned ground-truth trajectory |
|:---:|:---:|
| - | sign_up_to_option({"s": "Premium"}) |
| sign_up_to_option({"s": "TV"}) | sign_up_to_option({"s": "TV"}) |

These aligned trajectories show that the `sign_up_to_option({"s": "Premium"}) ` action is missing from the trajectory of the agent.

The first and second evaluation methods are implemented in a [code recipe](recipe:compute_agent_answers_evaluated) of the project. In contrast, the third evaluation method only appears in a [companion project](https://gallery.dataiku.com/projects/EX_LLM_STARTER_KIT_AGENT/) (because it requires `pm4py`, a Python package licensed under the GPL v3 license).

### Visual agent

Implementing the agent of the previous sections in either a code agent or a Python recipe requires writing Python code. It is also possible to create a simple agent with only visual tools. For example, the [visual agent](saved_model:BOcLarF3) included in the [5. LLMs combined with tools](flow_zone:Gzn9KXV) Flow zone is based on two readily available tools, one that performs a lookup on a [dataset](dataset:customers) and one that queries an LLM augmented with a [knowledge bank](retrievable_knowledge:iXhIIXV5).

## Using LLM for text2SQL tasks

Building upon the [previous section](#augmenting-an-llm-with-external-tools-1), the [6. LLMs for Text-to-SQL](flow_zone:N8de5Jm) Flow zone shows how we can use LLMs to make information captured in relational databases more accessible for users by parsing natural language queries into SQL queries. With tools, we can build agents capable of autonomously taking action and forming SQL queries to answer the question provided.

For our example here, we use relatively two small datasets containing information about [international football results](https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017) from Kaggle, from a flat file format (csv) and query it using [pandasql](https://pypi.org/project/pandasql/). 

> :warning: **Note**: We suggest to use data that are directly stored in a SQL Database. While we show that it is possible to use flat files, pandasql requires the data to be loaded in memory. Loading a huge flat file may impact performance.

### Tools

In this context, the tools provided to the Text-to-SQL agent are (included in the `text2sql_agent.py` file in the project's libraries):
- `get_table_columns` returns list of table column names and types in JSON.
- `execute_sql`returns the result of SQL query execution of from the generated SQL

### Implementation

Given some [sample questions](dataset:questions_on_data), we use this text2SQL agent to generate [answers](recipe:compute_questions_answered), capturing the results and intermediate steps in [the output](dataset:questions_answered). The results are satisfying. For example, with the following question:

```
Who is the top 3 goalscorers, not counting penalties or own goals?
```
... we get the following reply:
```
The top 3 goalscorers, not counting penalties or own goals, are:
1. Cristiano Ronaldo with 92 goals
2. Romelu Lukaku with 55 goals
3. Robert Lewandowski with 50 goals
```
... with these actions:
```
1. get_table_columns({'table_names': ['match_goalscorers']})

2. execute_sql({'table_names': ['match_goalscorers'], 'sql_query': 'SELECT scorer, COUNT(*) as goals FROM match_goalscorers WHERE penalty = FALSE AND own_goal = FALSE GROUP BY scorer ORDER BY goals DESC LIMIT 3;'})
```

The agent of the [Python recipe](recipe:compute_questions_answered) is also implemented in an equivalent [Code Agent](saved_model:MSIGB581).

You can test the agent through a [web app](web_app:x9jctZd), **Please note that you need to download the project, import it on your own Dataiku instance, setup an [LLM connection](https://knowledge.dataiku.com/latest/ml-analytics/gen-ai/concept-llm-connections.html). It cannot be used on Dataiku's public gallery.**

![text2sql_webapp.png](wA7fSwZjgfzk)

## Fine-tuning LLMs

Proprietary and open source models can be used out-of-the-box for a wide variety of tasks. As demonstrated in the previous sections, adjusting instructions in the prompt, adding few-shot examples, implementing a retrieval-augmented generation approach or giving access to tools are all effective methods to use an LLM for a given use case. If the techniques above are insufficient to reach the targeted performance, fine-tuning an LLM can help go a step further.

A [standard way](https://arxiv.org/abs/2203.02155) to train and fine-tune an LLM involves the following stages:
1. **Unsupervised pre-training**: the LLM is trained to estimate the next token probability on an internet-scale text corpus;
2. **Supervised fine-tuning**: the LLM is trained to estimate the next token probability on a cautiously curated dataset of high-quality demonstrations for specific tasks;
3. **Preference optimization**: the LLM is fine-tuned to reflect the preferences expressed in a comparison dataset. This comparison dataset includes prompts and, for each prompt, a chosen answer and a rejected answer.

Pre-training LLMs in an unsupervised manner is out of reach for most organizations but fine-tuning an existing open source model in a supervised manner or through preference optimization is much more accessible. This is what we illustrate in the [8. LLM fine-tuning](flow_zone:LZpUc7e) Flow zone.

### Supervised fine-tuning

We apply supervised fine-tuning to a [dataset](dataset:sft_data) derived from the [GEM Viggo](https://huggingface.co/datasets/GEM/viggo) dataset. The dataset includes a message related to video games and the corresponding JSON string that contains some key details of the message. For example, such a pair is:
- Message (input):
```
Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac.
```
- Corresponding JSON string (output):
```
{"text_type": "inform", "name": "Dirt: Showdown", "release_year": 2012, "esrb": "E 10+ (for Everyone 10 and Older)", "genres": ["driving/racing", "sport"], "platforms": ["PlayStation", "Xbox", "PC"], "available_on_steam": "no", "has_linux_release": "no", "has_mac_release": "no"}
```

We assume that we want to generate such a JSON string for new messages. This is not a trivial task because the JSON strings in the dataset comply with a pretty complicated JSON schema:
- with 15 potential keys;
- with one required key (`text_type`);
- with values of different types (string, integer or array);
- with some values being limited to a few predefined categories.

After having [split](recipe:split_sft_data) the initial dataset into a [train set](dataset:sft_train), a [validation set](dataset:sft_validation) and a [test set](dataset:sft_test), we fine-tune an LLM in two different ways:
- we use the [visual fine-tuning recipe](recipe:finetune_finetuned_for_information_extraction) to fine-tune `gpt-4o-mini` thanks to the OpenAI API;
- we use a [Python recipe](recipe:compute_74GhsK3Z) and the [TRL](https://huggingface.co/docs/trl/index) library to fine-tune [mistral-7B-instruct-0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) on this task. We leverage the [LoRA](https://arxiv.org/abs/2106.09685) technique that reduces the memory requirements and increases the training throughput. The fine-tuning step results in a LoRA adapter stored in a [saved model](saved_model:VOqI4fIf).

We can then generate the answers with the two fine-tuned models of the [test set](dataset:sft_test_sampled)  and [compare](recipe:compute_mYObd5To) them with [those](dataset:sft_predictions_gpt_4o) obtained with `gpt-4o`prompted with the JSON schema and some few-shot examples.

![sft_results.png](1AOxDF05z3VQ)

We assess both the compliance the target JSON schema and the similarity of the generated outputs with the ground-truth answers. For both these criteria,  the predictions of the fine-tuned models are competitive if not better than those `gpt-4o`. Beyond producing good answers, fine-tuning also significantly reduce the prompt's length (and the inference computations and costs) because the full description of the expected format and the few-shot examples can be omitted.

### Preference optimization

Preference optimization consists in adjusting a model so that it better reflects the preferences expressed in a comparison dataset. It is particularly relevant when critiquing LLM-generated answers is much easier and quicker than writing adequate answers.

The comparison dataset includes **prompts** and, **for each prompt**, at least **an approved answer and a rejected answer**. These answers have typically been generated by the model that we want to fine-tune and then assessed by either human evaluators or an LLM.

Here, we use a [sample](dataset:po_data) of the [lvwerra/stack-exchange-paired](https://huggingface.co/datasets/lvwerra/stack-exchange-paired) dataset  which includes tuples of a StackExchange question, an accepted answer and a rejected answer. We filtered the dataset so that the accepted answers share some features that will be easy to detect in the answers generated by the fine-tuned model:
- The accepted answer always includes a code block while the rejected answer never does;
- The accepted answer is shorter than the rejected answer.

We again [split](recipe:split_po_data) the dataset and then [fine-tune](recipe:compute_Yt3MdkME) the model using the [Direct Preference Optimization](https://arxiv.org/abs/2305.18290) (DPO) algorithm. As opposed to previous [popular approaches](https://arxiv.org/abs/2203.02155), DPO does not rely on reinforcement learning. As with supervised fine-tuning, we do not fine-tune the whole model but only train a LoRA adapter, stored in a [saved model](saved_model:Yt3MdkME).

We then [generate](recipe:compute_po_test_answers_finetuned) [answers](dataset:po_test_answers_finetuned) with both the base model and its fine-tuned version. By design of the [training set](dataset:po_data), we can hope that the fine-tuning step improved the quality of the answers, increased the likelihood of the answers including a code block and made the answers generally more concise. We test these three assumptions with recipes that evaluate the quality of the answers with an LLM-as-a-judge approach and a [Prepare recipe](recipe:compute_po_predictions_prepared) that measures the length of the answers and verify whether they include code blocks. The [results ](dataset:po_answers_evaluated2) for the [test set](dataset:po_test) are as follows:

| Metric | Base model | Fine-tuned model |
|---|---|---|
| Answer correctness assessed by an LLM-as-a-judge | 0.30 | 0.29 |
| Average number of characters in the answers | 1766 | 1507 |
| Percentage of answers including a code block | 86% | 90% |

**As expected, the answers generated by the fine-tuned model are shorter** than those from the base model **and more likely to include code blocks**. Their quality is assessed less favorably but the difference is limited.

Please note that the train set should ideally only include accepted and rejected answers generated by the base model. This is not the case here because we reused an existing dataset out of convenience (cf. Section 4 of the [DPO paper](https://arxiv.org/abs/2305.18290) for additional details).

## Bonus: implementing a conversational agent

The web apps presented above in the context of a [question answering system](web_app:gfMP9Yq) and an [LLM agent](web_app:e5X6LOZ) are very simple: the user asks a question and gets a response, and if a new question is submitted, the previous interaction is not taken into account in the new reply.

In many use cases, moving from single shot answers to a **conversational experience** is both more productive and user-friendly. This requires a **chat user interface** and a **memory component** that feeds the LLM with past interactions. This project includes a simple web app implementing such a conversational agent thanks to LangChain. It comes in two versions: [one](web_app:3hKu5FA) naively wrapping a chat model model and [one](web_app:1yaSXSp) answering questions on the basis of the same [documents](managed_folder:TyR7HVoz) used before.

![q_and_a-chatbot.png](Z7l56RvieofY)

Please note that this webapp is included in this project to illustrate a simple conversational interface and to enable you to create a basic interface for development purposes. For use cases in production, Dataiku offers [Dataiku Answers](https://doc.dataiku.com/dss/latest/generative-ai/chat-ui/answers.html), a packaged, scalable web application for LLM chat and RAG applications.

# Next: develop your own LLM use cases

You can [download](https://downloads.dataiku.com/public/dss-samples/EX_LLM_STARTER_KIT/) and import it in your own Dataiku instance.

## Technical requirements
This project:
- leverages features available starting from **Dataiku 13.4**;
- uses the [Text Extraction and OCR plugin](https://www.dataiku.com/product/plugins/tesseract-ocr/) (version 2.3 or above);
- uses the [Traces Explorer plugin](https://doc.dataiku.com/dss/latest/generative-ai/agents/tracing.html#traces-explorer);
- requires the ids of LLM connections as project variables (`LLM_id` for a chat LLM, `embedding_model_id` for an embedding model, `image_generation_model_id` for a text-to-image model);
- requires a Hugging Face tokeb stored as a [user secret](https://doc.dataiku.com/dss/latest/security/user-secrets.html) for the recipes using `mistral-7b-instruct-v0.2`. The key for this secret should be `hf_token`. You will need to visit the corresponding [Hugging Face page](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) and be granted access to the model;
- uses the [unstructured.io system dependencies](https://unstructured-io.github.io/unstructured/installing.html) corresponding to the document formats you want to cover in the [4. Retrieval-based question answering - code](flow_zone:wwVg0ys) Flow zone. For example, only `libreoffice` needs to be installed with the [documents provided as examples](managed_folder:TyR7HVoz);
- requires to create a Python 3.10 code environment named `py_310_sample_llm` with the packages and the resource [initialization script](https://doc.dataiku.com/dss/latest/code-envs/operations-python.html#managed-code-environment-resources-directory) mentioned in this [wiki page](article:2);
- requires to create a Python 3.9 code environment named `py39_huggingface` with the packages of the *Local HuggingFace Models* preset but with `trl` version 0.13.0 or more and `datasets` version 2.21.0 or more;
- requires to create a "Retrieval augmented generation" Python 3.9 code environment through the admin interface (Administration > Settings > LLM Mesh > configuration);
- requires to create a Python 3.9 code environment named `py39_llm_evaluation` with the packages of the *Evaluation of LLMs* preset.

## Reusing parts of this project

You first need to [download](https://downloads.dataiku.com/public/dss-samples/EX_LLM_STARTER_KIT/) the project and import it on your Dataiku instance.

All the datasets and files are stored in filesystem so no remapping will be needed. However you have the option to [change the connection type](https://knowledge.dataiku.com/latest/courses/flow-views-and-actions/connection-changes-concept-summary.html#connection-changes) if you want to rely on a specific data storage type. 

If you want to leverage certain parts of the Flow directly, keep in mind that Dataiku allows you to reuse elements at different levels. In particular it is possible to:
- [Duplicate a whole project](https://knowledge.dataiku.com/latest/kb/governance/How-to-duplicate-a-DSS-project.html)
- [Copy and paste entire subflows](https://knowledge.dataiku.com/latest/courses/flow-views-and-actions/connection-changes-concept-summary.html#copy-subflow)
- [Copy and paste recipes](https://knowledge.dataiku.com/latest/kb/collaboration/How-to-copy-a-recipe-in-your-Flow.html) and remap input/output datasets
- [Copy and paste preparation steps](https://doc.dataiku.com/dss/latest/preparation/copy-steps.html) within a Prepare recipe

More specifically:
 - Regarding the [3. Retrieval-based question answering - visual](flow_zone:nS8CTF4) and [4. Retrieval-based question answering - code](flow_zone:wwVg0ys) Flow zones: you can replace the [documents](managed_folder:TyR7HVoz) with your own documents. In the [4. Retrieval-based question answering - code](flow_zone:wwVg0ys) Flow zone, you may need to adapt the code of the [indexing recipe](recipe:compute_4pbb7xKD) and leverage other LangChain [document loaders](https://python.langchain.com/en/latest/modules/indexes/document_loaders.html) to cover other formats.
 - Regarding the [5. LLMs combined with tools](flow_zone:Gzn9KXV) Flow zone: you need to create your own tools with Python functions (cf. the examples in the `fictitious_tools.py` file in the project's library) and adapt the code in the Python code recipes (cf. the comments in this recipe which show which parts to modify).

## If you want to use other LLMs...

This project leverages an "[LLM connection](https://doc.dataiku.com/dss/latest/generative-ai/llm-connections.html)" (introduced in Dataiku version 12.3) and `mistral-7b-instruct-v0.2`, an open source model.

For all visual recipes using an LLM connection, for example the visual recipes in the [1. Basic use - visual](flow_zone:JSx6cEw) Flow zone, you can choose to replace the LLM with another one among those configured by your administrator.

For the code recipes using the LLM connection, you can switch to another model by modifying the `LLM_id` project variable.

# Related resources
- [Introduction to Large Language Models With Dataiku](https://content.dataiku.com/llms-dataiku/dataiku-llm-starter-kit)
- [Dataiku for Generative AI](https://www.dataiku.com/product/generative-ai/)
- [Documentation on Dataiku's LLM Mesh](https://doc.dataiku.com/dss/latest/generative-ai/index.html)
- "Question Answering with GPT-3" Dataiku project ([demo](https://gallery.dataiku.com/projects/EX_QUESTION_ANSWERING/), [download site](https://downloads.dataiku.com/public/dss-samples/EX_QUESTION_ANSWERING/), [blog post](https://medium.com/data-from-the-trenches/digging-through-the-minutiae-query-your-documents-with-gpt-3-c4600635a55))
- "Text Classification" Dataiku project ([demo](https://gallery.dataiku.com/projects/EX_TEXT_CLASSIFICATION/), [download site](https://downloads.dataiku.com/public/dss-samples/EX_TEXT_CLASSIFICATION/), [blog post](https://medium.com/data-from-the-trenches/7-text-classification-techniques-for-any-scenario-be428ea68b71))
- "Token Classification" Dataiku project ([demo](https://gallery.dataiku.com/projects/EX_TOKEN_CLASSIFICATION/), [download site](https://downloads.dataiku.com/public/dss-samples/EX_TOKEN_CLASSIFICATION/))
- "Advanced RAG" Dataiku project ([demo](https://gallery.dataiku.com/projects/EX_ADVANCED_RAG/), [download site](https://downloads.dataiku.com/public/dss-samples/EX_ADVANCED_RAG/), [blog post](https://medium.com/data-from-the-trenches/from-sketch-to-success-strategies-for-building-and-evaluating-an-advanced-rag-system-edd7bc46375d))
- "Multimodal LLM" Dataiku project ([demo](https://gallery.dataiku.com/projects/EX_VISION_LLM/), [download site](https://downloads.dataiku.com/public/dss-samples/EX_VISION_LLM/), [blog post](https://medium.com/data-from-the-trenches/demystifying-multimodal-llm-053143c07d6f))
- "Web-based RAG" Dataiku project ([demo](https://gallery.dataiku.com/projects/EX_WEB_RAG/), [download site](https://downloads.dataiku.com/public/dss-samples/EX_WEB_RAG/), [blog post](https://medium.com/data-from-the-trenches/standing-on-the-shoulders-of-a-giant-cefe2a50881a))
- "Multimodal RAG" Dataiku project ([demo](https://gallery.dataiku.com/projects/EX_MULTIMODAL_RAG/), [download site](https://downloads.dataiku.com/public/dss-samples/EX_MULTIMODAL_RAG/), [blog post](https://medium.com/data-from-the-trenches/beyond-text-taking-advantage-of-rich-information-sources-with-multimodal-rag-0f98ff077308))
- [Text Extraction and OCR plugin](https://www.dataiku.com/product/plugins/tesseract-ocr/)