# Overview

This example project illustrates how to **evaluate and improve a Retrieval-Augmented Generation (RAG) pipeline**. It is best explored as a companion to this [blog post](https://medium.com/data-from-the-trenches/from-sketch-to-success-strategies-for-building-and-evaluating-an-advanced-rag-system-edd7bc46375d) (and after having read this [primer](https://medium.com/data-from-the-trenches/digging-through-the-minutiae-query-your-documents-with-gpt-3-c4600635a55) or our [Introduction to LLMs with Dataiku](https://content.dataiku.com/llms-dataiku/dataiku-llm-starter-kit) if you are unfamiliar with the notion of RAG). The example project shows how to:
- **Implement a baseline RAG pipeline**;
- **Improve** it using Dataiku's [LLM Mesh](https://blog.dataiku.com/llm-mesh), in combination with [LangChain](https://python.langchain.com/docs/get_started/introduction) or [LlamaIndex](https://docs.llamaindex.ai/en/stable/);
- **Manually evaluate** a RAG pipeline, either by grading individual generated answers or or by comparing two generated answers;
- Compute various **automated metrics** for a RAG pipeline;
- **Quantify the correlation between the manual and automated evaluations**;
- **Visualize the answers** generated by of a RAG pipeline, along with its automated evaluation results;
- Generate a **synthetic test dataset**;
- Incorporate the RAG pipeline in a **basic web app**.

This project can be [downloaded](https://downloads.dataiku.com/public/dss-samples/EX_ADVANCED_RAG/) and instructions to reuse it for your own use case are provided in the [last section](#next-build-your-own-advanced-rag-pipeline-1).

# Data

We assume that we want to build a question-answering system based on Dataiku's online technical documentation. For this, we use a [managed folder](managed_folder:o87Qdwlh) that includes approximately 2,000 web pages downloaded from Dataiku's website, as well as a [dataset of test questions](dataset:questions) with the corresponding ground-truth answers.

# Walkthrough

## Improving a baseline RAG pipeline

### Baseline RAG pipeline

In the [2. Baseline](flow_zone:6S14LG3) Flow zone, we build a **baseline RAG pipeline using only visual recipes**. We first extract the content of the web pages with the [Text Extraction recipe](recipe:compute_text_extracted) of the [Text Extraction and OCR](https://www.dataiku.com/product/plugins/tesseract-ocr/) plugin. We then split this content in chunks, compute the corresponding embeddings and store these embeddings in a [knowledge bank](retrievable_knowledge:XIujm5qr) through an [Embed recipe](recipe:compute_kb). With this knowledge bank, we generate the answers to the test questions with a [Prompt recipe](recipe:compute_answers_baseline) and we [reformat](recipe:compute_structure_scored_evaluation_dataset_formatted_1) [them](dataset:answers_baseline_formatted) so that the format of all generated answers in this project is the same.

### Structure-aware chunking

In the baseline RAG pipeline, the documents are split with a simple rule based on the number of characters. In order to get more relevant chunks, we can **take advantage of the structure of the input documents** and split them first thanks to their sections and subsections and then according to the number of characters. This ensures that we would not have chunks spanning several sections or subsections. Moreover, we can systematically add the titles of the current section and subsections to the chunks to make them more self-contained.

In the [3. Structure-aware chunking](flow_zone:5bdEBbj) Flow zone, we do so with a [code recipe](recipe:compute_chunks) using a text splitter specific for HTML documents and the rest of the pipeline is similar as the baseline RAG pipeline.

### Prompt engineering

Another improvement consists of **adapting the prompt to give the LLM more context information on the use case**. For example, here we can provide more details about Dataiku (e.g. "Dataiku is an end-to-end AI and data science platform...") and additional guidance on the proper way to answer (e.g. "be concise"). 

In the [4. Prompt engineering](flow_zone:fjM5DtI) Flow zone, we do it simply by modifying the prompt in the [Prompt recipe](recipe:compute_prompt_scored_evaluation_dataset).

### Hybrid search

The baseline RAG pipeline described above uses a semantic similarity model to retrieve relevant chunks from the set of documents. Such a model was pre-trained on a large corpus of internet documents and it is generally quite effective for a wide range of documents. However, it can fail when the documents or the user's queries include some highly specific jargon, for example some acronyms specific to a given organization and unused in public documents.

In such a case, a simple improvement is to **combine the semantic similarity search with an exact keyword search**.  As shown in the [5. Hybrid search](flow_zone:T6tPm60) Flow zone, this requires [building](recipe:compute_J8NbddOj) an index for the exact keyword search and [combine](recipe:compute_hybrid_scored_evaluation_dataset) the corresponding retriever with a standard dense retriever, which is easy with the [EnsembleRetriever class in LangChain](https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble).

### Parent-children chunk retrieval

The text of the chunks in the baseline RAG pipeline is used both to compute the embeddings in view of the semantic similarity search during the retrieval step and to feed the prompt if the chunks have been retrieved. This creates a dilemma: if a chunk is too short, it may not be informative enough and if it is too long, this could dilute its content and make it harder to retrieve. Several techniques such as **parent-children retrieval** or **auto-merging retrieval** aim at overcoming this dilemma. The documents are split in relative small chunks that allows precise retrieval and the content of these chunks is enriched with the surrounding text before being included in the prompt.

The [6. Parent-children chunk retrieval](flow_zone:LMMCK3R) Flow zone illustrates a parent-children retrieval RAG pipeline based on LlamaIndex. It includes a [code recipe](recipe:compute_DpAtwQpa) to build a hierarchical index and a [code recipe](recipe:compute_parent_scored_evaluation_dataset) to generate the answers with this index.

## Evaluating a RAG pipeline

Among all the potential improvements of a RAG pipeline (and the previous section only offer a glimpse of such potential improvements), how can we identify the most promising ones and measure their added value? The key  is to evaluate a RAG pipeline in a way that:
- Is accurate enough;
- Realistically reflects the diversity of the potential questions asked;
- Does not create an excessive manual burden;
- Points to promising improvements.

This requires **a mix of manual and automated evaluations and a mix of qualitative and quantitative assessments**.

### Human evaluation

![annotation.png](ZqcEUdUzEKp4)

The [7. Human evaluation](flow_zone:h2AbQmP) Flow zone shows two ways to assess the generated answers:
- One using the [record labeling](https://doc.dataiku.com/dss/latest/machine-learning/labeling.html#record-labeling) feature. A human annotator can [grade](labeling_task:UN3M4u909W) each generated answer, for example on a 1-to-5 scale (cf. image above);
- One using a custom [web app](web_app:UstrtIC). The human annotator can visualize two generated answers, along with the question and the reference answer and choose the one he or she prefers (cf. image below).

![comparison.png](V9GpsGxnKLny)

In the second case, we can [compute](recipe:compute_ranking) the corresponding [ranking of the various approaches evaluated](dataset:ranking) and generate a [win matrix](managed_folder:jGYaG5fn). Please note that the win matrix below was generated with only 10 questions and is therefore not a reliable evaluation of the approaches tested.

![heatmap.png](aL2s11PQU4Bk)

### Automated evaluation

Human evaluation is hardly scalable, because repeated evalautions are needed to test the added value of changes to the RAG pipeline. It is then important to also consider **automated metrics**.

In the [8. Automated evaluation](flow_zone:BGtBKE5) Flow zone, we [compute](recipe:compute_a3nnPHvq) several metrics:
- a ***statistical* metric**: BERT score;
- ***LLM-as-a-judge* metrics**: Answer correctness, Answer relevance, Context Relevance and Faithfulness;
- the **number of tokens** in the answer or in the prompt.

*LLM-as-a-judge* metrics use an LLM to provide a numerical grade and a qualitative justification. *BERT score* and ***Answer correctness*** both assess the extent to which the generated answer is similar to the reference answer while:
- ***Answer relevance*** assesses whether the answer seems relevant given the question;
- ***Context relevance*** assesses whether the retrieved chunks seem relevant given the question;
- ***Faithfulness*** assesses whether the answer is grounded in the retrieved chunks.

For a given use case, it is important to understand whether the automated metrics align with human judgment. For this we can [compute](recipe:compute_correlation_evaluation_methods) the [correlation](dataset:automated_evaluation_methods) between the automated metrics and the human scores.

### Visualization of generated answers

Tracking metrics is not enough to understand the failure cases of a RAG pipeline. Selecting the next potential improvements also requires to analyze the errors in the generated answers. For this we need to be able to **explore and visualize the generated answers**. A [web app](web_app:gVwnADo) is provided in the project for this.

With this web app, the user can:
- **Visualize all the answers** generated through the various approaches;
- **Visualize the chunks** associated with the generated answers;
- **Zoom in on a specific generated answer** and the corresponding chunks;
- **Visualize the scores** of the automated metrics, along with the corresponding **qualitative explanations for the LLM-as-a-judge metrics**;
- **Sort or filter** these answers with the question, the approach or the quantitative scores.

![visualization.png](Js7NFFXlhCPJ)

### Generation of a synthetic test dataset

As seen above, the LLM can be used to generate and evaluate answers. It can also be useful to create a synthetic test dataset. In the [9. Test set generation](flow_zone:jrDn450) Flow zone, we leverage LlamaIndex to automatically [generate](recipe:compute_synthetic_test_dataset) a set of questions along with the reference answers and the corresponding chunks. This dataset can serve to compute the previous metrics. Additionally, since the ground truth context is known, we can also compute the precision and the recall of the retrieval step.

## Packaging the RAG pipeline in a web app

The example project includes a question answering [web app](web_app:kLJSVaK) that can easily be reused and adjusted to incorporate a RAG pipeline.

![webapp.png](lLKz3aaaatot)

# Next: Build your own advanced RAG pipeline

The project can be downloaded [here](https://downloads.dataiku.com/public/dss-samples/EX_ADVANCED_RAG/).

## Technical requirements

This project requires:
- features available starting from **Dataiku 12.5**;
- a Python 3.10 code environment named `py_310_sample_rag` with the packages and [resources](https://doc.dataiku.com/dss/latest/code-envs/operations-python.html#managed-code-environment-resources-directory) specified in [Appendix: code environment](article:5);
- a Python 3.9 code environment named `py39_rag` as defined in the Initial setup section of this [page](https://doc.dataiku.com/dss/latest/generative-ai/rag.html#initial-setup);
- an LLM connection and an embedding model connection specified in the project variables. You can get a list of all available LLM connections and embedding model connections with the `list_llms` [method](https://developer.dataiku.com/latest/api-reference/python/projects.html#dataikuapi.dss.project.DSSProject.list_llms);
- the [Text Extraction and OCR plugin](https://www.dataiku.com/product/plugins/tesseract-ocr/).

## How to reuse this project

The project can be downloaded [here](https://downloads.dataiku.com/public/dss-samples/EX_ADVANCED_RAG/).

Once you have imported the project, you can directly navigate the Flow.

If you want to use your own data, you can just replace the files in the [documents](managed_folder:o87Qdwlh) managed folder and add your own questions in the [questions](dataset:questions) dataset.

All the datasets are stored in filesystem so no remapping will be needed. However you have the option to [change the connection type](https://knowledge.dataiku.com/latest/courses/flow-views-and-actions/connection-changes-concept-summary.html#connection-changes) if you want to rely on a specific data storage type. 

If you want to leverage certain parts of the Flow directly, keep in mind that Dataiku allows you to reuse elements at different levels. In particular it is possible to:
- [duplicate a whole project](https://knowledge.dataiku.com/latest/kb/governance/How-to-duplicate-a-DSS-project.html)
- [copy and paste entire subflows](https://knowledge.dataiku.com/latest/courses/flow-views-and-actions/connection-changes-concept-summary.html#copy-subflow)
- [copy and paste recipes](https://knowledge.dataiku.com/latest/kb/collaboration/How-to-copy-a-recipe-in-your-Flow.html) and remap input/output datasets
- [copy and paste preparation steps](https://doc.dataiku.com/dss/latest/preparation/copy-steps.html) within a Prepare recipe  

# Related Resources
- Blog posts on [advanced RAG](https://medium.com/data-from-the-trenches/from-sketch-to-success-strategies-for-building-and-evaluating-an-advanced-rag-system-edd7bc46375d) and [basic RAG](https://medium.com/data-from-the-trenches/digging-through-the-minutiae-query-your-documents-with-gpt-3-c4600635a55)
- [Introduction to Large Language Models with Dataiku](https://content.dataiku.com/llms-dataiku/dataiku-llm-starter-kit)
- [Dataiku LLM Starter Kit](https://gallery.dataiku.com/projects/EX_LLM_STARTER_KIT/)
- [Text Extraction and OCR plugin](https://www.dataiku.com/product/plugins/tesseract-ocr/)
