# Overview

This example project shows how **LLM-powered agents** can be implemented in Dataiku. In particular and following an upcoming [blog post](https://medium.com/data-from-the-trenches) , it illustrates how to easily convert common Dataiku objects into tools with simple and generic Python functions. **Please read the [blog post](https://medium.com/data-from-the-trenches) before exploring this project.**

More specifically, this example project illustrates how to:
- Create tools from various Dataiku objects:
  - An **endpoint** corresponding to a **Python function**;
  - An **endpoint** corresponding to a machine learning **model**;
  - a collection of **datasets**;
  - a set of **documents**;
- Build an agent that leverages these tools.

# Fictitious example

In the context of this example project, we imagine the **case of an internet provider handling [customer requests](dataset:customer_requests)** related to Internet connection troubles, password reinitialization, options... This company already knows how to accomplish the corresponding tasks (run diagnostic tests, resetting passwords, activating options...) through its IT system. The key contribution of an LLM agent in this context is to interpret a natural language query and identify and plan the right actions to take. We will see below how this company can easily create tools to be used by an LLM if the corresponding resources already exist in Dataiku.

# Converting Dataiku objects into tools

The simplest way to create a LangChain tool is to provide:
- A **name**;
- A **description**;
- A **Python function**.

The Libraries section of this example project includes several generic Python functions that convert certain Dataiku objects into tools. For this, these functions need to derive the name, the description and the Python function from the Dataiku objects and inputs provided by the developer. **Please note that you cannot visualize the Libraries section of a project on the public Dataiku gallery**. If you want to see the generic Python functions, you will need to download the project and import it on your own instance.

### Python function endpoint

The function `create_tool_from_python_endpoint` converts an [API endpoint exposing a Python function](https://doc.dataiku.com/dss/latest/apinode/endpoint-python-function.html) into a tool, which will simply call this endpoint:

```python
def create_tool_from_python_endpoint(url, description, name=None, list_arguments=None, api_key="", **kwargs)
```

The arguments are:
- ```url```: the URL of the endpoint;
- ```description```: the description of the tool;
- ```name```: the name of the tool;
- ```arguments_names```: the list of the names of the arguments of the function. If it is not given, it will be inferring from an a call to the endpoint (which could be harmful if the endpoint actually modifies something);
- ```api_key```: the API key granting the authorization to access the endpoint;
- ```**kwargs```: additional named arguments to be included in the API calls. These are useful to restrict the tool given the context. For example, if the purpose of an endpoint is to modify customers' account and includes a `customer_id` argument and if an agent is meant to handle a request from a certain customer, we can pass `customer_id=123` when creating the tool. In this way, we prevent the tool from affecting other customers' account. Please note that the description should be changed accordingly in this case.

For example, with an endpoint exposing a Python function providing information on products, we can create a tool based on this endpoint with the following snippet:
```python
python_endpoint_description = """Give the characteristics of a product. Input is the name of the product and output is the description. The only products available are:
    -"Iphone 10"
    -"Iphone 11"
    -"Samsung galaxy S9"
    -"Samsung galaxy S10" """

create_tool_from_python_endpoint("http://xxx.yy.z.aa:bbbbb/public/api/v1/example_agent/get_product_info", python_endpoint_description, api_key=API_KEY)
```
Here is an example of the tool being used by an agent:
```
The customer is asking details about an Iphone 10.
Action: get_product_info
Action Input: 'Iphone 10'
Observation: The Iphone 10 costs 300 dollars, it is available and can be sent in 3 days.
```
### Model endpoint

The approach is similar with an [endpoint exposing a model](https://doc.dataiku.com/dss/latest/apinode/endpoint-std.html). The generic Python function creating the tool has a slightly different signature:

```python
def create_tool_from_model_endpoint(url, description, name=None, post_process=lambda x: x, api_key="", key="id", value=None)
```

The arguments are:
- ```url```: the URL of the endpoint;
- ```description```: the description of the tool;
- ```name```: the name of the tool;
- ```post_process```: a Python function that converts the output of the model into an explicit string that is more easily understood by the LLM agent;
- ```api_key```: the API key granting the authorization to access the endpoint;
- ```key```: the name of the parameter to enrich the prediction query. We only cover the case when the model endpoint leverages the [data enrichment feature](https://doc.dataiku.com/dss/latest/apinode/enrich-prediction-queries.html). `key_name` refers to the lookup key in the reference dataset.
- ```value```: For the same reason as the kwargs for the [Python function section](#python-function-endpoint-1), we might want to impose the value for which the model will be called. 

For example, the project includes a ```run_diagnostic``` model endpoint that takes a client_id and returns whether an intervention by a technician is needed. You can create a tool that leverages this model with the following snippet:

```python
def post_process(x):
        if str(x) == "0":
              return "The customer can directly solve the connection problem by following the instructions provided in the FAQ: www.telco-operator/faq/{random.randint(1, 100)}. An intervention of a technician is not warranted."
        elif str(x) == "1":
              return "A technician must visit the customer to solve the connection problem. An appointment needs to be made."
        elif str(x) == "2":
              return "A technician must discuss with the customer over the phone to solve the connection problem. An appointment needs to be made."
        return "Unexpected output"
    
model_endpoint_description = """Determine the best course of action when a customer experiences an Internet connection problem.
This should be the first action to take if a customer reports an Internet connection problem.
The input should be a single integer representing the customer's id. E.g. '123'."""

create_tool_from_model_endpoint("http://xxx.yy.z.aa:bbbbb/public/api/v1/example_agent/run_diagnostic", model_endpoint_description, api_key=API_KEY, post_process=post_process, key="client_id")
```

Here is an example of an agent using this tool:
```
I should start by running a diagnostic to determine the best course of action for Mr. John Doe's Internet connection problem.
Action: run_diagnostic
Action Input: 1
Observation: the system is fine
```

### Datasets

We can create a tool that answers questions grounded in Dataiku datasets thanks to a text2SQL chain, ie. an LLM chain that generates SQL queries and interprets their results. Here is the signature of the generic Python function creating such a tool:

```python
def create_tool_from_datasets(tag, name="query_datasets", description=None, datasets_restrictions=[], llm=None)
```

The arguments are:
- ```tag```: a tag identifying the datasets covered by the tool;
- ```name```: the name of the tool;
- ```description```: the description of the tool. If `None`, the description is built with the short description of the datasets;
- ```datasets_restrictions```: a list of restrictions (cf. below);
- ```llm```: the LangChain LLM used to generate SQL queries and interpret their results.

The `datasets_restrictions` argument is used to restrict the datasets while executing the SQL queries. For example, if the agent handles the request of a specific customer whose `id` is `2`, we can make sure through such a restriction that the tool will not reveal information about other customers. For instance, if the tool needs to query the [customers_info_sql](dataset:customers_info_sql), the string `customers_info_sql` will be replaced by 
`(customers_info_sql WHERE client_id = 2)` in all queries.

The SQL query is run either via ```dataiku.SQLExecutor2``` if all datasets support SQL with the same execution engine or with `pandas` and `sqlite` otherwise.

For example, in the project, we use a collection of three datasets with customer information, invoice data and the availability of internet fiber.  The corresponding tool can be created like this:
```python
customer_id = 1
datasets_restrictions = [{"dataset_name": "total_customers_info", "key": "id", "value": customer_id}, {"dataset_name": "customers_invoices_sql", "key": "client_id", "value": customer_id}]
tag = "invoices"
llm = ChatOpenAI(temperature=0)

create_tool_from_datasets(tag, datasets_restrictions=datasets_restrictions, llm=llm)
```
In this case, neither a name nor a description are specified. The name is assigned a default value while the description is built with the following template:

name:
```
query_datasets
```

description:
```python
"""This is a tool you can use to get information from datasets. The input is the question you want to ask the datasets and the output will be the answer based on the datasets.
The available datasets are:
{}
Include in the question all relevant contextual information.
The input is the question you want to ask the answer of, not the actual query.
If a SQL syntax error is returned, try to rephrase the question or provide additional contextual information.
"""
```

Here is an example of the tool being used by an agent:

```python
I need to check if fiber is available for Mrs. Samantha Lee.
Action: query_datasets
Action Input: "Is fiber available for customer with id 6?"
Observation: ('Fiber is available for customer with id 6', 'SELECT is_available\nFROM fiber_availabilities\nWHERE city = (SELECT city FROM total_customers_info WHERE id = 6)')
```

### Managed folder

Finally, we can create a tool answering questions with a collection of documents by leveraging a [**Retrieval-Augmented Generation**](https://medium.com/data-from-the-trenches/digging-through-the-minutiae-query-your-documents-with-gpt-3-c4600635a55) (RAG) approach. For this, we assume that a [LangChain FAISS vector database](https://python.langchain.com/docs/integrations/vectorstores/faiss) of these documents is already included in a managed folder.

The generic Python function creating such a tool has the following signature:

```python
def create_tool_from_managed_folder(folder_id, additional_instructions, name="query_documents", description=None, num_chunks=5, llm_chain=None, filters={}, embeddings=None)
```

The arguments are:
- ```folder_id```: the id of the managed folder including the FAISS index;
- ```additional_instructions```: the additional instructions to add to the prompt. Only used if `llm_chain=None`;
- ```name```: the name of the tool;
- ```description```: the description of the tool. If None, the desription is based on the short description of the managed folder;
- ```num_chunks```: the number of chunks included in the prompt to answer the question;
- ```llm_chain```: an optional LangChain RAG chain;
- ```filters```:  the filters applied to the metadata of the chunks to restrict the scope of the tool. Cf. below;
- ```embeddings```: the embeddings used (provided as an Embeddings LangChain object).

The filters are the equivalent of the datasets restrictions in the [previous section](#datasets-1). These filters limits the retrieval of relevant chunks to the one whose metadata satisfy conditions. This can for example prevent sensitive information from being inappropriately shared.

In this example project, the managed folder used for this tool includes a manual that describes the procedures to answer common questions. You can create the tool like this:

```python
additional_instructions = """
Your answer should just be exactly what the customer should do according to the manual
Focus on answering the precise question, do not make a list of answers according to other questions. Just give the procedure for the precise question.
"""
filter = {"docs": "manual"}
create_tool_from_managed_folder("DQwRivV3", additional_instructions=additional_instructions, filter=filter)
```

In this case, neither a name nor a description are specified. The name is assigned a default value while the description is built with the following template:

name:
```python
query_documents
```

description:
```python
"""This is a tool that uses documents to answer a question. 
The input is the question for which you want to obtain the answer based on the documents.
The documents consist of: {}
If the tool doesn't help you then you must try to continue without it.
"""
```

Here is an example of the tool being used:

```
I need to provide information about the benefits of the premium plan compared to the basic one. I will use query_documents to get this information from the manual.
Action: query_documents
Action Input: What are the benefits of the premium plan compared to the basic one?
Observation: According to the manual, to answer the question "What are the benefits of the premium plan compared to the basic one?", you should inform the customer that the premium plan offers multiple benefits, including faster data speed and premium TV channels such as sports channels. However, it is important to note that this list is not exclusive and the customer should be directed to visit our website "telecommunication/company.com" for more information. Additionally, the price of the premium plan is $20 per month.

Reference: 
- How to answer the question of a customer > Premium Plan > What are the benefits of the premium plan
```

## Implementing the agent

This [Python recipe](recipe:compute_agent_answer) shows how to build the agent corresponding to the fictitious example described in the [blog post](https://medium.com/data-from-the-trenches). In this case, the agent may depend on the customer asking the question, in which case an agent with specific restricted tools can be created on the fly for each request.  There is a function to address a single request:
```python
def process_request(request, customer_id=None, datasets_restrictions=[], llm=ChatOpenAI(temperature=0), additional_instructions="", separator_for_additional_instructions="Begin!"):
```
The arguments are:
- ```request```: the request that we need the agent for;
- ```customer_id```: the id of the customer who made the request;
- ```datasets_restrictions```: the restrictions of the datasets for the [tool that uses datasets](#datasets-1);
- ```llm```: the LLM used to power the agent;
- ```additional_instructions```: instructions to add to the default prompt of LangChain;
- ```separator_for_additional_instructions```: string used to know where to include the additional instructions.

The function creates all the necessary tools and the corresponding agent, runs the agent on the request and returns the answer. You can see a sample of answers in the [answers](dataset:answers) dataset.

Here is an example of how an agent handles the following request: "Good afternoon. I'm not interested anymore in the television option. Can I stop paying for it?". We added a (quite arbitrary) condition in the manual that the agent can only cancel an option if the customer has less than 150$ worth of invoices. We can see that the agent properly interprets the manual, computes the amount of invoices and cancels the option.

```python
1. The customer wants to cancel the TV option. I should check the manual to confirm if I can proceed with this request.
Action: query_documents
Action Input: Can I cancel a customer's TV option?
Observation: To cancel a customer's TV option, you need to follow the procedure outlined in the manual. First, check whether the customer has more than $150 worth of invoices. If the customer has less than $150 worth of invoices and wants to cancel their subscription, you should cancel any appointments they might have and cancel the options they have. However, if the customer has more than $150 worth of invoices, you should inform them that they will be contacted by the customer support team.

If the customer specifically requests to cancel an option without specifying the name, you should ask the customer which option they would like to deactivate. Once you have this information, you can proceed to cancel the requested option.

Please note that if the customer wants to suspend their subscription due to losing their phone, you can cancel their phone subscription.

For more information on the benefits of the premium plan, you can direct the customer to visit our website "www.telco-operator/premium".

References:
- Manual: Answering customer questions > Subscription cancellation
- Manual: Answering customer questions > Options
- Manual: Answering customer questions > Miscellaneous
- Manual: Answering customer questions > Premium Plan

2. The manual confirms that I can cancel the TV option for the customer. However, I need to check if the customer has more than $150 worth of invoices. I should use the customers_invoices_sql dataset to find this information.
Action: query_datasets
Action Input: What is the total amount of invoices for customer with id 7?
Observation: ('The total amount of invoices for customer with id 7 is 105', 'SELECT SUM(amount) AS total_amount\nFROM customers_invoices_sql\nWHERE client_id = 7')

3. The customer has less than $150 worth of invoices. Therefore, I can proceed to cancel the TV option for the customer.
Action: cancel_option
Action Input: TV
Observation: The TV option has successfully been deactivated for customer #7
```

# Next: test the example project on your own Dataiku instance

You can [download](https://downloads.dataiku.com/public/dss-samples/EX_LLM_AGENT/) this example project and import it in your own Dataiku instance.

## Technical requirements

This example project requires:

- An OpenAI API key stored as a user secret if you want to use OpenAI models. The key for this secret should be: `openai_key`;
- An API node if you want to test the tools based on either the Python function endpoints or the model endpoints. In this case, you will also need an [API key](https://doc.dataiku.com/dss/latest/apinode/api/admin-api.html#authentication) provided as a user secret with the `API_KEY` key;
- A Python environment named  `py_39_sample_agent ` with the following  packages:
```
langchain==0.0.278
openai==0.28.0
tabulate==0.9.0
transformers==4.32.1
sqlalchemy==1.4.22
protobuf==3.20.0
pandasql==0.7.3
faiss-cpu==1.7.4
urllib3==1.26.16
sqlparse==0.4.4
```

## Using the example project

After having imported the project:
- if you want to test the tools based on the API endpoints, you first need to deploy the [API service](lambda_service:agents_api_service) to your API node. Once you have done this, modify the project's [global variables](https://knowledge.dataiku.com/latest/mlops-o16n/variables/concept-global-vs-project.html) `get_product_info_url`and `run_diagnostic_url` with the corresponding endpoints;
- you can change the connections of the [customers_info_sql](dataset:customers_info_sql), [customers_invoices_sql](dataset:customers_invoices_sql) and [fiber_availability_sql](dataset:fiber_availability_sql) datasets so that the tool based on these datasets can leverage a SQL database;
- you can modify the [requests](dataset:customer_requests) and obtain the corresponding [answers](dataset:answers) from the agent.

# Related resources

- [Introduction to Large Language Models With Dataiku](https://content.dataiku.com/llms-dataiku/dataiku-llm-starter-kit)
- [Dataiku's LLM Starter Kit](https://gallery.dataiku.com/projects/EX_LLM_STARTER_KIT/), which includes a simpler example of an agent
- [Dataiku for Generative AI](https://www.dataiku.com/product/generative-ai/)