# Description

This example project illustrates how to match records referring to the same entity across different datasets. More specifically, we set to  **match 2,000 companies**  in our Customer Relationship Management (CRM) system  **against a larger dataset of 20,057 companies** from an external PROVIDER, for data enrichment purposes. The challenge is that company IDs are different in both datasets, and  **company names do not always match**.

Key Takeaways:

*  **Matching can be automated**  in Dataiku by preprocessing names and fuzzy joining.
* The [Visual Edit](https://www.dataiku.com/product/plugins/visual-edit/) capability provides a no-code solution to set up webapps for  **domain experts to validate automated matches and to find missing matches** .
* The output is a matching table whose contents are the result of automation and human review. It’s a new data source that can be leveraged in other analytics projects.

# About this Wiki

To make the most of this Wiki, we recommend to start reading the project's accompanying [blog post](https://blog.dataiku.com/accelerating-entity-resolution), which gives an overview of how entity matching can be automated and how validation webapps can be created in Dataiku.

The current Wiki article provides:

* Technical details on the project's components: Flow, Scenarios, Dashboards and Webapps configuration
* Instructions to reuse this project and apply it to your own data

The next article in the Wiki discusses [how to use Entity Resolution in practice](article:2).

# Components

## Flow

**[Load Data](flow_zone:t1fcwwG)**:
- Two Sync recipes copy example edit logs into the _editlog_ datasets that are used and maintained by the Visual Edit webapps. They are only used for demonstration purposes, so that the webapp looks like it’s been used.
- A [Stack recipe](recipe:compute_entities_ext) uses the [entity_ext_none](dataset:entity_ext_none) editable dataset to add a "NONE" company with ID `0` at the top to the external dataset, so that it can be picked from the webapps to indicate that no matching entity exists.

**[Automated Matching](flow_zone:SBgMyvR)**:
- Preprocessing:
  - [business_structure](dataset:business_structure) is an editable dataset that holds the list of acronyms to be removed from company names. Its contents are read by the [Update business structure](scenario:UPDATEBUSINESSSTRUCTURE) scenario and written into a `business_structure` project variable.
  - Both input datasets get processed by [Prepare recipes](recipe:compute_entities_ref_prepared) with the same steps (including removal of acronyms based on the `business_structure` project variable).
- Fuzzy Matching:
  - We [Fuzzy Join](recipe:compute_companies_joined) the resulting datasets based on their names, keeping matches where the Damerau-Levenshtein textual distance between preprocessed names is less than 20%, and where country and industry values match perfectly.
  - We [Prepare](recipe:compute_companies_joined_prepared) the Fuzzy Join output: we extract `Confidence` values based on textual distance, and add a `Match Type` column; a [Pie chart](insight:G1p985d) showing counts by match type is added to the Automated Matching page of the Matching Management dashboard.
    - Some matches are missing: "Manual Matching Required".
    - If Confidence is above 90%, the match is "Automatically Validated".
    - Otherwise: "Validation Required".
    
**[Split Matches](flow_zone:udTezJ6)**:
- For each entity of the reference dataset, we only keep its closest match. We do this by [Grouping](recipe:compute_companies_joined_prepared_by_id_companies_ref) matches by `ID` of the reference dataset.
- We [Split](recipe:split_matches_edited_prepared) into 3 datasets: automatically validated matches, missing matches, and match suggestions.
- We [Prepare](recipe:compute_match_suggestions_prepared) the last two for usage in the Visual Edit webapps (see the Dashboards section below), by [adding empty feedback columns as instructed in the Visual Edit documentation](https://dataiku.github.io/dss-visual-edit/validate): `Validated` (boolean) for match suggestions, and `Comments` (text) for both. We also reorder columns:
  - We want to start with the `Validated` column of checkboxes, so that marking matches as valid looks like ticking items off of a checklist.
  - We want `Confidence` to follow, since it's likely to be used to sort the table.
  - We want the `Matched Entity` to be right after the company `Name` (for easy comparison), followed by `Comments` (to share any insights on the matching).
  - `Country` and `Industry` are moved to the end because they will be hidden or only used for filtering purposes.
  
**[Webapp Internals](flow_zone:d4I4UQO)**:
- This Flow Zone contains _editlog_ datasets managed by the Visual Edit webapps, and _edits_ and _edited_ datasets built from them.
- We [Prepare](recipe:compute_match_suggestions_statuses) both edited datasets to make it easy to monitor progress: a `Status`column indicates if each match is either "Pending", "Matched", or "Unmatchable". [Pie charts](insight:onjXhk9) showing counts by Status are added to the Matching Management dashboard.

**[Dispatch Edits](flow_zone:sukZYRX)**:
- From the _edits_ datasets we use [Split](recipe:split_match_suggestions_edits) recipes to extract (validated) rows where a matching entity was actually found.
- We [Stack](recipe:compute_matching_table) these rows to the automatically validated matches, so that they all make their way into the [matching table](dataset:matching_table).

## Scenarios

This project implements the [automation Scenarios recommended in the Visual Edit documentation](https://dataiku.github.io/dss-visual-edit/build-complete-application#integrating-edits-in-automation-scenarios):

* [Build edited and downstream datasets](scenario:Build_matches_edited_and_downstream)
* [Update Source](scenario:COMPUTEMATCHES)
  * One specificity is that it starts with the previously mentioned [Update business structure scenario](scenario:UPDATEBUSINESSSTRUCTURE)
  * It should be run to take into account any changes in...
    * The CRM data
    * The PROVIDER's data
    * The list of business structure acronyms to filter out from company names
    * The matching pipeline
* [Commit Edits](scenario:UPDATE)
* [Reset Edits](scenario:RESET)

Each Scenario applies to both webapps at the same time.

## Dashboards

**[Matching Management](dashboard:DEuqpap)**:
This dashboard would be used by Data Stewards when the project is running in production:
*  **Rerun the matching pipeline**  to take into account any changes in the source data or in the list of business structure acronyms, using a button that triggers the Update Source scenario discussed above.
* **Review results**. Note: this could also be used by the owner of the data pipeline, to help adjust its parameters.
* **Monitor progress**  of work done via both webapps.
  
**[Match CRM against PROVIDER](dashboard:JDVodKt)**:
This dashboard would be used by Domain Experts when the project is running in production:
* **Validation and manual matching webapps** powered by Visual Edit, each embedded in its own page of the dashboard.
* **Help page** explaining how matched entities can be edited or validated, providing tips on reviewing strategy, and how to best set up the data table to make work easier.

## Visual Edit Webapps

### Settings

The validation webapp is based on the [match_suggestions dataset](dataset:match_suggestions) and the manual matching webapp is based on the [matches_missing dataset](dataset:matches_missing).

Here are more details on the columns found in the validation webapp:

- Entities are grouped by industry and then by country.
- 2 feedback columns were added (all values initially empty):
  - a validation column with checkboxes to mark uncertain matches as valid;
  - a comments column.
- When necessary, each company's `Matched Entity` can be edited via a dropdown menu that shows company names found in the external provider's dataset, rather than their IDs.
- Confidence scores are displayed for guidance (based on textual distance).

This translates into the following settings:

* **Primary keys:**
  * `Entity ID` - Text - hidden
* **Display-only columns:**
  * `Confidence` - Numerical
  * `Name` - Text - used to sort data initially
  * `Country` - Text - used as grouping column
  * `Industry` - Text - used as grouping column
* **Editable columns:**
  * `Matched Entity`
    * Categorical
    * Machine-generated values representing entity IDs in the external dataset
    * Edited via dropdown with search functionality ("starts with")
      * 20k options listed in Linked Dataset [entities_ext_stacked](dataset:entities_ext_stacked)
      * Showing the `Name` instead of each option's `ID`, and `Country` and `Industry` as lookup columns, to make it easier for the user to pick the right option.
* **Feedback columns:**
  * `Validated` - Validation column - Boolean - False by default
  * `Comments` - Comments column - Text - empty by default

See the actual implementation of these settings: [Validate Matches](web_app:SxOehJF).

The manual matching webapp has the same dropdown menus for Matched Entity, but values are initially empty, and there is no validation column.

### Editlogs

Each row of the editlog corresponds to a single edit. Columns include the date and time of the edit, the username, key values that identify the edited row, the name of the edited column, and the new value it was given. Notice that when the `Matched Entity` column was edited, Visual Edit stored company external IDs, even though the interface showed company names.

# Applying to your own data

## Requirements

As this project is based on Visual Edit, you should confirm that your Dataiku installation is [compatible](https://dataiku.github.io/dss-visual-edit/compatibility). Note that the project's datasets were configured to use a FileSystem data connection, in order to be compatible with the Dataiku Gallery for demonstration purposes; however, we recommend using one of the connections listed on this page, to ensure production readiness.

You'll need a reference and an external dataset of entities. If these entities are companies, the schemas of your own datasets are likely to be similar to the ones in this project. For other Entity Resolution problems, you may need to adapt the matching pipeline to different schemas of the input datasets. You may also need to adapt the configuration of the Visual Edit webapps; if so, we recommend using the [Qualification Framework](https://docs.google.com/document/d/1b7Y0A84qT337nfhnuF4xKpqlq_CghHXiSkMoPpJFzy8/edit?tab=t.0#heading=h.qnrk7w2ihhwm) as a first step.

## Reusing for Company Resolution

The project can be downloaded [here](https://downloads.dataiku.com/public/dss-samples/EX_ENTITY_RESOLUTION/). Import it into Dataiku as a new project and make sure it works properly with the example data we provide, before using your own company data.

### Reproducing on sample data

The project bundle includes example internal and external company datasets. Build the rest of the project’s datasets and get the webapps running with these scenarios:

* [Update Source](scenario:COMPUTEMATCHES): this will build the datasets used in Visual Edit, hence the datasets in the Automated Matching and Split Matches flow zones.
  * ⚠️ It will fail to build the edited datasets, because the edits datasets will not have been built.
* [Commit Edits](scenario:UPDATE): this will build the _edits_, _edited_, and downstream datasets all the way down to the matching table.
* [Restart webapps](scenario:RESTART_WEBAPPS).

The project bundle also includes example editlogs for match suggestions and for missing matches, for demonstration purposes and so that the webapp looks like it’s been used. These can be cleared from the webapps by running the [Reset Edits](scenario:RESET) scenario. Feel free to also delete the example editlog datasets and the Sync recipes that follow.

### Adapting to your own data

* Duplicate the previous project.
* Delete the example editlog datasets.
* Replace [entities_ref](dataset:entities_ref) and [entities_ext_stacked](dataset:entities_ext_stacked).
  * Use this as an opportunity to change the project's datasets' connection if needed.
  * Make sure that column names match those used in this project (`Name`, `Industry`, `Country`).
  * Note that `entities_ext` is used as Linked Dataset in Visual Edit Webapps and only the first 10,000 rows will be used, unless this dataset is on a SQL connection.
* Review and edit [business_structure](dataset:business_structure).
* Run [Reset Edits](scenario:RESET) to initialize the _editlog_ and _edits_ datasets.
  * ⚠️ It will fail when "committing" the empty contents of the editlogs, because the original datasets will not have been built.
* Adjust parameters of the matching pipeline, such as the threshold for textual distance between entity names.
* Run [Update Source](scenario:COMPUTEMATCHES) and [Restart webapps](scenario:RESTART_WEBAPPS).
* Rename the [Match CRM against PROVIDER](dashboard:JDVodKt) dashboard to your needs, customize its Help page, customize each page's introductory text, test the webapps, and share with Domain Experts for further tests.
* Test the [Matching Management](dashboard:DEuqpap) dashboard and share it with the Data Steward for further tests.
* Test the use of the matching table in another project.

### Deploying to production

Because this project has interfaces where users can enter data, which then gets processed, you'll need to have two instances of the project: one for development and one for production; each will have its own set of edits.

Once you've run successful tests with end-users on the development project, here are the steps to take for production usage:

* [Deploy your project](https://dataiku.github.io/dss-visual-edit/deploy) on an automation node.
* Run automated matching on the automation node.
* Have Domain Experts validate matches and the Data Steward publish results, as explained in the [blog post](https://blog.dataiku.com/accelerating-entity-resolution).
* Deploy downstream analytics projects which rely on the matching table.

## Extending to other types of Entities

The above can be adapted to other types of entities such as products and persons (which are characterized by other attributes than name, industry and country). For this, you would need to adapt the [Fuzzy Join recipe](recipe:compute_companies_joined) to these attributes. Our sample project used Text columns only but [Join conditions](https://doc.dataiku.com/dss/latest/other_recipes/fuzzy-join.html#join-conditions) can also be set on Numeric and Geopoint columns.

### Reconciliation

If the nature of entities of the external dataset is different from that of the reference dataset, for instance if you need to match payments to invoices, you should consider our [Reconciliation Solution](https://www.dataiku.com/solutions/catalog/reconciliation/).

#### Features and limitations of the Reconciliation Solution

* Reconciliation supports 1-to-many and many-to-many mappings.
* Reconciliation has a simple configuration UI that removes the need to go into the Flow and customize recipes.
* Reconciliation has a "Focus" view which presents a single item from the reference (aka primary) dataset, along with several match candidates  to Approve or Reject from the external (aka secondary) dataset. By focusing on a single item, it becomes possible to display the values of all attributes of this item and of its match candidates, which can be very powerful for some use cases.
* Reconciliation does not have dropdown editing / Linked Records.
* Reconciliation currently discards entities for which no match was found by its engine.

#### Supporting many-to-many mappings in the Entity Resolution project

* **Many-to-1**: Our webapp enables many-to-1 mappings; this means that the same company from the external dataset can be selected several times (which would only make sense if there are duplicate entities in the internal dataset).
* **1-to-Many**: The project would need to be modified by removing the data transformation that only keeps the closest match of each entity; doing so would make it possible to have several match suggestions per entity, and to validate several matches.

# Next Steps

[How to use Entity Resolution in practice](article:2)