# How to standardize text fields using fuzzy values clustering[¶](https://knowledge.dataiku.com/latest/kb/data-prep/prepare-recipe/How-to-standardize-text-fields-using-fuzzy-values-clustering.html#how-to-standardize-text-fields-using-fuzzy-values-clustering "Permalink to this headline")

When working with large amounts of disparate, user-entered text data, we often need to standardize or collapse entries into a resolved form.

For example, how can we get a computer to recognize that strings like “Abraham Lincoln”, “Abe Lincoln”, and “Abrahm Lincoln” are actually the same category? We want to map these close variations into a smaller number of clean labels.

In the image below, the *emp\_title* column has thousands of distinct values and a long tail of unique values (those that appear only once).

Text cleaning strategies like **normalization** that standardize on case and remove punctuation may help in this regard, but they don’t get us all of the way there. Similarly, if it is user-entered data, you might not be able to introduce a rule-based strategy using **regular expressions**.

## Fuzzy Matching[¶](https://knowledge.dataiku.com/latest/kb/data-prep/prepare-recipe/How-to-standardize-text-fields-using-fuzzy-values-clustering.html#fuzzy-matching "Permalink to this headline")

The answer lies in a technique called fuzzy matching.

When viewing a dataset in the Explore tab, select **Analyze** from the context menu of the column header to review the number of distinct and unique values and their distribution.

If you switch over from the Categorical tab to the Values Clustering tab, you have the option to search for clusters.

* You can choose a clustering strategy of “Fuzzy” or “Highly fuzzy” depending on how close you require values in the cluster to be. In this case, we’ll choose “Fuzzy”.

* You can also compare words or characters. For this use case, we’ll compare words.

If you repeat these steps in a **Prepare** recipe, you can take action on these clusters.

You can adjust the final, standardized label if needed. Then, you can click “Merge Selected” to perform a merge on all or a selection of clusters so that the original values are mapped to the new cluster label.

To take the first example, all of these 5 variations will now have the same label– “Dept of Homeland Security”.

This merge action creates a fully-transparent and editable **Find and Replace step** in the Prepare recipe with hundreds - or even thousands - of individual replacements.

Super simple and a huge time saver!

## What’s next?[¶](https://knowledge.dataiku.com/latest/kb/data-prep/prepare-recipe/How-to-standardize-text-fields-using-fuzzy-values-clustering.html#what-s-next "Permalink to this headline")

For more tips on the Prepare recipe, be sure to check out free Academy courses like Basics 102 or the Visual Recipes 101.
