Skip to main content

AI-based Identity Resolution: The Key for Linking Diverse Customer Data

Shelby Heinecke
Apr 18 - 5 min read

Written by Shelby Heinecke and Scott Nyberg.

Companies want a comprehensive view of their customers, enabling them to solve business and marketing challenges, such as personalization, segmentation, and targeting — but they face an uphill battle as they are drowning in data. For example, many companies cannot match the identity of a customer who visits their website with the same customer who visits their store. In fact, 33% of companies cannot glean actionable insights from their data and 30% simply cannot handle the volume.

How can a comprehensive customer view be efficiently achieved? It starts by leveraging multiple data sets. For example, companies may possess email interaction data, purchase records, and customer service records. These data sets are useful for understanding a customer’s needs.

However, across these data sets, a customer may be represented slightly differently. For example, “Mr. John Doe” may be listed in one data set, while in another data set, that same person may be called, “J. Doe”. Identity resolution — identifying customer identities across data sets and merging their data — can be a puzzling and time-consuming data cleaning problem, which sometimes requires a company to manually search and merge customer data to support downstream analytics.

Could there be a better way? Enter Salesforce AI. Together with Salesforce Data Cloud engineering teams, Salesforce AI kicked off AI-based identity resolution in 2021, developing a fuzzy first-name matching approach that leverages a large language model (LLM) to match individuals.

That approach has since evolved to a soft matching system — empowering companies to select their level of rigor of matching per business case.

For a deep dive into this technical solution, read on…

How does AI-based fuzzy matching help with identity resolution?

As they examine customer data, companies must analyze multiple data sets, where a person’s name may not be consistently represented. For example, in one data source, someone may have the first name Robert and in another data source, that same person may appear as Bob. Fuzzy first name matching effectively matches those first names, subject to potential real-world variations, enabling companies to identify a unique person across multiple data sets.

Diving deeper, the fuzzy first name matching model developed by Salesforce AI and Salesforce Data Cloud consists of a fine-tuned LLM and data-derived rules, classifying a pair of first names as either a match or not a match. For example, a record for a particular customer may reveal the name (or initial), “J”, while another may show “Jessica”. In this case, the AI model would consider those matches. This lies in stark contrast to records such as “Marisa” and “Mario” which would not be a match.

However, matches can vary depending on the context, domain and data sets. In highly regulated fields — such as the medical industry — the cost of a potential name record mismatch could have serious consequences. For example, the name (or initial),“J” and “Jessica” would not constitute a match. However, in less regulated domains, “J” and Jessica may be a suitable match.

Understanding that there is not a “one-size-fits-all” approach to matching and that companies require a more sophisticated method for identifying their customers, Salesforce AI and Salesforce Data Cloud engineers took fuzzy matching to the next level with soft matching.

How does AI-based soft matching improve identity resolution?

Using advanced AI models, soft matching further enables identity resolution for diverse data sets and domains by supplanting binary answers with match scores — giving companies creative control of their data merging. As they select their desired precision — low, medium, and high — an AI model returns the matching first names accordingly.

Examples of soft matching scoring.

What does this look like? High precision matches are stringent and nearly exact matches. This includes nicknames — whereby “William” and “Bill” would be considered a strong match — and punctuation, whereby “Mary-Joe” and Mary Joe are matches.

Medium and low matches allow for more fuzziness, enabling companies to capture a wider range of potentially matching customers. For example, the initial “S” and the name, “Sharon” or even loosely similar names (“Bob” vs. “Roberto”) delivers medium precision matches. Additionally, selecting these lower levels may be helpful if the data contains name misspellings or typos.

How should organizations select their levels of rigor? It depends on the domain, as the level of fuzziness that a user deems permissible for first names remains context dependent. For example, the medical field may select high precision matches to ensure the integrity of patient records. Alternatively, businesses might select medium or low matches to maximize the ROI of their data — significantly expanding the reach of their target market.

How did the Salesforce AI team innovate soft matching?

Identity resolution research began by training the AI model to find matches between two first names, delivering binary results. However, the team did not want to produce just zeroes and ones. They challenged themselves to enhance their fine-tuned LLM to produce a soft matching AI model that would generate a smooth range of confidence scores — supporting high, middle, and low precision matches.

To produce this wide range of scores, the team trained a regularized multilayer perceptron (MLP) — a neural network — to align name similarity scores and name embeddings produced by their fine-tuned LLM to determine if two name strings are semantically similar. As a result, instead of producing too many values close to one or zero, the model produced a smoother distribution of scores, ranging from high to low.

Next, the AI team iterated with Salesforce Data Cloud engineers, modifying the MLP’s performance through various training approaches to better define the scores and implemented the MLP model in Java.

Lastly, the Data Cloud team integrated MLP into the Data Cloud platform, making the model available for customers.

How does AI-based soft matching overcome the global language barrier?

The soft matching model must support Data Cloud customers around the world, which creates a challenge as international first names may involve different conventions, accents, alphabets, and other variables. How does the AI team overcome this hurdle?

First, the team used a multilingual DistilBERT model that is pre-trained on over 100 different languages.

Next, the team fine-tuned the multilingual DistilBERT model on first name data across several languages. This further improved multilingual performance on first names.

Finally, the team leveraged multilingual nickname dictionaries to ensure that certain nicknames were consistently recognized.

Learn More

Related Articles

View all