Scaling Identity Resolution with Lucene, Spark, and Fuzzy Matching

In our “Engineering Energizers” Q&A series, we shine a spotlight on the brilliant engineers fueling innovation at Salesforce. Today, we feature Torrey Teats, a Software Engineering Principal Architect who spearheaded the architectural vision and technical strategy for identity resolution in Data Cloud.

Discover how Torrey’s team revolutionized indexing to bypass ElasticSearch limitations, introduced a dual-phase fuzzy matching system to handle exponential complexity, and revamped merge logic to prevent memory failures in large-scale customer profile unification.

What is your team’s mission as it relates to scaling identity resolution in Data Cloud?

The mission revolves around designing, building, and scaling the identity resolution engine that powers unified profiles within Data Cloud. This engine transforms fragmented customer data from systems such as CRM, Snowflake, and S3 into coherent, accurate identities used across Salesforce applications.

The team oversees the entire lifecycle of this process, from match rule authoring and embedding-based candidate generation to scoring, unification, and publishing. Key engineering goals include achieving low-latency resolution, maintaining high match accuracy, minimizing cost-to-serve, and ensuring fault tolerance across diverse workloads.

For near-real-time data pipelines, the system aims for a sub-five-minute turnaround to keep customer data current. For batch ingestion, resolution must complete within one hour to ensure efficient data processing. The architecture must support horizontal scaling while maintaining correctness, even when handling billions of records.

This mission is essential to Data Cloud’s capability to deliver accurate Customer 360 views. To stay performant and reliable at scale, continuous evolution of infrastructure, indexing strategies, model tuning, and observability is imperative.

Torrey explains why engineers should join Salesforce.

What architectural constraints in Data Cloud forced a rethinking of identity resolution indexing strategy?

Initially, the system depended entirely on a managed ElasticSearch service to retrieve candidate records. While this worked well for smaller datasets, performance issues arose as data volumes increased. Index creation turned into a significant bottleneck, and high-concurrency queries, especially under heavy load, led to contention and performance degradation. These problems became particularly evident when datasets surpassed 50 million records, causing query delays and index instability that hindered the system’s ability to scale to enterprise levels.

To overcome these challenges, the indexing strategy for Data Cloud’s identity resolution engine was overhauled to a distributed Lucene-based model. Each Spark worker node now generates a local Lucene index, eliminating cross-node contention and enabling fully parallelized candidate retrieval across the Spark cluster.

This new approach allowed Data Cloud to scale from handling 50 million to 2 billion source records in production-scale testing. The system maintains stable performance even under extreme concurrency, and the use of local indexes significantly reduces latency spikes. This redesign stands as the most significant technical advancement in enhancing the scalability of the indexing layer.

Fields with fuzzy matching generate embeddings, which are then transformed into hash buckets of similar records.

What was the most significant technical challenge encountered while scaling identity resolution?

The introduction of fuzzy matching significantly increased processing complexity. Unlike deterministic matching, which compares exact values, fuzzy logic evaluates non-identical values using probabilistic scoring, leading to a much larger pool of potential matches.

To address this, a two-phase system was implemented. The first phase generates searchable hashes using a locality-sensitive hashing (LSH) algorithm to quickly find likely candidates based on precomputed field embeddings. The second phase then scores these candidates using a learned model to provide a confidence score.

Generating embeddings for every incoming record at query time was too resource-intensive. To optimize, embeddings for low-cardinality fields like state, country, and standardized identifiers are pre-generated and stored in lookup tables. This reduces latency and cost while maintaining match quality.

The system scores the search results to limit the number of candidates passed to the scoring phase, preventing bottlenecks. This dual-stage method ensures that Data Cloud can perform probabilistic resolution at scale without sacrificing speed or accuracy.

How are scalability-related challenges managed across datasets, architectures, and match models?

Scalability challenges are managed through curated test datasets, regression tracking, and layered observability. Over multiple release cycles, datasets have been created to simulate the toughest Data Cloud scenarios, such as massively large clusters, fuzzy edge cases, inconsistent formatting, and high-cardinality fields.

Each pipeline component — retrieval, scoring, and merging — is closely monitored across software releases. Even minor performance drops trigger engineering reviews. Performance metrics are tracked in detail, including latency, memory usage, and failure rates per stage.

This rigorous process ensures that scalability is not only maintained but also continuously improved. Observability tools and simulation environments help engineers pinpoint performance issues to specific match rules or model configurations. Combined with the Lucene + Spark architecture, this approach ensures the reliability and transparency needed to handle identity workloads at scale within Data Cloud.

Torrey explores a few emerging technologies his team plans to use to enhance their productivity.

What recent R&D efforts improved the ability to scale identity resolution?

Recent R&D efforts have focused on enhancing memory efficiency during the profile unification phase. Large unified clusters can include up to 50,000 matched records. In earlier versions, the pipeline would try to load all profile data into memory during merge operations, often causing failures, especially in environments with limited memory. These issues affected job reliability and delayed downstream processing.

To address this, the profile merge logic was revamped to stream and process data incrementally. Now, only the necessary fields are loaded at each stage, and intermediate results are discarded as soon as they are no longer needed. This approach significantly reduces memory pressure and the risk of job failures.

As a result, job success rates have improved, particularly for large Data Cloud tenants, and the processing time for complex clusters has decreased. This R&D effort has increased the system’s capacity without the need for additional hardware and has helped stabilize long-tail workloads across the platform.

What role does the identity graph play in improving transparency — and how does it support trust tools like Agentforce?

The identity graph offers full visibility into the match decision process, especially in transitive match scenarios. In traditional resolution, a user might see that Record A matched Record C but have no insight into the linking logic. This lack of transparency can erode trust and lead to support cases where customers ask, “Why were these records matched?”

The identity graph solves this by clearly mapping the linkage chain. For instance: Record A matches Record B via email, and Record B matches Record C via phone, so Record A and Record C are grouped together. Each linkage is annotated with the reason, such as a matching normalized address or a high-confidence fuzzy name match. This provides full traceability.

This transparency feature, integrating into Data Cloud soon, supports trust-layer platforms like Agentforce. Dashboards will show why a match occurred, and audits can verify the strength of each connection. By making opaque match chains navigable and explainable, the identity graph ensures that resolution is not only scalable but also inspectable, which is crucial for enterprise-grade trust and governance.

Learn more

Stay connected — join our Talent Community!
Check out our Technology and Product teams to learn how you can get involved.

Scaling Identity Resolution in Data Cloud with Lucene, Spark, and Fuzzy Matching

What is your team’s mission as it relates to scaling identity resolution in Data Cloud?

What architectural constraints in Data Cloud forced a rethinking of identity resolution indexing strategy?

What was the most significant technical challenge encountered while scaling identity resolution?

How are scalability-related challenges managed across datasets, architectures, and match models?

What recent R&D efforts improved the ability to scale identity resolution?

What role does the identity graph play in improving transparency — and how does it support trust tools like Agentforce?

Learn more

New to Salesforce?

About Salesforce

Popular Links

What is your team’s mission as it relates to scaling identity resolution in Data Cloud?

What architectural constraints in Data Cloud forced a rethinking of identity resolution indexing strategy?

What was the most significant technical challenge encountered while scaling identity resolution?

How are scalability-related challenges managed across datasets, architectures, and match models?

What recent R&D efforts improved the ability to scale identity resolution?

What role does the identity graph play in improving transparency — and how does it support trust tools like Agentforce?

Learn more

How Data Cloud Powers Regression and Classification Model Training Across 20 Million Rows

Engineering Data Cloud Governance: Achieving Structured Security Across 300,000 Orgs

Scaling Data Cloud for Agentforce and AI: Rethinking Metadata, Activation, and Governance at Hyperscale

AI-Driven Relationship Generation: Transforming Data Modeling in 50+ Tableau Next & Data Cloud Environments

New to Salesforce?

About Salesforce

Popular Links