Deterministic vs. probabilistic matching
Key Takeaways:
- Identity resolution creates a comprehensive view of a customer by aggregating data from different sources into a single database record.
- Deterministic identity resolution matches information from different sources more precisely, making it useful when accurate personalization is essential.
- Probabilistic identity resolution is less precise but yields a larger customer database to draw from, making it useful for real-time or broad-based advertising campaigns.
Table of Contents
What is identity resolution?
Successful marketing requires a thorough understanding of each customer and prospect. Marketers gather this information from many different sources — through social media interactions, ecommerce, point-of-sales and in-store payments, customer service queries, emails and texts, and so on. These data sources create a comprehensive view of each customer or prospect.
Among the tools marketers use to gather this information is technology that aggregates data gathered from these different sources into a single database record. This process is called identity resolution. The result is a unified, 360-degree view of each customer.
This comprehensive view connects the experiences and interactions a customer has with your brand with specific characteristics about a customer or prospect.
Deterministic vs. probabilistic identity resolution
There are two basic types of identity resolution: deterministic and probabilistic. These are also called “deterministic matching” and “probabilistic matching.” There are various benefits and drawbacks to each type of identity resolution. The two types of identity resolution can also be used in tandem on the same data sources.
Deterministic identity resolution
This type of matching uses a company’s first party data and relies on exact matches. It takes unchangeable or static information like name, home and email address, birthdate, phone number, or passport number to match two or more customer records in which the same information is present.
Because it relies on data that tends to remain constant — like someone’s name — deterministic matching models are more precise. However, they are limited to records containing static data, and they may overlook valid matches because of imprecise data like a misspelled name. So the scope of the records they’re able to match is more limited.
Benefits of deterministic matching
Deterministic matching yields 70–80% accuracy because it relies on known identifiers like email addresses and job titles. This inherent accuracy provides several benefits:
- It improves the quality of your customer database and lets you personalize emails and device-specific in-app messages confidently.
- It creates more intuitive, personalized customer journeys based on granular criteria like previous product purchases, gender, and race.
- The databases they build are more durable: as new information is added, it maintains the matches among existing records while easily matching new data to existing records.
- You can control the matching rules.
- The data can also be verified more easily against third-party sources, further improving its accuracy.
In addition, because deterministic matching algorithms are straightforward and use identifiable data, their connections are easy for humans to understand.
Drawbacks of deterministic matching
Because it uses multiple factors to determine identities, deterministic matching cannot build an accurate identity graph if one or more of those factors is missing from a specific record. It also struggles to build the graph when the two records differ because of a misspelling or alternate spelling. While the matches it does make are more likely to be accurate, the tradeoff is that it also produces false negatives. For any dataset, identity matches are more accurate, but there are fewer of them.
Types of deterministic matching
There are three main ways that deterministic identity resolution tools work:
- Single-field matching uses only one variable or unique identifier— an email address, for example — to decide that two records apply to the same person. This is less accurate than…
- Composite field matching, which compares two or more identifiers before deciding that two records apply to the same person. It’s similar to single-field matching but checks two fields to match (such as name and email).
- Cascading deterministic heuristic matching is similar to composite field matching, but the rules include if/then scenarios. For example, if the email address and last name of two records do not match exactly, the fallback might be to declare a match if the email address and the first four letters of the last name match. The “cascade” would continue, with each step requiring less precision for a match to be declared. It can also identify a match even when certain variables are inconsistent. Using multiple variables reduces the number of false positives and false negatives.
Probabilistic identity resolution
This type of matching uses algorithms that predict matches among several similar data records. In addition to assessing static information, it can take into account behavioral data like user journeys and device usage. The algorithms make informed guesses about the likelihood that several pieces of data relate to the same customer or prospect.
While it may be riskier than deterministic matching, probabilistic models can uncover less obvious connections because the algorithms can analyze a wider array of data and make allowances for incorrect or missing data.
Benefits of probabilistic matching
Probabilistic matching can assess information like IP addresses, operating systems, real-time geographic location, and network. It can also assess behavioral data, such as customer purchases or content they download from a website. This means you can build a user profile without collecting the kind of personal data deterministic matching algorithms rely on — data that is often protected by privacy laws or industry regulations.
Whereas deterministic matching improves database quality, probabilistic matching increases the size of your database and enables you to cast a wider net with your marketing campaigns. It can also:
- Improve top-of-funnel content marketing by building more accurate target customer personas, rather than messaging for specific customers.
- Let you target customers based on their interest in various topics or products in near real-time.
- Predict how customers may behave in the future, enabling you to market your products or services sooner in their purchasing journey.
Drawbacks of probabilistic matching
Probabilistic matching algorithms are less accurate than deterministic ones because they guess at the connections among various data sources.
Probabilistic matching — sometimes called “fuzzy” matching — also incorporates behavioral data. Because a customer’s behavior and preferences can change, the matches may grow less accurate with time.
In addition, probabilistic models have trouble differentiating between someone interested in purchasing a product and someone merely researching the product. So their connections aren’t always relevant, a phenomenon known as false positives.
Furthermore, new privacy regulations and the death of third party cookies make it harder to collect the kind of third-party data that probabilistic matching needs. And the accuracy of a probabilistic algorithm decreases as the data points decrease.
Inaccurate customer profiles can:
- Jeopardize the customer experience by demonstrating that your brand misunderstands its messaging’s target audience.
- Increase the cost of advertising and marketing campaigns because they missed the intended audience or targeted a less relevant audience.
- Force you to intervene manually to keep your databases accurate.
Probabilistic matching also has more difficulty matching new data to existing records, further decreasing its accuracy.
Types of probabilistic matching
Probabilistic matching algorithms employ various techniques.
- Fuzzy string matching identifies matches by increasing the tolerance for differences between the two pieces of data. Search engines that can guess the correct spelling of misspelled words also use this matching type.
- Advanced machine learning matching is a category of AI-driven search that includes:
- Evaluating the relationship between words and concepts
- Neural matching, which assesses the relationship between queries and web pages rather than relying on keywords
- Cascading mixed heuristic matching applies different deterministic and probabilistic algorithms in order from strictest to least strict. This enables the tool to determine matches based on a “cascade” of criteria. Even when there’s no match between the criteria at the top of the cascade, the algorithm attempts to find a match based on criteria further down in the cascade that carry less weight in confirming a match. Like cascading deterministic heuristic matching, this model reduces false positives and false negatives.
- Phonetic matching uses either simple lookup tables or a machine learning algorithm to determine a match when two words or names sound alike but are spelled differently — “Jon” vs. “John,” for example.
Choosing between deterministic and probabilistic identity resolution
In general, deterministic models are used when the goal is accurate personalization. Probabilistic models are used to reach a broader audience with more general messaging.
Use cases for deterministic identity resolution
The accuracy of deterministically matched data makes it appropriate for campaigns that require a hyper-personalized customer experience, or where granular audience segmentation is critical. For example, brands with a broad product line, like makeup companies, might use deterministic data to target certain users with specific products.
It is also ideal for targeting prospective purchasers exclusively — to market an upgrade to the owner of a specific cell phone or the subscriber to a specific cable TV package.
Use cases for probabilistic identity resolution
Probabilistic identity matching’s risk of false positives can sometimes be helpful. Overshooting the target audience, for example, could be beneficial in an advertising campaign for high-end or luxury goods, where reaching beyond the main audience can generate brand awareness even among people unlikely to become customers. Organizations often use probabilistic models for aligning digital ad placement with content for the same reason.
Probabilistic data is also ideal for reaching potential customers at every stage of the customer journey simultaneously, a tactic often used by car or B2C software companies.
Using identity resolution for marketing intelligence
Organizations can also use deterministic and probabilistic identity resolution to gather marketing data. For example, most customers use more than one device to browse the same website. Deterministic matching can help you understand which marketing channel results in the most conversions.
Probabilistic identity resolution projects provide marketing teams with information for understanding:
- Which campaigns will resonate with which customers
- Which customers are in danger of churning
- When in the marketing journey did a prospect become a customer
- What tactic drove the most upsells
- Which customers have become disengaged with the brand
Buy vs build: How to decide on an identity resolution platform
Should you buy an off-the-shelf identity resolution solution or build your own?
As with any customizable SaaS platform or database, deciding whether to buy one from a software vendor or build a platform in-house depends on your business priorities and internal software development capacity.
An in-house solution can give you more control over customization, integrations with other tools and platforms, new features, and security. It can also be more costly and time-consuming to build and maintain than purchasing a third-party solution, especially if your customer engagement strategy changes rapidly.
Consider these factors when weighing buy vs. build:
Level of risk involved in generating incorrect matches
Consider the risk level if you build an in-house solution without identity resolution experts. For example, a bank sending highly personalized emails is risky if the data underlying those emails is incorrect. An online retailer rolling out a paid media campaign risks less should the target list be inaccurate.
Data quality and accuracy
If your data quality is poor, purchasing a product may be the wrong approach. Building it in-house or hiring consultants to build a custom solution will enable you to improve the quality of your data.
Complexity of matching logic and algorithms
If the complexity of the deterministic or probabilistic matching algorithms or rules is simple, a third-party product will likely be overkill. If the logic and algorithms are extremely complex, a third-party solution may not be up to the task.
More from the University
Looking for guidance on your Data Warehouse?
Supercharge your favorite marketing and sales tools with intelligent customer audiences built in BigQuery, Snowflake, or Redshift.