Understanding the Comparison Process

During a Validation Run, Validata follows a multi-step process to compare records between the Source and Target tables in each Validation Pair. This process includes:

Filtering out records with null values in the comparison key columns,
Excluding duplicate records, and
Evaluating the remaining unique records

This section offers a conceptual view of the comparison process and explains how Validata computes comparison metrics.

Forming Validation Pairs

The comparison process begins by forming Validation Pairs, where each Pair consists of a table from the Source dataset and its corresponding table from the Target dataset. Validata intelligently auto-maps Source tables to Target tables and maps Source and Target columns within each Pair to one another. You may modify or override any of these mappings and select which mapped tables and columns Validata should evaluate.

Validata will only compare the data in the mapped columns that you have selected for comparison. If the Source and Target tables contain mapped columns that you have not selected for comparison, Validata will not process and evaluate the data in those columns.

A few additional points:

If a column in the Source or Target table contains a data type unsupported by the current Validata edition, that column cannot be selected for comparison
If a column in one table of the Validation Pair is not mapped to a column in the other table in the Pair, that column cannot be selected for comparison.
All Validation types require you to specify comparison key(s), except for Custom Validation where the comparison key is optional but recommended.

Selecting Comparison Keys

The next step is identifying the column or set of columns that serve as Comparison Keys. These keys determine how Validata matches records in the Source table to records in the Target table.

We recommend that you choose a column that contains unique values as the comparison key column to ensure that the comparison does not result in unexpected comparison results.

If the data system requires a primary key, we strongly recommend that you choose the primary key as the comparison key. Depending on your use case, you may also choose a column other than the primary key as your comparison key, or you may choose multiple columns to act as a composite comparison key. Validata filters out records that have null values in any of the comparison key columns and does not compare records that contain duplicate values in the comparison key columns; therefore, you should choose your comparison keys appropriately.

In this section, we present two scenarios to illustrate the comparison process:

Comparison using 1 column as the comparison key
Comparison using 2 columns to form a composite comparison key

We will use fictional point-of-sale data that is replicated from a relational database (Source) to a data warehouse (Target). We assume that the data warehouse is configured in APPEND mode; therefore it supports duplicate records, i.e. records that share the same primary key.

Scenario-1: Using 1 column as the Comparison Key

In this scenario, we map Source.TxnID to Target.TxnID to form the Comparison Key. Validata uses this comparison key to match records in the Source table to corresponding records in the Target table. We also map semantically equivalent data in both tables to ensure accurate comparison (e.g., Source.MerchName is mapped to Target.MerchName).

Filtering null records

When ingesting data, Validata automatically filters out records that have null values in the comparison key column (i.e., TxnID). These null records are not processed, compared, or included in the validation metrics or reconciliation scripts.

Identifying and excluding duplicate records

Next, Validata identifies and excludes duplicate keys and their associated duplicate records from the comparison.

As defined previously, a duplicate key is a comparison key value that appears in two or more records within either the Source or the Target table after null records have been filtered. In the current example, Validata has identified 3 duplicate keys: TxnID = 1003, TxnID = 12054, and TxnID = 12061.

Any record in the Source or Target table of a Validation Pair whose TxnID matches any of the duplicate keys is classified as a duplicate record and excluded from further comparison.

As shown in the figure above, Validata has flagged 3 Source records and 6 Target records as duplicate records:

TxnID = 1003: 1 record in the Source table, and 2 records in the Target table
TxnID = 12054: 1 record in the Source table, and 2 records in the Target table
TxnID = 11061: 1 record in the Source table, and 2 records in the Target table

Comparing the remaining unique records

After filtering null records and excluding duplicate records, the remaining records in both tables have unique values in the comparison key columns. Validata compares these unique records based on the comparison key:

If a Source record and a Target record share the same comparison key value and all corresponding column values match , they are classified as In-Sync.
If the comparison key matches but one or more non-key columns differ, Validata marks the records as Out-of-Sync: Content mismatch.
Validata marks any remaining unmatched records—those appearing in only the Source table or only the Target table—as Out-of-Sync, either Extra in Source or Extra in Target respectively.

We consider 4 comparisons in the figure above to illustrate In-Sync and Out-of-Sync concepts:

Validata has marked the record with TxnID = 1005 as In-Sync because its contents are identical for all corresponding columns.
Validata has marked the record with TxnID = 12056 as Out-of-Sync: Content mismatch because ItemName = “Bluetooth Spkr” in the Source record has been truncated to ItemName = “Bluetooth” in the Target record.
Validata has marked the record with TxnID = 13059 as Out-of-Sync: Extra in Source because it is an unmatched record that appears only in the Source table.
Validata has marked the record with TxnID = 1000 as Out-of-Sync: Extra in Target because it is an unmatched record that appears only in the Target table.

Computing the Metrics for the Validation Pair

After completing the comparison for a Validation Pair, Validata generates a Validation Pair Report summarizing the comparison metrics. The report includes essential information such as the list of Duplicate Keys, details on Out-of-Sync Records, and a reconciliation script. You can download the script and execute it on the external data systems to resolve the detected discrepancies.

Scenario-2: Using 2 columns to form a composite Comparison Key

You may choose to use multiple columns to form a composite comparison key for granular analysis.

In this scenario, we map Source.TxnID to Target.TxnID and Source.MerchID to Target.MerchID to form a composite Comparison Key. Validata uses this composite comparison key to match records in the Source and Target tables. We also map semantically equivalent data in both tables to ensure accurate comparison (e.g., Source.ItemName is mapped to Target.ItemName).

Filtering null records

When ingesting data, Validata automatically filters out records that have null values in any of the comparison key columns, i.e. either TxnID or MerchID.

In this example, Validata has identified 5 records as null records as shown in the figure above:

Source table: (TxnID, MerchID) = (12063, null)
Target table: (TxnID, MerchID) = (1003, null), (12054, null), (12054, null), (12062, null)

Note that the addition of MerchID to form a composite comparison key has resulted in 4 extra null records—1 in the Source and 3 in the Target—due to null values in the MerchID column.

Identifying and excluding duplicate records

Next, Validata identifies and excludes duplicate keys and their associated duplicate records from the comparison.

In the current example, a duplicate key is a composite comparison key value that appears in two or more records within either the Source or the Target table after null records have been filtered. Here, Validata has identified 1 duplicate key: (TxnID, MerchID) = (12061, M072), and classified 3 records—2 in the Source and 1 in the Target—as duplicate records.

Validata_Concepts_2ComparisonKeys_Table2_Duplicates.png

Thus, the addition of MerchID to form a composite comparison key has altered the number of keys and records that Validata classifies as duplicates.

Comparing the remaining unique records

After filtering null records and excluding duplicate records, the remaining records in both tables have unique values in the comparison key columns, as shown below.

As discussed previously, Validata compares these unique records based on the comparison key:

If a Source record and a Target record share the same composite comparison key value and all corresponding column values match, they are classified as In-Sync.
If the composite comparison key value matches but one or more non-key columns differ, Validata marks the records as Out-of-Sync: Content mismatch.
Validata marks any remaining unmatched records—those appearing in only the Source table or only the Target table—as Out-of-Sync, either Extra in Source or Extra in Target respectively.

We consider 4 comparisons in the figure above to illustrate In-Sync and Out-of-Sync concepts:

Validata has marked the record with (TxnID, MerchID) = (1003, MS002) as In-Sync because its contents are identical for all corresponding columns.
Validata has marked the record with (TxnID, MerchID) = (12056, M014) as Out-of-Sync: Content mismatch because ItemName = “Bluetooth Spkr” in the Source record has been truncated to ItemName = “Bluetooth” in the Target record.
Validata has marked the record with (TxnID, MerchID) = (12058, S1009) as Out-of-Sync: Extra in Source because it is an unmatched record that appears only in the Source table.
Validata has marked the record with (TxnID, MerchID) = (1000, MS005) as Out-of-Sync: Extra in Target because it is an unmatched record that appears only in the Target table.

Understanding the Comparison Process

Forming Validation Pairs

Selecting Comparison Keys

Scenario-1: Using 1 column as the Comparison Key

Filtering null records

Identifying and excluding duplicate records

Comparing the remaining unique records

Computing the Metrics for the Validation Pair

Scenario-2: Using 2 columns to form a composite Comparison Key

Filtering null records

Identifying and excluding duplicate records

Comparing the remaining unique records

Computing the Metrics for the Validation Pair

Search results