This is the third post in our Foundations series on Validata. Click here to read the first post, which is an introduction to Validata, and here to read the second, which discusses why row matching is an exact identity problem.
When you sit down to validate two datasets, the first instinct is to ask how to get the most accurate answer. With Validata, this first question is answered by the fact that every validation method produces identical outcomes against the same configuration. The semantics are constant. What changes is execution: where computation happens, how much data crosses the wire, and how granular the mismatch information is when you’re done.
That distinction matters. Choosing a validation method isn’t a tradeoff between accuracy and speed. It’s a placement decision. Where do you want the work to happen? And once you understand the spine that’s shared across every method, the choice becomes obvious.
This is the third post in the Validata Foundations series. The first introduced why we built Validata. In the second, we discussed why row matching is an exact identity problem. This one goes a layer deeper into how we made it work.
The Shared Spine
Every Validata validation, regardless of method, runs the same logical comparison flow.
First, the comparison key is resolved. You define it explicitly: single column or composite. Validata enforces it regardless of whether the underlying data system has primary keys.
Second, the rows are filtered. Records with null values in the comparison key are excluded – nulls break exact matching, full stop. Records with duplicate keys are surfaced on both sides and excluded from the precise comparison.
Third, Validata pairs the remaining records by key and evaluates them. Each row is assigned to one of four classifications: in-sync, content mismatch, extra at source, or extra at target. There’s no sampling. No silent loss. No “approximately fine.”
Fourth, the result is stored deterministically in the Validata Historian. Configuration snapshot, comparison key, classifications, mismatch counts, reconciliation script. All preserved against the exact configuration that produced them. Edit the validation later, and the historical reports don’t move.
This foundation — comparison key resolution, null and duplicate filtering, four-way classification, deterministic Historian — is the same across every validation method we ship. It is the part you cannot opt out of, and it is what makes Validata results trustworthy.
What varies is the physics underneath.
Three Axes of Execution
Three things can change between validation methods, and only three:
- Where the computation happens. Inside the source and target systems? On the Validata server? Some hybrid?
- How much data crosses the network. Compact signatures? Comparison keys plus hashes? Full rows?
- What precision of mismatch information you get back. Out-of-sync summary? Specific keys? Full content diff?
Every Validata method picks a position on those three axes. None of them changes the semantics. None of them changes whether the comparison is correct. They just move the work to a different place.
| Engineering tradeoffs in distributed systems usually aren’t about whether something is true. They’re about who pays the cost of making it true. Validation is no different. |
The Full-Dataset Methods
Three methods evaluate each selected column for every record in the full dataset. They differ in where the bulk of the comparison work happens.
Now let’s dive into each validation method, what they are, how they work, and which execution method is right for your use case.
Vector Validation
Validata computes compact vector signatures of the source and target tables inside their respective data systems, then compares only those signatures on the Validata server. The compute load goes up on the source and target. Network traffic drops dramatically over large datasets. This is the default full-dataset method, and it is purpose-built for very large tables where bandwidth is the bottleneck.
Fast Record Validation
Validata pulls the comparison-key columns and a hash of the non-key columns from each side back to the Validata server. The hashes are computed inside the source and target systems; the diff is computed on the Validata server.
This is the middle position on the bandwidth axis. You move more data than Vector — keys plus hashes for every row — but less than Full Record (discussed next). This is the right choice for medium-to-large datasets with reduced compute increase on Source and Target, without paying the full bandwidth cost.
Full Record Validation
Validata fetches every selected column of every record from both sides to the Validata server, and runs a full column-by-column comparison there.
The trade is plain: maximum data movement, minimum compute on the source and target systems. This is the forensic mode. Use it when you’re investigating mismatches, debugging an integration, or when you cannot increase any compute on source and target. On small-to-medium datasets, the bandwidth cost is acceptable. For large datasets, expect transfer times and processing overhead to increase rapidly.
The Partial-Dataset Methods
Two methods evaluate a deliberate subset of the data. Both fetch from the source and the target to the Validata server.
Key Validation
Validata compares only the comparison key values across the two systems. It tells you whether each key in the source has a counterpart in the target – and vice versa. It does not look at non-key content.
That sounds like a limitation. But that’s the point. Key Validation is the right answer when the question is “Is anything missing or extra?” Record presence, not content equivalence. It is the cheapest way to surface inserts that didn’t propagate, or deletes that shouldn’t. When the deeper question is whether content matches, you escalate to Vector or Fast Record. Until then, Key Validation answers the cheap question fast.
Interval Validation
Validata applies a time-based predicate at query time, evaluating only records updated within a defined window. You can configure the window as absolute timestamps or relative duration (e.g, “the last two hours”).
Interval requires a timestamp or datetime column on both sides. When you have one, this is the most efficient method for recurring drift detection. Run it every two hours instead of running a full Vector validation overnight. Surface the gaps that opened during the window. Ignore the data that didn’t change. Interval Validation is operational telemetry expressed in validation form.
Custom Validation: The Escape Hatch
If it happens that the built-in methods cannot represent the comparison you need, Validata gives you Custom Validation – a single Source-Target pair, validated using SQL you write yourself.
This is the right method for two specific situations.
The first is transformed data. Replication pipelines that reformat dates, trim or pad strings, adjust numeric precision, or change timezone handling produce target data that is equivalent to the source under your business rules but not literally identical. Validata’s built-in methods compare values as-is. If the transformation can be expressed in SQL, (and most can) Custom Validation lets you align the values inside your query, then compare what comes out.
The second is append-mode replication. Validata’s built-in methods make sense with data warehouses (Snowflake, BigQuery, Databricks) only when data is written in MERGE mode. Data warehouses and data lakes record updates as new rows rather than mutations. Therefore, they can’t be validated by methods that exclude duplicates by design. Validata wants to flag every single updated record and in the current edition, Custom Validation is how you handle it: write SQL that defines what “the latest state” means in your topology, and Validata compares those results.
Custom Validation is intentional friction. You write the rules, you own the rules, and the rules are auditable. Validata doesn’t guess what you meant.
Two operational aspects are worth noting up front. Custom Validation is only available at the Singleton Validation Pair scope – Validation Sets, which are the multi-pair, schema-wide configurations, do not support it. And Custom Validation does not support Revalidation (introduced next). This is by design: when you’ve defined the equivalence rule yourself, Validata cannot retroactively distinguish replication-lag mismatches from real ones. That is your decision to make in how you structure the SQL.
Revalidation: Identical Behavior Throughout the Five
All five built-in methods – Vector, Fast Record, Full Record, Key, and Interval – support Revalidation. After a user-defined wait time keyed to your replication latency expectations, Validata reruns the comparison only on rows previously marked out-of-sync. Discrepancies that have since been resolved through replication catch-up are no longer reported. The remaining discrepancies are flagged as real.
Revalidation works identically across the five methods because they all expose the same artifact at the end – a set of out-of-sync rows. The execution physics that gets you there differs. The downstream behavior does not.
| That is the Validata design principle in a single line: identical semantics, different execution. |
How to Choose
The practical mapping every data engineer should keep in mind:
- Massive dataset, source and target have spare compute, bandwidth is the bottleneck → Vector Validation.(the fastest option)
- Medium-to-large dataset, you cannot afford compute increase on source or target → Fast Record Validation.
- Small-to-medium dataset, forensic investigation, audit, or debugging → Full Record Validation. (the fastest option for small datasets)
- You need to know whether records are present in both systems. Content equivalence is a separate question → Key Validation.
- Recurring drift checks against time-windowed data, timestamp column available on both sides → Interval Validation.
- Data is transformed before landing, or the target is in append mode → Custom Validation.
A complete validation strategy usually combines these. Run Interval Validation every two hours to catch drift fast. Schedule a full Vector or Fast Record validation during the maintenance window for full coverage. Use Key Validation as a lightweight pre-check before kicking off heavier work. Reach for Full Record when something has gone wrong, and you need the granular truth.
You don’t choose one method and live with its tradeoffs. You compose methods that align with the operational profile of your data and the question you’re trying to answer at any given moment.
The Engineering Payoff
Here’s why the shared spine matters operationally. When semantics are constant across every method, your chosen validation method is never a judgment call about correctness. It’s a placement decision. Where do you want the compute to take place? How much bandwidth can you spend? How granular do the mismatch reports need to be? Those are engineering questions with operational answers, and they’re the only questions left, because Validata has already answered the harder one: every method tells you the same truth about your data.
That’s a different mental model than most validation tooling expects from you. Other tools force you to pick between exactness and speed, between completeness and feasibility, between forensic detail and operational throughput. Validata refuses the tradeoff. Every method is correct. The choice is about where you want the work to happen, not how much of the truth you’re willing to give up.
The implication for data engineering teams is clear. Validation stops being a quality decision dependent and starts being an operational decision about scheduling, resource assignment, and reporting cadence. You scale validation over a fleet of pipelines without weakening the rigor of any individual check.
Composability Beats Reinvention
Most validation tooling on the market today asks you to make tradeoffs every time you set up a new check. Pick the wrong method and your reports lie. Pick the wrong cadence and your team drowns in noise. Pick the wrong granularity and the engineering cost balloons. That ongoing decision tax is real, and it’s the reason most teams either underspecify their validation strategy or abandon validation altogether after the first few painful months.
The Validata bet is the opposite. One comparison logic, one truth model, one historian, one classification taxonomy – applied across every method. The methods themselves are six points in execution space, not six different opinions about what equivalence means.
When the spine is shared, the methods are composable. You don’t reinvent a validation strategy for every dataset. You compose methods. You schedule them. You scale them. You trust the result, because the result is built on a substrate that doesn’t move.
| One comparison logic, many execution strategies. That’s not a slogan. It’s the design principle that makes the rest of the platform work. |
In the next post, we’ll go deeper into the Validata Historian itself – the deterministic, configuration-snapshotted store that makes validation results auditable months and years after they’re written. That’s the layer that turns Validata from an operational tool into a regulatory artifact.
Try Validata
See Validata running against your own data. Bring a real source-target pair, and our team will show you the six methods in action – vector signatures over a billion rows, custom SQL against an append-mode lake, interval drift checks running every two hours. Forty minutes, real data, no slides.
Schedule a demo →
Read our first post in the Validata Foundations series: Introducing Validata



