Validating Migrated Data: A Practical Field Guide for Teams

Data migration is like moving a house of cards across a room. You can plan every step, use the best tools, and still end up with a collapsed mess if you don't check each card along the way. That's what validation is for: proving that what arrived is what you sent. This guide gives your team a practical, repeatable system for validating migrated data—no academic theory, just field-tested steps that work.

Why Data Validation Often Fails (and What That Costs)

Most teams treat validation as an afterthought. They run a few row counts, spot-check a handful of records, and call it done. Then the first Monday after go-live, someone notices that all customer phone numbers are missing the area code, or that order totals don't match the original system. The cost of these failures goes beyond fixing the data—it erodes trust, delays downstream projects, and can trigger compliance issues.

Why does this happen so often? First, validation is boring compared to building the migration scripts. Teams underestimate the time it takes. Second, many tools only check surface-level attributes (row count, null checks) but miss deeper problems like referential integrity or business rule violations. Third, there's no shared language for what 'valid' means—different stakeholders expect different things.

We've seen projects where the validation phase was squeezed into the last week before cutover, leading to rushed checks and missed errors. In one composite case, a retail company migrated its inventory system but didn't validate that the 'last updated' timestamps were converted to the new timezone. Every stock report showed negative quantities for a month until someone caught the mismatch. The fix required a full re-sync and a public apology to suppliers.

The takeaway: validation is not a step you add at the end. It's a mindset you embed from the first mapping discussion. When you treat validation as a core part of the migration, you catch problems early, when they're cheap to fix.

Common Validation Traps

Teams often fall into these traps: (1) Only checking the happy path—what works when everything is perfect. (2) Assuming the source system is clean. (3) Not validating relationships between tables. (4) Relying on a single validation method (e.g., only row counts). (5) Forgetting to validate metadata like timestamps, user IDs, and audit trails.

The Core Idea: Validation Is a Comparison, Not a Test

At its heart, validation is a comparison between two states: the source data and the target data. You're asking, 'Are these the same?' But 'same' is tricky. The target might have different data types, different column names, or different business rules. A direct field-by-field compare often fails because of these transformations.

Think of it like translating a book from English to Spanish. You don't check that every word matches—you check that the meaning and structure are preserved. Similarly, you need to compare at the right level of abstraction. For some fields, exact match is required (like IDs). For others, you accept a transformation (like dates formatted differently).

The key is to define what 'valid' means for each data element before you start migrating. Create a validation matrix that lists: the source field, the target field, the transformation rule, the validation method (exact match, fuzzy match, business rule check), and the acceptable tolerance. This matrix becomes your contract between the migration team and the business stakeholders.

Validation Levels

We recommend four levels of validation: (1) Structural—are all tables and columns present? (2) Quantitative—do row counts and sum totals match? (3) Qualitative—are individual field values correct? (4) Business logic—do the data still obey business rules (e.g., discounts don't exceed 100%)? Each level catches different types of errors, and skipping any one leaves you blind.

How Validation Works Under the Hood

Let's walk through the technical process. Most validation tools work by extracting data from both source and target, then running a comparison engine. But the devil is in the details: how you extract, how you compare, and how you report differences.

First, you need a consistent snapshot. If the source is still being updated during migration, you must either freeze it or implement change data capture to track what changed after the snapshot. Without a consistent baseline, every difference is a false alarm.

Second, the comparison engine must handle data type mismatches. For example, a source might store phone numbers as a 10-digit integer, while the target uses a string with formatting. The engine needs to normalize before comparing. Similarly, floating-point numbers require tolerance thresholds—exact equality is impossible due to rounding.

Third, reporting is critical. A good validation tool gives you a summary (how many records match, how many fail) and drill-down details (which fields failed, what the values were). It should also categorize errors by severity: critical (data loss), major (business rule violation), minor (formatting difference).

Automated vs. Manual Checks

Not everything can be automated. Business logic validation often requires manual review by domain experts. For example, a migration of insurance policies might need an underwriter to verify that risk scores were calculated correctly after the move. The trick is to automate the boring, repetitive checks (row counts, null checks, referential integrity) and reserve manual effort for high-value, complex validations.

A Worked Example: Migrating a Customer Database

Let's say you're migrating a customer database from a legacy CRM to a modern platform. The source has 50,000 records with fields like name, email, phone, address, and a 'notes' field with unstructured text. The target has a stricter schema: email is required, phone is split into country code and number, and notes are limited to 500 characters.

Here's how you'd validate this migration step by step:

Structural validation: Check that all source tables (Customers, Addresses, Orders) exist in the target. Verify that all required columns are present and have the correct data types.
Quantitative validation: Count rows in source and target. Should be 50,000. Also sum a numeric field like 'lifetime value' to ensure totals match.
Field-level validation: For a random sample of 1,000 records, compare each field. For email, check exact match. For phone, normalize both to the same format before comparing. For notes, check that truncation didn't cut off important information.
Business rule validation: Verify that all customers with an 'active' status have a non-null email (target requirement). Check that no two customers share the same email (uniqueness constraint).

During this process, you'll likely find issues: some emails are missing, some phone numbers are malformed, and a few notes were truncated mid-sentence. Each issue gets logged, triaged, and either fixed in the source (re-migrate) or accepted with a documented decision.

Handling the 'Notes' Field

The notes field is a classic edge case. The source allows unlimited text, but the target caps at 500 characters. You have three options: truncate, split into multiple fields, or store in a separate table. Each has trade-offs. Truncation is simplest but loses data. Splitting is complex. A separate table preserves data but changes the schema. The validation must confirm that the chosen approach was applied consistently and that no critical data was lost.

Edge Cases and Exceptions

Even with a solid plan, edge cases will surprise you. Here are some of the most common ones we've encountered:

Large Datasets

When you have millions of records, full comparison becomes impractical. You can't extract every row and compare it field-by-field—the time and bandwidth are prohibitive. Instead, use sampling strategies: random sampling for statistical confidence, stratified sampling for high-value records, and hash-based checksums for bulk verification. For example, compute a hash of each row in source and target, then compare the hashes. Any mismatch tells you exactly which row to investigate.

Real-Time Sync

If the target system is live during migration (cutover is phased), validation becomes a moving target. You need to compare at a point in time, then track changes after that point. Use timestamps or version numbers to identify which records changed after the snapshot. Alternatively, use a dual-write pattern: write to both systems simultaneously and validate in near-real time.

Legacy Systems with No API

Some old systems only allow file exports. You get a CSV dump and that's it. In this case, you must validate the export first (is it complete? are delimiters correct?) before even starting the migration. Then, after the import, you compare the target against the export file. This adds a layer of indirection but is still doable.

Data Cleanup During Migration

Many teams decide to clean data during migration—deduplicate, fix formatting, fill missing values. This is risky because you're changing the data and validating against a moving baseline. The best practice is to keep a copy of the original source data and validate against that, not the cleaned version. Then separately validate that the cleanup rules were applied correctly.

Limits of the Validation Approach

No validation process can guarantee 100% correctness. Here are the inherent limits you need to accept:

Sample size: If you sample, you might miss rare errors. Statistical sampling gives you confidence intervals, but it can't catch a single corrupted record in a million.
Business logic complexity: Some business rules are too complex to automate. For example, validating that a pricing algorithm produces the same result after migration might require manual recalculation.
Human error: The validation matrix itself can have mistakes. If you define the wrong transformation rule, you'll validate against the wrong expectation.
Time pressure: In most projects, validation is the first thing to be cut when timelines slip. You can fight this by making validation visible to stakeholders—show them the error rates and risks of skipping steps.

The goal is not perfection; it's risk management. You want to catch the errors that would have the biggest business impact. Prioritize validation effort on critical data: financial records, customer PII, compliance data. For less critical data, a lighter check may be sufficient.

When to Stop Validating

There's a point of diminishing returns. If you've run multiple rounds of validation and the error rate is below a threshold agreed with stakeholders, it's time to move on. Document the remaining known issues (if any) and get sign-off. Perfectionism can delay go-live indefinitely.

Reader FAQ

How long should validation take?

It depends on data volume and complexity. A good rule of thumb is 20-30% of the total migration timeline. For a 3-month migration, plan at least 3 weeks for validation. That includes building the validation matrix, running automated checks, and manual review of exceptions.

What tools should we use?

There are many options: open-source frameworks like Great Expectations, commercial tools like Informatica Data Validation, or custom scripts in Python or SQL. Choose based on your team's skills and the complexity of your data. For simple migrations, a set of SQL queries may be enough. For complex ones, invest in a dedicated tool.

How do we handle validation when the source is still live?

Use a snapshot or freeze the source during extraction. If that's not possible, implement change data capture and track changes after the snapshot. Validate the snapshot first, then validate the incremental changes.

What if we find errors after go-live?

Have a rollback plan and a data repair process. Not all errors require rollback—some can be fixed in place. But if the error is systemic (e.g., all dates are off by one day), you may need to re-migrate that subset. The key is to detect errors quickly through monitoring and to have a documented escalation path.

Should we validate every field?

No. Focus on fields that are critical to business operations, required for compliance, or prone to transformation errors. Skip fields that are purely informational or rarely used. Your validation matrix should reflect this prioritization.

This guide is intended for general informational purposes and does not constitute professional advice. Always consult with qualified data engineers and legal advisors for your specific migration context.

Validating Migrated Data: A Practical Field Guide for Teams

Table of Contents

Why Data Validation Often Fails (and What That Costs)

Common Validation Traps

The Core Idea: Validation Is a Comparison, Not a Test

Validation Levels

How Validation Works Under the Hood

Automated vs. Manual Checks

A Worked Example: Migrating a Customer Database

Handling the 'Notes' Field

Edge Cases and Exceptions

Large Datasets

Real-Time Sync

Legacy Systems with No API

Data Cleanup During Migration

Limits of the Validation Approach

When to Stop Validating

Reader FAQ

How long should validation take?

What tools should we use?

How do we handle validation when the source is still live?

What if we find errors after go-live?

Should we validate every field?

Comments (0)

Table of Contents

Why Data Validation Often Fails (and What That Costs)

Common Validation Traps

The Core Idea: Validation Is a Comparison, Not a Test

Validation Levels

How Validation Works Under the Hood

Automated vs. Manual Checks

A Worked Example: Migrating a Customer Database

Handling the 'Notes' Field

Edge Cases and Exceptions

Large Datasets

Real-Time Sync

Legacy Systems with No API

Data Cleanup During Migration

Limits of the Validation Approach

When to Stop Validating

Reader FAQ

How long should validation take?

What tools should we use?

How do we handle validation when the source is still live?

What if we find errors after go-live?

Should we validate every field?

Share this article:

Comments (0)

Related Articles

Mastering Migration Execution: A Step-by-Step Validation Framework for Seamless Transitions

Everything You Need to Know About Migration Execution & Validation

Migration Execution & Validation: Expert Insights for Flawless Data Transitions