Theory12 min read

Introducing DITTOE: Data Integration Theory of Everything

Andy Blum

January 15, 2025

E = Σ(S_FADS →BAPPTE T_FADS)

At the heart of data integration are two problems that need to be addressed. The first glaring issue is that, after over 30 years of data integration tools, there is still no standard, structured schema to describe an integration routine independently of the engine that runs it. The impact is enormous. Without such a standard, we cannot describe how to move data without writing code or working with UI's. A better world would establish a single standard, enabling data integration engines to compete effectively.

When Google released Dataflow, they had as good a chance as anyone to do this. They needed a way to import ETL pipelines from competing tools into their system. Yet they did not release a standard. Like most ETL products, they use a proprietary approach. Long before that, Microsoft's SSIS proposed an XML-based standard that proved overly complex and failed to gain traction. DITTOE aims to change that.

The second issue is one of technique. Too many teams treat each S→T combination as a one-off effort. Nothing is more inefficient in scaling from dozens to hundreds to thousands of routines. If you want to introduce a new aspect across your pipelines, you end up changing every visual designer canvas or Python script. Across large enterprises, data architects (myself included) tended to coalesce around in-house frameworks built from configuration and mapping data. Even when we used third-party engines for execution, the most sophisticated shops modeled metadata for sources, targets, and mappings, and generated configuration from that metadata. Yet each of these systems remains locked away in corporate code.

When businesses move data, whether from a file to a database or from an API to a queue, it can seem complicated. DITTOE (Data Integration Theory of Everything) offers a clear and consistent framework for describing how any type of data is integrated, transformed, and utilized.

Core Principles

At its core, DITTOE is built on a few basic ideas.

•Every transfer starts in one place and ends in another: Source to Target (S→T).
•The source and target are always one of four types: file, API, database, or stream (FADS). These are the enduring building blocks of modern data systems, hardly a "fad."
•No matter how complex a pipeline is, it can be broken down into six parts, as outlined in the BAPPTE model: Begin, Acquire, Process, Package, Transmit, and End. These steps outline the complete lifecycle of a data routine, from initiation to completion.

Instead of relying on custom scripts or ad hoc logic, DITTOE describes each routine with Pipeline Metadata. This is a structured description that tells the system about the data, where it should be sent, the actions required along the way, and how to handle it. Think of it as a blueprint for your data pipeline.

Why It Matters

Standardizing how integrations are described opens the door to:

•Reusable pipelines that do not depend on fragile scripts
•Easier testing, tracking, and troubleshooting
•Better coordination across teams and tools
•Faster development of new workflows

At the heart of this is the Pipeline Metadata Document (PMD), a blueprint that outlines everything an integration needs, including where the data resides, its format, how it should be transformed, and how it should be delivered. The result is that engineers and analysts spend less time on plumbing and more time providing value. Because everything is expressed in a consistent, portable format, your data workflows become easier to manage, scale, and trust.

1. DITTOE - Data Integration Theory of Everything (Concept)

DITTOE is a unifying theory for understanding and modeling all data integration routines. The formula shown above represents this theory.

This is not just a metaphor. It's a formal grammar for expressing any data integration routine. And at the center of that grammar is the Pipeline Metadata Document (PMD). First, let's explain the formula above and how, based on first principles, we can derive a Universal Data Integration Schema.

A. THE 3 DITTOE AXIOMS:

1. We always move a data resource from a source to a target.

We can express this with the following formula

S→T

If the above statement is true, then every routine that has ever been moved between a Source and a Target can be expressed using the formula below.

E = Σ(S→T)

Simply put, this formula represents everything that has ever been moved in the history of data integration.

2. A resource can only be one of four types: file, API, database, or stream.

E = Σ(S_FADS→T_FADS)

This leads to the claim that, throughout the history of data integration, from the first copy command to the most modern system, we will only be moving a File, API, Database, or Stream (FADS) to another FADS. It's a joke, these are the least "fadful" concepts in computing.

3. Every templated pipeline can be analyzed, documented, built, and observed using the following 6 formal stages:

Begin, Acquire, Process, Package, Transmit, and End.

Begin

Acquire

Process

Package

Transmit

End

The addition of "BAPPTE" in the formula below represents the universal stages of a data pipeline. We assert that any pipeline can be broken down into the following stages. We allow for custom code to be introduced after multiple stages.

•Begin - The start of every routine. In DITTOE, we always retrieve the metadata related to the particular S2T integration. Instrumentation is called to indicate the start of a routine.
•Acquire - This is where we receive the data from the client.
•Process - Transformations and other data manipulations are handled in this stage
•Package - Data may be further compressed or masked.
•Transmit - Data is sent to the Target
•End - Resources are freed, and instrumentation is called to indicate the end of the routine

2. Pipeline Metadata (Concept)

At the heart of EveryArrow is a bold foundational idea: any data integration routine - past, present, or future - can be described using a universal standard. With this in place, programming becomes optional. A pipeline engine doesn't need custom code to execute logic - it only needs a schematic, declarative description of what data should move, where it should go, and how it should be shaped.

That description is what we refer to as Pipeline Metadata. It captures everything required to define a pipeline - its source and target systems, the transformations it performs, the fields and data types it handles, the behaviors it applies, and the rules that govern its execution. It is the real-world expression of the DITTOE formula shown at the beginning of this article.

Although this schema serves as the foundation of EveryArrow's internal engine, its design is intentionally platform-agnostic. Any DI/ETL system that supports scripting, runtime parameter injection, or workflow templating - such as NiFi, Spark, AirByte, SSIS, Matillion, or Airflow. Instead of authoring integration logic repeatedly, teams can operate from a single, declarative standard that describes integration at scale - consistently, repeatably, and with governance built in from the start. It captures the full configuration of a pipeline in a structured, declarative format - including the source and target resources, the behaviors assigned to each BAPPTE stage, and the mappings that define how data flows between systems.

It defines the Server (where the data resides), the Resource (what is being accessed), the Fields (how the data is structured), the Data Types (how values are interpreted), and the Source-to-Target Mapping (how data is transformed). Metadata may also include runtime parameters, global properties, and inter-pipeline Dependencies for coordinating complex workflows. By wrapping all of this into a portable, platform-agnostic format, Pipeline Metadata enables pipelines to be triggered, configured, executed, scheduled, and observed without custom code - making integration logic reusable, auditable, and automated at scale. It replaces brittle scripting with a universal structure that supports governance across the entire data integration lifecycle.

Core Concepts

3. Dependency (Concept)

Describes a directed relationship between two pipelines, where one must complete (successfully or with tolerable errors) before the other can begin. This models execution order and inter-pipeline coordination.

A dependency includes:

•The pipeline that depends on another.
•The pipeline depends on it.
•Whether the dependent pipeline is allowed to proceed even if the prerequisite pipeline fails.

Dependencies may exist between individual pipelines, even across different S→T combinations.

4. Server (Concept)

A Server represents any external system that your data integration platform can connect to. This could be a database, an API host, a file storage location, or a streaming platform. The idea of a Server abstracts away the low-level details of connectivity - such as hostnames, ports, credentials, and protocols - and provides a standardized description of how to reach and authenticate with that system.

By modeling servers as metadata, the platform can support a wide range of technologies in a uniform way, without hardcoding any one integration. This allows pipeline logic to remain generic and reusable, while the actual connections are defined declaratively.

In practice, a Server is the "where" in your integration landscape - it tells the system which external platform it should interact with, and how to get there.

5. Resource (Concept)

A Resource represents a specific unit of data that the platform can either read from (as a source) or write to (as a target). Examples include a table in a database, a file in cloud storage, an API endpoint, or a topic in a stream. Each resource is defined in the context of a Server, which provides the connection details for reaching it.

While the Server answers "Where is this?", the Resource answers "What is it?" and "How should it be used?". It includes information about the data's structure, format, access strategy (read/write), and other usage-specific metadata.

Resources enable a clear separation of concerns: connection logic is centralized in the Server definition, while data-specific logic lives with the Resource. This allows teams to reuse infrastructure across different datasets, and cleanly express intent in pipeline design.

6. Fields (Concept)

A Field defines the shape, structure, and semantics of the data within a Resource. Where the Resource itself describes what is being accessed (e.g., a table, file, or API response), the Fields describe what lives inside that structure - column names, types, meanings, and how to treat them.

Think of Fields as the schema metadata for a Resource. They allow the platform to:

•Interpret data accurately.
•Validate inputs and outputs.
•Drive automated transformation, mapping, or enrichment steps.
•Communicate structure across tools and teams.

7. DataType (Concept)

A DataType defines the format and behavior of a piece of data within a Field - such as whether it represents a string, number, timestamp, or boolean. While Fields describe the structure of a Resource, DataTypes provide the semantics needed to interpret and process each value correctly. Because different systems use different native types, the platform translates all source-specific types into a canonical internal standard. This allows Resources to move seamlessly between technologies - ensuring consistent validation, transformation, and interoperability across files, APIs, databases, and streams.

8. Source-to-Target Mapping (Concept)

A Source-to-Target Mapping defines how data from a source Resource is aligned, transformed, and delivered into a target Resource. It describes the correspondence between fields - what data should move, where it should go, and how it should be shaped along the way.

This mapping captures not only field-to-field relationships, but also data type transformations, default values, renaming rules, and calculated expressions. By externalizing this logic into metadata, the platform enables pipelines to automatically adapt to changing schemas and business rules - without rewriting code. This is what turns a pipeline into a portable, declarative specification for how data should flow from one system to another.

Now we get into the concepts that effect how our pipelines would be executed.

9. Stage (Concept)

A Stage represents a defined point in the lifecycle of a data integration routine. In the DITTOE architecture, these stages follow a standardized progression known as BAPPTE: Begin, Process, Package, Transmit, and End.

Each stage abstracts a class of activities commonly found in any integration workflow:

•Begin - Initialization, setup, or pre-check logic.
•Process - Core data handling, transformation, and conditioning.
•Package - Structuring data into its final form (e.g., batch assembly, format conversion).
•Transmit - Moving data across boundaries (e.g., writing to files, pushing to APIs).
•End - Post-process cleanup, audit logging, or finalization.

This model acts as a universal lens for analyzing, organizing, and governing any integration pipeline - whether it's internal, third-party, event-driven, or batch-based. By assigning Behaviors to these canonical stages, the system achieves clear execution boundaries, consistent observability, and a governance-friendly structure that applies across all technologies.

10. Behavior (Concept)

A Behavior is a modular, declaratively defined unit of work that performs a specific action within a pipeline. Behaviors are assigned to stages - such as Begin, Process, Package, Transmit, or End - in accordance with the BAPPTE model, which provides a universal structure for describing all data integration routines.

Each Behavior expresses a single transformation, validation, enrichment, or control operation. Unlike hardcoded steps, Behaviors are defined through metadata, making them configurable, reusable, and portable across pipelines and execution contexts.

Behaviors may describe logic executed by the platform itself, or represent proxies to external systems (e.g., invoking a third-party tool like Airflow or a script in an external ETL engine). Their placement within the BAPPTE stage sequence allows the system to reason about when and why a Behavior should run, providing a consistent framework for execution, auditing, and governance.

By breaking pipelines into clearly staged Behaviors, DITTOE enables integration to be understood, modified, and governed as a sequence of composable intentions, rather than as opaque procedural logic.

Conclusion

For three decades, data integration has lacked a portable way to describe what every data pipeline actually is. DITTOE closes that gap. By treating integrations as a formal grammar: S→T over FADS, executed through BAPPTE and captured in a Pipeline Metadata Document, we replace brittle scripts and proprietary formats with a precise, testable, engine-agnostic specification.

The payoff is practical: faster builds, cleaner handoffs, easier audits, safer changes, and a path to true portability across tools. Instead of re-authoring logic per engine, teams share a common language for how data should move and transform - then let runtimes compete on performance, cost, and ergonomics.