In the simplest terms, schema is the structure of data inside a database. The structure of data can include things like field and table names, views, indexes, and snapshots. The definition of schema will often expand to include the relationships between data, for example, primary and foreign keys that logically connect separate tables.
Analytics systems and legacy data management systems require a schema, which can be generated either on write or on read. When schema is generated on write, the schema comes before the data. A very common schema on write scenario is that a data engineer creates several tables in a relational database that are connected by primary keys with a rigid schema. Then, the data engineer populates the table with data. In a schema on read scenario, different types of data, potentially both structured and unstructured, are loaded into the destination, and the schema is generated when queries against the data are executed. This means the data engineer can spend more time crafting queries to gain better insights rather than spending all of their time carefully defining fields.
Schema Past and Future
Schema on write was the default method for decades. Data engineers would spend a significant portion of their time defining schema and relationships before ever starting to analyze their data. Today, more modern data tools tend towards schema on read. The trend is towards automation of time-consuming and manual processes that don’t need human intervention. Defining schema falls squarely into this bucket.
At a Glance: Schema on Read vs. Schema on Write
We’ve already gone over the main differentiator between schema on read and schema on write, but there are other more subtle differences. Let’s explore them.
Schema on Write | Schema on Read | |
---|---|---|
Schema | User has to define a schema | Schema is inferred from the data |
Data | Structured and relational | Unstructured and Structured |
End User Experience | The only queryable data is pre-selected | Allows richer data exploration |
Positive Features | Lightweight | Adaptable |
When you have to define your data before it arrives at your destination, as you do with schema on write, it most often has to be structured and relational. Schema on read, on the other hand, can handle all kinds of data, including unstructured and structured.
Regarding the end user experience, schema on write forces data architects and engineers to be explicit about what data goes to their warehouse before they can analyze it. As you can imagine, this can pose a problem. Schema on read allows for more flexibility and a richer data exploration experience because analysts can pull in fields as needed.
Finally, while there is no right or wrong way to apply schema, there are positive features to both schema on read and schema on write. Schema on read benefits from excellent adaptability inherent in its design, while schema on write offers a lightweight solution that can offer lightning-fast query performance.
StreamSets and Schema
StreamSets aligns with the more modern way of handling schema, taking a schema on read approach. This design choice means pipelines don’t need to be re-written when new fields are created in the origin. Instead, the schema is inferred and passed to the transforms and destination without the need for human intervention. This makes for robust pipelines that can adapt to change. In other words, StreamSets pipelines respond automatically to data drift, a critical function for a modern data strategy.