Article originally appeared as Schema On Read vs. Schema On Write Explained.
What’s the difference between Schema on read vs. Schema on write?
How did Schema on read shift the way data is stored?
Since the inception of Relational Databases in the 70’s, schema on write has be the defacto procedure for storing data to be analyzed. However recently there has been a shift to use a schema on read approach, which has led to the exploding popularity of Big Data platforms and NoSQL databases. In this post let’s take a deep dive into what are the differences between schema on read vs. schema on write.
What is Schema On Write
Schema on write is defined as creating a schema for data before writing into the database. If you have done any kind of development with a database you understand the structured nature of Relational Database(RDBMS) because you have used Structured Query Language (SQL) to read data from the database.
One of the most time consuming task in a RDBMS is doing Extract Transform Load (ETL) work. Remember just because the data is structured doesn’t mean it starts out that way. Most of the data that exist is in an unstructured fashion. Not only do you have to define the schema for the data but you must also structure it based on that schema.
For example if I wanted to store menu data for a local restaurant how would I begin to set the schema and write the data into the database?
All those steps had to be done before being able to store the data and analyze it for new insights. The overhead for having to do the ETL is one of the reasons new data sets are hard to get into your Enterprise Data Warehouse(EDW) quickly.
What is Schema On Read
Schema on read differs from schema on write because you create the schema only when reading the data. Structured is applied to the data only when it’s read, this allows unstructured data to be stored in the database. Since it’s not necessary to define the schema before storing the data it makes it easier to bring in new data sources on the fly.
The exploding growth of unstructured data and overhead of ETL for storing data in RDBMS is the main reason for shift to schema on read. Many times analyst aren’t sure what types of insights they will gain from new data sources which is why getting new data source is time consuming. Remember back to our schema on write scenario let’s walk through it using schema on read.
Boom! We are done! All of the menu data is in the database. Any insights we want to investigate we can try and apply the schema while testing. Let’s be clear though, we are still doing ETL on the data to fit into a schema but only when reading the data. Think of this as schema on demand!
Key Differences Schema On Read vs. Schema On Write
Since schema on read allows for data to be inserted without applying a schema should it become the defacto database? No, there are pros and cons for schema on read and schema on write. For example when structure of the data is known schema on write is perfect because it can return results quickly. See the comparison below for a quick overview:
There is no better or best with schema on read vs. schema on write. Just like most things in development– it depends on the use case. Is the workload mostly data supporting a dashboard where the results need to be fast and repetitive? It’s going to need to use a schema on write database. Will there be a lot of unknowns with the data and constant new sources? Sounds like a schema on read will work.
The post How Schema On Read vs. Schema On Write Started It All appeared first on Dell EMC Big Data.
|Update your feed preferences|
I guess we could say that the problem with the database and application world is that like most worlds, it isn’t perfect. As such, you are correct by saying that an application that during the day behaves as an OLTP type, at night shifts to a batch/DSS like behavior. This happens due to natural business cycles where the operational systems’ data need to produce summary reports, run backups, or the data extracted, cleansed and sent to other systems such as Data Warehouses (DW), Business Intelligence (BI), OLAP, etc. The result is that the border between OLTP and DSS softens. When today I ask DBAs if their database is an OLTP or DSS the answer is most often – both!
There is another trend that soften the boundary, which is in-memory databases (IMDB). Being that IMDBs are very fast, they are often positioned as a single repository of data for both OLTP and DSS alike. I’m not a great believer in this trend since a true BI or DW indeed shows a DSS workload behavior (large data scans, sequential nature and so on), however, their *content* is not the same as the operational systems. It is often an aggregation and summary of data from multiple operational systems and requires some cleansing first (often data in different operational systems isn’t inserted in the same way, may include missing records, too detailed, etc.).
Regarding your question of treating transaction logs as DSS my take is that although like you said the logs are typically sequential write (and reads for archiving), we tend to use the terms DSS or OLTP for a whole database or application, and more often to the nature of how the data is being accessed, not so much the transaction logs. Still, you’re right if you only consider the access pattern per storage group.
Regarding eNAS on VMAX3/AF, we don’t necessarily distinguish between FC, iSCSI (block storage), or eNAS (file storage) as far as application type and leave it to the customer to decide how they want to access their data. You can find a white paper on VMAX3 with SQL Server and eNAS here.
To actually go back to what you asked about – “I was trying to identify examples of host apps/services that fall under the workload type of DSS.” I’m not an expert in this space to provide a list, however, if you’re interested, one approach is looking at Gartner’s “Magic Quadrant” (example here, just click on the image to magnify. The full report goes into more details though cost money). Or google the subject (here is one result I found). Since I didn’t read closely either of these two links, I just bring them as examples and not advocating their content However, note how they usually point to an application and no a database. If you dig deeper, some of these application use the same databases OLTP systems may use, and other have their own.
Graph History function in IBM Navigator for i helps you do that. With Collection
Services historical data and the Graph History function as part of the Performance task
on the web, you can understand how your system performance compares to last week, last
month, or even last year. Collection Services now provides a new collection type which
aggregates a subset of data over long time periods. Performance Data Investigator can
analyze and display this new historical data collection type. Historical data combined
with the power to view and analyze it across days, weeks, and years provides new insight
about system performance. This article describes how the Graph History function in
IBM Navigator for i can help you understand your data, and how system performance changes over