Schema-less or Schema On Demand
Posted by Dylan Wan on January 29, 2017
I am trying to define what are the criteria for a schema-less, or a schema on demand data storage.
In relational database, we use DDL, data definition language, to define schema.
We have to create table first, before we can insert data into a table. When we update data, we update the data by mentioning the name of columns. We use DDL before we use DML.
We have to know the name of columns in order to construct the SQL statement with specific select clause. If a column does not exist, the system throws an error when we try to query. SELECT * FROM does not have the requirement. CREATE SELECT also kind of giving us some tolerance, but the table it creates will be unstable. INSERT.. SELECT * will be a bad practice as when the source schema changes, this statement becomes broken.
Schema describes the name of table, the names and orders of the columns, and the data type (or domain) of the columns.
Here are what I feel possible and something we can pursue:
1.Infer Data Schema during initial create/load
If DDL is required, it should be hidden. It is kind of like CREATE…SELECT. I can check the create schema after the data is created. Apache Spark csv upload is kind of supporting this:
df = sqlContext.read.load('file:///mydata.csv', format='com.databricks.spark.csv', header='true', inferSchema='true')
Incorta can also does this for csv file. If the source is a database, you should know how to get the metadata from JDBC or any database. The parquet file also have the schema definitions. For the purpose of loading or inserting a brand new schema, you should not need to create schema first. Elastic Search is considered a schema less because the PUT and POST API does not requires any DDL like thing to be done first.
2. Update Schema – automatically ALTER SCHEMA during incremental load/refresh
This requirement needs to separate into two layers:
2.1 Create new columns
In a columnar database, this should be possible without reloading the existing data. Elastic Search is kind of allowing this if you PUT a document with the same ID, it perform an update to the document mapping, which is the term used by elastic to refer to schema. If you introduce a new field, it will be indexed, which is equivalent to insert the data into a database. Elastic calls this dynamic mapping. This is not always desirable, so they allow the schema developer to turn the feature off.
2.2 Ignore the undefined columns
When you try to insert a column that does not already defined in the schema in an Oracle database, it throws an error. While the approach described in 2.1 may be desirable in some cases, especially during development. It may not be desirable in production.
2.3 Ignore a column with a different type
What if the schema inference during the initial stage does not get the data with all proper data types?
Since data type affects data storage, it always means reload the existing data. The question is whether this should be performed dynamically.
I still believe that we need to have the development and testing phase and the production phase. Allowing the data type change should just be a development time change and during production, the wrong data should result in an error. Failing to insert a column should not result in rejecting the whole record(or document).
2.3 Mapping the source column to a new column or to multiple columns
A possible way to deal above problem is to create a new column with a new type for holding the data. This is for avoiding losing the data.
Elastic search allows a single column from the source to be indexed in different ways and thus becomes multiple columns that can be searched and selected. Since number and text can be stored and handled differently, perhaps storing two different copies of the same data is desirable if transformation can be avoid during run time. Oracle’s function based index and the virtual column with index is kind of serving the similar purpose.
3. Query based on schema definition
3.1 Allow an alias or synonym to be given for a name changed column or table
This is an approach for dealing with backward compatibility issues.
OBIEE presentation layer allows the BI admin to add aliases. This a very handy feature.
3.2 Ignore the non-existing columns in the SELECT statement
DBMS performs this type of validation as they do not know how to handle it. it built the dependency from consumers to producers.
What if we just simply ignore the columns?