Semi-structured Data Type
Summary
Short comes of current DataType
-
DataType
is an enum type, we must use specific type after matching. For example, if we want to create deserializer/serializer byDataType
, we should always do matching. It does not mean that match is not necessary. If we want to add more and more functions toDataType
, matching may be very annoyment. -
DataType
represented as enum type, we can't use it as generic argument. -
DataType
may involve some nested datatypes, such asDataType::Struct
, but we putDataField
insideDataType
, it's logically unreasonable。 -
Hard to put attributes into enum based
DataType
, such as nullable attribute #3726 #3769
Too many concepts about column (Series/Column/Array)
- DataColumn is an enum, including
Constant(value)
andArray(Series)
pub enum DataColumn {
// Array of values.
Array(Series),
// A Single value.
Constant(DataValue, usize),
}
- Series is a wrap of
SeriesTrait
pub struct Series(pub Arc<dyn SeriesTrait>);
Summary
Semi-structured data types are used to represent schemaless data formats, such as JSON, XML, and so on. In order to be compatible with Snowflake's SQL syntax, we support the following three semi-structured data types:
Variant
: A tagged universal type, which can store values of any other type, includingObject
andArray
.Object
: Used to represent collections of key-value pairs, where the key is a non-empty string, and the value is a value ofVariant
type.Array
: Used to represent dense or sparse arrays of arbitrary size, where the index is a non-negative integer (up to 2^31-1), and values areVariant
types.
Since Object
and Array
can be regarded as a type of Variant
, the following introduction mainly takes Variant
as an example.
Examples
The following example shows how to create a table with VARIANT
, ARRAY
and OBJECT
data types, insert and query some test data.
CREATE TABLE test_semi_structured (
var variant,
arr array,
obj object
);
INSERT INTO test_semi_structured (var, arr, obj)
SELECT 1, array_construct(1, 2, 3)
, parse_json(' { "key1": "value1", "key2": "value2" } ');
INSERT INTO test_semi_structured (var, arr, obj)
SELECT to_variant('abc')
, array_construct('a', 'b', 'c')
, parse_json(' { "key1": [1, 2, 3], "key2": ["a", "b", "c"] } ');
SELECT * FROM test_semi_structured;
+-------+-------------------+----------------------------------------------------+
| var | arr | obj |
+-------+-------------------+----------------------------------------------------+
| 1 | [ 1, 2, 3 ] | { "key1": "value1", "key2": "value2" } |
| "abc" | [ "a", "b", "c" ] | { "key1": [ 1, 2, 3 ], "key2": [ "a", "b", "c" ] } |
+-------+-------------------+----------------------------------------------------+
Design Details
Data storage format
In order to store the Variant
type values in the parquet
format file with schema, we need to do some conversion on the original raw value. We have the following two choices:
Store data in one column as JSON or binary JSON-like format
JSON (JavaScript Object Notation) is the most common semi-structured format that can represent arbitrarily complex hierarchical values. It is very suitable for representing this kind of semi-structured data. Data of type Variant
can be encoded in JSON format and stored as a raw string value.
The main disadvantage of the JSON format is that each access requires expensive parsing of the raw string, so there are several optimized binary JSON-like formats to improve parsing speed and single key access.
For example, MongoDB and PostgreSQL use BSON and jsonb respectively to store data in JSON format.
UBJSON is also a compatible format specification for binary JSON, it can provide universal compatibility, as easy of use as JSON while being faster and more efficient.
All of these binary JSON formats have better performance, the only problem is they lack a good Rust implementation libraries.
Store each unique key of data in sub-columns
Despite JSON format can represent arbitrary data, in practice, JSON data is usually generated by the machine, so we can predict the Shema and structure. Based on this feature, we can extract and flatten each unique key in JSON data into multiple independent virtual sub-columns.
For example, suppose we have a column named tweet
and store the following JSON data:
{"id":1, "date": "1/11", type: "story", "score": 3, "desc": 2, "title": "...", "url": "..."}
{"id":2, "date": "1/12", type: "poll", "score": 5, "desc": 2, "title": "..."}
{"id":3, "date": "1/13", type: "pollop", "score": 6, "poll": 2, "title": "..."}
{"id":4, "date": "1/14", type: "story", "score": 1, "desc": 1, "title": "...", "url": "..."}
{"id":5, "date": "1/15", type: "comment", "parent": 4, "text": "..."}
{"id":6, "date": "1/16", type: "comment", "parent": 1, "text": "..."}
{"id":7, "date": "1/17", type: "pollop", "score": 3, "poll": 2, "title": "..."}
{"id":8, "date": "1/18", type: "comment", "parent": 1, "text": "..."}
This column can be split into 10 virtual sub-columns: tweet.id
, tweet.date
, tweet.type
, tweet.score
, tweet.desc
, tweet.title
, tweet.url
, tweet.parent
, tweet.text
, tweet.poll
.
The data type of each sub-column can also be automatically deducted from the value, then we can automatically create those sub-columns and insert the corresponding value.
The main advantage of this storage format is that it does not need to parse the raw JSON string when querying the data, which can greatly speed up the query processing. The disadvantage is that additional processing is required when inserting data, and the schema of data in each row is not exactly the same. In some scenarios with large differences, many sub-column data will be Null. In order to have good performance and balance in various scenarios, we can refer to the optimization algorithms introduced in the paper JSON Tiles.
From the perspective of performance, a better solution is to store data in binary JSON-like format and extract some frequently queried unique keys as sub-columns. However, in order to simplify development, we use the JSON format in the first version. Binary JSON-like format and separately stored sub-columns will be adopted in a future optimized version.
Data Types
Add three new values Variant
, VariantArray
, VariantObject
to the enumeration TypeID
, respectively support these three semi-structured data types.
Since we now have a value called Array
, we name the semi-structured Array
type as VariantArray
to distinguish from it.
Define the corresponding structures for these types, and implement the trait DataType
.
The PhysicalTypeID
corresponding to these types are String
, the JSON value will be converted to a raw string for storage.
pub enum TypeID {
...
Variant
VariantArray
VariantObject
}
pub struct VariantType {}
pub struct VariantArrayType {}
pub struct VariantObjectType {}
Object Column
Currently Column
is only implemented for fundamental types, custom structs or enumerations like serde_json::Value
don't have a suitable Column
implementation to store.
Define ObjectColumn
and MutableObjectColumn
as generic structures to store custom data types, and implement trait Column
and MutableColumn
respectively.
ObjectType
can be any custom type of structure or enumerations, we can define JsonColumn
by specified parameter as serde_json::Value
.
All the variant
data will be automatically cast to serde_json::Value
and generate a JsonColumn
.
Other custom data types like BitmapColumn can be supported easily in the future.
#[derive(Clone)]
pub struct ObjectColumn<T: ObjectType> {
values: Vec<T>,
}
#[derive(Debug)]
pub struct MutableObjectColumn<T: ObjectType> {
data_type: DataTypeImpl,
pub(crate) values: Vec<T>,
}
type JsonColumn = ObjectColumn<serde_json::Value>;
TODO
- Use better storage formats to improve query performance