Querying Avro Files in Stage
Syntax:
Avro Querying Features Overview
Databend provides comprehensive support for querying Avro files directly from stages. This allows for flexible data exploration and transformation without needing to load the data into a table first.
- Variant Representation: Each row in an Avro file is treated as a variant, referenced by
$1
. This allows for flexible access to nested structures within the Avro data. - Type Mapping: Each Avro type is mapped to a corresponding variant type in Databend.
- Metadata Access: You can access metadata columns like
METADATA$FILENAME
andMETADATA$FILE_ROW_NUMBER
for additional context about the source file and row.
Tutorial
This tutorial demonstrates how to query Avro files stored in a stage.
Step 1. Prepare an Avro File
Consider an Avro file with the following schema named user
:
{
"type": "record",
"name": "user",
"fields": [
{
"name": "id",
"type": "long"
},
{
"name": "name",
"type": "string"
}
]
}
Step 2. Create an External Stage
Create an external stage with your own S3 bucket and credentials where your Avro files are stored.
CREATE STAGE avro_query_stage
URL = 's3://load/avro/'
CONNECTION = (
ACCESS_KEY_ID = '<your-access-key-id>'
SECRET_ACCESS_KEY = '<your-secret-access-key>'
);
Step 3. Query Avro Files
Basic Query
Query Avro files directly from a stage:
SELECT
CAST($1:id AS INT) AS id,
$1:name AS name
FROM @avro_query_stage
(
FILE_FORMAT => 'AVRO',
PATTERN => '.*[.]avro'
);
Query with Metadata
Query Avro files directly from a stage, including metadata columns like METADATA$FILENAME
and METADATA$FILE_ROW_NUMBER
:
SELECT
METADATA$FILENAME,
METADATA$FILE_ROW_NUMBER,
CAST($1:id AS INT) AS id,
$1:name AS name
FROM @avro_query_stage
(
FILE_FORMAT => 'AVRO',
PATTERN => '.*[.]avro'
);
Type Mapping to Variant
Variants in Databend are stored as JSONB. While most Avro types map straightforwardly, some special considerations apply:
- Time Types:
TimeMillis
andTimeMicros
are mapped toINT64
as JSONB does not have a native Time type. Users should be aware of the original type when processing these values. - Decimal Types: Decimals are loaded as
DECIMAL128
orDECIMAL256
. An error may occur if the precision exceeds the supported limits. - Enum Types: Avro
ENUM
types are mapped toSTRING
values in Databend.