Querying Avro Files in Stage
Query Avro Files in Stage
Syntax:
SELECT [<alias>.]$1:<column> [, $1:<column> ...]
FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
[(
[<connection_parameters>],
[ PATTERN => '<regex_pattern>'],
[ FILE_FORMAT => 'AVRO'],
[ FILES => ( '<file_name>' [ , '<file_name>' ] [ , ... ] ) ]
)]
Tips
Query Return Content Explanation:
- Return Format: Each row as a single variant object (referenced as
$1
) - Access Method: Use path expressions
$1:column_name
- Example:
SELECT $1:id, $1:name FROM @stage_name
- Key Features:
- Must use path notation to access specific fields
- Type casting required for type-specific operations (e.g.,
CAST($1:id AS INT)
) - Avro schema is mapped to variant structure
- Whole row is represented as a single variant object
Avro Querying Features Overview
Databend provides comprehensive support for querying Avro files directly from stages. This allows for flexible data exploration and transformation without needing to load the data into a table first.
- Variant Representation: Each row in an Avro file is treated as a variant, referenced by
$1
. This allows for flexible access to nested structures within the Avro data. - Type Mapping: Each Avro type is mapped to a corresponding variant type in Databend.
- Metadata Access: You can access metadata columns like
METADATA$FILENAME
andMETADATA$FILE_ROW_NUMBER
for additional context about the source file and row.
Tutorial
This tutorial demonstrates how to query Avro files stored in a stage.
Step 1. Prepare an Avro File
Consider an Avro file with the following schema named user
:
{
"type": "record",
"name": "user",
"fields": [
{
"name": "id",
"type": "long"
},
{
"name": "name",
"type": "string"
}
]
}
Step 2. Create an External Stage
Create an external stage with your own S3 bucket and credentials where your Avro files are stored.
CREATE STAGE avro_query_stage
URL = 's3://load/avro/'
CONNECTION = (
ACCESS_KEY_ID = '<your-access-key-id>'
SECRET_ACCESS_KEY = '<your-secret-access-key>'
);
Step 3. Query Avro Files
Basic Query
Query Avro files directly from a stage:
SELECT
CAST($1:id AS INT) AS id,
$1:name AS name
FROM @avro_query_stage
(
FILE_FORMAT => 'AVRO',
PATTERN => '.*[.]avro'
);
Query with Metadata
Query Avro files directly from a stage, including metadata columns like METADATA$FILENAME
and METADATA$FILE_ROW_NUMBER
:
SELECT
METADATA$FILENAME,
METADATA$FILE_ROW_NUMBER,
CAST($1:id AS INT) AS id,
$1:name AS name
FROM @avro_query_stage
(
FILE_FORMAT => 'AVRO',
PATTERN => '.*[.]avro'
);
Type Mapping to Variant
Variants in Databend are stored as JSONB. While most Avro types map straightforwardly, some special considerations apply:
- Time Types:
TimeMillis
andTimeMicros
are mapped toINT64
as JSONB does not have a native Time type. Users should be aware of the original type when processing these values. - Decimal Types: Decimals are loaded as
DECIMAL128
orDECIMAL256
. An error may occur if the precision exceeds the supported limits. - Enum Types: Avro
ENUM
types are mapped toSTRING
values in Databend.