Skip to main content

Querying & Transforming

Databend enables direct querying of staged files without loading data into tables first. Query files from any stage type (user, internal, external) or directly from object storage and HTTPS URLs. Ideal for data inspection, validation, and transformation before or after loading.

Syntax

query only

SELECT {
[<alias>.]<column> [, [<alias>.]<column> ...] -- Query columns by name
| [<alias>.]$<col_position> [, [<alias>.]$<col_position> ...] -- Query columns by position
| [<alias>.]$1[:<column>] [, [<alias>.]$1[:<column>] ...] -- Query rows as Variants
}
FROM {@<stage_name>[/<path>] | '<uri>'} -- stage table function
[( -- stage table function parameters
[<connection_parameters>],
[ PATTERN => '<regex_pattern>'],
[ FILE_FORMAT => 'CSV | TSV | NDJSON | PARQUET | ORC | Avro | <custom_format_name>'],
[ FILES => ( '<file_name>' [ , '<file_name>' ... ])],
[ CASE_SENSITIVE => true | false ]
)]
[<alias>]

copy with transform

COPY INTO [<database_name>.]<table_name> [ ( <col_name> [ , <col_name> ... ] ) ]
FROM (
SELECT {
[<alias>.]<column> [, [<alias>.]<column> ...] -- Query columns by name
| [<alias>.]$<col_position> [, [<alias>.]$<col_position> ...] -- Query columns by position
| [<alias>.]$1[:<column>] [, [<alias>.]$1[:<column>] ...] -- Query rows as Variants
} ]
FROM {@<stage_name>[/<path>] | '<uri>'}
)
[ FILES = ( '<file_name>' [ , '<file_name>' ] [ , ... ] ) ]
[ PATTERN = '<regex_pattern>' ]
[ FILE_FORMAT = (
FORMAT_NAME = '<your-custom-format>'
| TYPE = { CSV | TSV | NDJSON | PARQUET | ORC | AVRO } [ formatTypeOptions ]
) ]
[ copyOptions ]
Note

compared the two syntaxes

  • Same Select List
  • Same FROM {@<stage_name>[/<path>] | '<uri>'}
  • diff parameters:
    • query use table function parameters, i.e. (<key> => <value>, ...)
    • transform use Options at the end of Copy into table

FROM Clause

the FROM Clause use similar syntax of Table Function. Like ordinary tables, table alias can be used when join with other tables.

table function parameters:

ParameterDescription
FILE_FORMATFile format type (CSV, TSV, NDJSON, PARQUET, ORC, Avro)
PATTERNRegex pattern to filter files
FILESExplicit list of files to query
CASE_SENSITIVEColumn name case sensitivity (Parquet only)
connection_parametersExternal storage connection details

Query File Data

The select list supports three syntaxes; only one may be used, with no mixing.

Query rows as Variants

  • Supported File Formats: NDJSON, AVRO, Parquet, ORC
Note

Currently for Parquet and ORC, Query rows as Variants is slower than Query columns by name, and the two methods can not be mix used.

syntax:

SELECT [<alias>.]$1[:<column>] [, [<alias>.]$1[:<column>]  ...] <FROM Clause>
  • Example: SELECT $1:id, $1:name FROM ...
  • Table Schema: ($1: Variant). i.e. Single Column with Variant Object Type, each Variant representing a whole row
  • Notes:
    • The type of path expressions like $1:column is Variant too, it can be auto cast to native types when used in expressions or load to dest table column, sometimes you may want to cast manually before for type-specific operations (e.g., CAST($1:id AS INT)) to make the semantics more explicit.

Query columns by name

  • supported File Formats: NDJSON, AVRO, Parquet, ORC
SELECT [<alias>.]<column> [, [<alias>.]<column>  ...] <FROM Clause>
  • Example: SELECT id, name FROM ...
  • Table Schema: Columns Mapping from Parquet or ORC file schema
  • Notes:
    • All files are required to have the same Parquet/ORC schema; otherwise, an error will be returned

Query columns by Position

  • supported File Formats: CSV, TSV
SELECT [<alias>.]$<col_position>[, [<alias>.]$<col_position>,  ...] <FROM Clause>
  • Example: SELECT $1, $2 FROM ...
  • Table Schema: Columns of type VARCHAR NULL
  • Notes
    • <col_position> starts from 1

Query Metadata

You can also include file metadata in your queries, which is useful for tracking data lineage and debugging:

SELECT METADATA$FILENAME, METADATA$FILE_ROW_NUMBER, $1, <FROM Clause>
(
FILE_FORMAT => 'ndjson_query_format',
PATTERN => '.*[.]ndjson'
);

The following file-level metadata fields are available for the supported file formats:

File MetadataTypeDescription
METADATA$FILENAMEVARCHARThe path of the file from which the row was read
METADATA$FILE_ROW_NUMBERINTThe row number within the file (starting from 0)

Use cases:

  • Data lineage: Track which source file contributed each record
  • Debugging: Identify problematic records by file and line number
  • Incremental processing: Process only specific files or ranges within files

Tutorials by File Formats