Skip to main content

INFER_SCHEMA

Automatically detects the file metadata schema and retrieves the column definitions.

infer_schema currently supports the following file formats:

  • Parquet - Native support for schema inference
  • CSV - With customizable delimiters and header detection
  • NDJSON - Newline-delimited JSON files

Compression Support: All formats also support compressed files with extensions .zip, .xz, .zst.

File Size Limit

Each individual file has a maximum size limit of 100MB for schema inference.

Schema Merging

When processing multiple files, infer_schema automatically merges different schemas:

  • Compatible types are promoted (e.g., INT8 + INT16 → INT16)
  • Incompatible types fall back to VARCHAR (e.g., INT + FLOAT → VARCHAR)
  • Missing columns in some files are marked as nullable
  • New columns from later files are added to the final schema

This ensures all files can be read using the unified schema.

Syntax

INFER_SCHEMA(
LOCATION => '{ internalStage | externalStage }'
[ PATTERN => '<regex_pattern>']
[ FILE_FORMAT => '<format_name>' ]
[ MAX_RECORDS_PRE_FILE => <number> ]
[ MAX_FILE_COUNT => <number> ]
)

Parameters

ParameterDescriptionDefaultExample
LOCATIONStage location: @<stage_name>[/<path>]Required'@my_stage/data/'
PATTERNFile name pattern to matchAll files'*.csv', '*.parquet'
FILE_FORMATFile format name for parsingStage's format'csv_format', 'NDJSON'
MAX_RECORDS_PRE_FILEMax records to sample per fileAll records100, 1000
MAX_FILE_COUNTMax number of files to processAll files5, 10

Examples

Parquet Files

-- Create stage and export data
CREATE STAGE test_parquet;
COPY INTO @test_parquet FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE = 'PARQUET');

-- Infer schema from parquet files using pattern
SELECT * FROM INFER_SCHEMA(
location => '@test_parquet',
pattern => '*.parquet'
);

Result:

+-------------+-----------------+----------+----------+----------+
| column_name | type | nullable | filenames| order_id |
+-------------+-----------------+----------+----------+----------+
| number | BIGINT UNSIGNED | false | data_... | 0 |
+-------------+-----------------+----------+----------+----------+

CSV Files

-- Create stage and export CSV data
CREATE STAGE test_csv;
COPY INTO @test_csv FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE = 'CSV');

-- Create a CSV file format
CREATE FILE FORMAT csv_format TYPE = 'CSV';

-- Infer schema using pattern and file format
SELECT * FROM INFER_SCHEMA(
location => '@test_csv',
pattern => '*.csv',
file_format => 'csv_format'
);

Result:

+-------------+---------+----------+----------+----------+
| column_name | type | nullable | filenames| order_id |
+-------------+---------+----------+----------+----------+
| column_1 | BIGINT | true | data_... | 0 |
+-------------+---------+----------+----------+----------+

For CSV files with headers:

-- Create CSV file format with header support
CREATE FILE FORMAT csv_headers_format
TYPE = 'CSV'
field_delimiter = ','
skip_header = 1;

-- Export data with headers
CREATE STAGE test_csv_headers;
COPY INTO @test_csv_headers FROM (
SELECT number as user_id, 'user_' || number::string as user_name
FROM numbers(5)
) FILE_FORMAT = (TYPE = 'CSV', output_header = true);

-- Infer schema with headers
SELECT * FROM INFER_SCHEMA(
location => '@test_csv_headers',
file_format => 'csv_headers_format'
);

Limit records for faster inference:

-- Sample only first 5 records for schema inference
SELECT * FROM INFER_SCHEMA(
location => '@test_csv',
pattern => '*.csv',
file_format => 'csv_format',
max_records_pre_file => 5
);

NDJSON Files

-- Create stage and export NDJSON data
CREATE STAGE test_ndjson;
COPY INTO @test_ndjson FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE = 'NDJSON');

-- Infer schema using pattern and NDJSON format
SELECT * FROM INFER_SCHEMA(
location => '@test_ndjson',
pattern => '*.ndjson',
file_format => 'NDJSON'
);

Result:

+-------------+---------+----------+----------+----------+
| column_name | type | nullable | filenames| order_id |
+-------------+---------+----------+----------+----------+
| number | BIGINT | true | data_... | 0 |
+-------------+---------+----------+----------+----------+

Limit records for faster inference:

-- Sample only first 5 records for schema inference
SELECT * FROM INFER_SCHEMA(
location => '@test_ndjson',
pattern => '*.ndjson',
file_format => 'NDJSON',
max_records_pre_file => 5
);

Schema Merging with Multiple Files

When files have different schemas, infer_schema merges them intelligently:

-- Suppose you have multiple CSV files with different schemas:
-- file1.csv: id(INT), name(VARCHAR)
-- file2.csv: id(INT), name(VARCHAR), age(INT)
-- file3.csv: id(FLOAT), name(VARCHAR), age(INT)

SELECT * FROM INFER_SCHEMA(
location => '@my_stage/',
pattern => '*.csv',
file_format => 'csv_format'
);

Result shows merged schema:

+-------------+---------+----------+-----------+----------+
| column_name | type | nullable | filenames | order_id |
+-------------+---------+----------+-----------+----------+
| id | VARCHAR | true | file1,... | 0 | -- INT+FLOAT→VARCHAR
| name | VARCHAR | true | file1,... | 1 |
| age | BIGINT | true | file1,... | 2 | -- Missing in file1→nullable
+-------------+---------+----------+-----------+----------+

Pattern Matching and File Limits

Use pattern matching to infer schema from multiple files:

-- Infer schema from all CSV files in the directory
SELECT * FROM INFER_SCHEMA(
location => '@my_stage/',
pattern => '*.csv'
);

Limit the number of files processed to improve performance:

-- Process only the first 5 matching files
SELECT * FROM INFER_SCHEMA(
location => '@my_stage/',
pattern => '*.csv',
max_file_count => 5
);

Compressed Files

infer_schema automatically handles compressed files:

-- Works with compressed CSV files
SELECT * FROM INFER_SCHEMA(location => '@my_stage/data.csv.zip');

-- Works with compressed NDJSON files
SELECT * FROM INFER_SCHEMA(
location => '@my_stage/data.ndjson.xz',
file_format => 'NDJSON',
max_records_pre_file => 50
);

Create Table from Inferred Schema

The infer_schema function displays the schema but doesn't create tables. To create a table from the inferred schema:

-- Create table structure from file schema
CREATE TABLE my_table AS
SELECT * FROM @my_stage/ (pattern=>'*.parquet')
LIMIT 0;

-- Verify the table structure
DESC my_table;