Transforming Data on Load
Databend's COPY INTO
command allows data transformation during the loading process. This streamlines your ETL pipeline by incorporating basic transformations, eliminating the need for temporary tables.
The general syntax for COPY INTO
with transformations is:
COPY INTO [<database_name>.]<table_name> [ ( <col_name> [ , <col_name> ... ] ) ]
FROM ( SELECT [<file_col> ... ]
FROM { userStage | internalStage | externalStage } )
[ FILES = ( '<file_name>' [ , '<file_name>' ] [ , ... ] ) ]
[ PATTERN = '<regex_pattern>' ]
[ FILE_FORMAT = (
FORMAT_NAME = '<your-custom-format>'
| TYPE = { CSV | TSV | NDJSON | PARQUET | ORC | Avro } [ formatTypeOptions ]
) ]
[ copyOptions ]
For more details, see COPY INTO.
Key transformations you can perform:
- Loading a subset of data columns: Selectively import specific columns.
- Reordering columns: Change column order during load.
- Converting datatypes: Ensure consistency and compatibility.
- Performing arithmetic operations: Generate new derived data.
- Loading data to a table with additional columns: Map and insert data into existing structures.
Tutorials
These tutorials demonstrate data transformation during loading. Each example shows loading from a staged file.
Before You Begin
Create a stage and generate a sample Parquet file:
CREATE STAGE my_parquet_stage;
COPY INTO @my_parquet_stage
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS id,
'Name_' || CAST(number AS VARCHAR) AS name,
20 + MOD(number, 23) AS age,
DATE_ADD('day', MOD(number, 60), '2022-01-01') AS onboarded
FROM numbers(10)
)
FILE_FORMAT = (TYPE = PARQUET);
Query the staged sample file:
SELECT * FROM @my_parquet_stage;
Result:
┌───────────────────────────────────────┐
│ id │ name │ age │ onboarded │
├────────┼────────┼────────┼────────────┤
│ 1 │ Name_0 │ 20 │ 2022-01-01 │
│ 2 │ Name_5 │ 25 │ 2022-01-06 │
│ 3 │ Name_1 │ 21 │ 2022-01-02 │
│ 4 │ Name_6 │ 26 │ 2022-01-07 │
│ 5 │ Name_7 │ 27 │ 2022-01-08 │
│ 6 │ Name_2 │ 22 │ 2022-01-03 │
│ 7 │ Name_8 │ 28 │ 2022-01-09 │
│ 8 │ Name_3 │ 23 │ 2022-01-04 │
│ 9 │ Name_4 │ 24 │ 2022-01-05 │
│ 10 │ Name_9 │ 29 │ 2022-01-10 │
└───────────────────────────────────────┘
Tutorial 1 - Loading a Subset of Data Columns
Load data into a table with fewer columns than the source file (e.g., excluding 'age').
CREATE TABLE employees_no_age (
id INT,
name VARCHAR,
onboarded timestamp
);
COPY INTO employees_no_age
FROM (
SELECT t.id,
t.name,
t.onboarded
FROM @my_parquet_stage t
)
FILE_FORMAT = (TYPE = PARQUET)
PATTERN = '.*parquet';
SELECT * FROM employees_no_age;
Result (first 3 rows):
┌──────────────────────────────────────────────────────────┐
│ id │ name │ onboarded │
├─────────────────┼──────────────────┼─────────────────────┤
│ 1 │ Name_0 │ 2022-01-01 00:00:00 │
│ 2 │ Name_5 │ 2022-01-06 00:00:00 │
│ 3 │ Name_1 │ 2022-01-02 00:00:00 │
└──────────────────────────────────────────────────────────┘
Tutorial 2 - Reordering Columns During Load
Load data into a table with columns in a different order (e.g., 'age' before 'name').
CREATE TABLE employees_new_order (
id INT,
age INT,
name VARCHAR,
onboarded timestamp
);
COPY INTO employees_new_order
FROM (
SELECT
t.id,
t.age,
t.name,
t.onboarded
FROM @my_parquet_stage t
)
FILE_FORMAT = (TYPE = PARQUET)
PATTERN = '.*parquet';
SELECT * FROM employees_new_order;
Result (first 3 rows):
┌────────────────────────────────────────────────────────────────────────────┐
│ id │ age │ name │ onboarded │
├─────────────────┼─────────────────┼──────────────────┼─────────────────────┤
│ 1 │ 20 │ Name_0 │ 2022-01-01 00:00:00 │
│ 2 │ 25 │ Name_5 │ 2022-01-06 00:00:00 │
│ 3 │ 21 │ Name_1 │ 2022-01-02 00:00:00 │
└────────────────────────────────────────────────────────────────────────────┘
Tutorial 3 - Converting Datatypes During Load
Load data and convert a column's datatype (e.g., 'onboarded' to DATE
).
CREATE TABLE employees_date (
id INT,
name VARCHAR,
age INT,
onboarded date
);
COPY INTO employees_date
FROM (
SELECT
t.id,
t.name,
t.age,
to_date(t.onboarded)
FROM @my_parquet_stage t
)
FILE_FORMAT = (TYPE = PARQUET)
PATTERN = '.*parquet';
SELECT * FROM employees_date;
Result (first 3 rows):
┌───────────────────────────────────────────────────────────────────────┐
│ id │ name │ age │ onboarded │
├─────────────────┼──────────────────┼─────────────────┼────────────────┤
│ 1 │ Name_0 │ 20 │ 2022-01-01 │
│ 2 │ Name_5 │ 25 │ 2022-01-06 │
│ 3 │ Name_1 │ 21 │ 2022-01-02 │
└───────────────────────────────────────────────────────────────────────┘
Tutorial 4 - Performing Arithmetic Operations During Load
Load data and perform arithmetic operations (e.g., increment 'age' by 1).
CREATE TABLE employees_new_age (
id INT,
name VARCHAR,
age INT,
onboarded timestamp
);
COPY INTO employees_new_age
FROM (
SELECT
t.id,
t.name,
t.age + 1,
t.onboarded
FROM @my_parquet_stage t
)
FILE_FORMAT = (TYPE = PARQUET)
PATTERN = '.*parquet';
SELECT * FROM employees_new_age;
Result (first 3 rows):
┌────────────────────────────────────────────────────────────────────────────┐
│ id │ name │ age │ onboarded │
├─────────────────┼──────────────────┼─────────────────┼─────────────────────┤
│ 1 │ Name_0 │ 21 │ 2022-01-01 00:00:00 │
│ 2 │ Name_5 │ 26 │ 2022-01-06 00:00:00 │
│ 3 │ Name_1 │ 22 │ 2022-01-02 00:00:00 │
└────────────────────────────────────────────────────────────────────────────┘
Tutorial 5 - Loading to a Table with Additional Columns
Load data into a table that has more columns than the source file.
CREATE TABLE employees_plus (
id INT,
name VARCHAR,
age INT,
onboarded timestamp,
lastday timestamp
);
COPY INTO employees_plus (id, name, age, onboarded)
FROM (
SELECT
t.id,
t.name,
t.age,
t.onboarded
FROM @my_parquet_stage t
)
FILE_FORMAT = (TYPE = PARQUET)
PATTERN = '.*parquet';
SELECT * FROM employees_plus;
Result (first 3 rows):
┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ id │ name │ age │ onboarded │ lastday │
├─────────────────┼──────────────────┼─────────────────┼─────────────────────┼─────────────────────┤
│ 1 │ Name_0 │ 20 │ 2022-01-01 00:00:00 │ NULL │
│ 2 │ Name_5 │ 25 │ 2022-01-06 00:00:00 │ NULL │
│ 3 │ Name_1 │ 21 │ 2022-01-02 00:00:00 │ NULL │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘