- RFC PR: datafuselabs/databend#6503
- Tracking Issue: datafuselabs/databend#6215
Summary
Add a new SQL statement for PRESIGN
, so users can generate a presigned URL for uploading or downloading.
Motivation
Databend supports loading data via internal stage:
- Call HTTP API
upload_to_stage
to upload files:curl -H "x-databend-stage-name:my_int_stage" -F "upload=@./books.csv" -XPUT http://localhost:8000/v1/upload_to_stage
- Call
COPY INTO
to copy data:COPY INTO books FROM '@my_int_stage'
This workflow's throughput is limited by databend's HTTP API: upload_to_stage
. We can improve the throughput by allowing users to upload to our backend storage directly. For example, we can use AWS Authenticating Requests: Using Query Parameters to generate a presigned URL. This way, users upload content to AWS s3 directly without going through the databend.
Besides, PRESIGN
can reduce the expanse of networking. All traffic goes through from the user side to s3 directly, with no extra cost for databend instances.
Guide-level explanation
Users can generate a URL for reading:
MySQL [(none)]> PRESIGN @my_stage/books.csv
+--------+---------+---------------------------------------------------------------------------------+
| method | headers | url |
+--------+---------+---------------------------------------------------------------------------------+
| GET | [] | https://example.s3.amazonaws.com/books.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&... |
+--------+---------+---------------------------------------------------------------------------------+
By default, the presigned URL will be expired in 1 hour. Users can specify the expiry time to 2 hours like the following:
MySQL [(none)]> PRESIGN @my_stage/books.csv EXPIRE=7200
+--------+---------+---------------------------------------------------------------------------------+
| method | headers | url |
+--------+---------+---------------------------------------------------------------------------------+
| GET | {} | https://example.s3.amazonaws.com/books.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&... |
+--------+---------+---------------------------------------------------------------------------------+
A presigned URL is generated for the download
action by default. Users can create a URL for upload
like the following:
MySQL [(none)]> PRESIGN UPLOAD @my_stage/books.csv EXPIRE=7200
+--------+---------+---------------------------------------------------------------------------------+
| method | headers | url |
+--------+---------+---------------------------------------------------------------------------------+
| PUT | {} | https://example.s3.amazonaws.com/books.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&... |
+--------+---------+---------------------------------------------------------------------------------+
If the headers
returned by presign
is not empty, users should include them in the actual requests.
MySQL [(none)]> PRESIGN UPLOAD @my_stage/books.csv
+--------+--------------------------+---------------------------------------------------------------------------------+
| method | headers | url |
+--------+--------------------------+---------------------------------------------------------------------------------+
| PUT | {'x-amz-key': 'value'} | https://example.s3.amazonaws.com/books.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&... |
+--------+--------------------------+---------------------------------------------------------------------------------+
Reference-level explanation
PRESIGN
will be implemented as a statement instead of a function so that we can return the HTTP method, headers, and URL at the same time.
Most jobs have been done via Apache OpenDAL presign.
The syntax will be:
PRESIGN [(DOWNLOAD | UPLOAD)] <location> [EXPIRE = <SECONDS>]
In databend, we will:
- Add
PRESIGN
in the parser (only in the new planner) - Implement
presign
interpreters - Add stateful tests for presign
- Add docs around
presign
Drawbacks
None.
Rationale and alternatives
Snowflake GET_PRESIGNED_URL
Snowflake has a sql function called GET_PRESIGNED_URL
.
GET_PRESIGNED_URL( @<stage_name> , '<relative_file_path>' , [ <expiration_time> ] )
Users can get a presigned of files in stage for downloading only:
select get_presigned_url(@images_stage, 'us/yosemite/half_dome.jpg', 3600);
+================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================-------+
| GET_PRESIGNED_URL(@IMAGES_STAGE, 'US/YOSEMITE/HALF_DOME.JPG', 3600) |
|================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================-------|
| http://myaccount.s3.amazonaws.com/national_parks/us/yosemite/half_dome.jpg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxAus-west-xxxxxxxxxaws1_request&X-Amz-Date=20200625T162738Z&X-Amz-Expires=3600&X-Amz-Security-Token=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-Amz-SignedHeaders=host&X-Amz-Signature=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |
+================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================-------+
Snowflake doesn't allow generating a presigned URL uploading via SQL. Instead, they implement a similar feature in their SDKs.
Take snowflake golang SDK as an example:
They build the transfer client with short-live s3 token:
func (util *snowflakeS3Client) createClient(info *execResponseStageInfo, useAccelerateEndpoint bool) (cloudClient, error) {
stageCredentials := info.Creds
var resolver s3.EndpointResolver
if info.EndPoint != "" {
resolver = s3.EndpointResolverFromURL("https://" + info.EndPoint) // FIPS endpoint
}
return s3.New(s3.Options{
Region: info.Region,
Credentials: aws.NewCredentialsCache(credentials.NewStaticCredentialsProvider(
stageCredentials.AwsKeyID,
stageCredentials.AwsSecretKey,
stageCredentials.AwsToken)),
EndpointResolver: resolver,
UseAccelerate: useAccelerateEndpoint,
}), nil
}
execResponseStageInfo
is fetched via internal API:
jsonBody, err := json.Marshal(req)
if err != nil {
return nil, err
}
data, err := sc.rest.FuncPostQuery(ctx, sc.rest, &url.Values{}, headers,
jsonBody, sc.rest.RequestTimeout, requestID, sc.cfg)
if err != nil {
return data, err
}
code := -1
if data.Code != "" {
code, err = strconv.Atoi(data.Code)
if err != nil {
return data, err
}
}
Databend prefers to implement the related feature in the kernel instead of SDKs.
Unresolved questions
None.
Future possibilities
Extend support for location
We can extend support to a location like COPY
:
PRESIGN 's3://bucket/books.csv'
Multipart support
We can generate multipart related actions to allow upload a single 10TB file:
PRESIGN UPLOAD_PART 's3://bucket/books.csv.zst'