Skip to main content

Summary

Add a new SQL statement for PRESIGN, so users can generate a presigned URL for uploading or downloading.

Motivation

Databend supports loading data via internal stage:

  • Call HTTP API upload_to_stage to upload files: curl -H "x-databend-stage-name:my_int_stage" -F "upload=@./books.csv" -XPUT http://localhost:8000/v1/upload_to_stage
  • Call COPY INTO to copy data: COPY INTO books FROM '@my_int_stage'

This workflow's throughput is limited by databend's HTTP API: upload_to_stage. We can improve the throughput by allowing users to upload to our backend storage directly. For example, we can use AWS Authenticating Requests: Using Query Parameters to generate a presigned URL. This way, users upload content to AWS s3 directly without going through the databend.

Besides, PRESIGN can reduce the expanse of networking. All traffic goes through from the user side to s3 directly, with no extra cost for databend instances.

Guide-level explanation

Users can generate a URL for reading:

MySQL [(none)]> PRESIGN @my_stage/books.csv
+--------+---------+---------------------------------------------------------------------------------+
| method | headers | url |
+--------+---------+---------------------------------------------------------------------------------+
| GET | [] | https://example.s3.amazonaws.com/books.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&... |
+--------+---------+---------------------------------------------------------------------------------+

By default, the presigned URL will be expired in 1 hour. Users can specify the expiry time to 2 hours like the following:

MySQL [(none)]> PRESIGN @my_stage/books.csv EXPIRE=7200
+--------+---------+---------------------------------------------------------------------------------+
| method | headers | url |
+--------+---------+---------------------------------------------------------------------------------+
| GET | {} | https://example.s3.amazonaws.com/books.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&... |
+--------+---------+---------------------------------------------------------------------------------+

A presigned URL is generated for the download action by default. Users can create a URL for upload like the following:

MySQL [(none)]> PRESIGN UPLOAD @my_stage/books.csv EXPIRE=7200
+--------+---------+---------------------------------------------------------------------------------+
| method | headers | url |
+--------+---------+---------------------------------------------------------------------------------+
| PUT | {} | https://example.s3.amazonaws.com/books.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&... |
+--------+---------+---------------------------------------------------------------------------------+

If the headers returned by presign is not empty, users should include them in the actual requests.

MySQL [(none)]> PRESIGN UPLOAD @my_stage/books.csv
+--------+--------------------------+---------------------------------------------------------------------------------+
| method | headers | url |
+--------+--------------------------+---------------------------------------------------------------------------------+
| PUT | {'x-amz-key': 'value'} | https://example.s3.amazonaws.com/books.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&... |
+--------+--------------------------+---------------------------------------------------------------------------------+

Reference-level explanation

PRESIGN will be implemented as a statement instead of a function so that we can return the HTTP method, headers, and URL at the same time.

Most jobs have been done via Apache OpenDAL presign.

The syntax will be:

PRESIGN [(DOWNLOAD | UPLOAD)] <location> [EXPIRE = <SECONDS>]

In databend, we will:

  • Add PRESIGN in the parser (only in the new planner)
  • Implement presign interpreters
  • Add stateful tests for presign
  • Add docs around presign

Drawbacks

None.

Rationale and alternatives

Snowflake GET_PRESIGNED_URL

Snowflake has a sql function called GET_PRESIGNED_URL.

GET_PRESIGNED_URL( @<stage_name> , '<relative_file_path>' , [ <expiration_time> ] )

Users can get a presigned of files in stage for downloading only:

select get_presigned_url(@images_stage, 'us/yosemite/half_dome.jpg', 3600);

+================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================-------+
| GET_PRESIGNED_URL(@IMAGES_STAGE, 'US/YOSEMITE/HALF_DOME.JPG', 3600) |
|================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================-------|
| http://myaccount.s3.amazonaws.com/national_parks/us/yosemite/half_dome.jpg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxAus-west-xxxxxxxxxaws1_request&X-Amz-Date=20200625T162738Z&X-Amz-Expires=3600&X-Amz-Security-Token=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-Amz-SignedHeaders=host&X-Amz-Signature=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |
+================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================-------+

Snowflake doesn't allow generating a presigned URL uploading via SQL. Instead, they implement a similar feature in their SDKs.

Take snowflake golang SDK as an example:

They build the transfer client with short-live s3 token:

func (util *snowflakeS3Client) createClient(info *execResponseStageInfo, useAccelerateEndpoint bool) (cloudClient, error) {
stageCredentials := info.Creds
var resolver s3.EndpointResolver
if info.EndPoint != "" {
resolver = s3.EndpointResolverFromURL("https://" + info.EndPoint) // FIPS endpoint
}

return s3.New(s3.Options{
Region: info.Region,
Credentials: aws.NewCredentialsCache(credentials.NewStaticCredentialsProvider(
stageCredentials.AwsKeyID,
stageCredentials.AwsSecretKey,
stageCredentials.AwsToken)),
EndpointResolver: resolver,
UseAccelerate: useAccelerateEndpoint,
}), nil
}

execResponseStageInfo is fetched via internal API:

jsonBody, err := json.Marshal(req)
if err != nil {
return nil, err
}

data, err := sc.rest.FuncPostQuery(ctx, sc.rest, &url.Values{}, headers,
jsonBody, sc.rest.RequestTimeout, requestID, sc.cfg)
if err != nil {
return data, err
}
code := -1
if data.Code != "" {
code, err = strconv.Atoi(data.Code)
if err != nil {
return data, err
}
}

Databend prefers to implement the related feature in the kernel instead of SDKs.

Unresolved questions

None.

Future possibilities

Extend support for location

We can extend support to a location like COPY:

PRESIGN 's3://bucket/books.csv'

Multipart support

We can generate multipart related actions to allow upload a single 10TB file:

PRESIGN UPLOAD_PART 's3://bucket/books.csv.zst'
Explore Databend Cloud for FREE
Low-cost
Fast Analytics
Easy Data Ingestion
Elastic Scaling
Try it today