External Functions in Databend Cloud
External functions in Databend allow you to define custom operations for processing data using external servers written in programming languages like Python. These functions enable you to extend Databend's capabilities by integrating custom logic, leveraging external libraries, and handling complex processing tasks. Key features of external functions include:
- Scalability: Ideal for complex and resource-intensive data operations.
- External Libraries: Leverage additional functionality through external libraries and dependencies.
- Advanced Logic: Implement sophisticated data processing logic for complex scenarios.
Supported Programming Languages
The following table lists the supported languages and the required libraries for creating external functions in Databend:
Language | Required Library |
---|---|
Python | databend-udf |
Managing External Functions
You can manage external functions using SQL commands such as CREATE FUNCTION
, DROP FUNCTION
, and SHOW FUNCTIONS
. For more details, see External Function.
Configuring External Functions in Databend Cloud
To use external functions in Databend Cloud, you need to allowlist the addresses of your external function servers. The external function server must be accessible via a domain name over HTTPS. Please contact Databend Cloud support to add your allowed UDF server addresses:
- Navigate to Support > Create New Ticket in the Databend Cloud console.
- Provide the external server addresses (with HTTPS domain names) you wish to allowlist.
- Submit the ticket and await confirmation from the support team.
Usage Example: Creating an External Function in Python
This section demonstrates how to create an external function using Python.
1. Install the Required Library
Install the databend-udf library using pip
:
pip install databend-udf
2. Define Your Function
Create a Python file (e.g., external_function.py
) and define your external function. The following example defines an external server in Python that exposes a custom function gcd
for calculating the greatest common divisor of two integers:
from databend_udf import udf, UDFServer
@udf(
input_types=["INT", "INT"],
result_type="INT",
skip_null=True,
)
def gcd(x: int, y: int) -> int:
while y != 0:
x, y = y, x % y
return x
if __name__ == '__main__':
# Create an external server listening at '0.0.0.0:8815'
server = UDFServer("0.0.0.0:8815")
# Add the defined function
server.add_function(gcd)
# Start the external server
server.serve()
Explanation of @udf
Decorator Parameters:
Parameter | Description |
---|---|
input_types | A list of strings specifying the input data types (e.g., ["INT", "VARCHAR"] ). |
result_type | A string specifying the return value type (e.g., "INT" ). |
name | (Optional) Custom name for the function. If not provided, the original function name is used. |
io_threads | Number of I/O threads used per data chunk for I/O-bound functions. |
skip_null | If set to True , NULL values are not passed to the function, and the corresponding return value is set to NULL. Default is False . |
Data Type Mappings Between Databend and Python:
Databend Type | Python Type |
---|---|
BOOLEAN | bool |
TINYINT (UNSIGNED) | int |
SMALLINT (UNSIGNED) | int |
INT (UNSIGNED) | int |
BIGINT (UNSIGNED) | int |
FLOAT | float |
DOUBLE | float |
DECIMAL | decimal.Decimal |
DATE | datetime.date |
TIMESTAMP | datetime.datetime |
VARCHAR | str |
VARIANT | any |
MAP(K,V) | dict |
ARRAY(T) | list[T] |
TUPLE(T,...) | tuple(T,...) |
3. Run the External Server
Run the Python file to start the external server:
python3 external_function.py
Note: Ensure that the server is accessible from Databend Cloud and that the address is allowlisted. If not already done, contact Databend Cloud support to add the server address to the allowlist.
4. Register the Function in Databend Cloud
Register the function gcd
in Databend using the CREATE FUNCTION
statement:
CREATE FUNCTION gcd (INT, INT)
RETURNS INT
LANGUAGE PYTHON
HANDLER = 'gcd'
ADDRESS = '<your-allowed-server-address>';
- Replace
<your-allowed-server-address>
with the actual address of your external server that has been allowlisted in Databend Cloud (must be an HTTPS domain). - The
HANDLER
specifies the name of the function as defined in your Python code. - The
ADDRESS
should match the address where your external server is running and must be allowlisted by Databend Cloud.
Example:
CREATE FUNCTION gcd (INT, INT)
RETURNS INT
LANGUAGE PYTHON
HANDLER = 'gcd'
ADDRESS = 'https://your-server-address';
Important: Before executing this statement, ensure that 'https://your-server-address'
is allowlisted in Databend Cloud by contacting support.
You can now use the external function gcd
in your SQL queries:
SELECT gcd(48, 18); -- Returns 6
Conclusion
External functions in Databend Cloud provide a powerful way to extend the functionality of your data processing pipelines by integrating custom code written in languages like Python. By following the steps outlined above, you can create and use external functions to handle complex processing tasks, leverage external libraries, and implement advanced logic.