Databend Community
Databend is an open-source, elastic, and workload-aware cloud data warehouse built in Rust, offering a cost-effective alternative to Snowflake. It's designed for complex analysis of the world's largest datasets.
- Performance
- Data Manipulation
- Object Storage
- Blazing-fast data analytics on object storage.
- Leverages data-level parallelism and instruction-level parallelism technologies for optimal performance.
- No indexes to build, no manual tuning, and no need to figure out partitions or shard data.
- Supports atomic operations such as
SELECT
,INSERT
,DELETE
,UPDATE
,REPLACE
,COPY
, andMERGE
. - Provides advanced features such as Time Travel and Multi Catalog (Apache Hive / Apache Iceberg).
- Supports ingestion of semi-structured data in various formats like CSV, JSON, and Parquet.
- Supports semi-structured data types such as ARRAY, MAP, and JSON.
- Supports Git-like MVCC storage for easy querying, cloning, and restoration of historical data.
- Supports various object storage platforms. Click here to see a full list of supported platforms.
- Allows instant elasticity, enabling users to scale up or down based on their application needs.
Databend's high-level architecture is composed of a meta-service layer
, a query layer
, and a storage layer
.
- Meta-Service Layer
- Query Layer
- Storage Layer
Databend efficiently supports multiple tenants through its meta-service layer, which plays a crucial role in the system:
- Metadata Management: Handles metadata for databases, tables, clusters, transactions, and more.
- Security: Manages user authentication and authorization for a secure environment.
Discover more about the meta-service layer in the meta on GitHub.
The query layer in Databend handles query computations and is composed of multiple clusters, each containing several nodes. Each node, a core unit in the query layer, consists of:
- Planner: Develops execution plans for SQL statements using elements from relational algebra, incorporating operators like Projection, Filter, and Limit.
- Optimizer: A rule-based optimizer applies predefined rules, such as "predicate pushdown" and "pruning of unused columns", for optimal query execution.
- Processors: Constructs a query execution pipeline based on planner instructions, following a Pull&Push approach. Processors are interconnected, forming a pipeline that can be distributed across nodes for enhanced performance.
Discover more about the query layer in the query directory on GitHub.
Databend employs Parquet, an open-source columnar format, and introduces its own table format to boost query performance. Key features include:
-
Secondary Indexes: Speeds up data location and access across various analysis dimensions.
-
Complex Data Type Indexes: Aimed at accelerating data processing and analysis for intricate types such as semi-structured data.
-
Segments: Databend effectively organizes data into segments, enhancing data management and retrieval efficiency.
-
Clustering: Employs user-defined clustering keys within segments to streamline data scanning.
Discover more about the storage layer in the storage on GitHub.