Databend Metrics
Metrics are crucial to monitor the performance and health of the system. Databend collects and stores two types of metrics, Meta Metrics and Query Metrics, in the format of Prometheus. Meta Metrics are used for real-time monitoring and debugging of the Metasrv component, while Query Metrics are used for monitoring the performance of the Databend-query component.
You can access the metrics through a web browser using the following URLs:
- Meta Metrics:
http://<admin_api_address>/v1/metrics
. Defaults to0.0.0.0:28101/v1/metrics
. - Query Metrics:
http://<metric_api_address>/metrics
. Defaults to0.0.0.0:7070/metrics
.
Alternatively, you can visualize the metrics using third-party tools. For information about supported tools and integration tutorials, refer to Monitor > Using 3rd-party Tools. When employing the Prometheus & Grafana solution, you can create dashboards using our provided dashboard templates, available here. For more details, check out the Prometheus & Grafana guide.
Meta Metrics
Here's a list of Meta metrics captured by Databend.
Server
These metrics describe the status of the metasrv
. All these metrics are prefixed with metasrv_server_
.
Name | Description | Type |
---|---|---|
current_leader_id | Current leader id of cluster, 0 means no leader. | Gauge |
is_leader | Whether or not this node is current leader. | Gauge |
node_is_health | Whether or not this node is health. | Gauge |
leader_changes | Number of leader changes seen. | Counter |
applying_snapshot | Whether or not statemachine is applying snapshot. | Gauge |
proposals_applied | Total number of consensus proposals applied. | Gauge |
last_log_index | Index of the last log entry.. | Gauge |
current_term | Current term. | Gauge |
proposals_pending | Total number of pending proposals. | Gauge |
proposals_failed | Total number of failed proposals. | Counter |
watchers | Total number of active watchers. | Gauge |
current_leader_id
indicate current leader id of cluster, 0 means no leader. If a cluster has no leader, it is unavailable.
is_leader
indicate if this metasrv
currently is the leader of cluster, and leader_changes
show the total number of leader changes since start.If change leader too frequently, it will impact the performance of metasrv
, also it signal that the cluster is unstable.
If and only if the node state is Follower
or Leader
, node_is_health
is 1, otherwise is 0.
proposals_applied
records the total number of applied write requests.
last_log_index
records the last log index has been appended to this Raft node's log, current_term
records the current term of the Raft node.
proposals_pending
indicates how many proposals are queued to commit currently.Rising pending proposals suggests there is a high client load or the member cannot commit proposals.
proposals_failed
show the total number of failed write requests, it is normally related to two issues: temporary failures related to a leader election or longer downtime caused by a loss of quorum in the cluster.
watchers
show the total number of active watchers currently.
Raft Network
These metrics describe the network status of raft nodes in the metasrv
. All these metrics are prefixed with metasrv_raft_network_
.
Name | Description | Labels | Type |
---|---|---|---|
active_peers | Current number of active connections to peers. | id(node id),address(peer address) | Gauge |
fail_connect_to_peer | Total number of fail connections to peers. | id(node id),address(peer address) | Counter |
sent_bytes | Total number of sent bytes to peers. | to(node id) | Counter |
recv_bytes | Total number of received bytes from peers. | from(remote address) | Counter |
sent_failures | Total number of send failures to peers. | to(node id) | Counter |
snapshot_send_success | Total number of successful snapshot sends. | to(node id) | Counter |
snapshot_send_failures | Total number of snapshot send failures. | to(node id) | Counter |
snapshot_send_inflights | Total number of inflight snapshot sends. | to(node id) | Gauge |
snapshot_sent_seconds | Total latency distributions of snapshot sends. | to(node id) | Histogram |
snapshot_recv_success | Total number of successful receive snapshot. | from(remote address) | Counter |
snapshot_recv_failures | Total number of snapshot receive failures. | from(remote address) | Counter |
snapshot_recv_inflights | Total number of inflight snapshot receives. | from(remote address) | Gauge |
snapshot_recv_seconds | Total latency distributions of snapshot receives. | from(remote address) | Histogram |
active_peers
indicates how many active connection between cluster members, fail_connect_to_peer
indicates the number of fail connections to peers. Each has the labels: id(node id) and address (peer address).
sent_bytes
and recv_bytes
record the sent and receive bytes to and from peers, and sent_failures
records the number of fail sent to peers.
snapshot_send_success
and snapshot_send_failures
indicates the success and fail number of sent snapshot.snapshot_send_inflights
indicate the inflight snapshot sends, each time send a snapshot, this field will increment by one, after sending snapshot is done, this field will decrement by one.
snapshot_sent_seconds
indicate the total latency distributions of snapshot sends.
snapshot_recv_success
and snapshot_recv_failures
indicates the success and fail number of receive snapshot.snapshot_recv_inflights
indicate the inflight receiving snapshot, each time receive a snapshot, this field will increment by one, after receiving snapshot is done, this field will decrement by one.
snapshot_recv_seconds
indicate the total latency distributions of snapshot receives.
Raft Storage
These metrics describe the storage status of raft nodes in the metasrv
. All these metrics are prefixed with metasrv_raft_storage_
.
Name | Description | Labels | Type |
---|---|---|---|
raft_store_write_failed | Total number of raft store write failures. | func(function name) | Counter |
raft_store_read_failed | Total number of raft store read failures. | func(function name) | Counter |
raft_store_write_failed
and raft_store_read_failed
indicate the total number of raft store write and read failures.
Meta Network
These metrics describe the network status of meta service in the metasrv
. All these metrics are prefixed with metasrv_meta_network_
.
Name | Description | Type |
---|---|---|
sent_bytes | Total number of sent bytes to meta grpc client. | Counter |
recv_bytes | Total number of recv bytes from meta grpc client. | Counter |
inflights | Total number of inflight meta grpc requests. | Gauge |
req_success | Total number of success request from meta grpc client. | Counter |
req_failed | Total number of fail request from meta grpc client. | Counter |
rpc_delay_seconds | Latency distribution of meta-service API in second. | Histogram |
Query Metrics
Here's a list of Query metrics captured by Databend.
Name | Type | Description | Labels |
---|---|---|---|
databend_cache_access_count | Counter | Number of cache accesses. | cache_name |
databend_cache_hit_count | Counter | Counts the number of cache hits for different cache types. | cache_name |
databend_cache_miss_count | Counter | Number of cache misses. | cache_name |
databend_cache_miss_load_millisecond | Histogram | Distribution of cache miss load times. | cache_name |
databend_cluster_discovered_node | Gauge | Reports information about discovered nodes exposed externally. | local_id, cluster_id, tenant_id, flight_address |
databend_compact_hook_compaction_ms | Histogram | Histogram of the time spent on compaction operations. | operation |
databend_compact_hook_execution_ms | Histogram | Distribution of execution time for compact hook operations. | operation: MergeInto, Insert |
databend_fuse_block_index_read_bytes | Counter | Number of bytes read for block index. | |
databend_fuse_block_index_write_bytes_total | Counter | Total number of bytes written for index blocks. | |
databend_fuse_block_index_write_milliseconds | Histogram | Distribution of the time taken to write index blocks. | |
databend_fuse_block_index_write_nums_total | Counter | Total number of index blocks written. | |
databend_fuse_block_write_bytes | Counter | Total number of bytes written. | |
databend_fuse_block_write_millioseconds | Histogram | Distribution of time taken to write blocks. | |
databend_fuse_block_write_nums | Counter | Total number of blocks written. | |
databend_fuse_blocks_bloom_pruning_after | Counter | Number of blocks after executing block-level bloom pruning. | |
databend_fuse_blocks_bloom_pruning_before | Counter | Number of blocks before executing block-level bloom pruning. | |
databend_fuse_blocks_range_pruning_after | Counter | Number of blocks after executing block-level range pruning. | |
databend_fuse_blocks_range_pruning_before | Counter | Number of blocks before executing block-level range pruning. | |
databend_fuse_bytes_block_bloom_pruning_after | Counter | Data size in bytes after executing block-level bloom pruning. | |
databend_fuse_bytes_block_bloom_pruning_before | Counter | Data size in bytes before executing block-level bloom pruning. | |
databend_fuse_bytes_segment_range_pruning_after | Counter | Data size in bytes after executing segment-level range pruning. | |
databend_fuse_bytes_segment_range_pruning_before | Counter | Data size in bytes before executing segment-level range pruning. | |
databend_fuse_commit_aborts | Counter | Number of times commit aborted due to errors. | |
databend_fuse_commit_copied_files | Counter | Total number of files copied during commit operations. | |
databend_fuse_commit_milliseconds | Counter | Total time taken for commit mutations. | |
databend_fuse_commit_mutation_modified_segment_exists_in_latest | Counter | Counts the existence of modified segments in the latest commit mutation. | |
databend_fuse_commit_mutation_success | Counter | Number of successful mutations committed. | |
databend_fuse_commit_mutation_unresolvable_conflict | Counter | Number of times unresolvable commit conflicts occurred. | |
databend_fuse_compact_block_build_lazy_part_milliseconds | Histogram | Distribution of the time spent building the lazy part during compaction. | |
databend_fuse_compact_block_build_task_milliseconds | Histogram | Distribution of the time spent building the compact block. | |
databend_fuse_compact_block_read_bytes | Counter | Cumulative size of blocks read during compaction, in bytes. | |
databend_fuse_compact_block_read_milliseconds | Histogram | Histogram of time spent reading blocks during compaction. | |
databend_fuse_compact_block_read_nums | Counter | Counts the number of blocks read during compaction. | |
databend_fuse_pruning_milliseconds | Histogram | Time spent on pruning segments. | |
databend_fuse_remote_io_deserialize_milliseconds | Histogram | Time spent on decompressing and deserializing raw data into DataBlocks. | |
databend_fuse_remote_io_read_bytes | Counter | Cumulative number of bytes read from object storage. | |
databend_fuse_remote_io_read_bytes_after_merged | Counter | Cumulative number of bytes read from object storage after merging. | |
databend_fuse_remote_io_read_milliseconds | Histogram | Histogram of time spent reading from S3. | |
databend_fuse_remote_io_read_parts | Counter | Cumulative count of partitioned table data blocks read from object storage. | |
databend_fuse_remote_io_seeks | Counter | Cumulative count of independent IO operations during reads from object storage. | |
databend_fuse_remote_io_seeks_after_merged | Counter | Cumulative count of IO merges during reads from object storage. | |
databend_fuse_segments_range_pruning_after | Counter | Number of segments after executing segment-level range pruning. | |
databend_fuse_segments_range_pruning_before | Counter | Number of segments before executing segment-level range pruning. | |
databend_merge_into_accumulate_milliseconds | Histogram | Overall time distribution for merge operations. | |
databend_merge_into_append_blocks_counter | Counter | Total number of blocks written in merge into. | |
databend_merge_into_append_blocks_rows_counter | Counter | Total number of rows written in merge into. | |
databend_merge_into_apply_milliseconds | Histogram | Time distribution for merge into operations. | |
databend_merge_into_matched_operation_milliseconds | Histogram | Time distribution for matched operations in merge operations. | |
databend_merge_into_matched_rows | Counter | Total number of matched rows in merge operations. | |
databend_merge_into_not_matched_operation_milliseconds | Histogram | Time distribution for 'not matched' part of merge into operations. | |
databend_merge_into_replace_blocks_counter | Counter | Number of replacement blocks generated by merge operations. | |
databend_merge_into_replace_blocks_rows_counter | Counter | Number of rows replaced by merge operations. | |
databend_merge_into_split_milliseconds | Histogram | Time taken for splitting merge operations. | |
databend_merge_into_unmatched_rows | Counter | Total number of rows unmatched in merge into. | |
databend_meta_grpc_client_request_duration_ms | Histogram | Distribution of request durations for different types of requests (Upsert, Txn, StreamList, StreamMGet, GetClientInfo) made to the meta leader. | endpoint, request |
databend_meta_grpc_client_request_inflight | Gauge | Current number of queries connecting to the meta. | |
databend_meta_grpc_client_request_success | Counter | Number of successful requests to the meta. | endpoint, request |
databend_opendal_bytes | Counter | Total number of bytes read and written by the OpenDAL endpoint. | scheme (the scheme used for the operation, e.g., "s3"), op (the type of operation, e.g., "read" or "write") |
databend_opendal_bytes_histogram | Histogram | Distribution of response times and counts by operation. | scheme (the scheme used for the operation, e.g., "s3"), op (the type of operation, e.g., "write") |
databend_opendal_errors | Counter | Number of errors and their types encountered in OpenDAL operations. | scheme (the scheme used for the operation, e.g., "s3"), op (the type of operation, e.g., "read"), err (the type of error encountered, e.g., "NotFound") |
databend_opendal_request_duration_seconds | Histogram | Duration of OpenDAL requests to object storage. | scheme (the scheme used for the operation, e.g., "s3"), op (the type of operation, e.g., "read") |
databend_opendal_requests | Counter | Number of various types of requests made using OpenDAL. | scheme (the scheme used for the request, e.g., "s3"), op (the operation type, e.g., "batch", "list", "presign", "read", "write", "delete", "stat") |
databend_process_cpu_seconds_total | Counter | Total CPU time (seconds) used by users and system. | |
databend_process_max_fds | Gauge | Maximum number of open file descriptors. | |
databend_process_open_fds | Gauge | Number of open file descriptors. | |
databend_process_resident_memory_bytes | Gauge | Resident memory size in bytes. | |
databend_process_start_time_seconds | Gauge | Start time of the process since Unix epoch in seconds. | |
databend_process_threads | Gauge | Number of OS threads in use. | |
databend_process_virtual_memory_bytes | Gauge | Virtual memory size in bytes. | |
databend_query_duration_ms | Histogram | Tracks the distribution of execution times for different types of queries initiated by various handlers. | handler, kind, tenant, cluster |
databend_query_error | Counter | Total number of query errors. | handler="HTTPQuery", kind="Other", tenant="wubx", cluster="w189" |
databend_query_failed | Counter | Total number of failed requests. | |
databend_query_http_requests_count | Counter | Number of HTTP requests, categorized by method, API endpoint, and status code. | method, api, status |
databend_query_http_response_duration_seconds | Histogram | Query response time distribution, categorized by HTTP method and API endpoint. | method, api, le, sum, count |
databend_query_http_response_errors_count | Counter | Counts and types of request errors. | code, err |
databend_query_result_bytes | Counter | Total number of bytes in the data returned by each query. | handler, kind, tenant, cluster |
databend_query_result_rows | Counter | Total number of data rows returned by each query. | handler, kind, tenant, cluster |
databend_query_scan_bytes | Counter | Total size of data scanned by queries in bytes. | handler, kind, tenant, cluster |
databend_query_scan_io_bytes | Counter | Total size of data scanned and transferred during queries, in bytes. | handler, kind, tenant, cluster |
databend_query_scan_io_bytes_cost_ms | Histogram | Distribution of IO scan time during queries. | handler, kind, tenant, cluster |
databend_query_scan_partitions | Counter | Total number of partitions (blocks) scanned by queries. | handler, kind, tenant, cluster |
databend_query_scan_rows | Counter | Total number of data rows scanned by queries. | handler, kind, tenant, cluster |
databend_query_start | Counter | Tracks the number of query executions initiated by different handlers. It categorizes queries into various kinds such as SELECT, UPDATE, INSERT, and others. | handler, kind, tenant, cluster |
databend_query_success | Counter | Number of successful queries by type. | handler, kind, tenant, cluster |
databend_query_total_partitions | Counter | Total number of partitions (blocks) involved in the query. | handler, kind, tenant, cluster |
databend_query_write_bytes | Counter | Cumulative number of bytes written by queries. | handler, kind, tenant, cluster |
databend_query_write_io_bytes | Counter | Total size of data written and transmitted by queries. | handler, kind, tenant, cluster |
databend_query_write_io_bytes_cost_ms | Histogram | Time cost of writing IO bytes for queries. | handler, kind, tenant, cluster |
databend_query_write_rows | Counter | Cumulative number of rows written by queries. | handler, kind, tenant, cluster |
databend_session_close_numbers | Counter | Number of session closures. | |
databend_session_connect_numbers | Counter | Records the cumulative total number of connections made to the nodes since the system started. | |
databend_session_connections | Gauge | Measures the current number of active connections to the nodes. | |
databend_session_queue_acquire_duration_ms | Histogram | Distribution of waiting queue acquisition time. | |
databend_session_queued_queries | Gauge | Number of SQL queries currently in the query queue. | |
databend_session_running_acquired_queries | Gauge | Current number of acquired queries in the running session. |