Skip to main content

Databend Metrics

Metrics are crucial to monitor the performance and health of the system. Databend collects and stores two types of metrics, Meta Metrics and Query Metrics, in the format of Prometheus. Meta Metrics are used for real-time monitoring and debugging of the Metasrv component, while Query Metrics are used for monitoring the performance of the Databend-query component.

You can access the metrics through a web browser using the following URLs:

  • Meta Metrics: http://<admin_api_address>/v1/metrics. Defaults to 0.0.0.0:28101/v1/metrics.
  • Query Metrics: http://<metric_api_address>/metrics. Defaults to 0.0.0.0:7070/metrics.
tip

Alternatively, you can visualize the metrics using third-party tools. For information about supported tools and integration tutorials, refer to Monitor > Using 3rd-party Tools. When employing the Prometheus & Grafana solution, you can create dashboards using our provided dashboard templates, available here. For more details, check out the Prometheus & Grafana guide.

Meta Metrics

Here's a list of Meta metrics captured by Databend.

Server

These metrics describe the status of the metasrv. All these metrics are prefixed with metasrv_server_.

NameDescriptionType
current_leader_idCurrent leader id of cluster, 0 means no leader.Gauge
is_leaderWhether or not this node is current leader.Gauge
node_is_healthWhether or not this node is health.Gauge
leader_changesNumber of leader changes seen.Counter
applying_snapshotWhether or not statemachine is applying snapshot.Gauge
proposals_appliedTotal number of consensus proposals applied.Gauge
last_log_indexIndex of the last log entry..Gauge
current_termCurrent term.Gauge
proposals_pendingTotal number of pending proposals.Gauge
proposals_failedTotal number of failed proposals.Counter
watchersTotal number of active watchers.Gauge

current_leader_id indicate current leader id of cluster, 0 means no leader. If a cluster has no leader, it is unavailable.

is_leader indicate if this metasrv currently is the leader of cluster, and leader_changes show the total number of leader changes since start.If change leader too frequently, it will impact the performance of metasrv, also it signal that the cluster is unstable.

If and only if the node state is Follower or Leader , node_is_health is 1, otherwise is 0.

proposals_applied records the total number of applied write requests.

last_log_index records the last log index has been appended to this Raft node's log, current_term records the current term of the Raft node.

proposals_pending indicates how many proposals are queued to commit currently.Rising pending proposals suggests there is a high client load or the member cannot commit proposals.

proposals_failed show the total number of failed write requests, it is normally related to two issues: temporary failures related to a leader election or longer downtime caused by a loss of quorum in the cluster.

watchers show the total number of active watchers currently.

Raft Network

These metrics describe the network status of raft nodes in the metasrv. All these metrics are prefixed with metasrv_raft_network_.

NameDescriptionLabelsType
active_peersCurrent number of active connections to peers.id(node id),address(peer address)Gauge
fail_connect_to_peerTotal number of fail connections to peers.id(node id),address(peer address)Counter
sent_bytesTotal number of sent bytes to peers.to(node id)Counter
recv_bytesTotal number of received bytes from peers.from(remote address)Counter
sent_failuresTotal number of send failures to peers.to(node id)Counter
snapshot_send_successTotal number of successful snapshot sends.to(node id)Counter
snapshot_send_failuresTotal number of snapshot send failures.to(node id)Counter
snapshot_send_inflightsTotal number of inflight snapshot sends.to(node id)Gauge
snapshot_sent_secondsTotal latency distributions of snapshot sends.to(node id)Histogram
snapshot_recv_successTotal number of successful receive snapshot.from(remote address)Counter
snapshot_recv_failuresTotal number of snapshot receive failures.from(remote address)Counter
snapshot_recv_inflightsTotal number of inflight snapshot receives.from(remote address)Gauge
snapshot_recv_secondsTotal latency distributions of snapshot receives.from(remote address)Histogram

active_peers indicates how many active connection between cluster members, fail_connect_to_peer indicates the number of fail connections to peers. Each has the labels: id(node id) and address (peer address).

sent_bytes and recv_bytes record the sent and receive bytes to and from peers, and sent_failures records the number of fail sent to peers.

snapshot_send_success and snapshot_send_failures indicates the success and fail number of sent snapshot.snapshot_send_inflights indicate the inflight snapshot sends, each time send a snapshot, this field will increment by one, after sending snapshot is done, this field will decrement by one.

snapshot_sent_seconds indicate the total latency distributions of snapshot sends.

snapshot_recv_success and snapshot_recv_failures indicates the success and fail number of receive snapshot.snapshot_recv_inflights indicate the inflight receiving snapshot, each time receive a snapshot, this field will increment by one, after receiving snapshot is done, this field will decrement by one.

snapshot_recv_seconds indicate the total latency distributions of snapshot receives.

Raft Storage

These metrics describe the storage status of raft nodes in the metasrv. All these metrics are prefixed with metasrv_raft_storage_.

NameDescriptionLabelsType
raft_store_write_failedTotal number of raft store write failures.func(function name)Counter
raft_store_read_failedTotal number of raft store read failures.func(function name)Counter

raft_store_write_failed and raft_store_read_failed indicate the total number of raft store write and read failures.

Meta Network

These metrics describe the network status of meta service in the metasrv. All these metrics are prefixed with metasrv_meta_network_.

NameDescriptionType
sent_bytesTotal number of sent bytes to meta grpc client.Counter
recv_bytesTotal number of recv bytes from meta grpc client.Counter
inflightsTotal number of inflight meta grpc requests.Gauge
req_successTotal number of success request from meta grpc client.Counter
req_failedTotal number of fail request from meta grpc client.Counter
rpc_delay_secondsLatency distribution of meta-service API in second.Histogram

Query Metrics

Here's a list of Query metrics captured by Databend.

NameTypeDescriptionLabels
databend_cache_access_countCounterNumber of cache accesses.cache_name
databend_cache_hit_countCounterCounts the number of cache hits for different cache types.cache_name
databend_cache_miss_countCounterNumber of cache misses.cache_name
databend_cache_miss_load_millisecondHistogramDistribution of cache miss load times.cache_name
databend_cluster_discovered_nodeGaugeReports information about discovered nodes exposed externally.local_id, cluster_id, tenant_id, flight_address
databend_compact_hook_compaction_msHistogramHistogram of the time spent on compaction operations.operation
databend_compact_hook_execution_msHistogramDistribution of execution time for compact hook operations.operation: MergeInto, Insert
databend_fuse_block_index_read_bytesCounterNumber of bytes read for block index.
databend_fuse_block_index_write_bytes_totalCounterTotal number of bytes written for index blocks.
databend_fuse_block_index_write_millisecondsHistogramDistribution of the time taken to write index blocks.
databend_fuse_block_index_write_nums_totalCounterTotal number of index blocks written.
databend_fuse_block_write_bytesCounterTotal number of bytes written.
databend_fuse_block_write_milliosecondsHistogramDistribution of time taken to write blocks.
databend_fuse_block_write_numsCounterTotal number of blocks written.
databend_fuse_blocks_bloom_pruning_afterCounterNumber of blocks after executing block-level bloom pruning.
databend_fuse_blocks_bloom_pruning_beforeCounterNumber of blocks before executing block-level bloom pruning.
databend_fuse_blocks_range_pruning_afterCounterNumber of blocks after executing block-level range pruning.
databend_fuse_blocks_range_pruning_beforeCounterNumber of blocks before executing block-level range pruning.
databend_fuse_bytes_block_bloom_pruning_afterCounterData size in bytes after executing block-level bloom pruning.
databend_fuse_bytes_block_bloom_pruning_beforeCounterData size in bytes before executing block-level bloom pruning.
databend_fuse_bytes_segment_range_pruning_afterCounterData size in bytes after executing segment-level range pruning.
databend_fuse_bytes_segment_range_pruning_beforeCounterData size in bytes before executing segment-level range pruning.
databend_fuse_commit_abortsCounterNumber of times commit aborted due to errors.
databend_fuse_commit_copied_filesCounterTotal number of files copied during commit operations.
databend_fuse_commit_millisecondsCounterTotal time taken for commit mutations.
databend_fuse_commit_mutation_modified_segment_exists_in_latestCounterCounts the existence of modified segments in the latest commit mutation.
databend_fuse_commit_mutation_successCounterNumber of successful mutations committed.
databend_fuse_commit_mutation_unresolvable_conflictCounterNumber of times unresolvable commit conflicts occurred.
databend_fuse_compact_block_build_lazy_part_millisecondsHistogramDistribution of the time spent building the lazy part during compaction.
databend_fuse_compact_block_build_task_millisecondsHistogramDistribution of the time spent building the compact block.
databend_fuse_compact_block_read_bytesCounterCumulative size of blocks read during compaction, in bytes.
databend_fuse_compact_block_read_millisecondsHistogramHistogram of time spent reading blocks during compaction.
databend_fuse_compact_block_read_numsCounterCounts the number of blocks read during compaction.
databend_fuse_pruning_millisecondsHistogramTime spent on pruning segments.
databend_fuse_remote_io_deserialize_millisecondsHistogramTime spent on decompressing and deserializing raw data into DataBlocks.
databend_fuse_remote_io_read_bytesCounterCumulative number of bytes read from object storage.
databend_fuse_remote_io_read_bytes_after_mergedCounterCumulative number of bytes read from object storage after merging.
databend_fuse_remote_io_read_millisecondsHistogramHistogram of time spent reading from S3.
databend_fuse_remote_io_read_partsCounterCumulative count of partitioned table data blocks read from object storage.
databend_fuse_remote_io_seeksCounterCumulative count of independent IO operations during reads from object storage.
databend_fuse_remote_io_seeks_after_mergedCounterCumulative count of IO merges during reads from object storage.
databend_fuse_segments_range_pruning_afterCounterNumber of segments after executing segment-level range pruning.
databend_fuse_segments_range_pruning_beforeCounterNumber of segments before executing segment-level range pruning.
databend_merge_into_accumulate_millisecondsHistogramOverall time distribution for merge operations.
databend_merge_into_append_blocks_counterCounterTotal number of blocks written in merge into.
databend_merge_into_append_blocks_rows_counterCounterTotal number of rows written in merge into.
databend_merge_into_apply_millisecondsHistogramTime distribution for merge into operations.
databend_merge_into_matched_operation_millisecondsHistogramTime distribution for matched operations in merge operations.
databend_merge_into_matched_rowsCounterTotal number of matched rows in merge operations.
databend_merge_into_not_matched_operation_millisecondsHistogramTime distribution for 'not matched' part of merge into operations.
databend_merge_into_replace_blocks_counterCounterNumber of replacement blocks generated by merge operations.
databend_merge_into_replace_blocks_rows_counterCounterNumber of rows replaced by merge operations.
databend_merge_into_split_millisecondsHistogramTime taken for splitting merge operations.
databend_merge_into_unmatched_rowsCounterTotal number of rows unmatched in merge into.
databend_meta_grpc_client_request_duration_msHistogramDistribution of request durations for different types of requests (Upsert, Txn, StreamList, StreamMGet, GetClientInfo) made to the meta leader.endpoint, request
databend_meta_grpc_client_request_inflightGaugeCurrent number of queries connecting to the meta.
databend_meta_grpc_client_request_successCounterNumber of successful requests to the meta.endpoint, request
databend_opendal_bytesCounterTotal number of bytes read and written by the OpenDAL endpoint.scheme (the scheme used for the operation, e.g., "s3"), op (the type of operation, e.g., "read" or "write")
databend_opendal_bytes_histogramHistogramDistribution of response times and counts by operation.scheme (the scheme used for the operation, e.g., "s3"), op (the type of operation, e.g., "write")
databend_opendal_errorsCounterNumber of errors and their types encountered in OpenDAL operations.scheme (the scheme used for the operation, e.g., "s3"), op (the type of operation, e.g., "read"), err (the type of error encountered, e.g., "NotFound")
databend_opendal_request_duration_secondsHistogramDuration of OpenDAL requests to object storage.scheme (the scheme used for the operation, e.g., "s3"), op (the type of operation, e.g., "read")
databend_opendal_requestsCounterNumber of various types of requests made using OpenDAL.scheme (the scheme used for the request, e.g., "s3"), op (the operation type, e.g., "batch", "list", "presign", "read", "write", "delete", "stat")
databend_process_cpu_seconds_totalCounterTotal CPU time (seconds) used by users and system.
databend_process_max_fdsGaugeMaximum number of open file descriptors.
databend_process_open_fdsGaugeNumber of open file descriptors.
databend_process_resident_memory_bytesGaugeResident memory size in bytes.
databend_process_start_time_secondsGaugeStart time of the process since Unix epoch in seconds.
databend_process_threadsGaugeNumber of OS threads in use.
databend_process_virtual_memory_bytesGaugeVirtual memory size in bytes.
databend_query_duration_msHistogramTracks the distribution of execution times for different types of queries initiated by various handlers.handler, kind, tenant, cluster
databend_query_errorCounterTotal number of query errors.handler="HTTPQuery", kind="Other", tenant="wubx", cluster="w189"
databend_query_failedCounterTotal number of failed requests.
databend_query_http_requests_countCounterNumber of HTTP requests, categorized by method, API endpoint, and status code.method, api, status
databend_query_http_response_duration_secondsHistogramQuery response time distribution, categorized by HTTP method and API endpoint.method, api, le, sum, count
databend_query_http_response_errors_countCounterCounts and types of request errors.code, err
databend_query_result_bytesCounterTotal number of bytes in the data returned by each query.handler, kind, tenant, cluster
databend_query_result_rowsCounterTotal number of data rows returned by each query.handler, kind, tenant, cluster
databend_query_scan_bytesCounterTotal size of data scanned by queries in bytes.handler, kind, tenant, cluster
databend_query_scan_io_bytesCounterTotal size of data scanned and transferred during queries, in bytes.handler, kind, tenant, cluster
databend_query_scan_io_bytes_cost_msHistogramDistribution of IO scan time during queries.handler, kind, tenant, cluster
databend_query_scan_partitionsCounterTotal number of partitions (blocks) scanned by queries.handler, kind, tenant, cluster
databend_query_scan_rowsCounterTotal number of data rows scanned by queries.handler, kind, tenant, cluster
databend_query_startCounterTracks the number of query executions initiated by different handlers. It categorizes queries into various kinds such as SELECT, UPDATE, INSERT, and others.handler, kind, tenant, cluster
databend_query_successCounterNumber of successful queries by type.handler, kind, tenant, cluster
databend_query_total_partitionsCounterTotal number of partitions (blocks) involved in the query.handler, kind, tenant, cluster
databend_query_write_bytesCounterCumulative number of bytes written by queries.handler, kind, tenant, cluster
databend_query_write_io_bytesCounterTotal size of data written and transmitted by queries.handler, kind, tenant, cluster
databend_query_write_io_bytes_cost_msHistogramTime cost of writing IO bytes for queries.handler, kind, tenant, cluster
databend_query_write_rowsCounterCumulative number of rows written by queries.handler, kind, tenant, cluster
databend_session_close_numbersCounterNumber of session closures.
databend_session_connect_numbersCounterRecords the cumulative total number of connections made to the nodes since the system started.
databend_session_connectionsGaugeMeasures the current number of active connections to the nodes.
databend_session_queue_acquire_duration_msHistogramDistribution of waiting queue acquisition time.
databend_session_queued_queriesGaugeNumber of SQL queries currently in the query queue.
databend_session_running_acquired_queriesGaugeCurrent number of acquired queries in the running session.
Explore Databend Cloud for FREE
Low-cost
Fast Analytics
Easy Data Ingestion
Elastic Scaling
Try it today