HISTOGRAM
Introduced or updated: v1.2.377
Generates a data distribution histogram using an "equal height" bucketing strategy.
Syntax
HISTOGRAM(<expr>)
-- The following two forms are equivalent:
HISTOGRAM(<max_num_buckets>)(<expr>)
HISTOGRAM(<expr> [, <max_num_buckets>])
Parameter | Description |
---|---|
expr | The data type of expr should be sortable. |
max_num_buckets | Optional positive integer specifying the maximum number of buckets. Default is 128. |
Return Type
Returns either an empty string or a JSON object with the following structure:
- buckets: List of buckets with detailed information:
- lower: Lower bound of the bucket.
- upper: Upper bound of the bucket.
- count: Number of elements in the bucket.
- pre_sum: Cumulative count of elements up to the current bucket.
- ndv: Number of distinct values in the bucket.
Examples
This example shows how the HISTOGRAM function analyzes the distribution of c_int
values in the histagg
table, returning bucket boundaries, distinct value counts, element counts, and cumulative counts:
CREATE TABLE histagg (
c_id INT,
c_tinyint TINYINT,
c_smallint SMALLINT,
c_int INT
);
INSERT INTO histagg VALUES
(1, 10, 20, 30),
(1, 11, 21, 33),
(1, 11, 12, 13),
(2, 21, 22, 23),
(2, 31, 32, 33),
(2, 10, 20, 30);
SELECT HISTOGRAM(c_int) FROM histagg;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ histogram(c_int) │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ [{"lower":"13","upper":"13","ndv":1,"count":1,"pre_sum":0},{"lower":"23","upper":"23","ndv":1,"count":1,"pre_sum":1},{"lower":"30","upper":"30","ndv":1,"count":2,"pre_sum":2},{"lower":"33","upper":"33","ndv":1,"count":2,"pre_sum":4}] │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
The result is returned as a JSON array:
[
{
"lower": "13",
"upper": "13",
"ndv": 1,
"count": 1,
"pre_sum": 0
},
{
"lower": "23",
"upper": "23",
"ndv": 1,
"count": 1,
"pre_sum": 1
},
{
"lower": "30",
"upper": "30",
"ndv": 1,
"count": 2,
"pre_sum": 2
},
{
"lower": "33",
"upper": "33",
"ndv": 1,
"count": 2,
"pre_sum": 4
}
]
This example shows how HISTOGRAM(2)
groups c_int values into two buckets:
SELECT HISTOGRAM(2)(c_int) FROM histagg;
-- Or
SELECT HISTOGRAM(c_int, 2) FROM histagg;
┌────── ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ histogram(2)(c_int) │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ [{"lower":"13","upper":"30","ndv":3,"count":4,"pre_sum":0},{"lower":"33","upper":"33","ndv":1,"count":2,"pre_sum":4}] │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
The result is returned as a JSON array:
[
{
"lower": "13",
"upper": "30",
"ndv": 3,
"count": 4,
"pre_sum": 0
},
{
"lower": "33",
"upper": "33",
"ndv": 1,
"count": 2,
"pre_sum": 4
}
]