分析函数 · k8s原理与实践

[TOC] # 分析函数 ## 简介 **分析函数**（某些数据库下也叫做**窗口函数**）与聚合函数类似，计算总是基于一组行的集合，不同的是，聚合函数一组只能返回一行，而分析函数每组可以返回多行，组内每一行都是基于窗口的逻辑计算的结果。分析函数可以显著优化需要 self-join 的查询。 ### 分析函数语法 “窗口”也称为 FRAME，OceanBase 数据库同时支持 ROWS 与 RANGE 两种 FRAME 语义，前者是基于物理行偏移的窗口，后者则是基于逻辑值偏移的窗口。分析函数语法如下： ~~~ analytic_function: analytic_function([ arguments ]) OVER (analytic_clause) analytic_clause: [ query_partition_clause ] [ order_by_clause [ windowing_clause ] ] query_partition_clause: PARTITION BY { expr[, expr ]... | ( expr[, expr ]... ) } order_by_clause: ORDER [ SIBLINGS ] BY{ expr | position | c_alias } [ ASC | DESC ] [ NULLS FIRST | NULLS LAST ] [, { expr | position | c_alias } [ ASC | DESC ][ NULLS FIRST | NULLS LAST ]]... windowing_clause: { ROWS | RANGE } { BETWEEN { UNBOUNDED PRECEDING | CURRENT ROW | value_expr { PRECEDING | FOLLOWING } } AND{ UNBOUNDED FOLLOWING | CURRENT ROW | value_expr { PRECEDING | FOLLOWING } } | { UNBOUNDED PRECEDING | CURRENT ROW| value_expr PRECEDING}} ~~~ ## SUM/MIN/MAX/COUNT/AVG **声明** SUM 的语法为：`SUM([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]` MIN 的语法为：`MIN([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]` MAX 的语法为：`MAX([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]` COUNT 的语法为：`COUNT({ * | [ DISTINCT | ALL ] expr }) [ OVER (analytic_clause) ]` AVG 的语法为：`AVG([ DISTINCT | ALL ] expr) [ OVER(analytic_clause) ]` **说明** 以上分析函数都有对应的聚合函数，其中，`SUM`返回`expr`的和，`MIN`/`MAX`返回`expr`的最小值/最大值，`COUNT`返回窗口中查询的行数，`AVG`返回`expr`的平均值。对于`COUNT`函数，如果指定了`expr`，即返回`expr`不为 NULL 的统计个数，如果指定`COUNT(*)`返回所有行的统计数目。 **例子** ~~~ obclient> create table exployees(last_name char(10), salary decimal, job_id char(32)); Query OK, 0 rows affected (0.17 sec) obclient> insert into exployees values('jim', 2000, 'cleaner'); Query OK, 1 row affected (0.03 sec) obclient> insert into exployees values('mike', 12000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('lily', 13000, 'engineering'); Query OK, 1 row affected (0.01 sec) obclient> insert into exployees values('tom', 11000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient>select last_name, sum(salary) over(partition by job_id) totol_s, min(salary) over(partition by job_id) min_s, max(salary) over(partition by job_id) max_s, count(*) over(partition by job_id) count_s from exployees; +-----------+---------+-------+-------+---------+ | last_name | totol_s | min_s | max_s | count_s | +-----------+---------+-------+-------+---------+ | jim | 2000 | 2000 | 2000 | 1 | | mike | 36000 | 11000 | 13000 | 3 | | lily | 36000 | 11000 | 13000 | 3 | | tom | 36000 | 11000 | 13000 | 3 | +-----------+---------+-------+-------+---------+ 4 rows in set (0.01 sec) ~~~ ## NTH\_VALUE/FIRST\_VALUE/LAST\_VALUE **声明** NTH\_VALUE 的语法为：`NTH_VALUE (measure_expr, n) [ FROM { FIRST | LAST } ] [ { RESPECT | IGNORE } NULLS ] OVER (analytic_clause)` FIRST\_VALUE 的语法为：`FIRST_VALUE { (expr) [ {RESPECT | IGNORE} NULLS ] | (expr [ {RESPECT | IGNORE} NULLS ])} OVER (analytic_clause)` LAST\_VALUE 的语法为：`LAST_VALUE { (expr) [ {RESPECT | IGNORE} NULLS ] | (expr [ {RESPECT | IGNORE} NULLS ])} OVER (analytic_clause)` **说明** NTH\_VALUE 函数表示第几个值，方向由`[ FROM { FIRST | LAST } ]`确定，默认为`FROM FIRST`，含有是否忽略 NULL 值的标志。其窗口为统一的`analytic_clause`。这里`n`应该是正数，如果`n`是 NULL，函数将返回错误；如果`n`大于窗口内所有的行数，此函数将返回 NULL。 FIRST\_VALUE 和 LAST\_VALUE 表示从第一个开始计数或者是从最后一个开始计数。 **例子** ~~~ obclient> create table exployees(last_name char(10), salary decimal, job_id char(32)); Query OK, 0 rows affected (0.08 sec) obclient> insert into exployees values('jim', 2000, 'cleaner'); Query OK, 1 row affected (0.11 sec) obclient> insert into exployees values('mike', 12000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('lily', 13000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('tom', 11000, 'engineering'); Query OK, 1 row affected (0.01 sec) obclient> select last_name, first_value(salary) over(partition by job_id) totol_s, last_value(salary) over(partition by job_id) min_s, max(salary) over(partition by job_id) max_s from exployees; +-----------+---------+-------+-------+ | last_name | totol_s | min_s | max_s | +-----------+---------+-------+-------+ | jim | 2000 | 2000 | 2000 | | mike | 12000 | 11000 | 13000 | | lily | 12000 | 11000 | 13000 | | tom | 12000 | 11000 | 13000 | +-----------+---------+-------+-------+ 4 rows in set (0.01 sec) ~~~ ## LEAD/LAG **声明** LEAD 的语法为：`LEAD { ( value_expr [, offset [, default]]) [ { RESPECT | IGNORE } NULLS ] | ( value_expr [ { RESPECT | IGNORE } NULLS ] [, offset [, default]] )} OVER ([ query_partition_clause ] order_by_clause)` LAG 的语法为：`LAG { ( value_expr [, offset [, default]]) [ { RESPECT | IGNORE } NULLS ] | ( value_expr [ { RESPECT | IGNORE } NULLS ] [, offset [, default]] )} OVER ([ query_partition_clause ] order_by_clause)` **说明** LEAD 和 LAG 含义为可以在一次查询中取出当前行的同一个字段的前面或后面第 N 行的数据，这种操作可以使用相同表的自连接来实现，但 LEAD/LAG 窗口函数有更高的效率。其中，`value_expr`是要做比对的字段，`offset`是`value_expr`的偏移量，`default`参数的默认值为 NULL，即如果在 LEAD/LAG 没有显示的设置`default`值的情况下，返回值为 NULL。例如：对 LAG 来说，当前行为 4，`offset`值为 6，这时候所要找的数据就是第 -2 行，不存在此行即返回`default`的值。 `[ { RESPECT | IGNORE } NULLS ]`的语法为是否考虑 NULL 值，默认为`RESPECT`，考虑 NULL 值。注意 LEAD/LAG 两个函数后必须有`order_by_clause`，数据应该在一个列上排序之后才能有前多少行后多少行的概念。`query_partition_clause`是可选的，如果没有`query_partition_clause`，就是全局的数据。 **例子** ~~~ obclient> create table exployees(last_name char(10), salary decimal, job_id char(32)); Query OK, 0 rows affected (0.08 sec) obclient> insert into exployees values('jim', 2000, 'cleaner'); Query OK, 1 row affected (0.11 sec) obclient> insert into exployees values('mike', 12000, 'engineering'); Query OK, 1 row affected (0.01 sec) obclient> insert into exployees values('lily', 13000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('tom', 11000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> select last_name, lead(salary) over(order by salary) lead, lag(salary) over(order by salary) lag from exployees; +-----------+-------+-------+ | last_name | lead | lag | +-----------+-------+-------+ | jim | 11000 | NULL | | tom | 12000 | 2000 | | mike | 13000 | 11000 | | lily | NULL | 12000 | +-----------+-------+-------+ 4 rows in set (0.01 sec) ~~~ ## STDDEV/VARIANCE/STDDEV\_SAMP/STDDEV\_POP **声明** VARIANCE 的语法为：`VARIANCE([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]` STDDEV 的语法为：`STDDEV([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]` STDDEV\_SAMP 的语法为：`STDDEV_SAMP(expr) [ OVER (analytic_clause) ]` STDDEV\_POP 的语法为：`STDDEV_POP(expr) [ OVER (analytic_clause) ]` **说明** VARIANCE 返回的是`expr`的方差，`expr`可能是数值类型或者可以转换成数值类型的类型，方差的类型和输入的值的类型相同。 STDDEV 返回的是`expr`的标准差，参数类型方面和 VARIANCE 的相同。 STDDEV\_SAMP 返回的是样本标准差。 STDDEV\_POP 返回的是总体标准差。 **例子** ~~~ obclient> create table exployees(last_name char(10), salary decimal, job_id char(32)); Query OK, 0 rows affected (0.08 sec) obclient> insert into exployees values('jim', 2000, 'cleaner'); Query OK, 1 row affected (0.11 sec) obclient> insert into exployees values('mike', 12000, 'engineering'); Query OK, 1 row affected (0.01 sec) obclient> insert into exployees values('lily', 13000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('tom', 11000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> select last_name, stddev(salary) over(order by salary) std, variance(salary) over(order by salary) var, stddev_pop(salary) over() std_pop, stddev_samp(salary) over() from exployees; +-----------+-------------------+--------------------+-------------------+----------------------------+ | last_name | std | var | std_pop | stddev_samp(salary) over() | +-----------+-------------------+--------------------+-------------------+----------------------------+ | jim | 0 | 0 | 4387.482193696061 | 5066.228051190222 | | tom | 4500 | 20250000 | 4387.482193696061 | 5066.228051190222 | | mike | 4496.912521077347 | 20222222.222222224 | 4387.482193696061 | 5066.228051190222 | | lily | 4387.482193696061 | 19250000 | 4387.482193696061 | 5066.228051190222 | +-----------+-------------------+--------------------+-------------------+----------------------------+ 4 rows in set (0.00 sec) ~~~ ## NTILE **声明** `NTILE(expr) OVER ([ query_partition_clause ] order_by_clause)` **说明** NTILE 函数将分区中已经排序的行划分为大小尽可能相同的指定数量的分组，并返回给每行组号。`expr`如果是 NULL，则返回 NULL。例子 ~~~ obclient> create table exployees(last_name char(10), salary decimal, job_id char(32)); Query OK, 0 rows affected (0.08 sec) obclient> insert into exployees values('jim', 2000, 'cleaner'); Query OK, 1 row affected (0.11 sec) obclient> insert into exployees values('mike', 12000, 'engineering'); Query OK, 1 row affected (0.01 sec) obclient> insert into exployees values('lily', 13000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('tom', 11000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> select last_name, ntile(10) over(partition by job_id order by salary) ntl from exployees; +-----------+------+ | last_name | ntl | +-----------+------+ | jim | 1 | | tom | 1 | | mike | 2 | | lily | 3 | +-----------+------+ 4 rows in set (0.01 sec) ~~~ ## ROW\_NUMBER 声明 `ROW_NUMBER( ) OVER ([ query_partition_clause ] order_by_clause)` 说明 ROW\_NUMBER 函数按照`order_by_clause`子句中指定的行的顺序，为每一行分配一个编号。例子 ~~~ obclient> create table exployees(last_name char(10), salary decimal, job_id char(32)); Query OK, 0 rows affected (0.08 sec) obclient> insert into exployees values('jim', 2000, 'cleaner'); Query OK, 1 row affected (0.11 sec) obclient> insert into exployees values('mike', 12000, 'engineering'); Query OK, 1 row affected (0.01 sec) obclient> insert into exployees values('lily', 13000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('tom', 11000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> select last_name, row_number() over(partition by job_id order by salary) ntl from exployees; +-----------+------+ | last_name | ntl | +-----------+------+ | jim | 1 | | tom | 1 | | mike | 2 | | lily | 3 | +-----------+------+ 4 rows in set (0.00 sec) ~~~ ## RANK/DENSE\_RANK/PERCENT\_RANK **声****明** RANK 的语法为：`RANK( ) OVER ([ query_partition_clause ] order_by_clause)` DENSE\_RANK 的语法为：`DENSE_RANK( ) OVER([ query_partition_clause ] order_by_clause)` PERCENT\_RANK 的语法为：`PERCENT_RANK( ) OVER ([ query_partition_clause ] order_by_clause)` **说明** RANK 计算每一行数据在某列上的排序，该列由`order_by_clause`中的列决定。例如，按照 salary 排序可以看出员工的收入排名。 DENSE\_RANK 的语义基本和 RANK 函数相同，但是 RANK 的排序中间会有‘跳过’，但是 DENSE\_RANK 中不会有。 PERCENT\_RANK 的语义基本和 RANK 函数相同，但是 PERCENT\_RANK 排序的结果是百分比，计算的是给定行的百分比。 **例子** ~~~ obclient> create table exployees(last_name char(10), salary decimal, job_id char(32)); Query OK, 0 rows affected (0.10 sec) obclient> insert into exployees values('jim', 2000, 'cleaner'); Query OK, 1 row affected (0.11 sec) obclient> insert into exployees values('mike', 12000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('lily', 13000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('tom', 11000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> select last_name, rank() over(partition by job_id order by salary) rank, dense_rank() over(partition by job_id order by salary) dense_rank, percent_rank() over(partition by job_id order by salary) percent_rank from exployees; +-----------+------+------------+----------------------------------+ | last_name | rank | dense_rank | percent_rank | +-----------+------+------------+----------------------------------+ | jim | 1 | 1 | 0.000000000000000000000000000000 | | tom | 1 | 1 | 0.000000000000000000000000000000 | | mike | 2 | 2 | 0.500000000000000000000000000000 | | lily | 3 | 3 | 1.000000000000000000000000000000 | +-----------+------+------------+----------------------------------+ 4 rows in set (0.01 sec) ~~~ ## CUME\_DIST **声明** `CUME_DIST() OVER ([ query_partition_clause ] order_by_clause)` **说明** 该函数计算一个值的分布，返回值为大于 0 小于等于 1 的值。作为一个分析函数，CUME\_DIST 在升序情况下计算比当前行的特定列小的数据的占比。例如如下例子中，按 job\_id 分组并在薪水排序的情况下，每行数据在窗口内的排序列上的占比。 **例子** ~~~ obclient> create table exployees(last_name char(10), salary decimal, job_id char(32)); Query OK, 0 rows affected (0.10 sec) obclient> insert into exployees values('jim', 2000, 'cleaner'); Query OK, 1 row affected (0.11 sec) obclient> insert into exployees values('mike', 12000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('lily', 13000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('tom', 11000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> select last_name, cume_dist() over(partition by job_id order by salary) cume_dist from exployees; +-----------+----------------------------------+ | last_name | cume_dist | +-----------+----------------------------------+ | jim | 1.000000000000000000000000000000 | | tom | 0.333333333333333333333333333333 | | mike | 0.666666666666666666666666666667 | | lily | 1.000000000000000000000000000000 | +-----------+----------------------------------+ 4 rows in set (0.01 sec) ~~~