KingbaseES 使用百分比函数获取中位数

发布时间 2023-05-09 19:41:51作者: KINGBASE研究院

客户从Oracle数据库迁移至KingbaseES数据库,应用中使用MEDIAN函数来求中位数。KingbaseES数据库中没有MEDIAN函数,但可以通过百分比函数来实现相应的功能。

MEDIAN 函数
MEDIAN是一个假设连续分布模型的逆分布函数。它采用一个数字或日期时间值,并返回中间值或插值,该值在值排序后将成为中间值。在计算中会忽略null。
以下是摘自Oracle官网说明

语法:
MEDIAN(expr) [ OVER ( [ query_partition_clause ] ) ]

MEDIAN is an inverse distribution function that assumes a continuous distribution model. It takes a numeric or datetime value and returns the middle value or an interpolated value that would be the middle value once the values are sorted. Nulls are ignored in the calculation.

The result of MEDIAN is computed by first ordering the rows. Using N as the number of rows in the group, Oracle calculates the row number (RN) of interest with the formula RN = (1 + (0.5*(N-1)). The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).
计算公式:
if (CRN = FRN = RN) then
      (value of expression from row at RN)
   else
      (CRN - RN) * (value of expression for row at FRN) +
      (RN - FRN) * (value of expression for row at CRN)

KingbaseES中有2个用于计算百分比的函数 PERCENTILE_CONT 和 PERCENTILE_DISC。

PERCENTILE_CONT 函数

PERCENTILE_CONT 是一种假定连续分布模型的逆分布函数。该函数具有一个百分比值和一个排序规范,并返回一个在有关排序规范的给定百分比值范围内的内插值。
PERCENTILE_CONT 在对值进行排序后计算值之间的线性内插。通过在聚合组中使用百分比值 (P) 和非 null 行数 (N),该函数会在根据排序规范对行进行排序后计算行号。根据公式 (RN) 计算此行号 RN = (1+ (P*(N-1))。聚合函数的最终结果通过行号 CRN = CEILING(RN) 和 FRN = FLOOR(RN) 的行中的值之间的线性内插计算。
最终结果将如下所示:
如果 (CRN = FRN = RN),则结果为 (value of expression from row at RN)
否则结果为 (CRN - RN) * (value of expression for row at FRN) + (RN - FRN) * (value of expression for row at CRN)

语法:
PERCENTILE_CONT ( percentile ) WITHIN GROUP (ORDER BY expr) OVER (  [ PARTITION BY expr_list ]  )

参数说明
percentile:介于 0 和 1 之间的数字常数。计算中将忽略 Null。
WITHIN GROUP ( ORDER BY expr) :指定用于排序和计算百分比的数字或日期/时间值。
OVER :指定窗口分区。
PARTITION BY expr 设置 OVER 子句中每个组的记录范围的可选参数。

PERCENTILE_DISC 函数
PERCENTILE_DISC 是一种假定离散分布模型的逆分布函数。该函数具有一个百分比值和一个排序规范,并返回给定集合中的元素。
对于给定的百分比值 P,PERCENTILE_DISC 在 ORDER BY 子句中对表达式的值进行排序,并返回带有大于或等于 P 的最小累积分布值(相对于同一排序规范)的值。

语法
PERCENTILE_DISC ( percentile ) WITHIN GROUP (ORDER BY expr) OVER (  [ PARTITION BY expr_list ]  )
参数说明
percentile:介于 0 和 1 之间的数字常数。计算中将忽略 Null。
WITHIN GROUP ( ORDER BY expr) :指定用于排序和计算百分比的数字或日期/时间值。
OVER :指定窗口分区。
PARTITION BY expr 设置 OVER 子句中每个组的记录范围的可选参数。

注意:KingbaseES数据库在 V008R006C007B0024 版本开始支持 over PARTITION BY 的语法,之前的版本不支持

示例

在oracle和KingbaseES数据库创建表并插入数据

CREATE TABLE EMP (EMPNO NUMBER(4) NOT NULL,
                  ENAME VARCHAR2(10),
                  JOB VARCHAR2(9),
                  SAL NUMBER(7, 2),
                  DEPTNO NUMBER(2));

INSERT INTO EMP VALUES (1, 'SMITH', 'CLERK',     800,    20);
INSERT INTO EMP VALUES (2, 'ALLEN', 'SALESMAN', 1600,    30);
INSERT INTO EMP VALUES (3, 'WARD',  'SALESMAN', 1250,    30);
INSERT INTO EMP VALUES (4, 'JONES', 'MANAGER',  2975,    20);
INSERT INTO EMP VALUES (5, 'MARTIN','SALESMAN', 1250,    30);
INSERT INTO EMP VALUES (6, 'BLAKE', 'MANAGER',  2850,    30);
INSERT INTO EMP VALUES (7, 'CLARK', 'MANAGER',  2850,    10);
INSERT INTO EMP VALUES (8, 'SCOTT', 'ANALYST',  3000,    20);
INSERT INTO EMP VALUES (9, 'KING',  'PRESIDENT',3000,    10);
INSERT INTO EMP VALUES (10,'TURNER','SALESMAN', 1500,    30);
INSERT INTO EMP VALUES (11,'ADAMS', 'CLERK',    1500,    20);

oracle
TEST@orclpdb> SELECT MEDIAN(sal) FROM emp;

MEDIAN(SAL)
-----------
       1600

TEST@orclpdb> SELECT MEDIAN(sal) over (partition by deptno)  avg,deptno from emp;

       AVG     DEPTNO
---------- ----------
      2925         10
      2925         10
    2237.5         20
    2237.5         20
    2237.5         20
    2237.5         20
      1500         30
      1500         30
      1500         30
      1500         30
      1500         30

KingbaseES

test=# select percentile_cont(0.5) within group (order by sal) from emp;
 percentile_cont
-----------------
            1600
(1 行记录)

test=# select percentile_cont(0.5) within group (order by sal) over (partition by deptno) ,deptno from emp;
 percentile_cont | deptno
-----------------+--------
            2925 |     10
            2925 |     10
          2237.5 |     20
          2237.5 |     20
          2237.5 |     20
          2237.5 |     20
            1500 |     30
            1500 |     30
            1500 |     30
            1500 |     30
            1500 |     30
(11 行记录)

test=# select percentile_disc(0.5) within group (order by sal) from emp;
 percentile_disc
-----------------
         1600.00
(1 行记录)

test=# select percentile_disc(0.5) within group (order by sal) over (partition by deptno) ,deptno from emp;
 percentile_disc | deptno
-----------------+--------
         2850.00 |     10
         2850.00 |     10
         1500.00 |     20
         1500.00 |     20
         1500.00 |     20
         1500.00 |     20
         1500.00 |     30
         1500.00 |     30
         1500.00 |     30
         1500.00 |     30
         1500.00 |     30
(11 行记录)

删除处于中位数的数据再进行测试(id=2,sal=1600)
KingbaseES

test=# delete from emp where empno =2 ;
DELETE 1
test=# select percentile_cont(0.5) within group (order by sal) from emp;
 percentile_cont
-----------------
            2175
(1 行记录)

test=# select percentile_cont(0.5) within group (order by sal) over (partition by deptno) ,deptno from emp;
 percentile_cont | deptno
-----------------+--------
            2925 |     10
            2925 |     10
          2237.5 |     20
          2237.5 |     20
          2237.5 |     20
          2237.5 |     20
            1375 |     30
            1375 |     30
            1375 |     30
            1375 |     30
(10 行记录)

test=# select percentile_disc(0.5) within group (order by sal) from emp;
 percentile_disc
-----------------
         1500.00
(1 行记录)

test=# select percentile_disc(0.5) within group (order by sal) over (partition by deptno) ,deptno from emp;
 percentile_disc | deptno
-----------------+--------
         2850.00 |     10
         2850.00 |     10
         1500.00 |     20
         1500.00 |     20
         1500.00 |     20
         1500.00 |     20
         1250.00 |     30
         1250.00 |     30
         1250.00 |     30
         1250.00 |     30
(10 行记录)

oracle

TEST@orclpdb>  SELECT MEDIAN(sal) FROM emp;

MEDIAN(SAL)
-----------
       2175

TEST@orclpdb> SELECT MEDIAN(sal) over (partition by deptno)  avg,deptno from emp;

       AVG     DEPTNO
---------- ----------
      2925         10
      2925         10
    2237.5         20
    2237.5         20
    2237.5         20
    2237.5         20
      1375         30
      1375         30
      1375         30
      1375         30

10 rows selected.

从测试结果看percentile_cont(0.5)函数与oracle的MEDIAN函数结果是一致的,可以进行完美替换。
percentile_disc(0.5) 和 percentile_cont(0.5) 在特定的情况下结果会一致(数据集为奇数的时候)