Hive中的explode函数、posexplode函数与later view函数

1.概述

在离线数仓处理通过HQL业务数据时，经常会遇到行转列或者列转行之类的操作，就像concat_ws之类的函数被广泛使用，今天这个也是经常要使用的拓展方法。

2.explode函数

2.1 函数语法

-- explode(a) - separates the elements of array a into multiple rows, or the elements of a map into multiple rows and columns 
Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDTFExplode
Function type:BUILTIN
-- explode()用于array的语法如下
select explode(arraycol) as newcol from tablename;
-- explode()用于map的语法如下：
select explode(mapcol) as (keyname,valuename) from tablename;

2.2 函数说明

explode 函数是UDTF函数，将hive一列中复杂的array或者map结构拆分成多行。

Explode函数是不允许在select再有其他字段,

explode(ARRAY) 列表中的每个元素生成一行。

explode(MAP) map中每个key-value对，生成一行，key为一列，value为一列。

2.3 使用案例

-- explode (array)
select explode(array('A','B','C'));
select explode(array('A','B','C')) as col;
select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf;
select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf as col;
-- 结果
col
A
B
C
-- explode (map)
select explode(map('A',10,'B',20,'C',30));
select explode(map('A',10,'B',20,'C',30)) as (key,value);
select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf;
select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf as key,value;
-- 结果
key value
A	10
B	20
C	30

3.posexplode函数

2.1 函数语法

-- posexplode(a) - behaves like explode for arrays, but includes the position of items in the original array
Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDTFPosExplode
Function type:BUILTIN
select posexplode(ARRAY a)
-- Explodes an array to multiple rows with additional positional column of int type (position of items in the original array, starting with 0). Returns a row-set with two columns (pos,val), one row for each element from the array.

2.2 函数说明

posexplode 函数，将ARRAY数组a展开，每个Value一行，每行两列分别对应数组从0开始的下标和数组元素。

2.3 使用案例

-- posexplode (array)
select posexplode(array('A','B','C'));
select posexplode(array('A','B','C')) as (pos,val);
select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf;
select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf as pos,val;
-- 结果
pos val
0	A
1	B
2	C

4.later view

4.1 语法

lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*
fromClause: FROM baseTable (lateralView)*
-- columnAlias是给udtf(expression)列起的别名。
-- tableAlias 虚拟表的别名。

4.2 用法描述

lateral view为侧视图,意义是为了配合UDTF来使用,把某一行数据拆分成多行数据。
不加lateral view的UDTF只能提取单个字段拆分,并不能塞回原来数据表中。
加上lateral view就可以将拆分的单个字段数据与原始表数据关联上。

lateral view函数会将UDTF生成的结果放到一个虚拟表中，然后虚拟表中的数据和输入行进行join来达到连接UDTF外的select字段的目的。(本质是笛卡尔积)

4.3 使用案例

4.3.1 准备数据

下表 pageAds. 它有两个字段: pageid (页码) and adid_list (页面上的adid):

Column name	Column type
pageid	STRING
adid_list	Array

表中数据如下：

pageid	adid_list
front_page	[1, 2, 3]
contact_page	[3, 4, 5]

需求：统计各个页面出现的广告的次数

4.3.2 代码实现

第一步：使用 lateral view 和 explore() 函数将 adid_list 列的 list 拆分，sql代码如下：

select pageid, adid
FROM pageAds lateral view explode(adid_list) ad_view as adid;

可的如下结果

pageid	adid
front_page	1
front_page	2
front_page	3
contact_page	4
contact_page	5

第二步：使用 count/group by 语句统计出每个adid出现的次数：

select adid,count(1) as cnt
FROM pageAds lateral view explode(adid_list) ad_view as adid
group by adid;

adid	cnt
1	1
2	1
3	2
4	1
5	1

4.4 Multiple Lateral Views

FROM子句可以有多个LATERAL VIEW子句。后面的LATERAL VIEWS子句可以引用出现在LATERAL VIEWS左侧表的任何列。

例如，如下查询：

SELECT * FROM exampleTable
LATERAL VIEW explode(col1) myTable1 AS myCol1
LATERAL VIEW explode(col2) myTable2 AS myCol2;

例如使用以下基表：

Array pageid_list	Array adid_list
[1, 2, 3]	[“a”, “b”, “c”]
[3, 4]	[“c”, “d”]

单个Lateral View查询:

SELECT pageid_list, adid
FROM pageAds_1
         LATERAL VIEW explode(adid_list) adTable AS adid;
[1,2,3]	a
[1,2,3]	b
[1,2,3]	c
[4,5]	c
[4,5]	d

多个Lateral View查询：

select pageid,adid FROM pageAds_1
lateral view explode(pageid_list) adTable as pageid
lateral view explode(adid_list) adTable as adid;
1,a
1,b
1,c
2,a
2,b
2,c
3,a
3,b
3,c
3,c
3,d
4,c
4,d

4.5 later view json_tuple()

4.5.1 准备数据

create table lateral_tal_3
(
    id   int,
    col1 string,
    col2 string
);
 insert into lateral_tal_3 values(1234,'{"part1" : "61", "total" : "623", "part2" : "560", "part3" : "1", "part4" : "1"}','	{"to_part2" : "0", "to_part4" : "0", "to_up" : "0", "to_part3" : "0", "to_part34" : "0"}'),
(4567,'{"part1" : "451", "total" : "89928", "part2" : "88653", "part3" : "789", "part4" : "35"}','{"to_part2" : "54", "to_part4" : "6", "to_up" : "65", "to_part3" : "2", "to_part34" : "3"}'),
(7890,'{"part1" : "142", "total" : "351808", "part2" : "346778", "part3" : "4321", "part4" : "567"}','{"to_part2" : "76", "to_part4" : "23", "to_up" : "65", "to_part3" : "14", "to_part34" : "53"}');

id	col1	col2
1234	{“part1” : “61”, “total” : “623”, “part2” : “560”, “part3” : “1”, “part4” : “1”}	{“to_part2” : “0”, “to_part4” : “0”, “to_up” : “0”, “to_part3” : “0”, “to_part34” : “0”}
4567	{“part1” : “451”, “total” : “89928”, “part2” : “88653”, “part3” : “789”, “part4” : “35”}	{“to_part2” : “54”, “to_part4” : “6”, “to_up” : “65”, “to_part3” : “2”, “to_part34” : “3”}
7890	{“part1” : “142”, “total” : “351808”, “part2” : “346778”, “part3” : “4321”, “part4” : “567”}	{“to_part2” : “76”, “to_part4” : “23”, “to_up” : “65”, “to_part3” : “14”, “to_part34” : “53”}

需求：解析非结构化的json数据类型

“json_tuple(jsonStr, p1, p2, …, pn) - like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string.”
Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDTFJSONTuple
Function type:BUILTIN

json_tuple : 第一个参数是json 字符串所在的列名,其它参数是获取 json 字符串中的哪些key值;

4.5.2 代码实现

SELECT id,
       part1,
       part3,
       part4,
       to_part2,
       to_part3,
       to_part4,
       IF(part3 = 0, 0.0, to_part3 / part3) as ratio3,
       IF(part4 = 0, 0.0, to_part4 / part4) as ratio4
FROM lateral_tal_3
         lateral VIEW json_tuple(col1, 'part3', 'part4', 'part1') json1 AS part3, part4, part1
         lateral VIEW json_tuple(col2, 'to_part2','to_part3', 'to_part4') json2 AS to_part2, to_part3, to_part4
;
1234,61,1,1,0,0,0,0,0
4567,451,789,35,54,2,6,0.0025348542458808617,0.17142857142857143
7890,142,4321,567,76,14,23,0.0032399907428835918,0.04056437389770723

5.使用案例

需求1：如何产生1-100的连续的数字？

--方式1： 结合space函数与split函数，posexplode函数，lateral view函数获得
select id_start + pos as id
from (
         select 1   as id_start,
                100 as id_end
     ) m lateral view posexplode(split(space(id_end - id_start), '')) t as pos, val;
  
-- 方式2：结合space函数与split函数，explode函数，lateral view函数+窗口函数获得
select row_number() over () as id
from (select split(space(99), '') as x) t
         lateral view
             explode(x) ex;
-- 方式2：结合space函数与split函数，posexplode函数，lateral view函数获取
from (select split(space(99), ' ') as x) t
         lateral view
             posexplode(x) ex as pos,val;

需求2：获取2024-07-15至2024-07-29间所有的日期

SELECT pos,
       date_add(start_date, pos) dd
FROM (SELECT '2024-07-15' AS start_date, '2024-07-29' AS end_date) temp
         lateral VIEW
             posexplode(split(space(datediff(end_date, start_date)), '')) t
         AS pos, val;

分类:热门推荐日期:2024-04-29浏览:1评论:0

1.概述

2.explode函数

2.1 函数语法

2.2 函数说明

2.3 使用案例

3.posexplode函数

2.1 函数语法

2.2 函数说明

2.3 使用案例

4.later view

4.1 语法

4.2 用法描述

4.3 使用案例

4.3.1 准备数据

4.3.2 代码实现

4.4 Multiple Lateral Views

4.5 later view json_tuple()

4.5.1 准备数据

4.5.2 代码实现

5.使用案例

函数 字段

相关推荐