This website requires JavaScript.

Sequence Files VS ORC Files

群集磁盘资源紧张,需要对表数据进行压缩,将从SequenceORC中选其一作为以后仓库表文件的存储格式。本文就SnappyZLIB(Gzip)压缩级别进行表大小以及查询效率的对比

压缩参数

Sequence

-- Snappy
set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapred.output.compression.type=BLOCK;

-- Gzip
set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapred.output.compression.type=BLOCK;

ORC

-- Snappy
set hive.exec.orc.default.compress=SNAPPY;

-- ZLIB
set hive.exec.orc.default.compress=ZLIB;

大小对比

订单主表

格式文件大小占用大小压缩比(倍)
text19.8G59.5G
seq_snappy6.1G18.4G3.24
orc_snappy4.2G12.5G4.71
seq_gzip3.4G10.3G5.82
orc_zlib3.0G9.0G6.6

订单明细表

格式文件大小占用大小压缩比(倍)
text18.8G56.5G
seq_snappy4.2G12.5G4.38
orc_snappy2.6G7.8G7.23
seq_gzip2.3G6.9G8.17
orc_zlib1.7G5.0G11.05

SQL查询效率

简单Count

格式第一次第二次第三次
seq_snappy23.1621.81323.031
orc_snappy22.15419.91120.582
seq_gzip33.36226.44627.537
orc_zlib21.19719.2318.499

过滤并分组

格式第一次第二次第三次
seq_snappy27.43626.45327.796
orc_snappy21.56825.97129.075
seq_gzip35.1678.64938.198
orc_zlib20.27321.49119.539

关联查询Count

主表与明细表通过订单ID关联

格式第一次第二次第三次
seq_snappy64.33668.12863.755
orc_snappy71.017100.51165.912
seq_gzip98.20179.24679.736
orc_zlib99.99669.2374.677

实际查询

主表与明细表通过订单ID关联,分组,单表select * , max ,sum等聚合

格式第一次第二次第三次第四次第五次第六次
seq_snappy580.9381065.992787.9991365.383622.456605.303
orc_snappy696.8871333.9521034.791806.678789.647794.424
seq_gzip806.7341448.1141237.069901.431901.75854
orc_zlib198319831294.967906.825926.922975.837

第六次细节

类型JobStageMapReduceElapsedVcore Map(Seconds)Vcore Reduce(Seconds)Vcore Total(Sedonds)-(orc-seq)/seq
seq_snapjob_1583725051064_59160Stage-157115mins,44sec169422229241594618381
orc_snapjob_1583725051064_59190Stage-14978mins,45sec1494752289507243898244.95%
seq_snapjob_1583725051064_59187Stage-238104mins,8sec119304218204433013485
orc_snapjob_1583725051064_59198Stage-239104mins,12sec119718519413883138573-4.15%
seq_gzipjob_1583725051064_59028Stage-14969mins,39sec166702828511334518161
orc_zlibjob_1583725051064_59046Stage-148511mins,24sec1477584293780344153872.27%
seq_gzipjob_1583725051064_59043Stage-242104mins,19sec121332521401553353480
orc_zlibjob_1583725051064_59065Stage-240104mins,37sec120666621882773394943-1.24%
类型ElapsedVcore Total(Sedonds)-(orc-seq)/seq
seq_snap9mins,52sec7631866
orc_snap12mins,57sec75283971.36%
seq_gzip13mins,58sec7871641
orc_zlib16mins78103300.78%

结论

  • 以数据压缩后的大小来看,ORC完胜。
  • SQL查询方面,因为orc的文件更小,所以map和reduce数相对较小,导致运算时间增长(20%左右),但总体CPU资源耗用还是要比Seq格式少1%左右。
0条评论
avatar