PostGIS 地理信息数据多核并行处理.docx-道客多多

资源描述

1、PostGIS 地理信息数据多核并行处理本文章来自于阿里云云栖社区摘要：标签 PostgreSQL , PostGIS , 栅格 , raster , 多核并行背景自从 PostgreSQL 9.6 支持 CPU 多核并行计算后，PostgreSQL 最流行的插件之一 PostGIS，用户对多核需求也越来越多。标签PostgreSQL , PostGIS , 栅格 , raster , 多核并行背景自从 PostgreSQL 9.6 支持 CPU 多核并行计算后，PostgreSQL 最流行的插件之一 PostGIS，用户对多核需求也越来越多。原因是 PostGIS 中有大量的运算是非

2、常耗费 CPU 资源的，比如 raster 类型相关的运算。PostGIS tends to involve CPU-intensive calculations on geometries, support for parallel query has been at the top of our request list to the core team for a long time.Now that it is finally arriving the question is: does it really help?PostGIS 发布的 2.3.1 版本，已经可以看到诚意了吧，以下

3、函数已经支持并行计算了。Mark ST_Extent, ST_3DExtent and ST_Mem* agg functions as parallel safe so they can be parallelized扫描并行扫描并行，比如扫描节点有大量数据要被过滤时，或者说查询子句中有大量的运算时，使用并行可以大大提升其效率。例如1. filter 过滤掉大量的数据，并且 filter 本身运算量较大时，使用 CPU 多核并行，效果明显select * from table where filter.; 2. 当 func 函数或者 OP 操作符运算量较大时，使用 CPU 多核并行，效果非

4、常明显比如聚合运算，或者一些业务逻辑运算（虽然 TABLE 本身没几条记录，但是每条记录的运算耗时很长时，并行效果明显）。select func(x), x op y from table .; JOIN 并行我们在使用 explain 观察 SQL 时，或者使用 perf 跟踪 SQL 的开销时，对于一个多个表数据 JOIN 的 SQL，如果 JOIN 的数据量很大，可能就会成为整个 SQL的性能瓶颈。现在可以使用 CPU 的多核并行来加速 JOIN 了。聚合并行聚合操作，比如统计某个维度的平均值、最大、最小、SUM 等，在金融、分析行业用得非常多，处理的数据量大，运算量也较大。除了扫描码并

5、行，聚合函数本身也要支持并行才行，比如 sum，count, avg, 可以想象并行处理都是安全的。应该这么说，凡是在分布式数据库中支持的 2 阶段聚合函数，并行都是安全的。关于分布式数据库的 2 阶段并行聚合的原理请参考hll 插件在 Greenplum 中的使用以及分布式聚合函数优化思路（原文链接：https:/ 关闭并行效果alter table table_name set (parallel_workers=0); set force_parallel_mode = off; 并行原理参考PostgreSQL 9.6 并行计算优化器算法浅析（原文链接：https:/ Final

6、ize Aggregate (cost=20482.9720482.98 rows=1 width=8) (actual time=345.855345.856 rows=1 loops=1) - Gather (cost=20482.6520482.96 rows=3 width=8) (actual time=345.674345.846 rows=4 loops=1) Number of Workers: 3 - Partial Aggregate (cost=19482.6519482.66 rows=1 width=8) (actual time=336.663336.664 row

7、s=1 loops=4) - Parallel Seq Scan on pd (cost=0.0019463.96 rows=7477 width=0) (actual time=0.154331.815 rows=15540 loops=4) Filter: (st_area(geom) 10000) Rows Removed by Filter: 1844 Planning time: 0.145 ms Execution time: 349.345 ms 2. JOIN 并行CREATE TABLE pts AS SELECT ST_PointOnSurface(geom):Geomet

8、ry(point, 3347) AS geom, gid, fed_num FROM pd; CREATE INDEX pts_gix ON pts USING GIST (geom); 找出与蓝色区域重叠的点EXPLAIN ANALYZE SELECT Count(*) FROM pd JOIN pts ON ST_Intersects(pd.geom, pts.geom); The ST_Intersects() function is actually a SQL wrapper on top of the UPDATE: Marking the geometry_overlaps fu

9、nction which is bound to the 例如EXPLAIN ANALYZE SELECT ST_Area(ST_MemUnion(geom) FROM pd WHERE fed_num = 47005; Finalize Aggregate (cost=16536.5316536.79 rows=1 width=8) (actual time=2263.6382263.639 rows=1 loops=1) - Gather (cost=16461.2216461.53 rows=3 width=32) (actual time=754.309757.204 rows=4 l

10、oops=1) Number of Workers: 3 - Partial Aggregate (cost=15461.2215461.23 rows=1 width=32) (actual time=676.738676.739 rows=1 loops=4) - Parallel Seq Scan on pd (cost=0.0013856.38 rows=64 width=2311) (actual time=3.00927.321 rows=42 loops=4) Filter: (fed_num = 47005) Rows Removed by Filter: 17341 Plan

11、ning time: 0.219 ms Execution time: 2264.684 ms 小结1. 自从 PostgreSQL 9.6 支持并行后，由于 PostgreSQL 开放了并行接口，比如聚合函数，使用并行时，会以两阶段方式运行，你需要增加一个合并函数。周边的插件，也可以很方便的将原有的聚合或者操作符，改造为并行的模式，从而享受 PostgreSQL 多核并行带来的效果。2. 其他加速技术，包括 LLVM，列存储，向量化，算子复用，GPU 加速等。PostgreSQL 向量化执行插件(瓦片式实现) 10x 提速 OLAP（原文链接：https:/ - LLVM、列存、多核并行、算子复用大联姻 - 一起来开启 PostgreSQL 的百宝箱（原文链接：https:/

展开阅读全文

PostGIS 地理信息数据 多核并行处理.docx

PostGIS 地理信息数据多核并行处理.docx