GeoMesa踩坑指北 - Shaun's Space

前言

　　需要做个 GeoMesa 的微服务，简单熟悉一下 GeoMesa。

基础篇

　　GeoMesa 可以说是大数据中的 PostGIS，主要用来在存储和处理 GIS 数据时提供相应的索引，从而加快处理速度。GeoMesa 基于 GeoTools，其中最重要的两个概念就是 SimpleFeatureType 和 SimpleFeature，SimpleFeatureType 对应的是关系型数据库中表的描述（表明，表的列字段属性信息等），而 SimpleFeature 对应的是表中每行数据。下面重点谈谈 GeoMesa 中的 SimpleFeatureType 以及其创建索引方式。

　　在 GeoMesa 中通常使用 SimpleFeatureTypes.createType 方法进行创建，该方法有两个重载，以没有 namespace 参数的方法为例：

def createType(typeName: String, spec: String): SimpleFeatureType = {
    val (namespace, name) = parseTypeName(typeName)
    createType(namespace, name, spec)
}

先通过 parseTypeName 解析 typeName，以 : 作为分隔符，取最后一个有效（不为空）字符串作为表名（name），其余部分如有效则作为 namespace，否则 namespace 则为 null。spec 参数的通用形式有以下几种：

val spec = "name:String,dtg:Date,*geom:Point:srid=4326"

val spec = "name:String,dtg:Date,*geom:Point:srid=4326;geomesa.indices.enabled='z2,id,z3'"

val spec = "name:String:index=true,tags:String:json=true,dtg:Date:default=true,*geom:Point:srid=4326;geomesa.indices.enabled='z2,id,z3'"

val spec = "userId:String,trackId:String,altitude:Double,dtg:Date,*geom:Point:srid=4326;geomesa.index.dtg='dtg',geomesa.table.sharing='true',geomesa.indices='z3:4:3,z2:3:3,id:2:3',geomesa.table.sharing.prefix='\\u0001'"

先使用 ; 分隔符，再使用 , 分隔符，最后使用 : 分隔符。; 分隔符将 spec 分割为两个字符串：前者表示表中的全部列属性信息，列属性经过 , 分隔符分割为多列，列又经过 : 分隔符分割为列名，列数据类型，列的一些属性（是否是索引，json 数据，默认索引等），而列名首字母 * 代表该字段是用于索引的 geometry 类型，一般采用 WKT 格式进行描述，当然存在数据库时会以字节码进行压缩；后者表示创建表时的 userData，同样经过 , 分隔符分割为多个 userData，userData 的一些默认属性可在 SimpleFeatureTypes.Configs 中看到，其它的可以用户自定义，这里重点说一下 geomesa.indices.enabled 属性，目前 GeoMesa 支持 8 种索引，分别为：

"attr", // 属性索引
"id", // 主键索引
"s2", // Hilbert 曲线点空间索引
"s3", // Hilbert 曲线点时空索引
"z2", // Z 型曲线点空间索引
"xz2", // Z 型曲线线面空间索引
"z3",  // Z 型曲线点时空索引
"xz3" // Z 型曲线线面时空索引

　　由于 GeoMesa 中的索引一般存在多个版本，而 geomesa.indices.enabled 默认使用最新的版本，若需要指定版本，需要使用 geomesa.indices，该属性是 geomesa 内部属性，不对外开放，通用格式为：

s"$name:$version:${mode.flag}:${attributes.mkString(":")}"

name 代表索引类别，version 代表索引版本，mode.flag 代表索引模式（是否支持读写，一般为3，支持读也支持写），attributes 代表是哪些字段需要建立该索引。spec 参数可以只有描述列属性的字段，即不带任何 useData 信息，GeoMesa 会默认添加索引信息，若存在空间和时间字段，则会默认建立 z3（空间字段为点 Point 类型）或 xz3（空间字段为线面非Point 类型）索引，若有多个空间和时间字段，建立索引的字段为第一个空间和第一个时间字段；若只存在空间字段，则会建立 z2 或 xz2 索引；若只有时间字段，则默认建立时间属性索引。当然如没有在 spec 指明索引信息，可以在后续继续添加信息，如下：

import org.locationtech.geomesa.utils.interop.SimpleFeatureTypes;

String spec = "name:String,dtg:Date,*geom:Point:srid=4326";
SimpleFeatureType sft = SimpleFeatureTypes.createType("mySft", spec);
// enable a default z3 and a default attribute index
sft.getUserData().put("geomesa.indices.enabled", "z3,attr:name");
// or, enable a default z3 and an attribute index with a Z2 secondary index
sft.getUserData().put("geomesa.indices.enabled", "z3,attr:name:geom");
// or, enable a default z3 and an attribute index with a temporal secondary index
sft.getUserData().put("geomesa.indices.enabled", "z3,attr:name:dtg");

坑篇

导入 OSM 数据问题

　　在导入 osm 数据时，若使用 osm-ways 作为 SimpleFeatureType，则 geomesa 会使用数据库存储 node 临时使用，这时其默认使用 H2 Database，若想使用其它数据库，则需要在 lib 导入相应 jdbc 包，若使用 postgresql 数据库，则 geomesa 会触发一个 bug，因为 postgresql 没有 double 类型，只有 double precision 类型，这将导致建表出错。详情见 geomesa/geomesa-convert/geomesa-convert-osm/src/main/scala/org/locationtech/geomesa/convert/osm/OsmWaysConverter.scala 中

private def createNodesTable(): Unit = {
    val sql = "create table nodes(id BIGINT NOT NULL PRIMARY KEY, lon DOUBLE, lat DOUBLE);"
    WithClose(connection.prepareStatement(sql))(_.execute())
}

所以若需要使用 geomesa-convert-osm 导入 osm 数据时，需要进入 geomesa/geomesa-convert/geomesa-convert-osm 文件夹中输入命令

mvn dependency:copy-dependencies -DoutputDirectory=./depLib

导出 geomesa-convert-osm 依赖包，将其中的 h2，osm4j，dynsax，trove4j 等一系列库放入 $GEOMESA_HBASE_HOME/lib 中。

s2 索引问题

　　s2 索引即 Google S2 Geometry 算法基于 Hilbert 曲线生成一种索引，GeoMesa 的 s2 索引是一个国人提交的，目前 3.2 版本只支持点的时空索引，不支持线面的时空索引，当然官方也在实现自己的 Hilbert 曲线，希望后续 GeoMesa 中会有 h2 索引。Shaun 在导入 osm 数据并启用 s2 索引时，报错，被提示不支持，对比 geomesa-index-api2Index.scala 和 geomesa-index-api2Index.scala 两文件的 defaults 函数可发现 S2Index 直接返回空，而在 geomesa-index-api.scala 中 fromName 函数需要调用 defaults 函数，从而导致 s2 索引不支持，修改 S2Index 的 defaults 函数即可（别忘了在 S2Index 类中首行加上 import org.locationtech.geomesa.utils.geotools.RichSimpleFeatureType.RichSimpleFeatureType）。

后记

　　暂时就了解了这么多，等后续熟悉的更多再继续更吧 (ง •_•)ง。

附录

GeoMesa 命令行工具部分参数

Geomesa 命令行参数：

参数	描述
-c, --catalog *	存放 schema 元数据的catalog 表（相当于数据库）
-f, --feature-name	schema 名（相当于数据库中的表）
-s, --spec	要创建 SimpleFeatureType 的说明（即表中列的描述信息，表的 schema，如 "name:String,age:Int,dtg:Date,*geom:Point:srid=4326"）
-C, --converter	指定转换器，必须为一下之一：1、已经在classpath中的converter 名；2、converter 的配置（一个字符串）；3、包括converter的配置的名
–converter-error-mode	自定义的转换器的error mode
-t, --threads	指定并行度
–input-format	指定输入源格式（如csv, tsv, avro, shp, json,）
–no-tracking	指定提交的 ingest job何时终止（在脚本中常用）
–run-mode	指定运行模式，必须为：local（本地）、distributed （分布式）、distributedcombine（分布式组合）之一
–split-max-size	在分布式中，指定切片最大大小（字节）
–src-list	输入文件为文本文件，按行输入
–force	禁用任何的提示
[files]…	指定输入的文件

参考资料：GeoMesa命令行工具---摄取命令