更省时的批量操作

发布于 2016-02-29 14:25:14 | 575 次阅读 | 评论: 0 | 来源: 网络整理

就像mget允许我们一次性检索多个文档一样，bulk API允许我们使用单一请求来实现多个文档的create、index、update或delete。这对索引类似于日志活动这样的数据流非常有用，它们可以以成百上千的数据为一个批次按序进行索引。

bulk请求体如下，它有一点不同寻常：

{ action: { metadata }}\n
{ request body        }\n
{ action: { metadata }}\n
{ request body        }\n
...

这种格式类似于用"\n"符号连接起来的一行一行的JSON文档流(stream)。两个重要的点需要注意：

每行必须以"\n"符号结尾，包括最后一行。这些都是作为每行有效的分离而做的标记。
每一行的数据不能包含未被转义的换行符，它们会干扰分析——这意味着JSON不能被美化打印。

提示:

在《批量格式》一章我们介绍了为什么bulk API使用这种格式。

action/metadata这一行定义了文档行为(what action)发生在哪个文档(which document)之上。

行为(action)必须是以下几种：

行为	解释
`create`	当文档不存在时创建之。详见《创建文档》
`index`	创建新文档或替换已有文档。见《索引文档》和《更新文档》
`update`	局部更新文档。见《局部更新》
`delete`	删除一个文档。见《删除文档》

在索引、创建、更新或删除时必须指定文档的_index、_type、_id这些元数据(metadata)。

例如删除请求看起来像这样：

{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }}

请求体(request body)由文档的_source组成——文档所包含的一些字段以及其值。它被index和create操作所必须，这是有道理的：你必须提供文档用来索引。

这些还被update操作所必需，而且请求体的组成应该与update API（doc, upsert, script等等）一致。删除操作不需要请求体(request body)。

{ "create":  { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "My first blog post" }

如果定义_id，ID将会被自动创建：

{ "index": { "_index": "website", "_type": "blog" }}
{ "title":    "My second blog post" }

为了将这些放在一起，bulk请求表单是这样的：

POST /_bulk
{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }} <1>
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "My first blog post" }
{ "index":  { "_index": "website", "_type": "blog" }}
{ "title":    "My second blog post" }
{ "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} }
{ "doc" : {"title" : "My updated blog post"} } <2>

注意`delete`**行为(action)**没有请求体，它紧接着另一个**行为(action)**
记得最后一个换行符

Elasticsearch响应包含一个items数组，它罗列了每一个请求的结果，结果的顺序与我们请求的顺序相同：

{
   "took": 4,
   "errors": false, <1>
   "items": [
      {  "delete": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 2,
            "status":   200,
            "found":    true
      }},
      {  "create": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 3,
            "status":   201
      }},
      {  "create": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "EiwfApScQiiy7TIKFxRCTw",
            "_version": 1,
            "status":   201
      }},
      {  "update": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 4,
            "status":   200
      }}
   ]
}}

所有子请求都成功完成。

每个子请求都被独立的执行，所以一个子请求的错误并不影响其它请求。如果任何一个请求失败，顶层的error标记将被设置为true，然后错误的细节将在相应的请求中被报告：

POST /_bulk
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "Cannot create - it already exists" }
{ "index":  { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "But we can update it" }

响应中我们将看到create文档123失败了，因为文档已经存在，但是后来的在123上执行的index请求成功了：

{
   "took": 3,
   "errors": true, <1>
   "items": [
      {  "create": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "status":   409, <2>
            "error":    "DocumentAlreadyExistsException <3>
                        [[website][4] [blog][123]:
                        document already exists]"
      }},
      {  "index": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 5,
            "status":   200 <4>
      }}
   ]
}

一个或多个请求失败。
这个请求的HTTP状态码被报告为`409 CONFLICT`。
错误消息说明了什么请求错误。
第二个请求成功了，状态码是`200 OK`。

这些说明bulk请求不是原子操作——它们不能实现事务。每个请求操作时分开的，所以每个请求的成功与否不干扰其它操作。

不要重复

你可能在同一个index下的同一个type里批量索引日志数据。为每个文档指定相同的元数据是多余的。就像mget API，bulk请求也可以在URL中使用/_index或/_index/_type:

POST /website/_bulk
{ "index": { "_type": "log" }}
{ "event": "User logged in" }

你依旧可以覆盖元数据行的_index和_type，在没有覆盖时它会使用URL中的值作为默认值：

POST /website/log/_bulk
{ "index": {}}
{ "event": "User logged in" }
{ "index": { "_type": "blog" }}
{ "title": "Overriding the default type" }

多大才算太大？

整个批量请求需要被加载到接受我们请求节点的内存里，所以请求越大，给其它请求可用的内存就越小。有一个最佳的bulk请求大小。超过这个大小，性能不再提升而且可能降低。

最佳大小，当然并不是一个固定的数字。它完全取决于你的硬件、你文档的大小和复杂度以及索引和搜索的负载。幸运的是，这个最佳点(sweetspot)还是容易找到的：

试着批量索引标准的文档，随着大小的增长，当性能开始降低，说明你每个批次的大小太大了。开始的数量可以在1000~5000个文档之间，如果你的文档非常大，可以使用较小的批次。

通常着眼于你请求批次的物理大小是非常有用的。一千个1kB的文档和一千个1MB的文档大不相同。一个好的批次最好保持在5-15MB大小间。

更省时的批量操作