Elasticsearch专题精讲—— REST APIs —— Document APIs —— Reindex API —— 跨集群索引

发布时间 2023-06-09 14:26:07作者: 左扬

Reindex from remote(跨集群索引)

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-reindex.html#reindex-from-remote

Reindex supports reindexing from a remote Elasticsearch cluster:

Reindex 支持从远程 Elasticsearch 集群进行重新索引:

curl -X POST "localhost:9200/_reindex?pretty" -H 'Content-Type: application/json' -d'
        {
          "source": {
            "remote": {
              "host": "http://otherhost:9200",
              "username": "user",
              "password": "pass"
            },
            "index": "my-index-000001",
            "query": {
              "match": {
                "test": "data"
              }
            }
          },
          "dest": {
            "index": "my-new-index-000001"
          }
        }'        
    

The host parameter must contain a scheme, host, port (e.g. https://otherhost:9200), and optional path (e.g. https://otherhost:9200/proxy). The username and password parameters are optional, and when they are present _reindex will connect to the remote Elasticsearch node using basic auth. Be sure to use https when using basic auth or the password will be sent in plain text. There are a range of settings available to configure the behaviour of the https connection.

host 参数必须包含一个方案、主机、端口(例如:https://otherhost:9200)以及可选的路径(例如:https://otherhost:9200/proxy)。username 和 password 参数是可选的,当它们存在时,_reindex 将使用基本身份验证方式连接到远程 Elasticsearch 节点。在使用本身份验证时,请务必使用 https,否则密码将以明文形式发送。有一系列设置可用于配置 https 连接的行为。

When using Elastic Cloud, it is also possible to authenticate against the remote cluster through the use of a valid API key:

在使用 Elastic Cloud 时,还可以通过使用有效的 API 密钥对远程集群进行身份验证:

curl -X POST "localhost:9200/_reindex?pretty" -H 'Content-Type: application/json' -d'
        {
          "source": {
            "remote": {
              "host": "http://otherhost:9200",
              "headers": {
                "Authorization": "ApiKey API_KEY_VALUE"
              }
            },
            "index": "my-index-000001",
            "query": {
              "match": {
                "test": "data"
              }
            }
          },
          "dest": {
            "index": "my-new-index-000001"
          }
        }'
     

Remote hosts have to be explicitly allowed in elasticsearch.yml using the reindex.remote.whitelist property. It can be set to a comma delimited list of allowed remote host and port combinations. Scheme is ignored, only the host and port are used. For example:

远程主机需要在 elasticsearch.yml 中使用 reindex.remote.whitelist 属性进行明确允许。它可以设置为允许的远程主机和端口组合的逗号分隔列表。忽略方案,仅使用主机和端口。例如:

reindex.remote.whitelist: "otherhost:9200, another:9200, 127.0.10.*:9200, localhost:*"

The list of allowed hosts must be configured on any nodes that will coordinate the reindex.

允许的主机列表必须在将协调重新索引的任何节点上进行配置。

This feature should work with remote clusters of any version of Elasticsearch you are likely to find. This should allow you to upgrade from any version of Elasticsearch to the current version by reindexing from a cluster of the old version.

此功能应该与您可能找到的任何版本的Elasticsearch的远程集群兼容。这应该允许您从旧版本的集群重新索引到当前版本,从而完成升级。

Elasticsearch does not support forward compatibility across major versions. For example, you cannot reindex from a 7.x cluster into a 6.x cluster.

Elasticsearch 不支持跨主要版本的前向兼容性。例如,您不能将7. x 集群重新索引到6. x 集群。

To enable queries sent to older versions of Elasticsearch the query parameter is sent directly to the remote host without validation or modification.

为了启用发送到老版本 Elasticsearch 的查询,查询参数将直接发送到远程主机,而不需要进行验证或修改。

Reindexing from remote clusters does not support manual or automatic slicing.

从远程集群重新索引不支持手动或自动切片。

Reindexing from a remote server uses an on-heap buffer that defaults to a maximum size of 100mb. If the remote index includes very large documents you’ll need to use a smaller batch size. The example below sets the batch size to 10 which is very, very small.

从远程服务器重新索引使用一个默认最大尺寸为100MB的堆缓冲区。如果远程索引包括非常大的文档,您需要使用一个较小的批量大小。下面的例子将批量大小设置为10,这是非常非常小的。

curl -X POST "localhost:9200/_reindex?pretty" -H 'Content-Type: application/json' -d'
        {
          "source": {
            "remote": {
              "host": "http://otherhost:9200"
            },
            "index": "source",
            "size": 10,
            "query": {
              "match": {
                "test": "data"
              }
            }
          },
          "dest": {
            "index": "dest"
          }
        }'
    

It is also possible to set the socket read timeout on the remote connection with the socket_timeout field and the connection timeout with the connect_timeout field. Both default to 30 seconds. This example sets the socket read timeout to one minute and the connection timeout to 10 seconds:

还可以使用socket_timeout字段设置远程连接的套接字读取超时时间,并使用connect_timeout字段设置连接超时时间。两者的默认值均为30秒。以下示例将套接字读取超时时间设置为一分钟,连接超时时间设置为10秒: