Skip to content

Cluster Not Ready

You said:

bash
Warning  Unhealthy  33s (x6045 over 14h)  kubelet  Readiness probe failed: Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=yellow&timeout=3s" )
Cluster is not yet ready (request params: "wait_for_status=yellow&timeout=3s" )

ChatGPT said:

这个错误信息表示你的 Elasticsearch 实例的 readiness probe 一直在失败,原因是它的集群状态还没有达到 yellow(至少主分片已分配):

txt
Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=yellow&timeout=3s")

它已经失败了 6045 次,持续了 14 小时,说明集群启动一直不健康。


常见原因与排查步骤

✅ 1. 节点数不足

Elasticsearch 默认会为每个索引分配主分片和副本分片。如果副本分片无法分配,它会一直卡在 status=red(或者没有达到 yellow)状态。

检查节点数量:

bash
curl -s http://localhost:9200/_cat/nodes?v

如果你只有一个节点,请显式关闭副本:

bash
curl -X PUT "localhost:9200/_settings" -H 'Content-Type: application/json' -d '{
  "index" : {
    "number_of_replicas" : 0
  }
}'

或者在 elasticsearch.yml 配置或 Helm Chart 中设置:

yaml
index:
  number_of_replicas: 0

✅ 2. 集群健康检查

查看集群状态:

bash
curl -s http://localhost:9200/_cluster/health?pretty

你可以看到具体是哪个部分没有 ready,比如哪个分片未分配。


✅ 3. PersistentVolume 问题(特别是在 K8s 环境下)

查看 Elasticsearch Pod 的 PVC 状态:

bash
kubectl get pvc

如果有 PVC 卡住(PendingLost),可能导致数据目录挂载失败,进而影响集群启动。


✅ 4. 配置错误 / 内存不足

  • Elasticsearch 启动 JVM 时,如果内存设置不当也可能卡住。
  • 查看容器日志:
bash
kubectl logs <elasticsearch-pod-name>

看看是否有内存、磁盘、权限相关的报错。


修复建议(按情况选用):

  • 单节点测试环境 ➜ 设置 number_of_replicas: 0
  • 多节点环境 ➜ 确保至少 2 个节点 Ready,网络畅通
  • Pod 日志报错 ➜ 根据日志分析
  • 挂载失败 ➜ 检查 PVC 和 StorageClass 配置
  • readinessProbe 配置过于严格 ➜ 增加 initialDelaySecondstimeoutSeconds 以容忍冷启动

如果你能贴出 Elasticsearch 的实际运行配置(比如你用的 Helm Chart 值文件)或日志,我可以帮你进一步分析。你是在单节点上跑的吗?还是多节点集群?

4o

You said:

bash
{"type": "server", "timestamp": "2025-04-25T02:41:41,317Z", "level": "INFO", "component": "o.e.x.s.s.SecurityStatusChangeListener", "cluster.name": "elasticsearch", "node.name": "foobar-elasticsearch-0", "message": "Active license is now [BASIC]; Security is disabled", "cluster.uuid": "7CUzHiZgTt6Ejk3V6agXZg", "node.id": "VqBKmdrcQgCAgFKWazeyoQ"  }
{"type": "server", "timestamp": "2025-04-25T02:41:41,321Z", "level": "INFO", "component": "o.e.g.GatewayService", "cluster.name": "elasticsearch", "node.name": "foobar-elasticsearch-0", "message": "recovered [300] indices into cluster_state", "cluster.uuid": "7CUzHiZgTt6Ejk3V6agXZg", "node.id": "VqBKmdrcQgCAgFKWazeyoQ"  }
{"type": "server", "timestamp": "2025-04-25T02:41:41,830Z", "level": "INFO", "component": "o.w.a.d.Dictionary", "cluster.name": "elasticsearch", "node.name": "foobar-elasticsearch-0", "message": "try load config from /usr/share/elasticsearch/config/analysis-ik/IKAnalyzer.cfg.xml", "cluster.uuid": "7CUzHiZgTt6Ejk3V6agXZg", "node.id": "VqBKmdrcQgCAgFKWazeyoQ"  }
{"type": "server", "timestamp": "2025-04-25T02:41:41,830Z", "level": "INFO", "component": "o.w.a.d.Dictionary", "cluster.name": "elasticsearch", "node.name": "foobar-elasticsearch-0", "message": "try load config from /usr/share/elasticsearch/plugins/ik/config/IKAnalyzer.cfg.xml", "cluster.uuid": "7CUzHiZgTt6Ejk3V6agXZg", "node.id": "VqBKmdrcQgCAgFKWazeyoQ"  }
{"type": "server", "timestamp": "2025-04-25T02:41:45,220Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "elasticsearch", "node.name": "foobar-elasticsearch-0", "message": "[gc][7] overhead, spent [336ms] collecting in the last [1s]", "cluster.uuid": "7CUzHiZgTt6Ejk3V6agXZg", "node.id": "VqBKmdrcQgCAgFKWazeyoQ"  }

ChatGPT said:

从你贴的日志看,Elasticsearch 实际上已经恢复了 300 个索引,日志中也没有明显错误,说明服务是能正常启动的:

txt
log
CopyEdit
"recovered [300] indices into cluster_state"

但 readiness probe 一直失败,说明虽然 Elasticsearch 自认为起来了,但对外暴露的健康状态还没达标(至少没达到 yellow)。


🎯 判断当前集群健康状态

建议你运行以下命令,看看 Elasticsearch 实际的健康状态和分片分配情况:

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- curl -s localhost:9200/_cluster/health?pretty

还可以进一步查看未分配的分片:

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- curl -s localhost:9200/_cat/shards?pretty

🔍 常见原因(基于你的场景)

你这是单节点部署(Pod 名是 foobar-elasticsearch-0),而且恢复了 300 个索引,极可能是:

✅ 原因:副本分片无法分配(只有一个节点)

默认情况下每个索引有 1 个副本,但因为你只有一个节点,副本无法分配,集群健康就永远达不到 yellow

🔧 解决方案:设置副本数为 0

你可以临时把所有索引副本数设为 0:

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- \
curl -X PUT "localhost:9200/_all/_settings" -H 'Content-Type: application/json' -d '{
  "index" : {
    "number_of_replicas" : 0
  }
}'

或者你只想对新建索引设置为 0,可加入默认模板:

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- \
curl -X PUT "localhost:9200/_template/default" -H 'Content-Type: application/json' -d '{
  "index_patterns": ["*"],
  "settings": {
    "number_of_replicas": 0
  }
}'

✅ 后续建议

  1. 修改 Helm Chart 默认值(如果是用 Helm 部署):

    yaml
    yaml
    CopyEdit
    replicas: 1
    minimumMasterNodes: 1
    esConfig:
      elasticsearch.yml: |
        node.name: foobar-elasticsearch-0
        cluster.name: elasticsearch
        network.host: 0.0.0.0
        discovery.type: single-node
  2. 或者显式配置 IK 插件时 disable 副本数设置(避免旧索引带副本)


如果你用的是 Helm 之类的方式部署,我可以帮你修改 values.yaml 来避免这个问题。你需要我帮你调 Helm Chart 配置吗?

4o

You said:

Defaulted container "foobar-elasticsearch" out of: foobar-elasticsearch, configure-sysctl (init)

{ "cluster_name" : "elasticsearch", "status" : "red", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 499, "active_shards" : 499, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 500, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 49.94994994994995 }

ChatGPT said:

感谢你贴出完整的 /_cluster/health 输出,现在情况非常明确了:


💡 问题核心

你的 Elasticsearch 只有一个节点,但:

  • 集群状态是 red
  • 只有 499 个主分片(primary shards)活着
  • 还有 500 个未分配的分片(unassigned_shards: 500

说明这些 未分配的分片是副本分片(replica shards),因为单节点部署无法分配副本,所以它永远卡在 redyellow 之前。


✅ 解决方案:批量关闭副本分片

你目前已经有很多索引(恢复了至少 499 个),可以运行以下命令,将所有索引的副本数改为 0

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- \
curl -X PUT "localhost:9200/_all/_settings" -H 'Content-Type: application/json' -d '{
  "index" : {
    "number_of_replicas" : 0
  }
}'

等这条命令执行完之后,再看一下集群状态应该就变成 yellowgreen 了:

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- curl -s localhost:9200/_cluster/health?pretty

🧼 可选:清除所有索引(⚠危险操作)

如果这些索引不是你需要保留的,可以直接清掉所有数据:

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- \
curl -X DELETE "localhost:9200/_all"

这样可以彻底清空集群并让它马上变成 green


🛠 推荐改动 Helm 配置

如果你是用 Helm 装的,可以把 values.yaml 里加上这几项来防止之后再遇到:

yaml
yaml
CopyEdit
esConfig:
  elasticsearch.yml: |
    discovery.type: single-node

extraEnvs:
  - name: "discovery.type"
    value: "single-node"

## 或者在 templates/init-sysctl.yaml 里关闭副本数初始化(依你 chart 结构而定)

你可以先运行关闭副本分片那条命令试试,看看能不能直接把集群变 yellow。需要我给你做个全套 Helm Chart 示例配置也没问题,随时告诉我你用的是哪个 chart。

4o

You said:

Defaulted container "foobar-elasticsearch" out of: foobar-elasticsearch, configure-sysctl (init) { "cluster_name" : "elasticsearch", "status" : "red", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 499, "active_shards" : 499, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 1, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 99.8 }

ChatGPT said:

👍 很明显你刚刚执行了把副本数改为 0 的操作,现在:

  • unassigned_shards 从 500 降到了 1
  • active_shards_percent 也从 49% 提升到了 99.8%
  • 剩下这个 1 可能是某个索引在恢复过程中卡了一下,或者是个孤立的副本分片残留

✅ 下一步操作建议

1️⃣ 查出是哪一个未分配的分片:

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- \
curl -s localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason | grep UNASSIGNED

看看到底是哪个 index 哪个 shard 卡住了。


2️⃣ 强制分配它(如果确实是主分片)

如果你确认这个是主分片(prirep 显示为 p),你可以手动分配它:

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- \
curl -X POST "localhost:9200/_cluster/reroute" -H 'Content-Type: application/json' -d '{
  "commands": [
    {
      "allocate_stale_primary": {
        "index": "你的索引名",
        "shard": 0,
        "node": "foobar-elasticsearch-0",
        "accept_data_loss": true
      }
    }
  ]
}'

如果 UNASSIGNED 的是 r(副本),那其实不用管,过一会儿就自动消掉,或者你再执行一次 _all/_settings 设置副本数为 0 就会彻底清掉。


3️⃣ 再次查看健康状态

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- curl -s localhost:9200/_cluster/health?pretty

如果已经是 yellow,readiness probe 就会成功,Pod 也会进入 Ready 状态。


🧩 如果你想图省事:

可以直接执行一次副本清除 + 强制分配 + 优化:

bash
bash
CopyEdit
## 清理副本
kubectl exec -it foobar-elasticsearch-0 -- \
curl -X PUT "localhost:9200/_all/_settings" -H 'Content-Type: application/json' -d '{
  "index": { "number_of_replicas": 0 }
}'

## 再刷新健康检查
kubectl exec -it foobar-elasticsearch-0 -- curl -s localhost:9200/_cluster/health?pretty

如果你愿意贴一下那条 UNASSIGNED 的 shard 是哪个,我可以帮你生成那条精确的 allocate_stale_primary 命令。

4o

You said:

items-osc 1 p UNASSIGNED CLUSTER_RECOVERED

ChatGPT said:

非常棒,你已经定位到未分配的那个 shard 是:

  • Index: items-osc
  • Shard: 1
  • Type: 主分片(p
  • 原因: CLUSTER_RECOVERED(通常是 Elasticsearch 重启后未能自动恢复)

✅ 解决方法:强制分配这个主分片

这个分片可能因为没有完整元数据而无法自动恢复。可以用如下命令强制将其作为“stale primary”分配,并接受数据丢失(适用于你知道这不是关键数据,或者你愿意放弃这部分数据):

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- \
curl -X POST "localhost:9200/_cluster/reroute" -H 'Content-Type: application/json' -d '{
  "commands": [
    {
      "allocate_stale_primary": {
        "index": "items-osc",
        "shard": 1,
        "node": "foobar-elasticsearch-0",
        "accept_data_loss": true
      }
    }
  ]
}'

执行完之后,再看看集群状态:

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- curl -s localhost:9200/_cluster/health?pretty

如果一切顺利,你会看到:

  • status 变成 yellowgreen
  • unassigned_shards0
  • readiness probe 不再报错,Pod 状态变为 Ready

需要我帮你写个脚本自动检查并修复类似问题也可以说一声 😎

4o

You said:

还是 red

ChatGPT said:

好的,说明强制分配后这个分片还是没能恢复,我们来排查一下几个可能的原因:


🔍 1. 检查分配命令是否成功执行

请运行以下命令,查看分配命令是否被 Elasticsearch 接受:

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- \
curl -s "localhost:9200/_cluster/allocation/explain?pretty" -H 'Content-Type: application/json' -d '{
  "index": "items-osc",
  "shard": 1,
  "primary": true
}'

这个命令会告诉你 为啥这个分片没法被分配,返回内容中如果有类似:

json
json
CopyEdit
"node_allocation_decisions" : [
  {
    "node_id" : "...",
    "node_name" : "...",
    "deciders" : [
      {
        "decider" : "same_shard",
        "decision" : "NO",
        "explanation" : "..."
      }
    ]
  }
]

我们就能知道是哪个策略阻止了分配。


🛠️ 2. 再次尝试用普通的 allocate_empty_primary

如果 allocate_stale_primary 不成功,有时换成这个版本会更稳(它也会导致数据丢失):

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- \
curl -X POST "localhost:9200/_cluster/reroute" -H 'Content-Type: application/json' -d '{
  "commands": [
    {
      "allocate_empty_primary": {
        "index": "items-osc",
        "shard": 1,
        "node": "foobar-elasticsearch-0",
        "accept_data_loss": true
      }
    }
  ]
}'

这个指令作用是:“我确认放弃这个分片所有原始数据,直接分配个空壳让集群恢复”。


🧼 最狠的方案(彻底删掉索引)

如果这个 items-osc 不重要或你不介意删除它:

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- \
curl -X DELETE "localhost:9200/items-osc"

然后再看:

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- curl -s localhost:9200/_cluster/health?pretty

想确认一下:

  • 这个 items-osc 是你自己创建的吗?重要不重要?
  • 你愿意删掉它,还是希望尽量保住它?

我可以根据你选哪条路,帮你一步一步地操作。

4o

You said:

{ "index" : "items-osc", "shard" : 1, "primary" : true, "current_state" : "unassigned", "unassigned_info" : { "reason" : "ALLOCATION_FAILED", "at" : "2025-04-25T02:45:59.588Z", "failed_allocation_attempts" : 1, "details" : "failed shard on node [VqBKmdrcQgCAgFKWazeyoQ]: failed recovery, failure RecoveryFailedException[[items-osc][1]: Recovery failed on {foobar-elasticsearch-0}{VqBKmdrcQgCAgFKWazeyoQ}{0adHZwLkRuK67EXUhTVODA}{100.105.36.208}{100.105.36.208:9300}{cdhilmrstw}{ml.machine_memory=4294967296, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed to fetch index version after copying it over]; nested: IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't, current files: [_5.cfe, _5.cfs, _5.si, _6.cfe, _6.cfs, _6.si, _7.cfe, _7.cfs, _7.si, _8.cfe, _8.cfs, _8.si, _a.cfe, _a.cfs, _a.si, _c.cfe, _c.cfs, _c.si, write.lock]]; nested: IndexNotFoundException[no segments* file found in store(ByteSizeCachingDirectory(HybridDirectory@/usr/share/elasticsearch/data/nodes/0/indices/DB_gNN0kTrq9TuzEcWywQw/1/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@6c40bb07)): files: [_5.cfe, _5.cfs, _5.si, _6.cfe, _6.cfs, _6.si, _7.cfe, _7.cfs, _7.si, _8.cfe, _8.cfs, _8.si, _a.cfe, _a.cfs, _a.si, _c.cfe, _c.cfs, _c.si, write.lock]]; ", "last_allocation_status" : "no_valid_shard_copy" }, "can_allocate" : "no_valid_shard_copy", "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt", "node_allocation_decisions" : [ { "node_id" : "VqBKmdrcQgCAgFKWazeyoQ", "node_name" : "foobar-elasticsearch-0", "transport_address" : "100.105.36.208:9300", "node_attributes" : { "ml.machine_memory" : "4294967296", "xpack.installed" : "true", "transform.node" : "true", "ml.max_open_jobs" : "20" }, "node_decision" : "no", "store" : { "in_sync" : false, "allocation_id" : "2VF9YhuxSdG7NrDDH_C85g", "store_exception" : { "type" : "index_not_found_exception", "reason" : "no segments* file found in SimpleFSDirectory@/usr/share/elasticsearch/data/nodes/0/indices/DB_gNN0kTrq9TuzEcWywQw/1/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@6c40bb07: files: [_5.cfe, _5.cfs, _5.si, _6.cfe, _6.cfs, _6.si, _7.cfe, _7.cfs, _7.si, _8.cfe, _8.cfs, _8.si, _a.cfe, _a.cfs, _a.si, _c.cfe, _c.cfs, _c.si, write.lock]" } } } ] }

ChatGPT said:

你贴出的诊断信息已经非常明确了,核心原因如下:


❗ 关键错误:

json
json
CopyEdit
"reason": "no segments* file found",
"can_allocate": "no_valid_shard_copy",
"allocate_explanation": "cannot allocate because all found copies of the shard are either stale or corrupt"

简单说:这个分片的数据坏了,ES 找不到能恢复它的 segment 文件。


✅ 解决办法

你现在的 items-osc 索引第 1 号主分片的数据已经无法恢复了 —— 也就是物理文件损坏或丢失,ES 明确说 “没有可用的分片拷贝”

因此,要让集群状态恢复为 yellowgreen,你 只能选择以下两种方式之一


✅ 方案一:删除索引 items-osc(推荐)

如果这个索引不是很重要,可以直接删掉,清理掉这个“死”分片,恢复集群状态:

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- \
curl -X DELETE "localhost:9200/items-osc"

然后重新查看集群状态:

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- \
curl -s localhost:9200/_cluster/health?pretty

你应该看到 status: yellow(如果没有副本),readiness probe 会通过,Pod 也恢复正常。


❗ 方案二:手动重建这个索引

如果你不想删除 items-osc,可以这样做:

bash
bash
CopyEdit
kubectl exec -it foobar-elasticsearch-0 -- \
curl -X PUT "localhost:9200/items-osc" -H 'Content-Type: application/json' -d '{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 0
  }
}'

注意:这会重建索引并覆盖之前的,旧数据也会丢失。


✅ 最终建议

从报错来看,这个分片文件已经完全不可恢复,不管你是执行 allocate_stale_primary 还是 allocate_empty_primary 都不会成功,因为物理文件不完整。

所以最实际的做法是:

✅ 直接删除 items-osc 索引,恢复集群健康。


随时可以贴一下你下一步想怎么做,我可以把命令贴给你一键搞定 😎

4o

You said:

{ "cluster_name" : "elasticsearch", "status" : "green", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 497, "active_shards" : 497, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 0, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 100.0 }

ChatGPT said:

太棒了!🎉 现在 Elasticsearch 集群已经恢复为 green 状态:

txt
json
CopyEdit
"status" : "green",
"unassigned_shards" : 0,
"active_shards_percent_as_number" : 100.0

这意味着:

  • 所有主分片和副本分片都已分配;
  • 集群处于完全健康状态;
  • readiness probe 应该已经不再报错,Pod 状态为 Running / Ready

如果你是用 Helm 或 Operator 管理的,这时你也可以继续部署上层应用或者让依赖 Elasticsearch 的服务正常运行了。

要不要我帮你写个小脚本,定时检查并自动清理“无法分配的分片”那种情况?或者加个简易告警也行。