the case of the yellow cluster

org.elasticsearch.transport.ConnectTransportException or the case of the yellow cluster

Some time ago we needed to add two datanode in our ElasticSearch cluster which we happily ordered from our cloud provider. The first one joined OK and shards started moving around nicely. A happy green cluster. However upon adding a second node, the cluster started accepting shards but remained in yellow state. Consistently. Like hours. Even trying to empty the node in order to remove it was not working. Some shards would stay there forever.

Upon looking at the node logs, here is what caught our attention:

org.elasticsearch.transport.ConnectTransportException: [datanode-7][10.1.2.7:9300] connect_exception

A similar log entry was found in datanode-7‘s log file. What was going on here? Well, these two machines were assigned sequential IP addresses, 10.1.2.6 and 10.1.2.7. They could literally ping the whole internet but not find each other. To which fact the cloud provider’s support group replied:

in this case you need to configure a hostroute via the gw as our switch doesn't allow a direct communication.

Enter systemd territory then, and not wanting to make this yet another service, I defaulted to the oldest boot solution. Edit /etc/rc.local (in reality /etc/rc.d/rc.local) on both machines with appropriate routes:

ip route add 10.1.2.7/32 via 10.1.2.1 dev enp0s31f6

and enable it:

# chmod +x /etc/rc.d/rc.local
# systemctl enable rc-local

rc.local will never die. It is that loyal friend that will wait to be called upon when you need them most.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s