Backup and Restore for etcd
Backup
Manually taking an etcd snapshot
ssh into one of etcd nodes and run the following command:
set -a; source /var/run/coreos/etcdadm-environment; set +a
/opt/bin/etcdadm save
The command takes an etcd snapshot by running an appropriate etcdctl snapshot save
command.
The snapshot is then exported to the S3 URI: s3://<your-bucket-name>/.../<your-cluster-name>/exported/etcd-snapshots/snapshot.db
.
Automatically taking an etcd snapshot
A feature to periodically take a snapshot of an etcd cluster can be enabled by specifying the following in cluster.yaml
:
etcd:
snapshot:
automated: true
When enabled, the command etcdadm save
is called periodically (every 1 minute by default) via a systemd timer.
Restore
Please beware that you must have taken an etcd snapshot beforehand to restore your cluster. An etcd snapshot can be taken manually or automatically according to the steps described above.
Manually restoring a permanently failed etcd node from etcd snapshot
It is impossible! However, you can recover a permanently failed etcd node, without losing data, by "resetting" the node. More concretely, you can run the following commands to remove the etcd member from the cluster, wipe etcd data, and then re-add the member to the cluster:
sudo systemctl stop etcd-member.service
set -a; source /var/run/coreos/etcdadm-environment; set +a
/opt/bin/etcdadm replace
sudo systemctl start etcd-member.service
The reset member eventually catches up data from the etcd cluster hence the recovery is done without losing data.
For more details, I'd suggest you to read the revelant upstream issue.
Manually restoring a cluster from etcd snapshot
ssh into every etcd node and stop the etcd3 process:
for h in $hosts; do
ssh -i path/to/your/key core@$h sudo systemctl stop etcd-member.service
done
and then sart the etcd3 process:
for h in $hosts; do
ssh -i path/to/your/key core@$h sudo systemctl start etcd-member.service
done
Doing this triggers the automated disaster recovery processes across etcd nodes by running etcdadm-reconfigure.service
and your cluster will eventually be restored from the snapshot stored at s3://<your-bucket-name>/.../<your-cluster-name>/exported/etcd-snapshots/snapshot.db
.
Automatic recovery
A feature to automatically restore a permanently failed etcd member or a cluster can be enabled by specifying:
etcd:
disasterRecovery:
automated: true
When enabled,
- The command
etcdadm check
is called periodically by a systemd timer- The etcd cluster and each etcd node(=member) is checked by running
etcdctl endpoint health
command
- The etcd cluster and each etcd node(=member) is checked by running
- When up to
1/N
etcd nodes failed successive health checks, it will be removed as an etcd member and then added again as a new member- The new member eventually catches up data from the etcd cluster
- When more than
1/N
etcd nodes failed successive health checks, a disaster recovery process is executed to recover all the etcd nodes from the latest etcd snapshot