...
To be able to back up a k8s cluster, first we need the executable file etcdctl
, downloadable from here (choose the appropriate release). Also in the compressed file are two other executables, etcd
and etcdutl
, which may come in handy in the future. After that, unpack the archive file (this results in a directory containing the binaries) and add the executable binaries to your path (i.e. /usr/local/bin
)
Code Block |
---|
language | bash |
---|
title | Download binary |
---|
|
# For example, let's download release 3.5.4
$ wget https://github.com/etcd-io/etcd/releases/download/v3.5.4/etcd-v3.5.4-linux-amd64.tar.gz
$ tar xzvf etcd-v3.5.4-linux-amd64.tar.gz
# In addition to the etcdctl executable, we also take etcd and etcdutl
$ sudo cp etcd-v3.5.4-linux-amd64/etcdctletcd* /usr/local/bin/
$
# Check that everything is OK
$ etcdctl version
etcdctl version: 3.5.4
API version: 3.5 |
Once we have the executable file, we need the certificates to be able to communicate with the etcd node(s). If you don't know the location of the certificates, you can retrieve it using the grep command in the /etc/kubernetes
folder on the master node (the default directory that holds the certificates in the etcd node is /etc/ssl/etcd/ssl
). Save the location of the certificates in the following environment variables
Code Block |
---|
language | bash |
---|
title | etcd certificates |
---|
|
# Insert the following lines inside the ".bashrc" file, then use "$ source .bashrc" to apply the changes
export ETCDCTL_CERT=/<path>/cert.pem
export ETCDCTL_CACERT=/<path>/ca.pem
export ETCDCTL_KEY=/<path>/key.pem
ETCDCTL_ENDPOINTS=etcd1:2379,etcd2:2379,etcd3:2379 |
Let's try running some commands, to check the status of the etcd cluster
$ etcdutl version
etcdutl version: 3.5.4
API version: 3.5
$ etcd --version
etcd Version: 3.5.4
Git SHA: 08407ff76
Go Version: go1.16.15
Go OS/Arch: linux/amd64 |
Once we have the executable file, we need the certificates to be able to communicate with the etcd node(s). If you don't know the location of the certificates, you can retrieve it using the grep command in the /etc/kubernetes
folder on the master node (the default directory that holds the certificates in the etcd node is /etc/ssl/etcd/ssl
). Save the location of the certificates in the following environment variables
Code Block |
---|
language | bash |
---|
title | etcd certificates |
---|
|
# Insert the following lines inside the ".bashrc" file, then use "source .bashrc" to apply the changes
export ETCDCTL_CERT=/<path>/cert.pem
export ETCDCTL_CACERT=/<path>/ca.pem
export ETCDCTL_KEY=/<path>/key.pem
ETCDCTL_ENDPOINTS=etcd1:2379,etcd2:2379,etcd3:2379 |
Let's try running some commands, to check the status of the etcd cluster
Code Block |
---|
language | bash |
---|
title | Example commands |
---|
collapse | true |
---|
|
$ etcdctl member list --write-out=table
+- |
Code Block |
---|
language | bash |
---|
title | Example commands |
---|
collapse | true |
---|
|
$ etcdctl member list --write-out=table
+------------------+---------+-------+------------------------------+------------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+-------+------------------------------+------------------------------+------------+
| 10d6fab05b506a11 | started | etcd3 | https://192.168.100.180:2380 | https://192.168.100.180:2379 ID | STATUS | NAME | false |
| 263e9aba708b17f7PEER |ADDRS started | etcd1 | https://192.168.100.151:2380 | https://192.168.100.151:2379 | false |
| 74e5f49a4cd290f1 |CLIENT startedADDRS | etcd2 | https://192.168.100.88:2380 | https://192.168.100.88:2379 | IS false |LEARNER |
+------------------+---------+-------+------------------------------+------------------------------+------------+
$ etcdctl endpoint health --write-out=table
+------------+--------+--------------+-------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+------------+| 10d6fab05b506a11 | started | etcd3 | https://192.168.100.180:2380 | https://192.168.100.180:2379 | false |
| 263e9aba708b17f7 | started | etcd1 | https://192.168.100.151:2380 | https://192.168.100.151:2379 | false |
| 74e5f49a4cd290f1 | started | etcd2 | https://192.168.100.88:2380 | https://192.168.100.88:2379 | false |
+-------------+-----+---------+-------+
| etcd2:2379 | true | 10.314162ms | |
| etcd1:2379 | true | 10.775429ms | |
| etcd3:2379 | true | 114.846224ms | |
------------------------------+------------------------------+------------+
$ etcdctl endpoint health --write-out=table
+------------+--------+--------------+-------+
$| etcdctl endpoint status --write-out=table ENDPOINT | HEALTH | TOOK | ERROR |
+------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | +
| etcd2:2379 | true | 10.314162ms | |
| etcd1:2379 | true | 10.775429ms | ID |
| etcd3:2379 | true | VERSION114.846224ms | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------+--------+-----------+---+-------+
$ etcdctl endpoint status --write-out=table
+------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| etcd1:2379 |ENDPOINT 263e9aba708b17f7 | 3.5.4 | 31 ID MB | false | VERSION | DB SIZE false | IS LEADER | IS LEARNER | RAFT 5TERM | RAFT INDEX 296505 | RAFT 296505APPLIED INDEX | |
| etcd2:2379 | 74e5f49a4cd290f1 | 3.5.4 | 30 MB | true | ERRORS |
+------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| etcd1:2379 | 263e9aba708b17f7 | 3.5.4 | 31 MB | false | false | 5 | 296505 | 296505 | |
| etcd3:2379 | 10d6fab05b506a11 | 3.5.4 | 30 MB | false | false | 5 | 296505 | 296505 | |
+------------ | 296505 | 296505 | |
| etcd2:2379 | 74e5f49a4cd290f1 | 3.5.4 | 30 MB | true | false | 5 | 296505 | 296505 | |
| etcd3:2379 | 10d6fab05b506a11 | 3.5.4 | 30 MB | false | false | 5 | 296505 | 296505 | |
+------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ |
Save and Restore
Warning |
---|
If you have an etcd cluster, you can select only one node, otherwise you get the error snapshot must be requested to one selected node, not multiple . Then unset the ETCDCTL_ENDPOINTS environment variable, if present. |
Take a snapshot of the etcd datastore using the following command (official documentation), which generates the <snapshot>
file
Code Block |
---|
language | bash |
---|
title | Save snapshot |
---|
|
$ etcdctl snapshot save <path>/<snapshot> --endpoints=<endpoint>:<port>
# Instead of <endpoint> you can substitute a hostname or an IP
$ etcdctl snapshot save snapshot.db --endpoints=etcd1:2379
$ etcdctl snapshot save snapshot.db --endpoints=192.168.100.88:2379
# View that the snapshot was successful
$ etcdctl snapshot status snapshot.db --write-out=table
+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+---------------+---+---------+---------+-----------+------------+---
| b89543b8 | 40881777 | 54340 | 106 MB |
+--------+-----+-------+---+------------+-----+--------+ |
Save and Restore
Warning |
---|
If you have an etcd cluster, you can select only one node, otherwise you get the error snapshot must be requested to one selected node, not multiple . Then unset the ETCDCTL_ENDPOINTS environment variable, if present. |
...
To restore a cluster, all that is needed is a single snapshot snapshot.db
file. A cluster restore with etcdctl snapshot restore
creates new etcd data directories; all members should restore using the same snapshot. Restoring overwrites some snapshot metadata (specifically, the member ID and cluster ID); the member loses its former identity. Therefore in order to start a cluster from a snapshot, the restore must start a new logical cluster.
Now we will use the snapshot backup to restore etcd as shown below. If you want to use a specific data directory for the restore, you can add the location using the --data-dir
flag, but the destination directory must be empty and obviously have write permissions.
Code Block |
---|
language | bash |
---|
title | Save Restore snapshot |
---|
|
$ etcdctl snapshot save <path>/<snapshot> --endpoints=<endpoint>:<port>
# Instead of <endpoint> you can substitute a hostname or an IP
$ etcdctl snapshot save snapshot.db --endpoints=etcd1:2379
$ etcdctl snapshot save snapshot.db --endpoints=192.168.100.88:2379
# View that the snapshot was successfulCopy the snapshot.db file to all etcd nodes
$ scp snapshot.db etcd1:
# Repeat this command for all etcd members, to create the directory
$ etcdctl snapshot status snapshot.db --write-out=table
+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| b89543b8 | 40881777 | 54340 | 106 MB |
+----------+----------+------------+------------+ |
To restore a cluster, all that is needed is a single snapshot db
file. A cluster restore with etcdctl snapshot restore
creates new etcd data directories; all members should restore using the same snapshot. Restoring overwrites some snapshot metadata (specifically, the member ID and cluster ID); the member loses its former identity. Therefore in order to start a cluster from a snapshot, the restore must start a new logical cluster.
restore <path>/<snapshot> [--data-dir <data_dir>] --name etcd1 --initial-cluster etcd1=https://<IP1>:2380,etcd2=https://<IP2>:2380,etcd3=https://<IP3>:2380 --initial-cluster-token <token> --initial-advertise-peer-urls https://<IP1>:2380
# For instance, for the first node
$ etcdctl snapshot restore snapshot.db --name etcd1 --initial-cluster etcd1=https://etcd1:2380,etcd2=https://etcd2:2380,etcd3=https://etcd3:2380 --initial-cluster-token k8s_etcd --initial-advertise-peer-urls https://etcd1:2380 |
Before we continue, let's stop all the API server instances. Then stop the etcd service on the nodes.
Code Block |
---|
language | bash |
---|
title | Pause cluster |
---|
|
# Let's go to the master(s) and temporarily move the "kube-apiserver.yaml" file
$ sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
# Stop etcd service on etcd node(s)
$ sudo systemctl stop etcd.service |
As said, the restore
command generates the member
directory, which will be pasted into the path where the etcd node data are stored (the default path is /var/lib/etcd/
)
Code Block |
---|
language | bash |
---|
title | Copy snapshot |
---|
|
# Paste the snapshot into the path where the etcd node data are stored
$ sudo cp -r <path>/<restore> /var/lib/etcd/
# For each etcd node
$ sudo cp -r $HOME/etcd1.etcd/member/ /var/lib/etcd/ |
Finally, we restart the etcd service on the nodes and restore the API serverNow we will use the snapshot backup to restore etcd as shown below. The restore
command generates the member
directory, which will be pasted into the path where the etcd node data are stored (the default path is /var/lib/etcd/
). If you want to use a specific data directory for the restore, you can add the location using the --data-dir
flag, but the destination directory must be empty and obviously have write permissions.
Code Block |
---|
language | bash |
---|
title | Restore snapshotRestart cluster |
---|
|
# RepeatStart thisetcd commandservice for allon etcd members
etcdctl snapshot restore <path>/<snapshot> [--data-dir <data_dir>] --name etcd1 --initial-cluster etcd1=https://<IP1>:2380,etcd2=https://<IP2>:2380,etcd3=https://<IP3>:2380 --initial-cluster-token <token> --initial-advertise-peer-urls https://<IP1>:2380
etcdctl snapshot restore snapshot.db --data-dir /tmp/snap_dir --name etcd1 --initial-cluster etcd1=https://etcd1:2380,etcd2=https://etcd2:2380,etcd3=https://etcd3:2380 --initial-cluster-token k8s_etcd --initial-advertise-peer-urls https://etcd1:2380
# Paste the snapshot into the path where the etcd node data are stored
$ cp -r /tmp/snap_dir/member /var/lib/etcd/node(s)
$ sudo systemctl start etcd.service
# Restore the API server from the master(s)
$ sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/ |
Tip |
---|
It's also recommend restarting any components (e.g. kube-scheduler , kube-controller-manager , kubelet ) to ensure that they don't rely on some stale data. Note that in practice, the restore takes a bit of time. During the restoration, critical components will lose leader lock and restart themselves. |