Cluster Maintenance
Last updated
Last updated
When performing upgrades on worker nodes, It's best practice to:
Drain the node kubectl drain node01
When upgrade is completed, you must uncordon it to make it schedulable kubectl uncordon node01
Everything went well.
When you download a new Kubernetes release, it comes with all Kubernetes components with of same version, although some components like etcd
and coreDNS
come with different versions as they are separate projects.
During an upgrade process to the cluster, here are important aspects to note, lets take v1.10 as an example:
controller-manager must be v1.9 or v1.10
kube-scheduler must be v1.9 or v1.10
kubelet must be v1.8, v1.9 or v1.10
kube-proxy must be v1.8, v1.9 or v1.10
This is not the case for kubectl as it can be v1.9, v1.10 or v1.11
These versions allows us to do live upgrade, component by component if required.
Lets say we are at Kubernetes v1.10, we only have access to the 3 latest version, so if v1.13 is released, v1.10 will be unsupported. So before v1.13 release, it would be a great time to upgrade.
To upgrade, its recommended to upgrade one by one which means v1.10 then v1.11 then v1.12 then v1.13.
The Kubernetes upgrade process is dependent on how the cluster is setup.
If we deployed our cluster using AWS EKS, it can be done within a few clicks.
If we deployed our cluster using kubeadm tool, it can be done within commands.
If we deployed our cluster the Kubernetes hard way (From scratch), then we need to manually upgrade components ourselves
We will be upgrading the cluster using kubeadm tool.
1) Upgrade master nodes: All management functions are down, although worker nodes still serve users
2) Upgrade worker nodes
1) Bring all nodes down then up (Has downtime)
2) One node at a time
3) Creating new nodes with newer versions and deleting old nodes (Convenient if on cloud)
Remember, kubeadm doesn't install or upgrade kubelets, so we need to deal with it.
Also, we need to upgrade the kubeadm tool itself before we perform the cluster upgrade.
I will share my experience when trying to set it up using the docs in the upcoming lab.
Now, I will discuss what happened through my lab, I ran kubeadm upgrade plan
which appeared to me that my cluster version is v1.28.0 and got the target upgrade that I can upgrade to is v1.28.7 although the question asked to upgrade to exactly v1.29.0.
I had to change the package repository of kubernetes to v1.29 by editing /etc/apt/sources.list.d/kubernetes.list
file.
These commands gave me the versions I can upgrade my kubdeadm to which gave me v1.29.0-1.1
which I wanted.
I proceeded and used the =1.29.0-1.1
throughout the upgrade commands until I finished upgrading controlplane and uncordoned it.
After that, I had to upgrade worker nodes, the huge mistake I did is trying to run kubectl
and kubeadm
commands in worker node, it won't work as this is not a master node and doesn't have the components of controlplane.
After entering the worker node, all I had to do is run the same commands as master node upgrade with the =1.29.0-1.1
but I had to add one command before upgrading kubeadm which is sudo kubeadm upgrade node
as shown in k8s worker node upgrade docs
That's it!
This is only for few resources groups, there are other resource groups which must be considered
Some solutions such as velero take care of that for you.
All resources created within the cluster is stored in ETCD server, so its crucial to take a backup for it.
The ETCD server is hosted on master nodes, when configuring ETCD we specify a path where all data would be stored.
Remember to specify endpoints, cacert, etcd cert and key for authentication and access to the etcd cluster when using snapshot save.
If using managed kubernetes cluster, you might not have access to etcd server, so resource backups using the kube-apiserver is better.
In this case, we are restoring the snapshot to a different directory but in the same server where we took the backup (the control plane node). As a result, the only required option for the restore command is the --data-dir
/etc/kubernetes/manifests/etcd.yaml
:We have now restored the etcd snapshot to a new path on the controlplane - /var/lib/etcd-from-backup
, so, the only change to be made in the YAML file, is to change the hostPath for the volume called etcd-data
from old directory (/var/lib/etcd
) to the new directory (/var/lib/etcd-from-backup
).
With this change, /var/lib/etcd
on the container points to /var/lib/etcd-from-backup
on the control plane
(which is what we want).
When this file is updated, the ETCD
pod is automatically re-created as this is a static pod placed under the /etc/kubernetes/manifests
directory.
Note 1: As the ETCD pod has changed it will automatically restart, and also
kube-controller-manager
,kube-apiserver
andkube-scheduler
. Wait 1-2 to mins for this pods to restart. You can run the command:watch "crictl ps | grep etcd"
to see when the ETCD pod is restarted.
Note 2: If the etcd pod is not getting
Ready 1/1
, then restart it bykubectl delete pod -n kube-system etcd-controlplane
and wait 1 minute.
Note 3: This is the simplest way to make sure that ETCD uses the restored data after the ETCD pod is recreated. You don't have to change anything else.
If you do change --data-dir
to /var/lib/etcd-from-backup
in the ETCD YAML file, make sure that the volumeMounts
for etcd-data
is updated as well, with the mountPath pointing to /var/lib/etcd-from-backup
When switching to a new cluster and you want to check components pods, make sure to ssh to the controlplane of the cluster.
When you can't see an etcd pod, it's most probably an external etcd and you must describe the kube-apiserver to see the external etcd IP, it must be mentioned there so you can ssh into it if needed.
Running commands like ps -aux | grep etcd
after ssh into the etcd server is crucial to get information regarding endpoints, data dirs, certs etc...
Make sure to utilize the scp
command to copy files from a node to another node specially when doing etcd snapshot restore and backup, examples shown below.
Make sure to also utilize chown
specially when restoring etcd snapshot.
Make sure to edit the etcd service to update the data-dir after restoring, usually found at /etc/systemd/system/etcd.service
path.
Never edit a static pod directly if you want to modify a data-dir for instance in etcd pod, always check the /etc/kubernetes/manifests folder
, if not found then check the /etc/systemd/system/etcd.service (External etcd).
kubelets always re-creates static pods if deleted. So to apply changes delete the static pod!
student-node
:First set the context to cluster1
:
Next, inspect the endpoints and certificates used by the etcd
pod. We will make use of these to take the backup.
SSH to the controlplane
node of cluster1
and then take the backup using the endpoints and certificates we identified above:
Finally, copy the backup to the student-node
. To do this, go back to the student-node
and use scp
as shown below:
Copy the snapshot file from the student-node to the etcd-server. In the example below, we are copying it to the /root directory:
Restore the snapshot on the cluster2
. Since we are restoring directly on the etcd-server
, we can use the endpoint https:/127.0.0.1
. Use the same certificates that were identified earlier. Make sure to use the data-dir
as /var/lib/etcd-data-new
:
Update the systemd service unit file for etcd
by running vim /etc/systemd/system/etcd.service
and add the new value for data-dir
:
Make sure the permissions on the new directory is correct (should be owned by etcd
user)
Finally, reload and restart the etcd service.
etcdctl
is a command line client for etcd.
In all our Kubernetes Hands-on labs, the ETCD key-value database is deployed as a static pod on the master. The version used is v3.
To make use of etcdctl for tasks such as back up and restore, make sure that you set the ETCDCTL_API to 3.
You can do this by exporting the variable ETCDCTL_API prior to using the etcdctl client.
To see all the options for a specific sub-command, make use of the -h or --help flag.