
Published 2022-08-10 12:11:52
Kubernetes Scheduling - Learn Affinity & Anti-Affinity
Kubernetes Scheduling - Learn Affinity & Anti-Affinity
In a Kubernetes cluster, there is a process called kube-scheduler
which is responsible for matching a Pod with a Node. This process will determine which Node is the best fit based on the combined sum, or highest score if you wish, of things like resource limits (set in the pod spec), the node constraints, affinity, anti-affinity rules, taints and tolerations, etc.
The node that gets the highest score will be selected and the Pod will be placed in a scheduling queue. The next that will happen is the kubelet
agent running on the node will make sure that the Pod gets up and running.
Usually, this is pretty transparent to the users and you don't need to make any effort of controlling this process. Behind the scenes, it's an intense computation going on in the cluster to determine where the Pod should be scheduled.
There are use cases when you need to control where a Pod should be scheduled.
For instance, you may want the Pod to get scheduled on a specific node type.
Other requirements could be that a Pod needs a specific storage type, to run in a particular availability zone, to be placed on a GPU node, to run alongside another application that runs in another pod on a specific node, etc.
In this article, we will learn how we can control the schedule of the Pods to different nodes based on different criteria.
Requirements
This article assumes that you have a running cluster. It won't instruct you on how to set up a Kubernetes cluster.
In the examples below, a Linode Kubernetes Engine (LKE) is being used, but you can run any cluster you want. It can be a local one like minikube, k3d, or Minishift, but you can also run it in the public cloud using Kubernetes services like EKS, GKE, AKS, or like me LKE. The most important thing is that you have full access to the cluster so you can run kubectl
commands.
The article won't teach you the fundamentals of kubectl
CLI, but even if you have limited knowledge of this kubectl
, you should be fine by just following along.
Labels and Selectors
Before going further, let's learn about labels and selectors.
Labels in Kubernetes are key-value pairs and are very handy for grouping and managing resources.
With labels, you basically add metadata to objects and resources. If you work as a Kubernetes administrator, this could make your life easier since you will be able to group and manage your resources in a good way.
Here are some examples of labels that you could define:
- owner=snoopy
- env=dev
- env=prod
- disktype=ssd
The following list shows the requirements for the labels:
- supports 63 characters or less
- can be empty
- if not empty, the label must begin and end with an alphanumeric character ([a-z0-9A-Z]),
- can contain:
- dashes (-)
- alphanumerics
- dots (.)
You can list objects in Kubernetes and their labels. In this example, we will list nodes and corresponding labels (there are many labels assigned to the nodes by default):
$ kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
lke68498-106277-62f385c4572c Ready <none> 51m v1.23.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c4572c,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
lke68498-106277-62f385c4b3c0 Ready <none> 51m v1.23.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c4b3c0,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
lke68498-106277-62f385c51015 Ready <none> 50m v1.23.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c51015,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
and you can list the nodes that have specific keys (you can add many keys):
$ kubectl get nodes -L=env,az
NAME STATUS ROLES AGE VERSION ENV AZ
lke68498-106277-62f385c4572c Ready <none> 22h v1.23.6 prod az-1
lke68498-106277-62f385c4b3c0 Ready <none> 22h v1.23.6 dev az-2
lke68498-106277-62f385c51015 Ready <none> 22h v1.23.6 az-2
Selectors allow you to filter the objects based on labels that are assigned to that object. Most third-party operators are performing filtering to address changes on the object level. Also internally, Kubernetes engine is using selectors to filter objects and resources in the internal processes. And you as the Kubernetes administrator can do the same to structure your resources and objects in a good way. For instance, your automation can make changes to objects based on the labels.
Labels and selectors can manage objects like:
- Pods
- Deployments
- Nodes
- Services
- Secrets
- Ingress Resource
- Namespaces
In the following example, we have a matchLabels
selector that holds a map of key-value pairs.
These key-value pairs can then be used when you write a matchExpressions
in rules which we will learn in soon.
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-cache
spec:
selector:
matchLabels:
app: cache
Example of matchExpressions
where we're specifying a PodAntiAffinity rule based on the app: cache
key-value pair:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cache
You could also write simple queries using the selector flag:
$ kubectl get pods -o wide --selector app=cache
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
redis-cache-57885c55ff-4br9m 0/1 Pending 0 11h <none> <none> <none> <none>
redis-cache-57885c55ff-xcg9g 1/1 Running 0 11h 10.2.0.14 lke68498-106277-62f385c4b3c0 <none> <none>
redis-cache-57885c55ff-xv69b 1/1 Running 0 11h 10.2.2.11 lke68498-106277-62f385c51015 <none> <none>
Another way to list the nodes that output app
as a column with a value set to cache
kubectl get nodes \
--label-columns=app \
--selector=cache
NodeSelector
This is the simplest way to control where a Pod will be scheduled. Let's have a look at a real-life scenario.
Use Case - nodeSelector - based on node labels
We need to have a pod scheduled on a node that has a particular label. First, let's imagine that you have a big cluster where you run both dev and production workloads. In this scenario, we have nodes that only host dev-related workloads.
Those nodes have fewer resources (CPU and Memory) available compared to the prod nodes.
The labels can look like the following:
env=dev
env=prod
The following diagram illustrates the placement of a Pod based on a label belonging to the worker node named n-1:
Resuming the use case scenario:
To fulfill the requirement, we need to add an appropriate label to the node(s). We will apply a Pod manifest to our cluster and state a nodeSelector env: dev
First, let's find out which worker nodes are running in our cluster:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
lke68498-106277-62f385c4572c Ready <none> 33m v1.23.6
lke68498-106277-62f385c4b3c0 Ready <none> 34m v1.23.6
lke68498-106277-62f385c51015 Ready <none> 33m v1.23.6
If you run kubectl get nodes --show-labels
you will get all nodes with all labels assigned to them
Example - list the nodes and labels:
$ kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
lke68498-106277-62f385c4572c Ready <none> 51m v1.23.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c4572c,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
lke68498-106277-62f385c4b3c0 Ready <none> 51m v1.23.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c4b3c0,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
lke68498-106277-62f385c51015 Ready <none> 50m v1.23.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c51015,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
Let's imagine that the node <lke68498-106277-62f385c51015
> is related to the dev environment.
It makes sense that we add a label like env=dev
# add the label to the node
kubectl label nodes lke68498-106277-62f385c51015 env=dev
We could list the nodes having env=dev
label with a selector parameter (we only have one):
$ kubectl get nodes --selector env=dev
Output:
NAME STATUS ROLES AGE VERSION
lke68498-106277-62f385c51015 Ready <none> 56m v1.23.6
We'll deploy a Pod on the fly by applying the following Pod Manifest:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: dev
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
env: dev
EOF
❗When specifying labels in your manifests, you need to write those in the following format: key: value
The result:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 84s 10.2.2.2 lke68498-106277-62f385c51015 <none> <none>
As we can see, the pod got scheduled on the node: lke68498-106277-62f385c51015
This was the simplest way to schedule a Pod on a specific node. In the following section, we will learn about affinity and anti-affinity rules.
Affinity and anti-affinity
The affinity and anti-affinity provide even more control compared to the nodeSelector
, which is the simplest way of controlling assignment of a pod to a node.
The following are characteristics of the affinity rules:
- write more expressive rules
- preference rules (instead of hard rules)
- base rules on other pods labels (not just the node labels)
There are two types of affinity:
- Node affinity/anti-affinity
- Pod affinity/anti-affinity
Node affinity/anti-affinity
As with labels, Node affinity is a set of rules the scheduler follow to decide where a Pod should be scheduled.
It's based on labels applied to a node. Now, the question is, how is this different from the nodeSelector
manner that we've used in the previous example?
What the rule in the Pod spec is saying, is the requirement is to schedule the pod on a node that has a env=dev
key-value pair.
If we look at our cluster, we can see that only two nodes are likely to become candidates.
Now, in the Pod spec, it's also specified that the preferred node instance type is t2.medium
(picked from EC2 AWS instance type list for this simple purpose).
The following diagram illustrates the scenario described. If the worker node called n-2
has enough resources and no constraints, the Pod will most likely be scheduled on this node.
An important thing to remember is, it's not guaranteed that the kube-scheduler
will schedule the pod on n-2
node, actually the Pod could end up on a worker node called n-1
.
Worker node n-3
will never be considered as an alternative to host our Pod since it doesn't match env=dev
label in out matchExpressions
.
Type of rules
- required
<requiredDuringSchedulingIgnoredDuringExecution>
: This rule will be enforced. If the rule cannot be met, the Pod will NOT be scheduled. - preferred <
preferredDuringSchedulingIgnoredDuringExecution>
: the scheduler will try to apply the rule, but it's not guaranteed and the Pod WILL BE scheduled on another node if the rule cannot be enforced.
The required rule is the so-called hard rule and the condition must be met before the Pod gets scheduled.
The preferred is the so-called soft rule, which means it will only schedule the Pod if possible, but not guaranteed. In case no node can meet the requirement, the Pod will be scheduled on another node.
To understand the preferred rule better, imagine that you want to schedule a Pod to a less expensive node type. The application running in that Pod is not requiring extensive resources in terms of CPU and Memory. But it has to run somewhere....so you may add a rule saying that the preferred node type is a spot instance or something very cheap. If the node type specified is not available, the Pod can be scheduled on any other node type. So the scheduler will never guarantee that the Pod gets scheduled on your preferred choice, but that could be totally fine. By doing so, the chances are greater that the Pod gets scheduled fairly quickly.
❗Understand the word IgnoredDuringExecution
in requiredDuringSchedulingIgnoredDuringExecution
or preferredDuringSchedulingIgnoredDuringExecution
The Pod will continue to run on a node, even if the rule is not valid anymore. The Pod won't be affected.
E.g. if you remove a label from a node where the Pod is running, the Pod will still continue to run on that node until the end of its full lifetime.
Use Case -nodeAffinity - required rule - <requiredDuringSchedulingIgnoredDuringExecution>
In big Kubernetes cluster environments, it's pretty common that the cluster worker nodes span over several regions and availability zones. In our imaginary environment, our cluster spans over three availability zones.
We have a web app that needs to be placed close to a Postgres DB that is running in Availability Zone 1. The reason is to minimize network latency that can impact customer satisfaction.
Each node in our cluster has a label that tells in which availability zone it is running in. Example:
- node n-1: az=az-1
- node n-2: az=az-2
- node n-3: az=az-3
In this use case, the requirement is to schedule a Pod on a node that is located in a specific availability zone. This should be labeled as az=az-1
.
We can fake this a little bit since not every cloud provider and especially not the local cluster has availability zones or regions. For that reason, we will create our own labels specifying different availability zones.
We will deploy an Nginx pod to our cluster with the required affinity rule that will fulfill the use-case's requirement.
In my cluster, I have the following nodes:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
lke68498-106277-62f385c4572c Ready <none> 127m v1.23.6
lke68498-106277-62f385c4b3c0 Ready <none> 128m v1.23.6
lke68498-106277-62f385c51015 Ready <none> 126m v1.23.6
We will assign a label to a node. We can do that with the following command:
$ kubectl label nodes lke68498-106277-62f385c4572c az=az-1
Output:
node/lke68498-106277-62f385c4572c labeled
We can now check if the node(s) has a specific label by using a selector flag:
$ kubectl get nodes --selector az=az-1
NAME STATUS ROLES AGE VERSION
lke68498-106277-62f385c4572c Ready <none> 130m v1.23.6
We will define and apply a Pod manifest, but before we do that, we will have a look at the affinity rule that will be part of the manifest.
The affinity definition:
spec:
affinity:
nodeAffinity: # nodeAffinity is a pointer to a struct that contains a list of required and optional node affinity scheduling rules.
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: az # key is the type of node affinity rule.
operator: In # operator is the type of comparison to perform.
values:
- az-1 # values is the list of values to compare with the node's labels
This is a rule of type <required>, so the key-value pair, as=az-1
is required to be found on the node before the scheduler is able to schedule the Pod with this rule in its Pod spec.
Now let's deploy our Nginx app with an affinity rule of type <required>:
cat <<EoF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: example-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: az
operator: In
values:
- az-1
containers:
- name: example-node-affinity
image: nginx
EoF
We expect the Nginx pod to get scheduled on the node lke68498-106277-62f385c4572c
since it (the node) holds the label az=az-1
With the following command, we will get the status of our Pod. With -o wide
flag, we can also get the node where the Pod is running:
$ kubectl get pod example-node-affinity -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
example-node-affinity 1/1 Running 0 86s 10.2.1.2 lke68498-106277-62f385c4572c <none> <none>
Great, our Pod got scheduled to the correct node according to the required rule that we have defined. This was not much different from the nodeSelector case.
But let's flip this case over a little bit and have a look at the following use case.
Clear Out
let's delete the label called az=1
from the node lke68498-106277-62f385c4572c
$ kubectl label nodes lke68498-106277-62f385c4572c az-
❗A tip - you can delete labels from a node with a dash (-)
e.g.kubectl label nodes lke68498-106277-62f385c4572c az-
Use Case -nodeAffinity - preferred rule - <requiredDuringSchedulingIgnoredDuringExecution>
This slightly different use case will be similar to the previous one, with some small but significant differences.
Our application should preferably be running in a zone called az-1
since that zone also hosts a Postgres DB and the latency should be minimal.
If the nodes in az-1
availability zone are not available or have constraints of some sort, the Pod should run on any other node regardless of in which availability zone the nodes are placed running in.