
Published 2022-08-10 12:11:52
Kubernetes Scheduling - Learn Affinity & Anti-Affinity
Kubernetes Scheduling - Learn Affinity & Anti-Affinity
In a Kubernetes cluster, there is a process called kube-scheduler
which is responsible for matching a Pod with a Node. This process will determine which Node is the best fit based on the combined sum, or highest score if you wish, of things like resource limits (set in the pod spec), the node constraints, affinity, anti-affinity rules, taints and tolerations, etc.
The node that gets the highest score will be selected and the Pod will be placed in a scheduling queue. The next that will happen is the kubelet
agent running on the node will make sure that the Pod gets up and running.
Usually, this is pretty transparent to the users and you don't need to make any effort of controlling this process. Behind the scenes, it's an intense computation going on in the cluster to determine where the Pod should be scheduled.
There are use cases when you need to control where a Pod should be scheduled.
For instance, you may want the Pod to get scheduled on a specific node type.
Other requirements could be that a Pod needs a specific storage type, to run in a particular availability zone, to be placed on a GPU node, to run alongside another application that runs in another pod on a specific node, etc.
In this article, we will learn how we can control the schedule of the Pods to different nodes based on different criteria.
Requirements
This article assumes that you have a running cluster. It won't instruct you on how to set up a Kubernetes cluster.
In the examples below, a Linode Kubernetes Engine (LKE) is being used, but you can run any cluster you want. It can be a local one like minikube, k3d, or Minishift, but you can also run it in the public cloud using Kubernetes services like EKS, GKE, AKS, or like me LKE. The most important thing is that you have full access to the cluster so you can run kubectl
commands.
The article won't teach you the fundamentals of kubectl
CLI, but even if you have limited knowledge of this kubectl
, you should be fine by just following along.
Labels and Selectors
Before going further, let's learn about labels and selectors.
Labels in Kubernetes are key-value pairs and are very handy for grouping and managing resources.
With labels, you basically add metadata to objects and resources. If you work as a Kubernetes administrator, this could make your life easier since you will be able to group and manage your resources in a good way.
Here are some examples of labels that you could define:
- owner=snoopy
- env=dev
- env=prod
- disktype=ssd
The following list shows the requirements for the labels:
- supports 63 characters or less
- can be empty
- if not empty, the label must begin and end with an alphanumeric character ([a-z0-9A-Z]),
- can contain:
- dashes (-)
- alphanumerics
- dots (.)
You can list objects in Kubernetes and their labels. In this example, we will list nodes and corresponding labels (there are many labels assigned to the nodes by default):
$ kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
lke68498-106277-62f385c4572c Ready <none> 51m v1.23.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c4572c,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
lke68498-106277-62f385c4b3c0 Ready <none> 51m v1.23.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c4b3c0,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
lke68498-106277-62f385c51015 Ready <none> 50m v1.23.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c51015,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
and you can list the nodes that have specific keys (you can add many keys):
$ kubectl get nodes -L=env,az
NAME STATUS ROLES AGE VERSION ENV AZ
lke68498-106277-62f385c4572c Ready <none> 22h v1.23.6 prod az-1
lke68498-106277-62f385c4b3c0 Ready <none> 22h v1.23.6 dev az-2
lke68498-106277-62f385c51015 Ready <none> 22h v1.23.6 az-2
Selectors allow you to filter the objects based on labels that are assigned to that object. Most third-party operators are performing filtering to address changes on the object level. Also internally, Kubernetes engine is using selectors to filter objects and resources in the internal processes. And you as the Kubernetes administrator can do the same to structure your resources and objects in a good way. For instance, your automation can make changes to objects based on the labels.
Labels and selectors can manage objects like:
- Pods
- Deployments
- Nodes
- Services
- Secrets
- Ingress Resource
- Namespaces
In the following example, we have a matchLabels
selector that holds a map of key-value pairs.
These key-value pairs can then be used when you write a matchExpressions
in rules which we will learn in soon.
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-cache
spec:
selector:
matchLabels:
app: cache
Example of matchExpressions
where we're specifying a PodAntiAffinity rule based on the app: cache
key-value pair:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cache
You could also write simple queries using the selector flag:
$ kubectl get pods -o wide --selector app=cache
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
redis-cache-57885c55ff-4br9m 0/1 Pending 0 11h <none> <none> <none> <none>
redis-cache-57885c55ff-xcg9g 1/1 Running 0 11h 10.2.0.14 lke68498-106277-62f385c4b3c0 <none> <none>
redis-cache-57885c55ff-xv69b 1/1 Running 0 11h 10.2.2.11 lke68498-106277-62f385c51015 <none> <none>
Another way to list the nodes that output app
as a column with a value set to cache
kubectl get nodes \
--label-columns=app \
--selector=cache
NodeSelector
This is the simplest way to control where a Pod will be scheduled. Let's have a look at a real-life scenario.
Use Case - nodeSelector - based on node labels
We need to have a pod scheduled on a node that has a particular label. First, let's imagine that you have a big cluster where you run both dev and production workloads. In this scenario, we have nodes that only host dev-related workloads.
Those nodes have fewer resources (CPU and Memory) available compared to the prod nodes.
The labels can look like the following:
env=dev
env=prod
The following diagram illustrates the placement of a Pod based on a label belonging to the worker node named n-1:
Resuming the use case scenario:
To fulfill the requirement, we need to add an appropriate label to the node(s). We will apply a Pod manifest to our cluster and state a nodeSelector env: dev
First, let's find out which worker nodes are running in our cluster:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
lke68498-106277-62f385c4572c Ready <none> 33m v1.23.6
lke68498-106277-62f385c4b3c0 Ready <none> 34m v1.23.6
lke68498-106277-62f385c51015 Ready <none> 33m v1.23.6
If you run kubectl get nodes --show-labels
you will get all nodes with all labels assigned to them
Example - list the nodes and labels:
$ kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
lke68498-106277-62f385c4572c Ready <none> 51m v1.23.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c4572c,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
lke68498-106277-62f385c4b3c0 Ready <none> 51m v1.23.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c4b3c0,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
lke68498-106277-62f385c51015 Ready <none> 50m v1.23.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c51015,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
Let's imagine that the node <lke68498-106277-62f385c51015
> is related to the dev environment.
It makes sense that we add a label like env=dev
# add the label to the node
kubectl label nodes lke68498-106277-62f385c51015 env=dev
We could list the nodes having env=dev
label with a selector parameter (we only have one):
$ kubectl get nodes --selector env=dev
Output:
NAME STATUS ROLES AGE VERSION
lke68498-106277-62f385c51015 Ready <none> 56m v1.23.6
We'll deploy a Pod on the fly by applying the following Pod Manifest:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: dev
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
env: dev
EOF
❗When specifying labels in your manifests, you need to write those in the following format: key: value
The result:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 84s 10.2.2.2 lke68498-106277-62f385c51015 <none> <none>
As we can see, the pod got scheduled on the node: lke68498-106277-62f385c51015
This was the simplest way to schedule a Pod on a specific node. In the following section, we will learn about affinity and anti-affinity rules.
Affinity and anti-affinity
The affinity and anti-affinity provide even more control compared to the nodeSelector
, which is the simplest way of controlling assignment of a pod to a node.
The following are characteristics of the affinity rules:
- write more expressive rules
- preference rules (instead of hard rules)
- base rules on other pods labels (not just the node labels)
There are two types of affinity:
- Node affinity/anti-affinity
- Pod affinity/anti-affinity
Node affinity/anti-affinity
As with labels, Node affinity is a set of rules the scheduler follow to decide where a Pod should be scheduled.
It's based on labels applied to a node. Now, the question is, how is this different from the nodeSelector
manner that we've used in the previous example?
What the rule in the Pod spec is saying, is the requirement is to schedule the pod on a node that has a env=dev
key-value pair.
If we look at our cluster, we can see that only two nodes are likely to become candidates.
Now, in the Pod spec, it's also specified that the preferred node instance type is t2.medium
(picked from EC2 AWS instance type list for this simple purpose).
The following diagram illustrates the scenario described. If the worker node called n-2
has enough resources and no constraints, the Pod will most likely be scheduled on this node.
An important thing to remember is, it's not guaranteed that the kube-scheduler
will schedule the pod on n-2
node, actually the Pod could end up on a worker node called n-1
.
Worker node n-3
will never be considered as an alternative to host our Pod since it doesn't match env=dev
label in out matchExpressions
.
Type of rules
- required
<requiredDuringSchedulingIgnoredDuringExecution>
: This rule will be enforced. If the rule cannot be met, the Pod will NOT be scheduled. - preferred <
preferredDuringSchedulingIgnoredDuringExecution>
: the scheduler will try to apply the rule, but it's not guaranteed and the Pod WILL BE scheduled on another node if the rule cannot be enforced.
The required rule is the so-called hard rule and the condition must be met before the Pod gets scheduled.
The preferred is the so-called soft rule, which means it will only schedule the Pod if possible, but not guaranteed. In case no node can meet the requirement, the Pod will be scheduled on another node.
To understand the preferred rule better, imagine that you want to schedule a Pod to a less expensive node type. The application running in that Pod is not requiring extensive resources in terms of CPU and Memory. But it has to run somewhere....so you may add a rule saying that the preferred node type is a spot instance or something very cheap. If the node type specified is not available, the Pod can be scheduled on any other node type. So the scheduler will never guarantee that the Pod gets scheduled on your preferred choice, but that could be totally fine. By doing so, the chances are greater that the Pod gets scheduled fairly quickly.
❗Understand the word IgnoredDuringExecution
in requiredDuringSchedulingIgnoredDuringExecution
or preferredDuringSchedulingIgnoredDuringExecution
The Pod will continue to run on a node, even if the rule is not valid anymore. The Pod won't be affected.
E.g. if you remove a label from a node where the Pod is running, the Pod will still continue to run on that node until the end of its full lifetime.
Use Case -nodeAffinity - required rule - <requiredDuringSchedulingIgnoredDuringExecution>
In big Kubernetes cluster environments, it's pretty common that the cluster worker nodes span over several regions and availability zones. In our imaginary environment, our cluster spans over three availability zones.
We have a web app that needs to be placed close to a Postgres DB that is running in Availability Zone 1. The reason is to minimize network latency that can impact customer satisfaction.
Each node in our cluster has a label that tells in which availability zone it is running in. Example:
- node n-1: az=az-1
- node n-2: az=az-2
- node n-3: az=az-3
In this use case, the requirement is to schedule a Pod on a node that is located in a specific availability zone. This should be labeled as az=az-1
.
We can fake this a little bit since not every cloud provider and especially not the local cluster has availability zones or regions. For that reason, we will create our own labels specifying different availability zones.
We will deploy an Nginx pod to our cluster with the required affinity rule that will fulfill the use-case's requirement.
In my cluster, I have the following nodes:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
lke68498-106277-62f385c4572c Ready <none> 127m v1.23.6
lke68498-106277-62f385c4b3c0 Ready <none> 128m v1.23.6
lke68498-106277-62f385c51015 Ready <none> 126m v1.23.6
We will assign a label to a node. We can do that with the following command:
$ kubectl label nodes lke68498-106277-62f385c4572c az=az-1
Output:
node/lke68498-106277-62f385c4572c labeled
We can now check if the node(s) has a specific label by using a selector flag:
$ kubectl get nodes --selector az=az-1
NAME STATUS ROLES AGE VERSION
lke68498-106277-62f385c4572c Ready <none> 130m v1.23.6
We will define and apply a Pod manifest, but before we do that, we will have a look at the affinity rule that will be part of the manifest.
The affinity definition:
spec:
affinity:
nodeAffinity: # nodeAffinity is a pointer to a struct that contains a list of required and optional node affinity scheduling rules.
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: az # key is the type of node affinity rule.
operator: In # operator is the type of comparison to perform.
values:
- az-1 # values is the list of values to compare with the node's labels
This is a rule of type <required>, so the key-value pair, as=az-1
is required to be found on the node before the scheduler is able to schedule the Pod with this rule in its Pod spec.
Now let's deploy our Nginx app with an affinity rule of type <required>:
cat <<EoF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: example-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: az
operator: In
values:
- az-1
containers:
- name: example-node-affinity
image: nginx
EoF
We expect the Nginx pod to get scheduled on the node lke68498-106277-62f385c4572c
since it (the node) holds the label az=az-1
With the following command, we will get the status of our Pod. With -o wide
flag, we can also get the node where the Pod is running:
$ kubectl get pod example-node-affinity -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
example-node-affinity 1/1 Running 0 86s 10.2.1.2 lke68498-106277-62f385c4572c <none> <none>
Great, our Pod got scheduled to the correct node according to the required rule that we have defined. This was not much different from the nodeSelector case.
But let's flip this case over a little bit and have a look at the following use case.
Clear Out
let's delete the label called az=1
from the node lke68498-106277-62f385c4572c
$ kubectl label nodes lke68498-106277-62f385c4572c az-
❗A tip - you can delete labels from a node with a dash (-)
e.g.kubectl label nodes lke68498-106277-62f385c4572c az-
Use Case -nodeAffinity - preferred rule - <requiredDuringSchedulingIgnoredDuringExecution>
This slightly different use case will be similar to the previous one, with some small but significant differences.
Our application should preferably be running in a zone called az-1
since that zone also hosts a Postgres DB and the latency should be minimal.
If the nodes in az-1
availability zone are not available or have constraints of some sort, the Pod should run on any other node regardless of in which availability zone the nodes are placed running in.
To simulate this scenario, we actually don't need to provide any az labels, we will fake it a little bit by simply not specifying any az=az-1
label on the node(s).
Another possible way to test this is to actually apply the label and deploy a HorizontalPodAutoscaler
object into the cluster and having az=az-1
label on one of the node. Then you could run a load-generator (like busybox) and let the deployment scale. By doing so, the kube-scheduler process will probably pick another node.
The following manifest defines a Pod spec with a name example-node-affinity-preferred
cat <<EoF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: example-node-affinity-preferred
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: az
operator: In
values:
- az-1
containers:
- name: example-node-affinity-preferred
image: nginx
EoF
❗If you don't want to fake this use-case by not including the label, a very silly way to test scheduling is by creating a HorizontalPodAutoscaler
resource. It's somewhat beyond the scope of this article, but it should give you an idea of how you could test this use case more properly. Still, there is no guarantee that it will fully work.
This test may be fine if you want to test by having the label az=az-1
in place. This may work for you if you run pretty small worker nodes, otherwise, it will most likely not be worth the effort. Bigger worker nodes can probably cope with the load.
Requirement: metrics-server. Your cluster needs to have the metrics server installed before you can fully get the advantage of HPA. This is beyond the scope of this article.
If you want to learn more about HorizontalPodAutoscaler
check out this blog post: Horizontal Pod Autoscaler
Label one of the nodes:
$ kubectl label nodes lke68498-106277-62f385c4572c az=az-1
In this step, we will create a HorizontalPodAutoscaler
in our cluster that will refer to a deployment called load-generator
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: test-load
spec:
maxReplicas: 4
metrics:
- resource:
name: cpu
target:
averageUtilization: 50
type: Utilization
type: Resource
minReplicas: 1
maxReplicas: 4
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: load-generator
EOF
The next step is to apply a deployment manifest for load-generator app :
cat <<EoF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
labels:
run: load-generator
name: load-generator
spec:
selector:
matchLabels:
run: load-generator
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
run: load-generator
spec:
containers:
- image: busybox
imagePullPolicy: Always
name: load-generator
ports:
- containerPort: 8080
protocol: TCP
command: ["/bin/sh", "-c", "while true; do wget -q -O- http://php-apache; done"]
resources:
requests:
cpu: "200m"
limits:
cpu: "500m"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: az
operator: In
values:
- az-1
EoF
In the HPA, we have specified averageUtilization: 50
which means 50% of full CPU utilization before HPA starts to scale our deployment to more pods. After a while, the pod replicas should bump up to 1, 2 ,3 and 4 pods due to the pretty low CPU limit.
The load-generator pod will repeatedly send requests to the php-apache
endpoint (internally in the pod) and generate a load.
If you want to scale down the load-generator just scale down deployment with the following command:
$ kubectl scale --replicas=0 deployment load-generator
deployment.apps/load-generator scaled
load-generator 0/0 0 0 85m
❗When using <preferredDuringSchedulingIgnoredDuringExecution
> rules, you also need to specify a field called weight. the highest value gets preferred.
The weight field shall have a value between 1-100. Behind the scenes, the scheduler will make some computations based on resource requests, and expressions of the RequiredDuringScheduling rules, which results in a score. The highest score one node gets, the higher priority it gets.
Let's have a look at our pods. We will use -o
wide flag to get the nodes that the pods are running on:
$ kubectl get pods -o wide | grep example-node-affinity-preferred
example-node-affinity-preferred 1/1 Running 0 9s 10.2.2.6 lke68498-106277-62f385c51015 <none> <none>
Instead of having a required rule, we now specified a preferred rule which will only schedule a Pod if that is feasible.
If you don't want to run a busy box image with HPA, you can just delete and re-deploy the Pod spec many times. It will eventually end up on different node
What we have learned from this use case is, the Pod got scheduled on a different node than the one labeled with az=az1.
Clear Out
You can delete the labels with the following command:
$ kubectl label node lke68498-106277-62f385c51015 az-
node/lke68498-106277-62f385c51015 unlabeled
More Advanced Use Case -nodeAffinity - combining required and preferred rules together
In this use case, we will combine both required and preferred rules, and let the scheduler take a decision based on both expressions.
The requirement is to schedule a Pod to az-1
or az-2
. As a preferred rule, we also want to schedule the Pod on a node(s) that have the label env=dev
Those are the requirements:
- Pod shall run on nodes with labels
az-1
oraz-2
- preferred nodes shall have label
env=dev
There are three worker nodes in my cluster:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
lke68498-106277-62f385c4572c Ready <none> 33m v1.23.6
lke68498-106277-62f385c4b3c0 Ready <none> 34m v1.23.6
lke68498-106277-62f385c51015 Ready <none> 33m v1.23.6
The nodes shall have the following labels:
- lke68498-106277-62f385c4572c:
- az=az-1
- env=prod
- lke68498-106277-62f385c4b3c0:
- az=az-2
- env=dev
- lke68498-106277-62f385c51015
- az=az-2
Label the nodes:
$ kubectl label node lke68498-106277-62f385c4572c az=az-1 env=prod
node/lke68498-106277-62f385c4572c labeled
$ kubectl label node lke68498-106277-62f385c4b3c0 az=az-2 env=dev
node/lke68498-106277-62f385c4b3c0 labeled
$ kubectl label node lke68498-106277-62f385c51015 az=az-2
node/lke68498-106277-62f385c51015 labeled
The following is the Pod manifest that we will apply to the cluster:
cat <<EoF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: example-node-affinity-mixed-rules
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: az
operator: In
values:
- az1
- az2
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: env
operator: In
values:
- dev
containers:
- name: example-node-affinity-mixed-rules
image: nginx
EoF
Read the manifest and try to guess which node will be selected by the scheduler, given that there is no particular load on any node. What would be your best guess?
My guess islke68498-106277-62f385c4b3c0
. The reason is, it has both required and preferred labels:
- az=az-2
- env=dev
$ kubectl get nodes -L=env,az
NAME STATUS ROLES AGE VERSION ENV AZ
lke68498-106277-62f385c4572c Ready <none> 35h v1.23.6 prod az-1
lke68498-106277-62f385c4b3c0 Ready <none> 35h v1.23.6 dev az-2
lke68498-106277-62f385c51015 Ready <none> 35h v1.23.6 az-3
The result:
$ kubectl get pods -o wide | grep example-node-affinity-mixed
example-node-affinity-mixed 1/1 Running 0 11s 10.2.0.6 lke68498-106277-62f385c4b3c0 <none> <none>
As expected, the pod is running on node lke68498-106277-62f385c4b3c0
Anti-affinity
With NotIn
and DoesNotExist
operators, you could achieve the anti-affinity rules.
We will have a look at one example specifying an anti-affinity rule with DoesNotExist
operator.
Use Case -anti affinity - no schedule if label exits
The requirement is to not schedule a Pod on the following node:
- lke68498-106277-62f385c4572c
The nodes are having the following labels, with no changes from the previous use case:
- lke68498-106277-62f385c4572c:
- az=az-1
- env=prod
- lke68498-106277-62f385c4b3c0:
- az=az-2
- env=dev
- lke68498-106277-62f385c51015
- az=az-2
our operator that achieves the anti-affinity rule looks like the following:
- key: kubernetes.io/hostname
operator: NotIn
values:
- lke68498-106277-62f385c51015
- lke68498-106277-62f385c4572c
We will apply the full manifest to our cluster with rules including:
cat <<EoF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: example-node-affinity-anti-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: az
operator: In
values:
- az-2
- key: kubernetes.io/hostname
operator: NotIn
values:
- lke68498-106277-62f385c51015
- lke68498-106277-62f385c4572c
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: env
operator: In
values:
- dev
containers:
- name: example-node-affinity-anti-affinity
image: nginx
EoF
Make a guess on which node the Pod will be scheduled. Again, it's on lke68498-106277-62f385c4b3c0
node, since it has the preferred labales and is not in the list of NotIn
operator.
Result:
$ kubectl get pods -o wide | grep example-node-affinity-anti-affinity
example-node-affinity-anti-affinity 1/1 Running 0 5s 10.2.0.8 lke68498-106277-62f385c4b3c0 <none> <non
Exactly what we expected. Since we also have a preferen rule for env=dev
key-value pair, the pod got scheduled on: lke68498-106277-62f385c4b3c0
Supported Operators
The node affinity supports the following operators:
In
NotIn
: you get anti-affinity behavior with this oneDoesNotExist
: you get anti-affinity behavior with this oneExists
Gt
: greater thanLt
: lower than
You can also use NotIn
and DoesNotExist
to achieve anti-affinity rules.
Some things to be aware of
The following are the conditions that you should be aware of:
- If you have
nodeSelector
andnodeAffinity
in the same manifest, both conditions must be met before the Pod gets scheduled on a node. - If you have multiple
matchExpressions
in anodeSelectorTerms
, it means all conditions must be met. - If you remove a label while a Pod is running, the Pod will continue running on the node during its lifetime. The affinity rules only apply before scheduling happens.
Pod affinity/anti-affinity
Pod affinity and anti-affinity are very similar to node affinity that we worked with so far.
It can be used to control the placement of workloads that needs to be coupled due to different requirements.
Assume that need to deploy a Redis in-memory cache server and a web server. Those two should run close to each other.
For performance reasons, we want to have a pair of Redis cache server and a web server running on the same node.
In the following use case, we will have a look at exactly that kind of scenario.
Use Case - Schedule two different Pods on the same node
This use case here is a little bit more advanced compared to the examples we have seen so far.
The case will be based on the performance constraints that you want to avoid.
Requirements: You have a web server that is using Redis in-memory cache. You need to deploy the following applications:
- web-server
- Redis server
Those two should be close to each other, so ideally the Pods should run on the same nodes.
In addition, you just want to run one Redis Pod per node, so you don't want to schedule two Redis Pods on the same node.
The following deployment manifest will deploy the Redis server, one on each node:
cat <<EoF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-cache
spec:
selector:
matchLabels:
app: cache
replicas: 3
template:
metadata:
labels:
app: cache
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cache
topologyKey: "kubernetes.io/hostname" # topologyKey is the type of topology to which the pod is assigned. This is podAntiAffinity though
containers:
- name: redis-server
image: redis:latest
EoF
- replicas: 3: Three replicas are specified.
- Selector is configured as
app=cache
- podAntiAffinity: will match app=cache
- topologyKey: "kubernetes.io/hostname": will make sure that the next pod will not end up on the same hostname
❗What is topologyKey?
From official Kubernetes Kubernes topologyKey:
"topologyKey is the key of node labels. If two Nodes are labeled with this key and have identical values for that label, the scheduler treats both Nodes as being in the same topology. The scheduler tries to place a balanced number of Pods into each topology domain."
In other words, as part of podAffinity, this will make sure that the next Redis pod will not end up on the same node as already running one(s).
The result:
$ kubectl get pods -o wide --show-labels | grep redis-cache
redis-cache-5b78df76d7-j7cc2 1/1 Running 0 5m30s 10.2.1.10 lke68498-106277-62f385c4572c <none> <none> app=cache,pod-template-hash=5b78df76d7
redis-cache-5b78df76d7-l5j2r 1/1 Running 0 5m30s 10.2.2.8 lke68498-106277-62f385c51015 <none> <none> app=cache,pod-template-hash=5b78df76d7
redis-cache-5b78df76d7-wzhbt 1/1 Running 0 5m30s 10.2.0.9 lke68498-106277-62f385c4b3c0 <none> <none> app=cache,pod-template-hash=5b78df76d7
As expected, e got our Redis pods running on different nodes.
In the next phase, we need to deploy our web server Pods. The rule we want to set is, we should have one web server Pod per node, co-existing with Redis cache server.
The full manifest looks like the following:
cat <<EoF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-server
spec:
selector:
matchLabels:
app: web-server
replicas: 3
template:
metadata:
labels:
app: web-server
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-server
topologyKey: "kubernetes.io/hostname"
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cache
topologyKey: "kubernetes.io/hostname"
containers:
- name: web-app
image: nginx:latest
EoF
Again, we will make sure that we only have one web server Pod per node, so we shouldn't end up with >1 web server pods running on the same node.
Conditions:
- Pod Affinity: app=cache
- Pod Anti-affinity: app=web-server
We will break up the manifest a little bit. The following part in our Pod manifest makes sure that the web server app will never be scheduled on the same node if an web-server instance is already running there:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-server
topologyKey: "kubernetes.io/hostname" # topologyKey is the type of topology to which the pod is assigned. This is podAntiAffinity though
matchExpression
is making sure that we get our web server running on the same node as where the Redis cache Pod runs:
The podAffinity is looking for app=cache
key-value pair (basically the label assigned to the Redis cache pods). The web-server pod will only on the node if there is a redis-cache pod running.
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cache
topologyKey: "kubernetes.io/hostname"
Let's have a look at the result of the web server pod placement:
$ kubectl get pods -o wide --show-labels | grep web-server
web-server-56d5cbb77-hc57b 1/1 Running 0 13s 10.2.0.10 lke68498-106277-62f385c4b3c0 <none> <none> app=web-server,pod-template-hash=56d5cbb77
web-server-56d5cbb77-hvl8s 1/1 Running 0 13s 10.2.2.9 lke68498-106277-62f385c51015 <none> <none> app=web-server,pod-template-hash=56d5cbb77
web-server-56d5cbb77-tllxg 1/1 Running 0 13s 10.2.1.11 lke68498-106277-62f385c4572c <none> <none> app=web-server,pod-template-hash=56d5cbb77
$ kubectl get pods -o wide --show-labels | grep redis-cache
redis-cache-5b78df76d7-j7cc2 1/1 Running 0 5m30s 10.2.1.10 lke68498-106277-62f385c4572c <none> <none> app=cache,pod-template-hash=5b78df76d7
redis-cache-5b78df76d7-l5j2r 1/1 Running 0 5m30s 10.2.2.8 lke68498-106277-62f385c51015 <none> <none> app=cache,pod-template-hash=5b78df76d7
redis-cache-5b78df76d7-wzhbt 1/1 Running 0 5m30s 10.2.0.9 lke68498-106277-62f385c4b3c0 <none> <none> app=cache,pod-template-hash=5b78df76d7
We have learned how to make sure that two applications end up on the same nodes.
Recommendations
We have learned how to control the scheduling workloads in the Kubernetes cluster.
Be very cautious before you follow the path creating tons of rules. It may get to a point where it's extremely difficult to manage those due to complexity.
My recommendation is to only use affinity and anti-affinity rules when it's really necessary. Kubernetes does the scheduling pretty well.
Before defining the rules, there must be a good reason for doing it. In my opinion, the beauty of Kubernetes is, it's so dynamic.
There are other ways of designing solutions and the affinity and anti-affinity rules and node selectors are not the only way of how you can make sure that two applications are running next to each other.
Still, hopefully, you have gotten some ideas of how you can make use of those rules in different scenarios.
TL: DR
In this article, you have learned:
- nodeSelector
- how this is the simplest way to control the scheduling
- how to specify which node the scheduler should choose
- Labels and Selectors
- how to add labels to nodes and pods
- how to update and remove labels
- how to use labels
- Node affinity
- affinity and anti-affinity rules
- preferred rules
- required rules
- Pod affinity
- affinity and anti-affinity rules
- topologyKey: how to control the scheduling based on the same topology
In addition, we have looked at some use cases to be able to understand the concepts better.
In the next post, we will cover the taints and tolerations in Kubernetes.
About the Author
Aleksandro Matejic, Cloud Architect, began working in IT Industry over 20y ago as a technical consultant at a IT consultancy firm in southern Sweden. Since then, he has worked in various companies and industries having various architect roles. In his spare time, Aleksandro is developing and running devoriales.com, a blog and learning platform launched in 2022. In addition, he likes to read and write technical articles about software development and DevOps methods and tools. You can contact Aleksandro by paying a visit to his LinkedIn Profile.