Kubernetes Scheduling - Learn Affinity & Anti-Affinity

kubernetes

Written By Aleksandro Matejic
Published 2022-08-10
Reading Time 27 min

Kubernetes Scheduling - Learn Affinity & Anti-Affinity

In a Kubernetes cluster, there is a process called kube-scheduler which is responsible for matching a Pod with a Node. This process will determine which Node is the best fit based on the combined sum, or highest score if you wish, of things like resource limits (set in the pod spec), the node constraints, affinity, anti-affinity rules, taints and tolerations, etc.

The node that gets the highest score will be selected and the Pod will be placed in a scheduling queue. The next that will happen is the kubelet agent running on the node will make sure that the Pod gets up and running.

Usually, this is pretty transparent to the users and you don't need to make any effort of controlling this process. Behind the scenes, it's an intense computation going on in the cluster to determine where the Pod should be scheduled.

There are use cases when you need to control where a Pod should be scheduled.

For instance, you may want the Pod to get scheduled on a specific node type.

Other requirements could be that a Pod needs a specific storage type, to run in a particular availability zone, to be placed on a GPU node, to run alongside another application that runs in another pod on a specific node, etc.

In this article, we will learn how we can control the schedule of the Pods to different nodes based on different criteria.

Requirements

This article assumes that you have a running cluster. It won't instruct you on how to set up a Kubernetes cluster.

In the examples below, a Linode Kubernetes Engine (LKE) is being used, but you can run any cluster you want. It can be a local one like minikube, k3d, or Minishift, but you can also run it in the public cloud using Kubernetes services like EKS, GKE, AKS, or like me LKE. The most important thing is that you have full access to the cluster so you can run kubectlcommands.

The article won't teach you the fundamentals of kubectl CLI, but even if you have limited knowledge of this kubectl, you should be fine by just following along.

Labels and Selectors

Before going further, let's learn about labels and selectors.

Labels in Kubernetes are key-value pairs and are very handy for grouping and managing resources.
With labels, you basically add metadata to objects and resources. If you work as a Kubernetes administrator, this could make your life easier since you will be able to group and manage your resources in a good way.

Here are some examples of labels that you could define:

owner=snoopy
env=dev
env=prod
disktype=ssd

The following list shows the requirements for the labels:

supports 63 characters or less
can be empty
if not empty, the label must begin and end with an alphanumeric character ([a-z0-9A-Z]),
can contain:
- dashes (-)
- alphanumerics
- dots (.)

You can list objects in Kubernetes and their labels. In this example, we will list nodes and corresponding labels (there are many labels assigned to the nodes by default):

$ kubectl get nodes --show-labels
NAME                           STATUS   ROLES    AGE   VERSION   LABELS
lke68498-106277-62f385c4572c   Ready    <none>   51m   v1.23.6   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c4572c,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
lke68498-106277-62f385c4b3c0   Ready    <none>   51m   v1.23.6   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c4b3c0,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
lke68498-106277-62f385c51015   Ready    <none>   50m   v1.23.6   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c51015,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central

and you can list the nodes that have specific keys (you can add many keys):

$ kubectl get nodes -L=env,az


NAME                           STATUS   ROLES    AGE   VERSION   ENV    AZ
lke68498-106277-62f385c4572c   Ready    <none>   22h   v1.23.6   prod   az-1
lke68498-106277-62f385c4b3c0   Ready    <none>   22h   v1.23.6   dev    az-2
lke68498-106277-62f385c51015   Ready    <none>   22h   v1.23.6          az-2

Selectors allow you to filter the objects based on labels that are assigned to that object. Most third-party operators are performing filtering to address changes on the object level. Also internally, Kubernetes engine is using selectors to filter objects and resources in the internal processes. And you as the Kubernetes administrator can do the same to structure your resources and objects in a good way. For instance, your automation can make changes to objects based on the labels.

Labels and selectors can manage objects like:

Pods
Deployments
Nodes
Services
Secrets
Ingress Resource
Namespaces

In the following example, we have a matchLabels selector that holds a map of key-value pairs.

These key-value pairs can then be used when you write a matchExpressions in rules which we will learn in soon.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
spec:
  selector:
    matchLabels:
      app: cache

Example of matchExpressionswhere we're specifying a PodAntiAffinity rule based on the app: cache key-value pair:

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - cache

You could also write simple queries using the selector flag:

 $ kubectl get pods -o wide --selector app=cache

NAME                           READY   STATUS    RESTARTS   AGE   IP          NODE                           NOMINATED NODE   READINESS GATES
redis-cache-57885c55ff-4br9m   0/1     Pending   0          11h   <none>      <none>                         <none>           <none>
redis-cache-57885c55ff-xcg9g   1/1     Running   0          11h   10.2.0.14   lke68498-106277-62f385c4b3c0   <none>           <none>
redis-cache-57885c55ff-xv69b   1/1     Running   0          11h   10.2.2.11   lke68498-106277-62f385c51015   <none>           <none>

Another way to list the nodes that output app as a column with a value set to cache

kubectl get nodes \
  --label-columns=app \
  --selector=cache

NodeSelector

This is the simplest way to control where a Pod will be scheduled. Let's have a look at a real-life scenario.

Use Case - nodeSelector - based on node labels

We need to have a pod scheduled on a node that has a particular label. First, let's imagine that you have a big cluster where you run both dev and production workloads. In this scenario, we have nodes that only host dev-related workloads.

Those nodes have fewer resources (CPU and Memory) available compared to the prod nodes.

The labels can look like the following:

env=dev
env=prod

The following diagram illustrates the placement of a Pod based on a label belonging to the worker node named n-1:

Resuming the use case scenario:

To fulfill the requirement, we need to add an appropriate label to the node(s). We will apply a Pod manifest to our cluster and state a nodeSelector env: dev

First, let's find out which worker nodes are running in our cluster:

$ kubectl get nodes
NAME                           STATUS   ROLES    AGE   VERSION
lke68498-106277-62f385c4572c   Ready    <none>   33m   v1.23.6
lke68498-106277-62f385c4b3c0   Ready    <none>   34m   v1.23.6
lke68498-106277-62f385c51015   Ready    <none>   33m   v1.23.6

If you run kubectl get nodes --show-labels you will get all nodes with all labels assigned to them

Example - list the nodes and labels:

$ kubectl get nodes --show-labels
NAME                           STATUS   ROLES    AGE   VERSION   LABELS
lke68498-106277-62f385c4572c   Ready    <none>   51m   v1.23.6   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c4572c,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
lke68498-106277-62f385c4b3c0   Ready    <none>   51m   v1.23.6   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c4b3c0,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central
lke68498-106277-62f385c51015   Ready    <none>   50m   v1.23.6   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g6-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-central,kubernetes.io/arch=amd64,kubernetes.io/hostname=lke68498-106277-62f385c51015,kubernetes.io/os=linux,lke.linode.com/pool-id=106277,node.kubernetes.io/instance-type=g6-standard-1,topology.kubernetes.io/region=eu-central,topology.linode.com/region=eu-central

Let's imagine that the node <lke68498-106277-62f385c51015> is related to the dev environment.

It makes sense that we add a label like env=dev

# add the label to the node
kubectl label nodes lke68498-106277-62f385c51015 env=dev

We could list the nodes having env=dev label with a selector parameter (we only have one):

$ kubectl get nodes --selector env=dev

Output:
NAME                           STATUS   ROLES    AGE   VERSION
lke68498-106277-62f385c51015   Ready    <none>   56m   v1.23.6

We'll deploy a Pod on the fly by applying the following Pod Manifest:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: dev
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  nodeSelector:
    env: dev
EOF

❗When specifying labels in your manifests, you need to write those in the following format: key: value

The result:

$ kubectl get pods -o wide

NAME    READY   STATUS    RESTARTS   AGE   IP         NODE                           NOMINATED NODE   READINESS GATES
nginx   1/1     Running   0          84s   10.2.2.2   lke68498-106277-62f385c51015   <none>           <none>

As we can see, the pod got scheduled on the node: lke68498-106277-62f385c51015

This was the simplest way to schedule a Pod on a specific node. In the following section, we will learn about affinity and anti-affinity rules.

Affinity and anti-affinity

The affinity and anti-affinity provide even more control compared to the nodeSelector, which is the simplest way of controlling assignment of a pod to a node.

The following are characteristics of the affinity rules:

write more expressive rules
preference rules (instead of hard rules)
base rules on other pods labels (not just the node labels)

There are two types of affinity:

Node affinity/anti-affinity
Pod affinity/anti-affinity

Node affinity/anti-affinity

As with labels, Node affinity is a set of rules the scheduler follow to decide where a Pod should be scheduled.

It's based on labels applied to a node. Now, the question is, how is this different from the nodeSelector manner that we've used in the previous example?

What the rule in the Pod spec is saying, is the requirement is to schedule the pod on a node that has a env=dev key-value pair.

If we look at our cluster, we can see that only two nodes are likely to become candidates.

Now, in the Pod spec, it's also specified that the preferred node instance type is t2.medium (picked from EC2 AWS instance type list for this simple purpose).

The following diagram illustrates the scenario described. If the worker node called n-2 has enough resources and no constraints, the Pod will most likely be scheduled on this node.

An important thing to remember is, it's not guaranteed that the kube-scheduler will schedule the pod on n-2 node, actually the Pod could end up on a worker node called n-1.

Worker node n-3 will never be considered as an alternative to host our Pod since it doesn't match env=dev label in out matchExpressions.

Type of rules

required <requiredDuringSchedulingIgnoredDuringExecution>: This rule will be enforced. If the rule cannot be met, the Pod will NOT be scheduled.
preferred <preferredDuringSchedulingIgnoredDuringExecution>: the scheduler will try to apply the rule, but it's not guaranteed and the Pod WILL BE scheduled on another node if the rule cannot be enforced.

The required rule is the so-called hard rule and the condition must be met before the Pod gets scheduled.

The preferred is the so-called soft rule, which means it will only schedule the Pod if possible, but not guaranteed. In case no node can meet the requirement, the Pod will be scheduled on another node.

To understand the preferred rule better, imagine that you want to schedule a Pod to a less expensive node type. The application running in that Pod is not requiring extensive resources in terms of CPU and Memory. But it has to run somewhere....so you may add a rule saying that the preferred node type is a spot instance or something very cheap. If the node type specified is not available, the Pod can be scheduled on any other node type. So the scheduler will never guarantee that the Pod gets scheduled on your preferred choice, but that could be totally fine. By doing so, the chances are greater that the Pod gets scheduled fairly quickly.

❗Understand the word IgnoredDuringExecution in requiredDuringSchedulingIgnoredDuringExecution or preferredDuringSchedulingIgnoredDuringExecution

The Pod will continue to run on a node, even if the rule is not valid anymore. The Pod won't be affected.

E.g. if you remove a label from a node where the Pod is running, the Pod will still continue to run on that node until the end of its full lifetime.

Use Case -nodeAffinity - required rule - `<requiredDuringSchedulingIgnoredDuringExecution>`

In big Kubernetes cluster environments, it's pretty common that the cluster worker nodes span over several regions and availability zones. In our imaginary environment, our cluster spans over three availability zones.

We have a web app that needs to be placed close to a Postgres DB that is running in Availability Zone 1. The reason is to minimize network latency that can impact customer satisfaction.

Each node in our cluster has a label that tells in which availability zone it is running in. Example:

node n-1: az=az-1
node n-2: az=az-2
node n-3: az=az-3

In this use case, the requirement is to schedule a Pod on a node that is located in a specific availability zone. This should be labeled as az=az-1.

We can fake this a little bit since not every cloud provider and especially not the local cluster has availability zones or regions. For that reason, we will create our own labels specifying different availability zones.

We will deploy an Nginx pod to our cluster with the required affinity rule that will fulfill the use-case's requirement.

In my cluster, I have the following nodes:

$ kubectl get nodes

NAME                           STATUS   ROLES    AGE    VERSION
lke68498-106277-62f385c4572c   Ready    <none>   127m   v1.23.6
lke68498-106277-62f385c4b3c0   Ready    <none>   128m   v1.23.6
lke68498-106277-62f385c51015   Ready    <none>   126m   v1.23.6

We will assign a label to a node. We can do that with the following command:

$ kubectl label nodes lke68498-106277-62f385c4572c az=az-1

Output:
node/lke68498-106277-62f385c4572c labeled

We can now check if the node(s) has a specific label by using a selector flag:

$ kubectl get nodes --selector az=az-1

NAME                           STATUS   ROLES    AGE    VERSION
lke68498-106277-62f385c4572c   Ready    <none>   130m   v1.23.6

We will define and apply a Pod manifest, but before we do that, we will have a look at the affinity rule that will be part of the manifest.

The affinity definition:

spec:
  affinity:
    nodeAffinity: # nodeAffinity is a pointer to a struct that contains a list of required and optional node affinity scheduling rules.
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: az # key is the type of node affinity rule.
            operator: In # operator is the type of comparison to perform.
            values:
            - az-1 # values is the list of values to compare with the node's labels

This is a rule of type <required>, so the key-value pair, as=az-1is required to be found on the node before the scheduler is able to schedule the Pod with this rule in its Pod spec.

Now let's deploy our Nginx app with an affinity rule of type <required>:

cat <<EoF | kubectl apply -f - 
apiVersion: v1
kind: Pod
metadata:
  name: example-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: az
            operator: In
            values:
            - az-1
  containers:
  - name: example-node-affinity
    image: nginx
EoF

We expect the Nginx pod to get scheduled on the node lke68498-106277-62f385c4572c since it (the node) holds the label az=az-1

With the following command, we will get the status of our Pod. With -o wide flag, we can also get the node where the Pod is running:

$ kubectl get pod example-node-affinity -o wide
NAME                    READY   STATUS    RESTARTS   AGE   IP         NODE                           NOMINATED NODE   READINESS GATES
example-node-affinity   1/1     Running   0          86s   10.2.1.2   lke68498-106277-62f385c4572c   <none>           <none>

Great, our Pod got scheduled to the correct node according to the required rule that we have defined. This was not much different from the nodeSelector case.

But let's flip this case over a little bit and have a look at the following use case.

Clear Out

let's delete the label called az=1 from the node lke68498-106277-62f385c4572c

$ kubectl label nodes lke68498-106277-62f385c4572c az-

❗A tip - you can delete labels from a node with a dash (-)

e.g.kubectl label nodes lke68498-106277-62f385c4572c az-

Use Case -nodeAffinity - preferred rule - `<requiredDuringSchedulingIgnoredDuringExecution>`

This slightly different use case will be similar to the previous one, with some small but significant differences.

Our application should preferably be running in a zone called az-1 since that zone also hosts a Postgres DB and the latency should be minimal.

If the nodes in az-1 availability zone are not available or have constraints of some sort, the Pod should run on any other node regardless of in which availability zone the nodes are placed running in.

To simulate this scenario, we actually don't need to provide any az labels, we will fake it a little bit by simply not specifying any az=az-1 label on the node(s).

Another possible way to test this is to actually apply the label and deploy a HorizontalPodAutoscaler object into the cluster and having az=az-1 label on one of the node. Then you could run a load-generator (like busybox) and let the deployment scale. By doing so, the kube-scheduler process will probably pick another node.

The following manifest defines a Pod spec with a name example-node-affinity-preferred

cat <<EoF | kubectl apply -f - 
apiVersion: v1
kind: Pod
metadata:
  name: example-node-affinity-preferred
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: az
            operator: In
            values:
            - az-1
  containers:
  - name: example-node-affinity-preferred
    image: nginx
EoF

❗If you don't want to fake this use-case by not including the label, a very silly way to test scheduling is by creating a HorizontalPodAutoscaler resource. It's somewhat beyond the scope of this article, but it should give you an idea of how you could test this use case more properly. Still, there is no guarantee that it will fully work.

This test may be fine if you want to test by having the label az=az-1 in place. This may work for you if you run pretty small worker nodes, otherwise, it will most likely not be worth the effort. Bigger worker nodes can probably cope with the load.

Requirement: metrics-server. Your cluster needs to have the metrics server installed before you can fully get the advantage of HPA. This is beyond the scope of this article.

If you want to learn more about HorizontalPodAutoscaler check out this blog post: Horizontal Pod Autoscaler

Label one of the nodes:

$ kubectl label nodes lke68498-106277-62f385c4572c az=az-1

In this step, we will create a HorizontalPodAutoscaler in our cluster that will refer to a deployment called load-generator

cat <<EOF | kubectl apply -f -
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: test-load
spec:
  maxReplicas: 4
  metrics:
  - resource:
      name: cpu
      target:
        averageUtilization: 50
        type: Utilization
    type: Resource
  minReplicas: 1
  maxReplicas: 4
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: load-generator
EOF

The next step is to apply a deployment manifest for load-generator app :

cat <<EoF | kubectl apply -f - 
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
  labels:
    run: load-generator
  name: load-generator
spec:
  selector:
    matchLabels:
      run: load-generator
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        run: load-generator
    spec:
      containers:
      - image: busybox
        imagePullPolicy: Always
        name: load-generator
        ports:
        - containerPort: 8080
          protocol: TCP
        command: ["/bin/sh", "-c", "while true; do wget -q -O- http://php-apache; done"]
        resources:
          requests:
            cpu: "200m"
          limits:
            cpu: "500m"
      affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
            preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              preference:
                matchExpressions:
                - key: az
                  operator: In
                  values:
                  - az-1
EoF

In the HPA, we have specified averageUtilization: 50which means 50% of full CPU utilization before HPA starts to scale our deployment to more pods. After a while, the pod replicas should bump up to 1, 2 ,3 and 4 pods due to the pretty low CPU limit.

The load-generator pod will repeatedly send requests to the php-apache endpoint (internally in the pod) and generate a load.

If you want to scale down the load-generator just scale down deployment with the following command:

$ kubectl scale --replicas=0 deployment load-generator
deployment.apps/load-generator scaled

load-generator              0/0     0            0           85m

❗When using <preferredDuringSchedulingIgnoredDuringExecution> rules, you also need to specify a field called weight. the highest value gets preferred.

The weight field shall have a value between 1-100. Behind the scenes, the scheduler will make some computations based on resource requests, and expressions of the RequiredDuringScheduling rules, which results in a score. The highest score one node gets, the higher priority it gets.

Let's have a look at our pods. We will use -o wide flag to get the nodes that the pods are running on:

$ kubectl get pods -o wide | grep example-node-affinity-preferred

example-node-affinity-preferred   1/1     Running   0             9s     10.2.2.6   lke68498-106277-62f385c51015   <none>           <none>

Instead of having a required rule, we now specified a preferred rule which will only schedule a Pod if that is feasible.

If you don't want to run a busy box image with HPA, you can just delete and re-deploy the Pod spec many times. It will eventually end up on different node

What we have learned from this use case is, the Pod got scheduled on a different node than the one labeled with az=az1.

Clear Out

You can delete the labels with the following command:

$ kubectl label node lke68498-106277-62f385c51015 az-

node/lke68498-106277-62f385c51015 unlabeled

More Advanced Use Case -nodeAffinity - combining required and preferred rules together

In this use case, we will combine both required and preferred rules, and let the scheduler take a decision based on both expressions.

The requirement is to schedule a Pod to az-1 or az-2. As a preferred rule, we also want to schedule the Pod on a node(s) that have the label env=dev

Those are the requirements:

Pod shall run on nodes with labels az-1 or az-2
preferred nodes shall have label env=dev

There are three worker nodes in my cluster:

$ kubectl get nodes
NAME                           STATUS   ROLES    AGE   VERSION
lke68498-106277-62f385c4572c   Ready    <none>   33m   v1.23.6
lke68498-106277-62f385c4b3c0   Ready    <none>   34m   v1.23.6
lke68498-106277-62f385c51015   Ready    <none>   33m   v1.23.6

The nodes shall have the following labels:

lke68498-106277-62f385c4572c:
- az=az-1
- env=prod
lke68498-106277-62f385c4b3c0:
- az=az-2
- env=dev
lke68498-106277-62f385c51015
- az=az-2

Label the nodes:

$ kubectl label node lke68498-106277-62f385c4572c az=az-1 env=prod
node/lke68498-106277-62f385c4572c labeled
$ kubectl label node lke68498-106277-62f385c4b3c0 az=az-2 env=dev
node/lke68498-106277-62f385c4b3c0 labeled
$ kubectl label node lke68498-106277-62f385c51015 az=az-2
node/lke68498-106277-62f385c51015 labeled

The following is the Pod manifest that we will apply to the cluster:

cat <<EoF | kubectl apply -f - 
apiVersion: v1
kind: Pod
metadata:
  name: example-node-affinity-mixed-rules
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: az
            operator: In
            values:
            - az1
            - az2
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: env
            operator: In
            values:
            - dev
  containers:
  - name: example-node-affinity-mixed-rules
    image: nginx
EoF

Read the manifest and try to guess which node will be selected by the scheduler, given that there is no particular load on any node. What would be your best guess?

My guess islke68498-106277-62f385c4b3c0. The reason is, it has both required and preferred labels:

az=az-2
env=dev

$ kubectl get nodes -L=env,az  
NAME                           STATUS   ROLES    AGE   VERSION   ENV    AZ
lke68498-106277-62f385c4572c   Ready    <none>   35h   v1.23.6   prod   az-1
lke68498-106277-62f385c4b3c0   Ready    <none>   35h   v1.23.6   dev    az-2
lke68498-106277-62f385c51015   Ready    <none>   35h   v1.23.6          az-3

The result:

$ kubectl get pods -o wide | grep example-node-affinity-mixed

example-node-affinity-mixed   1/1     Running   0          11s   10.2.0.6   lke68498-106277-62f385c4b3c0   <none>           <none>

As expected, the pod is running on node lke68498-106277-62f385c4b3c0

Anti-affinity

With NotIn and DoesNotExist operators, you could achieve the anti-affinity rules.

We will have a look at one example specifying an anti-affinity rule with DoesNotExistoperator.

Use Case -anti affinity - no schedule if label exits

The requirement is to not schedule a Pod on the following node:

lke68498-106277-62f385c4572c

The nodes are having the following labels, with no changes from the previous use case:

lke68498-106277-62f385c4572c:
- az=az-1
- env=prod
lke68498-106277-62f385c4b3c0:
- az=az-2
- env=dev
lke68498-106277-62f385c51015
- az=az-2

our operator that achieves the anti-affinity rule looks like the following:

- key: kubernetes.io/hostname
  operator: NotIn
  values:
  - lke68498-106277-62f385c51015
  - lke68498-106277-62f385c4572c

We will apply the full manifest to our cluster with rules including:

cat <<EoF | kubectl apply -f - 
apiVersion: v1
kind: Pod
metadata:
  name: example-node-affinity-anti-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: az
            operator: In
            values:
            - az-2
          - key: kubernetes.io/hostname
            operator: NotIn
            values:
            - lke68498-106277-62f385c51015
            - lke68498-106277-62f385c4572c
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: env
            operator: In
            values:
            - dev
  containers:
  - name: example-node-affinity-anti-affinity
    image: nginx
EoF

Make a guess on which node the Pod will be scheduled. Again, it's on lke68498-106277-62f385c4b3c0 node, since it has the preferred labales and is not in the list of NotIn operator.

Result:

$ kubectl get pods -o wide | grep example-node-affinity-anti-affinity

example-node-affinity-anti-affinity   1/1     Running   0          5s      10.2.0.8   lke68498-106277-62f385c4b3c0   <none>           <non

Exactly what we expected. Since we also have a preferen rule for env=dev key-value pair, the pod got scheduled on: lke68498-106277-62f385c4b3c0

Supported Operators

The node affinity supports the following operators:

In
NotIn: you get anti-affinity behavior with this one
DoesNotExist: you get anti-affinity behavior with this one
Exists
Gt: greater than
Lt: lower than

You can also use NotIn and DoesNotExist to achieve anti-affinity rules.

Some things to be aware of

The following are the conditions that you should be aware of:

If you have nodeSelector and nodeAffinity in the same manifest, both conditions must be met before the Pod gets scheduled on a node.
If you have multiple matchExpressions in a nodeSelectorTerms, it means all conditions must be met.
If you remove a label while a Pod is running, the Pod will continue running on the node during its lifetime. The affinity rules only apply before scheduling happens.

Pod affinity/anti-affinity

Pod affinity and anti-affinity are very similar to node affinity that we worked with so far.

It can be used to control the placement of workloads that needs to be coupled due to different requirements.

Assume that need to deploy a Redis in-memory cache server and a web server. Those two should run close to each other.

For performance reasons, we want to have a pair of Redis cache server and a web server running on the same node.

In the following use case, we will have a look at exactly that kind of scenario.

Use Case - Schedule two different Pods on the same node

This use case here is a little bit more advanced compared to the examples we have seen so far.

The case will be based on the performance constraints that you want to avoid.

Requirements: You have a web server that is using Redis in-memory cache. You need to deploy the following applications:

web-server
Redis server

Those two should be close to each other, so ideally the Pods should run on the same nodes.

In addition, you just want to run one Redis Pod per node, so you don't want to schedule two Redis Pods on the same node.

The following deployment manifest will deploy the Redis server, one on each node:

cat <<EoF  | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
spec:
  selector:
    matchLabels:
      app: cache
  replicas: 3
  template:
    metadata:
      labels:
        app: cache
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - cache
            topologyKey: "kubernetes.io/hostname" # topologyKey is the type of topology to which the pod is assigned. This is podAntiAffinity though
      containers:
      - name: redis-server
        image: redis:latest
EoF

replicas: 3: Three replicas are specified.
Selector is configured as app=cache
podAntiAffinity: will match app=cache
- topologyKey: "kubernetes.io/hostname": will make sure that the next pod will not end up on the same hostname

❗What is topologyKey?

From official Kubernetes Kubernes topologyKey:

"topologyKey is the key of node labels. If two Nodes are labeled with this key and have identical values for that label, the scheduler treats both Nodes as being in the same topology. The scheduler tries to place a balanced number of Pods into each topology domain."

In other words, as part of podAffinity, this will make sure that the next Redis pod will not end up on the same node as already running one(s).

The result:

$ kubectl get pods -o wide --show-labels | grep redis-cache

redis-cache-5b78df76d7-j7cc2          1/1     Running   0          5m30s   10.2.1.10   lke68498-106277-62f385c4572c   <none>           <none>            app=cache,pod-template-hash=5b78df76d7
redis-cache-5b78df76d7-l5j2r          1/1     Running   0          5m30s   10.2.2.8    lke68498-106277-62f385c51015   <none>           <none>            app=cache,pod-template-hash=5b78df76d7
redis-cache-5b78df76d7-wzhbt          1/1     Running   0          5m30s   10.2.0.9    lke68498-106277-62f385c4b3c0   <none>           <none>            app=cache,pod-template-hash=5b78df76d7

As expected, e got our Redis pods running on different nodes.

In the next phase, we need to deploy our web server Pods. The rule we want to set is, we should have one web server Pod per node, co-existing with Redis cache server.

The full manifest looks like the following:

cat <<EoF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  selector:
    matchLabels:
      app: web-server
  replicas: 3
  template:
    metadata:
      labels:
        app: web-server
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web-server
            topologyKey: "kubernetes.io/hostname"
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - cache
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: web-app
        image: nginx:latest
EoF

Again, we will make sure that we only have one web server Pod per node, so we shouldn't end up with >1 web server pods running on the same node.

Conditions:

Pod Affinity: app=cache
Pod Anti-affinity: app=web-server

We will break up the manifest a little bit. The following part in our Pod manifest makes sure that the web server app will never be scheduled on the same node if an web-server instance is already running there:

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - web-server
        topologyKey: "kubernetes.io/hostname" # topologyKey is the type of topology to which the pod is assigned. This is podAntiAffinity though

matchExpression is making sure that we get our web server running on the same node as where the Redis cache Pod runs:

The podAffinity is looking for app=cache key-value pair (basically the label assigned to the Redis cache pods). The web-server pod will only on the node if there is a redis-cache pod running.

podAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
  - labelSelector:
      matchExpressions:
      - key: app
        operator: In
        values:
        - cache
    topologyKey: "kubernetes.io/hostname"

Let's have a look at the result of the web server pod placement:

$ kubectl get pods -o wide --show-labels | grep web-server

web-server-56d5cbb77-hc57b            1/1     Running   0          13s   10.2.0.10   lke68498-106277-62f385c4b3c0   <none>           <none>            app=web-server,pod-template-hash=56d5cbb77
web-server-56d5cbb77-hvl8s            1/1     Running   0          13s   10.2.2.9    lke68498-106277-62f385c51015   <none>           <none>            app=web-server,pod-template-hash=56d5cbb77
web-server-56d5cbb77-tllxg            1/1     Running   0          13s   10.2.1.11   lke68498-106277-62f385c4572c   <none>           <none>            app=web-server,pod-template-hash=56d5cbb77


$ kubectl get pods -o wide --show-labels | grep redis-cache

redis-cache-5b78df76d7-j7cc2          1/1     Running   0          5m30s   10.2.1.10   lke68498-106277-62f385c4572c   <none>           <none>            app=cache,pod-template-hash=5b78df76d7
redis-cache-5b78df76d7-l5j2r          1/1     Running   0          5m30s   10.2.2.8    lke68498-106277-62f385c51015   <none>           <none>            app=cache,pod-template-hash=5b78df76d7
redis-cache-5b78df76d7-wzhbt          1/1     Running   0          5m30s   10.2.0.9    lke68498-106277-62f385c4b3c0   <none>           <none>            app=cache,pod-template-hash=5b78df76d7

We have learned how to make sure that two applications end up on the same nodes.

Recommendations

We have learned how to control the scheduling workloads in the Kubernetes cluster.

Be very cautious before you follow the path creating tons of rules. It may get to a point where it's extremely difficult to manage those due to complexity.

My recommendation is to only use affinity and anti-affinity rules when it's really necessary. Kubernetes does the scheduling pretty well.

Before defining the rules, there must be a good reason for doing it. In my opinion, the beauty of Kubernetes is, it's so dynamic.

There are other ways of designing solutions and the affinity and anti-affinity rules and node selectors are not the only way of how you can make sure that two applications are running next to each other.

Still, hopefully, you have gotten some ideas of how you can make use of those rules in different scenarios.

TL: DR

In this article, you have learned:

nodeSelector
- how this is the simplest way to control the scheduling
- how to specify which node the scheduler should choose
Labels and Selectors
- how to add labels to nodes and pods
- how to update and remove labels
- how to use labels
Node affinity
- affinity and anti-affinity rules
- preferred rules
- required rules
Pod affinity
- affinity and anti-affinity rules
- topologyKey: how to control the scheduling based on the same topology

In addition, we have looked at some use cases to be able to understand the concepts better.

In the next post, we will cover the taints and tolerations in Kubernetes.

About the Author

Aleksandro Matejic, Cloud Architect, began working in IT Industry over 20y ago as a technical consultant at a IT consultancy firm in southern Sweden. Since then, he has worked in various companies and industries having various architect roles. In his spare time, Aleksandro is developing and running devoriales.com, a blog and learning platform launched in 2022. In addition, he likes to read and write technical articles about software development and DevOps methods and tools. You can contact Aleksandro by paying a visit to his LinkedIn Profile.

Join Devoriales Today!

Kubernetes Scheduling - Learn Affinity & Anti-Affinity

Kubernetes Scheduling - Learn Affinity & Anti-Affinity

Requirements

Labels and Selectors

NodeSelector

Use Case - nodeSelector - based on node labels

Affinity and anti-affinity

Node affinity/anti-affinity

Type of rules

Use Case -nodeAffinity - required rule - <requiredDuringSchedulingIgnoredDuringExecution>

Clear Out

Use Case -nodeAffinity - preferred rule - <requiredDuringSchedulingIgnoredDuringExecution>

Clear Out

More Advanced Use Case -nodeAffinity - combining required and preferred rules together

Anti-affinity

Use Case -anti affinity - no schedule if label exits

Supported Operators

Some things to be aware of

Pod affinity/anti-affinity

Use Case - Schedule two different Pods on the same node

Recommendations

TL: DR

About the Author

Comments

Use Case -nodeAffinity - required rule - `<requiredDuringSchedulingIgnoredDuringExecution>`

Use Case -nodeAffinity - preferred rule - `<requiredDuringSchedulingIgnoredDuringExecution>`