k8s is really complex...

How Linux make container possible

  • Linux namespace
  • Linux control groups (cgroups) [=systemd]

Notes

  • Container images are composed of layers, which can be shared and reused across multiple images.

Docker command and arguments

  • ENTRYPOINT: ["python", "app.py"]
  • CMD: ["-w", "4"]

entrypoint is the default part, cmd can be override.

Kubernetes

Components

Master

  • API server
  • Scheduler
  • Controller manager
  • etcd

Worker

  • Container runtime
  • Kubelet
  • kube-proxy

Type

  • ClusterIP: internal network
  • LoadBalancer: external access

Services

  • solve pods' IP problem

When to use multiple containers in a Pod?

  • Do they need to be run together or can they run on different hosts?
  • Do they represent a single whole or are they independent components?
  • Must they be scaled together or individually?

Namespace

Does not offer isolation for nodes or network(depends on the networking solution deployed with k8s).

Liveness Probes

spec.containers[0].livenessProbe

  • HTTP GET probe
  • TCP Socket probe
  • Exec probe

If liveness probe failed, the pod will be terminated.

Readiness Probes

  • Exec
  • HTTP GET
  • TCP Socekt

If Readiness probe failed, the pod will be removed from endpoints.

ReplicationController

keep pods running

  • label selector
  • replica count
  • pod template

pods can be removed from the controller by changing the label

ReplicaSet

similar to ReplicationController but more powerful in labels matching

Always use ReplicaSet instead of ReplicationController.

DaemonSet

To run a pod on all cluster nodes. (even the unschedulable node)

Job

Terminate the pod when job is done successfully.

spec.template.spec.restartPolicy:

  • Always (default, need to change)
  • OnFailure
  • Never

sequentially: spec.completions: n (run n jobs)

parallel: spec.parallelism: n (run n jobs parallel)

CronJob

cron job for k8s.

spec.schedule: "minute, hour, day of month, month, day of week"

Service

Create a single, constant point of entry to a group of pods. ( TCP/UDP level)

redirect by IP: spec.sessionAffinity: ClientIP (default: None) (keep-alive connection will always hit the same pod even it set to None)

Pods start after Service can get the IP:PORT from environment variables.

  • <SERVICE_NAME>_SERVICE_HOST
  • <SERVICE_NAME>_SERVICE_PORT

Dashes in the service name will be converted to underscores and all letters are uppercased.

FQDN (fully qualified domain name):

<span style="text-decoration: underline;"><service_name>.<namespace>.svc.cluster.local</span>

"svc.cluster.local" can be omitted. If they are in the same namespace, it can also be omitted.

spec.type
  • ExternalName: pods connect to this service will directly connect to an external endpoint
  • NodePort: each node opens a port and redirects traffic to the underlying service
  • LoadBalancer: provided by cloud infrastructure k8s is running on

spec.externalTrafficPolicy

  • Local: the traffic will only be redirected to the Pod on the Node it hits (if no local pod can be found, it will hang) (also load balance will be node locally)

EndPoints

This can expose service to external endpoints.

metadata.name must match service name

IPs are list in subsets.addresses.

Ingress

HTTP level (cookie-based or header-based session affinity). (L4 is also planned)

Ingress needs a ingress controller to do the load balance, like Nginx.

The Ingress controller doesn't forward the request to the service. It only use it to select a pod.

Headless Service

set "spec.clusterIP: None" to get a headless service.

With headless services, DNS will return the pods' IPs directly. It still provides load balancing across pods, but through the DNS round-robin mechanism instead of through the service proxy.

Volumes

types:

  • emptyDir: lifetime is tied to the pod (disk or memory)
  • hostPath: mount directories from the worker node's filesystem (DaemonSet)
  • gitRepo: init by checking out the contents of a Git repo
  • nfs: NFS share mounted into the pod
  • gcePersistentDisk(GCE), awsElasticBlockStore(AWS), azureDisk(Azure)
  • cinder, cephfs, iscsi, flocker, glusterfs, quobyte, rbd, flexVolume, vsphereVolume, photonPersistentDisk, scaleIO: other types of network storage
  • configMap, secret, downwardAPI: used to expose certain K8s resources and cluster information (metadata not data)
  • persistentVolumeClaim: pre- or dynamically provisioned persistent storage

PersistentVolume

ask the admin to setup this volume storage.

Still need Volume as a backup.

  • capacity
  • accessModes
  • persistemtVolumeRecalimPolicy (Retain or Delete)

PV don't belong to any namespace.

Mode:

  • RWO: ReadWriteOnce
  • ROX: ReadOnlyMany
  • RWX: ReadWriteMany

PersistentVolumeClaim

  • resources
  • accessModes
  • storageClassName

PVC can only be created in a specific namespace.

StorageClass

StorageClass resources aren't namespaced. It's dynamic. So it's impossible to run out of PV(but storage space).

Enveronment Variables

spec.container[*].image:

  • command: override ENTRYPOINT
  • args: override CMD
  • env[*]{name:value}: environment variables

ConfigMap

define: key-value pairs in metadata

usage:

  • spec.containers[].env[].valueFrom.configMapKeyRef
  • spec.containers[*].envFrom.configMapRef

This can also be used in volume.

Mounting a directory hides existing files in that directory. (unless you use volumeMount.subPath)

Changes in ConfigMap will be updated in pods without reload. (exclude mounted files in volume)

Secrets

The contents of a Secret's entries are shown as base64 encoded strings.

Maximum size is limited to 1MB.

Downward API for metadata

  • pod's name, IP, namespace, labels, annotations,
  • name of node, name of service account
  • CPU and memory requests/limits for each container

These can be passed into pods with environment variables or volumes.

Deployment

A Deployment is backed by a ReplicaSet.

When rolling update, it will create a new ReplicaSet to handle the new version pods. So it will create a ReplicaSet for each new version. (revisionHistoryLimit is 2 by default)

kubectl rollout undo deployment <name> --to-revision=1
kubectl rollout history deployment <name>
  • maxSurge: how many pod instances allow to exist above the desired replica count
  • maxUnavailable: how many pod instances allow to be unavailable relative to the desired replica count
kubectl rollout pause deployment <name>
kubectl rollout resume deployment <name>

Useful CMD

kubelet explain pod kubelet explain pod.spec

kubectl exec <pod> -- <cmd>
kubectl exec -it <pod> /bin/bash

kubectl run <name> --image=<> --generator=run-pod/v1 --command -- sleep infinity

kubectl get endpoints

kubectl port-forward <name> <port_client>:<port_pod>

kubectl exec downward env
kubectl exec downward ls -1L /etc/downward

kubectl proxy

kubectl patch deployment <name> -p '{"spec": {"minReadySeconds": 10}}'

Details

GPU schedual

kubectl label node gpu-node gpu=true
pod.spec.nodeSelector: gpu: "true"