Cyanide - Kubernetes in action note

k8s is really complex...

How Linux make container possible

Linux namespace
Linux control groups (cgroups) [=systemd]

Notes

Container images are composed of layers, which can be shared and reused across multiple images.

Docker command and arguments

ENTRYPOINT: ["python", "app.py"]
CMD: ["-w", "4"]

entrypoint is the default part, cmd can be override.

Kubernetes

Components

Master

API server
Scheduler
Controller manager
etcd

Worker

Container runtime
Kubelet
kube-proxy

Type

ClusterIP: internal network
LoadBalancer: external access

Services

solve pods' IP problem

When to use multiple containers in a Pod?

Do they need to be run together or can they run on different hosts?
Do they represent a single whole or are they independent components?
Must they be scaled together or individually?

Namespace

Does not offer isolation for nodes or network(depends on the networking solution deployed with k8s).

Liveness Probes

spec.containers[0].livenessProbe

HTTP GET probe
TCP Socket probe
Exec probe

If liveness probe failed, the pod will be terminated.

Readiness Probes

Exec
HTTP GET
TCP Socekt

If Readiness probe failed, the pod will be removed from endpoints.

ReplicationController

keep pods running

label selector
replica count
pod template

pods can be removed from the controller by changing the label

ReplicaSet

similar to ReplicationController but more powerful in labels matching

Always use ReplicaSet instead of ReplicationController.

DaemonSet

To run a pod on all cluster nodes. (even the unschedulable node)

Job

Terminate the pod when job is done successfully.

spec.template.spec.restartPolicy:

Always (default, need to change)
OnFailure
Never

sequentially: spec.completions: n (run n jobs)

parallel: spec.parallelism: n (run n jobs parallel)

CronJob

cron job for k8s.

spec.schedule: "minute, hour, day of month, month, day of week"

Service

Create a single, constant point of entry to a group of pods. ( TCP/UDP level)

redirect by IP: spec.sessionAffinity: ClientIP (default: None) (keep-alive connection will always hit the same pod even it set to None)

Pods start after Service can get the IP:PORT from environment variables.

<SERVICE_NAME>_SERVICE_HOST
<SERVICE_NAME>_SERVICE_PORT

Dashes in the service name will be converted to underscores and all letters are uppercased.

FQDN (fully qualified domain name):

<span style="text-decoration: underline;"><service_name>.<namespace>.svc.cluster.local</span>

"svc.cluster.local" can be omitted. If they are in the same namespace, it can also be omitted.

spec.type

ExternalName: pods connect to this service will directly connect to an external endpoint
NodePort: each node opens a port and redirects traffic to the underlying service
LoadBalancer: provided by cloud infrastructure k8s is running on

spec.externalTrafficPolicy

Local: the traffic will only be redirected to the Pod on the Node it hits (if no local pod can be found, it will hang) (also load balance will be node locally)

EndPoints

This can expose service to external endpoints.

metadata.name must match service name

IPs are list in subsets.addresses.

Ingress

HTTP level (cookie-based or header-based session affinity). (L4 is also planned)

Ingress needs a ingress controller to do the load balance, like Nginx.

The Ingress controller doesn't forward the request to the service. It only use it to select a pod.

Headless Service

set "spec.clusterIP: None" to get a headless service.

With headless services, DNS will return the pods' IPs directly. It still provides load balancing across pods, but through the DNS round-robin mechanism instead of through the service proxy.

Volumes

types:

emptyDir: lifetime is tied to the pod (disk or memory)
hostPath: mount directories from the worker node's filesystem (DaemonSet)
gitRepo: init by checking out the contents of a Git repo
nfs: NFS share mounted into the pod
gcePersistentDisk(GCE), awsElasticBlockStore(AWS), azureDisk(Azure)
cinder, cephfs, iscsi, flocker, glusterfs, quobyte, rbd, flexVolume, vsphereVolume, photonPersistentDisk, scaleIO: other types of network storage
configMap, secret, downwardAPI: used to expose certain K8s resources and cluster information (metadata not data)
persistentVolumeClaim: pre- or dynamically provisioned persistent storage

PersistentVolume

ask the admin to setup this volume storage.

Still need Volume as a backup.

capacity
accessModes
persistemtVolumeRecalimPolicy (Retain or Delete)

PV don't belong to any namespace.

Mode:

RWO: ReadWriteOnce
ROX: ReadOnlyMany
RWX: ReadWriteMany

PersistentVolumeClaim

resources
accessModes
storageClassName

PVC can only be created in a specific namespace.

StorageClass

StorageClass resources aren't namespaced. It's dynamic. So it's impossible to run out of PV(but storage space).

Enveronment Variables

spec.container[*].image:

command: override ENTRYPOINT
args: override CMD
env[*]{name:value}: environment variables

ConfigMap

define: key-value pairs in metadata

usage:

spec.containers[].env[].valueFrom.configMapKeyRef
spec.containers[*].envFrom.configMapRef

This can also be used in volume.

Mounting a directory hides existing files in that directory. (unless you use volumeMount.subPath)

Changes in ConfigMap will be updated in pods without reload. (exclude mounted files in volume)

Secrets

The contents of a Secret's entries are shown as base64 encoded strings.

Maximum size is limited to 1MB.

Downward API for metadata

pod's name, IP, namespace, labels, annotations,
name of node, name of service account
CPU and memory requests/limits for each container

These can be passed into pods with environment variables or volumes.

Deployment

A Deployment is backed by a ReplicaSet.

When rolling update, it will create a new ReplicaSet to handle the new version pods. So it will create a ReplicaSet for each new version. (revisionHistoryLimit is 2 by default)

kubectl rollout undo deployment <name> --to-revision=1
kubectl rollout history deployment <name>

maxSurge: how many pod instances allow to exist above the desired replica count
maxUnavailable: how many pod instances allow to be unavailable relative to the desired replica count

kubectl rollout pause deployment <name>
kubectl rollout resume deployment <name>

Useful CMD

kubelet explain pod kubelet explain pod.spec

kubectl exec <pod> -- <cmd>
kubectl exec -it <pod> /bin/bash

kubectl run <name> --image=<> --generator=run-pod/v1 --command -- sleep infinity

kubectl get endpoints

kubectl port-forward <name> <port_client>:<port_pod>

kubectl exec downward env
kubectl exec downward ls -1L /etc/downward

kubectl proxy

kubectl patch deployment <name> -p '{"spec": {"minReadySeconds": 10}}'

Details

GPU schedual

kubectl label node gpu-node gpu=true
pod.spec.nodeSelector: gpu: "true"

Contents