. This is an example config that records spans for 1 in 10000 requests, and uses the default OpenTelemetry endpoint:
apiVersion: apiserver.config.k8s.io/v1beta1
kind: TracingConfiguration
# default value
#endpoint: localhost:4317
samplingRatePerMillion: 100
For more information about the TracingConfiguration struct, see API server config API (v1beta1).
kubelet traces
FEATURE STATE: Kubernetes v1.27 [beta]
The kubelet CRI interface and authenticated http servers are instrumented to generate trace spans. As with the apiserver, the endpoint and sampling rate are configurable. Trace context propagation is also configured. A parent span's sampling decision is always respected. A provided tracing configuration sampling rate will apply to spans without a parent. Enabled without a configured endpoint, the default OpenTelemetry Collector receiver address of "localhost:4317" is set.
Enabling tracing in the kubelet
To enable tracing, apply the tracing configuration. This is an example snippet of a kubelet config that records spans for 1 in 10000 requests, and uses the default OpenTelemetry endpoint:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
KubeletTracing: true
tracing:
# default value
#endpoint: localhost:4317
samplingRatePerMillion: 100
If the samplingRatePerMillion is set to one million (1000000), then every span will be sent to the exporter.
The kubelet in Kubernetes v1.30 collects spans from the garbage collection, pod synchronization routine as well as every gRPC method. The kubelet propagates trace context with gRPC requests so that container runtimes with trace instrumentation, such as CRI-O and containerd, can associate their exported spans with the trace context from the kubelet. The resulting traces will have parent-child links between kubelet and container runtime spans, providing helpful context when debugging node issues.
Please note that exporting spans always comes with a small performance overhead on the networking and CPU side, depending on the overall configuration of the system. If there is any issue like that in a cluster which is running with tracing enabled, then mitigate the problem by either reducing the samplingRatePerMillion or disabling tracing completely by removing the configuration.
Stability
Tracing instrumentation is still under active development, and may change in a variety of ways. This includes span names, attached attributes, instrumented endpoints, etc. Until this feature graduates to stable, there are no guarantees of backwards compatibility for tracing instrumentation.
What's next
Read about Getting Started with the OpenTelemetry Collector
Read about OTLP Exporter Configuration
11.9 - Proxies in Kubernetes
This page explains proxies used with Kubernetes.
Proxies
There are several different proxies you may encounter when using Kubernetes:
The kubectl proxy:
runs on a user's desktop or in a pod
proxies from a localhost address to the Kubernetes apiserver
client to proxy uses HTTP
proxy to apiserver uses HTTPS
locates apiserver
adds authentication headers
The apiserver proxy:
is a bastion built into the apiserver
connects a user outside of the cluster to cluster IPs which otherwise might not be reachable
runs in the apiserver processes
client to proxy uses HTTPS (or http if apiserver so configured)
proxy to target may use HTTP or HTTPS as chosen by proxy using available information
can be used to reach a Node, Pod, or Service
does load balancing when used to reach a Service
The kube proxy:
runs on each node
proxies UDP, TCP and SCTP
does not understand HTTP
provides load balancing
is only used to reach services
A Proxy/Load-balancer in front of apiserver(s):
existence and implementation varies from cluster to cluster (e.g. nginx)
sits between all clients and one or more apiservers
acts as load balancer if there are several apiservers.
Cloud Load Balancers on external services:
are provided by some cloud providers (e.g. AWS ELB, Google Cloud Load Balancer)
are created automatically when the Kubernetes service has type LoadBalancer
usually supports UDP/TCP only
SCTP support is up to the load balancer implementation of the cloud provider
implementation varies by cloud provider.
Kubernetes users will typically not need to worry about anything other than the first two types. The cluster admin will typically ensure that the latter types are set up correctly.
Requesting redirects
Proxies have replaced redirect capabilities. Redirects have been deprecated.
11.10 - API Priority and Fairness
FEATURE STATE: Kubernetes v1.29 [stable]
Controlling the behavior of the Kubernetes API server in an overload situation is a key task for cluster administrators. The kube-apiserver has some controls available (i.e. the --max-requests-inflight and --max-mutating-requests-inflight command-line flags) to limit the amount of outstanding work that will be accepted, preventing a flood of inbound requests from overloading and potentially crashing the API server, but these flags are not enough to ensure that the most important requests get through in a period of high traffic.
The API Priority and Fairness feature (APF) is an alternative that improves upon aforementioned max-inflight limitations. APF classifies and isolates requests in a more fine-grained way. It also introduces a limited amount of queuing, so that no requests are rejected in cases of very brief bursts. Requests are dispatched from queues using a fair queuing technique so that, for example, a poorly-behaved controller need not starve others (even at the same priority level).
This feature is designed to work well with standard controllers, which use informers and react to failures of API requests with exponential back-off, and other clients that also work this way.
Caution:
Some requests classified as "long-running"—such as remote command execution or log tailing—are not subject to the API Priority and Fairness filter. This is also true for the --max-requests-inflight flag without the API Priority and Fairness feature enabled. API Priority and Fairness does apply to watch requests. When API Priority and Fairness is disabled, watch requests are not subject to the --max-requests-inflight limit.
Enabling/Disabling API Priority and Fairness
The API Priority and Fairness feature is controlled by a command-line flag and is enabled by default. See Options for a general explanation of the available kube-apiserver command-line options and how to enable and disable them. The name of the command-line option for APF is "--enable-priority-and-fairness". This feature also involves an API Group with: (a) a stable v1 version, introduced in 1.29, and enabled by default (b) a v1beta3 version, enabled by default, and deprecated in v1.29. You can disable the API group beta version v1beta3 by adding the following command-line flags to your kube-apiserver invocation:
kube-apiserver \
--runtime-config=flowcontrol.apiserver.k8s.io/v1beta3=false \
# …and other flags as usual
The command-line flag --enable-priority-and-fairness=false will disable the API Priority and Fairness feature.
Recursive server scenarios
API Priority and Fairness must be used carefully in recursive server scenarios. These are scenarios in which some server A, while serving a request, issues a subsidiary request to some server B. Perhaps server B might even make a further subsidiary call back to server A. In situations where Priority and Fairness control is applied to both the original request and some subsidiary ones(s), no matter how deep in the recursion, there is a danger of priority inversions and/or deadlocks.
One example of recursion is when the kube-apiserver issues an admission webhook call to server B, and while serving that call, server B makes a further subsidiary request back to the kube-apiserver. Another example of recursion is when an APIService object directs the kube-apiserver to delegate requests about a certain API group to a custom external server B (this is one of the things called "aggregation").
When the original request is known to belong to a certain priority level, and the subsidiary controlled requests are classified to higher priority levels, this is one possible solution. When the original requests can belong to any priority level, the subsidiary controlled requests have to be exempt from Priority and Fairness limitation. One way to do that is with the objects that configure classification and handling, discussed below. Another way is to disable Priority and Fairness on server B entirely, using the techniques discussed above. A third way, which is the simplest to use when server B is not kube-apisever, is to build server B with Priority and Fairness disabled in the code.
Concepts
There are several distinct features involved in the API Priority and Fairness feature. Incoming requests are classified by attributes of the request using FlowSchemas, and assigned to priority levels. Priority levels add a degree of isolation by maintaining separate concurrency limits, so that requests assigned to different priority levels cannot starve each other. Within a priority level, a fair-queuing algorithm prevents requests from different flows from starving each other, and allows for requests to be queued to prevent bursty traffic from causing failed requests when the average load is acceptably low.
Priority Levels
Without APF enabled, overall concurrency in the API server is limited by the kube-apiserver flags --max-requests-inflight and --max-mutating-requests-inflight. With APF enabled, the concurrency limits defined by these flags are summed and then the sum is divided up among a configurable set of priority levels. Each incoming request is assigned to a single priority level, and each priority level will only dispatch as many concurrent requests as its particular limit allows.
The default configuration, for example, includes separate priority levels for leader-election requests, requests from built-in controllers, and requests from Pods. This means that an ill-behaved Pod that floods the API server with requests cannot prevent leader election or actions by the built-in controllers from succeeding.
The concurrency limits of the priority levels are periodically adjusted, allowing under-utilized priority levels to temporarily lend concurrency to heavily-utilized levels. These limits are based on nominal limits and bounds on how much concurrency a priority level may lend and how much it may borrow, all derived from the configuration objects mentioned below.
Seats Occupied by a Request
The above description of concurrency management is the baseline story. Requests have different durations but are counted equally at any given moment when comparing against a priority level's concurrency limit. In the baseline story, each request occupies one unit of concurrency. The word "seat" is used to mean one unit of concurrency, inspired by the way each passenger on a train or aircraft takes up one of the fixed supply of seats.
But some requests take up more than one seat. Some of these are list requests that the server estimates will return a large number of objects. These have been found to put an exceptionally heavy burden on the server. For this reason, the server estimates the number of objects that will be returned and considers the request to take a number of seats that is proportional to that estimated number.
Execution time tweaks for watch requests
API Priority and Fairness manages watch requests, but this involves a couple more excursions from the baseline behavior. The first concerns how long a watch request is considered to occupy its seat. Depending on request parameters, the response to a watch request may or may not begin with create notifications for all the relevant pre-existing objects. API Priority and Fairness considers a watch request to be done with its seat once that initial burst of notifications, if any, is over.
The normal notifications are sent in a concurrent burst to all relevant watch response streams whenever the server is notified of an object create/update/delete. To account for this work, API Priority and Fairness considers every write request to spend some additional time occupying seats after the actual writing is done. The server estimates the number of notifications to be sent and adjusts the write request's number of seats and seat occupancy time to include this extra work.
Queuing
Even within a priority level there may be a large number of distinct sources of traffic. In an overload situation, it is valuable to prevent one stream of requests from starving others (in particular, in the relatively common case of a single buggy client flooding the kube-apiserver with requests, that buggy client would ideally not have much measurable impact on other clients at all). This is handled by use of a fair-queuing algorithm to process requests that are assigned the same priority level. Each request is assigned to a flow, identified by the name of the matching FlowSchema plus a flow distinguisher — which is either the requesting user, the target resource's namespace, or nothing — and the system attempts to give approximately equal weight to requests in different flows of the same priority level. To enable distinct handling of distinct instances, controllers that have many instances should authenticate with distinct usernames
After classifying a request into a flow, the API Priority and Fairness feature then may assign the request to a queue. This assignment uses a technique known as shuffle sharding, which makes relatively efficient use of queues to insulate low-intensity flows from high-intensity flows.
The details of the queuing algorithm are tunable for each priority level, and allow administrators to trade off memory use, fairness (the property that independent flows will all make progress when total traffic exceeds capacity), tolerance for bursty traffic, and the added latency induced by queuing.
Exempt requests
Some requests are considered sufficiently important that they are not subject to any of the limitations imposed by this feature. These exemptions prevent an improperly-configured flow control configuration from totally disabling an API server.
Resources
The flow control API involves two kinds of resources. PriorityLevelConfigurations define the available priority levels, the share of the available concurrency budget that each can handle, and allow for fine-tuning queuing behavior. FlowSchemas are used to classify individual inbound requests, matching each to a single PriorityLevelConfiguration.
PriorityLevelConfiguration
A PriorityLevelConfiguration represents a single priority level. Each PriorityLevelConfiguration has an independent limit on the number of outstanding requests, and limitations on the number of queued requests.
The nominal concurrency limit for a PriorityLevelConfiguration is not specified in an absolute number of seats, but rather in "nominal concurrency shares." The total concurrency limit for the API Server is distributed among the existing PriorityLevelConfigurations in proportion to these shares, to give each level its nominal limit in terms of seats. This allows a cluster administrator to scale up or down the total amount of traffic to a server by restarting kube-apiserver with a different value for --max-requests-inflight (or --max-mutating-requests-inflight), and all PriorityLevelConfigurations will see their maximum allowed concurrency go up (or down) by the same fraction.
Caution:
In the versions before v1beta3 the relevant PriorityLevelConfiguration field is named "assured concurrency shares" rather than "nominal concurrency shares". Also, in Kubernetes release 1.25 and earlier there were no periodic adjustments: the nominal/assured limits were always applied without adjustment.
The bounds on how much concurrency a priority level may lend and how much it may borrow are expressed in the PriorityLevelConfiguration as percentages of the level's nominal limit. These are resolved to absolute numbers of seats by multiplying with the nominal limit / 100.0 and rounding. The dynamically adjusted concurrency limit of a priority level is constrained to lie between (a) a lower bound of its nominal limit minus its lendable seats and (b) an upper bound of its nominal limit plus the seats it may borrow. At each adjustment the dynamic limits are derived by each priority level reclaiming any lent seats for which demand recently appeared and then jointly fairly responding to the recent seat demand on the priority levels, within the bounds just described.
Caution:
With the Priority and Fairness feature enabled, the total concurrency limit for the server is set to the sum of --max-requests-inflight and --max-mutating-requests-inflight. There is no longer any distinction made between mutating and non-mutating requests; if you want to treat them separately for a given resource, make separate FlowSchemas that match the mutating and non-mutating verbs respectively.
When the volume of inbound requests assigned to a single PriorityLevelConfiguration is more than its permitted concurrency level, the type field of its specification determines what will happen to extra requests. A type of Reject means that excess traffic will immediately be rejected with an HTTP 429 (Too Many Requests) error. A type of Queue means that requests above the threshold will be queued, with the shuffle sharding and fair queuing techniques used to balance progress between request flows.
The queuing configuration allows tuning the fair queuing algorithm for a priority level. Details of the algorithm can be read in the enhancement proposal, but in short:
Increasing queues reduces the rate of collisions between different flows, at the cost of increased memory usage. A value of 1 here effectively disables the fair-queuing logic, but still allows requests to be queued.
Increasing queueLengthLimit allows larger bursts of traffic to be sustained without dropping any requests, at the cost of increased latency and memory usage.
Changing handSize allows you to adjust the probability of collisions between different flows and the overall concurrency available to a single flow in an overload situation.
Note:
A larger handSize makes it less likely for two individual flows to collide (and therefore for one to be able to starve the other), but more likely that a small number of flows can dominate the apiserver. A larger handSize also potentially increases the amount of latency that a single high-traffic flow can cause. The maximum number of queued requests possible from a single flow is handSize * queueLengthLimit.
Following is a table showing an interesting collection of shuffle sharding configurations, showing for each the probability that a given mouse (low-intensity flow) is squished by the elephants (high-intensity flows) for an illustrative collection of numbers of elephants. See https://play.golang.org/p/Gi0PLgVHiUg , which computes this table.
HandSize Queues 1 elephant 4 elephants 16 elephants
12 32 4.428838398950118e-09 0.11431348830099144 0.9935089607656024
10 32 1.550093439632541e-08 0.0626479840223545 0.9753101519027554
10 64 6.601827268370426e-12 0.00045571320990370776 0.49999929150089345
9 64 3.6310049976037345e-11 0.00045501212304112273 0.4282314876454858
8 64 2.25929199850899e-10 0.0004886697053040446 0.35935114681123076
8 128 6.994461389026097e-13 3.4055790161620863e-06 0.02746173137155063
7 128 1.0579122850901972e-11 6.960839379258192e-06 0.02406157386340147
7 256 7.597695465552631e-14 6.728547142019406e-08 0.0006709661542533682
6 256 2.7134626662687968e-12 2.9516464018476436e-07 0.0008895654642000348
6 512 4.116062922897309e-14 4.982983350480894e-09 2.26025764343413e-05
6 1024 6.337324016514285e-16 8.09060164312957e-11 4.517408062903668e-07
FlowSchema
A FlowSchema matches some inbound requests and assigns them to a priority level. Every inbound request is tested against FlowSchemas, starting with those with the numerically lowest matchingPrecedence and working upward. The first match wins.
Caution:
Only the first matching FlowSchema for a given request matters. If multiple FlowSchemas match a single inbound request, it will be assigned based on the one with the highest matchingPrecedence. If multiple FlowSchemas with equal matchingPrecedence match the same request, the one with lexicographically smaller name will win, but it's better not to rely on this, and instead to ensure that no two FlowSchemas have the same matchingPrecedence.
A FlowSchema matches a given request if at least one of its rules matches. A rule matches if at least one of its subjects and at least one of its resourceRules or nonResourceRules (depending on whether the incoming request is for a resource or non-resource URL) match the request.
For the name field in subjects, and the verbs, apiGroups, resources, namespaces, and nonResourceURLs fields of resource and non-resource rules, the wildcard * may be specified to match all values for the given field, effectively removing it from consideration.
A FlowSchema's distinguisherMethod.type determines how requests matching that schema will be separated into flows. It may be ByUser, in which one requesting user will not be able to starve other users of capacity; ByNamespace, in which requests for resources in one namespace will not be able to starve requests for resources in other namespaces of capacity; or blank (or distinguisherMethod may be omitted entirely), in which all requests matched by this FlowSchema will be considered part of a single flow. The correct choice for a given FlowSchema depends on the resource and your particular environment.
Defaults
Each kube-apiserver maintains two sorts of APF configuration objects: mandatory and suggested.
Mandatory Configuration Objects
The four mandatory configuration objects reflect fixed built-in guardrail behavior. This is behavior that the servers have before those objects exist, and when those objects exist their specs reflect this behavior. The four mandatory objects are as follows.
The mandatory exempt priority level is used for requests that are not subject to flow control at all: they will always be dispatched immediately. The mandatory exempt FlowSchema classifies all requests from the system:masters group into this priority level. You may define other FlowSchemas that direct other requests to this priority level, if appropriate.
The mandatory catch-all priority level is used in combination with the mandatory catch-all FlowSchema to make sure that every request gets some kind of classification. Typically you should not rely on this catch-all configuration, and should create your own catch-all FlowSchema and PriorityLevelConfiguration (or use the suggested global-default priority level that is installed by default) as appropriate. Because it is not expected to be used normally, the mandatory catch-all priority level has a very small concurrency share and does not queue requests.
Suggested Configuration Objects
The suggested FlowSchemas and PriorityLevelConfigurations constitute a reasonable default configuration. You can modify these and/or create additional configuration objects if you want. If your cluster is likely to experience heavy load then you should consider what configuration will work best.
The suggested configuration groups requests into six priority levels:
The node-high priority level is for health updates from nodes.
The system priority level is for non-health requests from the system:nodes group, i.e. Kubelets, which must be able to contact the API server in order for workloads to be able to schedule on them.
The leader-election priority level is for leader election requests from built-in controllers (in particular, requests for endpoints, configmaps, or leases coming from the system:kube-controller-manager or system:kube-scheduler users and service accounts in the kube-system namespace). These are important to isolate from other traffic because failures in leader election cause their controllers to fail and restart, which in turn causes more expensive traffic as the new controllers sync their informers.
The workload-high priority level is for other requests from built-in controllers.
The workload-low priority level is for requests from any other service account, which will typically include all requests from controllers running in Pods.
The global-default priority level handles all other traffic, e.g. interactive kubectl commands run by nonprivileged users.
The suggested FlowSchemas serve to steer requests into the above priority levels, and are not enumerated here.
Maintenance of the Mandatory and Suggested Configuration Objects
Each kube-apiserver independently maintains the mandatory and suggested configuration objects, using initial and periodic behavior. Thus, in a situation with a mixture of servers of different versions there may be thrashing as long as different servers have different opinions of the proper content of these objects.
Each kube-apiserver makes an initial maintenance pass over the mandatory and suggested configuration objects, and after that does periodic maintenance (once per minute) of those objects.
For the mandatory configuration objects, maintenance consists of ensuring that the object exists and, if it does, has the proper spec. The server refuses to allow a creation or update with a spec that is inconsistent with the server's guardrail behavior.
Maintenance of suggested configuration objects is designed to allow their specs to be overridden. Deletion, on the other hand, is not respected: maintenance will restore the object. If you do not want a suggested configuration object then you need to keep it around but set its spec to have minimal consequences. Maintenance of suggested objects is also designed to support automatic migration when a new version of the kube-apiserver is rolled out, albeit potentially with thrashing while there is a mixed population of servers.
Maintenance of a suggested configuration object consists of creating it --- with the server's suggested spec --- if the object does not exist. OTOH, if the object already exists, maintenance behavior depends on whether the kube-apiservers or the users control the object. In the former case, the server ensures that the object's spec is what the server suggests; in the latter case, the spec is left alone.
The question of who controls the object is answered by first looking for an annotation with key apf.kubernetes.io/autoupdate-spec. If there is such an annotation and its value is true then the kube-apiservers control the object. If there is such an annotation and its value is false then the users control the object. If neither of those conditions holds then the metadata.generation of the object is consulted. If that is 1 then the kube-apiservers control the object. Otherwise the users control the object. These rules were introduced in release 1.22 and their consideration of metadata.generation is for the sake of migration from the simpler earlier behavior. Users who wish to control a suggested configuration object should set its apf.kubernetes.io/autoupdate-spec annotation to false.
Maintenance of a mandatory or suggested configuration object also includes ensuring that it has an apf.kubernetes.io/autoupdate-spec annotation that accurately reflects whether the kube-apiservers control the object.
Maintenance also includes deleting objects that are neither mandatory nor suggested but are annotated apf.kubernetes.io/autoupdate-spec=true.
Health check concurrency exemption
The suggested configuration gives no special treatment to the health check requests on kube-apiservers from their local kubelets --- which tend to use the secured port but supply no credentials. With the suggested config, these requests get assigned to the global-default FlowSchema and the corresponding global-default priority level, where other traffic can crowd them out.
If you add the following additional FlowSchema, this exempts those requests from rate limiting.
Caution:
Making this change also allows any hostile party to then send health-check requests that match this FlowSchema, at any volume they like. If you have a web traffic filter or similar external security mechanism to protect your cluster's API server from general internet traffic, you can configure rules to block any health check requests that originate from outside your cluster.
priority-and-fairness/health-for-strangers.yaml Copy priority-and-fairness/health-for-strangers.yaml to clipboard
apiVersion: flowcontrol.apiserver.k8s.io/v1
kind: FlowSchema
metadata:
name: health-for-strangers
spec:
matchingPrecedence: 1000
priorityLevelConfiguration:
name: exempt
rules:
- nonResourceRules:
- nonResourceURLs:
- "/healthz"
- "/livez"
- "/readyz"
verbs:
- "*"
subjects:
- kind: Group
group:
name: "system:unauthenticated"
Observability
Metrics
Note:
In versions of Kubernetes before v1.20, the labels flow_schema and priority_level were inconsistently named flowSchema and priorityLevel, respectively. If you're running Kubernetes versions v1.19 and earlier, you should refer to the documentation for your version.
When you enable the API Priority and Fairness feature, the kube-apiserver exports additional metrics. Monitoring these can help you determine whether your configuration is inappropriately throttling important traffic, or find poorly-behaved workloads that may be harming system health.
Maturity level BETA
apiserver_flowcontrol_rejected_requests_total is a counter vector (cumulative since server start) of requests that were rejected, broken down by the labels flow_schema (indicating the one that matched the request), priority_level (indicating the one to which the request was assigned), and reason. The reason label will be one of the following values:
queue-full, indicating that too many requests were already queued.
concurrency-limit, indicating that the PriorityLevelConfiguration is configured to reject rather than queue excess requests.
time-out, indicating that the request was still in the queue when its queuing time limit expired.
cancelled, indicating that the request is not purge locked and has been ejected from the queue.
apiserver_flowcontrol_dispatched_requests_total is a counter vector (cumulative since server start) of requests that began executing, broken down by flow_schema and priority_level.
apiserver_flowcontrol_current_inqueue_requests is a gauge vector holding the instantaneous number of queued (not executing) requests, broken down by priority_level and flow_schema.
apiserver_flowcontrol_current_executing_requests is a gauge vector holding the instantaneous number of executing (not waiting in a queue) requests, broken down by priority_level and flow_schema.
apiserver_flowcontrol_current_executing_seats is a gauge vector holding the instantaneous number of occupied seats, broken down by priority_level and flow_schema.
apiserver_flowcontrol_request_wait_duration_seconds is a histogram vector of how long requests spent queued, broken down by the labels flow_schema, priority_level, and execute. The execute label indicates whether the request has started executing.
Note:
Since each FlowSchema always assigns requests to a single PriorityLevelConfiguration, you can add the histograms for all the FlowSchemas for one priority level to get the effective histogram for requests assigned to that priority level.
apiserver_flowcontrol_nominal_limit_seats is a gauge vector holding each priority level's nominal concurrency limit, computed from the API server's total concurrency limit and the priority level's configured nominal concurrency shares.
Maturity level ALPHA
apiserver_current_inqueue_requests is a gauge vector of recent high water marks of the number of queued requests, grouped by a label named request_kind whose value is mutating or readOnly. These high water marks describe the largest number seen in the one second window most recently completed. These complement the older apiserver_current_inflight_requests gauge vector that holds the last window's high water mark of number of requests actively being served.
apiserver_current_inqueue_seats is a gauge vector of the sum over queued requests of the largest number of seats each will occupy, grouped by labels named flow_schema and priority_level.
apiserver_flowcontrol_read_vs_write_current_requests is a histogram vector of observations, made at the end of every nanosecond, of the number of requests broken down by the labels phase (which takes on the values waiting and executing) and request_kind (which takes on the values mutating and readOnly). Each observed value is a ratio, between 0 and 1, of the number of requests divided by the corresponding limit on the number of requests (queue volume limit for waiting and concurrency limit for executing).
apiserver_flowcontrol_request_concurrency_in_use is a gauge vector holding the instantaneous number of occupied seats, broken down by priority_level and flow_schema.
apiserver_flowcontrol_priority_level_request_utilization is a histogram vector of observations, made at the end of each nanosecond, of the number of requests broken down by the labels phase (which takes on the values waiting and executing) and priority_level. Each observed value is a ratio, between 0 and 1, of a number of requests divided by the corresponding limit on the number of requests (queue volume limit for waiting and concurrency limit for executing).
apiserver_flowcontrol_priority_level_seat_utilization is a histogram vector of observations, made at the end of each nanosecond, of the utilization of a priority level's concurrency limit, broken down by priority_level. This utilization is the fraction (number of seats occupied) / (concurrency limit). This metric considers all stages of execution (both normal and the extra delay at the end of a write to cover for the corresponding notification work) of all requests except WATCHes; for those it considers only the initial stage that delivers notifications of pre-existing objects. Each histogram in the vector is also labeled with phase: executing (there is no seat limit for the waiting phase).
apiserver_flowcontrol_request_queue_length_after_enqueue is a histogram vector of queue lengths for the queues, broken down by priority_level and flow_schema, as sampled by the enqueued requests. Each request that gets queued contributes one sample to its histogram, reporting the length of the queue immediately after the request was added. Note that this produces different statistics than an unbiased survey would.
Note:
An outlier value in a histogram here means it is likely that a single flow (i.e., requests by one user or for one namespace, depending on configuration) is flooding the API server, and being throttled. By contrast, if one priority level's histogram shows that all queues for that priority level are longer than those for other priority levels, it may be appropriate to increase that PriorityLevelConfiguration's concurrency shares.
apiserver_flowcontrol_request_concurrency_limit is the same as apiserver_flowcontrol_nominal_limit_seats. Before the introduction of concurrency borrowing between priority levels, this was always equal to apiserver_flowcontrol_current_limit_seats (which did not exist as a distinct metric).
apiserver_flowcontrol_lower_limit_seats is a gauge vector holding the lower bound on each priority level's dynamic concurrency limit.
apiserver_flowcontrol_upper_limit_seats is a gauge vector holding the upper bound on each priority level's dynamic concurrency limit.
apiserver_flowcontrol_demand_seats is a histogram vector counting observations, at the end of every nanosecond, of each priority level's ratio of (seat demand) / (nominal concurrency limit). A priority level's seat demand is the sum, over both queued requests and those in the initial phase of execution, of the maximum of the number of seats occupied in the request's initial and final execution phases.
apiserver_flowcontrol_demand_seats_high_watermark is a gauge vector holding, for each priority level, the maximum seat demand seen during the last concurrency borrowing adjustment period.
apiserver_flowcontrol_demand_seats_average is a gauge vector holding, for each priority level, the time-weighted average seat demand seen during the last concurrency borrowing adjustment period.
apiserver_flowcontrol_demand_seats_stdev is a gauge vector holding, for each priority level, the time-weighted population standard deviation of seat demand seen during the last concurrency borrowing adjustment period.
apiserver_flowcontrol_demand_seats_smoothed is a gauge vector holding, for each priority level, the smoothed enveloped seat demand determined at the last concurrency adjustment.
apiserver_flowcontrol_target_seats is a gauge vector holding, for each priority level, the concurrency target going into the borrowing allocation problem.
apiserver_flowcontrol_seat_fair_frac is a gauge holding the fair allocation fraction determined in the last borrowing adjustment.
apiserver_flowcontrol_current_limit_seats is a gauge vector holding, for each priority level, the dynamic concurrency limit derived in the last adjustment.
apiserver_flowcontrol_request_execution_seconds is a histogram vector of how long requests took to actually execute, broken down by flow_schema and priority_level.
apiserver_flowcontrol_watch_count_samples is a histogram vector of the number of active WATCH requests relevant to a given write, broken down by flow_schema and priority_level.
apiserver_flowcontrol_work_estimated_seats is a histogram vector of the number of estimated seats (maximum of initial and final stage of execution) associated with requests, broken down by flow_schema and priority_level.
apiserver_flowcontrol_request_dispatch_no_accommodation_total is a counter vector of the number of events that in principle could have led to a request being dispatched but did not, due to lack of available concurrency, broken down by flow_schema and priority_level.
apiserver_flowcontrol_epoch_advance_total is a counter vector of the number of attempts to jump a priority level's progress meter backward to avoid numeric overflow, grouped by priority_level and success.
Good practices for using API Priority and Fairness
When a given priority level exceeds its permitted concurrency, requests can experience increased latency or be dropped with an HTTP 429 (Too Many Requests) error. To prevent these side effects of APF, you can modify your workload or tweak your APF settings to ensure there are sufficient seats available to serve your requests.
To detect whether requests are being rejected due to APF, check the following metrics:
apiserver_flowcontrol_rejected_requests_total: the total number of requests rejected per FlowSchema and PriorityLevelConfiguration.
apiserver_flowcontrol_current_inqueue_requests: the current number of requests queued per FlowSchema and PriorityLevelConfiguration.
apiserver_flowcontrol_request_wait_duration_seconds: the latency added to requests waiting in queues.
apiserver_flowcontrol_priority_level_seat_utilization: the seat utilization per PriorityLevelConfiguration.
Workload modifications
To prevent requests from queuing and adding latency or being dropped due to APF, you can optimize your requests by:
Reducing the rate at which requests are executed. A fewer number of requests over a fixed period will result in a fewer number of seats being needed at a given time.
Avoid issuing a large number of expensive requests concurrently. Requests can be optimized to use fewer seats or have lower latency so that these requests hold those seats for a shorter duration. List requests can occupy more than 1 seat depending on the number of objects fetched during the request. Restricting the number of objects retrieved in a list request, for example by using pagination, will use less total seats over a shorter period. Furthermore, replacing list requests with watch requests will require lower total concurrency shares as watch requests only occupy 1 seat during its initial burst of notifications. If using streaming lists in versions 1.27 and later, watch requests will occupy the same number of seats as a list request for its initial burst of notifications because the entire state of the collection has to be streamed. Note that in both cases, a watch request will not hold any seats after this initial phase.
Keep in mind that queuing or rejected requests from APF could be induced by either an increase in the number of requests or an increase in latency for existing requests. For example, if requests that normally take 1s to execute start taking 60s, it is possible that APF will start rejecting requests because requests are occupying seats for a longer duration than normal due to this increase in latency. If APF starts rejecting requests across multiple priority levels without a significant change in workload, it is possible there is an underlying issue with control plane performance rather than the workload or APF settings.
Priority and fairness settings
You can also modify the default FlowSchema and PriorityLevelConfiguration objects or create new objects of these types to better accommodate your workload.
APF settings can be modified to:
Give more seats to high priority requests.
Isolate non-essential or expensive requests that would starve a concurrency level if it was shared with other flows.
Give more seats to high priority requests
If possible, the number of seats available across all priority levels for a particular kube-apiserver can be increased by increasing the values for the max-requests-inflight and max-mutating-requests-inflight flags. Alternatively, horizontally scaling the number of kube-apiserver instances will increase the total concurrency per priority level across the cluster assuming there is sufficient load balancing of requests.
You can create a new FlowSchema which references a PriorityLevelConfiguration with a larger concurrency level. This new PriorityLevelConfiguration could be an existing level or a new level with its own set of nominal concurrency shares. For example, a new FlowSchema could be introduced to change the PriorityLevelConfiguration for your requests from global-default to workload-low to increase the number of seats available to your user. Creating a new PriorityLevelConfiguration will reduce the number of seats designated for existing levels. Recall that editing a default FlowSchema or PriorityLevelConfiguration will require setting the apf.kubernetes.io/autoupdate-spec annotation to false.
You can also increase the NominalConcurrencyShares for the PriorityLevelConfiguration which is serving your high priority requests. Alternatively, for versions 1.26 and later, you can increase the LendablePercent for competing priority levels so that the given priority level has a higher pool of seats it can borrow.
Isolate non-essential requests from starving other flows
For request isolation, you can create a FlowSchema whose subject matches the user making these requests or create a FlowSchema that matches what the request is (corresponding to the resourceRules). Next, you can map this FlowSchema to a PriorityLevelConfiguration with a low share of seats.
For example, suppose list event requests from Pods running in the default namespace are using 10 seats each and execute for 1 minute. To prevent these expensive requests from impacting requests from other Pods using the existing service-accounts FlowSchema, you can apply the following FlowSchema to isolate these list calls from other requests.
Example FlowSchema object to isolate list event requests:
priority-and-fairness/list-events-default-service-account.yaml Copy priority-and-fairness/list-events-default-service-account.yaml to clipboard
apiVersion: flowcontrol.apiserver.k8s.io/v1
kind: FlowSchema
metadata:
name: list-events-default-service-account
spec:
distinguisherMethod:
type: ByUser
matchingPrecedence: 8000
priorityLevelConfiguration:
name: catch-all
rules:
- resourceRules:
- apiGroups:
- '*'
namespaces:
- default
resources:
- events
verbs:
- list
subjects:
- kind: ServiceAccount
serviceAccount:
name: default
namespace: default
This FlowSchema captures all list event calls made by the default service account in the default namespace. The matching precedence 8000 is lower than the value of 9000 used by the existing service-accounts FlowSchema so these list event calls will match list-events-default-service-account rather than service-accounts.
The catch-all PriorityLevelConfiguration is used to isolate these requests. The catch-all priority level has a very small concurrency share and does not queue requests.
What's next
You can visit flow control reference doc to learn more about troubleshooting.
For background information on design details for API priority and fairness, see the enhancement proposal.
You can make suggestions and feature requests via SIG API Machinery or the feature's slack channel.
11.11 - Cluster Autoscaling
Automatically manage the nodes in your cluster to adapt to demand.
Kubernetes requires nodes in your cluster to run pods. This means providing capacity for the workload Pods and for Kubernetes itself.
You can adjust the amount of resources available in your cluster automatically: node autoscaling. You can either change the number of nodes, or change the capacity that nodes provide. The first approach is referred to as horizontal scaling, while the second is referred to as vertical scaling.
Kubernetes can even provide multidimensional automatic scaling for nodes.
Manual node management
You can manually manage node-level capacity, where you configure a fixed amount of nodes; you can use this approach even if the provisioning (the process to set up, manage, and decommission) for these nodes is automated.
This page is about taking the next step, and automating management of the amount of node capacity (CPU, memory, and other node resources) available in your cluster.
Automatic horizontal scaling
Cluster Autoscaler
You can use the Cluster Autoscaler to manage the scale of your nodes automatically. The cluster autoscaler can integrate with a cloud provider, or with Kubernetes' cluster API, to achieve the actual node management that's needed.
The cluster autoscaler adds nodes when there are unschedulable Pods, and removes nodes when those nodes are empty.
Cloud provider integrations
The README for the cluster autoscaler lists some of the cloud provider integrations that are available.
Cost-aware multidimensional scaling
Karpenter
Karpenter supports direct node management, via plugins that integrate with specific cloud providers, and can manage nodes for you whilst optimizing for overall cost.
Karpenter automatically launches just the right compute resources to handle your cluster's applications. It is designed to let you take full advantage of the cloud with fast and simple compute provisioning for Kubernetes clusters.
The Karpenter tool is designed to integrate with a cloud provider that provides API-driven server management, and where the price information for available servers is also available via a web API.
For example, if you start some more Pods in your cluster, the Karpenter tool might buy a new node that is larger than one of the nodes you are already using, and then shut down an existing node once the new node is in service.
Cloud provider integrations
Note: Items on this page refer to vendors external to Kubernetes. The Kubernetes project authors aren't responsible for those third-party products or projects. To add a vendor, product or project to this list, read the content guide before submitting a change. More information.
There are integrations available between Karpenter's core and the following cloud providers:
Amazon Web Services
Azure
Related components
Descheduler
The descheduler can help you consolidate Pods onto a smaller number of nodes, to help with automatic scale down when the cluster has space capacity.
Sizing a workload based on cluster size
Cluster proportional autoscaler
For workloads that need to be scaled based on the size of the cluster (for example cluster-dns or other system components), you can use the Cluster Proportional Autoscaler.
The Cluster Proportional Autoscaler watches the number of schedulable nodes and cores, and scales the number of replicas of the target workload accordingly.
Cluster proportional vertical autoscaler
If the number of replicas should stay the same, you can scale your workloads vertically according to the cluster size using the Cluster Proportional Vertical Autoscaler. This project is in beta and can be found on GitHub.
While the Cluster Proportional Autoscaler scales the number of replicas of a workload, the Cluster Proportional Vertical Autoscaler adjusts the resource requests for a workload (for example a Deployment or DaemonSet) based on the number of nodes and/or cores in the cluster.
What's next
Read about workload-level autoscaling
11.12 - Installing Addons
Note: This section links to third party projects that provide functionality required by Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are listed alphabetically. To add a project to this list, read the content guide before submitting a change. More information.
Add-ons extend the functionality of Kubernetes.
This page lists some of the available add-ons and links to their respective installation instructions. The list does not try to be exhaustive.
Networking and Network Policy
ACI provides integrated container networking and network security with Cisco ACI.
Antrea operates at Layer 3/4 to provide networking and security services for Kubernetes, leveraging Open vSwitch as the networking data plane. Antrea is a CNCF project at the Sandbox level.
Calico is a networking and network policy provider. Calico supports a flexible set of networking options so you can choose the most efficient option for your situation, including non-overlay and overlay networks, with or without BGP. Calico uses the same engine to enforce network policy for hosts, pods, and (if using Istio & Envoy) applications at the service mesh layer.
Canal unites Flannel and Calico, providing networking and network policy.
Cilium is a networking, observability, and security solution with an eBPF-based data plane. Cilium provides a simple flat Layer 3 network with the ability to span multiple clusters in either a native routing or overlay/encapsulation mode, and can enforce network policies on L3-L7 using an identity-based security model that is decoupled from network addressing. Cilium can act as a replacement for kube-proxy; it also offers additional, opt-in observability and security features. Cilium is a CNCF project at the Graduated level.
CNI-Genie enables Kubernetes to seamlessly connect to a choice of CNI plugins, such as Calico, Canal, Flannel, or Weave. CNI-Genie is a CNCF project at the Sandbox level.
Contiv provides configurable networking (native L3 using BGP, overlay using vxlan, classic L2, and Cisco-SDN/ACI) for various use cases and a rich policy framework. Contiv project is fully open sourced. The installer provides both kubeadm and non-kubeadm based installation options.
Contrail, based on Tungsten Fabric, is an open source, multi-cloud network virtualization and policy management platform. Contrail and Tungsten Fabric are integrated with orchestration systems such as Kubernetes, OpenShift, OpenStack and Mesos, and provide isolation modes for virtual machines, containers/pods and bare metal workloads.
Flannel is an overlay network provider that can be used with Kubernetes.
Gateway API is an open source project managed by the SIG Network community and provides an expressive, extensible, and role-oriented API for modeling service networking.
Knitter is a plugin to support multiple network interfaces in a Kubernetes pod.
Multus is a Multi plugin for multiple network support in Kubernetes to support all CNI plugins (e.g. Calico, Cilium, Contiv, Flannel), in addition to SRIOV, DPDK, OVS-DPDK and VPP based workloads in Kubernetes.
OVN-Kubernetes is a networking provider for Kubernetes based on OVN (Open Virtual Network), a virtual networking implementation that came out of the Open vSwitch (OVS) project. OVN-Kubernetes provides an overlay based networking implementation for Kubernetes, including an OVS based implementation of load balancing and network policy.
Nodus is an OVN based CNI controller plugin to provide cloud native based Service function chaining(SFC).
NSX-T Container Plug-in (NCP) provides integration between VMware NSX-T and container orchestrators such as Kubernetes, as well as integration between NSX-T and container-based CaaS/PaaS platforms such as Pivotal Container Service (PKS) and OpenShift.
Nuage is an SDN platform that provides policy-based networking between Kubernetes Pods and non-Kubernetes environments with visibility and security monitoring.
Romana is a Layer 3 networking solution for pod networks that also supports the NetworkPolicy API.
Spiderpool is an underlay and RDMA networking solution for Kubernetes. Spiderpool is supported on bare metal, virtual machines, and public cloud environments.
Weave Net provides networking and network policy, will carry on working on both sides of a network partition, and does not require an external database.
Service Discovery
CoreDNS is a flexible, extensible DNS server which can be installed as the in-cluster DNS for pods.
Visualization & Control
Dashboard is a dashboard web interface for Kubernetes.
Weave Scope is a tool for visualizing your containers, Pods, Services and more.
Infrastructure
KubeVirt is an add-on to run virtual machines on Kubernetes. Usually run on bare-metal clusters.
The node problem detector runs on Linux nodes and reports system issues as either Events or Node conditions.
Instrumentation
kube-state-metrics
Legacy Add-ons
There are several other add-ons documented in the deprecated cluster/addons directory.
Well-maintained ones should be linked to here. PRs welcome!
12 - Windows in Kubernetes
Kubernetes supports nodes that run Microsoft Windows.
Kubernetes supports worker nodes running either Linux or Microsoft Windows.
🛇 This item links to a third party project or product that is not part of Kubernetes itself. More information
The CNCF and its parent the Linux Foundation take a vendor-neutral approach towards compatibility. It is possible to join your Windows server as a worker node to a Kubernetes cluster.
You can install and set up kubectl on Windows no matter what operating system you use within your cluster.
If you are using Windows nodes, you can read:
Networking On Windows
Windows Storage In Kubernetes
Resource Management for Windows Nodes
Configure RunAsUserName for Windows Pods and Containers
Create A Windows HostProcess Pod
Configure Group Managed Service Accounts for Windows Pods and Containers
Security For Windows Nodes
Windows Debugging Tips
Guide for Scheduling Windows Containers in Kubernetes
or, for an overview, read:
12.1 - Windows containers in Kubernetes
Windows applications constitute a large portion of the services and applications that run in many organizations. Windows containers provide a way to encapsulate processes and package dependencies, making it easier to use DevOps practices and follow cloud native patterns for Windows applications.
Organizations with investments in Windows-based applications and Linux-based applications don't have to look for separate orchestrators to manage their workloads, leading to increased operational efficiencies across their deployments, regardless of operating system.
Windows nodes in Kubernetes
To enable the orchestration of Windows containers in Kubernetes, include Windows nodes in your existing Linux cluster. Scheduling Windows containers in Pods on Kubernetes is similar to scheduling Linux-based containers.
In order to run Windows containers, your Kubernetes cluster must include multiple operating systems. While you can only run the control plane on Linux, you can deploy worker nodes running either Windows or Linux.
Windows nodes are supported provided that the operating system is Windows Server 2019 or Windows Server 2022.
This document uses the term Windows containers to mean Windows containers with process isolation. Kubernetes does not support running Windows containers with Hyper-V isolation.
Compatibility and limitations
Some node features are only available if you use a specific container runtime; others are not available on Windows nodes, including:
HugePages: not supported for Windows containers
Privileged containers: not supported for Windows containers. HostProcess Containers offer similar functionality.
TerminationGracePeriod: requires containerD
Not all features of shared namespaces are supported. See API compatibility for more details.
See Windows OS version compatibility for details on the Windows versions that Kubernetes is tested against.
From an API and kubectl perspective, Windows containers behave in much the same way as Linux-based containers. However, there are some notable differences in key functionality which are outlined in this section.
Comparison with Linux
Key Kubernetes elements work the same way in Windows as they do in Linux. This section refers to several key workload abstractions and how they map to Windows.
Pods
A Pod is the basic building block of Kubernetes–the smallest and simplest unit in the Kubernetes object model that you create or deploy. You may not deploy Windows and Linux containers in the same Pod. All containers in a Pod are scheduled onto a single Node where each Node represents a specific platform and architecture. The following Pod capabilities, properties and events are supported with Windows containers:
Single or multiple containers per Pod with process isolation and volume sharing
Pod status fields
Readiness, liveness, and startup probes
postStart & preStop container lifecycle hooks
ConfigMap, Secrets: as environment variables or volumes
emptyDir volumes
Named pipe host mounts
Resource limits
OS field:
The .spec.os.name field should be set to windows to indicate that the current Pod uses Windows containers.
If you set the .spec.os.name field to windows, you must not set the following fields in the .spec of that Pod:
spec.hostPID
spec.hostIPC
spec.securityContext.seLinuxOptions
spec.securityContext.seccompProfile
spec.securityContext.fsGroup
spec.securityContext.fsGroupChangePolicy
spec.securityContext.sysctls
spec.shareProcessNamespace
spec.securityContext.runAsUser
spec.securityContext.runAsGroup
spec.securityContext.supplementalGroups
spec.containers[*].securityContext.seLinuxOptions
spec.containers[*].securityContext.seccompProfile
spec.containers[*].securityContext.capabilities
spec.containers[*].securityContext.readOnlyRootFilesystem
spec.containers[*].securityContext.privileged
spec.containers[*].securityContext.allowPrivilegeEscalation
spec.containers[*].securityContext.procMount
spec.containers[*].securityContext.runAsUser
spec.containers[*].securityContext.runAsGroup
In the above list, wildcards (*) indicate all elements in a list. For example, spec.containers[*].securityContext refers to the SecurityContext object for all containers. If any of these fields is specified, the Pod will not be admitted by the API server.
Workload resources including:
ReplicaSet
Deployment
StatefulSet
DaemonSet
Job
CronJob
ReplicationController
Services See Load balancing and Services for more details.
Pods, workload resources, and Services are critical elements to managing Windows workloads on Kubernetes. However, on their own they are not enough to enable the proper lifecycle management of Windows workloads in a dynamic cloud native environment.
kubectl exec
Pod and container metrics
Horizontal pod autoscaling
Resource quotas
Scheduler preemption
Command line options for the kubelet
Some kubelet command line options behave differently on Windows, as described below:
The --windows-priorityclass lets you set the scheduling priority of the kubelet process (see CPU resource management)
The --kube-reserved, --system-reserved , and --eviction-hard flags update NodeAllocatable
Eviction by using --enforce-node-allocable is not implemented
Eviction by using --eviction-hard and --eviction-soft are not implemented
When running on a Windows node the kubelet does not have memory or CPU restrictions. --kube-reserved and --system-reserved only subtract from NodeAllocatable and do not guarantee resource provided for workloads. See Resource Management for Windows nodes for more information.
The MemoryPressure Condition is not implemented
The kubelet does not take OOM eviction actions
API compatibility
There are subtle differences in the way the Kubernetes APIs work for Windows due to the OS and container runtime. Some workload properties were designed for Linux, and fail to run on Windows.
At a high level, these OS concepts are different:
Identity - Linux uses userID (UID) and groupID (GID) which are represented as integer types. User and group names are not canonical - they are just an alias in /etc/groups or /etc/passwd back to UID+GID. Windows uses a larger binary security identifier (SID) which is stored in the Windows Security Access Manager (SAM) database. This database is not shared between the host and containers, or between containers.
File permissions - Windows uses an access control list based on (SIDs), whereas POSIX systems such as Linux use a bitmask based on object permissions and UID+GID, plus optional access control lists.
File paths - the convention on Windows is to use \ instead of /. The Go IO libraries typically accept both and just make it work, but when you're setting a path or command line that's interpreted inside a container, \ may be needed.
Signals - Windows interactive apps handle termination differently, and can implement one or more of these:
A UI thread handles well-defined messages including WM_CLOSE.
Console apps handle Ctrl-C or Ctrl-break using a Control Handler.
Services register a Service Control Handler function that can accept SERVICE_CONTROL_STOP control codes.
Container exit codes follow the same convention where 0 is success, and nonzero is failure. The specific error codes may differ across Windows and Linux. However, exit codes passed from the Kubernetes components (kubelet, kube-proxy) are unchanged.
Field compatibility for container specifications
The following list documents differences between how Pod container specifications work between Windows and Linux:
Huge pages are not implemented in the Windows container runtime, and are not available. They require asserting a user privilege that's not configurable for containers.
requests.cpu and requests.memory - requests are subtracted from node available resources, so they can be used to avoid overprovisioning a node. However, they cannot be used to guarantee resources in an overprovisioned node. They should be applied to all containers as a best practice if the operator wants to avoid overprovisioning entirely.
securityContext.allowPrivilegeEscalation - not possible on Windows; none of the capabilities are hooked up
securityContext.capabilities - POSIX capabilities are not implemented on Windows
securityContext.privileged - Windows doesn't support privileged containers, use HostProcess Containers instead
securityContext.procMount - Windows doesn't have a /proc filesystem
securityContext.readOnlyRootFilesystem - not possible on Windows; write access is required for registry & system processes to run inside the container
securityContext.runAsGroup - not possible on Windows as there is no GID support
securityContext.runAsNonRoot - this setting will prevent containers from running as ContainerAdministrator which is the closest equivalent to a root user on Windows.
securityContext.runAsUser - use runAsUserName instead
securityContext.seLinuxOptions - not possible on Windows as SELinux is Linux-specific
terminationMessagePath - this has some limitations in that Windows doesn't support mapping single files. The default value is /dev/termination-log, which does work because it does not exist on Windows by default.
Field compatibility for Pod specifications
The following list documents differences between how Pod specifications work between Windows and Linux:
hostIPC and hostpid - host namespace sharing is not possible on Windows
hostNetwork - see below
dnsPolicy - setting the Pod dnsPolicy to ClusterFirstWithHostNet is not supported on Windows because host networking is not provided. Pods always run with a container network.
podSecurityContext see below
shareProcessNamespace - this is a beta feature, and depends on Linux namespaces which are not implemented on Windows. Windows cannot share process namespaces or the container's root filesystem. Only the network can be shared.
terminationGracePeriodSeconds - this is not fully implemented in Docker on Windows, see the GitHub issue. The behavior today is that the ENTRYPOINT process is sent CTRL_SHUTDOWN_EVENT, then Windows waits 5 seconds by default, and finally shuts down all processes using the normal Windows shutdown behavior. The 5 second default is actually in the Windows registry inside the container, so it can be overridden when the container is built.
volumeDevices - this is a beta feature, and is not implemented on Windows. Windows cannot attach raw block devices to pods.
volumes
If you define an emptyDir volume, you cannot set its volume source to memory.
You cannot enable mountPropagation for volume mounts as this is not supported on Windows.
Field compatibility for hostNetwork
FEATURE STATE: Kubernetes v1.26 [alpha]
The kubelet can now request that pods running on Windows nodes use the host's network namespace instead of creating a new pod network namespace. To enable this functionality pass --feature-gates=WindowsHostNetwork=true to the kubelet.
Note:
This functionality requires a container runtime that supports this functionality.
Field compatibility for Pod security context
Only the securityContext.runAsNonRoot and securityContext.windowsOptions from the Pod securityContext fields work on Windows.
Node problem detector
The node problem detector (see Monitor Node Health) has preliminary support for Windows. For more information, visit the project's GitHub page.
Pause container
In a Kubernetes Pod, an infrastructure or “pause” container is first created to host the container. In Linux, the cgroups and namespaces that make up a pod need a process to maintain their continued existence; the pause process provides this. Containers that belong to the same pod, including infrastructure and worker containers, share a common network endpoint (same IPv4 and / or IPv6 address, same network port spaces). Kubernetes uses pause containers to allow for worker containers crashing or restarting without losing any of the networking configuration.
Kubernetes maintains a multi-architecture image that includes support for Windows. For Kubernetes v1.30.0 the recommended pause image is registry.k8s.io/pause:3.6. The source code is available on GitHub.
Microsoft maintains a different multi-architecture image, with Linux and Windows amd64 support, that you can find as mcr.microsoft.com/oss/kubernetes/pause:3.6. This image is built from the same source as the Kubernetes maintained image but all of the Windows binaries are authenticode signed by Microsoft. The Kubernetes project recommends using the Microsoft maintained image if you are deploying to a production or production-like environment that requires signed binaries.
Container runtimes
You need to install a container runtime into each node in the cluster so that Pods can run there.
The following container runtimes work with Windows:
Note: This section links to third party projects that provide functionality required by Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are listed alphabetically. To add a project to this list, read the content guide before submitting a change. More information.
ContainerD
FEATURE STATE: Kubernetes v1.20 [stable]
You can use ContainerD 1.4.0+ as the container runtime for Kubernetes nodes that run Windows.
Learn how to install ContainerD on a Windows node.
Note:
There is a known limitation when using GMSA with containerd to access Windows network shares, which requires a kernel patch.
Mirantis Container Runtime
Mirantis Container Runtime (MCR) is available as a container runtime for all Windows Server 2019 and later versions.
See Install MCR on Windows Servers for more information.
Windows OS version compatibility
On Windows nodes, strict compatibility rules apply where the host OS version must match the container base image OS version. Only Windows containers with a container operating system of Windows Server 2019 are fully supported.
For Kubernetes v1.30, operating system compatibility for Windows nodes (and Pods) is as follows:
Windows Server LTSC release
Windows Server 2019
Windows Server 2022
Windows Server SAC release
Windows Server version 20H2
The Kubernetes version-skew policy also applies.
Hardware recommendations and considerations
Note: This section links to third party projects that provide functionality required by Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are listed alphabetically. To add a project to this list, read the content guide before submitting a change. More information.
Note:
The following hardware specifications outlined here should be regarded as sensible default values. They are not intended to represent minimum requirements or specific recommendations for production environments. Depending on the requirements for your workload these values may need to be adjusted.
64-bit processor 4 CPU cores or more, capable of supporting virtualization
8GB or more of RAM
50GB or more of free disk space
Refer to Hardware requirements for Windows Server Microsoft documentation for the most up-to-date information on minimum hardware requirements. For guidance on deciding on resources for production worker nodes refer to Production worker nodes Kubernetes documentation.
To optimize system resources, if a graphical user interface is not required, it may be preferable to use a Windows Server OS installation that excludes the Windows Desktop Experience installation option, as this configuration typically frees up more system resources.
In assessing disk space for Windows worker nodes, take note that Windows container images are typically larger than Linux container images, with container image sizes ranging from 300MB to over 10GB for a single image. Additionally, take note that the C: drive in Windows containers represents a virtual free size of 20GB by default, which is not the actual consumed space, but rather the disk size for which a single container can grow to occupy when using local storage on the host. See Containers on Windows - Container Storage Documentation for more detail.
Getting help and troubleshooting
Your main source of help for troubleshooting your Kubernetes cluster should start with the Troubleshooting page.
Some additional, Windows-specific troubleshooting help is included in this section. Logs are an important element of troubleshooting issues in Kubernetes. Make sure to include them any time you seek troubleshooting assistance from other contributors. Follow the instructions in the SIG Windows contributing guide on gathering logs.
Reporting issues and feature requests
If you have what looks like a bug, or you would like to make a feature request, please follow the SIG Windows contributing guide to create a new issue. You should first search the list of issues in case it was reported previously and comment with your experience on the issue and add additional logs. SIG Windows channel on the Kubernetes Slack is also a great avenue to get some initial support and troubleshooting ideas prior to creating a ticket.
Validating the Windows cluster operability
The Kubernetes project provides a Windows Operational Readiness specification, accompanied by a structured test suite. This suite is split into two sets of tests, core and extended, each containing categories aimed at testing specific areas. It can be used to validate all the functionalities of a Windows and hybrid system (mixed with Linux nodes) with full coverage.
To set up the project on a newly created cluster, refer to the instructions in the project guide.
Deployment tools
The kubeadm tool helps you to deploy a Kubernetes cluster, providing the control plane to manage the cluster it, and nodes to run your workloads.
The Kubernetes cluster API project also provides means to automate deployment of Windows nodes.
Windows distribution channels
For a detailed explanation of Windows distribution channels see the Microsoft documentation.
Information on the different Windows Server servicing channels including their support models can be found at Windows Server servicing channels.
12.2 - Guide for Running Windows Containers in Kubernetes
This page provides a walkthrough for some steps you can follow to run Windows containers using Kubernetes. The page also highlights some Windows specific functionality within Kubernetes.
It is important to note that creating and deploying services and workloads on Kubernetes behaves in much the same way for Linux and Windows containers. The kubectl commands to interface with the cluster are identical. The examples in this page are provided to jumpstart your experience with Windows containers.
Objectives
Configure an example deployment to run Windows containers on a Windows node.
Before you begin
You should already have access to a Kubernetes cluster that includes a worker node running Windows Server.
Getting Started: Deploying a Windows workload
The example YAML file below deploys a simple webserver application running inside a Windows container.
Create a manifest named win-webserver.yaml with the contents below:
---
apiVersion: v1
kind: Service
metadata:
name: win-webserver
labels:
app: win-webserver
spec:
ports:
# the port that this service should serve on
- port: 80
targetPort: 80
selector:
app: win-webserver
type: NodePort
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: win-webserver
name: win-webserver
spec:
replicas: 2
selector:
matchLabels:
app: win-webserver
template:
metadata:
labels:
app: win-webserver
name: win-webserver
spec:
containers:
- name: windowswebserver
image: mcr.microsoft.com/windows/servercore:ltsc2019
command:
- powershell.exe
- -command
- "<#code used from https://gist.github.com/19WAS85/5424431#> ; $$listener = New-Object System.Net.HttpListener ; $$listener.Prefixes.Add('http://*:80/') ; $$listener.Start() ; $$callerCounts = @{} ; Write-Host('Listening at http://*:80/') ; while ($$listener.IsListening) { ;$$context = $$listener.GetContext() ;$$requestUrl = $$context.Request.Url ;$$clientIP = $$context.Request.RemoteEndPoint.Address ;$$response = $$context.Response ;Write-Host '' ;Write-Host('> {0}' -f $$requestUrl) ; ;$$count = 1 ;$$k=$$callerCounts.Get_Item($$clientIP) ;if ($$k -ne $$null) { $$count += $$k } ;$$callerCounts.Set_Item($$clientIP, $$count) ;$$ip=(Get-NetAdapter | Get-NetIpAddress); $$header='Windows Container Web Server
' ;$$callerCountsString='' ;$$callerCounts.Keys | % { $$callerCountsString+='IP {0} callerCount {1} ' -f $$ip[1].IPAddress,$$callerCounts.Item($$_) } ;$$footer='' ;$$content='{0}{1}{2}' -f $$header,$$callerCountsString,$$footer ;Write-Output $$content ;$$buffer = [System.Text.Encoding]::UTF8.GetBytes($$content) ;$$response.ContentLength64 = $$buffer.Length ;$$response.OutputStream.Write($$buffer, 0, $$buffer.Length) ;$$response.Close() ;$$responseStatus = $$response.StatusCode ;Write-Host('< {0}' -f $$responseStatus) } ; "
nodeSelector:
kubernetes.io/os: windows
Note:
Port mapping is also supported, but for simplicity this example exposes port 80 of the container directly to the Service.
Check that all nodes are healthy:
kubectl get nodes
Deploy the service and watch for pod updates:
kubectl apply -f win-webserver.yaml
kubectl get pods -o wide -w
When the service is deployed correctly both Pods are marked as Ready. To exit the watch command, press Ctrl+C.
Check that the deployment succeeded. To verify:
Several pods listed from the Linux control plane node, use kubectl get pods
Node-to-pod communication across the network, curl port 80 of your pod IPs from the Linux control plane node to check for a web server response
Pod-to-pod communication, ping between pods (and across hosts, if you have more than one Windows node) using kubectl exec
Service-to-pod communication, curl the virtual service IP (seen under kubectl get services) from the Linux control plane node and from individual pods
Service discovery, curl the service name with the Kubernetes default DNS suffix
Inbound connectivity, curl the NodePort from the Linux control plane node or machines outside of the cluster
Outbound connectivity, curl external IPs from inside the pod using kubectl exec
Note:
Windows container hosts are not able to access the IP of services scheduled on them due to current platform limitations of the Windows networking stack. Only Windows pods are able to access service IPs.
Observability
Capturing logs from workloads
Logs are an important element of observability; they enable users to gain insights into the operational aspect of workloads and are a key ingredient to troubleshooting issues. Because Windows containers and workloads inside Windows containers behave differently from Linux containers, users had a hard time collecting logs, limiting operational visibility. Windows workloads for example are usually configured to log to ETW (Event Tracing for Windows) or push entries to the application event log. LogMonitor, an open source tool by Microsoft, is the recommended way to monitor configured log sources inside a Windows container. LogMonitor supports monitoring event logs, ETW providers, and custom application logs, piping them to STDOUT for consumption by kubectl logs .
Follow the instructions in the LogMonitor GitHub page to copy its binaries and configuration files to all your containers and add the necessary entrypoints for LogMonitor to push your logs to STDOUT.
Configuring container user
Using configurable Container usernames
Windows containers can be configured to run their entrypoints and processes with different usernames than the image defaults. Learn more about it here.
Managing Workload Identity with Group Managed Service Accounts
Windows container workloads can be configured to use Group Managed Service Accounts (GMSA). Group Managed Service Accounts are a specific type of Active Directory account that provide automatic password management, simplified service principal name (SPN) management, and the ability to delegate the management to other administrators across multiple servers. Containers configured with a GMSA can access external Active Directory Domain resources while carrying the identity configured with the GMSA. Learn more about configuring and using GMSA for Windows containers here.
Taints and tolerations
Users need to use some combination of taint and node selectors in order to schedule Linux and Windows workloads to their respective OS-specific nodes. The recommended approach is outlined below, with one of its main goals being that this approach should not break compatibility for existing Linux workloads.
You can (and should) set .spec.os.name for each Pod, to indicate the operating system that the containers in that Pod are designed for. For Pods that run Linux containers, set .spec.os.name to linux. For Pods that run Windows containers, set .spec.os.name to windows.
Note:
If you are running a version of Kubernetes older than 1.24, you may need to enable the IdentifyPodOS feature gate to be able to set a value for .spec.pod.os.
The scheduler does not use the value of .spec.os.name when assigning Pods to nodes. You should use normal Kubernetes mechanisms for assigning pods to nodes to ensure that the control plane for your cluster places pods onto nodes that are running the appropriate operating system.
The .spec.os.name value has no effect on the scheduling of the Windows pods, so taints and tolerations (or node selectors) are still required to ensure that the Windows pods land onto appropriate Windows nodes.
Ensuring OS-specific workloads land on the appropriate container host
Users can ensure Windows containers can be scheduled on the appropriate host using taints and tolerations. All Kubernetes nodes running Kubernetes 1.30 have the following default labels:
kubernetes.io/os = [windows|linux]
kubernetes.io/arch = [amd64|arm64|...]
If a Pod specification does not specify a nodeSelector such as "kubernetes.io/os": windows, it is possible the Pod can be scheduled on any host, Windows or Linux. This can be problematic since a Windows container can only run on Windows and a Linux container can only run on Linux. The best practice for Kubernetes 1.30 is to use a nodeSelector.
However, in many cases users have a pre-existing large number of deployments for Linux containers, as well as an ecosystem of off-the-shelf configurations, such as community Helm charts, and programmatic Pod generation cases, such as with operators. In those situations, you may be hesitant to make the configuration change to add nodeSelector fields to all Pods and Pod templates. The alternative is to use taints. Because the kubelet can set taints during registration, it could easily be modified to automatically add a taint when running on Windows only.
For example: --register-with-taints='os=windows:NoSchedule'
By adding a taint to all Windows nodes, nothing will be scheduled on them (that includes existing Linux Pods). In order for a Windows Pod to be scheduled on a Windows node, it would need both the nodeSelector and the appropriate matching toleration to choose Windows.
nodeSelector:
kubernetes.io/os: windows
node.kubernetes.io/windows-build: '10.0.17763'
tolerations:
- key: "os"
operator: "Equal"
value: "windows"
effect: "NoSchedule"
Handling multiple Windows versions in the same cluster
The Windows Server version used by each pod must match that of the node. If you want to use multiple Windows Server versions in the same cluster, then you should set additional node labels and nodeSelector fields.
Kubernetes automatically adds a label, node.kubernetes.io/windows-build to simplify this.
This label reflects the Windows major, minor, and build number that need to match for compatibility. Here are values used for each Windows Server version:
Product Name Version
Windows Server 2019 10.0.17763
Windows Server 2022 10.0.20348
Simplifying with RuntimeClass
RuntimeClass can be used to simplify the process of using taints and tolerations. A cluster administrator can create a RuntimeClass object which is used to encapsulate these taints and tolerations.
Save this file to runtimeClasses.yml. It includes the appropriate nodeSelector for the Windows OS, architecture, and version.
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: windows-2019
handler: example-container-runtime-handler
scheduling:
nodeSelector:
kubernetes.io/os: 'windows'
kubernetes.io/arch: 'amd64'
node.kubernetes.io/windows-build: '10.0.17763'
tolerations:
- effect: NoSchedule
key: os
operator: Equal
value: "windows"
Run kubectl create -f runtimeClasses.yml using as a cluster administrator
Add runtimeClassName: windows-2019 as appropriate to Pod specs
For example:
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: iis-2019
labels:
app: iis-2019
spec:
replicas: 1
template:
metadata:
name: iis-2019
labels:
app: iis-2019
spec:
runtimeClassName: windows-2019
containers:
- name: iis
image: mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019
resources:
limits:
cpu: 1
memory: 800Mi
requests:
cpu: .1
memory: 300Mi
ports:
- containerPort: 80
selector:
matchLabels:
app: iis-2019
---
apiVersion: v1
kind: Service
metadata:
name: iis
spec:
type: LoadBalancer
ports:
- protocol: TCP
port: 80
selector:
app: iis-2019
13 - Extending Kubernetes
Different ways to change the behavior of your Kubernetes cluster.
Kubernetes is highly configurable and extensible. As a result, there is rarely a need to fork or submit patches to the Kubernetes project code.
This guide describes the options for customizing a Kubernetes cluster. It is aimed at cluster operators who want to understand how to adapt their Kubernetes cluster to the needs of their work environment. Developers who are prospective Platform Developers or Kubernetes Project Contributors will also find it useful as an introduction to what extension points and patterns exist, and their trade-offs and limitations.
Customization approaches can be broadly divided into configuration, which only involves changing command line arguments, local configuration files, or API resources; and extensions, which involve running additional programs, additional network services, or both. This document is primarily about extensions.
Configuration
Configuration files and command arguments are documented in the Reference section of the online documentation, with a page for each binary:
kube-apiserver
kube-controller-manager
kube-scheduler
kubelet
kube-proxy
Command arguments and configuration files may not always be changeable in a hosted Kubernetes service or a distribution with managed installation. When they are changeable, they are usually only changeable by the cluster operator. Also, they are subject to change in future Kubernetes versions, and setting them may require restarting processes. For those reasons, they should be used only when there are no other options.
Built-in policy APIs, such as ResourceQuota, NetworkPolicy and Role-based Access Control (RBAC), are built-in Kubernetes APIs that provide declaratively configured policy settings. APIs are typically usable even with hosted Kubernetes services and with managed Kubernetes installations. The built-in policy APIs follow the same conventions as other Kubernetes resources such as Pods. When you use a policy APIs that is stable, you benefit from a defined support policy like other Kubernetes APIs. For these reasons, policy APIs are recommended over configuration files and command arguments where suitable.
Extensions
Extensions are software components that extend and deeply integrate with Kubernetes. They adapt it to support new types and new kinds of hardware.
Many cluster administrators use a hosted or distribution instance of Kubernetes. These clusters come with extensions pre-installed. As a result, most Kubernetes users will not need to install extensions and even fewer users will need to author new ones.
Extension patterns
Kubernetes is designed to be automated by writing client programs. Any program that reads and/or writes to the Kubernetes API can provide useful automation. Automation can run on the cluster or off it. By following the guidance in this doc you can write highly available and robust automation. Automation generally works with any Kubernetes cluster, including hosted clusters and managed installations.
There is a specific pattern for writing client programs that work well with Kubernetes called the controller pattern. Controllers typically read an object's .spec, possibly do things, and then update the object's .status.
A controller is a client of the Kubernetes API. When Kubernetes is the client and calls out to a remote service, Kubernetes calls this a webhook. The remote service is called a webhook backend. As with custom controllers, webhooks do add a point of failure.
Note:
Outside of Kubernetes, the term “webhook” typically refers to a mechanism for asynchronous notifications, where the webhook call serves as a one-way notification to another system or component. In the Kubernetes ecosystem, even synchronous HTTP callouts are often described as “webhooks”.
In the webhook model, Kubernetes makes a network request to a remote service. With the alternative binary Plugin model, Kubernetes executes a binary (program). Binary plugins are used by the kubelet (for example, CSI storage plugins and CNI network plugins), and by kubectl (see Extend kubectl with plugins).
Extension points
This diagram shows the extension points in a Kubernetes cluster and the clients that access it.
Symbolic representation of seven numbered extension points for Kubernetes
Kubernetes extension points
Key to the figure
Users often interact with the Kubernetes API using kubectl. Plugins customise the behaviour of clients. There are generic extensions that can apply to different clients, as well as specific ways to extend kubectl.
The API server handles all requests. Several types of extension points in the API server allow authenticating requests, or blocking them based on their content, editing content, and handling deletion. These are described in the API Access Extensions section.
The API server serves various kinds of resources. Built-in resource kinds, such as pods, are defined by the Kubernetes project and can't be changed. Read API extensions to learn about extending the Kubernetes API.
The Kubernetes scheduler decides which nodes to place pods on. There are several ways to extend scheduling, which are described in the Scheduling extensions section.
Much of the behavior of Kubernetes is implemented by programs called controllers, that are clients of the API server. Controllers are often used in conjunction with custom resources. Read combining new APIs with automation and Changing built-in resources to learn more.
The kubelet runs on servers (nodes), and helps pods appear like virtual servers with their own IPs on the cluster network. Network Plugins allow for different implementations of pod networking.
You can use Device Plugins to integrate custom hardware or other special node-local facilities, and make these available to Pods running in your cluster. The kubelet includes support for working with device plugins.
The kubelet also mounts and unmounts volume for pods and their containers. You can use Storage Plugins to add support for new kinds of storage and other volume types.
Extension point choice flowchart
If you are unsure where to start, this flowchart can help. Note that some solutions may involve several types of extensions.
Flowchart with questions about use cases and guidance for implementers. Green circles indicate yes; red circles indicate no.
Flowchart guide to select an extension approach
Client extensions
Plugins for kubectl are separate binaries that add or replace the behavior of specific subcommands. The kubectl tool can also integrate with credential plugins These extensions only affect a individual user's local environment, and so cannot enforce site-wide policies.
If you want to extend the kubectl tool, read Extend kubectl with plugins.
API extensions
Custom resource definitions
Consider adding a Custom Resource to Kubernetes if you want to define new controllers, application configuration objects or other declarative APIs, and to manage them using Kubernetes tools, such as kubectl.
For more about Custom Resources, see the Custom Resources concept guide.
API aggregation layer
You can use Kubernetes' API Aggregation Layer to integrate the Kubernetes API with additional services such as for metrics.
Combining new APIs with automation
A combination of a custom resource API and a control loop is called the controllers pattern. If your controller takes the place of a human operator deploying infrastructure based on a desired state, then the controller may also be following the operator pattern. The Operator pattern is used to manage specific applications; usually, these are applications that maintain state and require care in how they are managed.
You can also make your own custom APIs and control loops that manage other resources, such as storage, or to define policies (such as an access control restriction).
Changing built-in resources
When you extend the Kubernetes API by adding custom resources, the added resources always fall into a new API Groups. You cannot replace or change existing API groups. Adding an API does not directly let you affect the behavior of existing APIs (such as Pods), whereas API Access Extensions do.
API access extensions
When a request reaches the Kubernetes API Server, it is first authenticated, then authorized, and is then subject to various types of admission control (some requests are in fact not authenticated, and get special treatment). See Controlling Access to the Kubernetes API for more on this flow.
Each of the steps in the Kubernetes authentication / authorization flow offers extension points.
Authentication
Authentication maps headers or certificates in all requests to a username for the client making the request.
Kubernetes has several built-in authentication methods that it supports. It can also sit behind an authenticating proxy, and it can send a token from an Authorization: header to a remote service for verification (an authentication webhook) if those don't meet your needs.
Authorization
Authorization determines whether specific users can read, write, and do other operations on API resources. It works at the level of whole resources -- it doesn't discriminate based on arbitrary object fields.
If the built-in authorization options don't meet your needs, an authorization webhook allows calling out to custom code that makes an authorization decision.
Dynamic admission control
After a request is authorized, if it is a write operation, it also goes through Admission Control steps. In addition to the built-in steps, there are several extensions:
The Image Policy webhook restricts what images can be run in containers.
To make arbitrary admission control decisions, a general Admission webhook can be used. Admission webhooks can reject creations or updates. Some admission webhooks modify the incoming request data before it is handled further by Kubernetes.
Infrastructure extensions
Device plugins
Device plugins allow a node to discover new Node resources (in addition to the builtin ones like cpu and memory) via a Device Plugin.
Storage plugins
Container Storage Interface (CSI) plugins provide a way to extend Kubernetes with supports for new kinds of volumes. The volumes can be backed by durable external storage, or provide ephemeral storage, or they might offer a read-only interface to information using a filesystem paradigm.
Kubernetes also includes support for FlexVolume plugins, which are deprecated since Kubernetes v1.23 (in favour of CSI).
FlexVolume plugins allow users to mount volume types that aren't natively supported by Kubernetes. When you run a Pod that relies on FlexVolume storage, the kubelet calls a binary plugin to mount the volume. The archived FlexVolume design proposal has more detail on this approach.
The Kubernetes Volume Plugin FAQ for Storage Vendors includes general information on storage plugins.
Network plugins
Your Kubernetes cluster needs a network plugin in order to have a working Pod network and to support other aspects of the Kubernetes network model.
Network Plugins allow Kubernetes to work with different networking topologies and technologies.
Kubelet image credential provider plugins
FEATURE STATE: Kubernetes v1.26 [stable]
Kubelet image credential providers are plugins for the kubelet to dynamically retrieve image registry credentials. The credentials are then used when pulling images from container image registries that match the configuration.
The plugins can communicate with external services or use local files to obtain credentials. This way, the kubelet does not need to have static credentials for each registry, and can support various authentication methods and protocols.
For plugin configuration details, see Configure a kubelet image credential provider.
Scheduling extensions
The scheduler is a special type of controller that watches pods, and assigns pods to nodes. The default scheduler can be replaced entirely, while continuing to use other Kubernetes components, or multiple schedulers can run at the same time.
This is a significant undertaking, and almost all Kubernetes users find they do not need to modify the scheduler.
You can control which scheduling plugins are active, or associate sets of plugins with different named scheduler profiles. You can also write your own plugin that integrates with one or more of the kube-scheduler's extension points.
Finally, the built in kube-scheduler component supports a webhook that permits a remote HTTP backend (scheduler extension) to filter and / or prioritize the nodes that the kube-scheduler chooses for a pod.
Note:
You can only affect node filtering and node prioritization with a scheduler extender webhook; other extension points are not available through the webhook integration.
What's next
Learn more about infrastructure extensions
Device Plugins
Network Plugins
CSI storage plugins
Learn about kubectl plugins
Learn more about Custom Resources
Learn more about Extension API Servers
Learn about Dynamic admission control
Learn about the Operator pattern
13.1 - Compute, Storage, and Networking Extensions
This section covers extensions to your cluster that do not come as part as Kubernetes itself. You can use these extensions to enhance the nodes in your cluster, or to provide the network fabric that links Pods together.
CSI and FlexVolume storage plugins
Container Storage Interface (CSI) plugins provide a way to extend Kubernetes with supports for new kinds of volumes. The volumes can be backed by durable external storage, or provide ephemeral storage, or they might offer a read-only interface to information using a filesystem paradigm.
Kubernetes also includes support for FlexVolume plugins, which are deprecated since Kubernetes v1.23 (in favour of CSI).
FlexVolume plugins allow users to mount volume types that aren't natively supported by Kubernetes. When you run a Pod that relies on FlexVolume storage, the kubelet calls a binary plugin to mount the volume. The archived FlexVolume design proposal has more detail on this approach.
The Kubernetes Volume Plugin FAQ for Storage Vendors includes general information on storage plugins.
Device plugins
Device plugins allow a node to discover new Node facilities (in addition to the built-in node resources such as cpu and memory), and provide these custom node-local facilities to Pods that request them.
Network plugins
A network plugin allow Kubernetes to work with different networking topologies and technologies. Your Kubernetes cluster needs a network plugin in order to have a working Pod network and to support other aspects of the Kubernetes network model.
Kubernetes 1.30 is compatible with CNI network plugins.
13.1.1 - Network Plugins
Kubernetes 1.30 supports Container Network Interface (CNI) plugins for cluster networking. You must use a CNI plugin that is compatible with your cluster and that suits your needs. Different plugins are available (both open- and closed- source) in the wider Kubernetes ecosystem.
A CNI plugin is required to implement the Kubernetes network model.
You must use a CNI plugin that is compatible with the v0.4.0 or later releases of the CNI specification. The Kubernetes project recommends using a plugin that is compatible with the v1.0.0 CNI specification (plugins can be compatible with multiple spec versions).
Installation
A Container Runtime, in the networking context, is a daemon on a node configured to provide CRI Services for kubelet. In particular, the Container Runtime must be configured to load the CNI plugins required to implement the Kubernetes network model.
Note:
Prior to Kubernetes 1.24, the CNI plugins could also be managed by the kubelet using the cni-bin-dir and network-plugin command-line parameters. These command-line parameters were removed in Kubernetes 1.24, with management of the CNI no longer in scope for kubelet.
See Troubleshooting CNI plugin-related errors if you are facing issues following the removal of dockershim.
For specific information about how a Container Runtime manages the CNI plugins, see the documentation for that Container Runtime, for example:
containerd
CRI-O
For specific information about how to install and manage a CNI plugin, see the documentation for that plugin or networking provider.
Network Plugin Requirements
Loopback CNI
In addition to the CNI plugin installed on the nodes for implementing the Kubernetes network model, Kubernetes also requires the container runtimes to provide a loopback interface lo, which is used for each sandbox (pod sandboxes, vm sandboxes, ...). Implementing the loopback interface can be accomplished by re-using the CNI loopback plugin. or by developing your own code to achieve this (see this example from CRI-O).
Support hostPort
The CNI networking plugin supports hostPort. You can use the official portmap plugin offered by the CNI plugin team or use your own plugin with portMapping functionality.
If you want to enable hostPort support, you must specify portMappings capability in your cni-conf-dir. For example:
{
"name": "k8s-pod-network",
"cniVersion": "0.4.0",
"plugins": [
{
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "127.0.0.1",
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type": "portmap",
"capabilities": {"portMappings": true},
"externalSetMarkChain": "KUBE-MARK-MASQ"
}
]
}
Support traffic shaping
Experimental Feature
The CNI networking plugin also supports pod ingress and egress traffic shaping. You can use the official bandwidth plugin offered by the CNI plugin team or use your own plugin with bandwidth control functionality.
If you want to enable traffic shaping support, you must add the bandwidth plugin to your CNI configuration file (default /etc/cni/net.d) and ensure that the binary is included in your CNI bin dir (default /opt/cni/bin).
{
"name": "k8s-pod-network",
"cniVersion": "0.4.0",
"plugins": [
{
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "127.0.0.1",
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type": "bandwidth",
"capabilities": {"bandwidth": true}
}
]
}
Now you can add the kubernetes.io/ingress-bandwidth and kubernetes.io/egress-bandwidth annotations to your Pod. For example:
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/ingress-bandwidth: 1M
kubernetes.io/egress-bandwidth: 1M
...
What's next
Learn more about Cluster Networking
Learn more about Network Policies
Learn about the Troubleshooting CNI plugin-related errors
13.1.2 - Device Plugins
Device plugins let you configure your cluster with support for devices or resources that require vendor-specific setup, such as GPUs, NICs, FPGAs, or non-volatile main memory.
FEATURE STATE: Kubernetes v1.26 [stable]
Kubernetes provides a device plugin framework that you can use to advertise system hardware resources to the Kubelet.
Instead of customizing the code for Kubernetes itself, vendors can implement a device plugin that you deploy either manually or as a DaemonSet. The targeted devices include GPUs, high-performance NICs, FPGAs, InfiniBand adapters, and other similar computing resources that may require vendor specific initialization and setup.
Device plugin registration
The kubelet exports a Registration gRPC service:
service Registration {
rpc Register(RegisterRequest) returns (Empty) {}
}
A device plugin can register itself with the kubelet through this gRPC service. During the registration, the device plugin needs to send:
The name of its Unix socket.
The Device Plugin API version against which it was built.
The ResourceName it wants to advertise. Here ResourceName needs to follow the extended resource naming scheme as vendor-domain/resourcetype. (For example, an NVIDIA GPU is advertised as nvidia.com/gpu.)
Following a successful registration, the device plugin sends the kubelet the list of devices it manages, and the kubelet is then in charge of advertising those resources to the API server as part of the kubelet node status update. For example, after a device plugin registers hardware-vendor.example/foo with the kubelet and reports two healthy devices on a node, the node status is updated to advertise that the node has 2 "Foo" devices installed and available.
Then, users can request devices as part of a Pod specification (see container). Requesting extended resources is similar to how you manage requests and limits for other resources, with the following differences:
Extended resources are only supported as integer resources and cannot be overcommitted.
Devices cannot be shared between containers.
Example
Suppose a Kubernetes cluster is running a device plugin that advertises resource hardware-vendor.example/foo on certain nodes. Here is an example of a pod requesting this resource to run a demo workload:
---
apiVersion: v1
kind: Pod
metadata:
name: demo-pod
spec:
containers:
- name: demo-container-1
image: registry.k8s.io/pause:2.0
resources:
limits:
hardware-vendor.example/foo: 2
#
# This Pod needs 2 of the hardware-vendor.example/foo devices
# and can only schedule onto a Node that's able to satisfy
# that need.
#
# If the Node has more than 2 of those devices available, the
# remainder would be available for other Pods to use.
Device plugin implementation
The general workflow of a device plugin includes the following steps:
Initialization. During this phase, the device plugin performs vendor-specific initialization and setup to make sure the devices are in a ready state.
The plugin starts a gRPC service, with a Unix socket under the host path /var/lib/kubelet/device-plugins/, that implements the following interfaces:
service DevicePlugin {
// GetDevicePluginOptions returns options to be communicated with Device Manager.
rpc GetDevicePluginOptions(Empty) returns (DevicePluginOptions) {}
// ListAndWatch returns a stream of List of Devices
// Whenever a Device state change or a Device disappears, ListAndWatch
// returns the new list
rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {}
// Allocate is called during container creation so that the Device
// Plugin can run device specific operations and instruct Kubelet
// of the steps to make the Device available in the container
rpc Allocate(AllocateRequest) returns (AllocateResponse) {}
// GetPreferredAllocation returns a preferred set of devices to allocate
// from a list of available ones. The resulting preferred allocation is not
// guaranteed to be the allocation ultimately performed by the
// devicemanager. It is only designed to help the devicemanager make a more
// informed allocation decision when possible.
rpc GetPreferredAllocation(PreferredAllocationRequest) returns (PreferredAllocationResponse) {}
// PreStartContainer is called, if indicated by Device Plugin during registration phase,
// before each container start. Device plugin can run device specific operations
// such as resetting the device before making devices available to the container.
rpc PreStartContainer(PreStartContainerRequest) returns (PreStartContainerResponse) {}
}
Note:
Plugins are not required to provide useful implementations for GetPreferredAllocation() or PreStartContainer(). Flags indicating the availability of these calls, if any, should be set in the DevicePluginOptions message sent back by a call to GetDevicePluginOptions(). The kubelet will always call GetDevicePluginOptions() to see which optional functions are available, before calling any of them directly.
The plugin registers itself with the kubelet through the Unix socket at host path /var/lib/kubelet/device-plugins/kubelet.sock.
Note:
The ordering of the workflow is important. A plugin MUST start serving gRPC service before registering itself with kubelet for successful registration.
After successfully registering itself, the device plugin runs in serving mode, during which it keeps monitoring device health and reports back to the kubelet upon any device state changes. It is also responsible for serving Allocate gRPC requests. During Allocate, the device plugin may do device-specific preparation; for example, GPU cleanup or QRNG initialization. If the operations succeed, the device plugin returns an AllocateResponse that contains container runtime configurations for accessing the allocated devices. The kubelet passes this information to the container runtime.
An AllocateResponse contains zero or more ContainerAllocateResponse objects. In these, the device plugin defines modifications that must be made to a container's definition to provide access to the device. These modifications include:
Annotations
device nodes
environment variables
mounts
fully-qualified CDI device names
Note:
The processing of the fully-qualified CDI device names by the Device Manager requires that the DevicePluginCDIDevices feature gate is enabled for both the kubelet and the kube-apiserver. This was added as an alpha feature in Kubernetes v1.28 and graduated to beta in v1.29.
Handling kubelet restarts
A device plugin is expected to detect kubelet restarts and re-register itself with the new kubelet instance. A new kubelet instance deletes all the existing Unix sockets under /var/lib/kubelet/device-plugins when it starts. A device plugin can monitor the deletion of its Unix socket and re-register itself upon such an event.
Device plugin deployment
You can deploy a device plugin as a DaemonSet, as a package for your node's operating system, or manually.
The canonical directory /var/lib/kubelet/device-plugins requires privileged access, so a device plugin must run in a privileged security context. If you're deploying a device plugin as a DaemonSet, /var/lib/kubelet/device-plugins must be mounted as a Volume in the plugin's PodSpec.
If you choose the DaemonSet approach you can rely on Kubernetes to: place the device plugin's Pod onto Nodes, to restart the daemon Pod after failure, and to help automate upgrades.
API compatibility
Previously, the versioning scheme required the Device Plugin's API version to match exactly the Kubelet's version. Since the graduation of this feature to Beta in v1.12 this is no longer a hard requirement. The API is versioned and has been stable since Beta graduation of this feature. Because of this, kubelet upgrades should be seamless but there still may be changes in the API before stabilization making upgrades not guaranteed to be non-breaking.
Note:
Although the Device Manager component of Kubernetes is a generally available feature, the device plugin API is not stable. For information on the device plugin API and version compatibility, read Device Plugin API versions.
As a project, Kubernetes recommends that device plugin developers:
Watch for Device Plugin API changes in the future releases.
Support multiple versions of the device plugin API for backward/forward compatibility.
To run device plugins on nodes that need to be upgraded to a Kubernetes release with a newer device plugin API version, upgrade your device plugins to support both versions before upgrading these nodes. Taking that approach will ensure the continuous functioning of the device allocations during the upgrade.
Monitoring device plugin resources
FEATURE STATE: Kubernetes v1.28 [stable]
In order to monitor resources provided by device plugins, monitoring agents need to be able to discover the set of devices that are in-use on the node and obtain metadata to describe which container the metric should be associated with. Prometheus metrics exposed by device monitoring agents should follow the Kubernetes Instrumentation Guidelines, identifying containers using pod, namespace, and container prometheus labels.
The kubelet provides a gRPC service to enable discovery of in-use devices, and to provide metadata for these devices:
// PodResourcesLister is a service provided by the kubelet that provides information about the
// node resources consumed by pods and containers on the node
service PodResourcesLister {
rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {}
rpc GetAllocatableResources(AllocatableResourcesRequest) returns (AllocatableResourcesResponse) {}
rpc Get(GetPodResourcesRequest) returns (GetPodResourcesResponse) {}
}
List gRPC endpoint
The List endpoint provides information on resources of running pods, with details such as the id of exclusively allocated CPUs, device id as it was reported by device plugins and id of the NUMA node where these devices are allocated. Also, for NUMA-based machines, it contains the information about memory and hugepages reserved for a container.
Starting from Kubernetes v1.27, the List endpoint can provide information on resources of running pods allocated in ResourceClaims by the DynamicResourceAllocation API. To enable this feature kubelet must be started with the following flags:
--feature-gates=DynamicResourceAllocation=true,KubeletPodResourcesDynamicResources=true
// ListPodResourcesResponse is the response returned by List function
message ListPodResourcesResponse {
repeated PodResources pod_resources = 1;
}
// PodResources contains information about the node resources assigned to a pod
message PodResources {
string name = 1;
string namespace = 2;
repeated ContainerResources containers = 3;
}
// ContainerResources contains information about the resources assigned to a container
message ContainerResources {
string name = 1;
repeated ContainerDevices devices = 2;
repeated int64 cpu_ids = 3;
repeated ContainerMemory memory = 4;
repeated DynamicResource dynamic_resources = 5;
}
// ContainerMemory contains information about memory and hugepages assigned to a container
message ContainerMemory {
string memory_type = 1;
uint64 size = 2;
TopologyInfo topology = 3;
}
// Topology describes hardware topology of the resource
message TopologyInfo {
repeated NUMANode nodes = 1;
}
// NUMA representation of NUMA node
message NUMANode {
int64 ID = 1;
}
// ContainerDevices contains information about the devices assigned to a container
message ContainerDevices {
string resource_name = 1;
repeated string device_ids = 2;
TopologyInfo topology = 3;
}
// DynamicResource contains information about the devices assigned to a container by Dynamic Resource Allocation
message DynamicResource {
string class_name = 1;
string claim_name = 2;
string claim_namespace = 3;
repeated ClaimResource claim_resources = 4;
}
// ClaimResource contains per-plugin resource information
message ClaimResource {
repeated CDIDevice cdi_devices = 1 [(gogoproto.customname) = "CDIDevices"];
}
// CDIDevice specifies a CDI device information
message CDIDevice {
// Fully qualified CDI device name
// for example: vendor.com/gpu=gpudevice1
// see more details in the CDI specification:
// https://github.com/container-orchestrated-devices/container-device-interface/blob/main/SPEC.md
string name = 1;
}
Note:
cpu_ids in the ContainerResources in the List endpoint correspond to exclusive CPUs allocated to a particular container. If the goal is to evaluate CPUs that belong to the shared pool, the List endpoint needs to be used in conjunction with the GetAllocatableResources endpoint as explained below:
Call GetAllocatableResources to get a list of all the allocatable CPUs
Call GetCpuIds on all ContainerResources in the system
Subtract out all of the CPUs from the GetCpuIds calls from the GetAllocatableResources call
GetAllocatableResources gRPC endpoint
FEATURE STATE: Kubernetes v1.28 [stable]
GetAllocatableResources provides information on resources initially available on the worker node. It provides more information than kubelet exports to APIServer.
Note:
GetAllocatableResources should only be used to evaluate allocatable resources on a node. If the goal is to evaluate free/unallocated resources it should be used in conjunction with the List() endpoint. The result obtained by GetAllocatableResources would remain the same unless the underlying resources exposed to kubelet change. This happens rarely but when it does (for example: hotplug/hotunplug, device health changes), client is expected to call GetAlloctableResources endpoint.
However, calling GetAllocatableResources endpoint is not sufficient in case of cpu and/or memory update and Kubelet needs to be restarted to reflect the correct resource capacity and allocatable.
// AllocatableResourcesResponses contains information about all the devices known by the kubelet
message AllocatableResourcesResponse {
repeated ContainerDevices devices = 1;
repeated int64 cpu_ids = 2;
repeated ContainerMemory memory = 3;
}
ContainerDevices do expose the topology information declaring to which NUMA cells the device is affine. The NUMA cells are identified using a opaque integer ID, which value is consistent to what device plugins report when they register themselves to the kubelet.
The gRPC service is served over a unix socket at /var/lib/kubelet/pod-resources/kubelet.sock. Monitoring agents for device plugin resources can be deployed as a daemon, or as a DaemonSet. The canonical directory /var/lib/kubelet/pod-resources requires privileged access, so monitoring agents must run in a privileged security context. If a device monitoring agent is running as a DaemonSet, /var/lib/kubelet/pod-resources must be mounted as a Volume in the device monitoring agent's PodSpec.
Note:
When accessing the /var/lib/kubelet/pod-resources/kubelet.sock from DaemonSet or any other app deployed as a container on the host, which is mounting socket as a volume, it is a good practice to mount directory /var/lib/kubelet/pod-resources/ instead of the /var/lib/kubelet/pod-resources/kubelet.sock. This will ensure that after kubelet restart, container will be able to re-connect to this socket.
Container mounts are managed by inode referencing the socket or directory, depending on what was mounted. When kubelet restarts, socket is deleted and a new socket is created, while directory stays untouched. So the original inode for the socket become unusable. Inode to directory will continue working.
Get gRPC endpoint
FEATURE STATE: Kubernetes v1.27 [alpha]
The Get endpoint provides information on resources of a running Pod. It exposes information similar to those described in the List endpoint. The Get endpoint requires PodName and PodNamespace of the running Pod.
// GetPodResourcesRequest contains information about the pod
message GetPodResourcesRequest {
string pod_name = 1;
string pod_namespace = 2;
}
To enable this feature, you must start your kubelet services with the following flag:
--feature-gates=KubeletPodResourcesGet=true
The Get endpoint can provide Pod information related to dynamic resources allocated by the dynamic resource allocation API. To enable this feature, you must ensure your kubelet services are started with the following flags:
--feature-gates=KubeletPodResourcesGet=true,DynamicResourceAllocation=true,KubeletPodResourcesDynamicResources=true
Device plugin integration with the Topology Manager
FEATURE STATE: Kubernetes v1.27 [stable]
The Topology Manager is a Kubelet component that allows resources to be co-ordinated in a Topology aligned manner. In order to do this, the Device Plugin API was extended to include a TopologyInfo struct.
message TopologyInfo {
repeated NUMANode nodes = 1;
}
message NUMANode {
int64 ID = 1;
}
Device Plugins that wish to leverage the Topology Manager can send back a populated TopologyInfo struct as part of the device registration, along with the device IDs and the health of the device. The device manager will then use this information to consult with the Topology Manager and make resource assignment decisions.
TopologyInfo supports setting a nodes field to either nil or a list of NUMA nodes. This allows the Device Plugin to advertise a device that spans multiple NUMA nodes.
Setting TopologyInfo to nil or providing an empty list of NUMA nodes for a given device indicates that the Device Plugin does not have a NUMA affinity preference for that device.
An example TopologyInfo struct populated for a device by a Device Plugin:
pluginapi.Device{ID: "25102017", Health: pluginapi.Healthy, Topology:&pluginapi.TopologyInfo{Nodes: []*pluginapi.NUMANode{&pluginapi.NUMANode{ID: 0,},}}}
Device plugin examples
Note: This section links to third party projects that provide functionality required by Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are listed alphabetically. To add a project to this list, read the content guide before submitting a change. More information.
Here are some examples of device plugin implementations:
Akri, which lets you easily expose heterogeneous leaf devices (such as IP cameras and USB devices).
The AMD GPU device plugin
The generic device plugin for generic Linux devices and USB devices
The Intel device plugins for Intel GPU, FPGA, QAT, VPU, SGX, DSA, DLB and IAA devices
The KubeVirt device plugins for hardware-assisted virtualization
The NVIDIA GPU device plugin for Container-Optimized OS
The RDMA device plugin
The SocketCAN device plugin
The Solarflare device plugin
The SR-IOV Network device plugin
The Xilinx FPGA device plugins for Xilinx FPGA devices
What's next
Learn about scheduling GPU resources using device plugins
Learn about advertising extended resources on a node
Learn about the Topology Manager
Read about using hardware acceleration for TLS ingress with Kubernetes
13.2 - Extending the Kubernetes API
Custom resources are extensions of the Kubernetes API. Kubernetes provides two ways to add custom resources to your cluster:
The CustomResourceDefinition (CRD) mechanism allows you to declaratively define a new custom API with an API group, kind, and schema that you specify. The Kubernetes control plane serves and handles the storage of your custom resource. CRDs allow you to create new types of resources for your cluster without writing and running a custom API server.
The aggregation layer sits behind the primary API server, which acts as a proxy. This arrangement is called API Aggregation (AA), which allows you to provide specialized implementations for your custom resources by writing and deploying your own API server. The main API server delegates requests to your API server for the custom APIs that you specify, making them available to all of its clients.
13.2.1 - Custom Resources
Custom resources are extensions of the Kubernetes API. This page discusses when to add a custom resource to your Kubernetes cluster and when to use a standalone service. It describes the two methods for adding custom resources and how to choose between them.
Custom resources
A resource is an endpoint in the Kubernetes API that stores a collection of API objects of a certain kind; for example, the built-in pods resource contains a collection of Pod objects.
A custom resource is an extension of the Kubernetes API that is not necessarily available in a default Kubernetes installation. It represents a customization of a particular Kubernetes installation. However, many core Kubernetes functions are now built using custom resources, making Kubernetes more modular.
Custom resources can appear and disappear in a running cluster through dynamic registration, and cluster admins can update custom resources independently of the cluster itself. Once a custom resource is installed, users can create and access its objects using kubectl, just as they do for built-in resources like Pods.
Custom controllers
On their own, custom resources let you store and retrieve structured data. When you combine a custom resource with a custom controller, custom resources provide a true declarative API.
The Kubernetes declarative API enforces a separation of responsibilities. You declare the desired state of your resource. The Kubernetes controller keeps the current state of Kubernetes objects in sync with your declared desired state. This is in contrast to an imperative API, where you instruct a server what to do.
You can deploy and update a custom controller on a running cluster, independently of the cluster's lifecycle. Custom controllers can work with any kind of resource, but they are especially effective when combined with custom resources. The Operator pattern combines custom resources and custom controllers. You can use custom controllers to encode domain knowledge for specific applications into an extension of the Kubernetes API.
Should I add a custom resource to my Kubernetes cluster?
When creating a new API, consider whether to aggregate your API with the Kubernetes cluster APIs or let your API stand alone.
Consider API aggregation if: Prefer a stand-alone API if:
Your API is Declarative. Your API does not fit the Declarative model.
You want your new types to be readable and writable using kubectl. kubectl support is not required
You want to view your new types in a Kubernetes UI, such as dashboard, alongside built-in types. Kubernetes UI support is not required.
You are developing a new API. You already have a program that serves your API and works well.
You are willing to accept the format restriction that Kubernetes puts on REST resource paths, such as API Groups and Namespaces. (See the API Overview.) You need to have specific REST paths to be compatible with an already defined REST API.
Your resources are naturally scoped to a cluster or namespaces of a cluster. Cluster or namespace scoped resources are a poor fit; you need control over the specifics of resource paths.
You want to reuse Kubernetes API support features. You don't need those features.
Declarative APIs
In a Declarative API, typically:
Your API consists of a relatively small number of relatively small objects (resources).
The objects define configuration of applications or infrastructure.
The objects are updated relatively infrequently.
Humans often need to read and write the objects.
The main operations on the objects are CRUD-y (creating, reading, updating and deleting).
Transactions across objects are not required: the API represents a desired state, not an exact state.
Imperative APIs are not declarative. Signs that your API might not be declarative include:
The client says "do this", and then gets a synchronous response back when it is done.
The client says "do this", and then gets an operation ID back, and has to check a separate Operation object to determine completion of the request.
You talk about Remote Procedure Calls (RPCs).
Directly storing large amounts of data; for example, > a few kB per object, or > 1000s of objects.
High bandwidth access (10s of requests per second sustained) needed.
Store end-user data (such as images, PII, etc.) or other large-scale data processed by applications.
The natural operations on the objects are not CRUD-y.
The API is not easily modeled as objects.
You chose to represent pending operations with an operation ID or an operation object.
Should I use a ConfigMap or a custom resource?
Use a ConfigMap if any of the following apply:
There is an existing, well-documented configuration file format, such as a mysql.cnf or pom.xml.
You want to put the entire configuration into one key of a ConfigMap.
The main use of the configuration file is for a program running in a Pod on your cluster to consume the file to configure itself.
Consumers of the file prefer to consume via file in a Pod or environment variable in a pod, rather than the Kubernetes API.
You want to perform rolling updates via Deployment, etc., when the file is updated.
Note:
Use a Secret for sensitive data, which is similar to a ConfigMap but more secure.
Use a custom resource (CRD or Aggregated API) if most of the following apply:
You want to use Kubernetes client libraries and CLIs to create and update the new resource.
You want top-level support from kubectl; for example, kubectl get my-object object-name.
You want to build new automation that watches for updates on the new object, and then CRUD other objects, or vice versa.
You want to write automation that handles updates to the object.
You want to use Kubernetes API conventions like .spec, .status, and .metadata.
You want the object to be an abstraction over a collection of controlled resources, or a summarization of other resources.
Adding custom resources
Kubernetes provides two ways to add custom resources to your cluster:
CRDs are simple and can be created without any programming.
API Aggregation requires programming, but allows more control over API behaviors like how data is stored and conversion between API versions.
Kubernetes provides these two options to meet the needs of different users, so that neither ease of use nor flexibility is compromised.
Aggregated APIs are subordinate API servers that sit behind the primary API server, which acts as a proxy. This arrangement is called API Aggregation(AA). To users, the Kubernetes API appears extended.
CRDs allow users to create new types of resources without adding another API server. You do not need to understand API Aggregation to use CRDs.
Regardless of how they are installed, the new resources are referred to as Custom Resources to distinguish them from built-in Kubernetes resources (like pods).
Note:
Avoid using a Custom Resource as data storage for application, end user, or monitoring data: architecture designs that store application data within the Kubernetes API typically represent a design that is too closely coupled.
Architecturally, cloud native application architectures favor loose coupling between components. If part of your workload requires a backing service for its routine operation, run that backing service as a component or consume it as an external service. This way, your workload does not rely on the Kubernetes API for its normal operation.
CustomResourceDefinitions
The CustomResourceDefinition API resource allows you to define custom resources. Defining a CRD object creates a new custom resource with a name and schema that you specify. The Kubernetes API serves and handles the storage of your custom resource. The name of a CRD object must be a valid DNS subdomain name.
This frees you from writing your own API server to handle the custom resource, but the generic nature of the implementation means you have less flexibility than with API server aggregation.
Refer to the custom controller example for an example of how to register a new custom resource, work with instances of your new resource type, and use a controller to handle events.
API server aggregation
Usually, each resource in the Kubernetes API requires code that handles REST requests and manages persistent storage of objects. The main Kubernetes API server handles built-in resources like pods and services, and can also generically handle custom resources through CRDs.
The aggregation layer allows you to provide specialized implementations for your custom resources by writing and deploying your own API server. The main API server delegates requests to your API server for the custom resources that you handle, making them available to all of its clients.
Choosing a method for adding custom resources
CRDs are easier to use. Aggregated APIs are more flexible. Choose the method that best meets your needs.
Typically, CRDs are a good fit if:
You have a handful of fields
You are using the resource within your company, or as part of a small open-source project (as opposed to a commercial product)
Comparing ease of use
CRDs are easier to create than Aggregated APIs.
CRDs Aggregated API
Do not require programming. Users can choose any language for a CRD controller. Requires programming and building binary and image.
No additional service to run; CRDs are handled by API server. An additional service to create and that could fail.
No ongoing support once the CRD is created. Any bug fixes are picked up as part of normal Kubernetes Master upgrades. May need to periodically pickup bug fixes from upstream and rebuild and update the Aggregated API server.
No need to handle multiple versions of your API; for example, when you control the client for this resource, you can upgrade it in sync with the API. You need to handle multiple versions of your API; for example, when developing an extension to share with the world.
Advanced features and flexibility
Aggregated APIs offer more advanced API features and customization of other features; for example, the storage layer.
Feature Description CRDs Aggregated API
Validation Help users prevent errors and allow you to evolve your API independently of your clients. These features are most useful when there are many clients who can't all update at the same time. Yes. Most validation can be specified in the CRD using OpenAPI v3.0 validation. CRDValidationRatcheting feature gate allows failing validations specified using OpenAPI also can be ignored if the failing part of the resource was unchanged. Any other validations supported by addition of a Validating Webhook. Yes, arbitrary validation checks
Defaulting See above Yes, either via OpenAPI v3.0 validation default keyword (GA in 1.17), or via a Mutating Webhook (though this will not be run when reading from etcd for old objects). Yes
Multi-versioning Allows serving the same object through two API versions. Can help ease API changes like renaming fields. Less important if you control your client versions. Yes Yes
Custom Storage If you need storage with a different performance mode (for example, a time-series database instead of key-value store) or isolation for security (for example, encryption of sensitive information, etc.) No Yes
Custom Business Logic Perform arbitrary checks or actions when creating, reading, updating or deleting an object Yes, using Webhooks. Yes
Scale Subresource Allows systems like HorizontalPodAutoscaler and PodDisruptionBudget interact with your new resource Yes Yes
Status Subresource Allows fine-grained access control where user writes the spec section and the controller writes the status section. Allows incrementing object Generation on custom resource data mutation (requires separate spec and status sections in the resource) Yes Yes
Other Subresources Add operations other than CRUD, such as "logs" or "exec". No Yes
strategic-merge-patch The new endpoints support PATCH with Content-Type: application/strategic-merge-patch+json. Useful for updating objects that may be modified both locally, and by the server. For more information, see "Update API Objects in Place Using kubectl patch" No Yes
Protocol Buffers The new resource supports clients that want to use Protocol Buffers No Yes
OpenAPI Schema Is there an OpenAPI (swagger) schema for the types that can be dynamically fetched from the server? Is the user protected from misspelling field names by ensuring only allowed fields are set? Are types enforced (in other words, don't put an int in a string field?) Yes, based on the OpenAPI v3.0 validation schema (GA in 1.16). Yes
Common Features
When you create a custom resource, either via a CRD or an AA, you get many features for your API, compared to implementing it outside the Kubernetes platform:
Feature What it does
CRUD The new endpoints support CRUD basic operations via HTTP and kubectl
Watch The new endpoints support Kubernetes Watch operations via HTTP
Discovery Clients like kubectl and dashboard automatically offer list, display, and field edit operations on your resources
json-patch The new endpoints support PATCH with Content-Type: application/json-patch+json
merge-patch The new endpoints support PATCH with Content-Type: application/merge-patch+json
HTTPS The new endpoints uses HTTPS
Built-in Authentication Access to the extension uses the core API server (aggregation layer) for authentication
Built-in Authorization Access to the extension can reuse the authorization used by the core API server; for example, RBAC.
Finalizers Block deletion of extension resources until external cleanup happens.
Admission Webhooks Set default values and validate extension resources during any create/update/delete operation.
UI/CLI Display Kubectl, dashboard can display extension resources.
Unset versus Empty Clients can distinguish unset fields from zero-valued fields.
Client Libraries Generation Kubernetes provides generic client libraries, as well as tools to generate type-specific client libraries.
Labels and annotations Common metadata across objects that tools know how to edit for core and custom resources.
Preparing to install a custom resource
There are several points to be aware of before adding a custom resource to your cluster.
Third party code and new points of failure
While creating a CRD does not automatically add any new points of failure (for example, by causing third party code to run on your API server), packages (for example, Charts) or other installation bundles often include CRDs as well as a Deployment of third-party code that implements the business logic for a new custom resource.
Installing an Aggregated API server always involves running a new Deployment.
Storage
Custom resources consume storage space in the same way that ConfigMaps do. Creating too many custom resources may overload your API server's storage space.
Aggregated API servers may use the same storage as the main API server, in which case the same warning applies.
Authentication, authorization, and auditing
CRDs always use the same authentication, authorization, and audit logging as the built-in resources of your API server.
If you use RBAC for authorization, most RBAC roles will not grant access to the new resources (except the cluster-admin role or any role created with wildcard rules). You'll need to explicitly grant access to the new resources. CRDs and Aggregated APIs often come bundled with new role definitions for the types they add.
Aggregated API servers may or may not use the same authentication, authorization, and auditing as the primary API server.
Accessing a custom resource
Kubernetes client libraries can be used to access custom resources. Not all client libraries support custom resources. The Go and Python client libraries do.
When you add a custom resource, you can access it using:
kubectl
The Kubernetes dynamic client.
A REST client that you write.
A client generated using Kubernetes client generation tools (generating one is an advanced undertaking, but some projects may provide a client along with the CRD or AA).
Custom resource field selectors
Field Selectors let clients select custom resources based on the value of one or more resource fields.
All custom resources support the metadata.name and metadata.namespace field selectors.
Fields declared in a CustomResourceDefinition may also be used with field selectors when included in the spec.versions[*].selectableFields field of the CustomResourceDefinition.
Selectable fields for custom resources
FEATURE STATE: Kubernetes v1.30 [alpha]
You need to enable the CustomResourceFieldSelectors feature gate to use this behavior, which then applies to all CustomResourceDefinitions in your cluster.
The spec.versions[*].selectableFields field of a CustomResourceDefinition may be used to declare which other fields in a custom resource may be used in field selectors. The following example adds the .spec.color and .spec.size fields as selectable fields.
customresourcedefinition/shirt-resource-definition.yaml Copy customresourcedefinition/shirt-resource-definition.yaml to clipboard
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: shirts.stable.example.com
spec:
group: stable.example.com
scope: Namespaced
names:
plural: shirts
singular: shirt
kind: Shirt
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
color:
type: string
size:
type: string
selectableFields:
- jsonPath: .spec.color
- jsonPath: .spec.size
additionalPrinterColumns:
- jsonPath: .spec.color
name: Color
type: string
- jsonPath: .spec.size
name: Size
type: string
Field selectors can then be used to get only resources with a color of blue:
kubectl get shirts.stable.example.com --field-selector spec.color=blue
The output should be:
NAME COLOR SIZE
example1 blue S
example2 blue M
What's next
Learn how to Extend the Kubernetes API with the aggregation layer.
Learn how to Extend the Kubernetes API with CustomResourceDefinition.
13.2.2 - Kubernetes API Aggregation Layer
The aggregation layer allows Kubernetes to be extended with additional APIs, beyond what is offered by the core Kubernetes APIs. The additional APIs can either be ready-made solutions such as a metrics server, or APIs that you develop yourself.
The aggregation layer is different from Custom Resources, which are a way to make the kube-apiserver recognise new kinds of object.
Aggregation layer
The aggregation layer runs in-process with the kube-apiserver. Until an extension resource is registered, the aggregation layer will do nothing. To register an API, you add an APIService object, which "claims" the URL path in the Kubernetes API. At that point, the aggregation layer will proxy anything sent to that API path (e.g. /apis/myextension.mycompany.io/v1/…) to the registered APIService.
The most common way to implement the APIService is to run an extension API server in Pod(s) that run in your cluster. If you're using the extension API server to manage resources in your cluster, the extension API server (also written as "extension-apiserver") is typically paired with one or more controllers. The apiserver-builder library provides a skeleton for both extension API servers and the associated controller(s).
Response latency
Extension API servers should have low latency networking to and from the kube-apiserver. Discovery requests are required to round-trip from the kube-apiserver in five seconds or less.
If your extension API server cannot achieve that latency requirement, consider making changes that let you meet it.
What's next
To get the aggregator working in your environment, configure the aggregation layer.
Then, setup an extension api-server to work with the aggregation layer.
Read about APIService in the API reference
Alternatively: learn how to extend the Kubernetes API using Custom Resource Definitions.
13.3 - Operator pattern
Operators are software extensions to Kubernetes that make use of custom resources to manage applications and their components. Operators follow Kubernetes principles, notably the control loop.
Motivation
The operator pattern aims to capture the key aim of a human operator who is managing a service or set of services. Human operators who look after specific applications and services have deep knowledge of how the system ought to behave, how to deploy it, and how to react if there are problems.
People who run workloads on Kubernetes often like to use automation to take care of repeatable tasks. The operator pattern captures how you can write code to automate a task beyond what Kubernetes itself provides.
Operators in Kubernetes
Kubernetes is designed for automation. Out of the box, you get lots of built-in automation from the core of Kubernetes. You can use Kubernetes to automate deploying and running workloads, and you can automate how Kubernetes does that.
Kubernetes' operator pattern concept lets you extend the cluster's behaviour without modifying the code of Kubernetes itself by linking controllers to one or more custom resources. Operators are clients of the Kubernetes API that act as controllers for a Custom Resource.
An example operator
Some of the things that you can use an operator to automate include:
deploying an application on demand
taking and restoring backups of that application's state
handling upgrades of the application code alongside related changes such as database schemas or extra configuration settings
publishing a Service to applications that don't support Kubernetes APIs to discover them
simulating failure in all or part of your cluster to test its resilience
choosing a leader for a distributed application without an internal member election process
What might an operator look like in more detail? Here's an example:
A custom resource named SampleDB, that you can configure into the cluster.
A Deployment that makes sure a Pod is running that contains the controller part of the operator.
A container image of the operator code.
Controller code that queries the control plane to find out what SampleDB resources are configured.
The core of the operator is code to tell the API server how to make reality match the configured resources.
If you add a new SampleDB, the operator sets up PersistentVolumeClaims to provide durable database storage, a StatefulSet to run SampleDB and a Job to handle initial configuration.
If you delete it, the operator takes a snapshot, then makes sure that the StatefulSet and Volumes are also removed.
The operator also manages regular database backups. For each SampleDB resource, the operator determines when to create a Pod that can connect to the database and take backups. These Pods would rely on a ConfigMap and / or a Secret that has database connection details and credentials.
Because the operator aims to provide robust automation for the resource it manages, there would be additional supporting code. For this example, code checks to see if the database is running an old version and, if so, creates Job objects that upgrade it for you.
Deploying operators
The most common way to deploy an operator is to add the Custom Resource Definition and its associated Controller to your cluster. The Controller will normally run outside of the control plane, much as you would run any containerized application. For example, you can run the controller in your cluster as a Deployment.
Using an operator
Once you have an operator deployed, you'd use it by adding, modifying or deleting the kind of resource that the operator uses. Following the above example, you would set up a Deployment for the operator itself, and then:
kubectl get SampleDB # find configured databases
kubectl edit SampleDB/example-database # manually change some settings
…and that's it! The operator will take care of applying the changes as well as keeping the existing service in good shape.
Writing your own operator
If there isn't an operator in the ecosystem that implements the behavior you want, you can code your own.
You also implement an operator (that is, a Controller) using any language / runtime that can act as a client for the Kubernetes API.
Following are a few libraries and tools you can use to write your own cloud native operator.
Note: This section links to third party projects that provide functionality required by Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are listed alphabetically. To add a project to this list, read the content guide before submitting a change. More information.
Charmed Operator Framework
Java Operator SDK
Kopf (Kubernetes Operator Pythonic Framework)
kube-rs (Rust)
kubebuilder
KubeOps (.NET operator SDK)
Mast
Metacontroller along with WebHooks that you implement yourself
Operator Framework
shell-operator
What's next
Read the CNCF Operator White Paper.
Learn more about Custom Resources
Find ready-made operators on OperatorHub.io to suit your use case
Publish your operator for other people to use
Read CoreOS' original article that introduced the operator pattern (this is an archived version of the original article).
Read an article from Google Cloud about best practices for building operators