當我們使用 Kubernetes 部署應用后，會發現如果用戶增長速度超過預期，以至于計算資源不夠時，你會怎么做呢？Kubernetes 給出的解決方案就是：自動伸縮（auto-scaling），通過自動伸縮組件之間的配合，可以 7*24 小時的監控著你的集群，動態變化負載，以適應你的用戶需求。

如何使 Kubernetes 集群自動擴容？Cluster Autoscaler 全面解析

自動伸縮組件

水平自動伸縮（Horizontal Pod Autoscaler，HPA）

HPA 可以基于實時的 CPU 利用率自動伸縮 Replication Controller、Deployment 和 Replica Set 中的 Pod 數量。也可以通過搭配 Metrics Server 基于其他的度量指標。

垂直自動伸縮（Vertical Pod Autoscaler，VPA）

VPA 可以基于 Pod 的使用資源來自動設置 Pod 所需資源并且能夠在運行時自動調整資源。

集群自動伸縮（Cluster Autoscaler，CA）

CA 是一個可以自動伸縮集群 Node 的組件。如果集群中有未被調度的 Pod，它將會自動擴展 Node 來使 Pod 可用，或是在發現集群中的 Node 資源使用率過低時，刪除 Node 來節約資源。

插件伸縮（Addon Resizer）

這是一個小插件，它以 Sidecar 的形式來垂直伸縮與自己同一個部署中的另一個容器，目前唯一的策略就是根據集群中節點的數量來進行線性擴展。通常與 [Metrics Server](
https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/metrics-server/metrics-server-deployment.yaml#L66) 配合使用，以保證其可以負擔不斷擴大的整個集群的 metrics API 服務。

通過 HPA 伸縮無狀態應用，VPA 伸縮有狀態應用，CA 保證計算資源，它們的配合使用，構成了一個完整的自動伸縮解決方案。

Cluster Autoscaler 詳細介紹

上面介紹的四個組件中，HPA 是在 kubernetes 代碼倉庫中的，隨著 kubernetes 的版本進行更新發布，不需要部署，可以直接使用。其他的三個組件都在官方社區維護的倉庫(
https://github.com/kubernetes/autoscaler)中，Cluster Autoscaler 的 v1.0(GA) 版本已經隨著 kubernetes 1.8 一起發布，剩下兩個則還是 beta 版本。

部署

Cluster Autoscaler 通常需要搭配云廠商使用，它提供了 Cloud Provider 接口供各個云廠商接入，云廠商通過伸縮組（Scaling Group）或節點池（Node Pool）的功能對 ECS 類產品節點進行增加刪除等操作。

目前（v1.18.1）已接入的云廠商：

Alicloud：https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/alicloud/README.md

Aws：https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md

Azure：https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/README.md

Baiducloud：https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/baiducloud/README.md

Digitalocean：https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/digitalocean/README.md

googleCloud GCE：https://kubernetes.io/docs/tasks/administer-cluster/cluster-management/#upgrading-google-compute-engine-clusters

GoogleCloud GKE：https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler

OpenStack Magnum：https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/magnum/README.md

Packet：https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/packet/README.md

啟動參數：
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#
what-are-the-parameters-to-ca

工作原理

Cluster Autoscaler 抽象出了一個 NodeGroup 的概念，與之對應的是云廠商的伸縮組服務。Cluster Autoscaler 通過 CloudProvider 提供的 NodeGroup 計算集群內節點資源，以此來進行伸縮。

在啟動后，Cluster Autoscaler 會定期（默認 10s）檢查未調度的 Pod 和 Node 的資源使用情況，并進行相應的 Scale UP 和 Scale Down 操作。

Scale UP

當 Cluster Autoscaler 發現有 Pod 由于資源不足而無法調度時，就會通過調用 `Scale UP` 執行擴容操作。

在 Scale UP 中會只會計算在 NodeGroup 中存在的 Node，我們可以將 Worker Node 統一交由伸縮組進行管理。并且由于伸縮組非同步加入的特性，也會考慮到 Upcoming Node。

為了業務需要，集群中可能會有不同規格的 Node，我們可以創建多個 NodeGroup，在擴容時會根據 --expander 選項配置指定的策略，選擇一個擴容的節點組，支持如下[五種策略](
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-expanders)：

random：隨機選擇一個 NodeGroup。如果未指定，則默認為此策略。

most-pods：選擇能夠調度最多 Pod 的 NodeGroup，比如有的 Pod 未調度是因為 nodeSelector，此策略會優先選擇能滿足的 NodeGroup 來保證大多數的 Pod 可以被調度。

least-waste：為避免浪費，此策略會優先選擇能滿足 Pod 需求資源的最小資源類型的 NodeGroup。

price：根據 CloudProvider 提供的價格模型，選擇最省錢的 NodeGroup。

priority：通過配置優先級來進行選擇，用起來比較麻煩，需要額外的配置，可以看文檔(https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/expander/priority/readme.md)。

如果有需要，也可以平衡相似 NodeGroup 中的 Node 數量，避免 NodeGroup 達到 MaxSize 而導致無法加入新 Node。通過
--balance-similar-node-groups 選項配置，默認為 false。

在經過一系列的操作后，最終計算出要擴容的 Node 數量及 NodeGroup，使用 CloudProvider 執行 IncreaseSize 操作，增加云廠商的伸縮組大小，從而完成擴容操作。

文字表達能力不足，如果有不清晰的地方，可以參考下面的 ScaleUP 源碼解析。

Scale Down

縮容是一個可選的功能，通過 --scale-down-enabled 選項配置，默認為 true。

在 Cluster Autoscaler 監控 Node 資源時，如果發現有 Node 滿足以下三個條件時，就會標記這個 Node 為 unneeded：

Node 上運行的所有的 Pod 的 Cpu 和內存之和小于該 Node 可分配容量的 50%。可通過 --scale-down-utilization-threshold 選項改變這個配置。

Node 上所有的 Pod 都可以被調度到其他節點。

Node 沒有表示不可縮容的 annotaition。

如果一個 Node 被標記為 unneeded 超過 10 分鐘（可通過
--scale-down-unneeded-time 選項配置），則使用 CloudProvider 執行 DeleteNodes 操作將其刪除。一次最多刪除一個 unneeded Node，但空 Node 可以批量刪除，每次最多刪除 10 個（通過 ----max-empty-bulk-delete 選項配置）。

實際上并不是只有這一個判定條件，還會有其他的條件來阻止刪除這個 Node，比如 NodeGroup 已達到 MinSize，或在過去的 10 分鐘內有過一次 Scale UP 操作（通過
--scale-down-delay-after-add 選項配置）等等，更詳細可查看文檔(
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-scale-down-work)。

Cluster Autoscaler 的工作機制很復雜，但其中大部分都能通過 flags 進行配置，如果有需要，請詳細閱讀文檔：
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md

如何實現 CloudProvider

如果使用上述中已實現接入的云廠商，只需要通過 --cloud-provider 選項指定來自哪個云廠商就可以，如果想要對接自己的 IaaS 或有特定的業務邏輯，就需要自己實現 CloudProvider Interface 與 NodeGroupInterface。并將其注冊到 builder 中，用于通過 --cloud-provider 參數指定。

builder 在 cloudprovider/builder 中的 builder_all.go (
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/builder/builder_all.go) 中注冊，也可以在其中新建一個自己的 build，通過 go 文件的 +build 編譯參數來指定使用的 CloudProvider。

CloudProvider 接口與 NodeGroup 接口在 cloud_provider.go (
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/cloud_provider.go) 中定義，其中需要注意的是 Refresh 方法，它會在每一次循環（默認 10 秒）的開始時調用，可在此時請求接口并刷新 NodeGroup 狀態，通常的做法是增加一個 manager 用于管理狀態。有不理解的部分可參考其他 CloudProvider 的實現。

type CloudProvider interface {

	// Name returns name of the cloud provider.

	Name() string

	// NodeGroups returns all node groups configured for this cloud provider.

	// 會在一次循環中多次調用此方法，所以不適合每次都請求云廠商服務，可以在 Refresh 時存儲狀態

	NodeGroups() []NodeGroup

	// NodeGroupForNode returns the node group for the given node, nil if the node

	// should not be processed by cluster autoscaler, or non-nil error if such

	// occurred. Must be implemented.

	// 同上

	NodeGroupForNode(*apiv1.Node) (NodeGroup, error)

	// Pricing returns pricing model for this cloud provider or error if not available.

	// Implementation optional.

	// 如果不使用 price expander 就可以不實現此方法

	Pricing() (PricingModel, errors.AutoscalerError)

	// GetAvailablemachineTypes get all machine types that can be requested from the cloud provider.

	// Implementation optional.

	// 沒用，不需要實現

	GetAvailableMachineTypes() ([]string, error)

	// NewNodeGroup builds a theoretical node group based on the node definition provided. The node group is not automatically

	// created on the cloud provider side. The node group is not returned by NodeGroups() until it is created.

	// Implementation optional.

	// 通常情況下，不需要實現此方法，但如果你需要 ClusterAutoscaler 創建一個默認的 NodeGroup 的話，也可以實現。

	// 但其實更好的做法是將默認 NodeGroup 寫入云端的伸縮組

	NewNodeGroup(machineType string, labels map[string]string, systemLabels map[string]string,

		taints []apiv1.Taint, extraResources map[string]resource.Quantity) (NodeGroup, error)

	// GetResourceLimiter returns struct containing limits (max, min) for resources (cores, memory etc.).

	// 資源限制對象，會在 build 時傳入，通常情況下不需要更改，除非在云端有顯示的提示用戶更改的地方，否則使用時會迷惑用戶

	GetResourceLimiter() (*ResourceLimiter, error)

	// GPULabel returns the label added to nodes with GPU resource.

	// GPU 相關，如果集群中有使用 GPU 資源，需要返回對應內容。 hack: we assume anything which is not cpu/memory to be a gpu.

	GPULabel() string

	// GetAvailableGPUTypes return all available GPU types cloud provider supports.

	// 同上

	GetAvailableGPUTypes() map[string]struct{}

	// Cleanup cleans up open resources before the cloud provider is destroyed, i.e. go routines etc.

	// CloudProvider 只會在啟動時被初始化一次，如果每次循環后有需要清除的內容，在這里處理

	Cleanup() error

	// Refresh is called before every main loop and can be used to dynamically update cloud provider state.

	// In particular the list of node groups returned by NodeGroups can change as a result of CloudProvider.Refresh().

	// 會在 StaticAutoscaler RunOnce 中被調用

	Refresh() error

}

// NodeGroup contains configuration info and functions to control a set

// of nodes that have the same capacity and set of labels.

type NodeGroup interface {

	// MaxSize returns maximum size of the node group.

	MaxSize() int

	// MinSize returns minimum size of the node group.

	MinSize() int

	// TargetSize returns the current target size of the node group. It is possible that the

	// number of nodes in Kubernetes is different at the moment but should be equal

	// to Size() once everything stabilizes (new nodes finish startup and registration or

	// removed nodes are deleted completely). Implementation required.

	// 響應的是伸縮組的節點數，并不一定與 kubernetes 中的節點數保持一致

	TargetSize() (int, error)

	// IncreaseSize increases the size of the node group. To delete a node you need

	// to explicitly name it and use DeleteNode. This function should wait until

	// node group size is updated. Implementation required.

	// 擴容的方法，增加伸縮組的節點數

	IncreaseSize(delta int) error

	// DeleteNodes deletes nodes from this node group. Error is returned either on

	// failure or if the given node doesn't belong to this node group. This function

	// should wait until node group size is updated. Implementation required.

	// 刪除的節點一定要在該節點組中

	DeleteNodes([]*apiv1.Node) error

	// DecreaseTargetSize decreases the target size of the node group. This function

	// doesn't permit to delete any existing node and can be used only to reduce the

	// request for new nodes that have not been yet fulfilled. Delta should be negative.

	// It is assumed that cloud provider will not delete the existing nodes when there

	// is an option to just decrease the target. Implementation required.

	// 當 ClusterAutoscaler 發現 kubernetes 節點數與伸縮組的節點數長時間不一致，會調用此方法來調整

	DecreaseTargetSize(delta int) error

	// Id returns an unique identifier of the node group.

	Id() string

	// Debug returns a string containing all information regarding this node group.

	Debug() string

	// Nodes returns a list of all nodes that belong to this node group.

	// It is required that Instance objects returned by this method have Id field set.

	// Other fields are optional.

	// This list should include also instances that might have not become a kubernetes node yet.

	// 返回伸縮組中的所有節點，哪怕它還沒有成為 kubernetes 的節點

	Nodes() ([]Instance, error)

	// TemplateNodeInfo returns a schedulernodeinfo.NodeInfo structure of an empty

	// (as if just started) node. This will be used in scale-up simulations to

	// predict what would a new node look like if a node group was expanded. The returned

	// NodeInfo is expected to have a fully populated Node object, with all of the labels,

	// capacity and allocatable information as well as all pods that are started on

	// the node by default, using manifest (most likely only kube-proxy). Implementation optional.

	// ClusterAutoscaler 會將節點信息與節點組對應，來判斷資源條件，如果是一個空的節點組，那么就會通過此方法來虛擬一個節點信息。

	TemplateNodeInfo() (*schedulernodeinfo.NodeInfo, error)

	// Exist checks if the node group really exists on the cloud provider side. Allows to tell the

	// theoretical node group from the real one. Implementation required.

	Exist() bool

	// Create creates the node group on the cloud provider side. Implementation optional.

	// 與 CloudProvider.NewNodeGroup 配合使用

	Create() (NodeGroup, error)

	// Delete deletes the node group on the cloud provider side.

	// This will be executed only for autoprovisioned node groups, once their size drops to 0.

	// Implementation optional.

	Delete() error

	// Autoprovisioned returns true if the node group is autoprovisioned. An autoprovisioned group

	// was created by CA and can be deleted when scaled to 0.

	Autoprovisioned() bool

}

ScaleUP 源碼解析

func ScaleUp(context *context.AutoscalingContext, processors *ca_processors.AutoscalingProcessors, clusterStateRegistry *clusterstate.ClusterStateRegistry, unschedulablePods []*apiv1.Pod, nodes []*apiv1.Node, daemonSets []*Appsv1.DaemonSet, nodeInfos map[string]*schedulernodeinfo.NodeInfo, ignoredTaints taints.TaintKeySet) (*status.ScaleUpStatus, errors.AutoscalerError) {

	

	......

	// 驗證當前集群中所有 ready node 是否來自于 nodeGroups，取得所有非組內的 node

	nodesFromNotAutoscaledGroups, err := utils.FilterOutNodesFromNotAutoscaledGroups(nodes, context.CloudProvider)

	if err != nil {

		return &status.ScaleUpStatus{Result: status.ScaleUpError}, err.AddPrefix("failed to filter out nodes which are from not autoscaled groups: ")

	}

	nodeGroups := context.CloudProvider.NodeGroups()

	gpuLabel := context.CloudProvider.GPULabel()

	availableGPUTypes := context.CloudProvider.GetAvailableGPUTypes()

	// 資源限制對象，會在 build cloud provider 時傳入

	// 如果有需要可在 CloudProvider 中自行更改，但不建議改動，會對用戶造成迷惑

	resourceLimiter, errCP := context.CloudProvider.GetResourceLimiter()

	if errCP != nil {

		return &status.ScaleUpStatus{Result: status.ScaleUpError}, errors.ToAutoscalerError(

			errors.CloudProviderError,

			errCP)

	}

	// 計算資源限制

	// nodeInfos 是所有擁有節點組的節點與示例節點的映射

	// 示例節點會優先考慮真實節點的數據，如果 NodeGroup 中還沒有真實節點的部署，則使用 Template 的節點數據

	scaleUpResourcesLeft, errLimits := computeScaleUpResourcesLeftLimits(context.CloudProvider, nodeGroups, nodeInfos, nodesFromNotAutoscaledGroups, resourceLimiter)

	if errLimits != nil {

		return &status.ScaleUpStatus{Result: status.ScaleUpError}, errLimits.AddPrefix("Could not compute total resources: ")

	}

	// 根據當前節點與 NodeGroups 中的節點來計算會有多少節點即將加入集群中

	// 由于云服務商的伸縮組 increase size 操作并不是同步加入 node，所以將其統計，以便于后面計算節點資源

	upcomingNodes := make([]*schedulernodeinfo.NodeInfo, 0)

	for nodeGroup, numberOfNodes := range clusterStateRegistry.GetUpcomingNodes() {

		......

	}

	klog.V(4).Infof("Upcoming %d nodes", len(upcomingNodes))

	// 最終會進入選擇的節點組

	expansionOptions := make(map[string]expander.Option, 0)

	......

	// 出于某些限制或錯誤導致不能加入新節點的節點組，例如節點組已達到 MaxSize

	skippedNodeGroups := map[string]status.Reasons{}

	// 綜合各種情況，篩選出節點組

	for _, nodeGroup := range nodeGroups {

	......

	}

	if len(expansionOptions) == 0 {

		klog.V(1).Info("No expansion options")

		return &status.ScaleUpStatus{

			Result:					status.ScaleUpNoOptionsAvailable,

			PodsRemainUnschedulable: getRemainingPods(podEquivalenceGroups, skippedNodeGroups),

			ConsideredNodeGroups:	nodeGroups,

		}, nil

	}

	......

	// 選擇一個最佳的節點組進行擴容，expander 用于選擇一個合適的節點組進行擴容，默認為 RandomExpander，flag: expander

	// random 隨機選一個，適合只有一個節點組

	// most-pods 選擇能夠調度最多 pod 的節點組，比如有 noSchedulerPods 是有 nodeSelector 的，它會優先選擇此類節點組以滿足大多數 pod 的需求

	// least-waste 優先選擇能滿足 pod 需求資源的最小資源類型的節點組

	// price 根據價格模型，選擇最省錢的

	// priority 根據優先級選擇

	bestOption := context.ExpanderStrategy.BestOption(options, nodeInfos)

	if bestOption != nil && bestOption.NodeCount > 0 {

	......

		newNodes := bestOption.NodeCount

		// 考慮到 upcomingNodes, 重新計算本次新加入節點

		if context.MaxNodesTotal > 0 && len(nodes)+newNodes+len(upcomingNodes) > context.MaxNodesTotal {

			klog.V(1).Infof("Capping size to max cluster total size (%d)", context.MaxNodesTotal)

			newNodes = context.MaxNodesTotal - len(nodes) - len(upcomingNodes)

			if newNodes < 1 {

				return &status.ScaleUpStatus{Result: status.ScaleUpError}, errors.NewAutoscalerError(

					errors.TransientError,

					"max node total count already reached")

			}

		}

		createNodeGroupResults := make([]nodegroups.CreateNodeGroupResult, 0)

	

		// 如果節點組在云服務商端處不存在，會嘗試創建根據現有信息重新創建一個云端節點組

		// 但是目前所有的 CloudProvider 實現都沒有允許這種操作，這好像是個多余的方法

		// 云服務商不想，也不應該將云端節點組的創建權限交給 ClusterAutoscaler

		if !bestOption.NodeGroup.Exist() {

			oldId := bestOption.NodeGroup.Id()

			createNodeGroupResult, err := processors.NodeGroupManager.CreateNodeGroup(context, bestOption.NodeGroup)

		......

		}

		// 得到最佳節點組的示例節點

		nodeInfo, found := nodeInfos[bestOption.NodeGroup.Id()]

		if !found {

			// This should never happen, as we already should have retrieved

			// nodeInfo for any considered nodegroup.

			klog.Errorf("No node info for: %s", bestOption.NodeGroup.Id())

			return &status.ScaleUpStatus{Result: status.ScaleUpError, CreateNodeGroupResults: createNodeGroupResults}, errors.NewAutoscalerError(

				errors.CloudProviderError,

				"No node info for best expansion option!")

		}

		// 根據 CPU、Memory及可能存在的 GPU 資源（hack: we assume anything which is not cpu/memory to be a gpu.），計算出需要多少個 Nodes

		newNodes, err = applyScaleUpResourcesLimits(context.CloudProvider, newNodes, scaleUpResourcesLeft, nodeInfo, bestOption.NodeGroup, resourceLimiter)

		if err != nil {

			return &status.ScaleUpStatus{Result: status.ScaleUpError, CreateNodeGroupResults: createNodeGroupResults}, err

		}

		// 需要平衡的節點組

		targetNodeGroups := []cloudprovider.NodeGroup{bestOption.NodeGroup}

		// 如果需要平衡節點組，根據 balance-similar-node-groups flag 設置。

		// 檢測相似的節點組，并平衡它們之間的節點數量

		if context.BalanceSimilarNodeGroups {

		......

		}

		// 具體平衡策略可以看 (b *BalancingNodeGroupSetProcessor) BalanceScaleUpBetweenGroups 方法

		scaleUpInfos, typedErr := processors.NodeGroupSetProcessor.BalanceScaleUpBetweenGroups(context, targetNodeGroups, newNodes)

		if typedErr != nil {

			return &status.ScaleUpStatus{Result: status.ScaleUpError, CreateNodeGroupResults: createNodeGroupResults}, typedErr

		}

		klog.V(1).Infof("Final scale-up plan: %v", scaleUpInfos)

		// 開始擴容，通過 IncreaseSize 擴容

		for _, info := range scaleUpInfos {

			typedErr := executeScaleUp(context, clusterStateRegistry, info, gpu.GetGpuTypeForMetrics(gpuLabel, availableGPUTypes, nodeInfo.Node(), nil), now)

			if typedErr != nil {

				return &status.ScaleUpStatus{Result: status.ScaleUpError, CreateNodeGroupResults: createNodeGroupResults}, typedErr

			}

		}

		......

	}

	......

}