Illustrate the core data structure of kubernetes scheduler framework


Framework is the second implementation of kubernetes extension. Compared with scheduleextender based on remote independent service, the framework core implements a localized standardized process management mechanism based on extension point

1. Expand to achieve the goal

The design of the framework has been clearly described in the official documents. Currently, there is no stable. This article will talk about some details of the implementation in addition to the official description based on version 1.18

1.1 stage extension point

Illustrate the core data structure of kubernetes scheduler framework
At present, the official extension mainly focuses on the previous pre selection and optimization stage, and provides more extension points. Each extension point is a kind of plug-in. We can write the extension plug-in in at the corresponding stage according to our needs to achieve scheduling enhancement

In the current version, priority plug-ins have been extracted into the framework. In the future, we should continue to extract pre selected plug-ins, which should take some time to stabilize

1.2 context and cyclestate

Illustrate the core data structure of kubernetes scheduler framework
In the implementation of the framework, each plug-in extension phase call will pass two objects: context and cyclestate. Context is similar to our usage in most go programming. Here, it is mainly used for unified exit operation in multi-phase parallel processing. Cyclestate stores all data in the current scheduling cycle, which is a concurrent security structure, It contains a read-write lock inside

1.3 Bind Permit

Illustrate the core data structure of kubernetes scheduler framework
Permit is an operation before bind operation. Its main design goal is to make the final decision before bind, that is, whether the current pod allows the final bind operation. It has one vote veto power. If the plug-in inside refuses, the corresponding pod will be rescheduled

2. Core source code implementation

2.1 framework core data structure

Illustrate the core data structure of kubernetes scheduler framework
The core data structure of the framework is simply divided into three parts: plug-in collection (each extension stage has its own collection), metadata acquisition interface (cluster and snapshot data acquisition), and waiting for Pod Collection

2.1.1 plug in set

The plug-in collection will be classified and saved according to different plug-in types. There is also a priority storage map of plug-ins, which is only used in the optimization stage at present, and the pre selected priority may be added later

    pluginNameToWeightMap map[string]int
    queueSortPlugins      []QueueSortPlugin
    preFilterPlugins      []PreFilterPlugin
    filterPlugins         []FilterPlugin
    postFilterPlugins     []PostFilterPlugin
    scorePlugins          []ScorePlugin
    reservePlugins        []ReservePlugin
    preBindPlugins        []PreBindPlugin
    bindPlugins           []BindPlugin
    postBindPlugins       []PostBindPlugin
    unreservePlugins      []UnreservePlugin
    permitPlugins         []PermitPlugin

2.1.2 cluster data acquisition

It is mainly about the implementation of some data acquisition interfaces in the cluster, mainly to implement the frameworkhandle. The interface mainly provides some interfaces for data acquisition and cluster operation for plug-ins

    clientSet       clientset.Interface
    informerFactory informers.SharedInformerFactory
    volumeBinder    *volumebinder.VolumeBinder
    snapshotSharedLister  schedulerlisters.SharedLister

2.1.3 waiting for pod set

The waiting Pod Collection is mainly stored in the permit stage for waiting. If the pod is deleted in the waiting cycle, it will be rejected directly

    waitingPods           *waitingPodsMap

2.1.4 plug in factory registry

All registered plug-in factories are stored through plug-in factories, and then specific plug-ins are built through plug-in factories

    registry              Registry

2.2 plug in factory registry

Illustrate the core data structure of kubernetes scheduler framework

2.2.1 plug in factory function

The factory function is to pass in the corresponding parameters and build a plugin. The frameworkhandle is mainly used to obtain the snapshot and other data of the cluster

type PluginFactory = func(configuration *runtime.Unknown, f FrameworkHandle) (Plugin, error)

2.2.2 implementation of plug-in factory

In go, most plug-in factories are implemented through map, and the same is true here, exposing the register and unregister interfaces

type Registry map[string]PluginFactory

// Register adds a new plugin to the registry. If a plugin with the same name
// exists, it returns an error.
func (r Registry) Register(name string, factory PluginFactory) error {
    if _, ok := r[name]; ok {
        return fmt.Errorf("a plugin named %v already exists", name)
    r[name] = factory
    return nil

// Unregister removes an existing plugin from the registry. If no plugin with
// the provided name exists, it returns an error.
func (r Registry) Unregister(name string) error {
    if _, ok := r[name]; !ok {
        return fmt.Errorf("no plugin named %v exists", name)
    delete(r, name)
    return nil

// Merge merges the provided registry to the current one.
func (r Registry) Merge(in Registry) error {
    for name, factory := range in {
        if err := r.Register(name, factory); err != nil {
            return err
    return nil

2.3 plug in registration implementation

Illustrate the core data structure of kubernetes scheduler framework
Here we take prefilterplugins as an example to show the registration of the whole process

2.3.1 Plugins

All plug-ins that are allowed to be configured in the current pluginset will be saved in the current pluginset configuration phase

type Plugins struct {
    // QueueSort is a list of plugins that should be invoked when sorting pods in the scheduling queue.
    QueueSort *PluginSet

    // PreFilter is a list of plugins that should be invoked at "PreFilter" extension point of the scheduling framework.
    PreFilter *PluginSet

    // Filter is a list of plugins that should be invoked when filtering out nodes that cannot run the Pod.
    Filter *PluginSet

    // PostFilter is a list of plugins that are invoked after filtering out infeasible nodes.
    PostFilter *PluginSet

    // Score is a list of plugins that should be invoked when ranking nodes that have passed the filtering phase.
    Score *PluginSet

    // Reserve is a list of plugins invoked when reserving a node to run the pod.
    Reserve *PluginSet

    // Permit is a list of plugins that control binding of a Pod. These plugins can prevent or delay binding of a Pod.
    Permit *PluginSet

    // PreBind is a list of plugins that should be invoked before a pod is bound.
    PreBind *PluginSet

    // Bind is a list of plugins that should be invoked at "Bind" extension point of the scheduling framework.
    // The scheduler call these plugins in order. Scheduler skips the rest of these plugins as soon as one returns success.
    Bind *PluginSet

    // PostBind is a list of plugins that should be invoked after a pod is successfully bound.
    PostBind *PluginSet

    // Unreserve is a list of plugins invoked when a pod that was previously reserved is rejected in a later phase.
    Unreserve *PluginSet

2.3.2 plug in set mapping

The main purpose of this method is to map the corresponding plug-in type and the corresponding plug-in type array in the framework, such as prefilter and its associated prefilterplugins slice, string (plug-in type) – > [] prefilterplugin (& reflect.SliceHeader Cutting head)

func (f *framework) getExtensionPoints(plugins *config.Plugins) []extensionPoint {
    return []extensionPoint{
        {plugins.PreFilter, &f.preFilterPlugins},
        {plugins.Filter, &f.filterPlugins},
        {plugins.Reserve, &f.reservePlugins},
        {plugins.PostFilter, &f.postFilterPlugins},
        {plugins.Score, &f.scorePlugins},
        {plugins.PreBind, &f.preBindPlugins},
        {plugins.Bind, &f.bindPlugins},
        {plugins.PostBind, &f.postBindPlugins},
        {plugins.Unreserve, &f.unreservePlugins},
        {plugins.Permit, &f.permitPlugins},
        {plugins.QueueSort, &f.queueSortPlugins},

2.3.3 scan and register all allowed plug-ins

It will traverse all the above mappings, but it will not register in the corresponding slice according to the type, but all will be registered in gpmap

func (f *framework) pluginsNeeded(plugins *config.Plugins) map[string]config.Plugin {
    pgMap := make(map[string]config.Plugin)

    if plugins == nil {
        return pgMap

    //Build anonymous function and modify pgmap with closure to save all allowed plug-in sets
    find := func(pgs *config.PluginSet) {
        if pgs == nil {
        for _ , pg := range  pgs.Enabled  {// traverse all allowed plug-in collections
            pgMap[ pg.Name ]= pg // save to map
    //Traverse all the mapping tables above
    for _, e := range f.getExtensionPoints(plugins) {
    return pgMap

2.3.4 plug in factory constructs plug-in mapping

The generated plug-in factory Registry will be called to build the plug-in instance through the factory of each plug-in and save it to pluginsmap

pluginsMap := make(map[string]Plugin)    
for name, factory := range r {
        //PG is the pgmap generated above. Only plug-ins that need to be used will be generated here
        if _, ok := pg[name]; !ok {

        p, err := factory(pluginConfig[name], f)
        if err != nil {
            return nil, fmt.Errorf("error initializing plugin %q: %v", name, err)
        pluginsMap[name] = p
        //Save the weight
        f.pluginNameToWeightMap[name] = int(pg[name].Weight)
        if f.pluginNameToWeightMap[name] == 0 {
            f.pluginNameToWeightMap[name] = 1
        // Checks totalPriority against MaxTotalScore to avoid overflow
        if int64(f.pluginNameToWeightMap[name])*MaxNodeScore > MaxTotalScore-totalPriority {
            return nil, fmt.Errorf("total score of Score plugins could overflow")
        totalPriority += int64(f.pluginNameToWeightMap[name]) * MaxNodeScore

2.3.5 register plug-ins by type

Here, we mainly use e.sliceptr to register specific types of plug-ins by using reflection, combining the pluginsmap and reflection constructed before

    for _, e := range f.getExtensionPoints(plugins) {
        if err := updatePluginList(e.slicePtr, e.plugins, pluginsMap); err != nil {
            return nil, err

The update pluginlist is mainly implemented through reflection. The address of the corresponding slice in the framework is obtained through the above getextensionpoints, and then the plug-in is registered and validated by reflection

func updatePluginList(pluginList interface{}, pluginSet *config.PluginSet, pluginsMap map[string]Plugin) error {
    if pluginSet == nil {
        return nil

    //First, get the type of the current array through elem
    plugins := reflect.ValueOf(pluginList).Elem()
    //Get the type of the elements inside the array by the array type
    pluginType := plugins.Type().Elem()
    set := sets.NewString()
    for _, ep := range pluginSet.Enabled {
        pg, ok := pluginsMap[ep.Name]
        if !ok {
            return fmt.Errorf("%s %q does not exist", pluginType.Name(), ep.Name)

        //Validity check: if the current plug-in does not implement the current interface, an error is reported
        if !reflect.TypeOf(pg).Implements(pluginType) {
            return fmt.Errorf("plugin %q does not extend %s plugin", ep.Name, pluginType.Name())

        if set.Has(ep.Name) {
            return fmt.Errorf("plugin %q already registered as %q", ep.Name, pluginType.Name())


        //Append the plug-in to slice and save the pointer
        newPlugins := reflect.Append(plugins, reflect.ValueOf(pg))
    return nil

2.4 CycleState

Cyclestate is mainly responsible for the storage and cloning of data in the scheduling process. It exposes the read-write lock interface. The plug-ins of each extension point can select locks independently according to their needs

2.4.1 data structure

Cyclestate is implemented and complex. It mainly stores statedata data. It only needs to implement a clone interface. The data in cyclstate can be added and modified by all plug-ins in the current framework. The thread safety is ensured by reading and writing locks. However, there are no restrictions on plug-ins, that is, trust all plug-ins and can be added or deleted arbitrarily

type CycleState struct {
    mx      sync.RWMutex
    storage map[StateKey]StateData
    // if recordPluginMetrics is true, PluginExecutionDuration will be recorded for this cycle.
    recordPluginMetrics bool

// StateData is a generic type for arbitrary data stored in CycleState.
type StateData interface {
    // Clone is an interface to make a copy of StateData. For performance reasons,
    // clone should make shallow copies for members (e.g., slices or maps) that are not
    // impacted by PreFilter's optional AddPod/RemovePod methods.
    Clone() StateData

2.4.2 external interface implementation

The implementation of the external interface requires the corresponding plug-in to select read lock or write lock, and then read and modify the relevant data

func (c *CycleState) Read(key StateKey) (StateData, error) {
    if v, ok :=[key]; ok {
        return v, nil
    return nil, errors.New(NotFound)

// Write stores the given "val" in CycleState with the given "key".
// This function is not thread safe. In multi-threaded code, lock should be
// acquired first.
func (c *CycleState) Write(key StateKey, val StateData) {[key] = val

// Delete deletes data with the given key from CycleState.
// This function is not thread safe. In multi-threaded code, lock should be
// acquired first.
func (c *CycleState) Delete(key StateKey) {
    delete(, key)

// Lock acquires CycleState lock.
func (c *CycleState) Lock() {

// Unlock releases CycleState lock.
func (c *CycleState) Unlock() {

// RLock acquires CycleState read lock.
func (c *CycleState) RLock() {

// RUnlock releases CycleState read lock.
func (c *CycleState) RUnlock() {

2. 2

Illustrate the core data structure of kubernetes scheduler framework
The preferred plugins will be stored in the wait wait phase before being rejected

2.5.1 data structure

Waitingpodsmap stores a map map through the uid of the pod, and protects the data through the read-write lock

type waitingPodsMap struct {
    pods map[types.UID]WaitingPod
    mu   sync.RWMutex

Waitingpod is a waiting instance of a specific pod. The timer waiting time defined by the plug-in is saved by pendingplugins internally, and the current pod status is accepted through Chan * status externally, and serialization is performed through read-write lock

type waitingPod struct {
    pod            *v1.Pod
    pendingPlugins map[string]*time.Timer
    s              chan *Status
    mu             sync.RWMutex

2.5.2 building waitingpod and timer

N timers will be constructed according to the wait time of each plugin. If any timer expires, it will be rejected

func newWaitingPod(pod *v1.Pod, pluginsMaxWaitTime map[string]time.Duration) *waitingPod {
    wp := &waitingPod{
        pod: pod,
        s:   make(chan *Status),

    wp.pendingPlugins = make(map[string]*time.Timer, len(pluginsMaxWaitTime))
    // The time.AfterFunc calls wp.Reject which iterates through pendingPlugins map. Acquire the
    // lock here so that time.AfterFunc can only execute after newWaitingPod finishes.
    //The timer is built according to the waiting time of the plug-in. If any timer is expired and there is no plugin allow, rejectj 㐇
    for k, v := range pluginsMaxWaitTime {
        plugin, waitTime := k, v
        wp.pendingPlugins[plugin] = time.AfterFunc(waitTime, func() {
            msg := fmt.Sprintf("rejected due to timeout after waiting %v at plugin %v",
                waitTime, plugin)

    return wp

2.5.3 stop timer sending reject event

If the timer of any plugin expires or the plugin initiates the reject operation, all timers will be suspended and the message will be broadcast

func (w *waitingPod) Reject(msg string) bool {
    //Stop all timers
    for _, timer := range w.pendingPlugins {

    //Send reject events via pipeline
    select {
    case w.s <- NewStatus(Unschedulable, msg):
        return true
        return false

2.5.4 sending allowed scheduling operation

Allow operation must wait for all plugins to allow before sending allow event

func (w *waitingPod) Allow(pluginName string) bool {
    if timer, exist := w.pendingPlugins[pluginName]; exist {
        //Stop the timer of the current plugin
        delete(w.pendingPlugins, pluginName)

    // Only signal success status after all plugins have allowed
    if len(w.pendingPlugins) != 0 {
        return true
    //Only if all plugins are allowed can the success permission event occur
    select {
    Case W.S < - newstatus (success, "") // send event
        return true
        return false

2.5.5 wait implementation in permit phase

First, all plug-ins will be traversed. Then, if the status is set to wait, the wait operation will be performed according to the waiting time of the plug-in

func (f *framework) RunPermitPlugins(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) (status *Status) {
    startTime := time.Now()
    defer func() {
        metrics.FrameworkExtensionPointDuration.WithLabelValues(permit, status.Code().String()).Observe(metrics.SinceInSeconds(startTime))
    pluginsWaitTime := make(map[string]time.Duration)
    statusCode := Success
    for _, pl := range f.permitPlugins {
        status, timeout := f.runPermitPlugin(ctx, pl, state, pod, nodeName)
        if !status.IsSuccess() {
            if status.IsUnschedulable() {
                msg := fmt.Sprintf("rejected by %q at permit: %v", pl.Name(), status.Message())
                return NewStatus(status.Code(), msg)
            if status.Code() == Wait {
                // Not allowed to be greater than maxTimeout.
                if timeout > maxTimeout {
                    timeout = maxTimeout
                //Record the waiting time of the current plugin
                pluginsWaitTime[pl.Name()] = timeout
                statusCode = Wait
            } else {
                msg := fmt.Sprintf("error while running %q permit plugin for pod %q: %v", pl.Name(), pod.Name, status.Message())
                return NewStatus(Error, msg)

    // We now wait for the minimum duration if at least one plugin asked to
    // wait (and no plugin rejected the pod)
    if statusCode == Wait {
        startTime := time.Now()
        //Building waitingpod based on plug-in waiting time
        w := newWaitingPod(pod, pluginsWaitTime)
        //Add to waitingpods
        defer f.waitingPods.remove(pod.UID)
        klog.V(4).Infof("waiting for pod %q at permit", pod.Name)
        //Waiting for status messages
        s := <-w.s
        if !s.IsSuccess() {
            if s.IsUnschedulable() {
                msg := fmt.Sprintf("pod %q rejected while waiting at permit: %v", pod.Name, s.Message())
                return NewStatus(s.Code(), msg)
            msg := fmt.Sprintf("error received while waiting at permit for pod %q: %v", pod.Name, s.Message())
            return NewStatus(Error, msg)

    return nil

2.6 overview of plug-in calling method

The plug-ins have been registered, and the implementation of the data saving and waiting mechanism in the scheduling process has been introduced. In fact, all that remains is the specific implementation of each type of plug-in to execute and call. In addition to the optimization stage, there is almost no logic processing in the remaining stages. The optimization stage is similar to the design of the optimization phase in the previous series sharing, which is also here No more repetition


The process seems quite simple. Note that if any plug-in refuses this location, the scheduling will fail directly

func (f *framework) RunPreFilterPlugins(ctx context.Context, state *CycleState, pod *v1.Pod) (status *Status) {
    startTime := time.Now()
    defer func() {
        metrics.FrameworkExtensionPointDuration.WithLabelValues(preFilter, status.Code().String()).Observe(metrics.SinceInSeconds(startTime))
    for _, pl := range f.preFilterPlugins {
        status = f.runPreFilterPlugin(ctx, pl, state, pod)
        if !status.IsSuccess() {
            if status.IsUnschedulable() {
                msg := fmt.Sprintf("rejected by %q at prefilter: %v", pl.Name(), status.Message())
                return NewStatus(status.Code(), msg)
            msg := fmt.Sprintf("error while running %q prefilter plugin for pod %q: %v", pl.Name(), pod.Name, status.Message())
            return NewStatus(Error, msg)

    return nil

2.6.2 RunFilterPlugins

It is similar to the previous one, but it will determine whether to run all plug-ins according to the runallfilters parameter. By default, it will not run because it has failed

unc (f *framework) RunFilterPlugins(
    ctx context.Context,
    state *CycleState,
    pod *v1.Pod,
    nodeInfo *schedulernodeinfo.NodeInfo,
) PluginToStatus {
    var firstFailedStatus *Status
    startTime := time.Now()
    defer func() {
        metrics.FrameworkExtensionPointDuration.WithLabelValues(filter, firstFailedStatus.Code().String()).Observe(metrics.SinceInSeconds(startTime))
    statuses := make(PluginToStatus)
    for _, pl := range f.filterPlugins {
        pluginStatus := f.runFilterPlugin(ctx, pl, state, pod, nodeInfo)
        if len(statuses) == 0 {
            firstFailedStatus = pluginStatus
        if !pluginStatus.IsSuccess() {
            if !pluginStatus.IsUnschedulable() {
                // Filter plugins are not supposed to return any status other than
                // Success or Unschedulable.
                firstFailedStatus = NewStatus(Error, fmt.Sprintf("running %q filter plugin for pod %q: %v", pl.Name(), pod.Name, pluginStatus.Message()))
                return map[string]*Status{pl.Name(): firstFailedStatus}
            statuses[pl.Name()] = pluginStatus
            if !f.runAllFilters {
                //You don't need to run all plug-ins to exit
                return statuses

    return statuses

Let’s call it a day. The scheduler modification is still quite large, but it can be predicted that for more scheduling plug-ins, they will all be concentrated in the framework, and the learning of kubernetes scheduler series will come to an end. As a kubernetes novice, it is still difficult to learn. Fortunately, the coupling between the scheduler design and other modules is relatively small


Official design of kubernetes scheduler framework

Wechat: baxiooshi2020 welcome to exchange, study and share together, and a small group welcomes the presence of big menIllustrate the core data structure of kubernetes scheduler framework
Personal blog: