linux 公平队列_公平调度(fair-share scheduling)的进程调度算法 | 学步园

89 阅读 0 评论 59 点赞

我是靠谱客的博主生动棒球，这篇文章主要介绍linux 公平队列_公平调度(fair-share scheduling)的进程调度算法 | 学步园，现在分享给大家，希望可以做个参考。

调度程序schedule()：

调度程序schedule()是一个非常频繁地执行的函数，因此要将运行效率放在第一位，函数中使用了很多的goto语句。

前面讲过，对schedule()只能由进程在内核中主动

调用，或者在当前进程从系统空间返回用户空间的前夕被动的

发生，而不能在一个中断服务程序的内部发生。即使一个中断服务程序有调度的要求，也只能通过把当前进程的need_resched字段设为１来表达这种要求，而不能直接调用schedule()。所以，如果在某个中断服务程序内部调用了schedule()，那一定是有问题的，所以转向scheduling_in_interrupt.(kernel/sched.c)

asmlinkage void schedule(void)

509 {

510

struct schedule_data * sched_data;

511 struct task_struct *prev, *next,

*p;

512 struct list_head *tmp;

513 int this_cpu, c;

514

515 if

(!current>

active_mm) BUG();

516 need_resched_back:

517 prev =

current;

518 this_cpu = prev>

processor;

519

520 if

(in_interrupt())

521 goto scheduling_in_interrupt

;

522

523

release_kernel_lock(prev, this_cpu);

524

525 /* Do "administrative" work

here while we don't hold any locks */

526 if (softirq_active(this_cpu) &

softirq_mask(this_cpu))

/*检查是否有内核软中断服务请求在等待，若有，就转入handle_softirq为这些请求服务*/

527

goto handle_softirq;

528

handle_softirq_back:

我们来看一下内核对这种问题的响应：

[schedule()]

686

scheduling_in_interrupt:

687 　　printk("Scheduling in interrupt/n");

688

BUG();

689

return;

内核对此的响应是显示或者在/var/log/messages文件末尾添上一条出错信息，然后执行一个宏操作BUG。

接着往下看schedule()：

如果有内核软中断服务请求在等待，那么就转入

handle_softirq：

[schedule()]

675 handle_softirq:

676

do_softirq();

677 　　　goto handle_softirq_back;

执行softirq队列完毕以后继续往下看：

====================

kernel/sched.c 528 541 ====================

[schedule()]

528 handle_softirq_back:

529

530 /*

531 *

'sched_data' is protected by the fact that we can run

532 * only one process

per CPU.

533 */

534 sched_data = &

aligned_data[this_cpu].schedule_data;

535

536

spin_lock_irq(&runqueue_lock);

537

538

/* move an exhausted RR process to be last.. */

539 if (prev>policy == SCHED_RR)

540 goto move_rr_last;

541

move_rr_back:

指针sched_data指向一个schedule_data数据结构，用来保存供下一次调度时使用的信息。此数据结构的定义如下：

====================

kernel/sched.c 91 101 ====================

91 /*

92 * We align

perCPU

scheduling data on cacheline boundaries,

93 * to prevent cacheline

pingpong.

94 */

95 static union {

96 　　struct schedule_data {

struct task_struct * curr;

98 　　　　cycles_t

last_schedule;

99 　　　}schedule_data;

100 　　　char __pad [SMP_CACHE_BYTES];

101 }

aligned_data [NR_CPUS

]

__cacheline_aligned = {

{{&init_task,0}}};

这里的cycles_t实际上是无符号整数，用来记录调度发生的时间。这个数据结构是为多处理器SMP结构而设的，因此我们不必关心。数组中的第一个元素，即CPU0的schedule_data结构初始化为{&init_task,0}，其余的则全为｛０，０｝。代码中的__cacheline_aligned表示数据结构的起点应与高速缓存中的缓冲线对齐。

下面就要涉及可执行进程队列了，所以先将这个队列锁住(536行)，以防止其他处理器的干扰。从538行开始：如果当前进程prev的调度策略是SCHED_RR，也就是轮换调度，那就要先进行一点特殊的处理(540 : goto move_rr_last;

)。

(对使用SCHED_RR策略的当前进程的处理)

==================== kernel/sched.c 679 685

====================

[schedule()]

679 move_rr_last:

680 if

(!prev>counter) {

681 prev>counter = NICE_TO_TICKS

(prev>nice);

682

move_last_runqueue(prev);

683 }

684 goto move_rr_back;

这里的prev>counter：代表这当前进程的运行时间配额，其数值在每次时钟中断时都要递减(update_process_times()中实现的)。因此，不管一个进程的时间配额有多高，随着运行时间的积累最终总会递减到０。对于调度策略为SCHED_RR的进程，一旦其时间配额降到０，就要从可执行进程队列runqueue

中当前的位置上移动到队列的末尾，同时恢复其最初的时间配额(NICE_TO_TICKS

)，以等待下一次的调度。对于具有相同优先级的进程，调度的时候排在前面的进程优先，所以这使队列中具有相同优先级的其他进程有了优势。

宏操作NICE_TO_TICKS根据系统时钟的精度将进程的优先级别换算成可以运行的时间配额。在kernel/sched.c中定义。

将一个进程的task_struct结构从可执行队列中的当前位置移到队列的末尾是由

move_last_runqueue()完成的(kernel/sched.c)。把进程移到可执行进程队列的末尾意味着：如果队列中没有资格更高的进程，但是有一个资格与之相同的进程存在，那么，这个资格虽然相同而排在前面的进程会被选中。

继续看schedule()：

====================

kernel/sched.c 541 553 ====================

[schedule()]

541 move_rr_back:

542

543 switch (prev>state

) {

544 case

TASK_INTERRUPTIBLE:

545 　　　if (signal_pending(prev)) {

546

prev>state = TASK_RUNNING;

547 　　　　　　break;

548 　　　　}

549

default:

550 　　　del_from_runqueue(prev);

551 case TASK_RUNNING:

552

}

553 prev>need_resched =

当前进程，就是正在执行的进程，当进入schedule()时其状态却不一定是TASK_RUNNING。例如：当前进程如已经在do_exit()中将其状态改成TASK_ZOMBIE，又如当前进程在sys_wait4()中调用schedule()时的状态为TASK_INTERRUPTIBLE。所以，这里的prev>state与其说是当前进程的状态不如说是其意愿。当其意愿既不是继续执行也不是可中断的睡眠时，就要通过del_from_runqueue()把这个进程从可执行队列中撤下来。另一方面，也可以看出TASK_INTERRUPTIBLE和TASK_UNINTERRUPTIBLE两种睡眠状态之间的区别：

前者在进程有信号等待处理时要将其改成TASK_RUNNING，让其处理完这些信号再说，而后者则不受信号的影响。

最后，将prev>need_resched恢复为０，因为所需求的调度已经在进行了。

下面的任务就是要挑选出一个进程来运行了(这一部分是很重要的，通过对就绪进程队列进行扫描)。

====================

kernel/sched.c 555 576 ====================

[schedule()]

555 /*

556 *

this is the scheduler proper:

557 */

558

559 repeat_schedule:

560

561 * Default process to select..

562 */

563 next = idle_task

(this_cpu);

564

c = 1000;

565 if (prev>state ==

TASK_RUNNING

)

566 goto still_running;

567

568

still_running_back:

569 list_for_each

(tmp,

&runqueue_head) {

570 p = list_entry(tmp, struct task_struct,

run_list);

571 if (can_schedule(p, this_cpu)) {

572 int

weight = goodness

(p, this_cpu,

prev>active_mm);

573 if (weight > c

)

574

c = weight, next = p;

575 }

576

}

在这段程序中，next总是指向已知最佳的候选进程，c则是这个进程的综合权值，或者是运行资格。

挑选的过程是从idle进程即0号进程开始，其权值为－1000，这是可能的最低值，表示仅在没有其他进程可以运行时才会让他运行。

然后，遍历可执行队列runqueue中的每个进程(在单CPU系统中can_schedule()的返回值永远是1)，也就是一般操作系统书中所称的就绪进程。为每一个就绪进程通过函数goodness

()计算出他当前所具有的权值，然后与当前的最高值c相比。注意这里的条件：weight > c，

这意味着“先入为大”

。也就是说，如果两个进程有相同的权值的话，排在队列前面的进程胜出，优先运行。

这里还有一个小插曲：如果当前进程的意图是继续运行，那么就要先执行一下still_running(kernel/sched.c)：

==================== kernel/sched.c 670 674

====================

[schedule()]

670 still_running:

671 c =

goodness(prev, this_cpu, prev>active_mm);

672 next = prev;

673

goto

still_running_back;

674

也就是说，如果当前进程想要继续运行，那么在挑选候选进程时以当前进程此刻的权值开始比较。而且这意味着，对于具有相同权值的其他进程来说，当前进程优先。

那么，进程的当前权值是怎样计算的呢？也就是goodness()是怎样执行的呢？

====================

kernel/sched.c 123 187 ====================

[schedule()>goodness()

]

123

124 * This is the function that decides how desirable a process

is..

125 * You can weigh different processes against each other

depending

126 * on what CPU they've run on lately etc to try to handle

cache

127 * and TLB miss penalties.

128 *

129 * Return values:

130 *

1000:never select this

131 * 0: out of time, recalculate counters (but it

might still be

132 * selected)

133 * +ve: "goodness" value (the larger,

the better)

134 * +1000: realtime process, select this.

135

136

137 static inline int goodness(struct task_struct * p, int

this_cpu, struct mm_struct *this_mm)

138 {

139 int weight;

140

141

142 * select the current process after every other

143 * runnable

process, but before the idle thread.

144 * Also, dont trigger a counter

recalculation.

145 */

146 weight = -1

;

147 if

(p>policy & SCHED_YIELD

)

148 goto

out;

149

150 /*

151 * Non RT process normal case first.

152

153 if (p>policy

== SCHED_OTHER

) {

154 /*

155 * Give the process a

firstapproximation goodness value

156 * according to the number of clockticks

it has left.

157 *

158 * Don't do any other calculations if the time slice

159 * over..

160 */

161 weight =

p->counter;

162 if (!weight)

163 goto out;

164

165

#ifdef CONFIG_SMP

166 /* Give a largish advantage to the same processor...

167 /* (this is equivalent to penalizing other processors) */

168 if

(p->processor == this_cpu)

169 weight += PROC_CHANGE_PENALTY;

170

#endif

171

172 /* .. and a slight advantage to the current MM */

173

if (p->mm == this_mm || !p->mm)

174 weight += 1;

175

weight += 20-

p>nice;

176 goto out;

177 }

178

179 /*

180 *

Realtime process, select the first one on the

181 * runqueue (taking

priorities within processes

182 * into account).

183 */

184 weight

= 1000 + p->rt_priority;

185 out:

186 return weight;

187

}

＊首先，如果一个进程通过系统调用sched_yield()明确表示了“礼让”后，就将其权值定位-1。这是很低的权值，一般就绪进程的权值至少是0。

＊对于没有实时要求的进程

，即调度策略为SCHED_OTHER的进程，其权值主要取决于两个因素：一个是剩下的时间配额p->counter

，如果用完了则权值为0。另一个是进程的优先级nice，这是从早期Unix沿用下来的负向优先级

(越负，优先级越高)，其取值范围为19~-20，只有特权用户才能把nice值设置为小于0。所以，综合的权值weight在时间配额尚未用完时基本上是二者之和。此外，如果是内核线程，或者其用户

空间与当前进程的相同，因而无需切换用户空间，则会得到一点小“奖励”，将权值额外加1。

＊对于实时进程，即调度策略为SCHED_FIFO或者SCHED_RR的进程，则另有一种正向的优先级

，那就是实时优先级rt_priority

，而权值为1000 +

p->rt_priority。可见，SCHED_FIFO或者SCHED_RR两种有时间要求的策略赋予进程很高的权值(相对于SCHED_OTHER)。这种进程的权值至少是1000。另一方面，rt_priority的值对于实时进程之间的权值比较也起着重要的作用，其数值也是在sched_setscheduler()中与调度策略一起设置的。

从上面可以看出：对于这两种实时调度策略，一个进程已经运行了多久，即时间配额p->counter

的当前值，对权值的计算不起所用。不过，前面讲到，对于使用SCHED_RR策略地进程，当

p->counter

达到

0时会导致将进程移到队列尾部。

实时进程的nice数值与优先级无关，但是对使用SCHED_RR策略地进程的时间配额大小有关(宏操作NICE_TO_TICKS())。由于实时进程的权值有个很大的基数(1000)，因此当有实时进程就绪时，非实时进程是没有机会运行的。

由此可见，linux内核中对权值的计算是很简单的，但是goodness()函数并不代表linux调度算法的全部，而要与前面讲到的对SCHED_RR进程的特殊处理

、对意欲继续运行的当前进程的特殊处理

‘以及下面要讲到的recalculate

结合起来分析。

上面still_running_back运行结束后，变量c的值有几种可能：一种可能是一个大于0的正数，此时next指向挑选出来的进程；另一种可能是c的值为0

，发生于就绪队列中所有进程的权值都是0的时候。由于除了init进程和调用了sched_yield()的进程以外，每个进程的权值最低为0，所以只要队列中有其他就绪进程存在就不可能为负数。因此，队列中所有其他进程的权值都已经降到0了，说明这些进程的调度策略都是SCHED_OTHER，即系统中当前没有就绪的实时进程，因为如果有策略为SCHED_FIFO或者SCHED_RR的进程存在，其权值至少也有1000。

let`s go

on：回到schedule()

==================== kernel/sched.c 578 580

====================

[schedule()]

578 /* Do we need to

recalculate

counters? */

579 if (!c)

580 goto

recalculate;

如果当前已经选择的进程(权值最高的进程)的权值为0，那就要重新计算各个进程的时间配额。如上所述，这说明系统中当前没有就绪的实时进程。而且，这种情况已经持续了一段时间，否则SCHED_OTHER进程的权值就没有机会消耗到0。

====================

kernel/sched.c 658 669 ====================

[schedule()]

658

recalculate:

659 {

660 struct task_struct *p;

661

spin_unlock_irq(&runqueue_lock);

662

read_lock(&tasklist_lock);

663 for_each_task

(p)

664

p->counter = (p->counter >> 1) +

NICE_TO_TICKS(p->nice);

665

read_unlock(&tasklist_lock);

666

spin_lock_irq(&runqueue_lock);

667 }

668 goto repeat_schedule;

这里所作的运算是将每个进程的当前的时间配额p->counter除以2，再在上面加上由该进程的nice值换算过来的tick数量。宏操作NICE_TO_TICKS的定义在前面已经见过，显然nice值对于非实时进程既表示优先级也决定着时间配额。

注意：这里的for_each_task()

是对所有进程的循环，而并不是仅对就绪进程队列的循环，对于不再就绪进程队列中的非实时进程

，这里得到了提升其时间配额、从而提升其综合权值的机会。不过，这种对综合权值的提升是很有限的，每次重新计算都将原有的时间配额减半，再与NICE_TO_TICKS(p->nice)相加，这样就决定了重新计算以后的综合权值永远也不可能达到NICE_TO_TICKS(p->nice)的两倍。因此，即使经过很长时间的韬光养晦，也不可能达到可与实时进程竞争的地步，所以只是对非实时进程之间的竞争有意义。

至于实时进程，时间配额的增加并不会提升其综合权值，而且对于SCHED_FIFO进程，时间配额就没有什么意义。

重新计算完权值以后，程序转回

repeat_schedule(跳回前面，再次执行挑选进程)处重新挑选。这样，当再次完成对就绪进程队列的扫描时，变量c的值就应该不为0了，此时next指向挑选出来的进程。

至此，已经挑选好进程了(权值最高的进程)。

还没有结束阿？哈哈

进程挑好之后，接下来要做的就是切换的事情了。

[schedule()]

581

582 * from this point on nothing can prevent us from

583 * switching to

the next task, save this fact in

584 * sched_data.

585 */

586 sched_data>curr =

next;

587 #ifdef CONFIG_SMP

.....

590 #endif

591

spin_unlock_irq(&runqueue_lock);

592

593 if (prev == next

)

594 goto same_process;

595

596 #ifdef

CONFIG_SMP

==================== kernel/sched.c 612 657

====================

612 #endif /* CONFIG_SMP */

613

614

kstat.context_swtch++;

615 /*

616 * there are 3 processes which are

affected by a context switch:

617 *

618 * prev == .... ==> (last =>

next)

620 * It's the 'much more previous' 'prev' that is on next's

stack,

621 * but prev is set to (the just run) 'last' process by

switch_to().

622 * This might sound slightly confusing but makes tons of

sense.

623 */

624 prepare_to_switch

();

625

{

626 struct mm_struct *mm = next->mm;

627 struct mm_struct *oldmm

= prev->active_mm;

628 if (!mm) {

629 if (next>active_mm)

BUG();

630 next>active_mm = oldmm;

631

atomic_inc(&oldmm>mm_count);

632 enter_lazy_tlb(oldmm, next,

this_cpu);

633 } else {

634 if (next>active_mm != mm)

BUG();

635 switch_mm(oldmm, mm, next, this_cpu);

636 }

637

638

if (!prev>mm) {

639 prev>active_mm = NULL;

640

mmdrop(oldmm);

641 }

642 }

643

644 /*

645 * This just switches

the register state and the

646 * stack.

647 */

648 switch_to(prev,

next, prev);

649 __schedule_tail(prev);

650

651 same_process:

652

reacquire_kernel_lock(current);

653 if (current>need_resched)

654

goto need_resched_back;

655

656 return;

跳过对SMP结构的条件编译部分。

首先，如果挑选出来的进程next就是当前进程prev，就不用切换，直接跳转到same_process处就返回了。这里的reacquire_kernel_lock()对于i386单CPU结构而言是空语句。前面已经把当前进程的need_resched清0，如果现在又成了非0，则一定是发生了中断并且情况有了变化，所以转回need_resched_back再调度一次。

否则，如果挑选出来的进程next与当前进程prev不同，那就要切换了。对于i386单CPU结构而言，prepare_to_switch()也是空语句。而649行的__schedule_tail()则只是将当前进程prev的task_struct结构中的policy字段里的SCHED_YIELD标志位清成0。所以实际上只剩下了两件事：对用户虚存空间的处理；进程的切换switch_to()。