Halide Lesson05: 向量化, 并行, 循环展开 以及 分块
注意:Halide 默认图像按列存储 column first,x为内循环,y为外循环
Func gradient("gradient");
gradient(x, y) = x + y;
gradient.trace_stores();
打印gradient的所有中间结果。
gradient.print_loop_nest();
打印gradient函数的循环过程
produce gradient:
for y:
for x:
gradient(...) = ...
gradient.reorder(y, x);
调整循环顺序,将y作为内循环,x作为外循环,此时循环过程为:
Pseudo-code for the schedule:
produce gradient_col_major:
for x:
for y:
gradient_col_major(...) = ...
gradient.split(x, x_outer, x_inner, 2);
x -> 拆分的维度
x_outer -> 拆分后的外层循环
x_inner-> 拆分后的内层循环
2 -> 拆分的因子,内存循环从0到factor,外层循环从0到x/factor,原来的index现在为index = outer * factor + inner
for (int y = 0; y < 4; y++) {
for (int x_outer = 0; x_outer < 2; x_outer++) {
for (int x_inner = 0; x_inner < 2; x_inner++) {
int x = x_outer * 2 + x_inner;
printf("Evaluating at x = %d, y = %d: %dn", x, y, x + y);
}
}
}
如果拆分factor不能整除维度总数,Halide仍然可以很好的处理split,假设factor为3,x维度总数为7
gradient.split(x, x_outer, x_inner, 3);
Buffer<int> output = gradient.realize(7, 2);
此时等价循环为
for (int y = 0; y < 2; y++) {
for (int x_outer = 0; x_outer < 3; x_outer++) { // Now runs from 0 to 2
for (int x_inner = 0; x_inner < 3; x_inner++) {
int x = x_outer * 3;
// Before we add x_inner, make sure we don't
// evaluate points outside of the 7x2 box. We'll
// clamp x to be at most 4 (7 minus the split
// factor).
if (x > 4) x = 4;
x += x_inner;
printf("Evaluating at x = %d, y = %d: %dn", x, y, x + y);
}
}
}
gradient.fuse(x, y, fused);
参数融合,将两个参数融合成一个参数
for (int fused = 0; fused < 4*4; fused++) {
int y = fused / 4;
int x = fused % 4;
printf("Evaluating at x = %d, y = %d: %dn", x, y, x + y);
}
gradient.vectorize(x_inner);
Var x_outer, x_inner;
gradient.split(x, x_outer, x_inner, 4);
gradient.vectorize(x_inner);
或者使用gradient.vectorize(x, 4);
等价于上面split和vectorize的融合
循环过程为
for (int y = 0; y < 4; y++) {
for (int x_outer = 0; x_outer < 2; x_outer++) {
// The loop over x_inner has gone away, and has been
// replaced by a vectorized version of the
// expression. On x86 processors, Halide generates SSE
// for all of this.
int x_vec[] = {x_outer * 4 + 0,
x_outer * 4 + 1,
x_outer * 4 + 2,
x_outer * 4 + 3};
int val[] = {x_vec[0] + y,
x_vec[1] + y,
x_vec[2] + y,
x_vec[3] + y};
printf("Evaluating at <%d, %d, %d, %d>, <%d, %d, %d, %d>:"
" <%d, %d, %d, %d>n",
x_vec[0], x_vec[1], x_vec[2], x_vec[3],
y, y, y, y,
val[0], val[1], val[2], val[3]);
}
}
gradient.unroll(x_inner);
在某个维度上进行循环展开
Var x_outer, x_inner;
gradient.split(x, x_outer, x_inner, 2);
gradient.unroll(x_inner);
上面两个操作可以替换为gradient.unroll(x, 2);
循环展开为
for (int y = 0; y < 4; y++) {
for (int x_outer = 0; x_outer < 2; x_outer++) {
// Instead of a for loop over x_inner, we get two
// copies of the innermost statement.
{
int x_inner = 0;
int x = x_outer * 2 + x_inner;
printf("Evaluating at x = %d, y = %d: %dn", x, y, x + y);
}
{
int x_inner = 1;
int x = x_outer * 2 + x_inner;
printf("Evaluating at x = %d, y = %d: %dn", x, y, x + y);
}
}
}
Fusing, tiling, and parallelizing.
融合 分tile 并行三个操作
// First we'll tile, then we'll fuse the tile indices and
// parallelize across the combination.
Var x_outer, y_outer, x_inner, y_inner, tile_index;
gradient.tile(x, y, x_outer, y_outer, x_inner, y_inner, 4, 4);
gradient.fuse(x_outer, y_outer, tile_index);
gradient.parallel(tile_index);
Buffer<int> output = gradient.realize(8, 8);
可以写到一个序列中:
gradient
.tile(x, y, x_outer, y_outer, x_inner, y_inner, 4, 4)
.fuse(x_outer, y_outer, tile_index)
.parallel(tile_index);
Buffer<int> output = gradient.realize(8, 8);
等价的循环为
// 仍然是列优先
for (int tile_index = 0; tile_index < 4; tile_index++) {
int y_outer = tile_index / 2;
int x_outer = tile_index % 2;
for (int y_inner = 0; y_inner < 4; y_inner++) {
for (int x_inner = 0; x_inner < 4; x_inner++) {
int y = y_outer * 4 + y_inner;
int x = x_outer * 4 + x_inner;
printf("Evaluating at x = %d, y = %d: %dn", x, y, x + y);
}
}
}
问题
- vectorize的时候,如果数字不满足SIMD寄存器要求会怎么样?比如传入3 5 7个数
- 各个操作的性能如何?
最后
以上就是友好月饼最近收集整理的关于Halide Lesson05: 向量化, 并行, 循环展开 以及 分块的全部内容,更多相关Halide内容请搜索靠谱客的其他文章。
本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
发表评论 取消回复