GCC对乘法的优化

215 阅读 0 评论 142 点赞

我是靠谱客的博主轻松猫咪，这篇文章主要介绍GCC对乘法的优化，现在分享给大家，希望可以做个参考。

这个标题有点大，我并没有看过GCC的源码，所以也不知道很准确的GCC对乘法的优化。但是我做了几个测试，发现GCC的强大之处，与君共享。
突然想到做这个测试，是因为前段时间有个面试，出了一个很基础的问题，y = x * 63怎么优化。这种是学校的学生都会做的问题： y = (x << 6) - x; 可是这种方法到底能提升多少效率？有数据有真相。
我测试的平台是在Linux 2.6.32，GCC 4.6。
下面一段是测试的代码

#include <sys/time.h>
#include <ctime>
#include <cstdio>
#include <cstdlib>
const int USEC_PER_SEC = 1000000 ;
int main(void)
{
const int count = 1000000 ;
const int factor = 63 ;
printf("test count is %dn", count);
srand(time(NULL));
const int cases = 4096 ;
int init[cases] ;
for (int i = 0 ; i < cases ; i++)
init[i] = rand() % 49999 ;
int tmp ;
struct timeval start, end ;
gettimeofday(&start, NULL) ;
tmp = 0;
for (int i = 0 ; i < count ; i++)
for (int j = 0 ; j < cases ; j++)
tmp += init[j] * factor ;
gettimeofday(&end, NULL) ;
end.tv_sec -= start.tv_sec ;
end.tv_usec -= start.tv_usec ;
if (end.tv_usec < 0)
{
end.tv_sec-- ;
end.tv_usec += USEC_PER_SEC ;
}
printf("result tmp is %dn", tmp) ;
printf("nomal mul: %d.%06dn", end.tv_sec, end.tv_usec) ;
gettimeofday(&start, NULL) ;
tmp = 0;
for (int i = 0 ; i < count ; i++)
{
for (int j = 0 ; j < cases; j++)
{
const int a = init[j] ;
tmp += (a << 6) - a ; // y = x * 63优化的代码
}
}
gettimeofday(&end, NULL) ;
end.tv_sec -= start.tv_sec ;
end.tv_usec -= start.tv_usec ;
if (end.tv_usec < 0)
{
end.tv_sec-- ;
end.tv_usec += USEC_PER_SEC ;
}
printf("result tmp is %dn", tmp) ;
printf("optimized mul: %d.%06dn", end.tv_sec, end.tv_usec) ;
return 0 ;
}

测试的数据很不理想，用 gcc -O0 编译，就是不优化，得出的结果是：

test count is 1000000
result tmp is -1332858560
nomal mul: 8.261260
result tmp is -1332858560
optimized mul: 8.483564

用gcc -O2优化后的结果是：

test count is 1000000
result tmp is -1897458560
nomal mul: 2.133206
result tmp is -1897458560
optimized mul: 2.133081

其实结果已经很明显了，即使是-O0的选项，gcc还是对乘法做了优化。看看-O0对y = x * 63和y = (x << 6) - x生成的汇编代码是怎么样的：

.L6:
.loc 1 29 0
movl
-32(%rbp), %eax
cltq
movl
-16480(%rbp,%rax,4), %edx
movl
%edx, %eax
sall
$6, %eax
; 这里其实是y = x * 63
subl
%edx, %eax
addl
%eax, -44(%rbp)
.loc 1 28 0
addl
$1, -32(%rbp)
.L5:
cmpl
$4095, -32(%rbp)
setle
%al
testb
%al, %al
jne .L6
; 此处略去几千字符
.LBB8:
.loc 1 48 0
movl
-24(%rbp), %eax
cltq
movl
-16480(%rbp,%rax,4), %eax
movl
%eax, -20(%rbp)
.loc 1 49 0
movl
-20(%rbp), %eax
sall
$6, %eax
; 这一段才是y = (x << 6) - x
subl
-20(%rbp), %eax
addl
%eax, -44(%rbp)

很遗憾，即使用-O0，gcc还是对乘法做了优化，可能这种优化太简单了吧。-O2 优化的汇编代码自然不用说，这段是类似的。
通过x * 63这种代码优化，突然想起来，我们经常做的 x * 1000能不能也用这种优化。于是也测试了一把，x * 1000改成加法操作：

y = (x << 10) - (x << 4) - (x << 3) ;

就是

y = x * 1024 - x * 16 - x * 8 ;

下面是不做优化编译的结果：

test count is 1000000
result tmp is -1742412800
nomal mul: 7.867746
result tmp is -1742412800
optimized mul: 10.117456

很明显，跟我期望的结果相反，乘法的CPU指令周期是多少不太清楚，但是看测试结果，已经不能用这种方法优化了。
再看看O2优化的结果：

test count is 1000000
result tmp is 1581241344
nomal mul: 2.138208
result tmp is 1581241344
optimized mul: 2.129256

为什么会这么快？
看看汇编代码吧。

.L4:
movl
(%rax), %esi
addq
$4, %rax
imull
$1000, %esi, %edx ; 直接用乘法指令
addl
%edx, %r14d
cmpq
%rbx, %rax
; 优化后的代码
.L8:
movl
(%rax), %edx
addq
$4, %rax
imull
$1000, %edx, %edx
; 优化后直接改成了乘以1000，好牛X
addl
%edx, %r13d
cmpq
%rax, %rbx