Hash Trick

174 阅读 0 评论 115 点赞

我是靠谱客的博主眼睛大月饼，这篇文章主要介绍Hash Trick，现在分享给大家，希望可以做个参考。

1.介绍

Hashing trick，有时候也叫做feature hashing，在自然语音中已经用作降维的手段。在一般的机器学习任务中，它也可以对categorical feature进行降维。

举个例子，比如你是淘宝的算法工程师，你要做一个退货的预测模型，假设有一个feature是location_id，表示商品的产地。这个是categorical
feature，所以你通常需要做one-hot encoding，把这一列转化为dummy
variable。商品来自全国各市、全球各国，可能这个location_id就有成千上万个数值。转码之后，模型就会增加这一万个dummy变量。这对数据的读取、操作，模型的训练都是极大的挑战。

Hashing trick就是用hashing
function这个小技巧来降维。若location_id都是整数，我们可以对所有的location_id取余，location_id
(mod p)，这个取余函数就是我们使用的hashing
function。很显然进行取余操作之后，我们最多只有p个不同的数值了。在此之上再用one-hot encoding，我们只增加了p列。

location_id location_id (mod 5)

21 1

9126 1

45 0

10 0

1189 4

Hashing trick有三个主要的优点

1.降维程度大

2.计算快速、方便

不要额外的存储空间（额外的参考词典等）

但是，也有些缺点。比如我们观察到上面产地编号9126和21除以5的余数都是1，它们就被放到了一起。在Hashing
trick中，这种冲突和合并是无法避免的。但是根据一些论文和大量业界应用的结果，这种冲突合并对预测模型的表现的影响微乎其微。另一个缺点，因为大量的数值并合并，这使得模型和结果不易interpret。

2.实现

# B. Apply hash trick of the original csv row
# for simplicity, we treat both integer and categorical features as categorical
# INPUT:
#     csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...}
#     D: the max index that we can hash to
# OUTPUT:
#     x: a list of indices that its value is 1
def get_x(csv_row, D):
    x = [0]  # 0 is the index of the bias term
    for key, value in csv_row.items():
        index = int(value + key[1:], 16) % D  # weakest hash ever ;)
        x.append(index)
    return x  # x contains indices of features that have a value of 1

2.1 hash numpy array

def hashArray(arr, modNum = 1000):
    return [hash(x) % modNum for x in arr]
hashArray(np.array(['3', '6']))

output:

[461, 716]

参考：
1.github 参考；
2.hash trick 解释

最后

以上就是眼睛大月饼最近收集整理的关于Hash Trick的全部内容，更多相关Hash内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：Other
浏览次数：174 次浏览
发布日期：2024-08-09 08:25:01
本文链接：https://www.kaopuke.com/article/k-p-k_13_u_7_o_18_f4_12__23_gx.html

Hash Trick

1.介绍

2.实现

2.1 hash numpy array

最后

评论列表共有 0 条评论

发表评论取消回复

Hash Trick

1.介绍

2.实现

2.1 hash numpy array

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

发表评论取消回复