我是靠谱客的博主 奋斗糖豆,这篇文章主要介绍使用spark的TF-IDF算法计算单词的重要性 使用spark的TF-IDF算法计算单词的重要性,现在分享给大家,希望可以做个参考。

使用spark的TF-IDF算法计算单词的重要性

本文简单学习一下spark的TF-IDF算法的使用

要计算每个单词的重要性,首先需要将单词分割,然后转换成数值型特征

In [1]:
复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from pyspark.ml.feature import HashingTF, IDF, Tokenizer sentenceData = sqlContext.createDataFrame([ (0, "Hi I heard about Spark"), (0, "I wish Java could use case classes"), (1, "Logistic regression models are neat") ], ["label", "sentence"]) tokenizer = Tokenizer(inputCol="sentence", outputCol="words") wordsData = tokenizer.transform(sentenceData) wordsData.show(5, False) hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20) featurizedData = hashingTF.transform(wordsData) # alternatively, CountVectorizer can also be used to get term frequency vectors featurizedData.select('rawFeatures', 'label').show(5, False) idf = IDF(inputCol="rawFeatures", outputCol="features") idfModel = idf.fit(featurizedData) rescaledData = idfModel.transform(featurizedData) rescaledData.select("features", "label").show(5, False)
复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
+-----+-----------------------------------+------------------------------------------+ |label|sentence |words | +-----+-----------------------------------+------------------------------------------+ |0 |Hi I heard about Spark |[hi, i, heard, about, spark] | |0 |I wish Java could use case classes |[i, wish, java, could, use, case, classes]| |1 |Logistic regression models are neat|[logistic, regression, models, are, neat] | +-----+-----------------------------------+------------------------------------------+ +-----------------------------------------+-----+ |rawFeatures |label| +-----------------------------------------+-----+ |(20,[5,6,9],[2.0,1.0,2.0]) |0 | |(20,[3,5,12,14,18],[2.0,2.0,1.0,1.0,1.0])|0 | |(20,[5,12,14,18],[1.0,2.0,1.0,1.0]) |1 | +-----------------------------------------+-----+ +--------------------------------------------------------------------------------------------------------+-----+ |features |label| +--------------------------------------------------------------------------------------------------------+-----+ |(20,[5,6,9],[0.0,0.6931471805599453,1.3862943611198906]) |0 | |(20,[3,5,12,14,18],[1.3862943611198906,0.0,0.28768207245178085,0.28768207245178085,0.28768207245178085])|0 | |(20,[5,12,14,18],[0.0,0.5753641449035617,0.28768207245178085,0.28768207245178085]) |1 | +--------------------------------------------------------------------------------------------------------+-----+

最后

以上就是奋斗糖豆最近收集整理的关于使用spark的TF-IDF算法计算单词的重要性 使用spark的TF-IDF算法计算单词的重要性的全部内容,更多相关使用spark的TF-IDF算法计算单词的重要性内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(62)

评论列表共有 0 条评论

立即
投稿
返回
顶部