【Spark】TF-IDF

109 阅读 0 评论 72 点赞

我是靠谱客的博主典雅草丛，这篇文章主要介绍【Spark】TF-IDF，现在分享给大家，希望可以做个参考。

TF-IDF

Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. Denote a term by tt, a document by dd, and the corpus by DD. Term frequency TF(t,d)TF(t,d) is the number of times that term tt appears in document dd, while document frequency DF(t,D)DF(t,D) is the number of documents that contains term tt. If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that appear very often but carry little information about the document, e.g., “a”, “the”, and “of”. If a term appears very often across the corpus, it means it doesn’t carry special information about a particular document. Inverse document frequency is a numerical measure of how much information a term provides:

TF-IDF常用于文本挖掘，反映了一个词对于文档集合中某篇文档的重要性。一个词用t表示，一个文档用d表示，文档集合用D表示。TF词频是t在d中的出现频率，DF文档频率是包含t的文档数量。如果只评估TF，会产生偏差，比如“a”和“the”在文档1出现频率很高，但是由于所有文档中这两个词出现频率都很高，所以并不能说“a”和“the”对于文档1来说就有多重要。IDF逆文档频率是衡量一个词提供多少信息的方法

【思考】http在url1中出现频率很高，但由于在所有url中http的出现频率都很高，所以不能说明http对于url1来说就是很重要的。

IDF(t,D)=log|D|+1DF(t,D)+1,IDF(t,D)=log⁡|D|+1DF(t,D)+1,

where |D||D| is the total number of documents in the corpus. Since logarithm is used, if a term appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product of TF and IDF:

|D|是文档集合中有多少个文档。如果某个词在所有文档中都出现了，它的IDF就是0。这里为防止分母为0，采取了适当措施。TF-IDF即TF和IDF的乘积：

TFIDF(t,d,D)=TF(t,d)⋅IDF(t,D).TFIDF(t,d,D)=TF(t,d)⋅IDF(t,D).

There are several variants on the definition of term frequency and document frequency. In spark.mllib, we separate TF and IDF to make them flexible.

Our implementation of term frequency utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e., the number of buckets of the hash table. The default feature dimension is 220=1,048,576220=1,048,576.

Spark机器学习库的TF使用了哈希技巧。一个原始特征（term）通过哈希映射成了一个下标（index）。然后计算每个下标出现频次。为了减少哈希碰撞，可以增加特征数量，默认是1,048,576个。

Note: spark.mllib doesn’t provide tools for text segmentation（Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics）. We refer users to the Stanford NLP Group and scalanlp/chalk.

注意：spark机器学习库不提供文本分割工具，可以使用Stanford NLP Group and scalanlp/chalk。

TF and IDF are implemented in HashingTF and IDF. HashingTF takes an RDD[Iterable[_]] as the input. Each record could be an iterable of strings or other types.

Refer to the HashingTF Scala docs for details on the API.

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
val sc: SparkContext = ...
// Load documents (one per line).
val documents: RDD[Seq[String]] = sc.textFile("...").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)

While applying HashingTF only needs a single pass to the data, applying IDF needs two passes: first to compute the IDF vector and second to scale the term frequencies by IDF.

import org.apache.spark.mllib.feature.IDF
// ... continue from the previous example
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)

spark.mllib’s IDF implementation provides an option for ignoring terms which occur in less than a minimum number of documents. In such cases, the IDF for these terms is set to 0. This feature can be used by passing the minDocFreq value to the IDF constructor.

import org.apache.spark.mllib.feature.IDF
// ... continue from the previous example
tf.cache()
val idf = new IDF(minDocFreq = 2).fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)