概述
一.简介
1.RF-IDF【term frequency-inverse document frequency】是一种用于检索与探究的常用加权技术。
2.TF-IDF是一种统计方法,用于评估一个词对于一个文件集或一个语料库中的其中一个文件的重要程度。
3.词的重要性随着它在文件中出现的次数的增加而增加,但同时也会随着它在语料库中出现的频率的升高而降低。
二.词频
指的是某一个给定的词语在一份给定的文件中出现的次数。这个数字通常会被归一化,以防止它偏向长的文件【同一个词语在文件里可能会比短文件有更高的词频,而不管该词重要与否】。
公式:
ni,j:是该词在文件dj中出现的次数,而分母则是在文件dj中所有词出现的次数之和。
三.逆文档频率
是一个词普遍重要性的度量。某一个特定词的IDF可以由总文件数目除以包含该词语的文件数据,再将得到的商取对数得到。
公式:
|D|:语料库中的文件总数
|{j:ti€dj}|:包含ti的文件总数
四.TF-IDF
公式:TF-IDF = TF * IDF
特点:某一特定文件内的高频率词语,以及该词语在整个语料库中的低文件频率,可以产生高权重的TF-IDF。因此,TF-IDF倾向于过滤掉常见的词语,保留重要的词语。
思想:如果某个词或短语在一篇文章中出现的频率TF高,并且在其它文章中很少出现,则认为此词或短语具有很好的类别区分能力,适合用来分类。
五.代码实现
1 package big.data.analyse.tfidf 2 3 import org.apache.log4j.{Level, Logger} 4 import org.apache.spark.sql.SparkSession 5 6 /** 7 * Created by zhen on 2019/05/28. 8 */ 9 object TF_IDF { 10 /** 11 * 设置日志级别 12 */ 13 Logger.getLogger("org").setLevel(Level.WARN) 14 def main(args: Array[String]) { 15 val spark = SparkSession 16 .builder() 17 .appName("TF_IDF") 18 .master("local[2]") 19 .config("spark.sql.warehouse.dir", "file:///D://warehouse").getOrCreate() 20 val sc = spark.sparkContext 21 /** 22 * 计算TF 23 */ 24 val tf = sc.textFile("src/big/data/analyse/tfidf/TF.txt") 25 .map(row => row.replace(",", " ").replace(".", " ").replace(" ", " ")) // 数据清洗 26 .flatMap(row => row.split(" ")) // 拆分 27 .map(row => (row, 1.0)) 28 .reduceByKey(_+_) 29 30 val tfSize = tf.map(row => row._2).sum() // 计算总词数 31 32 val tfed = tf.map(row => (row._1, row._2 / tfSize.toDouble)) //求词频 33 println("TF:") 34 tfed.foreach(println) 35 36 /** 37 * 计算IDF 38 */ 39 val idf_0 = tf.map(row => (row._1, 1.0)) 40 println("加载IDF1文件数据。。。") 41 val idf_1 = sc.textFile("src/big/data/analyse/tfidf/IDF1.txt") 42 .map(row => row.replace(",", " ").replace(".", " ").replace(" ", " ")) 43 .flatMap(row => row.split(" ")) 44 .map(row => (row, 1.0)) 45 .reduceByKey(_+_) 46 .map(row => (row._1, 1.0)) 47 48 println("加载IDF2文件数据。。。") 49 val idf_2 = sc.textFile("src/big/data/analyse/tfidf/IDF2.txt") 50 .map(row => row.replace(",", " ").replace(".", " ").replace(" ", " ")) 51 .flatMap(row => row.split(" ")) 52 .map(row => (row, 1.0)) 53 .reduceByKey(_+_) 54 .map(row => (row._1, 1.0)) 55 56 /** 57 * 整合语料库数据 58 */ 59 val idf = idf_0.union(idf_1).union(idf_2) 60 .reduceByKey(_+_) 61 .map(row => (row._1, 3 / row._2)) 62 println("IDF:") 63 idf.foreach(println) 64 65 /** 66 * 关联TF和IDF,计算TF-IDF 67 */ 68 println("TF-IDF:") 69 tfed.join(idf).map(row => (row._1, (row._2._1 * row._2._2).formatted("%.4f"))) 70 .foreach(println) 71 } 72 }
六.结果
TF: (GraphX,0.011494252873563218) (are,0.011494252873563218) (learning,0.011494252873563218) (Python,0.011494252873563218) (provides,0.011494252873563218) (is,0.022988505747126436) (Please,0.011494252873563218) (higher-level,0.011494252873563218) (general,0.011494252873563218) (Security,0.034482758620689655) (R,0.011494252873563218) (fast,0.011494252873563218) (SQL,0.022988505747126436) (Apache,0.011494252873563218) (Java,0.011494252873563218) (data,0.011494252873563218) (attack,0.011494252873563218) (This,0.011494252873563218) (cluster,0.011494252873563218) (graph,0.011494252873563218) (execution,0.011494252873563218) (MLlib,0.011494252873563218) (Scala,0.011494252873563218) (computing,0.011494252873563218) (downloading,0.011494252873563218) (Streaming,0.011494252873563218) (supports,0.022988505747126436) (engine,0.011494252873563218) (set,0.011494252873563218) (running,0.011494252873563218) (Spark,0.08045977011494253) (you,0.011494252873563218) (Overview,0.011494252873563218) (general-purpose,0.011494252873563218) (rich,0.011494252873563218) (APIs,0.011494252873563218) (vulnerable,0.011494252873563218) (that,0.011494252873563218) (a,0.022988505747126436) (high-level,0.011494252873563218) (processing,0.022988505747126436) (OFF,0.011494252873563218) (before,0.011494252873563218) (including,0.011494252873563218) (could,0.011494252873563218) (optimized,0.011494252873563218) (in,0.022988505747126436) (to,0.011494252873563218) (see,0.011494252873563218) (graphs,0.011494252873563218) (of,0.011494252873563218) (also,0.011494252873563218) (by,0.022988505747126436) (structured,0.011494252873563218) (tools,0.011494252873563218) (It,0.022988505747126436) (for,0.034482758620689655) (mean,0.011494252873563218) (an,0.011494252873563218) (machine,0.011494252873563218) (and,0.06896551724137931) (system,0.011494252873563218) (default,0.022988505747126436) 加载IDF1文件数据。。。 加载IDF2文件数据。。。 IDF: (running,1.5) (For,3.0) (visit,3.0) (The,3.0) (you,1.0) (website,1.5) (than,3.0) (7,3.0) (PATH,3.0) (that,1.0) (was,1.5) (a,1.0) (main,3.0) (old,3.0) (high-level,1.5) (be,1.5) (quick,3.0) (processing,1.5) (could,1.5) (all,3.0) (augmenting,3.0) (optimized,1.5) (Downloads,3.0) (follow,3.0) (applications,3.0) (classpath,3.0) (structured,1.5) (like,1.5) (along,3.0) (support,3.0) (Spark’s,1.5) (If,3.0) (but,3.0) (and,1.0) (reference,3.0) (1,3.0) (g,3.0) (system,1.5) (your,3.0) (10,3.0) (It’s,3.0) (are,1.0) (learning,1.5) (download,1.5) (its,3.0) (After,3.0) (Building,3.0) (can,1.5) (Security,1.5) (have,3.0) (runs,3.0) (6,3.0) (build,3.0) (0,1.5) (SQL,1.0) (with,1.5) (locally,3.0) (projects,3.0) (their,3.0) (Get,3.0) (UNIX-like,3.0) (This,1.0) (,1.5) (first,3.0) (documentation,3.0) (Since,3.0) (still,3.0) (Downloading,3.0) (packaged,3.0) (better,3.0) (However,3.0) (switch,3.0) (hood,3.0) (Linux,3.0) (Streaming,1.5) (supports,1.5) (PyPI,3.0) ((2,3.0) (vulnerable,1.5) (RDD,3.0) (Dataset,3.0) (package,3.0) (this,3.0) (under,3.0) (Python,1.0) (provides,1.0) (API,1.5) (higher-level,1.5) (introduction,3.0) (Apache,1.5) (will,1.5) (Java,1.0) (2,1.5) (data,1.5) (as,3.0) (YARN,3.0) (installed,3.0) (pointing,3.0) (optimizations,3.0) (get,3.0) (cluster,1.5) (tutorial,3.0) (graph,1.5) (easy,3.0) (execution,1.5) (MLlib,1.5) (We,3.0) (you’d,3.0) (supported,3.0) (downloading,1.5) (shell,3.0) (handful,3.0) (1+,3.0) (Users,3.0) (engine,1.5) (version,1.5) (11,3.0) (set,1.5) (performance,3.0) (rich,1.5) (systems,3.0) (replaced,3.0) (Spark,1.0) (project,3.0) (Overview,1.5) (APIs,1.5) (Mac,3.0) (or,1.5) (popular,3.0) (Support,3.0) (richer,3.0) (downloads,3.0) (OFF,1.5) (future,3.0) (detailed,3.0) (GraphX,1.5) (removed,3.0) (4,3.0) (installation,3.0) (Please,1.5) (is,1.0) (guide,3.0) (recommend,3.0) (R,1.5) (general,1.5) (JAVA_HOME,3.0) (fast,1.5) (include,3.0) (need,3.0) (one,3.0) (attack,1.5) (how,3.0) (uses,3.0) (compatible,3.0) (information,3.0) (we,3.0) (interactive,3.0) (—,3.0) (using,1.5) (Note,1.5) (7+/3,3.0) (java,3.0) (pre-packaged,3.0) (Scala,1.0) (any,1.5) (computing,1.5) (variable,3.0) (users,3.0) (from,1.5) (has,3.0) (won’t,3.0) (through,3.0) (at,3.0) (more,3.0) (3,3.0) (versions,3.0) (of,1.0) (tools,1.5) (8+,3.0) (by,1.0) (mean,1.5) (RDDs,3.0) ((e,3.0) (It,1.5) (for,1.0) (To,3.0) (were,3.0) (both,3.0) (an,1.0) (12,3.0) (which,3.0) (machine,1.5) (libraries,3.0) (introduce,3.0) (environment,3.0) ((in,3.0) (programming,3.0) (See,3.0) (use,1.5) (default,1.5) (the,1.5) (write,3.0) (highly,3.0) (release,3.0) (Resilient,3.0) (interface,3.0) (strongly-typed,3.0) (about,3.0) (run,3.0) (general-purpose,1.5) (5,3.0) (Distributed,3.0) (on,3.0) (You,3.0) (source,3.0) (Scala),3.0) (show,3.0) (then,3.0) (before,1.0) (including,1.5) (to,1.0) (in,1.0) (client,3.0) (see,1.5) (HDFS,1.5) (graphs,1.5) (Hadoop’s,3.0) (also,1.5) (“Hadoop,3.0) (binary,3.0) (x),3.0) (free”,3.0) (Maven,3.0) (coordinates,3.0) (Windows,3.0) (deprecated,3.0) (install,3.0) ((RDD),3.0) (4+,3.0) (page,3.0) (OS),3.0) (Hadoop,1.5) TF-IDF: (you,0.0115) (that,0.0115) (a,0.0230) (high-level,0.0172) (processing,0.0345) (could,0.0172) (optimized,0.0172) (structured,0.0172) (and,0.0690) (system,0.0172) (are,0.0115) (learning,0.0172) (Security,0.0517) (SQL,0.0230) (This,0.0115) (Streaming,0.0172) (supports,0.0345) (vulnerable,0.0172) (Spark,0.0805) (Overview,0.0172) (APIs,0.0172) (OFF,0.0172) (of,0.0115) (tools,0.0172) (by,0.0230) (mean,0.0172) (It,0.0345) (for,0.0345) (an,0.0115) (machine,0.0172) (default,0.0345) (Python,0.0115) (provides,0.0115) (higher-level,0.0172) (Apache,0.0172) (GraphX,0.0172) (Please,0.0172) (is,0.0230) (R,0.0172) (general,0.0172) (fast,0.0172) (attack,0.0172) (Java,0.0115) (Scala,0.0115) (computing,0.0172) (data,0.0172) (cluster,0.0172) (graph,0.0172) (execution,0.0172) (MLlib,0.0172) (downloading,0.0172) (engine,0.0172) (set,0.0172) (rich,0.0172) (general-purpose,0.0172) (before,0.0115) (including,0.0172) (to,0.0115) (in,0.0230) (see,0.0172) (graphs,0.0172) (also,0.0172) Process finished with exit code 0
转载于:https://www.cnblogs.com/yszd/p/10939583.html
最后
以上就是醉熏奇异果为你收集整理的TF-IDF词频逆文档频率算法的全部内容,希望文章能够帮你解决TF-IDF词频逆文档频率算法所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复