调用MapReduce对文件中各个单词出现的次数进行统计

68 阅读 0 评论 45 点赞

我是靠谱客的博主负责灰狼，最近开发中收集的这篇文章主要介绍调用MapReduce对文件中各个单词出现的次数进行统计，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

调用MapReduce对文件中各个单词出现的次数进行统计：

环境：Hadoop | 软件：Eclipse

实验要求：

1.将待分析的文件（不少于10000个英文单词）上传到HDFS。

2.将MapReduce对文件中各个单词出现的次数进行统计。

3.将统计结果下载本地。

操作步骤：

调用MapReduce对文件中的各个单词出现的次数进行统计

步骤简述：

1.首先在eclipse中创建项目。

2.然后将需要用到的jar包导入到eclipse中的项目中。

3.编写名称为HDFSFileExist项目,并运行。

4.编写名称为WordCount的项目，并运行。

5.开启hadoop。

6.删除hadoop中的input和output文件

7.重新建立input文件，将需要用到的文件移动到该目录下

8.然后执行命令查看词频统计。

9.打印词频统计结果。

现在，开始介绍详细步骤：

1.首先打开Eclipse，先填写workspace，该目录是保存文件的目录，这里建议默认不要更改，点击OK进入下一步，如下图所示：

2、进入Eclipse后，开始创建一个Java工程，点击“File->New->Project->Java Project->Next”,如下图所示：

3.1）进行下一步设置，将需要用到的jar包导入到项目中，如下图所示：

2）进行下一步设置，在Libraries中点击"Add External JARS……"，进行包的添加，如下图所示:

3)添加包完毕后，在点击下方的"Finish"按钮，就可以完成HDFSExample工程的创建，如下图所示:

4.接着在新建好的项目中右键点击，选择"New->Class",进行"HDFSFileExist"源文件的创建，如下图所示:

5.在新建的文件中将图中的代码输入，如下图:

以下附上图中的代码：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class HDFSFileIfExist {
    public static void main(String[] args){
        try{
            String fileName = "test";
            Configuration conf = new Configuration();
            conf.set("fs.defaultFS", "hdfs://localhost:9000");
            conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
            FileSystem fs = FileSystem.get(conf);
            if(fs.exists(new Path(fileName))){
                System.out.println("文件存在");
            }else{
                System.out.println("文件不存在");
            }
 
        }catch (Exception e){
            e.printStackTrace();
        }
    }
}

6.将hadoop启动，在Eclipse中上方点击”Run As->Java Application“运行文件，如下图所示:

7.点击后，会弹出以下界面，点击”OK“即可，如下图所示:

8.点击”OK“ 后，会弹出以下警告信息，以及一个”文件不存在的提示“，可以不必理会，继续往下做，如下图:

9.新建一个myapp的文件夹用来存放hadoop中的文件，命令如下：

cd /usr/local/hadoop
mkdir myapp

10.在工程名称为”HDFSExample“上单击鼠标右键，在弹出来的方框中选中”Export“，如下图所示：

11.点击”Export“后，弹出以下方框，点击”Runnable JAR file“，再点击”Next“进行下一步，如下图：

12.之后会弹出如下图，直接点击”Finish“即可，如下图：

13.通过Eclipse运行MapReduce

1）以下是MapReduce安装步骤截图，详细步骤请点击该链接http://dblab.xmu.edu.cn/blog/hadoop-build-project-using-eclipse/

14.新建一个WordCount工程，按照上面导入jar包一样导入jar包，创建工程的图如下所示：

15.将代码运行后，如下图所示：

图中的代码如下所示：

package org.apache.hadoop.examples;
 
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
 
public class WordCount {
    public WordCount() {
    }
 
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs();
        if(otherArgs.length < 2) {
            System.err.println("Usage: wordcount <in> [<in>...] <out>");
            System.exit(2);
        }
 
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(WordCount.TokenizerMapper.class);
        job.setCombinerClass(WordCount.IntSumReducer.class);
        job.setReducerClass(WordCount.IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
 
        for(int i = 0; i < otherArgs.length - 1; ++i) {
            FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
        }
 
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));
        System.exit(job.waitForCompletion(true)?0:1);
    }
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();
 
        public IntSumReducer() {
        }
 
        public void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
            int sum = 0;
 
            IntWritable val;
            for(Iterator i$ = values.iterator(); i$.hasNext(); sum += val.get()) {
                val = (IntWritable)i$.next();
            }
 
            this.result.set(sum);
            context.write(key, this.result);
        }
    }
 
    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
        private static final IntWritable one = new IntWritable(1);
        private Text word = new Text();
 
        public TokenizerMapper() {
        }
 
        public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
 
            while(itr.hasMoreTokens()) {
                this.word.set(itr.nextToken());
                context.write(this.word, one);
            }
 
        }
    }
}

16.运行完成后，在左边的方框中会出现input和output两个文件（如果需要重新运行该程序，则需要将output删除），如下图所示：