MapReduce之WordCount程序详解及常见错误汇总

93 阅读 0 评论 62 点赞

我是靠谱客的博主坚定香水，这篇文章主要介绍MapReduce之WordCount程序详解及常见错误汇总，现在分享给大家，希望可以做个参考。

前言：

在之前的笔记中，我们已经成功的关联了eclipse和hadoop，对FileSystem的使用进行了简单了解。

下面就是Hadoop中的重点MapReduce程序的开发。作为MapReduce（以下使用MR来代替）开发中的入门程序WordCount，基本是每个学习MapReduce的同学都必知必会的。有关于WordCount的概念笔者就不再赘述，网上有N多文章讲解。

本次博客主要是记录笔者在Windows环境下使用eclipse进行WordCount程序编写过程中所遇到的问题及解决方案。

准备工作：

* Windows环境下Eclipse工具的准备（需要使用插件关联hadoop，更多细节请参考笔者另一篇文章https://blog.csdn.net/qq_26323323/article/details/82936098 ）

* 创建maven项目，命名为hadoop，将Linux环境下hadoop的配置文件core-site.xml/mapred-site.xml/hdfs-site.xml/yarn-site.xml放入hadoop/src/main/resources中（主要是因为MR程序需要加载这些配置文件中的配置内容）

* 在hadoop/src/main/resources中创建log4j.properties文件，内容如下

复制代码

1
2
3
4
5
6
7
8
log4j.rootLogger=DEBUG, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

注意：之所以要创建该文件，是因为在eclipse中启动MR程序时，默认是没有日志的，我们加载log4j的配置后，root设置为DEBUG级别，那么程序的每一步操作我们都可以通过日志来观察到，有利于我们定位问题

* 使用用户hxw（笔者）来启动hadoop

复制代码

1
2
HADOOP_HOME/sbin/start-dfs.sh
HADOOP_HOME/sbin/start-yarn.sh

1.WordCount程序的编写

具体内容如下，笔者不再详述

复制代码

package hadoop.mr;

import java.io.IOException;
import java.net.URI;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
		private final static IntWritable ONE = new IntWritable(1);
		private Text word = new Text();
		 
		/**
		 * map程序，进行切割转换
		 */
		@Override
		protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
				throws IOException, InterruptedException {
			
			// 1.解析value为token，默认会按照空格进行分割
			StringTokenizer token = new StringTokenizer(value.toString());
			while(token.hasMoreTokens()){
				// 2.将分割后的字符放入Word
				word.set(token.nextToken());
				
				// 3.输出k-v格式	类似(hadoop,1)
				context.write(word, ONE);
			}
		}
	}
	
	public static class WCReduce extends Reducer<Text, IntWritable, Text, IntWritable>{
		private IntWritable result = new IntWritable();
		
		/**
		 * reduce程序，对map的结果进行合并
		 */
		@Override
		protected void reduce(Text key, Iterable<IntWritable> values,
				Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
			
			// 1.计算总和
			int sum = 0;
			for (IntWritable intWritable : values) {
				sum += intWritable.get();
			}
			
			result.set(sum);
			// 2.输出结果
			context.write(key, result);
		}
	}
	
	private static String INPUT_PATH = "/user/hadoop/mapreduce/input/";
	private static String OUTPUT_PATH = "/user/hadoop/mapreduce/output/";
	private static String HDFS_URI = "hdfs://hadoop:9000";// 对应于core-site.xml中的FS.default
	
	public static void main(String[] args) {
		
		try {
			// 1.如果已经有output_path，则先进行删除
			deleteOutputFile(OUTPUT_PATH);
			
			// 2.创建job，设置基本属性
			Job job = Job.getInstance();
			job.setJarByClass(WordCount.class);
			job.setJobName("wordcount");
			
			// 3.设置Mapper、Reducer
			job.setMapperClass(WCMapper.class);
			job.setReducerClass(WCReduce.class);
			
			job.setOutputKeyClass(Text.class);
			job.setOutputValueClass(IntWritable.class);
			
			// 4.设置输入路径和输出路径
			FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
			FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
			
			// 5.执行，执行完成后退出程序
			System.exit(job.waitForCompletion(true) ? 0 : 1);
		} catch (IOException e) {
			e.printStackTrace();
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
	
	static void deleteOutputFile(String path) throws Exception{
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(new URI(HDFS_URI),conf,"hxw");
        if(fs.exists(new Path(path))){
            fs.delete(new Path(path),true);
        }
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
package hadoop.mr;

import java.io.IOException;
import java.net.URI;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

	public static class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
		private final static IntWritable ONE = new IntWritable(1);
		private Text word = new Text();
		 
		/**
		 * map程序，进行切割转换
		 */
		@Override
		protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
				throws IOException, InterruptedException {
			
			// 1.解析value为token，默认会按照空格进行分割
			StringTokenizer token = new StringTokenizer(value.toString());
			while(token.hasMoreTokens()){
				// 2.将分割后的字符放入Word
				word.set(token.nextToken());
				
				// 3.输出k-v格式	类似(hadoop,1)
				context.write(word, ONE);
			}
		}
	}
	
	public static class WCReduce extends Reducer<Text, IntWritable, Text, IntWritable>{
		private IntWritable result = new IntWritable();
		
		/**
		 * reduce程序，对map的结果进行合并
		 */
		@Override
		protected void reduce(Text key, Iterable<IntWritable> values,
				Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
			
			// 1.计算总和
			int sum = 0;
			for (IntWritable intWritable : values) {
				sum += intWritable.get();
			}
			
			result.set(sum);
			// 2.输出结果
			context.write(key, result);
		}
	}
	
	private static String INPUT_PATH = "/user/hadoop/mapreduce/input/";
	private static String OUTPUT_PATH = "/user/hadoop/mapreduce/output/";
	private static String HDFS_URI = "hdfs://hadoop:9000";// 对应于core-site.xml中的FS.default
	
	public static void main(String[] args) {
		
		try {
			// 1.如果已经有output_path，则先进行删除
			deleteOutputFile(OUTPUT_PATH);
			
			// 2.创建job，设置基本属性
			Job job = Job.getInstance();
			job.setJarByClass(WordCount.class);
			job.setJobName("wordcount");
			
			// 3.设置Mapper、Reducer
			job.setMapperClass(WCMapper.class);
			job.setReducerClass(WCReduce.class);
			
			job.setOutputKeyClass(Text.class);
			job.setOutputValueClass(IntWritable.class);
			
			// 4.设置输入路径和输出路径
			FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
			FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
			
			// 5.执行，执行完成后退出程序
			System.exit(job.waitForCompletion(true) ? 0 : 1);
		} catch (IOException e) {
			e.printStackTrace();
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
	
	static void deleteOutputFile(String path) throws Exception{
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(new URI(HDFS_URI),conf,"hxw");
        if(fs.exists(new Path(path))){
            fs.delete(new Path(path),true);
        }
    }
}

2.遇到的问题汇总

1）访问HDFS无权限

报错内容一般如下：

复制代码

1
org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="":nutch:supergroup:rwxr-xr-x

报错原因：主要是由于HDFS的文件系统都是有用户和权限的，如果当前用户无权限则在使用该文件或文件夹的时候会报错。

解决方案：

* 使用hdfs dfs -chmod 命令来修改相关文件或文件夹权限；

* 如果在测试环境，用户不想这么麻烦来修改权限的话，也可使用配置来禁用hdfs的权限管理，可以在hdfs-site.xml中配置以下内容

复制代码

1
2
3
4
	<property>
		<name>dfs.permissions</name>
		<value>false</value>
	</property>

2）运行过程中，无报错无日志，在http://hadoop:8088 界面中也无任务进度

笔者在运行的过程中，比较烦恼的一件事就是，运行的时候没有任何报错，任务调度界面中也没有任务显示

将项目打成jar包放入Linux环境下，也是可以运行的，很奇怪。

后来，就添加了log4j.properties文件，将root设置为DEBUG级别（内容如上所示），就看到了其中的报错。

所以，我们在运行项目的时候，一定要添加日志文件

3）创建对ResourceManager连接的时候报错

复制代码

1
2
DEBUG [org.apache.hadoop.ipc.Client] - closing ipc connection to 0.0.0.0/0.0.0.0:8032: Connection refused: no further information
java.net.ConnectException: Connection refused: no further information

报错原因：看报错信息，我们知道是在创建对0.0.0.0:8032的Connection时候失败。为什么会失败？应该是无法连接到0.0.0.0这个IP。我们没有在配置文件中配置这个IP和端口，那么这个应该是默认配置。我们去hadoop官网的core-default.xml、yarn-default.xml等默认配置文件进行查看的时候，发现在yarn-site.xml中发现以下内容

复制代码

1
2
yarn.resourcemanager.hostname	0.0.0.0	The hostname of the RM.
yarn.resourcemanager.address	${yarn.resourcemanager.hostname}:8032

那么可以确定这个IP：port是对ResourceManager的连接失败

我们知道ResourceManager负责集群的资源分配，所有NodeManager都需要与ResourceManager进行通信交换信息，yarn.resourcemanager.hostname默认为0.0.0.0，我们将这个内容修改为hadoop，对应着当前本机地址即可

解决方案：在yarn-site.xml中添加

复制代码

1
2
3
4
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>hadoop</value>
</property>

4）no job control

报错信息如下所示：

复制代码

1
2
3
4
5
6
Exception message: /bin/bash: line 0: fg: no job control
Stack trace: ExitCodeException exitCode=1: /bin/bash: line 0: fg: no job control

	at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
	at org.apache.hadoop.util.Shell.run(Shell.java:456)
    ...

报错原因：由于我们使用Windows平台进行开发并添加MR任务，而hadoop部署在Linux平台上，故针对跨平台的job会报该错

解决方案：在mapred-site.xml中添加以下配置

设置为job提交允许跨平台

复制代码

1
2
3
4
	<property>
		<name>mapreduce.app-submission.cross-platform</name>
		<value>true</value>
	</property>

5）ClassNotFoundException

复制代码

1
2
3
4
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class hadoop.mr.WordCount$WCMapper not found
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
	at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
...

报错原因：这个还是比较难解释的，需要对hadoop的运行原理有一定的了解。具体可参考这篇文章 https://blog.csdn.net/qq_19648191/article/details/56684268

解决方案：在core-site.xml中设置如下配置：

复制代码

1
2
3
4
<property>
    <name>mapred.jar</name>
    <value>C:/Users/lucky/Desktop/wc.jar</value>
</property>

然后每次运行WordCount任务的时候，先将当前项目导出为一个jar包，命名为wc.jar，然后位置也要与我们配置的位置一致，这样再运行的时候就不会报错了

最后

以上就是坚定香水最近收集整理的关于MapReduce之WordCount程序详解及常见错误汇总的全部内容，更多相关MapReduce之WordCount程序详解及常见错误汇总内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：Sundry
浏览次数：93 次浏览
发布日期：2024-04-28 20:20:01
本文链接：https://www.kaopuke.com/article/k-p-k_13_u_23_okf5_12__23_c3.html

MapReduce之WordCount程序详解及常见错误汇总

最后

评论列表共有 0 条评论

发表评论取消回复

MapReduce之WordCount程序详解及常见错误汇总

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

微信扫一扫：分享

发表评论取消回复