Hadoop多文件输出：MultipleOutputFormat和MultipleOutputs深究(一)

60 阅读 0 评论 40 点赞

我是靠谱客的博主直率荔枝，最近开发中收集的这篇文章主要介绍Hadoop多文件输出：MultipleOutputFormat和MultipleOutputs深究(一)，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

　　直到目前，我们看到的所有Mapreduce作业都输出一组文件。但是，在一些场合下，经常要求我们将输出多组文件或者把一个数据集分为多个数据集更为方便；比如将一个log里面属于不同业务线的日志分开来输出，并交给相关的业务线。
　　用过旧API的人应该知道，旧API中有 org.apache.hadoop.mapred.lib.MultipleOutputFormat和org.apache.hadoop.mapred.lib.MultipleOutputs，文档对MultipleOutputFormat的解释（MultipleOutputs 解释在后面）是：

　　MultipleOutputFormat allowing to write the output data to different output files.

　　MultipleOutputFormat可以将相似的记录输出到相同的数据集。在写每条记录之前，MultipleOutputFormat将调用generateFileNameForKeyValue方法来确定需要写入的文件名。通常，我们都是继承MultipleTextOutputFormat类，来重新实现generateFileNameForKeyValue方法以返回每个输出键/值对的文件名。generateFileNameForKeyValue方法的默认实现如下：

 
        protected 
        String generateFileNameForKeyValue(K key, V value, String name) { 
       
        return 
        name; 
       
        }

返回默认的name，我们可以在自己的类中重写这个方法，来定义自己的输出路径，比如：

 
        public 
        static 
        class 
        PartitionFormat 
       
        extends 
        MultipleTextOutputFormat<NullWritable, Text> { 
       
        @Override 
       
        protected 
        String generateFileNameForKeyValue( 
       
        NullWritable key, 
       
        Text value,  
       
        String name) { 
       
        String[] split = value.toString().split( 
        "," 
        , - 
        1 
        ); 
       
        String country = split[ 
        4 
        ].substring( 
        1 
        ,  
        3 
        ); 
       
        return 
        country +  
        "/" 
        + name; 
       
        } 
       
        }

这样相同country的记录将会输出到同一目录下的name文件中。完整的例子如下：

 
        package 
        com.wyp; 
       
        import 
        org.apache.hadoop.conf.Configuration; 
       
        import 
        org.apache.hadoop.fs.Path; 
       
        import 
        org.apache.hadoop.io.LongWritable; 
       
        import 
        org.apache.hadoop.io.NullWritable; 
       
        import 
        org.apache.hadoop.io.Text; 
       
        import 
        org.apache.hadoop.mapred.*; 
       
        import 
        org.apache.hadoop.mapred.lib.MultipleTextOutputFormat; 
       
        import 
        org.apache.hadoop.util.GenericOptionsParser; 
       
        import 
        java.io.IOException; 
       
        /** 
       
        * User: http://www.iteblog.com/  
       
        * Date: 13-11-26 
       
        * Time: 上午10:02 
       
        */ 
       
        public 
        class 
        OutputTest { 
       
        public 
        static 
        class 
        MapClass  
        extends 
        MapReduceBase  
       
        implements 
        Mapper<LongWritable, Text, NullWritable, Text> { 
       
        @Override 
       
        public 
        void 
        map(LongWritable key, Text value,  
       
        OutputCollector<NullWritable, Text> output,  
       
        Reporter reporter)  
        throws 
        IOException { 
       
        output.collect(NullWritable.get(), value); 
       
        } 
       
        } 
       
        public 
        static 
        class 
        PartitionFormat  
       
        extends 
        MultipleTextOutputFormat<NullWritable, Text> { 
       
        //和上面一样，就不写了 
       
        } 
       
        public 
        static 
        void 
        main(String[] args)  
        throws 
        IOException { 
       
        Configuration conf =  
        new 
        Configuration(); 
       
        JobConf job =  
        new 
        JobConf(conf, OutputTest. 
        class 
        ); 
       
        String[] remainingArgs =  
       
        new 
        GenericOptionsParser(conf, args).getRemainingArgs(); 
       
        if 
        (remainingArgs.length !=  
        2 
        ) { 
       
        System.err.println( 
        "Error!" 
        ); 
       
        System.exit( 
        1 
        ); 
       
        } 
       
        Path in =  
        new 
        Path(remainingArgs[ 
        0 
        ]); 
       
        Path out =  
        new 
        Path(remainingArgs[ 
        1 
        ]); 
       
        FileInputFormat.setInputPaths(job, in); 
       
        FileOutputFormat.setOutputPath(job, out); 
       
        job.setJobName( 
        "Output" 
        ); 
       
        job.setMapperClass(MapClass. 
        class 
        ); 
       
        job.setInputFormat(TextInputFormat. 
        class 
        ); 
       
        job.setOutputFormat(PartitionFormat. 
        class 
        ); 
       
        job.setOutputKeyClass(NullWritable. 
        class 
        ); 
       
        job.setOutputValueClass(Text. 
        class 
        ); 
       
        job.setNumReduceTasks( 
        0 
        ); 
       
        JobClient.runJob(job); 
       
        } 
       
        }

将上面的程序打包成jar文件（具体怎么打包，就不说），并在Hadoop2.2.0上面运行（测试数据请在这里下载：http://pan.baidu.com/s/1td8xN）：

 
        /home/q/hadoop- 
        2.2 
        . 
        0 
        /bin/hadoop jar                       
       
        /export1/tmp/wyp/OutputText.jar com.wyp.OutputTest  
       
        /home/wyp/apat63_99.txt                              
       
        /home/wyp/out

运行完程序之后，可以去/home/wyp/out目录看下运行结果：

 
   
        [wyp 
        @l 
        -datalog5.data.cn1 ~]$ /home/q/hadoop- 
        2.2 
        . 
        0 
        /bin/hadoop fs          
       
 
                                              
        -ls /home/wyp/out 
       
 
        .............................这里省略了很多................................... 
       
 
        drwxr-xr-x   - wyp  supergroup      
        0 
        2013 
        - 
        11 
        - 
        26 
        14 
        : 
        25 
        /home/wyp/out/VE 
       
 
        drwxr-xr-x   - wyp  supergroup      
        0 
        2013 
        - 
        11 
        - 
        26 
        14 
        : 
        25 
        /home/wyp/out/VG 
       
 
        drwxr-xr-x   - wyp  supergroup      
        0 
        2013 
        - 
        11 
        - 
        26 
        14 
        : 
        25 
        /home/wyp/out/VN 
       
 
        drwxr-xr-x   - wyp  supergroup      
        0 
        2013 
        - 
        11 
        - 
        26 
        14 
        : 
        25 
        /home/wyp/out/VU 
       
 
        drwxr-xr-x   - wyp  supergroup      
        0 
        2013 
        - 
        11 
        - 
        26 
        14 
        : 
        25 
        /home/wyp/out/YE 
       
 
        .............................这里省略了很多................................... 
       
 
        -rw-r--r--    
        3 
        wyp  supergroup      
        0 
        2013 
        - 
        11 
        - 
        26 
        14 
        : 
        25 
        /home/wyp/out/_SUCCESS 
       

           
       
 
        [wyp 
        @l 
        -datalog5.data.cn1 ~]$ /home/q/hadoop- 
        2.2 
        . 
        0 
        /bin/hadoop fs         
       
 
                                             
        -ls /home/wyp/out/VN 
       
 
        Found  
        2 
        items 
       
 
        -rw-r--r--  
        3 
        wyp supergroup   
        148 
        2013 
        - 
        11 
        - 
        26 
        14 
        : 
        25 
        /home/wyp/out/VN/part- 
        00000 
       
 
        -rw-r--r--  
        3 
        wyp supergroup   
        566 
        2013 
        - 
        11 
        - 
        26 
        14 
        : 
        25 
        /home/wyp/out/VN/part- 
        00001 
       

           
       
 
        [wyp 
        @l 
        -datalog5.data.cn1 ~]$ /home/q/hadoop- 
        2.2 
        . 
        0 
        /bin/hadoop fs         
       
 
                                            
        -cat /home/wyp/out/VN/part- 
        00001 
       
 
        3430490 
        , 
        1969 
        , 
        3350 
        , 
        1965 
        , 
        "VN" 
        , 
        "" 
        , 
        597185 
        , 
        6 
        ,, 
        73 
        , 
        4 
        , 
        43 
        ,, 
        0 
        ,,,,,,,,, 
       
 
        3630470 
        , 
        1971 
        , 
        4379 
        , 
        1970 
        , 
        "VN" 
        , 
        "" 
        ,, 
        1 
        ,, 
        244 
        , 
        5 
        , 
        55 
        ,, 
        4 
        ,, 
        0.375 
        ,, 
        22.5 
        ,,,,, 
       
 
        3654325 
        , 
        1972 
        , 
        4477 
        , 
        1969 
        , 
        "VN" 
        , 
        "" 
        ,, 
        1 
        ,, 
        554 
        , 
        1 
        , 
        14 
        ,, 
        0 
        ,,,,,,,,, 
       
 
        3665081 
        , 
        1972 
        , 
        4526 
        , 
        1970 
        , 
        "VN" 
        , 
        "" 
        ,, 
        1 
        ,, 
        373 
        , 
        6 
        , 
        66 
        ,, 
        1 
        ,, 
        0 
        ,, 
        3 
        ,,,,, 
       
 
        3772710 
        , 
        1973 
        , 
        5072 
        , 
        1972 
        , 
        "VN" 
        , 
        "" 
        ,, 
        1 
        ,, 
        4 
        , 
        6 
        , 
        65 
        ,, 
        1 
        ,, 
        0 
        ,, 
        8 
        ,,,,, 
       
 
        3821853 
        , 
        1974 
        , 
        5296 
        , 
        1971 
        , 
        "VN" 
        , 
        "" 
        ,, 
        1 
        ,, 
        33 
        , 
        6 
        , 
        69 
        ,, 
        1 
        ,, 
        0 
        ,, 
        23 
        ,,,,, 
       
 
        3824277 
        , 
        1974 
        , 
        5310 
        , 
        1970 
        , 
        "VN" 
        , 
        "" 
        , 
        347650 
        , 
        3 
        ,, 
        562 
        , 
        1 
        , 
        14 
        ,, 
        2 
        ,, 
        0.5 
        ,, 
        9 
        ,,,, 
        0 
        , 
        0 
       
 
        3918104 
        , 
        1975 
        , 
        5793 
        , 
        1972 
        , 
        "VN" 
        , 
        "" 
        ,, 
        1 
        , 
        2 
        , 
        4 
        , 
        6 
        , 
        65 
        , 
        5 
        , 
        0 
        , 
        0.4 
        ,, 
        0 
        ,, 
        18.2 
        ,,,, 
       
 
 

　　从上面的结果可以看出，所有country相同的结果都输出到同一个文件夹下面了。MultipleOutputFormat对完全控制文件名和目录名很方便。大家也看到了上面的程序是基于行的split，如果我们要基于列的split，MultipleOutputFormat就无能为力了。这时MultipleOutputs就用上场了。MultipleOutputs在很早的版本就存在，那么我们先看看官方文档是怎么解释MultipleOutputs的：

　　MultipleOutputs creates multiple OutputCollectors. Each OutputCollector can have its own OutputFormat and types for the key/value pair. Your MapReduce program will decide what to output to each OutputCollector.

　　由于本文比较长，考虑到篇幅问题，所以将本文拆分为二，第二部分请参见本博客《Hadoop多文件输出：MultipleOutputFormat和MultipleOutputs深究(二)》，给你带来不便请原谅。

转载请注明：转载自过往记忆（http://www.iteblog.com/）
本文链接地址: Hadoop多文件输出：MultipleOutputFormat和MultipleOutputs深究(一)（http://www.iteblog.com/archives/842）