Bootstrap

linux运行wordcount,Hadoop之本地运行WordCount

本文主要记录在windows搭建Hadoop开发环境并编写一个WordCount的mapreduce在本地环境执行。

主要内容:

1.搭建本地环境

2.编写WordCount并在本地运行

1.搭建本地环境

1.1.解压

去官网下载指定的hadoop版本

hadoop-2.7.3.tar.gz

将下载好的hadoop压缩包解压到任意目录

拷贝winutils.exe 到 hadoop-2.7.3/bin 目录下

1.2 配置环境变量

新建环境变量执行hadoop解压路径

HADOOP_HOME:D:\soft\dev\hadoop-2.7.3

在Path后新增

%HADOOP_HOME%\bin;

2.编写WordCount

输入文件格式如下:

hello java

hello hadoop

输出如下:

hello 2

hadoop 1

java 1

项目目录如下:

762ebf8f9af7

image.png

2.1.引入Maven依赖

org.apache.hadoop

hadoop-client

2.7.3

org.apache.hadoop

hadoop-common

2.7.3

org.apache.hadoop

hadoop-hdfs

2.7.3

2.2.加入log4j.properties配置文件

log4j.appender.console=org.apache.log4j.ConsoleAppender

log4j.appender.console.Target=System.out

log4j.appender.console.layout=org.apache.log4j.PatternLayout

log4j.appender.console.layout.ConversionPattern=%d{ABSOLUTE} %5p %c{1}:%L - %m%n

log4j.rootLogger=INFO, console

2.3.编写Mapper

读取输入文本中的每一行,并切分单词,记录单词的数量并输出,输出类型为Text,IntWritable 例如:java,1

public class WcMapper extends Mapper {

@Override

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

System.out.println("--->Map-->" + Thread.currentThread().getName());

String[] words = StringUtils.split(value.toString(), ' ');

for (String w : words) {

context.write(new Text(w), new IntWritable(1));

}

}

}

2.4.编写Reducer

接收Mapper的输出结果进行累加并输出结果,接收类型为Mapper的输出类型Text,Iterable 例如:java (1,1),输出类型为 Text,intWritable 例如:java 2

public class WcReducer extends Reducer {

@Override

protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {

System.out.println("--->Reducer-->" + Thread.currentThread().getName());

int sum = 0;

for (IntWritable i : values) {

sum = sum + i.get();

}

context.write(key, new IntWritable(sum));

}

}

2.5.编写Job

将Mapper和Reducer组装起来封装成功一个Job,作为一个执行单元。计算WordCount就是一个Job。

public class RunWcJob {

public static void main(String[] args) throws Exception {

// 创建本次mr程序的job实例

Configuration conf = new Configuration();

Job job = Job.getInstance(conf);

// 指定本次job运行的主类

job.setJarByClass(RunWcJob.class);

// 指定本次job的具体mapper reducer实现类

job.setMapperClass(WcMapper.class);

job.setReducerClass(WcReducer.class);

// 指定本次job map阶段的输出数据类型

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(IntWritable.class);

// 指定本次job reduce阶段的输出数据类型 也就是整个mr任务的最终输出类型

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

// 指定本次job待处理数据的目录 和程序执行完输出结果存放的目录

FileInputFormat.setInputPaths(job, "D:\\hadoop\\input");

FileOutputFormat.setOutputPath(job, new Path("D:\\hadoop\\output"));

// 提交本次job

boolean b = job.waitForCompletion(true);

System.exit(b ? 0 : 1);

}

}

在本地文件夹D:\hadoop\input下新建 words.txt,内容为上面给出的输入内容作为输入

同样输出文件夹为output,那么直接运行程序:

可能出现的错误:

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

原因:

没有拷贝winutils拷贝到hadoop-2.7.3/bin目录下或者没有配置HADOOP_HOME环境变量或者配置HADOOP_HOME环境变量没生效

解决:

1.下载winutils拷贝到hadoop-2.7.3/bin目录下

2.检查环境变量是否配置

3.如果已经配置好环境变量,重启idea或这电脑,有可能是环境变量没生效

Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z

原因:

不太清楚

解决:

拷贝org.apache.hadoop.io.nativeio.NativeIO源码,重写access方法的返回值

762ebf8f9af7

image.png

2.6运行结果

允许如果出现一下信息就表示已经正确执行了。

14:40:01,813 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

14:40:02,058 INFO deprecation:1173 - session.id is deprecated. Instead, use dfs.metrics.session-id

14:40:02,060 INFO JvmMetrics:76 - Initializing JVM Metrics with processName=JobTracker, sessionId=

14:40:02,355 WARN JobResourceUploader:64 - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

14:40:02,387 WARN JobResourceUploader:171 - No job jar file set. User classes may not be found. See Job or Job#setJar(String).

14:40:02,422 INFO FileInputFormat:283 - Total input paths to process : 1

14:40:02,685 INFO JobSubmitter:198 - number of splits:1

14:40:02,837 INFO JobSubmitter:287 - Submitting tokens for job: job_local866013445_0001

14:40:03,035 INFO Job:1294 - The url to track the job: http://localhost:8080/

14:40:03,042 INFO Job:1339 - Running job: job_local866013445_0001

14:40:03,044 INFO LocalJobRunner:471 - OutputCommitter set in config null

14:40:03,110 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1

14:40:03,115 INFO LocalJobRunner:489 - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter

14:40:03,211 INFO LocalJobRunner:448 - Waiting for map tasks

14:40:03,211 INFO LocalJobRunner:224 - Starting task: attempt_local866013445_0001_m_000000_0

14:40:03,238 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1

14:40:03,383 INFO ProcfsBasedProcessTree:192 - ProcfsBasedProcessTree currently is supported only on Linux.

14:40:03,439 INFO Task:612 - Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@4d11cc8c

14:40:03,445 INFO MapTask:756 - Processing split: file:/D:/hadoop/input/words.txt:0+24

14:40:03,509 INFO MapTask:1205 - (EQUATOR) 0 kvi 26214396(104857584)

14:40:03,509 INFO MapTask:998 - mapreduce.task.io.sort.mb: 100

14:40:03,509 INFO MapTask:999 - soft limit at 83886080

14:40:03,509 INFO MapTask:1000 - bufstart = 0; bufvoid = 104857600

14:40:03,510 INFO MapTask:1001 - kvstart = 26214396; length = 6553600

14:40:03,515 INFO MapTask:403 - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer

--->Map-->LocalJobRunner Map Task Executor #0

--->Map-->LocalJobRunner Map Task Executor #0

14:40:03,522 INFO LocalJobRunner:591 -

14:40:03,522 INFO MapTask:1460 - Starting flush of map output

14:40:03,522 INFO MapTask:1482 - Spilling map output

14:40:03,522 INFO MapTask:1483 - bufstart = 0; bufend = 40; bufvoid = 104857600

14:40:03,522 INFO MapTask:1485 - kvstart = 26214396(104857584); kvend = 26214384(104857536); length = 13/6553600

14:40:03,573 INFO MapTask:1667 - Finished spill 0

14:40:03,583 INFO Task:1038 - Task:attempt_local866013445_0001_m_000000_0 is done. And is in the process of committing

14:40:03,589 INFO LocalJobRunner:591 - map

14:40:03,589 INFO Task:1158 - Task 'attempt_local866013445_0001_m_000000_0' done.

14:40:03,589 INFO LocalJobRunner:249 - Finishing task: attempt_local866013445_0001_m_000000_0

14:40:03,590 INFO LocalJobRunner:456 - map task executor complete.

14:40:03,593 INFO LocalJobRunner:448 - Waiting for reduce tasks

14:40:03,593 INFO LocalJobRunner:302 - Starting task: attempt_local866013445_0001_r_000000_0

14:40:03,597 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1

14:40:03,597 INFO ProcfsBasedProcessTree:192 - ProcfsBasedProcessTree currently is supported only on Linux.

14:40:03,627 INFO Task:612 - Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@2ae5eb6

14:40:03,658 INFO ReduceTask:362 - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@72ddfb0b

14:40:03,686 INFO MergeManagerImpl:197 - MergerManager: memoryLimit=1314232704, maxSingleShuffleLimit=328558176, mergeThreshold=867393600, ioSortFactor=10, memToMemMergeOutputsThreshold=10

14:40:03,688 INFO EventFetcher:61 - attempt_local866013445_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events

14:40:03,720 INFO LocalFetcher:144 - localfetcher#1 about to shuffle output of map attempt_local866013445_0001_m_000000_0 decomp: 50 len: 54 to MEMORY

14:40:03,729 INFO InMemoryMapOutput:100 - Read 50 bytes from map-output for attempt_local866013445_0001_m_000000_0

14:40:03,730 INFO MergeManagerImpl:315 - closeInMemoryFile -> map-output of size: 50, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->50

14:40:03,731 INFO EventFetcher:76 - EventFetcher is interrupted.. Returning

14:40:03,731 INFO LocalJobRunner:591 - 1 / 1 copied.

14:40:03,731 INFO MergeManagerImpl:687 - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs

14:40:03,744 INFO Merger:606 - Merging 1 sorted segments

14:40:03,744 INFO Merger:705 - Down to the last merge-pass, with 1 segments left of total size: 41 bytes

14:40:03,746 INFO MergeManagerImpl:754 - Merged 1 segments, 50 bytes to disk to satisfy reduce memory limit

14:40:03,748 INFO MergeManagerImpl:784 - Merging 1 files, 54 bytes from disk

14:40:03,748 INFO MergeManagerImpl:799 - Merging 0 segments, 0 bytes from memory into reduce

14:40:03,748 INFO Merger:606 - Merging 1 sorted segments

14:40:03,749 INFO Merger:705 - Down to the last merge-pass, with 1 segments left of total size: 41 bytes

14:40:03,749 INFO LocalJobRunner:591 - 1 / 1 copied.

14:40:03,847 INFO deprecation:1173 - mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords

--->Reducer-->pool-3-thread-1

--->Reducer-->pool-3-thread-1

--->Reducer-->pool-3-thread-1

14:40:03,867 INFO Task:1038 - Task:attempt_local866013445_0001_r_000000_0 is done. And is in the process of committing

14:40:03,868 INFO LocalJobRunner:591 - 1 / 1 copied.

14:40:03,868 INFO Task:1199 - Task attempt_local866013445_0001_r_000000_0 is allowed to commit now

14:40:03,873 INFO FileOutputCommitter:535 - Saved output of task 'attempt_local866013445_0001_r_000000_0' to file:/D:/hadoop/output/_temporary/0/task_local866013445_0001_r_000000

14:40:03,877 INFO LocalJobRunner:591 - reduce > reduce

14:40:03,877 INFO Task:1158 - Task 'attempt_local866013445_0001_r_000000_0' done.

14:40:03,877 INFO LocalJobRunner:325 - Finishing task: attempt_local866013445_0001_r_000000_0

14:40:03,877 INFO LocalJobRunner:456 - reduce task executor complete.

14:40:04,044 INFO Job:1360 - Job job_local866013445_0001 running in uber mode : false

14:40:04,045 INFO Job:1367 - map 100% reduce 100%

14:40:04,045 INFO Job:1378 - Job job_local866013445_0001 completed successfully

14:40:04,050 INFO Job:1385 - Counters: 30

File System Counters

FILE: Number of bytes read=488

FILE: Number of bytes written=566782

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

Map-Reduce Framework

Map input records=2

Map output records=4

Map output bytes=40

Map output materialized bytes=54

Input split bytes=96

Combine input records=0

Combine output records=0

Reduce input groups=3

Reduce shuffle bytes=54

Reduce input records=4

Reduce output records=3

Spilled Records=8

Shuffled Maps =1

Failed Shuffles=0

Merged Map outputs=1

GC time elapsed (ms)=7

Total committed heap usage (bytes)=498073600

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=24

File Output Format Counters

Bytes Written=36

Process finished with exit code 0

会在D:\hadoop\output输出结果如下:

762ebf8f9af7

image.png

其中part-r-00000的内容如下:

hadoop 1

hello 2

java 1

下一篇我们介绍在集群中运行WordCount,Hadoop之集群运行WordCount

3.参考

;