本文主要记录在windows搭建Hadoop开发环境并编写一个WordCount的mapreduce在本地环境执行。
主要内容:
1.搭建本地环境
2.编写WordCount并在本地运行
1.搭建本地环境
1.1.解压
去官网下载指定的hadoop版本
hadoop-2.7.3.tar.gz
将下载好的hadoop压缩包解压到任意目录
拷贝winutils.exe 到 hadoop-2.7.3/bin 目录下
1.2 配置环境变量
新建环境变量执行hadoop解压路径
HADOOP_HOME:D:\soft\dev\hadoop-2.7.3
在Path后新增
%HADOOP_HOME%\bin;
2.编写WordCount
输入文件格式如下:
hello java
hello hadoop
输出如下:
hello 2
hadoop 1
java 1
项目目录如下:
image.png
2.1.引入Maven依赖
org.apache.hadoop
hadoop-client
2.7.3
org.apache.hadoop
hadoop-common
2.7.3
org.apache.hadoop
hadoop-hdfs
2.7.3
2.2.加入log4j.properties配置文件
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.Target=System.out
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{ABSOLUTE} %5p %c{1}:%L - %m%n
log4j.rootLogger=INFO, console
2.3.编写Mapper
读取输入文本中的每一行,并切分单词,记录单词的数量并输出,输出类型为Text,IntWritable 例如:java,1
public class WcMapper extends Mapper {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
System.out.println("--->Map-->" + Thread.currentThread().getName());
String[] words = StringUtils.split(value.toString(), ' ');
for (String w : words) {
context.write(new Text(w), new IntWritable(1));
}
}
}
2.4.编写Reducer
接收Mapper的输出结果进行累加并输出结果,接收类型为Mapper的输出类型Text,Iterable 例如:java (1,1),输出类型为 Text,intWritable 例如:java 2
public class WcReducer extends Reducer {
@Override
protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
System.out.println("--->Reducer-->" + Thread.currentThread().getName());
int sum = 0;
for (IntWritable i : values) {
sum = sum + i.get();
}
context.write(key, new IntWritable(sum));
}
}
2.5.编写Job
将Mapper和Reducer组装起来封装成功一个Job,作为一个执行单元。计算WordCount就是一个Job。
public class RunWcJob {
public static void main(String[] args) throws Exception {
// 创建本次mr程序的job实例
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 指定本次job运行的主类
job.setJarByClass(RunWcJob.class);
// 指定本次job的具体mapper reducer实现类
job.setMapperClass(WcMapper.class);
job.setReducerClass(WcReducer.class);
// 指定本次job map阶段的输出数据类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 指定本次job reduce阶段的输出数据类型 也就是整个mr任务的最终输出类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 指定本次job待处理数据的目录 和程序执行完输出结果存放的目录
FileInputFormat.setInputPaths(job, "D:\\hadoop\\input");
FileOutputFormat.setOutputPath(job, new Path("D:\\hadoop\\output"));
// 提交本次job
boolean b = job.waitForCompletion(true);
System.exit(b ? 0 : 1);
}
}
在本地文件夹D:\hadoop\input下新建 words.txt,内容为上面给出的输入内容作为输入
同样输出文件夹为output,那么直接运行程序:
可能出现的错误:
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
原因:
没有拷贝winutils拷贝到hadoop-2.7.3/bin目录下或者没有配置HADOOP_HOME环境变量或者配置HADOOP_HOME环境变量没生效
解决:
1.下载winutils拷贝到hadoop-2.7.3/bin目录下
2.检查环境变量是否配置
3.如果已经配置好环境变量,重启idea或这电脑,有可能是环境变量没生效
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
原因:
不太清楚
解决:
拷贝org.apache.hadoop.io.nativeio.NativeIO源码,重写access方法的返回值
image.png
2.6运行结果
允许如果出现一下信息就表示已经正确执行了。
14:40:01,813 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14:40:02,058 INFO deprecation:1173 - session.id is deprecated. Instead, use dfs.metrics.session-id
14:40:02,060 INFO JvmMetrics:76 - Initializing JVM Metrics with processName=JobTracker, sessionId=
14:40:02,355 WARN JobResourceUploader:64 - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
14:40:02,387 WARN JobResourceUploader:171 - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
14:40:02,422 INFO FileInputFormat:283 - Total input paths to process : 1
14:40:02,685 INFO JobSubmitter:198 - number of splits:1
14:40:02,837 INFO JobSubmitter:287 - Submitting tokens for job: job_local866013445_0001
14:40:03,035 INFO Job:1294 - The url to track the job: http://localhost:8080/
14:40:03,042 INFO Job:1339 - Running job: job_local866013445_0001
14:40:03,044 INFO LocalJobRunner:471 - OutputCommitter set in config null
14:40:03,110 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
14:40:03,115 INFO LocalJobRunner:489 - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
14:40:03,211 INFO LocalJobRunner:448 - Waiting for map tasks
14:40:03,211 INFO LocalJobRunner:224 - Starting task: attempt_local866013445_0001_m_000000_0
14:40:03,238 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
14:40:03,383 INFO ProcfsBasedProcessTree:192 - ProcfsBasedProcessTree currently is supported only on Linux.
14:40:03,439 INFO Task:612 - Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@4d11cc8c
14:40:03,445 INFO MapTask:756 - Processing split: file:/D:/hadoop/input/words.txt:0+24
14:40:03,509 INFO MapTask:1205 - (EQUATOR) 0 kvi 26214396(104857584)
14:40:03,509 INFO MapTask:998 - mapreduce.task.io.sort.mb: 100
14:40:03,509 INFO MapTask:999 - soft limit at 83886080
14:40:03,509 INFO MapTask:1000 - bufstart = 0; bufvoid = 104857600
14:40:03,510 INFO MapTask:1001 - kvstart = 26214396; length = 6553600
14:40:03,515 INFO MapTask:403 - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
--->Map-->LocalJobRunner Map Task Executor #0
--->Map-->LocalJobRunner Map Task Executor #0
14:40:03,522 INFO LocalJobRunner:591 -
14:40:03,522 INFO MapTask:1460 - Starting flush of map output
14:40:03,522 INFO MapTask:1482 - Spilling map output
14:40:03,522 INFO MapTask:1483 - bufstart = 0; bufend = 40; bufvoid = 104857600
14:40:03,522 INFO MapTask:1485 - kvstart = 26214396(104857584); kvend = 26214384(104857536); length = 13/6553600
14:40:03,573 INFO MapTask:1667 - Finished spill 0
14:40:03,583 INFO Task:1038 - Task:attempt_local866013445_0001_m_000000_0 is done. And is in the process of committing
14:40:03,589 INFO LocalJobRunner:591 - map
14:40:03,589 INFO Task:1158 - Task 'attempt_local866013445_0001_m_000000_0' done.
14:40:03,589 INFO LocalJobRunner:249 - Finishing task: attempt_local866013445_0001_m_000000_0
14:40:03,590 INFO LocalJobRunner:456 - map task executor complete.
14:40:03,593 INFO LocalJobRunner:448 - Waiting for reduce tasks
14:40:03,593 INFO LocalJobRunner:302 - Starting task: attempt_local866013445_0001_r_000000_0
14:40:03,597 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
14:40:03,597 INFO ProcfsBasedProcessTree:192 - ProcfsBasedProcessTree currently is supported only on Linux.
14:40:03,627 INFO Task:612 - Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@2ae5eb6
14:40:03,658 INFO ReduceTask:362 - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@72ddfb0b
14:40:03,686 INFO MergeManagerImpl:197 - MergerManager: memoryLimit=1314232704, maxSingleShuffleLimit=328558176, mergeThreshold=867393600, ioSortFactor=10, memToMemMergeOutputsThreshold=10
14:40:03,688 INFO EventFetcher:61 - attempt_local866013445_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
14:40:03,720 INFO LocalFetcher:144 - localfetcher#1 about to shuffle output of map attempt_local866013445_0001_m_000000_0 decomp: 50 len: 54 to MEMORY
14:40:03,729 INFO InMemoryMapOutput:100 - Read 50 bytes from map-output for attempt_local866013445_0001_m_000000_0
14:40:03,730 INFO MergeManagerImpl:315 - closeInMemoryFile -> map-output of size: 50, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->50
14:40:03,731 INFO EventFetcher:76 - EventFetcher is interrupted.. Returning
14:40:03,731 INFO LocalJobRunner:591 - 1 / 1 copied.
14:40:03,731 INFO MergeManagerImpl:687 - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
14:40:03,744 INFO Merger:606 - Merging 1 sorted segments
14:40:03,744 INFO Merger:705 - Down to the last merge-pass, with 1 segments left of total size: 41 bytes
14:40:03,746 INFO MergeManagerImpl:754 - Merged 1 segments, 50 bytes to disk to satisfy reduce memory limit
14:40:03,748 INFO MergeManagerImpl:784 - Merging 1 files, 54 bytes from disk
14:40:03,748 INFO MergeManagerImpl:799 - Merging 0 segments, 0 bytes from memory into reduce
14:40:03,748 INFO Merger:606 - Merging 1 sorted segments
14:40:03,749 INFO Merger:705 - Down to the last merge-pass, with 1 segments left of total size: 41 bytes
14:40:03,749 INFO LocalJobRunner:591 - 1 / 1 copied.
14:40:03,847 INFO deprecation:1173 - mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
--->Reducer-->pool-3-thread-1
--->Reducer-->pool-3-thread-1
--->Reducer-->pool-3-thread-1
14:40:03,867 INFO Task:1038 - Task:attempt_local866013445_0001_r_000000_0 is done. And is in the process of committing
14:40:03,868 INFO LocalJobRunner:591 - 1 / 1 copied.
14:40:03,868 INFO Task:1199 - Task attempt_local866013445_0001_r_000000_0 is allowed to commit now
14:40:03,873 INFO FileOutputCommitter:535 - Saved output of task 'attempt_local866013445_0001_r_000000_0' to file:/D:/hadoop/output/_temporary/0/task_local866013445_0001_r_000000
14:40:03,877 INFO LocalJobRunner:591 - reduce > reduce
14:40:03,877 INFO Task:1158 - Task 'attempt_local866013445_0001_r_000000_0' done.
14:40:03,877 INFO LocalJobRunner:325 - Finishing task: attempt_local866013445_0001_r_000000_0
14:40:03,877 INFO LocalJobRunner:456 - reduce task executor complete.
14:40:04,044 INFO Job:1360 - Job job_local866013445_0001 running in uber mode : false
14:40:04,045 INFO Job:1367 - map 100% reduce 100%
14:40:04,045 INFO Job:1378 - Job job_local866013445_0001 completed successfully
14:40:04,050 INFO Job:1385 - Counters: 30
File System Counters
FILE: Number of bytes read=488
FILE: Number of bytes written=566782
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=2
Map output records=4
Map output bytes=40
Map output materialized bytes=54
Input split bytes=96
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=54
Reduce input records=4
Reduce output records=3
Spilled Records=8
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=7
Total committed heap usage (bytes)=498073600
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=24
File Output Format Counters
Bytes Written=36
Process finished with exit code 0
会在D:\hadoop\output输出结果如下:
image.png
其中part-r-00000的内容如下:
hadoop 1
hello 2
java 1
下一篇我们介绍在集群中运行WordCount,Hadoop之集群运行WordCount
3.参考