Hadoop Streaming是Hadoop提供的一种编程工具,提供了一种非常灵活的编程接口, 允许用户使用任何语言编写MapReduce作业,是一种常用的非Java API编写MapReduce的工具。
$ ${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
-input <输入目录> \ # 可以指定多个输入路径,例如:-input '/user/foo/dir1' -input '/user/foo/dir2'
-inputformat <输入格式 JavaClassName> \
-output <输出目录> \
-outputformat <输出格式 JavaClassName> \
-mapper <mapper executable or JavaClassName> \
-reducer <reducer executable or JavaClassName> \
-combiner <combiner executable or JavaClassName> \
-partitioner <JavaClassName> \
-cmdenv <name=value> \ # 可以传递环境变量,可以当作参数传入到任务中,可以配置多个
-file <依赖的文件> \ # 配置文件,字典等依赖
-D <name=value> \ # 作业的属性配置
属性 | 新名称 | 含义 | 备注 |
mapred.job.name | mapreduce.job.name | 作业名称 | |
mapred.map.tasks | mapreduce.job.maps | 每个Job运行map task的数量 | map启动的个数无法被完全控制 |
mapred.reduce.tasks | mapreduce.job.reduces | 每个Job运行reduce task的数量 | |
mapred.job.priority | mapreduce.job.priority | 作业优先级 | VERY_LOW,LOW,NORMAL,HIGH,VERY_HIGH |
stream.map.input.field.separator | | Map输入数据的分隔符 | 默认是\t |
stream.reduce.input.field.separator | | Reduce输入数据的分隔符 | 默认是\t |
stream.map.output.field.separator | | Map输出数据的分隔符 | 默认是\t |
stream.reduce.output.field.separator | | Reduce输出数据的分隔符 | |
stream.num.map.output.key.fields | | Map task输出record中key所占的个数 | |
stream.num.reduce.output.key.fields | | Reduce task输出record中key所占的个数 | |
注意:2.6.0的Streaming文档中只提到了stream.num.reduce.output.fields, 没提到stream.num.reduce.output.key.fields,后续需要看下二者的关系。
Hadoop Streaming要求用户编写的Mapper/Reducer从标准输入(stdin)中读取数据,将结果写入到标准输出(stdout)中, 这非常类似于Linux的管道机制。
$ cat <input_file> | <mapper executable> | sort | <reducer executable>
# python的streaming示例
$ cat <input_file> | python mapper.py | sort | python reducer.py
$ cat input/input_0.txt
Hadoop is the Elephant King!
A yellow and elegant thing.
He never forgets
Useful data, or lets
An extraneous element cling!
$ cat input/input_1.txt
A wonderful king is Hadoop.
The elephant plays well with Sqoop.
But what helps him to thrive
Are Impala, and Hive,
And HDFS in the group.
$ cat input/input_2.txt
Hadoop is an elegant fellow.
An elephant gentle and mellow.
He never gets mad,
Or does anything bad,
Because, at his core, he is yellow.
$ ${HADOOP_HOME}/bin/hadoop fs -mkdir -p /user/<username>/wordcount
$ ${HADOOP_HOME}/bin/hadoop fs -put input/ /user/<username>/wordcount
#!/bin/env python
# encoding: utf-8
import re
import sys
seperator_pattern = re.compile(r'[^a-zA-Z0-9]+')
for line in sys.stdin:
for word in seperator_pattern.split(line):
if word:
print '%s\t%d' % (word.lower(), 1)
#!/bin/env python
# encoding: utf-8
import sys
last_key = None
last_sum = 0
for line in sys.stdin:
key, value = line.rstrip('\n').split('\t')
if last_key is None:
last_key = key
last_sum = int(value)
elif last_key == key:
last_sum += int(value)
print '%s\t%d' % (last_key, last_sum)
last_sum = int(value)
last_key = key
if last_key:
print '%s\t%d' % (last_key, last_sum)
#!/bin/env python
# encoding: utf-8
import itertools
import sys
stdin_generator = (line for line in sys.stdin if line)
for key, values in itertools.groupby(stdin_generator, key=lambda x: x.split('\t')[0]):
value_sum = sum((int(i.split('\t')[1]) for i in values))
print '%s\t%d' % (key, value_sum)
前面说过,Streaming的基本过程与linux管道类似,所以可以在本地先进行简单的测试。 这里的测试只能测试程序的逻辑基本符合预期,作业的属性设置
$ cat input/* | python mapper.py | sort | python reducer.py
a 2
an 3
and 4
anything 1
are 1
at 1
bad 1
because 1
but 1
cling 1
core 1
data 1
does 1
elegant 2
element 1
elephant 3
extraneous 1
fellow 1
forgets 1
gentle 1
gets 1
group 1
hadoop 3
hdfs 1
he 3
helps 1
him 1
his 1
hive 1
impala 1
in 1
is 4
king 2
lets 1
mad 1
mellow 1
never 2
or 2
plays 1
sqoop 1
the 3
thing 1
thrive 1
to 1
useful 1
well 1
what 1
with 1
wonderful 1
yellow 2
#!/bin/env python
# encoding: utf-8
import re
import sys
seperator_pattern = re.compile(r'[^a-zA-Z0-9]+')
def print_counter(group, counter, amount):
print >> sys.stderr, 'reporter:counter:{g},{c},{a}'.format(g=group, c=counter, a=amount)
for line in sys.stdin:
for word in seperator_pattern.split(line):
if word:
print '%s\t%d' % (word.lower(), 1)
print_counter('wc', 'empty-word', 1)
How do I update counters in streaming applications?
A streaming process can use the stderr to emit counter information. reporter:counter:<group>,<counter>,<amount> should be sent to stderr to update the counter.
# 使用-files,注意:-D -files选项放在最前面,放在后面会报错,不懂为何
$ ${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
-D mapred.job.name="streaming_wordcount" \
-D mapred.map.tasks=3 \
-D mapred.reduce.tasks=3 \
-D mapred.job.priority=HIGH \
-files "mapper.py,reducer.py" \
-input /user/<username>/wordcount/input \
-output /user/<username>/wordcount/output \
-mapper "python mapper.py" \
-reducer "python reducer.py"
# output 不同的版本可能输出有所不同 -D这里使用的老配置名,前面会有一些警告,这里未显示
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-unjar707084306300214621/] [] /tmp/streamjob5287904745550112970.jar tmpDir=null
15/09/29 10:35:14 INFO client.RMProxy: Connecting to ResourceManager at xxxxx/x.x.x.x:y
15/09/29 10:35:14 INFO client.RMProxy: Connecting to ResourceManager at xxxxx/x.x.x.x:y
15/09/29 10:35:15 INFO mapred.FileInputFormat: Total input paths to process : 3
15/09/29 10:35:15 INFO mapreduce.JobSubmitter: number of splits:3
15/09/29 10:35:15 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
15/09/29 10:35:15 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1440570785607_1597
15/09/29 10:35:15 INFO impl.YarnClientImpl: Submitted application application_1440570785607_1597
15/09/29 10:35:15 INFO mapreduce.Job: The url to track the job: http://xxxxx:yyy/proxy/application_1440570785607_1597/
15/09/29 10:35:15 INFO mapreduce.Job: Running job: job_1440570785607_1597
15/09/29 10:37:15 INFO mapreduce.Job: Job job_1440570785607_1597 running in uber mode : false
15/09/29 10:37:15 INFO mapreduce.Job: map 0% reduce 0%
15/09/29 10:42:17 INFO mapreduce.Job: map 33% reduce 0%
15/09/29 10:42:18 INFO mapreduce.Job: map 100% reduce 0%
15/09/29 10:42:23 INFO mapreduce.Job: map 100% reduce 100%
15/09/29 10:42:24 INFO mapreduce.Job: Job job_1440570785607_1597 completed successfully
15/09/29 10:42:24 INFO mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=689
FILE: Number of bytes written=661855
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=822
HDFS: Number of bytes written=379
HDFS: Number of read operations=18
HDFS: Number of large read operations=0
HDFS: Number of write operations=6
Job Counters
Launched map tasks=3
Launched reduce tasks=3
Rack-local map tasks=3
Total time spent by all maps in occupied slots (ms)=10657
Total time spent by all reduces in occupied slots (ms)=21644
Total time spent by all map tasks (ms)=10657
Total time spent by all reduce tasks (ms)=10822
Total vcore-seconds taken by all map tasks=10657
Total vcore-seconds taken by all reduce tasks=10822
Total megabyte-seconds taken by all map tasks=43651072
Total megabyte-seconds taken by all reduce tasks=88653824
Map-Reduce Framework
Map input records=15
Map output records=72
Map output bytes=527
Map output materialized bytes=725
Input split bytes=423
Combine input records=0
Combine output records=0
Reduce input groups=50
Reduce shuffle bytes=725
Reduce input records=72
Reduce output records=50
Spilled Records=144
Shuffled Maps =9
Failed Shuffles=0
Merged Map outputs=9
GC time elapsed (ms)=72
CPU time spent (ms)=7870
Physical memory (bytes) snapshot=3582062592
Virtual memory (bytes) snapshot=29715922944
Total committed heap usage (bytes)=10709630976
Shuffle Errors
File Input Format Counters
Bytes Read=399
File Output Format Counters
Bytes Written=379
15/09/29 10:42:24 INFO streaming.StreamJob: Output directory: /user/<username>/wordcount/output
- The url to track the job: http://xxxxx:yyy/proxy/application_1440570785607_1597/ 点击这个url可以通过web页面查看任务的状态
- map 0% reduce 0% 显示任务map和reduce的进度
- 最后的Counters信息,包含系统默认的counter,可以自定义counter来统计一些任务的状态信息
- Output directory: /user//wordcount/output 结果输出目录
$ wget https://www.python.org/ftp/python/2.7.10/Python-2.7.10.tgz
$ tar xzf Python-2.7.10.tgz
$ cd Python-2.7.10
# compile
$ ./configure --prefix=/home/<username>/wordcount/python27
$ make -j
$ make install
# 打包一份python27.tar.gz
$ cd /home/<username>/wordcount/
$ tar czf python27.tar.gz python27/
# 上传至hadoop的hdfs
$ ${HADOOP_HOME}/bin/hadoop fs -mkdir -p /tools/
$ ${HADOOP_HOME}/bin/hadoop fs -put python27.tar.gz /tools
# 启动任务,使用刚才上传的Python版本
$ ${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
-D mapred.reduce.tasks=3 \
-files "mapper.py,reducer.py" \
-archives "hdfs://xxxxx:9000/tools/python27.tar.gz#py" \
-input /user/<username>/wordcount/input \
-output /user/<username>/wordcount/output \
-mapper "py/python27/bin/python mapper.py" \
-reducer "py/python27/bin/python reducer.py"
配置多个-input的时候可以进行多路输入,在实际中可能需要对不同的输入进行不同的处理,这个时候需要获取输入的路径信息, 来区分是哪个输入路径或文件。Streaming提供了Configured_Parameters, 可以获取一些运行时的信息。
Name | Type | Description |
mapreduce.job.id | String | The job id |
mapreduce.job.jar | String | job.jar location in job directory |
mapreduce.job.local.dir | String | The job specific shared scratch space |
mapreduce.task.id | String | The task id |
mapreduce.task.attempt.id | String | The task attempt id |
mapreduce.task.is.map | boolean | Is this a map task |
mapreduce.task.partition | int | The id of the task within the job |
mapreduce.map.input.file | String | The filename that the map is reading from |
mapreduce.map.input.start | long | The offset of the start of the map input split |
mapreduce.map.input.length | long | The number of bytes in the map input split |
mapreduce.task.output.dir | String | The task's temporary output directory |
在Streaming job运行的过程中,这些mapreduce的参数格式会有所变化,所有的点(.)会变成下划线(_)。例如,mapreduce.job.id变成mapreduce_job_id。 所有的参数都可以通过环境变量来获取。
import os
input_file = os.environ['mapreduce_map_input_file']
- mrjob
- snakebite:纯Python实现的HDFS客户端
- Apache Hadoop MapReduce Streaming
- Hadoop Streaming 编程 - 董西成
- Deprecated Properties: 新旧参数名字对照