window下pyspark环境搭建
- 配置hadoop
- 下载hadoop、winutils
- 安装、配置环境变量
- 用winutils覆盖掉hadoop-2.2.0\bin下所有内容
- 配置spark
- 下载spark
- 安装(安装路径不能有空格)、配置环境变量
- 配置pyspark
- 拷贝 D:\spark-1.6.1\python\pyspark 至 D:\Program Files\Python\Python27\Lib\site-packages
- pip install py4j
WordCount程序示例
from pyspark import SparkContext, SparkConf
import sys
def run(input_path, output_path):
conf = SparkConf()\
.set("spark.hadoop.validateOutputSpecs", "false") \
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.setAppName("helloWorld")\
.setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.textFile(input_path)
re = rdd.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y).sortBy(lambda x: x[1], False)
re.saveAsTextFile(output_path)
sc.stop()
if __name__ == "__main__":
input_path = sys.argv[1]
output_path = sys.argv[2]
run(input_path, output_path)
if __name__ == "__main__":
python_package = sys.argv[1]
input_path = sys.argv[2]
output_path = sys.argv[3]
sys.path.append(python_package)
run(input_path, output_path)
pyspark递交到yarn上运行
/home/hadoop/soft/spark/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 1 \
--executor-memory 1G \
wordCount.py hdfs://artemis-02:9000/tmp/lvxw/learn_pyspark/logs/words hdfs://artemis-02:9000/tmp/lvxw/learn_pyspark/out/result