1.安装maven
1.1 下载maven
https://maven.apache.org/download.cgi
1.2上传解压
tar -zxvf apache-maven-3.6.3-bin.tar.gz -C /opt
1.3 配置 MVN_HOMR
[root@cluster2-slave2 ~]# vim /etc/profile
export MVN_HOME=/data/module/apache-maven-3.6.3
export PATH=
P
A
T
H
:
PATH:
PATH:MVN_HOME/bin
[root@cluster2-slave2 ~]# source /etc/profile
1.4 验证maven
[root@cluster2-slave2 ~]# mvn -v
2 安装protobuf-2.5.0.tar.gz
2.1 下载
只能是2.5.0这个版本
因为后面安装tez0.91的时候加压后在pom.xml里可以看到,要求就是2.5.0的
hadoop使用protocol buffer进行通信,需要下载和安装 protobuf-2.5.0.tar.gz。
但是现在 protobuf-2.5.0.tar.gz已经无法在官网 https://code.google.com/p/protobuf/downloads/list中 下载了
我在百度网盘找到了下载链接这里附上下载链接
链接:https://pan.baidu.com/s/1hm7D2_wxIxMKbN9xnlYWuA
提取码:haz4
2.2 上传解压
tar -zxvf protobuf-2.5.0.tar.gz
2.3 configure校验
cd protobuf-2.5.0/
[root@cluster2-slave2 protobuf-2.5.0]# ./configure
第一次我校验失败:
checking whether to enable maintainer-specific portions of Makefiles... yes
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking target system type... x86_64-unknown-linux-gnu
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking for style of include used by make... GNU
checking dependency style of gcc... gcc3
checking for g++... no
checking for c++... no
checking for gpp... no
checking for aCC... no
checking for CC... no
checking for cxx... no
checking for cc++... no
checking for cl.exe... no
checking for FCC... no
checking for KCC... no
checking for RCC... no
checking for xlC_r... no
checking for xlC... no
checking whether we are using the GNU C++ compiler... no
checking whether g++ accepts -g... no
checking dependency style of g++... none
checking how to run the C++ preprocessor... /lib/cpp
configure: error: in `/data/software/protobuf-2.5.0':
configure: error: C++ preprocessor "/lib/cpp" fails sanity check
See `config.log' for more details
错误信息:
configure: error: in `/data/software/protobuf-2.5.0':
configure: error: C++ preprocessor "/lib/cpp" fails sanity check
问题的根源是缺少必要的C++库。如果是CentOS系统,运行,如下命令解决
yum install glibc-headers
yum install gcc-c++
结束后日志:
再次检查通过
2.4 make
[root@cluster2-slave2 protobuf-2.5.0]# make
2.5 make install
[root@cluster2-slave2 protobuf-2.5.0]# make install
2.6 验证protobuf
[root@cluster2-slave2 protobuf-2.5.0]# protoc --version
libprotoc 2.5.0
可见安装成功
3 安装 Tez
3.1 下载
http://www.apache.org/dyn/closer.lua/tez/0.9.1/
下载源码自己根据自己的环境编译
3.2 上传解压
[root@cluster2-slave2 software]# tar -zxvf apache-tez-0.9.1-src.tar.gz -C …/module/
3.3 修改pom.xml
有四处修改:
(1)hadoop.version版本对应修改
确认自己的hadoop版本
修改如下
**(2)repository.cloudera
(3) 新增pluginRepository.clouder
(4) 注释不必要的东西减少下载和编译出错概率**
完整pom.xml 如下
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.apache.tez</groupId>
<artifactId>tez</artifactId>
<packaging>pom</packaging>
<version>0.9.1</version>
<name>tez</name>
<licenses>
<license>
<name>The Apache Software License, Version 2.0</name>
<url>http://www.apache.org/licenses/LICENSE-2.0.txt</url>
</license>
</licenses>
<organization>
<name>Apache Software Foundation</name>
<url>http://www.apache.org</url>
</organization>
<properties>
<maven.test.redirectTestOutputToFile>true</maven.test.redirectTestOutputToFile>
<clover.license>${user.home}/clover.license</clover.license>
<hadoop.version>3.0.0-cdh6.3.2</hadoop.version>
<jetty.version>6.1.26</jetty.version>
<netty.version>3.6.2.Final</netty.version>
<pig.version>0.13.0</pig.version>
<javac.version>1.8</javac.version>
<slf4j.version>1.7.10</slf4j.version>
<enforced.java.version>[${javac.version},)</enforced.java.version>
<distMgmtSnapshotsId>apache.snapshots.https</distMgmtSnapshotsId>
<distMgmtSnapshotsName>Apache Development Snapshot Repository</distMgmtSnapshotsName>
<distMgmtSnapshotsUrl>https://repository.apache.org/content/repositories/snapshots</distMgmtSnapshotsUrl>
<distMgmtStagingId>apache.staging.https</distMgmtStagingId>
<distMgmtStagingName>Apache Release Distribution Repository</distMgmtStagingName>
<distMgmtStagingUrl>https://repository.apache.org/service/local/staging/deploy/maven2</distMgmtStagingUrl>
<failIfNoTests>false</failIfNoTests>
<protobuf.version>2.5.0</protobuf.version>
<protoc.path>${env.PROTOC_PATH}</protoc.path>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<scm.url>scm:git:https://git-wip-us.apache.org/repos/asf/tez.git</scm.url>
<build.time>${maven.build.timestamp}</build.time>
<frontend-maven-plugin.version>1.4</frontend-maven-plugin.version>
<findbugs-maven-plugin.version>3.0.1</findbugs-maven-plugin.version>
<javadoc-maven-plugin.version>2.10.4</javadoc-maven-plugin.version>
<shade-maven-plugin.version>2.4.3</shade-maven-plugin.version>
</properties>
<scm>
<connection>${scm.url}</connection>
</scm>
<distributionManagement>
<repository>
<id>${distMgmtStagingId}</id>
<name>${distMgmtStagingName}</name>
<url>${distMgmtStagingUrl}</url>
</repository>
<snapshotRepository>
<id>${distMgmtSnapshotsId}</id>
<name>${distMgmtSnapshotsName}</name>
<url>${distMgmtSnapshotsUrl}</url>
</snapshotRepository>
</distributionManagement>
<repositories>
<repository>
<id>${distMgmtSnapshotsId}</id>
<name>${distMgmtSnapshotsName}</name>
<url>${distMgmtSnapshotsUrl}</url>
</repository>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
<name>Cloudera Repositories</name>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>maven2-repository.atlassian</id>
<name>Atlassian Maven Repository</name>
<url>https://maven.atlassian.com/repository/public</url>
<layout>default</layout>
</pluginRepository>
<pluginRepository>
<id>${distMgmtSnapshotsId}</id>
<name>${distMgmtSnapshotsName}</name>
<url>${distMgmtSnapshotsUrl}</url>
<layout>default</layout>
</pluginRepository>
<pluginRepository>
<id>cloudera</id>
<name>Cloudera Repositories</name>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</pluginRepository>
</pluginRepositories>
<dependencyManagement>
<dependencies>
<!-- 原文省略 -->
</dependencies>
</dependencyManagement>
<modules>
<module>hadoop-shim</module>
<module>tez-api</module>
<module>tez-common</module>
<module>tez-runtime-library</module>
<module>tez-runtime-internals</module>
<module>tez-mapreduce</module>
<module>tez-examples</module>
<module>tez-tests</module>
<module>tez-dag</module>
<!--
<module>tez-ext-service-tests</module>
<module>tez-ui</module>
-->
<module>tez-plugins</module>
<module>tez-tools</module>
<module>hadoop-shim-impls</module>
<module>tez-dist</module>
<module>docs</module>
</modules>
<build>
<!-- 原文省略 -->
</build>
<profiles>
<!-- 原文省略 -->
</profiles>
<reporting>
<!-- 原文省略 -->
</reporting>
</project>
3.4 maven编译
注意:我们不需要javadoc和test所以可以编译时跳过
所以用下面领命编译
在pom.xml同级目录下执行
mvn clean package -Dmaven.javadoc.skip=true -Dmaven.test.skip=true
开始编译
3.5 遇到问题
第一次编译要下载很多包有可能中途下载失败
比如下面就是一次包的依赖没有下载成功报错了
重试后所有依赖的jar下载成功
但是编译还遇到了一个问题
报ApplicationReport.newInstance() 89行异常
/data/module/apache-tez-0.9.1-src/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/clienttRunningJob.java:[89,29] no s
解决办法:更换ApplicationReport.newInstance()的另一个方法
更换源代码:
return ApplicationReport.newInstance(unknownAppId, unknownAttemptId, "N/A",
"N/A", "N/A", "N/A", 0, null, YarnApplicationState.NEW, "N/A", "N/A",
0, 0, FinalApplicationStatus.UNDEFINED, null, "N/A", 0.0f, "TEZ_MRR", null)
更换为新代码:
return ApplicationReport.newInstance(unknownAppId, unknownAttemptId, "N/A",
"N/A", "N/A", "N/A", 0, null, YarnApplicationState.NEW, "N/A", "N/A",
0, 0, 0, FinalApplicationStatus.UNDEFINED, null, "N/A", 0.0f, "TEZ_MRR", null);
vim tez-mapreduce/src/main/java/org/apache/tez/mapreduce/clienttRunningJob.java
然后继续编译
mvn clean package -Dmaven.javadoc.skip=true -Dmaven.test.skip=true
3.6 最后编译成功
编译后的文件在tez-dist/target下面
我的在:/data/module/apache-tez-0.9.1-src/tez-dist/target
3.7 整个tez到hdfs
hdfs上创建相应的tez目录
[root@cluster2-slave2 target]# hadoop fs -mkdir /user/tez
上传tez-0.9.1.tar.gz到 /user/tez目录
[root@cluster2-slave2 target]# hadoop fs -put tez-0.9.1.tar.gz /user/tez
3.8 整个tez到hive
(1)进入CDH lib目录
cd /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib
(2) 创建tez相关目录目录
[root@cluster2-slave2 lib]# mkdir -p tez/conf
(3)创建 tez-site.xml文件
[root@cluster2-slave2 lib]# cd tez/conf/
[root@cluster2-slave2 conf]# vim tez-site.xml
<configuration>
<property>
<name>tez.lib.uris</name>
<value>${fs.defaultFS}/user/tez/tez-0.9.1.tar.gz</value>
</property>
<property>
<name>tez.use.cluster.hadoop-libs</name>
<value>false</value>
</property>
</configuration>
(4)tez-0.9.1-minimal拷贝到tez中
下面的目录
/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/tez
[root@cluster2-slave2 tez-0.9.1-minimal]# cp ./*.jar /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/tez
[root@cluster2-slave2 tez-0.9.1-minimal]# cp -r lib /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/tez
(5)分发tez这个文件到各个节点
[root@cluster2-slave2 lib]# scp -r tez root@cluster2-slave1:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib
[root@cluster2-slave2 lib]# scp -r tez root@cluster2-master:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib
3 配置hvie环境
3.1 配置环境
在cdh找到hive客户端配置
HADOOP_CLASSPATH=/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/tez/conf:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/tez/*:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/tez/lib/*
然后保存并部署客户端配置,使生效
3.2 测试效果
报错
23/08/15 19:55:49 ERROR SessionState: DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1692089386238_0057_2_00, diagnostics=[Vertex vertex_1692089386238_0057_2_00 [Map 1] killed/failed due to:INIT_FAILURE, Fail to create InputInitializerManager, org.apache.tez.dag.api.TezReflectionException: Unable to instantiate class with 1 arguments: org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator
at org.apache.tez.common.ReflectionUtils.getNewInstance(ReflectionUtils.java:71)
at org.apache.tez.common.ReflectionUtils.createClazzInstance(ReflectionUtils.java:89)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$1.run(RootInputInitializerManager.java:152)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$1.run(RootInputInitializerManager.java:148)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at org.apache.tez.dag.app.dag.RootInputInitializerManager.createInitializer(RootInputInitializerManager.java:148)
at org.apache.tez.dag.app.dag.RootInputInitializerManager.runInputInitializers(RootInputInitializerManager.java:121)
at org.apache.tez.dag.app.dag.impl.VertexImpl.setupInputInitializerManager(VertexImpl.java:4101)
at org.apache.tez.dag.app.dag.impl.VertexImpl.access$3100(VertexImpl.java:205)
at org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.handleInitEvent(VertexImpl.java:2912)
at org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.transition(VertexImpl.java:2859)
at org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.transition(VertexImpl.java:2841)
at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:59)
at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1939)
at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:204)
at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:2317)
at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:2303)
at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180)
at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.tez.common.ReflectionUtils.getNewInstance(ReflectionUtils.java:68)
... 25 more
Caused by: java.lang.NullPointerException: hive.llap.daemon.service.hosts must be defined
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:228)
at org.apache.hadoop.hive.llap.registry.impl.LlapRegistryService.getClient(LlapRegistryService.java:60)
at org.apache.hadoop.hive.ql.exec.tez.Utils.getSplitLocationProvider(Utils.java:39)
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.<init>(HiveSplitGenerator.java:125)
3.3 异常处理
3.1.1 Caused by: java.lang.NullPointerException: hive.llap.daemon.service.hosts must be defined
解决办法 修改:
禁用LLAP服务。
修改hive-site.xml
设置 hive.llap.enabled=false
3.1.2 Exception: Too many counters: 121 max=120
原因是hadoop 设置counters数默认是120个
yarn 配置mapreduce.job.counters.max=2000
3.1.3 Caused by: java.lang.RuntimeException: Map operator initialization failed
原因是mapJoin是先将一张表存放在内存中进行缓存,当如果表的数据过大的时候,内存吃不消进而报错,mapJoin一把处理一张小表和一张大表,新版的hive会自动优化把小表存入内存中进行缓存.如果业务需求必须要用两张大表进行Join,那需要先把mapJoin临时关闭
sql文件中设置:set hive.auto.convert.join=false;
3.1.4 hive on tez 报错IndexOutOfBoundsException: Index: 0, Size: 0
原因是 left join 关联字段有空值导致,
处理方式一:过滤空值(关联字段)
处理方式二:set hive.auto. convert. join=false;
3.1.5 hive使用tez引擎和使用mr引擎数据量不一致
原因:
Tez和MR一样,都默认开启了mapjoin,这里面涉及到了几个参数 – 是否自动开启mapjoin,默认为true set hive.auto. convert. join=true; – mapjoin小表和大表的阈值设置 set hive.mapjoin.smalltable.filesize =25000000; – 多个mapjoin 转换为1个时,限制输入的最大的数据量 影响tez,默认10m set hive.auto. convert. join .noconditionaltask.size =10000000 ; 当表的数据大于10m时,tez会把多余的那部分数据截掉,这样就会造成丢数据。
hive3和hive2 hash算法不一致导致 参考 https://blog.csdn.net/weixin_38070561/article/details/126895259
3.4 日志太多的处理
实际上可以看出这个过程日志太多了。看着不清爽,下面的操作减少日志输出量
这是因为/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/tez/lib目录下有
slf4j-api-1.7.10.jar
slf4j-log4j12-1.7.10.jar
两个包
并且现在hive日志级别是 INFO
那么现在通过CDH控制台把hive日志级别设置成ERROR
重启hive使生效
也可以直接删除下面这两个包
slf4j-api-1.7.10.jar
slf4j-log4j12-1.7.10.jar
查看hive查询日志
这个时候yarn上也显示 ApplicationType 是 Tez
6.5 hiveserver2也配置能启用tez
set hive.tez.container.size=3020;
set hive.execution.engine=tez;
select brandid,vipid,sku_id,`time` from pro30049.add_to_cart_dt_partition where vipid=1197032 and `time`>=1588953500000 and `time`<=1589039999000;