主要记录操作,操作内容参考尚硅谷Hudi公开资料以及Hudi官方文档
具体参看官方文档:https://hudi.apache.org/docs/hoodie_deltastreamer/
DeltaStreamer工具介绍
HoodieDeltaStreamer工具 (hudi-utilities-bundle中的一部分) 提供了从DFS或Kafka等不同来源进行摄取的方式,并具有以下功能:
-
精准一次从Kafka采集新数据,从Sqoop、HiveIncrementalPuller的输出或DFS文件夹下的文件增量导入
-
导入的数据支持json、avro或自定义数据类型
-
管理检查点,回滚和恢复
-
利用 DFS 或 Confluent schema registry的 Avro Schema
-
支持自定义转换操作
命令
利用spark提交的命令如下
spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer /opt/software/hudi-0.12.1/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.12.1.jar --help
Schema Provider和Source配置项:https://hudi.apache.org/docs/hoodie_deltastreamer
如下以File Based Schema Provider和JsonKafkaSource为例演示
测试
启动kafka集群,准备数据
-
启动kafka集群,创建topic
/opt/module/kafka/bin/kafka-topics.sh --bootstrap-server m2:9092 --create --topic hudi_test
-
java生产者代码往topic发送测试数据
<dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>2.7.2</version> </dependency> <!--fastjson <= 1.2.80 存在安全漏洞,--> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.83</version> </dependency>
import com.alibaba.fastjson.JSONObject; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.ProducerRecord; import java.util.Properties; import java.util.Random; public class TestKafkaProducer { public static void main(String[] args) { Properties props = new Properties(); props.put("bootstrap.servers", "192.168.2.102:9092"); props.put("acks", "-1"); props.put("batch.size", "1048576"); props.put("linger.ms", "5"); props.put("compression.type", "snappy"); props.put("buffer.memory", "33554432"); props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); KafkaProducer<String, String> producer = new KafkaProducer<String, String>(props); Random random = new Random(); for (int i = 0; i < 1000; i++) { JSONObject model = new JSONObject(); model.put("userid", i); model.put("username", "name" + i); model.put("age", 18); model.put("partition", random.nextInt(100)); producer.send(new ProducerRecord<String, String>("hudi_test", model.toJSONString())); } producer.flush(); producer.close(); } }
准备配置文件
-
定义arvo所需schema文件(包括source和target)
mkdir /opt/test/hudi-props/ vim /opt/test/hudi-props/source-schema-json.avsc
{ "type": "record", "name": "Profiles", "fields": [ { "name": "userid", "type": [ "null", "string" ], "default": null }, { "name": "username", "type": [ "null", "string" ], "default": null }, { "name": "age", "type": [ "null", "string" ], "default": null }, { "name": "partition", "type": [ "null", "string" ], "default": null } ] }
cp source-schema-json.avsc target-schema-json.avsc
-
拷贝hudi配置base.properties
cp /opt/software/hudi-0.12.1/hudi-utilities/src/test/resources/delta-streamer-config/base.properties /opt/test/hudi-props/
-
根据源码里提供的模板,编写自己的kafka source的配置文件
cp /opt/software/hudi-0.12.1/hudi-utilities/src/test/resources/delta-streamer-config/kafka-source.properties /opt/test/hudi-props/ vim /opt/test/hudi-props/kafka-source.properties
我这边编写好的kafka source 配置文件如下
如下配置文件我都是在本地,实际上一般都是放在hdfs上
如果放到HDFS上,可以把hudi-props目录下文件都put上去:hadoop fs -put /opt/module/hudi-props/ /
然后例如/opt/test/hudi-props/source-schema-json.avsc改成
hdfs://m1:8020/hudi-props/source-schema-json.avsc
### include=/opt/test/hudi-props/base.properties # Key fields, for kafka example hoodie.datasource.write.recordkey.field=userid hoodie.datasource.write.partitionpath.field=partition # schema provider configs # hoodie.deltastreamer.schemaprovider.registry.url=http://localhost:8081/subjects/impressions-value/versions/latest hoodie.deltastreamer.schemaprovider.source.schema.file=/opt/test/hudi-props/source-schema-json.avsc hoodie.deltastreamer.schemaprovider.target.schema.file=/opt/test/hudi-props/target-schema-json.avsc # Kafka Source #hoodie.deltastreamer.source.kafka.topic=uber_trips hoodie.deltastreamer.source.kafka.topic=hudi_test #Kafka props bootstrap.servers=m2:9092 auto.offset.reset=earliest group.id=test-group schema.registry.url=http://localhost:8081
拷贝所需hudi的jar包到Spark
需要把之前打包打好的hudi-utilities-bundle_2.12-0.12.1.jar放入spark的jars路径下,否则报错找不到一些类和方法。
cp /opt/software/hudi-0.12.1/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.12.1.jar /opt/module/spark-3.2.2/jars/
运行导入命令
spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
/opt/module/spark-3.2.2/jars/hudi-utilities-bundle_2.12-0.12.1.jar \
--props /opt/test/hudi-props/kafka-source.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field userid \
--target-base-path hdfs://m1:8020/tmp/hudi/hudi_test \
--target-table hudi_test \
--op BULK_INSERT \
--table-type MERGE_ON_READ
查看导入结果
-
启动spark-sql
spark-sql \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \ --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
-
指定location创建hudi表
use spark_hudi; create table hudi_test using hudi location 'hdfs://m1:8020/tmp/hudi/hudi_test'
-
查询hudi表
select * from hudi_test;