使用flinksql读取parquent文件
一、导入maven依赖
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-parquet_2.12</artifactId>
<version>1.11</version>
</dependency>
二、创建flink动态表关联文件
public static void main(String[] args) {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(2);
final StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
String dataPath = "/Users/klook/Downloads/user_profile/";
tableEnv.executeSql("CREATE TABLE `user_profile_data`(\n" +
" `device_id` STRING,\n" +
" `last_week_click_num` INT ," +
" `last_month_click_num` INT ," +
" `last_week_searchpv` INT ," +
" `year` string, " +
" `month` string, " +
" `day` string " +
") partitioned by(`year`,`month`,`day`)" +
" WITH ( " +
" 'connector' = 'filesystem',\n" +
" 'path' = "+ "'" + dataPath + "',\n" +
" 'format' = 'parquet'\n" +
" )");
final Table activity_base = tableEnv.sqlQuery(sql);
final DataStream<Row> activityBaseStream = tableEnv.toAppendStream(activity_base, Row.class);
sink2Mongo(activityBaseStream);
try {
env.execute();
} catch (Exception e) {
e.printStackTrace();
}
}
关联后可以使用sql进行操作,也可以转换成流进行别的操作。
三、特殊说明
如果在生产环境使用jar进行执行时报没有parquet工厂类的时候,可以将parquet的jar包放到flink的lib目录下。
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Could not find any factory for identifier 'parquet' that implements 'org.apache.flink.table.factories.FileSystemFormatFactory' in the classpath.
Available factory identifiers are:
csv
json
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:302)
at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)
at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:149)
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:699)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:232)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:916)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
Caused by: org.apache.flink.table.api.ValidationException: Could not find any factory for identifier 'parquet' that implements 'org.apache.flink.table.factories.FileSystemFormatFactory' in the classpath.
如果报以下错,可以检查分区字段是否在sql语句里面。
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.flink.formats.parquet.ParquetFileSystemFormatFactory$ParquetInputFormat.lambda$open$0(ParquetFileSystemFormatFactory.java:171)
at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
at org.apache.flink.formats.parquet.ParquetFileSystemFormatFactory$ParquetInputFormat.open(ParquetFileSystemFormatFactory.java:169)
at org.apache.flink.formats.parquet.ParquetFileSystemFormatFactory$ParquetInputFormat.open(ParquetFileSystemFormatFactory.java:128)
at org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:85)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:213)
Command exiting with ret '0'