Bootstrap

Spark SQL----LOAD DATA

一、描述

LOAD DATA语句将用户指定的目录或文件中的数据加载到Hive serde表中。如果指定了一个目录,那么将加载该目录中的所有文件。如果指定了一个文件,则只加载单个文件。此外,LOAD DATA语句接受一个可选的分区specification。当指定分区时,数据文件(当输入源是目录时)或单个文件(当输入源是文件时)被加载到目标表的分区中。如果表是缓存的,则该命令清除表的缓存数据及其引用表的所有从属项。当下一次访问表或从属项时,缓存将被延迟填充。

二、语法

LOAD DATA [ LOCAL ] INPATH path [ OVERWRITE ] INTO TABLE table_identifier [ partition_spec ]

三、参数

  • path
    文件系统的路径。它可以是绝对路径,也可以是相对路径。
  • table_identifier
    指定一个表名称,可以选择使用数据库名称对其进行限定。
    语法:[ database_name. ] table_name
  • partition_spec
    一个可选参数,用于指定分区的键值对的逗号分隔列表。
    语法:PARTITION ( partition_col_name = partition_col_val [ , … ] )
  • LOCAL
    如果指定,它将导致INPATH根据本地文件系统而不是默认文件系统进行解析,默认文件系统通常是分布式存储。
  • OVERWRITE
    默认情况下,新数据会附加到表中。如果使用OVERWRITE,则会使用新数据覆盖该表。

四、例子

-- Example without partition specification.
-- Assuming the students table has already been created and populated.
SELECT * FROM students;
+---------+----------------------+----------+
|     name|               address|student_id|
+---------+----------------------+----------+
|Amy Smith|123 Park Ave, San Jose|    111111|
+---------+----------------------+----------+

CREATE TABLE test_load (name VARCHAR(64), address VARCHAR(64), student_id INT) USING HIVE;

-- Assuming the students table is in '/user/hive/warehouse/'
LOAD DATA LOCAL INPATH '/user/hive/warehouse/students' OVERWRITE INTO TABLE test_load;

SELECT * FROM test_load;
+---------+----------------------+----------+
|     name|               address|student_id|
+---------+----------------------+----------+
|Amy Smith|123 Park Ave, San Jose|    111111|
+---------+----------------------+----------+

-- Example with partition specification.
CREATE TABLE test_partition (c1 INT, c2 INT, c3 INT) PARTITIONED BY (c2, c3);

INSERT INTO test_partition PARTITION (c2 = 2, c3 = 3) VALUES (1);

INSERT INTO test_partition PARTITION (c2 = 5, c3 = 6) VALUES (4);

INSERT INTO test_partition PARTITION (c2 = 8, c3 = 9) VALUES (7);

SELECT * FROM test_partition;
+---+---+---+
| c1| c2| c3|
+---+---+---+
|  1|  2|  3|
|  4|  5|  6|
|  7|  8|  9|
+---+---+---+

CREATE TABLE test_load_partition (c1 INT, c2 INT, c3 INT) USING HIVE PARTITIONED BY (c2, c3);

-- Assuming the test_partition table is in '/user/hive/warehouse/'
LOAD DATA LOCAL INPATH '/user/hive/warehouse/test_partition/c2=2/c3=3'
    OVERWRITE INTO TABLE test_load_partition PARTITION (c2=2, c3=3);

SELECT * FROM test_load_partition;
+---+---+---+
| c1| c2| c3|
+---+---+---+
|  1|  2|  3|
+---+---+---+
;