TRANSFORM
The TRANSFORM
clause is used to specify a Hive-style transform query specification to transform the inputs by running a user-specified command or script.
Spark’s script transform supports two modes:
- Hive support disabled: Spark script transform can run with
spark.sql.catalogImplementation=in-memory
or withoutSparkSession.builder.enableHiveSupport()
. In this case, now Spark only uses the script transform withROW FORMAT DELIMITED
and treats all values passed to the script as strings. - Hive support enabled: When Spark is run with
spark.sql.catalogImplementation=hive
or Spark SQL is started withSparkSession.builder.enableHiveSupport()
, Spark can use the script transform with bothHive SerDe and ROW FORMAT DELIMITED
.
Syntax
SELECT TRANSFORM ( expression [ , ... ] )
[ ROW FORMAT row_format ]
[ RECORDWRITER record_writer_class ]
USING command_or_script [ AS ( [ col_name [ col_type ] ] [ , ... ] ) ]
[ ROW FORMAT row_format ]
[ RECORDREADER record_reader_class ]
Parameters
-
expression
Specifies a combination of one or more values, operators and SQL functions that results in a value.
-
row_format
Specifies the row format for input and output. See HIVE FORMAT for more syntax details.
-
RECORDWRITER
Specifies a fully-qualified class name of a custom RecordWriter. The default value is
org.apache.hadoop.hive.ql.exec.TextRecordWriter
. -
RECORDREADER
Specifies a fully-qualified class name of a custom RecordReader. The default value is
org.apache.hadoop.hive.ql.exec.TextRecordReader
. -
command_or_script
Specifies a command or a path to script to process data.
ROW FORMAT DELIMITED BEHAVIOR
When Spark uses ROW FORMAT DELIMITED format:
-
Spark uses the character
\u0001
as the default field delimiter and this delimiter can be overridden by FIELDSTERMINATED BY
. -
Spark uses the character
\n
as the default line delimiter and this delimiter can be overridden by LINESTERMINATED BY
. -
Spark uses a string \N as the default NULL value in order to differentiate NULL values from the literal string NULL. This delimiter can be overridden by NULL DEFINED AS.
-
Spark casts all columns to STRING and combines columns by tabs before feeding to the user script. For complex types such as
ARRAY/MAP/STRUCT
, Spark uses to_json casts it to an inputJSON
string and usesfrom_json
to convert the result outputJSON
string toARRAY/MAP/STRUCT data
. -
COLLECTION ITEMS TERMINATED BY
andMAP KEYS TERMINATED BY
are delimiters to split complex data such asARRAY/MAP/STRUCT
, Spark usesto_json
andfrom_json
to handle complex data types withJSON
format. SoCOLLECTION ITEMS TERMINATED BY
andMAP KEYS TERMINATED BY
won’t work in default row format. -
The standard output of the user script is treated as tab-separated STRING columns. Any cell containing only a string
\N
is re-interpreted as a literalNULL
value, and then the resultingSTRING
column will be cast to the data types specified incol_type
. -
If the actual number of output columns is less than the number of specified output columns, additional output columns will be filled with
NULL
. For example:output tabs: 1, 2
output columns: A: INT, B INT, C: INT
result:
+---+---+------+
| a| b| c|
+---+---+------+
| 1| 2| NULL|
+---+---+------+ -
If the actual number of output columns is more than the number of specified output columns, the output columns only select the corresponding columns, and the remaining part will be discarded. For example, if the output has three tabs and there are only two output columns:
output tabs: 1, 2, 3
output columns: A: INT, B INT
result:
+---+---+
| a| b|
+---+---+
| 1| 2|
+---+---+ -
If there is no AS clause after USING my_script, the output schema is key: STRING, value: STRING. The key column contains all the characters before the first tab and the value column contains the remaining characters after the first tab. If there are no tabs, Spark returns the NULL value. For example:
output tabs: 1, 2, 3
output columns:
result:
+-----+-------+
| key| value|
+-----+-------+
| 1| 2|
+-----+-------+
output tabs: 1, 2
output columns:
result:
+-----+-------+
| key| value|
+-----+-------+
| 1| NULL|
+-----+-------+
Hive SerDe behavior
When Hive support is enabled and Hive SerDe mode is used:
-
Spark uses the Hive SerDe
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
by default, so columns are cast to STRING and combined by tabs before feeding to the user script. -
All literal
NULL
values are converted to a string\N
in order to differentiate literalNULL
values from the literal stringNULL
. -
The standard output of the user script is treated as tab-separated
STRING
columns, any cell containing only a string\N
is re-interpreted as aNULL
value, and then the resultingSTRING
column will be cast to the data type specified incol_type
. -
If the actual number of output columns is less than the number of specified output columns, additional output columns will be filled with
NULL
. -
If the actual number of output columns is more than the number of specified output columns, the output columns only select the corresponding columns, and the remaining part will be discarded.
-
If there is no AS clause after
USING my_script
, the output schema is key:STRING
, value:STRING
. The key column contains all the characters before the first tab and the value column contains the remaining characters after the first tab. If there is no tab, Spark returns theNULL
value. -
These defaults can be overridden with
ROW FORMAT SERDE
orROW FORMAT DELIMITED
.
Examples
CREATE TABLE person (zip_code INT, name STRING, age INT);
INSERT INTO person VALUES
(94588, 'Zen Hui', 50),
(94588, 'Dan Li', 18),
(94588, 'Anil K', 27),
(94588, 'John V', NULL),
(94511, 'David K', 42),
(94511, 'Aryan B.', 18),
(94511, 'Lalit B.', NULL);
-- With specified output without data type
SELECT TRANSFORM(zip_code, name, age)
USING 'cat' AS (a, b, c)
FROM person
WHERE zip_code > 94511;
+-------+---------+-----+
| a | b| c|
+-------+---------+-----+
| 94588| Anil K| 27|
| 94588| John V| NULL|
| 94588| Zen Hui| 50|
| 94588| Dan Li| 18|
+-------+---------+-----+
-- With specified output with data type
SELECT TRANSFORM(zip_code, name, age)
USING 'cat' AS (a STRING, b STRING, c STRING)
FROM person
WHERE zip_code > 94511;
+-------+---------+-----+
| a | b| c|
+-------+---------+-----+
| 94588| Anil K| 27|
| 94588| John V| NULL|
| 94588| Zen Hui| 50|
| 94588| Dan Li| 18|
+-------+---------+-----+
-- Using ROW FORMAT DELIMITED
SELECT TRANSFORM(name, age)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
NULL DEFINED AS 'NULL'
USING 'cat' AS (name_age string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '@'
LINES TERMINATED BY '\n'
NULL DEFINED AS 'NULL'
FROM person;
+---------------+
| name_age|
+---------------+
| Anil K,27|
| John V,null|
| ryan B.,18|
| David K,42|
| Zen Hui,50|
| Dan Li,18|
| Lalit B.,null|
+---------------+
-- Using Hive Serde
SELECT TRANSFORM(zip_code, name, age)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim' = '\t'
)
USING 'cat' AS (a STRING, b STRING, c STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim' = '\t'
)
FROM person
WHERE zip_code > 94511;
+-------+---------+-----+
| a | b| c|
+-------+---------+-----+
| 94588| Anil K| 27|
| 94588| John V| NULL|
| 94588| Zen Hui| 50|
| 94588| Dan Li| 18|
+-------+---------+-----+
-- Schema-less mode
SELECT TRANSFORM(zip_code, name, age)
USING 'cat'
FROM person
WHERE zip_code > 94500;
+-------+---------------------+
| key| value|
+-------+---------------------+
| 94588| Anil K 27|
| 94588| John V \N|
| 94511| Aryan B. 18|
| 94511| David K 42|
| 94588| Zen Hui 50|
| 94588| Dan Li 18|
| 94511| Lalit B. \N|
+-------+---------------------+