Hive built-in function summary

Time:2022-1-2

1. Related help operation functions

View built-in functions: Show functions;
Display function details: desc function ABS;
Display function extension information: desc function extended concat;

2. Learn the ultimate mental method of built-in function

Step 1: carefully read all the functions of the show functions command to establish an overall understanding and impression
Step 2: use desc extended function_ Name to query the detailed usage of this function
Third cloth: the function is found through the above methods, but the use method is not clear. Please follow: "hive function_name"
Search for keywords in this format
Step 4: if there is no suitable combination of functions, please use the above function

3. Test shortcuts to built-in functions

The first method is to use it directly without using the from syntax branch, for example:

select concat('a','a')

The second way: create a dual table to help us write the complete SQL

1. Create a dual tablecreate table dual (ID string);
2. Load a file (one line, one space) to the dual table
3、select substr('huangbo',2,3) from dual;

4. Built in function list

1、 Relational operation

1. Equivalence comparison:=
2. Equivalence comparison: < = >
3. Unequal comparison: < > and=
4. Less than comparison:<
5. Comparison of less than or equal to:<=
6. Greater than comparison: >
7. Comparison of greater than or equal to: >=
8. Interval comparison
9. Null value judgment: is null
10. Non null judgment: is not null
11. Like comparison: like
12. Java like operation: rlike
13. Regexp operation: regexp

2、 Mathematical operation

1. Addition operation:+
2. Subtraction operation: –
3. Multiplication:*
4. Division operation:/
5. Residual operation:%
6. Bit and operation:&
7. Bit or operation:|
8. Bit exclusive or operation:^
9. Bit reversal operation:~

3、 Logical operation

1. Logic and operation: and&&
2. Logic or operation: or||
3. Logical non operation: not!

4、 Composite type constructor

1. Map structure
2. Struct structure
3. named_ Struct structure
4. Array structure
5. create_union

5、 Compound type operator

1. Get elements in array
2. Get the elements in the map
3. Get elements in struct

6、 Numerical calculation function

1. Rounding function: round
2. Specify precision rounding function: round
3. Rounding down function: floor
4. Rounding up function: ceil
5. Rounding up function: ceiling
6. Take random number function: Rand
7. Natural exponential function: exp
8. Logarithm function based on 10: log10
9. Logarithm function based on 2: log2
10. Logarithmic function: log
11. Power operation function: pow
12. Power operation function: Power
13. Square function: sqrt
14. Binary function: Bin
15. Hex function: hex
16. Reverse hex function: unhex
17. Binary conversion function: conv
18. Absolute value function: ABS
19. Positive remainder function: PMOD
20. Sine function: sin
21. Inverse sine function: asin
22. Cosine function: cos
23. Inverse cosine function: ACOs
24. Positive function: positive
25. Negative function: negative

7、 Set operation function

1. Map type size: size
2. Array type size: size
3. Judge whether the element array contains elements: array_ contains
4. Get all value sets in the map
5. Get all the key sets in the map
6. Array sorting

8、 Type conversion function

1. Binary conversion: binary
2. Forced conversion between basic types: Cast

9、 Date function

1. UNIX timestamp to date function: from_ unixtime
2. Get the current UNIX timestamp function: Unix_ timestamp
3. Date to time stamp function: Unix_ timestamp
4. Specified format date to UNIX timestamp function: Unix_ timestamp
5. Date time to date function: to_ date
6. Date to year function: year
7. Date to month function: month
8. Date conversion function: day
9. Date to hour function: hour
10. Date to minute function: minute
11. Date to second function: Second
12. Date to week function: weekofyear
13. Date comparison function: DateDiff
14. Date addition function: Date_ add
15. Date reduction function: Date_ sub

10、 Conditional function

1. If function: if
2. Non empty lookup function: coalesce
3. Condition judgment function: case

11、 String function

1. Character code function: ASCII
2. Base64 string
3. String connection function: concat
4. String connection function with separator: concat_ ws
5. Function to convert array to string: concat_ ws
6. Format decimal places into string function: format_ number
7. String interception function: substr, substring
8. String interception function: substr, substring
9. String lookup function: instr
10. String length function: length
11. String lookup function: locate
12. String formatting function: printf
13. Convert string to map function: STR_ to_ map
14. Base64 decoding function: unbase64 (string STR)
15. String to uppercase function: upper, ucase
16. String to lowercase function: lower, lcase
17. Space removal function: Trim
18. Left space removing function: ltrim
19. Space removal function on the right: rtrim
20. Regular expression replacement function: regexp_ replace
21. Regular expression parsing function: regexp_ extract
22. URL parsing function: parse_ url
23. JSON parsing function: get_ json_ object
24. Space string function: Space
25. Repeat string function: repeat
26. Left complement function: lpad
27. Right complement function: rpad
28. Split string function: split
29. Set lookup function: find_ in_ set
Function: sen.30
31. After word segmentation, count the Top-k with the highest frequency
32. After word segmentation, count the Top-k with the highest frequency together with the specified word

12、 Mixed function

1. Call Java function: Java_ method
2. Call Java function: reflect
3. Hash value of string: hash

13、 XPath parsing XML functions

1. xpath
2. xpath_string
3. xpath_boolean
4. xpath_short, xpath_int, xpath_long
5. xpath_float, xpath_double, xpath_number

14、 Summary statistics function (udaf)

1. Number statistics function: count
2. Sum statistical function: Sum
3. Average statistical function: AVG
4. Minimum statistical function: Min
5. Maximum statistical function: Max
6. Non empty set population variable function: VAR_ pop
7. Non empty set sample variable function: VAR_ samp
8. Overall standard deviation function: StdDev_ pop
9. Sample standard deviation function: StdDev_ samp
10. Median function: percentile
11. Median function: percentile
12. Approximate median function: percentile_ approx
13. Approximate median function: percentile_ approx
14. Histogram: histogram_ numeric
15. Set de duplication number: collect_ set
16. Non duplication function of collection: collect_ list

15、 Table generating functions (udtf)

1. Split the array into multiple rows: explode (array)
2. Split map into multiple lines: expand (map)

Hive custom function UDF

When the built-in functions provided by hive cannot meet the needs of business processing, you can consider using user-defined functions

Function type explain
UDF (user defined function) acts on a single data row to generate a data row as output. Mathematical function, string function, equivalent to a mapping operation, an input and an output
UDAF (user defined aggregation function): receive multiple input data lines and generate an output data line, count, Max, etc., which is equivalent to aggregation operation, multiple inputs and one output
UDTF (user defined table functions): receive one line of input and output multiple lines (expand). Equivalent to burst operation, one input, multiple outputs

A simple UDF example

1. First, develop a simple java class and inherit org apache. hadoop. hive. ql.exec. UDF, overload evaluate method

Package com.naixue.hive.udf
import java.util.HashMap;
import org.apache.hadoop.hive.ql.exec.UDF;
public class ToLowerCase extends UDF {
//It must be public, and the evaluate method can be overloaded
public String evaluate(String field) {
String result = field.toLowerCase();
return result;
}
//Different logical methods can be executed according to different parameters passed in
public int evaluate(int a, int b) {
return a + b;
}
}

2. Print it into a jar package and upload it to the server
3. Add the jar package to hive’s classpath

hive> add JAR /home/bigdata/hivejar/udf.jar;
hive> list jar;

4. Create a temporary function to associate with the developed class

hive> create temporary function tolowercase as
'com.naixue.hive.udf.ToLowerCase';

5. At this point, you can use custom functions in HQL

select tolowercase(name), age from student;

JSON data parsing UDF development
The existing original JSON data (rating. JSON) is as follows:

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
.....
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}
{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"

Now you need to import the data into the hive warehouse and finally get the following result:

movie rate timeStamp uid
1193 5 978300760 1

What should I do? (Note: it can be completed with built-in get_json_object or user-defined function)

Transform implementation

The Transform keyword of Hive provides the function of calling self writing script in SQL. It is suitable for realizing functions not available in hive and not wanting to write UDF.
Explain with an example.
JSON data:

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}

Requirement: convert the value of timestamp into date number
1. Load rating first JSON file to a raw table rate of hive_ json

create table rate_json(line string) row format delimited;
load data local inpath '/home/bigdata/rating.json' into table rate_json;

2. Create a rate table to store the fields parsed from JSON:

create table rate(movie int, rate int, unixtime int, userid int) row format
delimited fields terminated by '\t';

Parse JSON and store the result in the rate table:

insert into table rate select
get_json_object(line,'$.movie') as moive,
get_json_object(line,'$.rate') as rate,
get_json_object(line,'$.timeStamp') as unixtime,
get_json_object(line,'$.uid') as userid
from rate_json;

3. Use transform + Python to convert unixtime to weekday
First edit a python script file: weekday_ mapper. py

vi weekday_mapper.py

The code is as follows:

#!/bin/python
import sys
import datetime
for line in sys.stdin:
line = line.strip()
movie,rate,unixtime,userid = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([movie, rate, str(weekday), userid])

Save file
Then, add the file to hive’s classpath:

hive> add file /home/bigdata/weekday_mapper.py;
hive> insert into table lastjsontable select
transform(movie,rate,unixtime,userid)
using 'python weekday_mapper.py' as(movie,rate,weekday,userid) from rate;

Create the last table to store the data parsed by calling the python script: lastjsontable

create table lastjsontable(movie int, rate int, weekday int, userid int) row
format delimited fields terminated by '\t';

Finally, check whether the data is correct:

select distinct(weekday) from lastjsontable;

Hive special separator handling

Supplement: hive read data mechanism:
1. First, use inputformat < and the default is: org apache. hadoop. mapred. A concrete implementation class of textinputformat > reads file data and returns records one by one (which can be lines or “lines” in your logic)
2. Then use serde < default: org apache. hadoop. hive. serde2. lazy. Lazysimpleserde > is a concrete implementation class of lazysimpleserde >, which performs field cutting on one record returned above
To learn more about serde, please see here:https://cwiki.apache.org/confluence/display/Hive/SerDe
3. Inputformat and serde work together:

HDFS files –> InputFileFormat –> <key, value> –> Deserializer –> Row object
Row object –> Serializer –> <key, value> –> OutputFileFormat –> HDFS files

4. Example test
Hive only supports single byte separators for fields in the file by default. If the data file is multi_ delim. The separator in txt is multi character, as shown below:

01||huangbo
02||xuzheng
03||wangbaoqiang

Please note: if you use 𞓜 as the separator, there will be no error in creating the table, but the data parsing is abnormal. This is because hive’s default serde does not support multi byte separators. The supported delimiter type is char

Use regexserde to extract fields through regular expressions

Create table:

drop table if exists multi_delim1;
create table multi_delim(id string,name string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)\\|\\|(.*)','output.format.string'='%1$s
%2$s')
stored as textfile;

multi_ delim. Txt data are as follows:

01||huangbo
02||xuzheng
03||wangbaoqiang

Import data:

load data local inpath '/home/bigdata/hivedata/multi_delim.txt' into table
multi_delim;

Use multidelimitserde to solve multi byte delimiters

Create table:

drop table if exists multi_delim2;
CREATE TABLE multi_delim2 (id STRING, name STRING, city STRING) ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH
SERDEPROPERTIES("field.delim"="^|~");

Import data:

load data local inpath '/home/bigdata/hivedata/multi_delim2.txt' into table
multi_delim2;

multi_ delim2. Txt data format:

1^|~huangbo^|~beijing
2^|~xuzheng^|~shanghai
3^|~wangbaoqiang^|~tianjin

Query data structure:

select id, name, city from multi_delim2;

Query results:

Hive built-in function summary

Solve special separator problems by customizing inputformat

The principle is to replace the “multi byte separator” in the data with hive’s default separator (Ctrl + A, that is \ X01) or the substituted single character separator when inputformat reads the row, so that hive can extract the field according to the default single byte separator during serde operation
com. naixue. hive. delimit2. The bideliminaterinputformat code is as follows:

package com.naixue.hive.delimit2;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
public class BiDelimiterInputFormat extends TextInputFormat {
@Override
public RecordReader<LongWritable, Text> getRecordReader(InputSplit
genericSplit, JobConf job, Reporter reporter)throws IOException {
reporter.setStatus(genericSplit.toString());
BiRecordReader reader = new BiRecordReader(job,(FileSplit)genericSplit);
// MyRecordReader reader = new MyRecordReader(job,(FileSplit)genericSplit);
return reader;
}
}

com. naixue. hive. delimit2. The birecorder code is as follows:

package com.naixue.hive.delimit2;
import java.io.IOException;
import java.io.InputStream;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.Seekable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.CodecPool;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.Decompressor;
import org.apache.hadoop.io.compress.SplitCompressionInputStream;
import org.apache.hadoop.io.compress.SplittableCompressionCodec;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.LineRecordReader;
import org.apache.hadoop.mapred.RecordReader;
public class BiRecordReader implements RecordReader<LongWritable, Text> {
private static final Log LOG =
LogFactory.getLog(LineRecordReader.class.getName());
private CompressionCodecFactory compressionCodecs = null;
private long start;
private long pos;
private long end;
private LineReader in;
int maxLineLength;
private Seekable filePosition;
private CompressionCodec codec;
private Decompressor decompressor;
/**
* A class that provides a line reader from an input stream.
* @deprecated Use {@link org.apache.hadoop.util.LineReader} instead.
*/
@Deprecated
public static class LineReader extends org.apache.hadoop.util.LineReader {
LineReader(InputStream in) {
super(in);
}
LineReader(InputStream in, int bufferSize) {
super(in, bufferSize);
}
public LineReader(InputStream in, Configuration conf)
throws IOException {
super(in, conf);
}
}
public BiRecordReader(Configuration job, FileSplit split) throws IOException
{
this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength",
Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
compressionCodecs = new CompressionCodecFactory(job);
codec = compressionCodecs.getCodec(file);
// open the file and seek to the start of the split
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());
if (isCompressedInput()) {
decompressor = CodecPool.getDecompressor(codec);
if (codec instanceof SplittableCompressionCodec) {
final SplitCompressionInputStream cIn =
((SplittableCompressionCodec) codec)
.createInputStream(fileIn, decompressor, start, end,
SplittableCompressionCodec.READ_MODE.BYBLOCK);
in = new LineReader(cIn, job);
start = cIn.getAdjustedStart();
end = cIn.getAdjustedEnd();
filePosition = cIn; // take pos from compressed stream
} else {
in = new LineReader(codec.createInputStream(fileIn,
decompressor), job);
filePosition = fileIn;
}
} else {
fileIn.seek(start);
in = new LineReader(fileIn, job);
filePosition = fileIn;
}
// If this is not the first split, we always throw away first record
// because we always (except the last split) read one extra line in
// next() method.
if (start != 0) {
start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}
this.pos = start;
}
private boolean isCompressedInput() {
return (codec != null);
}
private int maxBytesToConsume(long pos) {
return isCompressedInput() ? Integer.MAX_VALUE : (int) Math.min(
Integer.MAX_VALUE, end - pos);
}
private long getFilePosition() throws IOException {
long retVal;
if (isCompressedInput() && null != filePosition) {
retVal = filePosition.getPos();
} else {
retVal = pos;
}
return retVal;
}
public BiRecordReader(InputStream in, long offset, long endOffset,
int maxLineLength) {
this.maxLineLength = maxLineLength;
this.in = new LineReader(in);
this.start = offset;
this.pos = offset;
this.end = endOffset;
this.filePosition = null;
}
public BiRecordReader(InputStream in, long offset, long endOffset,
Configuration job) throws IOException {
this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength",
Integer.MAX_VALUE);
this.in = new LineReader(in, job);
this.start = offset;
this.pos = offset;
this.end = endOffset;
this.filePosition = null;
}
public LongWritable createKey() {
return new LongWritable();
}
public Text createValue() {
return new Text();
}
/** Read a line. */
public synchronized boolean next(LongWritable key, Text value)
throws IOException {
// We always read one extra line, which lies outside the upper
// split limit i.e. (end - 1)
while (getFilePosition() <= end) {
key.set(pos);
//Key code division
int newSize = in.readLine(value,
maxLineLength,Math.max(maxBytesToConsume(pos), maxLineLength));
String str = value.toString().replaceAll("\\|\\|", "\\|");
value.set(str);
pos += newSize;
if (newSize == 0) {
return false;
}
if (newSize < maxLineLength) {
return true;
}
// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos "
+ (pos - newSize));
}
return false;
}
/**
* Get the progress within the split
*/
public float getProgress() throws IOException {
if (start == end) {
return 0.0f;
} else {
return Math.min(1.0f, (getFilePosition() - start)
/ (float) (end - start));
}
}
public synchronized long getPos() throws IOException {
return pos;
}
public synchronized void close() throws IOException {
try {
if (in != null) {
in.close();
}
} finally {
if (decompressor != null) {
CodecPool.returnDecompressor(decompressor);
}
}
}
}

be careful:
1. All the APIs in the above code use Hadoop’s old API interface org apache. hadoop. mapred…。 Then package the project, copy it to the Lib folder of hive installation directory, restart hive, and use the following statement to create a table

hive> create table new_bi(id string,name string) row format delimited fields
terminated by '|' stored as inputformat
'com.naixue.hive.delimit2.BiDelimiterInputFormat' outputformat
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
hive> load data local inpath '/home/bigdata/bi.dat' into table new_bi;
hive> select * from new_bi;
OK
01 huangbo
02 xuzheng
03 wangbaoqiang

2. You also need to use add jar in hive to pass the custom jar package to maptask when HQL queries the table

hive> add jar /home/bigdata/apps/hive/lib/myinput.jar;

Recommended Today

Vue2 technology finishing 3 – Advanced chapter – update completed

3. Advanced chapter preface Links to basic chapters:https://www.cnblogs.com/xiegongzi/p/15782921.html Link to component development:https://www.cnblogs.com/xiegongzi/p/15823605.html 3.1. Custom events of components 3.1.1. Binding custom events There are two implementation methods here: one is to use v-on with vuecomponent$ Emit implementation [PS: this method is a little similar to passing from child to parent]; The other is to use ref […]