Big data analysis engine: Presto

Time:2020-9-19

Big data analysis engine: Presto

1、 What is Presto?

  • Background: hive’s shortcomings and presto’s background

Hive uses MapReduce as the underlying computing framework and is designed for batch processing. However, with more and more data, a simple data query using hive may take several minutes to several hours, which obviously can not meet the requirements of interactive query. Presto is a distributed SQL query engine designed for high-speed, real-time data analysis. It supports standard ANSI SQL, including complex query, aggregation, join and window functions. There are two points worth exploring. The first is architecture, and the second is how to achieve low latency to support timely interaction.

  • What is Presto?

Presto is an open source distributed SQL query engine, which is suitable for interactive analysis and query, and supports GB to Pb bytes of data. Presto was designed and written to solve the problem of interactive analysis and processing speed in a business data warehouse of the size of Facebook.

  • What can it do?

Presto supports online data queries, including hive, Cassandra, relational databases, and proprietary data storage. A Presto query can merge data from multiple data sources and analyze it across the entire organization. Presto targets the needs of analysts, who expect response times of less than a second to a few minutes. Presto ended the data analysis dilemma of using either fast, expensive business solutions or slow “free” solutions that consumed a lot of hardware.

  • Who is using it?

Facebook uses Presto for interactive queries for multiple internal data stores, including a 300pb data warehouse. More than 1000 Facebook employees use Presto every day, executing more than 30000 queries and scanning more than 1 Pb of data. Leading Internet companies, including airbnb and Dropbox, are using presto.

2、 Presto’s architecture

Presto is a distributed system running on multiple servers. The complete installation includes a coordinator and multiple workers. The client submits the query and from the Presto command line cli to the coordinator. The coordinator parses, analyzes and executes the query plan, and then distributes the processing queue to the worker.

Big data analysis engine: Presto

Presto query engine is a master slave architecture, which is composed of a coordinator node, a discovery server node and multiple worker nodes. Discovery server is usually embedded in the coordinator node. Coordinator is responsible for parsing SQL statements, generating execution plans, and distributing execution tasks to worker nodes. The worker node is responsible for the actual execution of the query task. After the worker node is started, it registers with the discovery server service, and the coordinator obtains the worker node that can work normally from the discovery server. If hive connector is configured, a hive Metastore service needs to be configured to provide hive meta information for presto, and the worker node can read data interactively with HDFS.

3、 Install Presto server

  • Installation media
presto-cli-0.217-executable.jar
presto-server-0.217.tar.gz
  • Install and configure Presto server

1. Unzip the installation package

tar -zxvf presto-server-0.217.tar.gz -C ~/training/

2. Create etc directory

cd ~/training/presto-server-0.217/
mkdir etc

3. You need to include the following configuration files in the etc directory

Node properties: node configuration information
JVM config: JVM configuration parameters for command line tools
Config properties: configuration parameters of Presto server
Catalog properties: configuration parameters of data sources (connectors)
Log properties: log parameter configuration
  • edit node.properties
#Cluster name. All Presto nodes in the same cluster must have the same cluster name.
node.environment=production
 
#A unique identifier for each Presto node. Of each node node.id All have to be unique. In Presto's restart or upgrade process, the node.id It has to remain the same. If multiple Presto instances are installed on a node (for example, multiple Presto nodes are installed on the same machine), each Presto node must have a unique node.id 。
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff 
 
#The location of the data store directory (path on the operating system). Presto will store the date and data in this directory.
node.data-dir=/root/training/presto-server-0.217/data
  • edit jvm.config

Since outofmemoryerror will cause the JVM to be in an inconsistent state, when encountering such an error, our general treatment is to collect the information in the dump header (for debugging), and then force the process to terminate. Presto will compile the query into bytecode file, so Presto will generate many classes. Therefore, we should increase the size of perm area (mainly store class in Perm) and allow JVM class unloading.

\-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
  • edit config.properties

Configuration of Coordinator

coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://192.168.157.226:8080

Configuration of workers

coordinator=false
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://192.168.157.226:8080

If we want to test on a stand-alone machine and configure both coordinator and worker, please use the following configuration:

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://192.168.157.226:8080

Parameter Description:

Big data analysis engine: Presto

  • edit log.properties
#Log level
com.facebook.presto=INFO
  • Configure catalog properties

Presto accesses data through connectors. These connectors are mounted on the catalogs. Connector can provide all the schemas and tables in a catalog. For example, the hive connector maps each hive database into a schema. Therefore, if the hive connector is mounted to a catalog named hive and there is a table named clicks in hive’s web, you can use the hive.web.clicks To access this list. By creating the catalog property file in the etc / Catalog directory to complete the registration of catalogs. If you want to create a connector for the hive data source, you can create an etc / catalog/ hive.properties File. The contents of the file are as follows. Mount a hiveconnector on the hivecatalog.

#Indicate the version of Hadoop
connector.name=hive-hadoop2
 
#Address configured in hive site
hive.metastore.uri=thrift://192.168.157.226:9083
 
#Hadoop configuration file path
hive.config.resources=/root/training/hadoop-2.7.3/etc/hadoop/core-site.xml,/root/training/hadoop-2.7.3/etc/hadoop/hdfs-site.xml

 Note: to access hive, you need to start hive’s Metastore: hive — service Metastore

Big data analysis engine: Presto

4、 Start Presto server

./launcher start

5、 Run Presto cli

  • Download: presto-cli-0.217- executable.jar
  • Rename the jar package and increase the execution permission
cp presto-cli-0.217-executable.jar presto 
chmod a+x presto
  • Connecting to Presto server
./presto --server localhost:8080 --catalog hive --schema default

6、 Using Presto

  • Using Presto to operate hive

Big data analysis engine: Presto

  • Web console using Presto: Port: 8080

Big data analysis engine: Presto

  • Using JDBC to operate Presto

1. Maven dependencies to include

<dependency>
    <groupId>com.facebook.presto</groupId>
    <artifactId>presto-jdbc</artifactId>
    <version>0.217</version>
</dependency>

2. JDBC code

Big data analysis engine: Presto

*******************************************************************************************

 Big data analysis engine: Presto

Big data analysis engine: Presto