1、 What is Presto?
- Background: hive’s shortcomings and presto’s background
Hive uses MapReduce as the underlying computing framework and is designed for batch processing. However, with more and more data, a simple data query using hive may take several minutes to several hours, which obviously can not meet the requirements of interactive query. Presto is a distributed SQL query engine designed for high-speed, real-time data analysis. It supports standard ANSI SQL, including complex query, aggregation, join and window functions. There are two points worth exploring. The first is architecture, and the second is how to achieve low latency to support timely interaction.
- What is Presto?
Presto is an open source distributed SQL query engine, which is suitable for interactive analysis and query, and supports GB to Pb bytes of data. Presto was designed and written to solve the problem of interactive analysis and processing speed in a business data warehouse of the size of Facebook.
- What can it do?
Presto supports online data queries, including hive, Cassandra, relational databases, and proprietary data storage. A Presto query can merge data from multiple data sources and analyze it across the entire organization. Presto targets the needs of analysts, who expect response times of less than a second to a few minutes. Presto ended the data analysis dilemma of using either fast, expensive business solutions or slow “free” solutions that consumed a lot of hardware.
- Who is using it?
Facebook uses Presto for interactive queries for multiple internal data stores, including a 300pb data warehouse. More than 1000 Facebook employees use Presto every day, executing more than 30000 queries and scanning more than 1 Pb of data. Leading Internet companies, including airbnb and Dropbox, are using presto.
2、 Presto’s architecture
Presto is a distributed system running on multiple servers. The complete installation includes a coordinator and multiple workers. The client submits the query and from the Presto command line cli to the coordinator. The coordinator parses, analyzes and executes the query plan, and then distributes the processing queue to the worker.
Presto query engine is a master slave architecture, which is composed of a coordinator node, a discovery server node and multiple worker nodes. Discovery server is usually embedded in the coordinator node. Coordinator is responsible for parsing SQL statements, generating execution plans, and distributing execution tasks to worker nodes. The worker node is responsible for the actual execution of the query task. After the worker node is started, it registers with the discovery server service, and the coordinator obtains the worker node that can work normally from the discovery server. If hive connector is configured, a hive Metastore service needs to be configured to provide hive meta information for presto, and the worker node can read data interactively with HDFS.
3、 Install Presto server
- Installation media
- Install and configure Presto server
1. Unzip the installation package
tar -zxvf presto-server-0.217.tar.gz -C ~/training/
2. Create etc directory
cd ~/training/presto-server-0.217/ mkdir etc
3. You need to include the following configuration files in the etc directory
Node properties: node configuration information JVM config: JVM configuration parameters for command line tools Config properties: configuration parameters of Presto server Catalog properties: configuration parameters of data sources (connectors) Log properties: log parameter configuration
- edit node.properties
#Cluster name. All Presto nodes in the same cluster must have the same cluster name. node.environment=production #A unique identifier for each Presto node. Of each node node.id All have to be unique. In Presto's restart or upgrade process, the node.id It has to remain the same. If multiple Presto instances are installed on a node (for example, multiple Presto nodes are installed on the same machine), each Presto node must have a unique node.id 。 node.id=ffffffff-ffff-ffff-ffff-ffffffffffff #The location of the data store directory (path on the operating system). Presto will store the date and data in this directory. node.data-dir=/root/training/presto-server-0.217/data
- edit jvm.config
Since outofmemoryerror will cause the JVM to be in an inconsistent state, when encountering such an error, our general treatment is to collect the information in the dump header (for debugging), and then force the process to terminate. Presto will compile the query into bytecode file, so Presto will generate many classes. Therefore, we should increase the size of perm area (mainly store class in Perm) and allow JVM class unloading.
\-server -Xmx16G -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:+ExitOnOutOfMemoryError
- edit config.properties
Configuration of Coordinator
coordinator=true node-scheduler.include-coordinator=false http-server.http.port=8080 query.max-memory=5GB query.max-memory-per-node=1GB query.max-total-memory-per-node=2GB discovery-server.enabled=true discovery.uri=http://192.168.157.226:8080
Configuration of workers
coordinator=false http-server.http.port=8080 query.max-memory=5GB query.max-memory-per-node=1GB query.max-total-memory-per-node=2GB discovery.uri=http://192.168.157.226:8080
If we want to test on a stand-alone machine and configure both coordinator and worker, please use the following configuration:
coordinator=true node-scheduler.include-coordinator=true http-server.http.port=8080 query.max-memory=5GB query.max-memory-per-node=1GB query.max-total-memory-per-node=2GB discovery-server.enabled=true discovery.uri=http://192.168.157.226:8080
- edit log.properties
#Log level com.facebook.presto=INFO
- Configure catalog properties
Presto accesses data through connectors. These connectors are mounted on the catalogs. Connector can provide all the schemas and tables in a catalog. For example, the hive connector maps each hive database into a schema. Therefore, if the hive connector is mounted to a catalog named hive and there is a table named clicks in hive’s web, you can use the hive.web.clicks To access this list. By creating the catalog property file in the etc / Catalog directory to complete the registration of catalogs. If you want to create a connector for the hive data source, you can create an etc / catalog/ hive.properties File. The contents of the file are as follows. Mount a hiveconnector on the hivecatalog.
#Indicate the version of Hadoop connector.name=hive-hadoop2 #Address configured in hive site hive.metastore.uri=thrift://192.168.157.226:9083 #Hadoop configuration file path hive.config.resources=/root/training/hadoop-2.7.3/etc/hadoop/core-site.xml,/root/training/hadoop-2.7.3/etc/hadoop/hdfs-site.xml
Note: to access hive, you need to start hive’s Metastore: hive — service Metastore
4、 Start Presto server
5、 Run Presto cli
- Download: presto-cli-0.217- executable.jar
- Rename the jar package and increase the execution permission
cp presto-cli-0.217-executable.jar presto chmod a+x presto
- Connecting to Presto server
./presto --server localhost:8080 --catalog hive --schema default
6、 Using Presto
- Using Presto to operate hive
- Web console using Presto: Port: 8080
- Using JDBC to operate Presto
1. Maven dependencies to include
<dependency> <groupId>com.facebook.presto</groupId> <artifactId>presto-jdbc</artifactId> <version>0.217</version> </dependency>
2. JDBC code