Gaia-ir: parallelized graph query engine on graphscope

Time:2022-5-9

In this paper, we will introduce the graphscope graph interactive query engine gaia-ir, which supports efficient interactive graph query expressed in gremlin language. At the same time, it highly abstracts the query calculation on the graph and has high scalability.

Background introduction

Graph query is an important tool in the analysis of massive data. Gremlin[1]It is an industry standard graph query language proposed and maintained by Apache tinkerpop. It is widely used by popular graph databases in the industry, such as # neo4j[2] 、OrientDB[3]、JanusGraph[4]、Microsoft Cosmos DB[5]And Amazon Neptune[6]。 The graph query engine Gaia in graphscope is the industry’s first open source system supporting large-scale distributed parallelization gremlin. However, although the flexibility of Gremlin language is its significant advantage, we also found some existing problems in the design and use of Gaia system.

Existing problems

Gaia query system mainly has the following disadvantages:

D1:Gremlin operators are numerous and have multiple expressions for the same semantics. This leads to the need to add corresponding operators in each module end-to-end in Gaia in order to support rich gremlin operators, and there may be redundant computing logic between operator implementations. For example, when we need to view attributes, gremlin can useelementMap()、 valueMap()values()、 select().valueMap()、 project().valueMap()Similar results can be obtained by the expression of, for example:

gremlin> g.V().elementMap() 
==>[id:1,label:person,name:marko,age:29] 
==>[id:2,label:person,name:vadas,age:27]

gremlin> g.V().valueMap('name','age') 
==>[name:[marko],age:[29]] 
==>[name:[vadas],age:[27]]

gremlin> g.V().as('a').select('a').by(valueMap('name', 'age')) 
==>[name:[marko], age:[29]] 
==>[name:[vadas], age:[27]]

gremlin> g.V().as('a').project('a').by(valueMap('name', 'age')) 
==>[a:[name:[marko], age:[29]]] 
==>[a:[name:[vadas], age:[27]]]

In order to support these similar expressions, multiple redundant operators need to be defined in Gaia and supported in each module, which is not friendly to development and has poor scalability.

D2:The language expansibility of Gaia is poor. Gaia is a customized implementation of Gremlin parallel query, and now there are many other commonly used graph query languages, such as cypher, GSQL and so on. If we need further access to more query languages in the future, it is almost impossible to expand Gaia.

D3:: gremlin has poor support for complex expression. For example, we want to use the following gremlin query statement to find the two degree neighbors of “a” who meet certain “age” attribute conditions:

g.V().as("a").out().as("b").out().as("c")
 .where("c", P.lt("a").or(P.gt("a").and(P.gt("b")))).by("age")

imagewhere()Such complex nested condition filtering in is not intuitive and not very friendly to users.

D4:There is no good definition of Gremlin syntax specification in Gaia, and it is difficult to define the support scope of the current system for gremlin operator and operator combination, which is not friendly to users.

Solution

In order to solve the above problems, we further propose an intermediate presentation layer which is independent of query language and more universalGAIA-IR(abbreviation)IR), which is used to describe the general graph query semantics. The operators we abstract can be divided into two categories: relational operators and graph dependent operators. Among them, relational operators are mainly consistent with the operations on traditional relational databases, such asProjection、 Selection、 GroupBy、 OrderByEtc; Graph related operators are special queries on graph data, such as point query, adjacent point (edge) query and so on. Through this query language independent intermediate presentation layer, we can solve the problems in the above Gaia:

A1:The gaia-ir layer uses a unified intermediate representation to achieve a similar expression in the gremlin operator. For example, we abstractprojectOperator, which is used to uniformly represent various attribute fetching operations of Gremlin in D1 above.

A2:Gaia-ir layer has nothing to do with query language, which makes it convenient for gaia-ir to further access more languages in the future. In the future, we only need to translate the operators of different languages into the unified intermediate representation layer of IR, and we can naturally realize the parallel query of the language without designing the distributed parallel implementation for each language.

A3:Gaia-ir also provides rich expression support to meet the needs of users. For example, compared with the example in D3, wewhere()The expression support added to the operator will be more intuitive:

g.V().as("a").out().as("b").out().as("c")
 .where(expr("@c.age < @a.age || (@c.age > @a.age && @c.age > @b.age)"))

A4:Gaia-ir introduces the ANTLR tool to support gremlin syntax checking function, and defines the scope of the system’s support for gremlin operators and combinations, which is more user-friendly.

IR overall design

Next, we introduce the overall design of gaia-ir.

Concept introduction

First, we introduce some basic concepts in IR. IR abstracts the basic calculation on graph data, and provides a unified, concise and language independent intermediate representation layer.

IR operator:At present, we will operate operators(Graph-Relational Algebra)It can be abstracted into two types: relational operations and graph related operations.

  • Relational operations include:ProjectionSelection、 Join、 Groupby、 Orderby、 Dedup、 LimitWait. This is consistent with operations on traditional relational databases.
  • The related operations in the figure include:GetV、 E(dge)-Join、 P(ath)-Join, respectively representing the point attribute operation, adjacent point (edge) operation and path operation on the graph.

Through the above two kinds of operator abstraction, we can not only express the traditional relational operation, but also support the unique query operation on the graph. At the same time, the set of abstract operators is not limited by query language, so it can be easily extended to other languages.

Data structure (grecord):We define the data structure grecord, which is used to represent the input and output of each IR operator. Grecord is a multi column structure. Each column has its own alias and value:

  • Alias: similar to in SQLAsAlias. In particular, in order to adapt to gremlin, we additionally provide a unique alias – “head”, as an anonymous alias, which specifically refers to the output of the previous operator, that is, the input of the current operator.
  • Value: there are two types of values: common object (including int / string / intarray / stringarray, etc.) and graphobject (including vertex, edge and path).

Gremlin query translation example

In gremlin query, we translate it into a series of IR operator operations on grerecord to support gremlin query semantics. For example, in a queryg.V().as('a').select('a').by(valueMap('name', 'age'))In,g.V().as('a')The following intermediate results will be generated. The alias is called “a”, and the data type is vertex type:

R1 Vertex { name:[marko], age:[29] }, Alias: “a”
GR2 Vertex { name:[vadas], age:[27] }, Alias: “a”

And we willselect('a').by(valueMap('name', 'age'))Translate asProject("{a.name,a.age}"), take the above GR1 and Gr2 as examplesProjectWe can get the output GR1 ‘and Gr2’, that is, the point attributes we need:

GR1′ CommonObject {a.name:[marko], a.age:[29] }
GR2′ CommonObject { a.name:[vadas], a.age:[27] }

 

Similarly, for the gremlin queryg.V().valueMap('name','age'), we just need to change alias of GR1 and Gr2 into anonymous “head” andvalueMap('name','age')Translate asProject("{HEAD.name,HEAD.age}")The same result can be obtained. Thus, we can translate the gremlin operator with the same semantics and different expressions into a unified intermediate representation. What’s more, for other languages, such as attribute fetching operations in SQL, we can also intuitively translate them into IRProjectOperator. It can be seen that IR abstracts a set of more concise and General intermediate presentation layer that is independent of query language.

system architecture

Next, we give gaia-ir’s current parallel computing architecture for gremlin, as shown in the figure below.

 

Generally speaking, we are compatible with the official gremlin console and the query method of Gremlin SDK. After the user submits gremlin query:

  1. IR compiler is responsible for syntax checking of query. For legitimate queries, IR compiler compiles the query syntax tree through IR library API, converts it into a logical plan composed of IR operator, further calls IR library API to generate physical plan, and then distributes the physical plan to the distributed dataflow computing framework.
  2. The dataflow framework will pull up the graph data partition in advance in the service pull-up stage to establish a thread pool for executing calculation. After receiving the physical execution plan distributed by IR compiler, IR runtime is responsible for parsing the physical plan and building the execution plan executable by the engine. At the same time, for each IR operator, IR runtime is responsible for generating UDF understandable by its corresponding engine, so as to realize the computational semantics of specific IR operators. After the calculation is completed, the IR runtime returns the result to the IR compiler, which further parses it and returns it to the client.

How to use IR

After introducing the overall design of gaia-ir, we introduce how to use gaia-ir engine for query.

Service deployment:In graphscopePrevious articlesIn, we described how to deploy graphscope. Gaia-ir is an important implementation of gie in graphscope, and the overall pull-up mode is consistent with graphscope. Take helm deploying graphscope as an example. You can successfully pull up gaia-ir by specifying the engine option as Gaia during the installation process. An example of the installation command is as follows:

helm repo add graphscope https://graphscope.oss-cn-beijing.aliyuncs.com/charts/
helm install [RELEASE_NAME] --set executor=gaia graphscope/graphscope-store

For more detailed deployment operations, please refer to the official documents[7]

Gremlin query:After successfully pulling up the service, we can query through gremlin server host and port. Taking gremlin console query as an example, the service is pulled up smoothly and the data is imported (refer to the official document for specific data import steps)[8])After that, we can query by configuring gremlin console. Examples are as follows:

  1. First, we modify the gremlin consoleconf/remote.yamlConfiguration file, modify the corresponding host and port;
  2. Open gremlin console, givenremote.yamlYou can start the query by configuring:
gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182] - type ':remote console' to return to local mode
gremlin> g.V().valueMap('name','age') 
==>[name:[marko],age:[29]] 
==>[name:[vadas],age:[27]]

epilogue

This paper briefly describes the design intention and overall architecture of gaia-ir, and how to use gaia-ir engine for query. In the directory of gaia-ir[9]You can find the current release on GitHub. As the graph query engine of graphscope, gaia-ir provides efficient gremlin parallelization query implementation. At the same time, in the unified intermediate representation of IR, we will also introduce more equivalent transformations and optimized implementations to support important scenes such as pattern match. In subsequent articles, we will also introduce more technical details. We will also continue to improve the implementation of gaia-ir. At the same time, we welcome and look forward to the feedback and contribution of the community.

reference material

[1]Gremlin: http://tinkerpop.apache.org/

[2]Neo4j: https://neo4j.com/

[3]OrientDB: https://www.orientdb.org/

[4]JanusGraph: https://janusgraph.org/

[5]Microsoft Cosmos DB: https://azure.microsoft.com/en-us/services/cosmos-db/

[6]Amazon Neptune: https://aws.amazon.com/neptune/

[7] Official documents:https://graphscope.io/docs/persistent_graph_store.html

[8] Official documents:https://graphscope.io/docs/persistent_graph_store.html

[9] Directory of gaia-ir:https://github.com/alibaba/GraphScope/tree/main/research/query_service/ir