Talking about Graphic Database and Graphic Database

Time:2019-10-9

Graphic database – Wikipedia: In computer science,Graph database(English: graph database,GDB) It is a database that uses graph structure for semantic query. It uses nodes, edges and attributes to represent and store data. The key concepts of the system arechartIt directly stores data items and datanodeRepresenting relationships with nodesedgeThe set of. These relationships allow direct linking of data in storage and, in many cases, can be retrieved through an operation. Graphic databases prioritize the relationships between data. Query graph databases have fast relationships because they are permanently stored in the database itself. Graphic databases can be used to visually display relationships, making them very useful for highly interconnected data.

Graph database is a non-relational database to solve the limitations of existing relational database. The graph model clearly lists the dependencies between data nodes, while the relational model and other NoSQL database models link data through implicit connections. Graphic databases are designed to retrieve complex hierarchical structures that are difficult to model in relational systems simply and quickly. Graph databases are similar to network model databases in the 1970s. They all represent general graphs, but network model databases run at a lower abstraction level and can not easily traverse a series of edges.

The underlying storage mechanisms of graph databases may vary. Some rely on relational engines to “store” graph data into tables (although tables are a logical element, this method imposes another layer of abstraction between graph databases, graph database management systems and physical devices that actually store data). Others use key-value storage or document-oriented databases to store them with intrinsic NoSQL structures. Most graph databases based on non-relational storage engines have also been addedsignorattributeThe notion that these tags or attributes essentially have a pointer relationship to another document. In this way, data elements can be classified to facilitate centralized retrieval.

Retrieving data from graph databases requires a query language other than SQL, which is designed to process data in relational systems, so traversal graphs cannot be handled gracefully. As of 2017, there is no general graph query language like SQL, which is usually limited to one product. However, there has been some standardization work that has made Gremlin, SPARQL and Cypher multi-vendor query languages. In addition to having a query language interface, some graph databases can also be accessed through the application program interface (API).

Graph databases are different from graph computing engines. Graph database is the technology of transforming relational OLTP database. Graph computing engine is used for batch analysis in OLAP. Due to the success of major technology companies in using proprietary graph databases and the introduction of open source graph databases, graph databases attracted considerable attention in the 2000’s.

The above part quotes Wikipedia’s entries on graph database to explain what a graph database is. This paper collates the fragmentary knowledge of graph database in Nebula Graph Exchange Group as a supplement to the knowledge of graph database. This paper is divided into two parts: knowledge and Q&A.

This article’s home directory

  • Little Knowledge

    • Opportunities for the Rise of Graph Database
    • Graph Database Storage Mode: Based on Memory Storage vs. Based on Distributed kV Storage
    • Design of a Storage Layer for Graph Database
    • Visualization of Graph Structure and GIS Data
  • Q&A Questions and Answers

    • Design of Computational Storage Separation in Graphic Database and Consideration Reasons for the Design Pattern
    • How to understand vertices and labels of graph database
    • How Nebula Handles ID Conflict
    • Differences between Nebula Graph and Tiger Graph
    • Significance of Label 0 in Graph Database
    • What do you think of the problem of “Graphic databases should be indexed”?
    • Computing, Storage and Copy Consistency in Knowledge Map Scenario

Little Knowledge

Learning the Starting Style of Graph Database – Understanding the Opportunity of the Rise of Graph Database.

(ii) Opportunities for the Rise of Graphic Databases – @Ah Ji

Around 2010, the rise of social media network research has led to the large-scale application of graph computing.

The hot spots around 2000 wereInformation Retrieval– andAnalysisIt is mainly driven by Google and collaborative filtering recommendation used by Amazon’s e-commerce. At that time collaborative filtering was also considered as a sub-area of information retrieval, including Google’s PageRank, which was also studied more in the field of information retrieval. Later, it was Twitter, and the rise of Facebook led to the study of network science.

Graph theory and graph algorithm are not new sciences. They have existed for a long time. They are just the latest 20 years of big data, the development of online retail and social networks.big datasocial networkse-commerce 、Web 2.0Graph computing has a new use, and the improvement of hardware computing power and the increasingly mature support of distributed computing make it possible for graph computing to process massive data efficiently.

Learning the opportunity of the development of graph database, we will study the storage mode of graph database and the design of a storage layer of graph database.

Graphic Database Storage: Memory-based Storage vs. Distributed kV Storage –@Bruceleexiaokan

Bruceleexiaokan: Memory-based graph databases have their advantages, especially forLarge-scale depth traversalAnd based onGraph modelThis has a strong advantage in Mass Parallel Processing (MPP), whose access language is more like a programming language than a graph traversal.

Sherman: All kinds of storage have their own advantages and disadvantages, and each has its own application scenarios, so it’s hard to compare different solutions without scenarios and requirements.

Bruceleexiaokan: Graph database based on distributed kV has its drawbacks for large-scale depth traversal and computation, and for graph model support. Graphic databases need to be categorized, and we need to understand which one is being discussed.

  1. Real-time online graph database,
  2. Offline Map Database,
  3. Graph database for large-scale mathematical analysis.

If we talk about three kinds, the memory-based scheme of graph structure has advantages. The first and second large-scale graph databases are mainly based on kV + index

(vii) Design of a Graph Database Storage Layer –@Bruceleexiaokan

Without centralized storage cluster, a single cluster still has certain size limitations, and should not be too large. The abstraction of storage layer is that the logical mapping from data sets (different points and edges in graph) to storage cluster is transparent to users, and the scenarios with high user availability requirements need to consider the disaster preparedness of two clusters. Data balance of single cluster is a matter within cluster. Data balance between cluster and cluster needs to be designed. Data transmission channel from offline to online is especially important.
Design Principles:

  • Do not make single cluster too large;
  • Local backup cluster supports active-active reading.
  • Use offline to online data transmission channels to do data migration, backfill, recovery, batch update and so on.
  • Data access is abstract, which makes the operation and maintenance of the cluster transparent to user access.
  • Do a good job of data replication between clusters across data centers;
  • Achieving a design that can be linearly extended even with gradual investment;

After learning the basic knowledge of storage and design, we can compare the visualization of the graph structure of the database and the visualization of GIS data.

_Visualization of Graphic Structure and Visualization of GIS Data –@Space

Visualization of graph structure is quite different from visualization of GIS data in essence.

GIS isHierarchical + tilePatch display, and the graph structure itself is flat, can only be a one-time display of all touched data. However, the practice of GIS can give us some enlightenment, combining with the specific business scenarios, can we also do one?Hierarchical samplingBut the problem of graph sampling is: how to do as much as possible while samplingRetaining the Connectivity of Subgraphs(Otherwise, it’s possible that the high-level layer displays isolated points, and only the finest-grained layer displays all data).
Some superficial ideas: we can combine graph computing technology to calculate connected subgraphs first, and then calculate PageRank in connected subgraphs. PageRank can be divided into different intervals according to PageRank size, which is equivalent to Hierarchical hierarchy according to PageRank value. In order to ensure the connectivity of the graph, in addition to displaying the vertices of the next level (PageRank value in the next interval), when the hierarchy is switched. ) In addition, we need to show the edges of the vertices sampled from these two levels (which is equivalent to searching the connected paths within a subgraph, if aggregate can be done better, if there are many edges, can we display the statistics first according to EdgeType aggregate, and if the user is interested in expanding it again, that is, the aggregation value returned from the graph database, the front-end generates “virtual” ” As the edges expand further, these “virtual edges” will be replaced by the actual detail edges.

The above trick is only to solve the problem of smoothly displaying graph data like GIS. The shortcomings are obvious, and the cost of Hierarchical sampling is high.

In addition, the display of graph data is not an independent front-end technical problem, but also involves the support of the back-end graph database as follows:

  1. Degree statistics
  2. Aggregation according to EdgeType
  3. When query encounters a super vertex, it truncates and returns truncation information to client

Built-in AP algorithms, such as PageRank, lpa, ring detection, etc.

Visualization of graph data also needs to be considered:
Front-end data load is limited, CS-type visualization tools are better, BS-type visualization tools, browsers load less. How to limit the amount of data touched to a certain extent in business is an application to be considered.
In addition, because of the name and other tag information of vertices and edges, it is not always displayed on the graph at one time during visualization. The first drawing can only request name from the graph database, and the properties of the subsequent tag can be requested again when the user is interested (click/hover).

Layout problem: At present, the common ones are force-guided, circular, tree-shaped and grid-shaped. These are all layouts without any business semantics, such as tree-shaped layout, which should be the top-level node and which should be the next-level node. If only through the directionality of edges, a single EdgeType can be displayed well. When multiple EdgeTypes are mixed on a tree, a single EdgeType will be destroyed. The structure of geType tree must introduce business rules to restrict problems in different layouts

Q&A Questions and Answers

Since Q&A is organized in Nebula Graph communication group and many people participate in the discussion, the following questions will be answered with a group of friends’nicknames. Instead of distinguishing the official members of Nebula Graph from the group members, only exchange graph database technology. ~If you have different opinions on the following issues, please welcome the comment area of this article to exchange (> =). Add WeChat:Nebula Graphbot to join the graph database communication group.

Design of Computational Storage Separation for Graphic Database and Consideration Reasons for the Design Model

Question: If computing and storage are separated, data migration, will network bandwidth be the bottleneck? How did Nebula solve it?

Hengzi: Now all 10,000 Mbp network cards, it is difficult to fill the bandwidth in the general computer room, usually IO will be the bottleneck first.

Bowaz: If it’s a geographically distributed graph database, bandwidth is a performance constraint to consider.

Sherman: Yes, the more popular way is to have three centers in two places or three centers in five places. Distributed Graph Database, which includes both the part of graph and the part of distributed system, is bound to be involved.

Bruceleexiaokan: Because large-scale online graph databases are designed to separate computation from storage, the design of data storage is particularly important. As far as financial risk is concerned, it is actually a big picture logically with hundreds of TB data. The design of linear scalable storage layer is the key of graph database.

Question: Why are they all designed to compute storage separation? What are the important considerations?

Bruceleexiaokan: For Risk, online is inference. Most scenarios are for feature computing. Graph traversal, which is basically less than 2-3 hops, is very simple, but it requires high performance and availability, so the separation of online graph database storage is reasonable. However, the design of graph databases for data analysis will be different. What is more needed is the depth traversal ability of the graph. Therefore, storage separation should be a problem, but how to support large-scale graphs and how to scale up should be the key, not scale out.

Tianshi: Storage computing separation is mostly adapted to cloud computing architecture: storage layer buys space, computing layer buys elastic virtual machine.

Wu Min: In the long run, the speed of development of several hardware modules, such as computing, storage and network, is not the same. Not all of them are the speed of Moore’s theorem. Separation is more suitable for long-term hardware evolution.

Sherman: I think one of the great benefits of the separation of storage and computing is that the storage cluster and computing cluster can be independently scaled up. By adjusting the capacity of different clusters, we can finally achieve the best matching to meet business needs.

(ii) How to understand vertices and labels in graph databases

Question: How to understand the relationship between Vertex and Tag? Is there a concept of Vertex in Schema? Does a vertex ID correspond to multiple Tags?

Sherman: Explain Vertex, Tag, Edge and their relationship:

Vertex is a vertex identified by a 64-bit ID. A Vertex can be tagged with multiple Tags, each of which defines a set of attributes.

For example, we can have two Tags: Person and Developer. Person defines name, telephone, address and so on. Developer may define familiar programming language, working life, GitHub account and so on. A Vertex can be typed as Person Tag, which means that this Vertex represents a Person and also contains attributes in Person. Another Vertex may be hit with both Person and Developer Tags, which means that this Vertex is not only a Person, but also a Developer.

Sherman: Vertex and Vertex can be connected by Edge, and each Edge has its own type, such as friendship. Each Edge Type can also define a set of attributes. Edge is generally used to represent a relationship, or an action. For example, when Peraon A transfers money to Person B, there will be a transfer type edge between A and B, which can define a set of attributes, such as transfer amount, transfer time and so on.

Sherman: Any two Vertex can have multiple types of edges, or they can have multiple identical types of edges, such as transfer, and there can be multiple transfers between two Persons, so each transfer is one side.

Question: There is a little question about the example. Can Tag be understood as ontology here?

Sherman: As I understand it, ontology should be the whole map of knowledge, that is, it contains Vertex and Edge. In Nebula, Vertex itself contains no content (that is, no attributes). Content is stored in Tag, where “content” refers to the concept in ontology and “edge” is the relationship in ontology.

Question: Additional question: Does multiple tags support hierarchical relationships, such as organizational structure? Thank you?

In Nebula, you can define dependencies between tags, such as in the example above, Developer dependencies on Person.

(_How Nebula handles ID conflicts

Question: If you want to build a network, users, businesses, public numbers, articles, these IDs will repeat conflict. According to the principle that vertex ID can only refer to a point now, the original ID can not be used directly. Is there any way to build this network? Or use ID as a Tag attribute, and then build an index.

Wu Min: Type and original ID are put together as hash, as VID, and then the original ID is used as a property.

Sherman: Because of the ever-changing business, we decided to turn over how to produce VID to a business. VID is a 64-bit integer. In your case, if the ID is not enough 64 bits, you can use 2-4 bits to represent different types, thus dividing the originally conflicting IDs into different spaces. If the original ID is already 64 bit, you can do hash as @Wu Min-zhao said, and save the real ID in the attribute.

_Difference between Nebula Graph and Tiger Graph

Question: Boys, we want to know about the relationship between Nebula Graph and Tiger Graph. What’s the difference between them?

Sherman: Simply put, Tiger Graph is not really a peer-to-peer distribution. It’s a distribution with central nodes. It distributes and stores attributes on points and edges, but the relationship between the whole graph must be kept on a machine. At the same time, when running, the whole graph must be loaded into memory, which limits the size of the graph it can handle. Once a product’s architecture is established, it’s not easy to change. It’s basically equivalent to redo.

J. GUARDIAN: Simply understood, Tiger Graph sacrifices the processing power of graph size for performance, while Nebula’s ability to solve graph size, but relatively sacrifices some performance.

Sherman: Not quite. Of course, that’s what I learned about Tiger Graph before.

The Significance of Label 0 in Picture Database

Question: I see that our document says “a vertex must have at least one type of tag”, but I noticed that Neo4j supports zero tags. Does a node without tags use the same common tag when querying? Why support zero tags? What’s the point of doing this?

Sherman: Most of the data sets (such as Graph500, Twitter) for graph computing performance evaluation are labeled 0, i.e. attribute-free filtering conditions. This shows the core performance of a graph engine. In most cases, dynamic pruning of the graph through tag filtering will shorten the time consumed.

What do you think of the problem of “Graphic database needs index”?

Question: What do you think of the problem of “Graphic database needs index”?

Bruceleexiaokan: Ultimately it’s a design.trade offProblem, different data distribution and different access requirements for different design schemes, performance is certainly different. The best solution is to design storage access abstraction, retain design and implementation flexibility, and can be optimized for different scenarios.

Neighboring index and node inline storage is an optimization that can reduce the number of physical disk block reads and read and write with nodes. But in some special scenarios:

  1. If updates are very frequent, they can cause write magnification problems.
  2. Single-node edge access is exceptionally high, but access only traverses the first few. Its performance will be worse, and attribute indexing is another problem.

Sherman: @Bruceleexiaokan fully agrees that the use of an index depends on the scenario, and overuse of the index will pay off.

Question: Nebula is indexed for adjacent nodes, right?

Sherman: Indexing attributes

(ii) Computing, storing and replica consistency in knowledge map scenarios

Question: Our knowledge map business scenario, look up the path between nodes, what is the efficiency of real-time calculation? Or is offline computing recommended? Nebula is a separation of storage and computation, right?

Sherman: let’s talk about my personal understanding. I think the scene of knowledge map generally needs to be queried online, because I don’t know what kind of query problem there will be. Well, yes, Nebula is a separation of storage and computing. The best advantage is flexible deployment. Computing nodes and storage nodes can be independently scaled according to different requirements.

Question: Are the copies eventually consistent? Or strong consistency?

Wu Min: The strong consistency between copies is based on Raft protocol.

Popular Survey of Graphic Database

The Nebula Graph Beta version of the graph database is online, and the version of the bug-catching campaign is under way. Welcome to Bug(viii)

Reference material

  • Graphic database – Wikipedia