[paper notes] line: large scale information network embedding


LINE: Large-scale Information Network Embedding

Arxiv 1503.03578

III. problem definition

We use first-order and second-order proximity to formally define the large-scale information network embedding problem We first define an information network as follows:

Definition 1 (Information Network): information network is defined asG = (V, E)Among themVIs a collection of vertices, each of which represents a data object,EIs the set of edges between vertices, each edge represents the relationship between two data objects. Each sidee ∈ EOrdered pairse = (u, v)And with weightw[uv] > 0Correlation, which represents the strength of the relationship IfGIt’s aimless. We have(u, v) ≡ (v, u)andw[uv] = w[vu]IfGYes, we do(u, v) ≢ (v, u)andw[uv] ≢ w[vu]

In practice, an information network can be directed (for example, a citation network) or undirected (for example, a user’s social network in Facebook). The weight of an edge can be binary or any real number Note that while negative edge weights are possible, we only consider non negative weights in this study For example, in citation networks and social networks, you need binary values; in co-occurrence networks between different objects,w[uv]You can take any non negative value Weights in some networks can be scattered because some objects appear together many times, while others only appear together a few times.

Embedding information network into low dimensional space is very useful in various applications To embed, the network structure must be preserved The first intuition is that the local network structure must be preserved, that is, the local pairwise proximity between vertices We define the local network structure as the first-order proximity between vertices:

Definition 2 (first-order adjacency): the first-order adjacency in a network is the local pairwise adjacency between two vertices For the edge(u, v)Each pair of linked vertices, weight on that edgew[uv]ExpressuandvFirst order proximity between If inuandvIf no edge is observed between them, their first-order proximity is 0.

First order proximity usually means the similarity of two nodes in real world network For example, people who become friends with each other on social networks tend to share similar interests; pages that are linked to each other on the World Wide Web tend to talk about similar topics Because of this importance, many existing graph embedding algorithms, such as Isomap, LLE, Laplacian feature mapping and graph decomposition, aim to preserve the first-order proximity.

However, in the real world information network, only a small proportion of links are observed, and many other links are missing [10]. The first-order proximity of a pair of nodes on a missing link is zero, even if they are very similar in nature to each other Therefore, a single first-order neighborhood is not enough to preserve the network structure, and it is important to find an alternative neighborhood concept to solve the sparse problem A natural instinct is that vertices that share similar neighbors tend to be similar to each other For example, in social networks, people who share similar friends often have similar interests and become friends; in word co-occurrence networks, words that always appear with the same group of words often have similar meanings Therefore, we define the second-order proximity, which complements the first-order proximity and preserves the network structure.

Definition 3 (second-order proximity): a pair of vertices in a network(u, v)The second-order proximity between them is the similarity between their neighborhood network structures In mathematics, let’sp[u] = (w[u, 1], … , w[u, |V|])ExpressuFirst degree proximity to all other vertices, thenuandvThe second-order proximity between is determined byp[u]andp[v]Determine the similarity between. IfuandvThere is no link to the same vertex, thenuandvThe second-order proximity between them is 0.

We study the first-order and second-order proximity of network embedding, which is defined as follows.

Definition 4 (large scale information network embedding): given large scale networkG = (V, E)The problem of embedding large-scale information networks, aiming at every vertexv ∈ VIn low dimensional spaceR^dRepresentation, i.e. learning functionf[G]: V -> R^dAmong themd << |V|。 In spaceR^dIn, the first-order and second-order proximity between vertices are preserved.

Next, we introduce a large-scale network embedding model, which retains the first-order and second-order proximity.

IV. line: large scale information network embedding

The ideal embedding model for real-world information networks must meet several requirements: first, it must be able to retain the first-order and second-order proximity between vertices; second, it must be suitable for very large networks, such as millions of vertices and billions of edges; third, it can handle networks with any type of edges: directed, undirected and / or weighted, without authority In this section, we propose a new network embedding model called “line”, which meets all three requirements.

4.1 model description

We describe the line model to preserve the first-order and second-order adjacency, and then introduce a simple method to combine the two adjacency.

First order proximity

First-order proximity refers to the local pairwise proximity between vertices in a network In order to simulate the first-order proximity, for each undirected edge(i, j), we define the vertexv[i]andv[j]The joint probabilities between are as follows:

[paper notes] line: large scale information network embedding (1)

amongu[i] ∈ R^dIs the vertex.v[i]Low dimensional vector representation of. Formula (1) defining spaceV×VDistribution abovep(·,·)Its empirical probability can be defined as^p[1](i, j) = w[ij]/WAmong themW = Σw[ij], (i, j) ∈ E。 To preserve the first-order proximity, a direct approach is to minimize the following objective functions:

[paper notes] line: large scale information network embedding (2)

amongd(·,·)Is the distance between the two distributions We choose to minimize the KL divergence of two probability distributions Instead of KL divergenced(·,·)And omitting some constants, we get:

[paper notes] line: large scale information network embedding (3)

Note that the first-order proximity is only applicable to undirected graphs, not to directed graphs By finding the{u[i]}, i = 1 .. |V|, we can expressdEvery vertex in dimensional space.

Second order proximity

Second order proximity is suitable for directed graphs and undirected graphs. For a given network, without losing generality, we assume that it is directed (undirected edges can be considered as two directed edges with opposite direction and equal weight) Second order proximity assumes that many connected vertices sharing with other vertices are similar to each other In this case, each vertex is also considered as a specific context, and it is assumed that vertices with similar distribution in the context are similar Therefore, each vertex plays two roles: the vertex itself and the specific “context” of other vertices We introduce two vectorsu[ i]andu'[i]Among themu[i]yesv[i]Representation when treated as a vertex, andv'[i]Is whenv[ i]A representation when considered a specific “context.” For each directed edge(i, j), we first set the vertexv[i]Generate contextv[j]The probability of is defined as:

[paper notes] line: large scale information network embedding (4)

among|V|Is the number of vertices or contexts. For each vertexv[i], formula (4) actually defines the conditional distribution in the context (i.e. the whole vertex set in the network)p[2](·|v[i])。 As mentioned above, second-order proximity assumes that vertices with similar distribution in the context are similar to each other In order to maintain the second-order proximity, we should make the distribution of context conditions specified by the low-dimensional representationp[2](·|v[i])Adjacent experience distribution^p[2](·|v[i])。 Therefore, we minimize the following objective functions:

[paper notes] line: large scale information network embedding (5)

To minimize this goal through learning{u[i]}, i = 1 .. |V|as well as{u'[i]}, i = 1 .. |V|, we can usedDimension vectoru[i]Represents each vertexv[i]

Combined first and second order proximity

In order to embed the network by preserving the first-order and second-order proximity, a simple and effective method we found in practice is to train the line model, preserving the first-order and second-order proximity respectively, and then connecting the embedding vectors trained by the two methods for each vertex. The more principled method combining two kinds of proximity is to train the objective function (3) and (6) jointly, which we will leave for the future work.

4.2 model optimization

The optimization objective (6) is expensive in calculation, and its conditional probability is calculatedp[2]We need to sum the whole vertex set In order to solve this problem, we use the negative sampling method proposed in [13], according to each edge(i, j)Some of the noise distribution samples multiple negative edges More specifically, it’s for each side(i, j)Specify the following target functions:

[paper notes] line: large scale information network embedding (7)

amongσ(x) = 1 / (1 + exp(-x))Is the sigmoid function The first term is used to simulate the observed edge and the second term is used to simulate the negative edge extracted from the noise distribution. K is the number of negative edges We set up theP[n](v) ∝ d[v]^3/4Among themd[v]Is the vertex.vThe degree of output.

For the objective function (3), there is a simple solution:u[ik] = ∞Fori = 1, ..., |V|andk = 1, ..., d。 To avoid this simple solution, we can stillu'[j]^TChange tou[j]^TTo utilize the negative sampling method (7).

We use ASGD [17] to optimize formula (7) In each step, the ASGD algorithm samples the small batch of edges, and then updates the model parameters If sampling edge(i, j)The vertexiEmbedding vector ofu[i]The gradient of will be calculated as:

[paper notes] line: large scale information network embedding (8)

Notice that the gradient is multiplied by the weight of the edge This becomes a problem when the weight of an edge has a high square difference For example, in the co occurrence network, some words appear together many times (for example, tens of thousands of times), while some words only appear together several times In such a network, it is difficult to find a good learning rate because of the divergence of gradient scale If we choose a larger learning rate based on the edge with small weight, the gradient on the edge with large weight will explode, and if we choose a smaller learning rate based on the edge with large weight, the gradient will become too small.

Optimize vs edge sampling

The intuition to solve the above problem is that if the weights of all edges are equal (for example, a network with binary edges), there is no problem of selecting an appropriate learning rate Therefore, the simple processing is to expand the weighted edge into multiple binary edges, for example, the weight iswThe edge development ofwTwo sides This will solve the problem, but will significantly increase memory requirements, especially when the weight of the edges is very large In order to solve this problem, we can sample from the original edge and regard the sampling edge as a binary edge, and the sampling probability is proportional to the weight of the original edge Through this kind of edge sampling processing, the overall objective function remains unchanged The problem boils down to how to sample the edges according to the weight.

4.3 discussion

We discuss several practical problems of line model.

Low vertex: a practical problem is how to embed low vertex accurately Because the number of neighbors of such nodes is very small, it is difficult to accurately infer their representation, especially the method based on the second-order proximity, which depends heavily on the number of “context”. The intuitive solution to this is to extend the neighbors of these vertices by adding higher-level neighbors, such as neighbors In this paper, we only consider adding second-order neighbors to each vertex, that is, neighbors’ neighbors vertexiAnd its second-order neighborsjThe weight measurement between is:

[paper notes] line: large scale information network embedding

In fact, people can only add vertices{j}A subset of, which is related to low degree verticesiIt has the maximum proximity.

New vertex: another practical problem is how to find the representation of the newly arrived vertex For new verticesi, if we know its connection with the existing vertex, we can obtain the empirical distribution on the existing vertex^p[1](·, v[i])and^p[2](·|v[i])。 According to the objective function formulas (3) and (6), in order to obtain the embedding of new vertices, a direct method is to minimize any of the following objective functions.

[paper notes] line: large scale information network embedding (10)

Update the embedding of new vertices and keep the embedding of existing vertices. If the connection between the new vertex and the existing vertex is not observed, we must turn to other information, such as the text information of the vertex, as our future work.