## LINE: Large-scale Information Network Embedding

Arxiv 1503.03578

## III. problem definition

We use first-order and second-order proximity to formally define the large-scale information network embedding problem We first define an information network as follows:

Definition 1 (Information Network): information network is defined as`G = (V, E)`

Among them`V`

Is a collection of vertices, each of which represents a data object,`E`

Is the set of edges between vertices, each edge represents the relationship between two data objects. Each side`e ∈ E`

Ordered pairs`e = (u, v)`

And with weight`w[uv] > 0`

Correlation, which represents the strength of the relationship If`G`

It’s aimless. We have`(u, v) ≡ (v, u)`

and`w[uv] = w[vu]`

If`G`

Yes, we do`(u, v) ≢ (v, u)`

and`w[uv] ≢ w[vu]`

。

In practice, an information network can be directed (for example, a citation network) or undirected (for example, a user’s social network in Facebook). The weight of an edge can be binary or any real number Note that while negative edge weights are possible, we only consider non negative weights in this study For example, in citation networks and social networks, you need binary values; in co-occurrence networks between different objects,`w[uv]`

You can take any non negative value Weights in some networks can be scattered because some objects appear together many times, while others only appear together a few times.

Embedding information network into low dimensional space is very useful in various applications To embed, the network structure must be preserved The first intuition is that the local network structure must be preserved, that is, the local pairwise proximity between vertices We define the local network structure as the first-order proximity between vertices:

Definition 2 (first-order adjacency): the first-order adjacency in a network is the local pairwise adjacency between two vertices For the edge`(u, v)`

Each pair of linked vertices, weight on that edge`w[uv]`

Express`u`

and`v`

First order proximity between If in`u`

and`v`

If no edge is observed between them, their first-order proximity is 0.

First order proximity usually means the similarity of two nodes in real world network For example, people who become friends with each other on social networks tend to share similar interests; pages that are linked to each other on the World Wide Web tend to talk about similar topics Because of this importance, many existing graph embedding algorithms, such as Isomap, LLE, Laplacian feature mapping and graph decomposition, aim to preserve the first-order proximity.

However, in the real world information network, only a small proportion of links are observed, and many other links are missing [10]. The first-order proximity of a pair of nodes on a missing link is zero, even if they are very similar in nature to each other Therefore, a single first-order neighborhood is not enough to preserve the network structure, and it is important to find an alternative neighborhood concept to solve the sparse problem A natural instinct is that vertices that share similar neighbors tend to be similar to each other For example, in social networks, people who share similar friends often have similar interests and become friends; in word co-occurrence networks, words that always appear with the same group of words often have similar meanings Therefore, we define the second-order proximity, which complements the first-order proximity and preserves the network structure.

Definition 3 (second-order proximity): a pair of vertices in a network`(u, v)`

The second-order proximity between them is the similarity between their neighborhood network structures In mathematics, let’s`p[u] = (w[u, 1], … , w[u, |V|])`

Express`u`

First degree proximity to all other vertices, then`u`

and`v`

The second-order proximity between is determined by`p[u]`

and`p[v]`

Determine the similarity between. If`u`

and`v`

There is no link to the same vertex, then`u`

and`v`

The second-order proximity between them is 0.

We study the first-order and second-order proximity of network embedding, which is defined as follows.

Definition 4 (large scale information network embedding): given large scale network`G = (V, E)`

The problem of embedding large-scale information networks, aiming at every vertex`v ∈ V`

In low dimensional space`R^d`

Representation, i.e. learning function`f[G]: V -> R^d`

Among them`d << |V|`

。 In space`R^d`

In, the first-order and second-order proximity between vertices are preserved.

Next, we introduce a large-scale network embedding model, which retains the first-order and second-order proximity.

## IV. line: large scale information network embedding

The ideal embedding model for real-world information networks must meet several requirements: first, it must be able to retain the first-order and second-order proximity between vertices; second, it must be suitable for very large networks, such as millions of vertices and billions of edges; third, it can handle networks with any type of edges: directed, undirected and / or weighted, without authority In this section, we propose a new network embedding model called “line”, which meets all three requirements.

### 4.1 model description

We describe the line model to preserve the first-order and second-order adjacency, and then introduce a simple method to combine the two adjacency.

#### First order proximity

First-order proximity refers to the local pairwise proximity between vertices in a network In order to simulate the first-order proximity, for each undirected edge`(i, j)`

, we define the vertex`v[i]`

and`v[j]`

The joint probabilities between are as follows:

(1)

among`u[i] ∈ R^d`

Is the vertex.`v[i]`

Low dimensional vector representation of. Formula (1) defining space`V×V`

Distribution above`p(·,·)`

Its empirical probability can be defined as`^p[1](i, j) = w[ij]/W`

Among them`W = Σw[ij], (i, j) ∈ E`

。 To preserve the first-order proximity, a direct approach is to minimize the following objective functions:

(2)

among`d(·,·)`

Is the distance between the two distributions We choose to minimize the KL divergence of two probability distributions Instead of KL divergence`d(·,·)`

And omitting some constants, we get:

(3)

Note that the first-order proximity is only applicable to undirected graphs, not to directed graphs By finding the`{u[i]}, i = 1 .. |V|`

, we can express`d`

Every vertex in dimensional space.

### Second order proximity

Second order proximity is suitable for directed graphs and undirected graphs. For a given network, without losing generality, we assume that it is directed (undirected edges can be considered as two directed edges with opposite direction and equal weight) Second order proximity assumes that many connected vertices sharing with other vertices are similar to each other In this case, each vertex is also considered as a specific context, and it is assumed that vertices with similar distribution in the context are similar Therefore, each vertex plays two roles: the vertex itself and the specific “context” of other vertices We introduce two vectors`u[ i]`

and`u'[i]`

Among them`u[i]`

yes`v[i]`

Representation when treated as a vertex, and`v'[i]`

Is when`v[ i]`

A representation when considered a specific “context.” For each directed edge`(i, j)`

, we first set the vertex`v[i]`

Generate context`v[j]`

The probability of is defined as:

(4)

among`|V|`

Is the number of vertices or contexts. For each vertex`v[i]`

, formula (4) actually defines the conditional distribution in the context (i.e. the whole vertex set in the network)`p[2](·|v[i])`

。 As mentioned above, second-order proximity assumes that vertices with similar distribution in the context are similar to each other In order to maintain the second-order proximity, we should make the distribution of context conditions specified by the low-dimensional representation`p[2](·|v[i])`

Adjacent experience distribution`^p[2](·|v[i])`

。 Therefore, we minimize the following objective functions:

(5)

To minimize this goal through learning`{u[i]}, i = 1 .. |V|`

as well as`{u'[i]}, i = 1 .. |V|`

, we can use`d`

Dimension vector`u[i]`

Represents each vertex`v[i]`

。

#### Combined first and second order proximity

In order to embed the network by preserving the first-order and second-order proximity, a simple and effective method we found in practice is to train the line model, preserving the first-order and second-order proximity respectively, and then connecting the embedding vectors trained by the two methods for each vertex. The more principled method combining two kinds of proximity is to train the objective function (3) and (6) jointly, which we will leave for the future work.

### 4.2 model optimization

The optimization objective (6) is expensive in calculation, and its conditional probability is calculated`p[2]`

We need to sum the whole vertex set In order to solve this problem, we use the negative sampling method proposed in [13], according to each edge`(i, j)`

Some of the noise distribution samples multiple negative edges More specifically, it’s for each side`(i, j)`

Specify the following target functions:

(7)

among`σ(x) = 1 / (1 + exp(-x))`

Is the sigmoid function The first term is used to simulate the observed edge and the second term is used to simulate the negative edge extracted from the noise distribution. K is the number of negative edges We set up the`P[n](v) ∝ d[v]^3/4`

Among them`d[v]`

Is the vertex.`v`

The degree of output.

For the objective function (3), there is a simple solution:`u[ik] = ∞`

For`i = 1, ..., |V|`

and`k = 1, ..., d`

。 To avoid this simple solution, we can still`u'[j]^T`

Change to`u[j]^T`

To utilize the negative sampling method (7).

We use ASGD [17] to optimize formula (7) In each step, the ASGD algorithm samples the small batch of edges, and then updates the model parameters If sampling edge`(i, j)`

The vertex`i`

Embedding vector of`u[i]`

The gradient of will be calculated as:

(8)

Notice that the gradient is multiplied by the weight of the edge This becomes a problem when the weight of an edge has a high square difference For example, in the co occurrence network, some words appear together many times (for example, tens of thousands of times), while some words only appear together several times In such a network, it is difficult to find a good learning rate because of the divergence of gradient scale If we choose a larger learning rate based on the edge with small weight, the gradient on the edge with large weight will explode, and if we choose a smaller learning rate based on the edge with large weight, the gradient will become too small.

#### Optimize vs edge sampling

The intuition to solve the above problem is that if the weights of all edges are equal (for example, a network with binary edges), there is no problem of selecting an appropriate learning rate Therefore, the simple processing is to expand the weighted edge into multiple binary edges, for example, the weight is`w`

The edge development of`w`

Two sides This will solve the problem, but will significantly increase memory requirements, especially when the weight of the edges is very large In order to solve this problem, we can sample from the original edge and regard the sampling edge as a binary edge, and the sampling probability is proportional to the weight of the original edge Through this kind of edge sampling processing, the overall objective function remains unchanged The problem boils down to how to sample the edges according to the weight.

### 4.3 discussion

We discuss several practical problems of line model.

Low vertex: a practical problem is how to embed low vertex accurately Because the number of neighbors of such nodes is very small, it is difficult to accurately infer their representation, especially the method based on the second-order proximity, which depends heavily on the number of “context”. The intuitive solution to this is to extend the neighbors of these vertices by adding higher-level neighbors, such as neighbors In this paper, we only consider adding second-order neighbors to each vertex, that is, neighbors’ neighbors vertex`i`

And its second-order neighbors`j`

The weight measurement between is:

In fact, people can only add vertices`{j}`

A subset of, which is related to low degree vertices`i`

It has the maximum proximity.

New vertex: another practical problem is how to find the representation of the newly arrived vertex For new vertices`i`

, if we know its connection with the existing vertex, we can obtain the empirical distribution on the existing vertex`^p[1](·, v[i])`

and`^p[2](·|v[i])`

。 According to the objective function formulas (3) and (6), in order to obtain the embedding of new vertices, a direct method is to minimize any of the following objective functions.

(10)

Update the embedding of new vertices and keep the embedding of existing vertices. If the connection between the new vertex and the existing vertex is not observed, we must turn to other information, such as the text information of the vertex, as our future work.