## DeepWalk: Online Learning of Social Representations

Arxiv 1403.6652

## II. Problem definition

We consider social network membership classification as one or more categories More formally, set`G = (V, E)`

Among them`V`

Is a member of the network,`E`

It’s the side.`E ⊆ (V×V)`

。 Social networks with given partial tags`G[L] = (V, E, X, Y)`

Attribute`X ∈ R^{|V|×S}`

Among them`S`

Is the size of the feature space of each attribute vector, and`Y ∈ R^{|V|×|Y|}`

，`Y`

Is a label set.

In traditional machine learning classification settings, our goal is to learn a hypothesis`H`

It will`X`

Elements of map to a label set`Y`

。 In our case, we can use`G`

The structure of embedded examples relies on important information to complete outstanding performance.

In the literature, this is known as relational classification (or collective classification problem [37]) The traditional relation classification method regards the problem as the reasoning in the undirected Markov network, and then uses the iterative approximate reasoning algorithm (such as iterative classification algorithm [31], Gibbs sampling [14] or label relaxation [18]) to calculate the posteriori distribution of the label of the given network structure.

We propose a different method to capture network topology information We propose an unsupervised method, which is not to mix the label space into a part of the feature space, but to learn to capture the features of the graph structure without relying on the label distribution.

This separation between structure representation and markup tasks avoids cascading errors, which may occur in iterative methods [33] Moreover, the same representation can be used for multiple classification problems involving the network.

Our goal is to learn`X[E] ∈ R^{|V|×d}`

Among them`d`

Is a smaller potential dimension These low dimensional representations are distributed; it means that each social phenomenon is represented by a subset of dimensions, and each dimension contributes to a subset of social concepts expressed in space.

Using these structural features, we will increase the attribute space to facilitate classification decision These features are universal and can be used with any classification algorithm, including iterative methods However, we believe that the biggest use of these features is that they are easy to integrate with simple machine learning algorithms They scale appropriately in real world networks, as we’ll show in Section 6.

## III. learning social expression

We want to learn social representations with the following characteristics:

- Adaptability – real social networks are evolving; new social relationships should not require repetition of the learning process.
- The distance between the community perception potential dimensions should represent an indicator to evaluate the social similarity between the corresponding members of the network This allows generalization in homogeneous networks.
- Low dimensional – when the labeled data is scarce, the generalization of low dimensional model is better, and the convergence and inference are accelerated.
- Continuous – we need a potential representation to model part of the community membership in a continuous space In addition to providing a detailed view of community members, continuous representation can also achieve smooth decision boundaries between communities, so as to achieve more powerful classification.

In order to meet these requirements, our method uses the optimization technology originally designed for language modeling to learn vertex representation from a shorter random walk path Here, we review the basics of random walk and language modeling, and describe how their combination meets our requirements.

### 3.1 random walk

We’re going to peak`v[i]`

The random walk of root is expressed as`W[v[i]]`

。 It’s a random process with random variables`W1[v[i]], W2[v[i]], ..., Wk[v[i]]`

Make`W(k+1)[v[i]]`

From the vertex`v[k]`

Randomly selected vertices in the neighborhood of Random walk has been used to measure the similarity of various problems in content recommendation [11] and community detection [1] They are also the basis of a class of output sensitive algorithms. They use random walks to calculate the local community structure information, which costs a sublinear time with the size of the input graph [38].

### 3.2 connection: power law

Now we need a suitable method to capture these information If the degree distribution of connected graph follows power law (scale-free), we observe that the frequency of vertices appearing in short random walk will also follow power law distribution.

The word frequency in natural language follows a similar distribution, which is explained by the technology of language modeling To emphasize this similarity, we show two different power-law distributions in Figure 2. The first is a series of short random walks on scale-free graphs, and the second is 100000 articles from English Wikipedia.

The core contribution of our work is that the technology used to simulate natural language (symbol frequency follows power-law distribution (or Zipf law)) can be used to simulate community structure in network again We will review the growing work in language modeling in the rest of this section and transform it to learn vertex representations that meet our criteria.

### 3.3 language modeling

The goal of language modeling is to estimate the probability of specific word sequences appearing in corpus More formally, give a list of words:

among`w[i] ∈ V`

（`V`

It’s a vocabulary), and we want to maximize it in all training corpora`Pr(w[n] | w[0], w[1], ..., w[n-1])`

。

Recent work on representation learning has focused on the use of probabilistic neural networks to construct general representations of words, which extend the scope of language modeling beyond its original goal.

In this work, we propose a generalization of language modeling to explore graphs through short random walks These excursions can be regarded as short sentences and phrases in a special language The direct analogy is that given all previous vertices visited so far in random walk, the observed vertices are estimated`v[i]`

Possibility.

Our goal is to learn the potential representation, not only the probability distribution of node co-occurrence, so we introduce the mapping function`Φ: v ∈ V -> R^{|V|×d}`

。 The mapping`Φ`

Represent and each vertex in the graph`v`

Associated potential social representation In practice, we use the`|V|×d`

Matrix represents Φ, which will be our`X[E]`

). The question is to estimate the likelihood:

(1)

However, with the increase of walk length, it is not feasible to calculate this objective function.

The recent relaxation of language modeling [26,27] has changed the prediction problem First, it does not use context to predict missing words, but uses a word to predict context Second, the context consists of words that appear to the right and left of a given word Finally, it removes the sort constraints on the problem Instead, the model needs to maximize the probability of any word appearing in the context without knowing its offset from a given word.

In terms of vertex representation modeling, this leads to optimization problems.

(2)

We find that these relaxation are particularly ideal for society to learn First, the sequential independence hypothesis better captures the sense of proximity provided by random walk In addition, this relaxation is useful for speeding up training time by building small models because it gives one vertex at a time.

From formula (2) to solve the optimization problem, a representation is constructed, which captures the shared similarity in the local graph structure between vertices. Vertices with similar neighborhood will get similar representation (coding co citation similarity) and can be generalized in machine learning tasks.

By combining the truncated random walk and neural language model, we develop a method to meet our needs. This method generates a low-dimensional representation of social networks and exists in continuous vector space. Its representation encodes the potential form of community membership, and because the method outputs useful intermediate representations, it can adapt to the changing network topology.

## Four, method

In this section, we will discuss the main components of the algorithm. We also introduce several variants of our method and discuss their advantages.

### 4.1 overview

As with any language modeling algorithm, the only input needed is corpus and vocabulary`V`

。 Deepwalk regards a set of truncated short random walks as its own corpus, and the graph vertex as its own vocabulary（`V = V`

) Although before training, I knew`V`

The frequency distribution of and vertices is beneficial, but the algorithm does not need to work as shown in 4.2.2.

### 4.2 algorithm: deepwalk

The algorithm consists of two main parts: the first is the random walk generator, and the second is the update process.

Acceptance graph of random walk generator`G`

And even sampling random vertices`v[i]`

As random walk`W[v[i]]`

The root. Swipe samples evenly from the neighbors of the last vertex visited until the maximum length is reached`t`

。 Although we set the length of random walk in the experiment as fixed, there is no limit to the length of random walk These walks may restart (that is, the probability of transmission returning to its root), but our preliminary results do not show any advantages of using restart In practice, our implementation specifies that the length from each vertex is`t`

Random walk of`γ`

The number.

```
Algorithm 1 DeepWalk(G, w, d, γ, t)
-----------------------------------
Input: graph G(V, E)
window size w
embedding size d
walks per vertex γ
walk length t
Output: matrix of vertex representations Φ ∈ R^{|V|×d}
1: Initialization: Sample Φ from U[|V|×d]
2: Build a binary Tree T from V
3: for i = 0 to γ do
4: O = Shuffle(V)
5: for each vi ∈ O do
6: W[v[i]] = RandomWalk(G, v[i], t)
7: SkipGram(Φ, W[v[i]], w)
8: end for
9: end for
```

Lines 3-9 of algorithm 1 show the core of our method The outer loop specifies the number of random walks we should start at each vertex`γ`

。 We think of each iteration as a walk over the data and sample each node once in the process At the beginning of each traversal, we generate a random sort to traverse the vertices This is not strictly required, but it is well known that it accelerates the convergence of random gradient descent.

In the inner loop, we iterate over all the vertices of the graph For each vertex`v[i]`

, we generate random walks`|W[v[i]]| = t`

, and then use it to update our representation (line 7). We use skipgram algorithm [26] to update these representations according to the objective function in formula (2).

#### 4.2.1 SkipGram

Skipgram is a language model that maximizes the window in a sentence`w`

The co occurrence probability between the words appearing in [26].

```
Algorithm 2 SkipGram(Φ, W[v[i]], w)
-----------------------------------
1: for each v[j] ∈ W[v[i]] do
2: for each u[k] ∈ W[v[i]][j − w : j + w] do
3: J(Φ) = − log Pr(u[k] | Φ(v[j]))
4: Φ = Φ − α * ∂J / ∂Φ
5: end for
6: end for
```

Algorithm 2 iteration appears in the window`w`

(lines 1-2) for all possible collocations in random walks For each, we take each vertex`v[j]`

Map to its current representation vector`Φ(v[j]) ∈ R^d`

(see Figure 3B) given`v[j]`

We want to maximize the probability of its wandering neighbors (line 3) We can use a variety of classifiers to learn this posterior distribution For example, using logistic regression to model a previous problem would result in an equal`|V|`

A lot of tags, which could be millions or billions These models need a lot of computing resources and may span the whole computer cluster [3]. In order to speed up the training time, the hierarchical softmax [29,30] can be used to approximate the probability distribution.

#### 4.2.2 level softmax

given`u[k] ∈ V`

, calculate the`Pr(u[k] | Φ(v[j]))`

It’s not feasible It is expensive to calculate the partition function (normalization factor). If we assign vertices to the leaves of a binary tree, the prediction problem becomes to maximize the probability of a particular path in the tree (see Figure 3C) If the vertex`u[k]`

The path of is composed of a series of tree nodes（`b[0], b[1], ..., b[log|V|]`

）Identification（`b[0] = root, b[log|V|] = uk`

Then:

Now,`Pr(b[l] | Φ(v[j]))`

Can be assigned to nodes by`b[l]`

The binary classifier model of the parent node of Calculation`Pr(u[k] | Φ(v[j]))`

The computational complexity of`O(|V|)`

Reduced to`O(log|V|)`

。

By assigning short paths to frequent vertices in random walks, we can further speed up the training process Huffman coding is used to reduce the access time of frequent elements in the tree.

#### 4.2.3 optimization

The model parameter set is`{Φ, T}`

, where the size of each is`O(d|V|)`

。 Random gradient descent (SGD) [4] is used to optimize these parameters (line 4, algorithm 2). The derivative is estimated by back propagation algorithm SGD learning rate`α`

At the beginning of the training, the initial setting is 2.5%, and then decreases linearly with the number of vertices seen so far.

### 4.3 parallelization

As shown in Figure 2, the frequency distribution of vertices in random walks of social networks and words in languages follow power law. This attempt produces a long tail of unusual vertices, so the`Φ`

The update of will be sparse in nature. This allows us to use the asynchronous version of random gradient descent (ASGD) in multitasking situations. Since our updates are sparse, and we do not get the lock to access the shared parameters of the model, ASGD will achieve the best convergence speed [36]. Although we use multiple threads to run experiments on a machine, it has been proved that the technology is highly scalable and can be used for large-scale machine learning [8]. Figure 4 shows the effect of parallelizing deepwalk. It shows that the speed of processing the blogcatalog and Flickr networks is the same because we increased the number of workers to 8 (Figure 4a). It also shows that the prediction performance will not decrease compared to the serial running deepwalk (Figure 4b).

### 4.4 algorithm variants

Here we discuss some variations of our proposed method, which we think may be meaningful.

#### 4.4.1 flow pattern

An interesting variation of this method is the flow method, which can be implemented without knowing the whole graph In this variation, small walks from the graph are passed directly to the representation learning code, and the model is updated directly Some changes to the learning process are also necessary. First of all, it is no longer possible to use the attenuated learning rate On the contrary, we can change the learning rate`α`

Initialize to a small constant value This takes longer to learn, but may be worth it in some applications Secondly, we don’t have to build a parameter tree. If`V`

The cardinality of is known (or can be bounded), we can build a hierarchical softmax tree for this maximum value. When you see vertices for the first time, you can assign them to the remaining leaf If we can estimate the vertex frequency in advance, we can still use Huffman coding to reduce the frequent element access time.

#### 4.4.2 non random walk

Some diagrams are created as a by-product of the agent’s interaction with a series of elements (for example, a user’s page navigation on a website). When a graph is created through this non random walk flow, we can use this process to directly provide the modeling phase. The graph sampled in this way will not only capture the information related to the network structure, but also capture the frequency of traversal path.

In our view, this variant also includes language modeling. Sentences can be seen as purposeful wandering through appropriately designed language networks, and language models such as skipgram aim to capture this behavior.

This approach can be used in combination with streaming variants (section 4.4.1) to train features on evolving networks without explicitly building the entire graph Using this technology to maintain the presentation can achieve web level classification without the hassle of dealing with web level diagrams.