Swiftorial Logo
Home
Swift Lessons
Matchups
CodeSnaps
Tutorials
Career
Resources

Link Prediction Tutorial

Introduction

Link Prediction is a fundamental problem in graph theory and machine learning. It involves predicting the existence of a link between two nodes in a graph. This technique is widely used in various applications such as social networks, recommender systems, and biological networks.

Basics of Graph Theory

A graph is a collection of nodes (or vertices) and edges (or links) that connect pairs of nodes. In the context of link prediction, the graph can be represented as G = (V, E) where V is the set of nodes and E is the set of edges.

Problem Definition

The goal of link prediction is to estimate the likelihood of the existence of a link between two nodes. Formally, given a graph G = (V, E), the task is to predict whether an edge exists between nodes u and v, where u, v ∈ V and (u, v) ∉ E.

Approaches to Link Prediction

There are several approaches to link prediction:

  • Heuristic Methods
  • Probabilistic Models
  • Machine Learning-based Methods
  • Graph Embedding-based Methods

Heuristic Methods

Heuristic methods use simple rules based on node and edge properties. Common heuristics include:

  • Common Neighbors: The number of common neighbors shared by two nodes.
  • Jaccard Coefficient: The ratio of the number of common neighbors to the total number of neighbors.
  • Adamic/Adar Index: A measure that assigns higher scores to pairs of nodes with rarer common neighbors.
  • Preferential Attachment: The product of the degrees of the two nodes.

Example: Common Neighbors

Consider the following graph:

Graph G = (V, E) where

V = {A, B, C, D}

E = {(A, B), (B, C), (C, D)}

To predict if there is a link between A and C using the Common Neighbors heuristic:

Common neighbors of A and C: {B}

Number of common neighbors: 1

Probabilistic Models

Probabilistic models estimate the likelihood of an edge based on the probability distributions of the graph. Popular models include:

  • Stochastic Block Model (SBM)
  • Exponential Random Graph Models (ERGM)

Machine Learning-based Methods

These methods utilize machine learning algorithms to learn the patterns of link formation. The process involves:

  • Feature Extraction: Extracting features from the graph such as node degrees, clustering coefficients, etc.
  • Training: Using these features to train a machine learning model (e.g., logistic regression, random forests).
  • Prediction: Using the trained model to predict the existence of links.

Example: Logistic Regression for Link Prediction

Step 1: Feature Extraction

For each pair of nodes (u, v), extract features such as:

  • Degree of node u
  • Degree of node v
  • Number of common neighbors

Step 2: Training

Use a logistic regression model to train on the extracted features.

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

Step 3: Prediction

Predict the existence of links using the trained model.

predictions = model.predict(X_test)

Graph Embedding-based Methods

Graph embedding methods map nodes to a lower-dimensional vector space. These embeddings can then be used to predict links. Popular techniques include:

  • Node2Vec
  • DeepWalk
  • Graph Convolutional Networks (GCNs)

Example: Node2Vec Embedding

Step 1: Generate Node Embeddings

Use Node2Vec to generate embeddings for each node.

from node2vec import Node2Vec
node2vec = Node2Vec(graph, dimensions=64, walk_length=30, num_walks=200, workers=4)
model = node2vec.fit(window=10, min_count=1, batch_words=4)

Step 2: Extract Embeddings

Extract embeddings for node pairs and use them as features for a classifier.

embeddings = model.wv

Step 3: Train Classifier

Train a classifier (e.g., logistic regression) on the embeddings to predict links.

Conclusion

Link prediction is a crucial task in graph-based learning, with applications ranging from social network analysis to bioinformatics. By understanding and implementing various methods, such as heuristic approaches, probabilistic models, machine learning-based techniques, and graph embedding methods, one can effectively predict the presence of links in a graph.