Link prediction metrics
This document summarizes the initial link prediction metrics supported by relationalstats.linkprediction.proxfun_full.
Notation
: graph. : node pair to score. : neighborhood of node . : degree of node . : adjacency matrix. : number of walks of length from to .
For the first release, the recommended default is directed=False, which scores pairs on an undirected version of the graph.
Metric families
| Mechanism | Metrics |
|---|---|
| Triadic closure | common_neighbors, jaccard, adamic_adar |
| Popularity / degree | preferential_attachment, degree |
| Structural similarity | salton, sorensen, lhn_local |
| Hub dynamics | hub_promoted, hub_depressed |
| Diffusion / flow | resource_allocation |
| Geodesic proximity | shortest_path |
| Extended short paths | local_path |
| Global centrality | katz |
| Stochastic diffusion | rwr |
| Random-walk global metric | act |
Common neighbors
Counts the number of shared neighbors between two nodes.
Jaccard coefficient
Measures neighborhood overlap normalized by neighborhood union.
Adamic-Adar index
Penalizes shared neighbors with high degree.
Preferential attachment
Scores pairs highly when both nodes have high degree.
Resource allocation index
Similar to Adamic-Adar, but with a stronger penalty for high-degree shared neighbors.
Salton index / cosine similarity
Cosine-like structural similarity between node neighborhoods.
Sorensen index
Another normalized neighborhood-overlap score.
Hub promoted index
Promotes similarity when the smaller-degree node shares many of its neighbors.
Hub depressed index
Penalizes similarity when one of the nodes is a high-degree hub.
Leicht-Holme-Newman local index
Reduces degree-driven bias in common-neighbor scores.
Shortest path
Lower values indicate closer nodes. Disconnected nodes receive infinite distance in the current implementation.
Local path index
The default value is:
This combines length-2 and length-3 walks.
Katz index
Matrix form:
The default value is:
Katz is a global metric and can be expensive for large graphs.
Random walk with restart
Random walk with restart estimates the diffusion probability of reaching a node from a source while repeatedly returning to the source.
The current implementation uses:
where:
is the row-normalized transition matrix. is the restart probability.
The default value is:
RWR is a global matrix-based metric and can be expensive for large graphs.
Degree features
These features expose the individual degree of each node in the pair.
Average commute time score
Average commute time is based on the Moore-Penrose pseudoinverse of the graph Laplacian.
The implemented score is:
Disconnected pairs receive score 0.0.
This metric requires dense linear algebra and is expensive for large graphs.
Division by zero policy
When a denominator is zero, the score is set to 0.0.
This affects metrics such as:
jaccardsaltonsorensenhub_promotedhub_depressedlhn_localact
Scalability notes
Sparse matrix operations are used where possible for local metrics.
The following metrics are potentially expensive:
katzrwract
These should be used carefully on large graphs.