Node classification with HashGNN
This Jupyter notebook is hosted here in the Neo4j Graph Data Science Client Github repository.
The notebook exemplifies how to use the graphdatascience
library to:
-
Import an IMDB dataset with
Movie
,Actor
andDirector
nodes directly into GDS using a convenience data loader -
Configure a node classification pipeline with HashGNN embeddings for predicting the genre of
Movie
nodes -
Train the pipeline with autotuning and inspecting the results
-
Make predictions for movie nodes missing without a specified genre
1. Prerequisites
Running this notebook requires a Neo4j database server with a recent version (2.3 or newer) of the Neo4j Graph Data Science library (GDS) plugin installed. We recommend using Neo4j Desktop with GDS, or AuraDS.
Additionally, the version of the graphdatascience
library used must
be 1.6 or newer.
# Install necessary Python dependencies
%pip install graphdatascience
2. Setup
We start by importing our dependencies and setting up our GDS client connection to the database.
# Import our dependencies
import os
from graphdatascience import GraphDataScience
# Get Neo4j DB URI, credentials and name from environment if applicable
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://localhost:7687")
NEO4J_AUTH = None
NEO4J_DB = os.environ.get("NEO4J_DB", "neo4j")
if os.environ.get("NEO4J_USER") and os.environ.get("NEO4J_PASSWORD"):
NEO4J_AUTH = (
os.environ.get("NEO4J_USER"),
os.environ.get("NEO4J_PASSWORD"),
)
gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH, database=NEO4J_DB)
from graphdatascience.server_version.server_version import ServerVersion
assert gds.server_version() >= ServerVersion(2, 3, 0)
3. Loading the IMDB dataset
Next we use the graphdatascience
built-in
IMDB loader to get data into our GDS server. This should give us a
graph with Movie
, Actor
and Director
nodes, connected by
ACTED_IN
and DIRECTED_IN
relationships.
Note that a “real world scenario”, we would probably project our own
data from a Neo4j database into GDS instead, or using
gds.graph.construct
to create a graph from our own client side data.
# Run the loading to obtain a `Graph` object representing our GDS projection
G = gds.graph.load_imdb()
Let’s inspect our graph to see what it contains.
print(f"Overview of G: {G}")
print(f"Node labels in G: {G.node_labels()}")
print(f"Relationship types in G: {G.relationship_types()}")
It looks as expected, though we notice that some nodes are of a
UnclassifiedMovie
label. Indeed, these are the nodes whose genre we
wish to predict with a node classification model. Let’s look at the node
properties present on the various node labels to see this more clearly.
print(f"Node properties per label:\n{G.node_properties()}")
So we see that the Movie
nodes have the genre
property, which
means that we can use these nodes when training our model later. The
UnclassifiedMovie
nodes as expected does not have the genre
property, which is exactly what we want to predict.
Additionally, we notice that all nodes have a plot_keywords
property. This is a binary “bag-of-words” type feature vector
representing which out of 1256 plot keywords describe a certain node.
These feature vectors will be used as input to our HashGNN node
embedding algorithm later.
4. Configuring the training pipeline
Now that we loaded and understood the data we want to analyze, we can
move on to look at the tools for actually making the aforementioned
genre predictions of the UnclassifiedMovie
nodes.
Since we want to predict discrete valued properties of nodes, we will use a node classification pipeline.
# Create an empty node classification pipeline
pipe, _ = gds.beta.pipeline.nodeClassification.create("genre-predictor-pipe")
To be able to compare our accuracy score to the current state of the art methods on this dataset, we want to use the same test set size as in the Graph Transformer Network paper (NIPS paper link). We configure our pipeline accordingly.
# Set the test set size to be 79.6 % of the entire set of `Movie` nodes
_ = pipe.configureSplit(testFraction=0.796)
Please note that we would get a much better model by using a more standard train-test split, like 80/20 or so. And typically this would be the way to go for real use cases.
5. The HashGNN node embedding algorithm
As the last part of the training pipeline, there will be an ML training
algorithm. If we use the plot_keywords
directly as our feature input
to the ML algorithm, we will not utilize any of the relationship data we
have in our graph. Since relationships would likely enrich our features
with more valuable information, we will use a node embedding algorithm
which takes relationships into account, and use its output as input to
the ML training algorithm.
In this case we will use the HashGNN node embedding algorithm which is new in GDS 2.3. Contrary to what the name suggests, HashGNN is not a supervised neural learning model. It is in fact an unsupervised algorithm. Its name comes from the fact that the algorithm design is inspired by that of graph neural networks, in that it does message passing interleaved with transformations on each node. But instead of doing neural transformations like most GNNs, its transformations are done by locality sensitive min-hashing. Since the hash functions used are randomly chosen independent of the input data, there is no need for training.
We will give hashGNN the plot_keywords
node properties as input, and
it will output new feature vectors for each node that has been enriched
by message passing over relationships. Since the plot_keywords
vectors are already binary we don’t have to do any
binarization
of the input.
Since we have multiple node labels and relationships, we make sure to
enable the heterogeneous capabilities of HashGNN by setting
heterogeneous=True
. Notably we also declare that we want to include
all kinds of nodes, not only the Movie
nodes we will train on, by
explicitly specifying the contextNodeLabels
.
Please see the HashGNN documentation for more on this algorithm.
# Add a HashGNN node property step to the pipeline
_ = pipe.addNodeProperty(
"beta.hashgnn",
mutateProperty="embedding",
iterations=4,
heterogeneous=True,
embeddingDensity=512,
neighborInfluence=0.7,
featureProperties=["plot_keywords"],
randomSeed=41,
contextNodeLabels=G.node_labels(),
)
# Set the embeddings vectors produced by HashGNN as feature input to our ML algorithm
_ = pipe.selectFeatures("embedding")
6. Setting up autotuning
It is time to set up the ML algorithms for the training part of the pipeline.
In this example we will add logistic regression and random forest algorithms as candidates for the final model. Each candidate will be evaluated by the pipeline, and the best one, according to our specified metric, will be chosen.
It is hard to know how much regularization we need so as not to overfit
our models on the training dataset, and for this reason we will use the
autotuning capabilities of GDS to help us out. The autotuning algorithm
will try out several values for the regularization parameters
penalty
(of logistic regression) and minSplitSize
(of random
forest) and choose the best ones it finds.
Please see the GDS manual to learn more about autotuning, logistic regression and random forest.
# Add logistic regression as a candidate ML algorithm for the training
# Provide an interval for the `penalty` parameter to enable autotuning for it
_ = pipe.addLogisticRegression(penalty=(0.1, 1.0), maxEpochs=1000, patience=5, tolerance=0.0001, learningRate=0.01)
# Add random forest as a candidate ML algorithm for the training
# Provide an interval for the `minSplitSize` parameter to enable autotuning for it
_ = pipe.addRandomForest(minSplitSize=(2, 100), criterion="ENTROPY")
7. Training the pipeline
The configuration is done, and we are now ready to kick off the training of our pipeline and see what results we get.
In our training call, we provide what node label and property we want the training to target, as well as the metric that will determine how the best model candidate is chosen.
# Call train on our pipeline object to run the entire training pipeline and produce a model
model, _ = pipe.train(
G,
modelName="genre-predictor-model",
targetNodeLabels=["Movie"],
targetProperty="genre",
metrics=["F1_MACRO"],
randomSeed=42,
)
Let’s inspect the model that was created by the training pipeline.
print(f"Accuracy scores of trained model:\n{model.metrics()['F1_MACRO']}")
print(f"Winning ML algorithm candidate config:\n{model.best_parameters()}")
As we can see the best ML algorithm configuration that the autotuning
found was logistic regression with penalty=0.159748
.
Further we note that the test set F1 score is 0.59118347, which is really good to when comparing to scores of other algorithms on this dataset in the literature. More on this in the Conclusion section below.
8. Making new predictions
We can now use the model produced by our training pipeline to predict
genres of the UnclassifiedMovie
nodes.
# Predict `genre` for `UnclassifiedMovie` nodes and stream the results
predictions = model.predict_stream(G, targetNodeLabels=["UnclassifiedMovie"], includePredictedProbabilities=True)
print(f"First predictions of unclassified movie nodes:\n{predictions.head()}")
In this case we streamed the prediction results back to our client
application, but we could for example also have mutated our GDS graph
represented by G
by calling model.predict_mutate
instead.
9. Cleaning up
Optionally we can now clean up our GDS state, to free up memory for other tasks.
# Drop the GDS graph represented by `G` from the GDS graph catalog
_ = G.drop()
# Drop the GDS training pipeline represented by `pipe` from the GDS pipeline catalog
_ = pipe.drop()
# Drop the GDS model represented by `model` from the GDS model catalog
_ = model.drop()
10. Conclusion
By using only the GDS library and its client, we were able to train a node classification model using the sophisticated HashGNN node embedding algorithm and logistic regression. Our logistic regression configuration was automatically chosen as the best candidate among a number of other algorithms (like random forest with various configurations) through a process of autotuning. We were able to achieve this with very little code, and with very good scores.
Though we used a convenience method of the graphdatascience
library
to load an IMDB dataset into GDS, it would be very easy to replace this
part with something like a
projection
from a Neo4j database to create a more realistic production workflow.
10.1. Comparison with other methods
As mentioned we tried to mimic the setup of the benchmarks in the NeurIPS paper Graph Transformer Networks, in order to compare with the current state of the art methods. A difference from this paper is that they have a predefined train-test set split, whereas we just generate a split (with the same size) uniformly at random within our training pipeline. However, we have no reason to think that the predefined split in the paper was not also generated uniformly at random. Additionally, they use length 64 float embeddings (64 * 32 = 2048 bits), whereas we use length 1256 bit embeddings with HashGNN.
The scores they observe are the following:
Algorithm | Test set F1 score (%) |
---|---|
DeepWalk |
32.08 |
metapath2vec |
35.21 |
GCN |
56.89 |
GAT |
58.14 |
HAN |
56.77 |
GTN |
60.92 |
In light of this, it is indeed very impressive that we get a test set F1 score of 59.11 % with HashGNN and logistic regression. Especially considering that: - we use fewer bits to represent the embeddings (1256 vs 2048) - use dramatically fewer training parameters in our gradient descent compared to the deep learning models above - HashGNN is an unsupervised algorithm - HashGNN runs a lot faster (even without a GPU) and requires a lot less memory