This application uses RedisGraph. The size of the graph depends on two factors:
For a single blog, this graph will be relatively small.
The graph can be stored into any key and multiple blogs can be stored in the same key or separate keys based on preference.
BlogPostingNamedEntity with a text property.uses between an article and a named entity with a count property for how many times the entity occurs in the article.This module uses the pypropgraph python library to load the graph. This library uses MERGE to avoid duplication. If the URLs of the articles and the named entities remain the same, running ingest more than once should result in the same graph.
The main challenge of scalability is the number of named entities. While the SpaCy NER model provides a robust detection of named entities, it does have issues with entity word boundaries. As such, sometimes the same named entity is prefixed or suffixed with extra words like stop words or other qualifiers. Pruning the named entities via additional algorithms should provide better scalability when processing a large corpus.
This project relies on RedisGraph and the demo relies on various indexes. These indexes are created by a separate program called setupdb.py and can be run on any graph key.
You can run a local RedisGraph instance by:
docker run -p 6379:6379 redislabs/redisgraph:latest
And setup the index per key:
python demo/setupdb.py milowski.com
The setup program also accepts multiple graph keys:
python demo/setupdb.py milowski.com redislabs-blog