Index Free Adjacency or Hybrid Indexes for Graph Databases


本站和网页 https://www.arangodb.com/2016/04/index-free-adjacency-hybrid-indexes-graph-databases/ 的作者无关,不对其内容负责。快照谨为网络故障时之索引,不代表被搜索网站的即时页面。

Index Free Adjacency or Hybrid Indexes for Graph Databases
Products
Managed Services
ArangoGraph Insights PlatformOur next-gen graph data and analytics platform, built on the ArangoDB Graph Database
ArangoGraph Enterprise
ArangoGraph Professional
ArangoGraphML
Self-Managed
ArangoDB Graph DatabaseGraph, integrated search engine, and JSON support, via a single query language.
Enterprise Edition
Community Edition
Compare Features 
Solutions
Proven and Flexible
Use CasesKnowledge Graph, Fraud Detection, KYC and more
Case StudiesCompanies using ArangoDB around the globe
WhitepapersGo Deeper on ArangoDB Capabilities
ComparisonsHow ArangoDB compares to other market leaders
Projects and IntegrationsShare and discover official and community projects
BI ConnectorsConnect Tableau, Qlik, PowerBI, Custom BI
Capabilities
Graph MLThe future of data science
Graph DatabasePowerful graph features at a glance
Document StoreRich document capabilities in ArangoDB
Search EngineIntegrated full-text search engine
Multi-Model DatabaseBenefits of three data models under one roof
Performance
Enterprise
Scale Safe & Secure
Graphs at ScaleOptimal performance for distributed graphs
Documents at ScaleFast join operations against distributed data
Search at ScaleSearch terabytes of data
High availabilityBusiness Continuity and Disaster Recovery
ComplianceGDPR & CCPA with data masking, auditing
SecurityEncryption 360, LDAP, Auditing and more
Learn
Go Deeper
ArangoDB UniversityInteractive courses powered by ArangoGraph
Training CenterTutorials on features and database functionalities
CertificationArangoDB Certified Professional Exam
DocumentationManual, AQL, HTTP API, Drivers, Oasis
Stay Informed
BlogLatest news from ArangoDB
ResourcesCourses, white papers, videos
EventsWhere to meet the ArangoDB team
CommunityGet involved with the open-source community
Cloud
Download
Sign up for ArangoGraph Insights Platform
Before signing up, please accept our terms & conditions and privacy policy.
What to expect after you signupYou can try out ArangoDB Cloud FREE for 14 days. No credit card required and you are not obligated to keep using ArangoDB Cloud.
At the end of your free trial, enter your credit card details to continue using ArangoDB Cloud.
If you decide that ArangoDB Cloud is not (yet) for you, you can simply leave and come back later.
ACCEPT
Index Free Adjacency or Hybrid Indexes for Graph Databases
April 18, 201612Architecture, GraphsTags: Graph
Some graph database vendors propagandize index-free adjacency for the implementation of graph models. There has been some discussion on Wikipedia about what makes a database a graph database. These vendors tried to push the definition of index-free adjacency as foundation of graph databases, but were stopped by the community.
Index Free Adjacency or Hybrid Indexes for Graph Databases
Therefore a graph database remains “a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data” – independent of the way the data is stored internally. It’s really the model and the implemented algorithms that matter.
Interested in trying out ArangoDB? Fire up your database in just a few clicks with ArangoDB Oasis: the Cloud Service for ArangoDB. Start your free 14-day trial here.
Example Graph with just small super-nodes
New to multi-model and graphs? Check out our free ArangoDB Graph Course.
Still, the question remains if index-free adjacency as an implementation detail is a suitable concept for a graph database. The idea is that each node contains a list of pointers of its edges, therefore avoiding look-ups. In a distributed world, it’s clear that such a definition is meaningless, as the edges can live on different servers and there is no such thing as a “pointer” of an edge. In this case, all potential improvements discussed below are negligible in comparison to the communication latency, so much cleverer algorithms are needed in the distributed case.
So, let’s concentrate on the single server case and dive into the computer science behind it.
The key message of index-free adjacency is, that the complexity to traverse the whole graph is O(n), where n is the number of nodes. In contrast, using any index will have complexity O(n log n). While this sounds plausible at first, it is simply wrong. Using a novel index, which combines hashes with linked-list, it is possible to gain the same complexity O(n) when traversing the whole graph. However, index-free adjacency has some severe pitfalls. It has drawbacks when there exist super-nodes. Access can be done fast, but deleting or modifying edges of such nodes becomes a nightmare. If you have a typical social network, then celebrities will have millions of followers. If you hit such a node, modifying operations suddenly become too complex to handle without indexes.
Can one do better? Yes, by using our novel hybrid index, one gets the best of both worlds. By combining a hash-index with a linked-list we were able to achieve the same access speed as an index-free implementation, while being way more efficient when deleting or modifying, especially when dealing with edges of super-nodes. Let’s explore the complexity analysis behind this approach.
If you store the vertices at each node as list of direct pointers, then traversing all neighbors has complexity O(k), if a vertex has k edges. Note that this is the best possible complexity because O(k) is the size of the answer. Deleting a single edge also has the same complexity of O(k) (assuming a doubly linked list), which is much worse than optimal. Furthermore, usually one will want to be able to traverse edges in both directions, which makes it necessary to store direct pointers on both vertices that are incident with an edge. A consequence of this is that deleting a supernode is even worse: To remove all incident edges one has to visit every adjacent vertex – and perform a potentially expensive removal operation for each of them.
Using indexes naively will not remedy the problem, as the complexity of traversing the edges raises to O(log E) + O(k) where E is the total number of edges. The choice of a hybrid index between hash and linked-list, solves this dilemma. We store all edges in a big hash table, and at the same time we put for each vertex V all those edges that are incident with V into a doubly linked list. Details of the implementation can be found in our github repository.
This allows to proceed as follows: Finding the starting point for the linked list of a vertex is a single hash lookup by the vertex key, which is O(1) with a very good constant. If a vertex has k neighbors, traversing all its neighbors has complexity O(k) because of the linked-list aspect of the index, we simply have to run through the linked list once we have found the start in O(1), but O(1) + O(k) = O(k). For single-edge modifications and single-edge deletions we first look up by the vertex key, in case the edge is the first in the linked list of the vertex, and if this fails to find the edge, we do one further lookup to find the edge by its own key. Thus, we can find the edge in O(1) and then modify or delete it. This is only possible, because the edge index is both a big hash table for all edges and a linked list for each vertex’s neighbours. Overall, the complexity is now O(k) for neighbors – as good as theoretically possible – and O(1) for single-edge modifications – again the best possible result.
ArangoDB Hybrid Hash/Linked-List – Index
Because of the much better complexity for modifications, it is unnecessary to investigate the involved constants. For the traversal case it is worthwhile to do so, though. So, how large are the speedups promised by direct pointers – at least when no modifications of the graph are involved? If the graph is unrealistically small (and fits into the L2 cache), then it is plausible that pointers will be faster. However, the potential savings with a direct pointer approach is negligible, since the direct pointer also has to fetch at least one cache line. Even this small benefit is immediately lost again if one does not use C++ or a similar low-level language.
On the other hand, if you have a large graph that does not completely fit into memory, a drawback of index-free adjacency becomes apparent. If you look at the iconic graph query, namely a pattern matching in a large graph, an index might even be faster, because you do not need to look at these vertices at all. You only need the index itself, which is much smaller than the complete node data. Therefore it is much more memory and cache friendly, which is of most importance with nowadays’ technology.
Conclusion
By selecting the right index, it is possible to reach the same read complexity with superior write complexity.
In addition, using an index keeps the payload out of the main memory. This makes the algorithms much more memory and cache friendly.
If you are interested in performance of such a solution have a look here.
April 18, 2016
Author Claudius WeinbergerClaudius studied economics with business informatics as key aspect at the University of Cologne. Together with his co-founder, he builds databases for more than 20 years; from in-memory to mostly memory databases and from K/V stores over multi-dimensional cubes to graph databases. His responsibility was mostly the product and project management. Since 2012 he is the CEO of ArangoDB.
2 Comments
Posted by Rasmus Schultz
on April 19, 2016 at 8:04 am
So does ArangoDB already use this type of index, or is this part of your plan to remedy the “mostly memory” performance aspect? 🙂
Posted by Claudius Weinberger
on May 17, 2016 at 10:05 pm
We use it since day one.
Leave a ReplyYour email address will not be published. Required fields are marked *Comment * Name *
Email *
Website
Get the latest tutorials, blog posts and news:
TagsAPI
AQL
arangoml
arangosearch
arangosh
Benchmark
Cloud
Cluster
Community
data science
Documentation
driver
Enterprise
Foxx
Fraud Detection
geo
GitHub
Graph
HTTP
Java
JavaScript
Kubernetes
Machine Learning
Managed Service
ML
monitoring
multi-model
networkx
Newsletter
NoSQL
open-source
Performance
PHP
plugin
Pregel
python
pytorch
Release
RocksDB
Ruby
storage
testing
Tutorial
WebUI
Windows
Quick LinksQuick Start
Roadmap
Language Drivers
Case Studies
BI Connectors
Certification Exam
InfoSubscriptions
Community
Events
Online Meetup
Logo/Resources
About UsAbout Us
Jobs
Imprint
Contact
News
SocialTwitter
LinkedIn
Stackoverflow
GitHub
Slack
Google Groups
Youtube
Star ArangoDB on GitHub
GitHub