Key-value, document-oriented, column family, graph, relational... Today we seem to have as many kinds of databases as there are kinds of data. While this may make choosing a database harder, it makes choosing the right database easier. Of course, that does require doing your homework. You’ve got to know your databases.
One of the least-understood types of databases out there is the graph database. Designed for working with highly interconnected data, a graph database might be described as more “relational” than a relational database. Graph databases shine when the goal is to capture complex relationships in vast webs of information.
Here is a closer look at what graph databases are, why they’re unlike other databases, and what kinds of data problems they’re built to solve.
Graph database vs. relational database
In a traditional relational or SQL database, the data is organized into tables. Each table records data in a specific format with a fixed number of columns, each column with its own data type (integer, time/date, freeform text, etc.).
This model works best when you’re dealing mainly with data from any one table. It also doesn’t work too badly when you’re aggregating data stored across multiple tables. But that behavior has some notable limits.
Consider a music database, with albums, bands, labels, and performers. If you want to report all the performers that were featured on this album by that band released on these labels—four different tables—you have to explicitly describe those relationships. With a relational database, you accomplish this by way of new data columns (for one-to-one or one-to-many relationships), or new tables (for many-to-many relationships).
This is practical as long as you’re managing a modest number of relationships. If you’re dealing with millions or even billions of relationships—friends of friends of friends, for instance—those queries don’t scale well.
In short, if the relationships between data, not the data itself, are your main concern, then a different kind of database—a graph database—is in order.
Graph database features
The term “graph” comes from the use of the word in mathematics. There it’s used to describe a collection of nodes (or vertices), each containing information (properties), and with labeled relationships (or edges) between the nodes.
A social network is a good example of a graph. The people in the network would be the nodes, the attributes of each person (such as name, age, and so on) would be properties, and the lines connecting the people (with labels such as “friend” or “mother” or “supervisor”) would indicate their relationship.
In a conventional database, queries about relationships can take a long time to process. This is because relationships are implemented with foreign keys and queried by joining tables. As any SQL DBA can tell you, performing joins is expensive, especially when you must sort through large numbers of objects—or, worse, when you must join multiple tables to perform the sorts of indirect (e.g. “friend of a friend”) queries that graph databases excel at.
Graph databases work by storing the relationships along with the data. Because related nodes are physically linked in the database, accessing those relationships is as immediate as accessing the data itself. In other words, instead of calculating the relationship as relational databases must do, graph databases simply read the relationship from storage. Satisfying queries is a simple matter of walking, or “traversing,” the graph.
A graph database not only stores the relationships between objects in a native way, making queries about relationships fast and easy, but allows you to include different kinds of objects and different kinds of relationships in the graph. Like other NoSQL databases, a graph database is schema-less. Thus, in terms of performance and flexibility, graph databases hew closer to document databases or key-value stores than they do relational or table-oriented databases.
Graph database use cases
Graph databases work best when the data you’re working with is highly connected and should be represented by how it links or refers to other data, typically by way of many-to-many relationships.
Again, a social network is a useful example. Graph databases reduce the amount of work needed to construct and display the data views found in social networks, such as activity feeds, or determining whether or not you might know a given person due to their proximity to other friends you have in the network.
Another application for graph databases is finding patterns of connection in graph data that would be difficult to tease out via other data representations. Fraud detection systems use graph databases to bring to light relationships between entities that might otherwise have been hard to notice.
Similarly, graph databases are a natural fit for applications that manage the relationships or interdependencies between entities. You will often find graph databases behind recommendation engines, content and asset management systems, identity and access management systems, and regulatory compliance and risk management solutions.
Graph database queries
Graph databases—like other NoSQL databases—typically use their own custom query methodology instead of SQL.
One commonly used graph query language is Cypher, originally developed for the Neo4j graph database. Since late 2015 Cypher has been developed as a separate open source project, and a number of other vendors have adopted it as a query system for their products (e.g., SAP HANA).
Here is an example of a Cypher query that returns a search result for everyone who is a friend of Scott:
MATCH (a:Person {name:’Scott’})-[:FRIENDOF]->(b)
RETURN b
The arrow symbol (->) is used in Cypher queries to represent a directed relationship in the graph.
Another common graph query language, Gremlin, was devised for the Apache TinkerPop graph computing framework. Gremlin syntax is similar to that used by some languages’ ORM database access libraries.
Here is an example of a “friends of Scott” query in Gremlin:
g.V().has(“name”,”Scott”).out(“friendof”)
Many graph databases have support for Gremlin by way of a library, either built-in or third-party.
Yet another query language is SPARQL. It was originally developed by the W3C to query data stored in the Resource Description Framework (RDF) format for metadata. In other words, SPARQL wasn’t devised for graph database searches, but can be used for them. On the whole, Cypher and Gremlin have been more broadly adopted.
SPARQL queries have some elements reminiscent of SQL, namely SELECT and WHERE clauses, but the rest of the syntax is radically dissimilar. Don’t think of SPARQL as being related to SQL at all, or for that matter to other graph query languages.
Popular graph databases
Because graph databases serve a relatively niche use case, there aren’t nearly as many of them as there are relational databases. On the plus side, that makes the standout products easier to identify and discuss.
Neo4j
Neo4j is easily the most mature (11 years and counting) and best-known of the graph databases for general use. Unlike previous graph database products, it doesn’t use a SQL back-end. Neo4j is a native graph database that was engineered from the inside out to support large graph structures, as in queries that return hundreds of thousands of relations and more.
Neo4j comes in both free open-source and for-pay enterprise editions, with the latter having no restrictions on the size of a dataset (among other features). You can also experiment with Neo4j online by way of its Sandbox, which includes some sample datasets to practice with.
Microsoft Azure Cosmos DB
The Azure Cosmos DB cloud database is an ambitious project. It’s intended to emulate multiple kinds of databases—conventional tables, document-oriented, column family, and graph—all through a single, unified service with a consistent set of APIs.
To that end, a graph database is just one of the various modes Cosmos DB can operate in. It uses the Gremlin query language and API for graph-type queries, and supports the Gremlin console created for Apache TinkerPop as another interface.
Another big selling point of Cosmos DB is that indexing, scaling, and geo-replication are handled automatically in the Azure cloud, without any knob-twiddling on your end. It isn’t clear yet how Microsoft’s all-in-one architecture measures up to native graph databases in terms of performance, but Cosmos DB certainly offers a useful combination of flexibility and scale.
JanusGraph
JanusGraph was forked from the TitanDB project, and is now under the governance of the Linux Foundation. It uses any of a number of supported back ends—Apache Cassandra, Apache HBase, Google Cloud Bigtable, Oracle BerkeleyDB—to store graph data, supports the Gremlin query language (as well as other elements from the Apache TinkerPop stack), and can also incorporate full-text search by way of the Apache Solr, Apache Lucene, or Elasticsearch projects.
IBM, one of the JanusGraph project’s supporters, offers a hosted version of JanusGraph on IBM Cloud, called Compose for JanusGraph. Like Azure Cosmos DB, Compose for JanusGraph provides autoscaling and high availability, with pricing based on resource usage.
https://www.infoworld.com
No comments:
Post a Comment