MDM and Graph
by Salah Kamel | April 22, 2018
The true role graph databases will play in the future of enterprise data management is nuanced. There certainly is value to be gained from a graph database in its ability to get access to poorly attributed data (nodes with little properties) and highly scattered and volatile relationships.
However, their role is most beneficial to analytics rather than operational data management. Advanced operational, transaction capabilities, and massive set-based processing will continue to rely on relational databases.
The value of Graph Databases lies not in its ability to replace your MDM, but rather in its ability to compliment your MDM and data governance applications when needed. Attendees of this session will learn how graph can complement modern, intelligent MDM solutions and solidify their understanding of how graph databases can be helpful when used in conjunction with a modern MDM tool.
Master Data Management (MDM) is a discipline that was introduced over a decade ago as the foundation for building a sustainable data governance strategy for customer, product, asset, financial, location, and reference data in an ever-growing eco-system of business applications. Big data, predictive analytics, IoT, and machine learning algorithms have set the stage for a new way of collecting, analyzing and deriving past insights into prediction models. They enable transforming extremely large volumes of data into actionable outcomes to drive better revenue, mitigate risk, and optimize costs. MDM has served -- and still serves -- the analytics world by feeding it with the most accurate, consistent and complete representation of the key master data sets. MDM delivers the globally unique, self-explanatory representation of an enterprise’s customers, products, and other foundational entities. With that, it has enabled the non-ambiguous interpretation of transactional data, events, and social observations - such as sales transactions, call center tickets, and twitter feeds - by blending such volatile data with highly reliable enriched and standardized references to drive appropriate decisions.
Growing complexity of relationships
With the rise of the multi-channel interactions in the current digital world, the number of relationships between the core entities of the enterprises has grown exponentially. The traditional perceived simplicity of those relationships, such as “My customer has ordered my product at this date,” is getting outdated. More interactions and links can - and should - be derived and explored to build sustainable data-driven value-focused applications for the enterprise. To illustrate that, think about all the interaction that you, as a potential customer, can have with a B2C enterprise. You might order a product, tweet positively or negatively about it, comment on facebook, call the service desk about it, recommend it to a friend, get it shipped to your place, or order additional compatible accessories.
All these interactions and observations represent - or can be represented - as relationships between you as a customer or a potential buyer and the product(s) proposed. Surfacing these relationships, whether explicit, implicit or derived, to the appropriate business stakeholders across various functions in the enterprise, such as the marketing department or the supply chain department, can provide tremendous value to optimize and sharpen the existing business processes. These processes would benefit from the analysis of past observations to reduce costs or mitigate risks, and they would also likely benefit from predictive analytical models to guide appropriate business decisions resulting in increased revenue.
The most expressive way we have today to define the exponentially complex relationships is through a Graph - not the technology of Graph Database - but the semantic relationships captured by “graph” abstraction.
The simplest definition of a graph is “a set of nodes connected by edges”. As such, the internet is a graph of web pages (nodes) connected by edges (hyperlinks). A relational data model with entities and relationships is also a graph where instances (actual records defining an object) are inter-connected to other instances. So graphs exist regardless of the technology used to manipulate them.
There is an ongoing confusion between the technology for storing graph data (namely Graph Databases), the performance for executing graph-based queries, and the completeness of the data modeling features and query semantics for Graph.
A graph database is a good technology to get fast access to “poorly attributed” data (aka nodes with little properties) and highly scattered and volatile relationships. It stores nodes and edges as a large index to give good performance for graph-based query access.
Unfortunately graph databases have been primarily designed for analytics rather than operational data management. Advanced transactional capabilities, as required by a Master Data Management initiative, such as transactional workflows or massive set-based processing, still require plain old relational databases. Relational databases continue to rule in this category with more than forty years of continuous innovation.
The excitement about graph databases is analogous to the one we had in the past for hierarchical databases in the 80’s, columnar databases in early 2000, indexed document-based databases optimized for search, or more recent name-value pair NoSQL databases. Each of these technologies has valid proof points and applications for a certain type of query tailored for certain use cases. For example, using a search engine such as Apache Solr is a good approach for indexing large databases of documents or records and providing fast query results for text-based search. However, everyone would agree that using a search index as a data management application is not appropriate. ERPs, CRMs, MDMs, and other data-driven applications, most likely designed on top of relational databases, export their data on a regular basis to feed the search index and seamlessly integrate with it. They all provide reasonably optimized out-of-the box search or filtering features hitting the relational database to serve daily operational requirements, and they integrate with search engines for more optimized text-search.
This analogy is valid for Graph Databases as well. Data-driven applications such as Master Data Management applications should primarily focus on solving the critical operational challenges of data governance, data lifecycle, approvals, matching and merging, data policies and records housekeeping. They must provide a reasonably optimized graph-based query engine based on the semantics of the application with advanced graph data visualisation to serve such operational needs in priority. An example of these operational graph queries could be “Give me all products that are close enough (3 degrees of depth) to product ‘XYZ’ by exploring relationships that include similar product families and similar customer profiles that bought ‘XYZ’ in the past year”. A data champion can then use the result of such a query to run a data review campaign on such products to ensure that their descriptions match the enterprise’s Search Engine Optimization (SEO) recommendations.
For more complex graph analytics requirements where graph query performance is important, where the number of edges exceeds billions, and where the data of the graph itself is not fully managed in the MDM application, it is a good practice to have a graph database complement the MDM and other operational applications. The more graph semantics the MDM applications have, the easier the integration with the graph database will be. In this scenario, the graph database will act as the relationship warehouse for all links managed in the MDM (explicit, implicit and derived) and complement these with additional relationships gathered from other areas outside of the scope of the MDM. The graph database can benefit from the MDM’s fully attributed entities and link to it to display more detailed information about nodes and edges.
The Semantics of Relationships Matter
With the explosion of the potential links between objects in the Analytics and Master Data Management worlds, it is becoming critical for enterprises to introduce a clear definition of each of these relationships. Such definitions have to be modelled and governed in an agile enough environment. They should be semantically described, enhanced, completed and changed by and with the business stakeholders according to their business requirements. For example, let’s consider two nodes in a graph connected with a link representing an instance of a product (“ZW Watch”) and an instance of a customer (“Jane H.”). Such a graph does not provide any other information than “Jane H. is linked to ZW Watch” without any other context. Is it because Jane bought that product, or is it because it was shipped to her?
By adding the semantics to each of these relationships between an instance of a customer and an instance of a product, the graph becomes much more meaningful in the context of the business function about to consume this information. Semantics naturally drive self-learning assessments of the interactions and observations between master data elements.
Semantics naturally drive self-learning assessments of the interactions and observations between master data elements.
Intelligent Master Data Management tools have to be able to cope with such complexity in a precise and agile manner. They must provide the appropriate modelling flexibility for defining, creating, maintaining and evolving such relationships. They should provide support for:
- Explicit relationships: for example those that exist between a “household” and a “physical person” (“member-of”). Such relationships are usually immutably predetermined as One-To-Many or Many-To-Many relationships in typical ER-design style.
- Implicit relationships: for example those that are derived from functional dependencies or transitivity such as if a “product” belongs to this “subfamily” and the “subfamily” belongs to this “family” then the “product” belongs to that same “family”.
- Derived relationships: for example those that can be aggregated or calculated according to external factors or fuzzy inference rules such as “Jane, being the friend of Joe who bought a plastic watch 5 days ago, might be interested in similar products”. These relationships are volatile and highly contextualized by business use cases.
All these relationships, holistically put together, represent the “edges” (links) between “nodes” (instances of objects) in a graph. The more you attach semantics to these edges and nodes, the better outcomes you get from the global interpretation of your analytics and predictions.
In short, graphs exist independently from the technology that is used to manipulate them. As graph analysis is becoming more relevant in MDM initiatives, depending on the business cases, it is important to rate graph capabilities in your toolset with regards to meaningful, unambiguous semantics to avoid misinterpretation of your data.
Your MDM tool must allow defining explicit, implicit and derived relationships and entities. But these need to be queried using powerful graph-query languages to obtain the best outcomes. Again, this does not necessitate a graph database.
Powerful Graph Query Language
Set-based query languages such as SQL have often failed to express complex queries on top of graphs. The grammar of these languages forces the writer of a query to explicitly define the paths to explore (joins in SQL) to retrieve a set of interconnected objects with their relationships. Although SQL is a very powerful language for defining extremely complex filtering rules to obtain a set of records, its core weakness resides in the lack of understanding of the semantics of the underlying relationships for dynamically building such queries. In short, SQL is cumbersome and not optimized for graph queries.
Graph query languages such as Neo4j’s Cypher define a new set of grammatical constructs specifically focused on graph queries but they introduce a new layer of complexity with a significant ramp-up time, even for experienced data stewards.
Specialized query languages such as SemQL combine the best of both worlds to provide a powerful, semantically complete query language that extends the capabilities of standard SQL to provide context-aware graph-based queries by introspecting the relationships defined in the data model.
The example above shows the same query using a graph language and SemQL to obtain the list of restaurants in New York serving Sushi that my friends liked.
The expressivity of the query language you plan to use for analyzing your graph is critical. Stopping at the query language is not enough. Once you have your data in hand, just as important is how to visualize it to communicate your findings to others.
Inspecting “neighborhoods” of a given object has always been a challenge in graph analytics. Most of the visualization tools have not yet come up with a comprehensive graph visualization engine to start from a node and explore N levels of depths regardless of the relationships between nodes. Graph visualization combined with Graph query and Graph semantics is fundamental for data discovery, data analysis, and relationship mining.
The example belows shows 5 levels of depths across all known relationships surrounding a given product.
Zooming in and out of such a diagram, choosing the relationships that matter to my business case, finding new areas of interest and jumping from a node to another can help me, as a business user, not only to understand the impact analysis on my data for regulatory compliance use cases, but also to discover new opportunities, for example, for upsell or cross-sell.
Applying graph theory to Master Data Management initiatives provides better insights. It prepares enterprises for advanced analytics, leveraging intelligent algorithms.
Graph Databases, as a technology, should be used where appropriate to your analytics use cases. Graph Databases will probably not replace your operational applications; they will likely complement your MDM and data governance applications.
Special attention should be paid when designing your MDM graph strategy:
- The semantics of the relationships, the flexibility, and the agility of the data modeling matter.
- The graph query language exposed to your data champions has to be powerful and concise.
- The visualization capabilities and the user experience to analyze your graph results are equally critical.
Download the Slides
Click the button below to download the slides to Salah's presentation.