Data management is simple ... once you have the big picture!
This post provides an overview of the classification of data, describes the various categories of data (reporting, transactional, master, reference and metadata) and explains why Master and Reference Data have a critical position in this organization.
Categories of Data
The following classification of data is commonly agreed in the data management field. We have seen it in the past in the form of a stack, a pyramid, or even a diamond as represented below; regardless of the shape, the list of items remains the same.
Let's have a look at these various data categories.
Transactional data describes business events. It is the largest volume of data in the enterprise. Example of business events include:
- buying products from suppliers,
- selling products to customers,
- shipping items to customer sites,
- hiring employees, managing their vacations or changing their positions.
You manage transactional data every day! they make the enterprise world spin.
Transactional Data is typically handled in operational applications, known under the CRM, ERP, SCM, HR, etc acronyms.
Master Data is key business information that supports the transactions.
Master Data describes the customers, products, parts, employees, materials, suppliers, sites, etc involved in the transactions. It is commonly referred to
as Places (locations, geography, sites, etc.), Parties (persons, customers, suppliers, employees, etc.) and Things (products, items, material, vehicles, etc.).
Master data already exists and is used in the operational systems, with some issues. Master data in these systems is:
- Not high quality data,
- Scattered and duplicated;
- Not truly managed.
Master Data is usually authored and used in the normal course of operations by existing business processes. Unfortunately, these operational business processes are tailored for an "application-specific" use case of this master data and therefore fail in achieving the overall enterprise requirement that mandates commonly used master data across applications with high quality standards and common governance.
It is data that is referenced and shared by a number of systems. Most of the reference data refers to concepts that either impact business processes - e.g. order status (CREATED | APPROVED | REJECTED | etc.) - or is used as an additional standardized semantic that further clarifies the interpretation of a data record - e.g. employee job position (JUNIOR | SENIOR | VP | etc.).
Some of the reference data can be universal and/or standardized (e.g. Countries - ISO 3166-1). Other reference data may be "agreed on" within the enterprise (customer status), or within a given business domain (product classifications).
Reference Data is frequently considered as a subset of master data. The full name for this data category is Master Reference Data.
Reporting data is (very short definition) data organized for the purpose of reporting and business intelligence. Data for operational reporting as well as data for enterprise (highly aggregated) reporting belong in this category.
Reporting data is created from the transactional data, master data and master reference data.
Metadata is data that describes other data; it is the underlying definition or description of data. Examples of metadata include the properties of a media file: its size, type, resolution, author, and create date. Software applications, documents, spreadsheets, and web pages are all examples that typically have associated metadata. Master data, reference data, and log data all have related metadata.
Big data has many different definitions, but the most common is from Gartner’s Doug Laney. He characterized “big data” by 3Vs: volume, variety, and velocity. By its very nature, big data cannot be effectively maintained with traditional technology. Quite simply, it is the combination of the previous four types of data: log data, transactional data, reference data, and master data.
Unstructured Data is data that does not have a predefined structure. This type of data refers mainly to text data. For example, a PDF document enters in this category. Domains such as Text Mining can extract relevant and structured data from unstructured documents.
So, what's the problem with Master Data?
As mentioned earlier Master data is often authored (created) and used in operational systems, but is not always accurate and complete enough to fit all purposes.
For example, a phone device ordering process (or application) would probably go beyond gathering only order-related data. Billing and shipping addresses of the party placing the order would also be provided. But the email address, since it is not relevant in this process would probably not be created. A web registration process would focus on the quality of the email address, but would not guarantee the quality of the phone number, etc. Data entered in these applications is indeed tailored for each application-specific scenario and usage. But at the enterprise level, such customer master data should include accurate billing/shipping addresses as well as a valid email address and phone number.
In the organization of data, transactional and reporting data rely on master (and reference) data. As a consequence, "bad master data" reflects directly into untrusworthy reports and operational inefficiency.
Now, imagine a database hosting customers (or products, employees, sites) records with:
- All the relevant information (aggregated from the various operational sources),
- Only valid information (No incorrect addresses or bouncing emails),
- No duplicates.
This database would be Golden Data.
Golden data is a cleansed, de-duplicated, consolidated, validated version of the original master data. Some people call it the "Single Version of The Truth" or "360° Customer View" (needless to say that Capital Letters are a must here).
As you may imagine, this golden data has tremendous value for applications (BI, operational or others). It also reveals other challenges, that will be discussed in future posts.