Master Data Management (MDM)
Craig Martin
January 22, 2021
ProfessionalDataData Integrity
Every organization struggles with managing the myriad amounts of data that permeates their organization. From small mom and pop entities to large conglomerates, the challenge is the same. What do I need to know and understand to run the business, how is that information tracked and preserved, what is most useful to supporting the business, and what is unnecessary and can be disposed of? As the number of systems (comprising of applications, databases, files, and even paper) used by the business grows, complexity increases, sometimes exponentially. We all know that within most, if not all systems, data errors exist. It is not always known what the true accuracy of the data is, particularly by those who rely on that data for running their business. So how do we sleep at night knowing decisions we make might be based on faulty information?
It is helpful to start by identifying the data elements used within the organization, starting with the designated “System of Record” also known as the “SOR”. A SOR system contains the primary set of data that is relied on by both users and other systems. The data contained therein should represent the truth. There usually is one system, but there may be several.
Designate data fields or elements:
1) that are essential for use by the intended audience(s) (owners, managers, operational staff, and technology staff).
2) that are essential for system operations (database keys, time stamps, logs, etc.)
3) that are not essential but are extraneous in nature (temporary values, nice to have data just in case it can be used in the future, etc.).
Next, create a list of data facts. Data facts are unique data elements that are true across the entire organization, providing a clear picture of what the audience needs. For each of the essential data elements identified in #1 define what the data represents.
For example, system 1 has a date field named “appdate” and system 2 has a date field named “applicationdate”. The definition of both is “the application date”. This definition would become a data fact, and given a name, such as “Application Date”. In the real world it gets more complex than this example, but the process of defining the data elements will weed out these differently named data elements.
As the data fact list is being created, it is important to identify the mapping of the data elements within the data dictionary for each system with the data fact identified. This will greatly aid understanding of the various systems and how they relate to one another.
The data fact list should include:
- Data fact reference number – an arbitrary value to track the data fact. It should be unique.
- Data fact name – the name given to the data fact. It should not be abbreviated, should convey clearly what it is, but not be excessively long. It should be unique.
- Data fact definition – the definition of the data fact. It can be any length but should be succinct enough to convey the meaning of the data fact.
- Data attribute – is this a date, number, text, or image.
- Data required – is the data required, optional, required under certain conditions (define those conditions).
- Depending on the data attribute, what are the minimum and maximum lengths, how large can the numbers be, are there decimal positions expected, or are there date time requirements.
- Range boundaries – what are the minimum and maximum values expected, if applicable.
- The data dictionary is system specific and should include all the objects and data elements which support the process, even though those elements might not be considered for inclusion in the data fact list. Typically, there are ways to get the data for the data dictionary directly out of the system (particularly databases), so it doesn’t require much manual effort.
- Every situation may be a little different, so some guidelines for what to include may help.
- Always include the data elements under an object. It is difficult to add these later if needed.
- Always include objects that have data elements where values are added, modified, or deleted by users.
- May exclude objects that support system processes, are backups of data, or are not relevant to the data set.
Data Dictionaries should include:
- Level 1: System designation.
- Entity Order Reference – must be unique across all systems.
- Name – the name of the system.
- Status – active, retired, etc.
- Type – application, database, spreadsheet, file system, etc.
- Description – brief description including its purpose.
- Location – the physical location(s).
- Environment – production, test, development, etc.
- Level 2: For each level 1 system – the tables, folders, or category names which organize groups of fields or data elements.
- Entity Order Reference from level 1.
- Object Order Reference – must be unique under each entity order reference.
- Name – the name of the object.
- Status – active, retired, etc.
- Type – table, view, file folder, etc.
- Description – a brief description
- Comment – any notes or comments that add value, such as relationships to other objects.
- Level 3: For each level 2 designation – the data fields or element names which contain specific data values.
- Object Order Reference from level 2.
Data fact definition – the definition of the data fact. It can be any length but should be succinct enough to convey the meaning of the data fact.
- Data attribute – is this a date, number, text, or image.
- Data required – is the data required, optional, required under certain conditions (define those conditions).
- Depending on the data attribute, what are the minimum and maximum lengths, how large can the numbers be, are there decimal positions expected, or are there date time requirements.
- Range boundaries – what are the minimum and maximum values expected, if applicable.
The data dictionary is system specific and should include all the objects and data elements which support the process, even though those elements might not be considered for inclusion in the data fact list. Typically, there are ways to get the data for the data dictionary directly out of the system (particularly databases), so it doesn’t require much manual effort.
Every situation may be a little different, so some guidelines for what to include may help.
- Always include the data elements under an object. It is difficult to add these later if needed.
- Always include objects that have data elements where values are added, modified, or deleted by users.
- May exclude objects that support system processes, are backups of data, or are not relevant to the data set.
Data Dictionaries should include:
- Level 1: System designation.
- Entity Order Reference – must be unique across all systems.
- Name – the name of the system.
- Status – active, retired, etc.
- Type – application, database, spreadsheet, file system, etc.
- Description – brief description including its purpose.
- Location – the physical location(s).
- Environment – production, test, development, etc.
- Level 2: For each level 1 system – the tables, folders, or category names which organize groups of fields or data elements.
- Entity Order Reference from level 1.
- Object Order Reference – must be unique under each entity order reference.
- Name – the name of the object.
- Status – active, retired, etc.
- Type – table, view, file folder, etc.
- Description – a brief description
- Comment – any notes or comments that add value, such as relationships to other objects.
- Level 3: For each level 2 designation – the data fields or element names which contain specific data values.
- Object Order Reference from level 2.
Element Order Reference – must be unique under each object order reference.
- Name – the name of the element.
- Status – active, retired, etc.
- Type – column, field, file, etc.
- Expectation – required or not or conditional on some other event.
- Default – the default value if not entered.
- Description – a brief description.
- Comment – any notes or comments that add value.
The data mapping list is simply a list of data elements that belong to the data fact definition. Element names may be different, but the data values within the element fit the data fact definition.
Data Mapping Lists should include:
- Data fact reference number
- Data fact name
- Entity order reference number
- System name
- Object order reference number
- Object name
- Element order reference number
- Element name
Once the “SOR” data facts have been identified and documented, then repeat the process for the other systems that have valuable information that may not be available on the SOR system. When data facts are the same on both systems, don’t create a new data fact. Just reference the data fact from the initial system.
Data quality needs to be focused on the most useful and important data. It is typically a fool’s errand to try to clean up all data within an organization, as the time and dollar commitment is too great. Cleanup of data should be prioritized based on the greatest need. Over time additional work can be done in other areas until the desired outcome is achieved.
It can be helpful to create an audience summary list to further identify what data needs to be addressed first. There is overlap between the various audiences so data facts will be duplicated across audiences in this list. But the granularity will usually be different as management typically likes to see summaries, unless a problem is detected, and only then are details needed for those records. Technology users are interested in different sets of data elements altogether – more about the system control data values – less about user entered data values.
Audience Summary List should include:
- Audience name – Management, Operations, Technology, etc.
- Data fact reference number
- Data fact name
- Granularity – how much detail is needed. Example values are “Detail”, “Total By x”, “Total Within Range.”
Next, organize what chunk of data will be cleaned up by:
1) identifying the primary audience which needs the greatest attention.
2) identifying the data facts that are of the greatest concern.
3) identifying the primary system which contains the data elements tied to this audience and data fact list. Not ideal, but this may include several systems which need to be evaluated and cleaned up together.
Next, determine how the data stacks up against the selected data facts. There are many ways to do this, some very manual intensive, and others which use tools to assist in the comparisons. A review of these are outside the scope of this article. The following are things to look for in the comparison:
1) any missing expected data.
2) any inaccurate data. This is a tough one to identify without comparing the data to its source, which is very time consuming. Sometimes it is useful to compare to other data elements and look for relationships that are odd. Additionally, if the same data exists on another system, a cross check could be done between those systems to look for differences. In either case, these would be the records for comparison back to the source instead of all of them.
3) any data values which don’t fit the data fact requirements.
4) any duplicated data.
5) any partially duplicated data across multiple records (such as multiple contact records with a slightly different name but the same address).
6) any data outside a range of values (such as old records no longer needed that can be archived, or dates in the next century that can’t be right, etc.).
Finally, make the corrections to the data thru the means best suited to the system and staff expertise. Repeat the cleanup process by selecting another chunk as time and money allow. If not right now, at least you can feel good about the cleanup that was done. It is an important step in improving data quality, with the added benefit that next time it will be easier, because cleaning up data for one area typically flows into the other areas.
Don’t forget to implement business or system processes that will help prevent the occurrence of errors in the first place. But that is another story entirely.