Skip to main content
photo
 
Data Quality Tools

Pam Neely, Graduate Assistant, Center for Technology in Government

Pam Neely gave a presentation on how data problems can be mapped to features of data quality tools. She began by explaining the basic process of building a data warehouse. Data usually comes from one or several source databases, migrates to an intermediate storage area where the data is transformed, then it migrates to the data warehouse. Along the way, there are tools— auditing tools, cleansing tools and migration tools— to help ensure that the data is clean and accurate prior to being used to populate the data warehouse:

Auditing Tools, Cleansing Tools, and Migration Tools

 

Auditing tools are used at the source, cleansing tools are used in the intermediate stage, and migration tools are used to move the data from the source to the staging area and then to the data warehouse. The data auditing tools ensure accuracy and correctness at the source of the data; they also compare the data to a set of business rules to perform validation checks. Data cleansing tools are used in the intermediate stage. They cleanse data by breaking the records into atomic units, standardizing the elements, matching records against each other to check for duplication, and consolidating data. Finally, data migration tools extract data from a source database, move it to the intermediate staging area, map data to the data warehouse schema, and move it into the warehouse. Pam presented a matrix mapping data quality problems to a selected set of data quality tools.

Table 2: Data Quality Matrix
Problems
 
Features
 
Tools
 
Auditing Tools
 
  
Is your data complete and valid?
 
Data examination-determines quality of data, patterns within it, and number of different fields used
 
WizSoft-WizRule
Vality-Integrity
 
How well does the data reflect the business rules? Do you have missing values, illegal values, inconsistent values, invalid relationships?
 
Compare to business rules and assess data for consistency and completeness against rules
 
Prism Solutions, Inc.-Prism
Quality Manager
WizSoft-WizRule
Vality-Integrity
 
Are you using sources that do not comply to your business rules?
 
Data reengineering-examining the data to determine what the business rules are
 
WizSoft-WizRule
Vality-Integrity
 
Cleansing Tools
 
  
Does your data need to be broken up between source and data warehouse?
 
Data parsing (elementizing)-context and destination of each component of each field
 
Trillium Software-Parser
i.d. Centric-DataRight
 
Does your data have abbreviations that should be changed to insure consistency?
 
Data standardizing-converting data elements to forms that are standard throughout the DW
 
Trillium Software-Parser
i.d. Centric-DataRight
 
Is your data correct?
 
Data correction and verification-matches data against known lists (addresses, product lists, customer lists)
 
Trillium Software-Parser Trillium Software-GeoCoder i.d. Centric-ACE. Clear I.D. Library Group 1 - NADIS
 
Is there redundancy in your data?
 
Record matching-determines whether two records represent data on the same object
 
Trillium Software-Matcher Innovative Systems-Match i.d. Centric-Match/Consolidation Group 1-Merge/Purge Plus
 
Are there multiple versions of company names in your database?
 
Record matching-based on user specified fields such as tax ID
 
Innovative Systems-Corp.-Match
 
Is your data consistent prior to entering data warehouse?
 
Transform data-“1” for male, “2” for female becomes “M” & “F”-ensures consistent mapping between source systems and data warehouse
 
Vality-Integrity
i.d. Centric-Match/Consolidation
 
Do you have information in free form fields that differs between databases?
 
Data reengineering-examining the data to determine what the business rules are
 
Vality-Integrity
 
Do you multiple individuals in the same household that need to be grouped together?
 
Householding-combining individual records that have same address
 
i.d. Centric-
Match/Consolidation Trillium Software-Matcher
 
Does your data contain atypical words-such as industry specific words, ethnic or hyphenated names?
 
Data parsing combined with data verification – comparison to industry specific lists
 
i.d. Centric-ACE, Clear I.D.
 
Migration and Other Tools
 
  
Do you have multiple formats to be accessed-relational dbs, flat files, etc?
 
Access the data then map it to the dw schema
 
Enterprise/Integrator by Carleton.
 
Do you have free form text that needs to be indexed, classified, other?
 
Text mining-extracts meaning and relevance from large amounts of information
 
Semio-SemioMap
 
Have the rules established during the data cleansing steps been reflected in the metadata?
 
Documenting-documenting the results of the data cleansing steps in the metadata
 
 
Is data Y2K compliant?
 
  
Is the quality of the data poor and people don’t care because they have adjusted to it?
 
  

Finally, Pam mentioned a few important questions that data quality tools cannot address:

(1) Have the users of the source database adjusted to poor quality of the database and developed “work-arounds”?

(2) Is there conflicting information that can’t be compared to a known resource?

(3) Do you have a lot of soft data that needs to be placed in your data warehouse?