Monday, November 21, 2011

Data quality problems and mapping their relation to cleansing tools


There are two components of data quality problem.
  • Cleansing records to ensure data validity, accuracy and completeness.
  • Determining a unique individual.
This data quality problem makes it jittery. There is a functional dimension of the data quality issue and a technical approach for remediation. Data quality tools aid technical remediation, and the functional dimension will help you harness the capabilities of these solutions. 
There are many tools available in the market to resolve data quality problems. Some of these tools are very elaborate, while others are minimalist. Even in a comprehensive tool, certain modules may be required and others may not. Since the capabilities of data quality tools differ, their prices differ. The functions which are sought after in a data quality tool depend on the organization’s needs. In this context, you can map several of your data quality problems to issues with the tool’s capabilities in two steps.
Step 1: Identify functional tool capability challenges which create data quality problems
As part of the first step in mapping issues to the data quality tool's deficiencies, you have to take a look at the following aspects.
Reporting: How bad is it?
These issues are generally related to data profiling at the column level to generate statistics about aspects such as the range of values, row count against each value and NULL count.
Rules: What can we do about it now?
On this front you have to look at creation of a rule library to define various transformation rules in order to correct the data.
Transformation: how can we fix it?
Here we are talking about the tool’s functionality to ‘correct’ or ‘standardize’ data. For example,
  • If the gender field contains ‘M,’ ‘Males,’ or ‘Male,’ then set the target value to ‘M.’
  • If the zip code is ‘90000,’ then set the pin code to null.
Other functional requirements at fault are:
Data cleansing:Basic data cleansing functions like:
  • Left and right justification   
  • Trimming blanks and spaces
  • Case conversion
  • Range checks
  • Uniqueness
  • Patterns
  • Date/time validation and reformatting
Data enrichment/appends: In addition to standard cleansing, the ability to ‘enrich’ or append data acquired from internal/external sources such as identified missing data elements, etc.
Address standardization and cleansing: It’s essential that your data quality tool has address cleansing functionality to standardize addresses. Examples of these capabilities include:
  • Fixing spelling
  • Identification of address components, and standardize
  • Validation of address against post-office addresses
  • Validating postal codes
  • Validation of street names
  • Consistent and valid street suffixes
  • Valid house numbers
Data merging:Data often needs to be obtained from multiple input sources and ‘merged.’ When a data attribute is duplicated in the input source, a user-specified business rule needs to be used to identify the ‘correct’ value. For example, the rule may specify “use pin code from Source-A when available. However, when it is null, use the pin code from Source-B”.
Customer de-duplication and data merging: Data about a particular customer may be in multiple input data sources, and each data source may have a different customer identifier. It is necessary to link or merge data from multiple data sources based on the identification that a particular customer is the same (based on user-specified business rules). For example, same address, same gender and same last name.
Incremental processing: Ability to quickly add or reload values as the source data changes, and ability to track source to target record linkage.
Step 2: Reconciliation 
As part of the reconciliation process, we have check whether the measures we undertook deliver the right results. Testing and validation are the ways in which we can ascertain the efficacy of our corrective measures.
Testing:Functionality to test rules on sample sets of data.
Validation:Providing statistics about each data load. These include areas like:
  • Number of records processed
  • Generate hash total and checksum total
  • Number of times a rule was invoked
  • Post-cleanse ‘data visualization’
  • Audit cleansed and transformed data partition
  • Generate and validate statistics of data load
  • Create and populate error and exception tables
The business need for a data quality tool is the commercial value (or use) of the data. If data quality problems hamper the use of that data, then it only creates nuisance than business value. In this context, one-time cleaning is not enough. The use of a data quality tool and associated processes (which the tool forces your organization to adopt) leads to reduction of data quality issues across the organization.

1 comment:

  1. Caesars Casino and Racetrack – 2021 New Jersey Gambling
    Caesars wooricasinos.info Resort Casino & Racetrack is the latest casino in New Jersey to https://deccasino.com/review/merit-casino/ undergo a www.jtmhub.com comprehensive safety worrione review. febcasino The casino is owned by Caesars

    ReplyDelete