Monday, November 21, 2011

Data quality problems and mapping their relation to cleansing tools


There are two components of data quality problem.
  • Cleansing records to ensure data validity, accuracy and completeness.
  • Determining a unique individual.
This data quality problem makes it jittery. There is a functional dimension of the data quality issue and a technical approach for remediation. Data quality tools aid technical remediation, and the functional dimension will help you harness the capabilities of these solutions. 
There are many tools available in the market to resolve data quality problems. Some of these tools are very elaborate, while others are minimalist. Even in a comprehensive tool, certain modules may be required and others may not. Since the capabilities of data quality tools differ, their prices differ. The functions which are sought after in a data quality tool depend on the organization’s needs. In this context, you can map several of your data quality problems to issues with the tool’s capabilities in two steps.
Step 1: Identify functional tool capability challenges which create data quality problems
As part of the first step in mapping issues to the data quality tool's deficiencies, you have to take a look at the following aspects.
Reporting: How bad is it?
These issues are generally related to data profiling at the column level to generate statistics about aspects such as the range of values, row count against each value and NULL count.
Rules: What can we do about it now?
On this front you have to look at creation of a rule library to define various transformation rules in order to correct the data.
Transformation: how can we fix it?
Here we are talking about the tool’s functionality to ‘correct’ or ‘standardize’ data. For example,
  • If the gender field contains ‘M,’ ‘Males,’ or ‘Male,’ then set the target value to ‘M.’
  • If the zip code is ‘90000,’ then set the pin code to null.
Other functional requirements at fault are:
Data cleansing:Basic data cleansing functions like:
  • Left and right justification   
  • Trimming blanks and spaces
  • Case conversion
  • Range checks
  • Uniqueness
  • Patterns
  • Date/time validation and reformatting
Data enrichment/appends: In addition to standard cleansing, the ability to ‘enrich’ or append data acquired from internal/external sources such as identified missing data elements, etc.
Address standardization and cleansing: It’s essential that your data quality tool has address cleansing functionality to standardize addresses. Examples of these capabilities include:
  • Fixing spelling
  • Identification of address components, and standardize
  • Validation of address against post-office addresses
  • Validating postal codes
  • Validation of street names
  • Consistent and valid street suffixes
  • Valid house numbers
Data merging:Data often needs to be obtained from multiple input sources and ‘merged.’ When a data attribute is duplicated in the input source, a user-specified business rule needs to be used to identify the ‘correct’ value. For example, the rule may specify “use pin code from Source-A when available. However, when it is null, use the pin code from Source-B”.
Customer de-duplication and data merging: Data about a particular customer may be in multiple input data sources, and each data source may have a different customer identifier. It is necessary to link or merge data from multiple data sources based on the identification that a particular customer is the same (based on user-specified business rules). For example, same address, same gender and same last name.
Incremental processing: Ability to quickly add or reload values as the source data changes, and ability to track source to target record linkage.
Step 2: Reconciliation 
As part of the reconciliation process, we have check whether the measures we undertook deliver the right results. Testing and validation are the ways in which we can ascertain the efficacy of our corrective measures.
Testing:Functionality to test rules on sample sets of data.
Validation:Providing statistics about each data load. These include areas like:
  • Number of records processed
  • Generate hash total and checksum total
  • Number of times a rule was invoked
  • Post-cleanse ‘data visualization’
  • Audit cleansed and transformed data partition
  • Generate and validate statistics of data load
  • Create and populate error and exception tables
The business need for a data quality tool is the commercial value (or use) of the data. If data quality problems hamper the use of that data, then it only creates nuisance than business value. In this context, one-time cleaning is not enough. The use of a data quality tool and associated processes (which the tool forces your organization to adopt) leads to reduction of data quality issues across the organization.

Friday, November 4, 2011

Types of reports in SSRS

  • Parameterized Reports: In this report we assign variables to report. Using in inside where condition we can make it drop-down in list while in equal condition it takes parameter as an input and display report according to parameters selection.
  • Linked Reports: A linked report is a report server item that provides an access point to an existing report. Conceptually, it is similar to a program shortcut that you use to run a program or open a file.
  • A linked report is derived from an existing report and retains the original's report definition. A linked report always inherits report layout and data source properties of the original report. All other properties and settings can be different from those of the original report, including security, parameters, location, subscriptions, and schedules.
  • Snapshot Reports: A report snapshot is a report that contains layout information and query results that were retrieved at a specific point in time. Unlike on-demand reports, which get up-to-date query results when you select the report, report snapshots are processed on a schedule and then saved to a report server. When you select a report snapshot for viewing, the report server retrieves the stored report from the report server database and shows the data and layout that were current for the report at the time the snapshot was created. Report snapshots serve three purposes:
    • Report history. By creating a series of report snapshots, you can build a history of a report that shows how data changes over time.
    • Consistency. Use report snapshots when you want to provide consistent results for multiple users who must work with identical sets of data. With volatile data, an on-demand report can produce different results from one minute to the next. A report snapshot, by contrast, allows you to make valid comparisons against other reports or analytical tools that contain data from the same point in time.
    • Performance. By scheduling large reports to run during off-peak hours, you can reduce processing impact on the report server during core business hours.
  • Cached Report: A cached report is a saved copy of a processed report. Cached reports are used to improve performance by reducing the number of processing requests to the report processor and by reducing the time required to retrieve large reports. They have a mandatory expiration period, usually in minutes.
  • Click through reports: A clickthrough report is a report that displays related data from a report model when you click the interactive data contained within your model-based report. These reports are generated by the report server based on the information contained within the report model. The person who created the model determines which fields are interactive and which fields are returned when a clickthrough report is opened. These field settings cannot be changed in the report authoring tools.
  • Drilldown reports: Drilldown reports initially hide complexity and enable the user to toggle conditionally hidden report items to control how much detail data they want to see. Drilldown reports must retrieve all possible data that can be shown in the report.
  • Drill through reports: Drillthrough reports are standard reports that are accessed through a hyperlink on a text box in the original report. Drillthrough reports work with a main report and are the target of a drillthrough action for a report item such as placeholder text or a chart. The main report displays summary information, for example in a matrix or chart. Actions defined in the matrix or chart provides drillthrough links to reports that display greater details based on the aggregate in the main report. Drillthrough reports can be filtered by parameters, but they do not have to be. Drillthrough reports differ from subreports in that the report does not display within the original report, but opens separately.
  • Sub reports: A subreport is a report that displays another report inside the body of a main report. Conceptually, a subreport is similar to a frame in a Web page. It is used to embed a report within a report. Any report can be used as a subreport. The subreport can use different data sources than the main report.