Monday, November 21, 2011

Data quality problems and mapping their relation to cleansing tools


There are two components of data quality problem.
  • Cleansing records to ensure data validity, accuracy and completeness.
  • Determining a unique individual.
This data quality problem makes it jittery. There is a functional dimension of the data quality issue and a technical approach for remediation. Data quality tools aid technical remediation, and the functional dimension will help you harness the capabilities of these solutions. 
There are many tools available in the market to resolve data quality problems. Some of these tools are very elaborate, while others are minimalist. Even in a comprehensive tool, certain modules may be required and others may not. Since the capabilities of data quality tools differ, their prices differ. The functions which are sought after in a data quality tool depend on the organization’s needs. In this context, you can map several of your data quality problems to issues with the tool’s capabilities in two steps.
Step 1: Identify functional tool capability challenges which create data quality problems
As part of the first step in mapping issues to the data quality tool's deficiencies, you have to take a look at the following aspects.
Reporting: How bad is it?
These issues are generally related to data profiling at the column level to generate statistics about aspects such as the range of values, row count against each value and NULL count.
Rules: What can we do about it now?
On this front you have to look at creation of a rule library to define various transformation rules in order to correct the data.
Transformation: how can we fix it?
Here we are talking about the tool’s functionality to ‘correct’ or ‘standardize’ data. For example,
  • If the gender field contains ‘M,’ ‘Males,’ or ‘Male,’ then set the target value to ‘M.’
  • If the zip code is ‘90000,’ then set the pin code to null.
Other functional requirements at fault are:
Data cleansing:Basic data cleansing functions like:
  • Left and right justification   
  • Trimming blanks and spaces
  • Case conversion
  • Range checks
  • Uniqueness
  • Patterns
  • Date/time validation and reformatting
Data enrichment/appends: In addition to standard cleansing, the ability to ‘enrich’ or append data acquired from internal/external sources such as identified missing data elements, etc.
Address standardization and cleansing: It’s essential that your data quality tool has address cleansing functionality to standardize addresses. Examples of these capabilities include:
  • Fixing spelling
  • Identification of address components, and standardize
  • Validation of address against post-office addresses
  • Validating postal codes
  • Validation of street names
  • Consistent and valid street suffixes
  • Valid house numbers
Data merging:Data often needs to be obtained from multiple input sources and ‘merged.’ When a data attribute is duplicated in the input source, a user-specified business rule needs to be used to identify the ‘correct’ value. For example, the rule may specify “use pin code from Source-A when available. However, when it is null, use the pin code from Source-B”.
Customer de-duplication and data merging: Data about a particular customer may be in multiple input data sources, and each data source may have a different customer identifier. It is necessary to link or merge data from multiple data sources based on the identification that a particular customer is the same (based on user-specified business rules). For example, same address, same gender and same last name.
Incremental processing: Ability to quickly add or reload values as the source data changes, and ability to track source to target record linkage.
Step 2: Reconciliation 
As part of the reconciliation process, we have check whether the measures we undertook deliver the right results. Testing and validation are the ways in which we can ascertain the efficacy of our corrective measures.
Testing:Functionality to test rules on sample sets of data.
Validation:Providing statistics about each data load. These include areas like:
  • Number of records processed
  • Generate hash total and checksum total
  • Number of times a rule was invoked
  • Post-cleanse ‘data visualization’
  • Audit cleansed and transformed data partition
  • Generate and validate statistics of data load
  • Create and populate error and exception tables
The business need for a data quality tool is the commercial value (or use) of the data. If data quality problems hamper the use of that data, then it only creates nuisance than business value. In this context, one-time cleaning is not enough. The use of a data quality tool and associated processes (which the tool forces your organization to adopt) leads to reduction of data quality issues across the organization.

Friday, November 4, 2011

Types of reports in SSRS

  • Parameterized Reports: In this report we assign variables to report. Using in inside where condition we can make it drop-down in list while in equal condition it takes parameter as an input and display report according to parameters selection.
  • Linked Reports: A linked report is a report server item that provides an access point to an existing report. Conceptually, it is similar to a program shortcut that you use to run a program or open a file.
  • A linked report is derived from an existing report and retains the original's report definition. A linked report always inherits report layout and data source properties of the original report. All other properties and settings can be different from those of the original report, including security, parameters, location, subscriptions, and schedules.
  • Snapshot Reports: A report snapshot is a report that contains layout information and query results that were retrieved at a specific point in time. Unlike on-demand reports, which get up-to-date query results when you select the report, report snapshots are processed on a schedule and then saved to a report server. When you select a report snapshot for viewing, the report server retrieves the stored report from the report server database and shows the data and layout that were current for the report at the time the snapshot was created. Report snapshots serve three purposes:
    • Report history. By creating a series of report snapshots, you can build a history of a report that shows how data changes over time.
    • Consistency. Use report snapshots when you want to provide consistent results for multiple users who must work with identical sets of data. With volatile data, an on-demand report can produce different results from one minute to the next. A report snapshot, by contrast, allows you to make valid comparisons against other reports or analytical tools that contain data from the same point in time.
    • Performance. By scheduling large reports to run during off-peak hours, you can reduce processing impact on the report server during core business hours.
  • Cached Report: A cached report is a saved copy of a processed report. Cached reports are used to improve performance by reducing the number of processing requests to the report processor and by reducing the time required to retrieve large reports. They have a mandatory expiration period, usually in minutes.
  • Click through reports: A clickthrough report is a report that displays related data from a report model when you click the interactive data contained within your model-based report. These reports are generated by the report server based on the information contained within the report model. The person who created the model determines which fields are interactive and which fields are returned when a clickthrough report is opened. These field settings cannot be changed in the report authoring tools.
  • Drilldown reports: Drilldown reports initially hide complexity and enable the user to toggle conditionally hidden report items to control how much detail data they want to see. Drilldown reports must retrieve all possible data that can be shown in the report.
  • Drill through reports: Drillthrough reports are standard reports that are accessed through a hyperlink on a text box in the original report. Drillthrough reports work with a main report and are the target of a drillthrough action for a report item such as placeholder text or a chart. The main report displays summary information, for example in a matrix or chart. Actions defined in the matrix or chart provides drillthrough links to reports that display greater details based on the aggregate in the main report. Drillthrough reports can be filtered by parameters, but they do not have to be. Drillthrough reports differ from subreports in that the report does not display within the original report, but opens separately.
  • Sub reports: A subreport is a report that displays another report inside the body of a main report. Conceptually, a subreport is similar to a frame in a Web page. It is used to embed a report within a report. Any report can be used as a subreport. The subreport can use different data sources than the main report. 

Monday, October 31, 2011

Interview Questions on DW design


    1. Question: How do you implement Slowly Changing Dimension type 2? I am not looking for the definition, but the practical implementation e.g. table structure, ETL/loading. {M}
    Answer: Create the dimension table as normal, i.e. first the dim key column as an integer, then the attributes as varchar (or varchar2 if you use Oracle). Then I’d create 3 additional columns: IsCurrent flag, “Valid From” and “Valid To” (they are datetime columns). With regards to the ETL, I’d check first if the row already exists by comparing the natural key. If it exists then “expire the row” and insert a new row. Set the “Valid From” date to today’s date or the current date time.
    An experienced candidate (particularly DW ETL developer) will not set the “Valid From” date to the current date time, but to the time when the ETL started. This is so that all the rows in the same load will have the same Valid From, which is 1 millisecond after the expiry time of the previous version thus avoiding issue with ETL workflows that run across midnight.
    Purpose: SCD 2 is the one of the first things that we learn in data warehousing. It is considered the basic/fundamental. The purpose of this question is to separate the quality candidate from the ones who are bluffing. If the candidate can not answer this question you should worry.
    2. Question: How do you index a fact table? And explain why. {H}
    Answer: Index all the dim key columns, individually, non clustered (SQL Server) or bitmap (Oracle). The dim key columns are used to join to the dimension tables, so if they are indexed the join will be faster. An exceptional candidate will suggest 3 additional things: a) index the fact key separately, b) consider creating a covering index in the right order on the combination of dim keys, and c) if the fact table is partitioned the partitioning key must be included in all indexes.
    Purpose: Many people know data warehousing only in theory or only in logical data model. This question is designed to separate those who have actually built a data warehouse and those who haven’t.
    3. Question: In the source system, your customer record changes like this: customer1 and customer2 now becomes one company called customer99. Explain a) impact to the customer dim (SCD1), b) impact to the fact tables. {M}
    Answer: In the customer dim we update the customer1 row, changing it to customer99 (remember that it is SCD1). We do soft delete on the customer2 row by updating the IsActive flag column (hard delete is not recommended). On the fact table we find the Surrogate Key for customer1 and 2 and update it with customer99’s SK.
    Purpose: This is a common problem that everybody in data warehousing encounters. By asking this question we will know if the candidate has enough experience in data warehousing. If they have not come across this (probably they are new in DW), we want to know if they have the capability to deal with it or not.
    4. Question: What are the differences between Kimball approach and Inmon’s? Which one is better and why? {L}
    Answer: if you are looking for a junior role e.g. a developer, then the expected answer is: in Kimball we do dimension modelling, i.e. fact and dim tables whereas in Inmon’s we do CIF, i.e. EDW in normalised form and we then create a DM/DDS from the EDW. Junior candidates usually prefer Kimball, because of query performance and flexibility, or because that’s the only one they know; which is fine. But if you are interviewing for a senior role e.g. senior data architect then they need to say that the approach depends on the situation. Both Kimball & Inmon’s approaches have advantages and disadvantages. I explained some of the main reasons of having a normalised DW here.
    Purpose: a) to see if the candidate understands the core principles of data warehousing or they just “know the skin”, b) to find out if the candidate is open minded, i.e. the solution depends on what we are trying to achieve (there’s right or wrong answer) or if they are blindly using Kimball for every situation.
    5. Question: Suppose a fact row has unknown dim keys, do you load that row or not? Can you explain the advantage/disadvantages? {M}
    Answer: We need to load that row so that the total of the measure/fact is correct. To enable us to load the row, we need to either set the unknown dim key to 0 or the dim key of the newly created dim rows. We can also not load that row (so the total of the measure will be different from the source system) if the business requirement prefer it. In this case we load the fact row to a quarantine area complete with error processing, DQ indicator and audit log. On the next day, after we receive the dim row, we load the fact row. This is commonly known as Late Arriving Dimension Rows and there are many sources for further information; one of the best is Bob Becker’s article here in 2006. Others refer to this as Early Arriving Fact Row, which Ralph Kimball explained here in 2004.
    Purpose: again this is a common problem that we encounter in regular basis in data warehousing. With this question we want to see if the candidate’s experience level is up to the expectation or not.
    6. Question: Please tell me your experience on your last 3 data warehouse projects. What were your roles in those projects? What were the issues and how did you solve them? {L}
    Answer: There’s no wrong or right answer here. With this question you are looking for a) whether they have done similar things to your current project, b) whether their have done the same role as the role you are offering, c) whether they faces the same issues as your current DW project.
    Purpose: Some of the reasons why we pay more to certain candidates compared to the others are: a) they have done it before they can deliver quicker than those who haven’t, b) they come from our competitors so we would know what’s happening there and we can make a better system than theirs, c) they have solved similar issues so we could “borrow their techniques”.
    7. Question: What are the advantages of having a normalised DW compared to dimensional DW? What are the advantages of dimensional DW compared to normalised DW? {M}
    Answer: For advantages of having a normalised DW see here and here. The advantages of dimensional DW are: a) flexibility, e.g. we can accommodate changes in the requirements with minimal changes on the data model, b) performance, e.g. you can query it faster than normalised model, c) it’s quicker and simpler to develop than normalised DW and easier to maintain.
    Purpose: to see if the candidate has seen “the other side of the coin”. Many people in data warehousing only knows Kimball/dimensional. Second purpose of this question is to check if the candidate understands the benefit of dimensional modelling, which is a fundamental understanding in data warehousing.
    8. Question: What is 3rd normal form? {L} Give me an example of a situation where the tables are not in 3rd NF, then make it 3rd NF. {M}
    Answer: No column is transitively depended on the PK. For example, column1 is dependant on column2 and column2 is dependant on column3. In this case column3 is “transitively dependant” on column1. To make it 3rd NF we need to split it into 2 tables: table1 which has column1 & column2 and table2 which has column2 and column3.
    Purpose: A lot of people talk about “3rd normal form” but they don’t know what it means. This is to test if the candidate is one of those people. If they can’t answer 3rd NF, ask 2nd NF. If they can’t answer 2nd NF, ask 1st NF.
    9. Question: Tell me how to design a data warehouse, i.e. what are the steps of doing dimensional modelling? {M}
    Answer: There are many ways, but it should not be too far from this order: 1. Understand the business process, 2. Declare the grain of the fact table, 3. Create the dimension tables including attributes, 4. Add the measures to the fact tables (from Kimball’s Toolkit book chapter 2). Step 3 and 4 could be reversed (add the fact first, then create the dims), but step 1 & 2 must be done in that order. Understanding the business process must always be the first, and declaring the grain must always be the second.
    Purpose: This question is for data architect or data warehouse architect to see if they can do their job. It’s not a question for an ETL, report or cube developer.
    10. Question: How do you join 2 fact tables? {H}
    Answer: It’s a trap question. You don’t usually join 2 fact tables especially if they have different grain. When designing a dimensional model, you include all the necessary measures into the same fact table. If the measure you need is located on another fact table, then there’s something wrong with the design. You need to add that measure to the fact table you are working with. But what if the measure has a different grain? Then you add the lower grain measure to the higher grain fact table. What if the fact table you are working with has a lower grain? Then you need to get the business logic for allocating the measure.
    It is possible to join 2 fact tables, i.e. using the common dim keys. But the performance is usually horrible, hence people don’t do this in practice, except for small fact tables (<100k rows). For example: if FactTable1 has dim1key, dim2key, dimkey3 and FactTable2 has dim1key and dim2key then you could join them like this:
    1
    select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2
    2
    from
    3
    select dim1key, dim2key, sum(measure1) as measure1
    4
      from FactTable1
    5
      group by dim1key, dim2key
    6
    ) f1
    7
    join FactTable2 f2
    8
    on f1.dim1key = f2.dim1key and f1.dim2key = f2.dim2key
    So if we don’t join 2 fact tables that way, how do we do it? The answer is using the fact key column. It is a good practice (especially in SQL Server because of the concept of cluster index) to have a fact key column to enable us to identify rows on the fact table. The performance would be much better (than joining on dim keys), but you need to plan this in advance as you need to include the fact key column on the other fact table.
    1
    select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2
    2
    from FactTable1 f1
    3
    join FactTable2 f2
    4
    on f2.fact1key = f1.factkey
    I implemented this technique originally for self joining, but then expand the usage to join to other fact table. But this must be used on an exception basis rather than the norm.
    Purpose: not to trap the candidate of course. But to see if they have the experience dealing with a problem which doesn’t happen every day.
    11. Question: How do you index a dimension table? {L}
    Answer: clustered index on the dim key, and non clustered index (individual) on attribute columns which are used on the query’s “where clause”.
    Purpose: this question is critical to be asked if you are looking for a Data Warehouse Architect (DWA) or a Data Architect (DA). Many DWA and DA only knows logical data model. Many of them don’t know how to index. They don’t know how different the physical tables are in Oracle compared to in Teradata. This question is not essential if you are looking for a report or ETL developer. It’s good for them to know, but it’s not essential
    12. Question: Tell me what you know about William Inmon? {L} Alternatively: Ralph Kimball.
    Answer: He was the one who introduced the concept of data warehousing. Arguably Barry Devlin was the first one, but he’s not as popular as Inmon. If you ask who is Barry Devlin or who is Claudia Imhoff 99.9% of the candidates wouldn’t know. But every decent practitioner in data warehousing should know about Inmon and Kimball.
    Purpose: to test if the candidate is a decent practitioner in data warehousing or not. You’ll be surprise (especially if you are interviewing a report developer) how many candidates don’t know the answer. If someone is applying for a BI architect role and he never heard about Inmon you should worry.
    13. Question: How do we build a real time data warehouse? {H}
    Answer: if the candidate asks “Do you mean real time or near real time” it may indicate that they have a good amount of experience dealing with this in the past. There are two ways we build a real time data warehouse (and this is applicable for both Normalised DW and Dimensional DW):
    a) By storing previous periods’ data in the warehouse then putting a view on top of it pointing to the source system’s current period data. “Current period” is usually 1 day in DW, but in some industries e.g. online trading and ecommerce, it is 1 hour.
    b) By storing previous periods’ data in the warehouse then use some kind of synchronous mechanism to propagate current period’s data. An example of synchronous data propagation mechanism is SQL Server 2008’s Change Tracking or the old school’s trigger.
    Near real time DW is built using asynchronous data propagation mechanism, aka mini batch (2-5 mins frequency) or micro batch (30s – 1.5 mins frequency).
    Purpose: to test if the candidate understands complex, non-traditional mechanism and follows the latest trends. Real time DW was considered impossible 5 years ago and only developed in the last 5 years. If the DW is normalised it’s easier to make it real time than if the DW is dimensional as there’s dim key lookup involved.
    14. Question: What is the difference between a data mart and a data warehouse? {L}
    Answer: Most candidates will answer that one is big and the other is small. Some good candidates (particularly Kimball practitioners) will say that data mart is one star. Whereas DW is a collection of all stars. An excellent candidate will say all the above answers, plus they will say that a DW could be the normalised model that store EDW, whereas DM is the dimensional model containing 1-4 stars for specific department (both relational DB and multidimensional DB).
    Purpose: The question has 3 different levels of answer, so we can see how deep the candidate’s knowledge in data warehousing.
    15. Question: What the purpose of having a multidimensional database? {L}
    Answer: Many candidates don’t know what a multidimensional database (MDB) is. They have heard about OLAP, but not MDB. So if the candidate looks puzzled, help them by saying “an MDB is an OLAP database”. Many will say “Oh… I see” but actually they are still puzzled so it will take a good few moments before they are back to earth again. So ask again: “What is the purpose of having an OLAP database?” The answer is performance and easier data exploration. An MDB (aka cube) is a hundred times faster than relational DB for returning an aggregate. An MDB will be very easy to navigate, drilling up and down the hierarchies and across attributes, exploring the data.
    Purpose: This question is irrelevant to report or ETL developer, but a must for a cube developer and DWA/DA. Every decent cube developer (SSAS, Hyperion, Cognos) should be able to answer the question as it’s their bread and butter.
    16. Question: Why do you need a staging area? {M}
    Answer: Because:
    a) Some data transformations/manipulations from source system to DWH can’t be done on the fly, but requires several stages and therefore needs to “be landed on disk first”
    b) The time to extract data from the source system is limited (e.g. we were only given 1 hour window) so we just “get everything we need out first and process later”
    c) For traceability and consistency, i.e. some data transform are simple and some are complex but for consistency we put all of them on stage first, then pick them up from stage for further processing
    d) Some data is required by more than 1 parts of the warehouse (e.g. ODS and DDS) and we want to minimise the impact to the source system’s workload. So rather than reading twice from the source system, we “land” the data on the staging then both the ODS and the DDS read the data from staging.
    Purpose: This question is intended more for an ETL developer than a report/cube developer. Obviously a data architect needs to know this too.
    17. Question: How do you decide that you need to keep it as 1 dimension or split it into 2 dimensions? Take for example dim product: there are attributes which are at product code level and there are attributes which are at product group level. Should we keep them all in 1 dimension (product) or split them into 2 dimensions (product and product group)? {H}
    Answer: Depends on how they are going to be used, as I explained in my article “One or two dimensions”.
    Purpose: To test if the candidate is conversant in dimensional modelling. This question especially is relevant for data architects and cube developers and less relevant for a report or ETL developer.
    18. Question: Fact table columns usually numeric. In what case does a fact table have a varchar column? {M}
    Answer: degenerate dimension
    Purpose: to check if the candidate has ever involved in detailed design of warehouse tables. Follow up with question 19.
    19. Question: What kind of dimension is a “degenerate dimension”?  Give me an example. {L}
    Answer: A “dimension” which stays in the fact table. It is usually the reference number of the transaction. For example: Transaction ID, payment ref and order ID
    Purpose: Just another question to test the fundamentals.
    20. Question: What is show flaking? What are the advantages and disadvantages? {M}
    Answer: In dimensional modelling, snow flaking is breaking a dimension into several tables by normalising it. The advantages are: a) performance when processing dimensions in SSAS, b) flexibility if the sub dim is used in several places e.g. city is used in dim customer and dim supplier (or in insurance DW: dim policy holder and dim broker), c) one place to update, and d) the DW load is quicker as there are less duplications of data. The disadvantages are: a) more difficult in “navigating the star*”, i.e. need joins a few tables, b) worse “sum group by*” query performance (compared to “pure star*”), c) more flexible in accommodating requirements, i.e. the city attributes for dim supplier don’t have to be the same as the city attributes for dim customer, d) the DW load is simpler as you don’t have to integrate the city.
    *: a “star” is a fact table with all its dimensions, “navigating” means browsing/querying, “sum group by” is a SQL select statement with a “group by” clause, pure star is a fact table with all its dimensions and none of the dims are snow-flaked.
    Purpose: Snow flaking is one of the classic debates in dimensional modelling community. It is useful to check if the candidate understands the reasons of just “following blindly”. This question is applicable particularly for data architect and OLAP designer. If their answers are way off then you should worry. But it also relevant to ETL and report developers as they will be populating and querying the structure.

Thursday, October 27, 2011

Cube Objects with design in SSAS

  • Cube: edit measure or to add new measure group.
  • Dimensions: to add new dimension or another object.
  • Calculations: A calculation is a Multidimensional Expressions (MDX) expression or script that is used to define a calculated member, a named set, or a scoped assignment in a cube in Microsoft SQL Server Analysis Services. Calculations let you add objects that are defined not by the data of the cube, but by expressions that can reference other parts of the cube, other cubes, or even information outside the Analysis Services database. Calculations let you extend the capabilities of a cube, adding flexibility and power to business intelligence applications.
    • Names Set: A named set is a CREATE SET MDX statement expression that returns a set.
    • Calculated Member: A calculated member is a member whose value is calculated at run time using a Multidimensional Expressions (MDX) expression that you specify when you define the calculated member. A calculated member is available to business intelligence applications just like any other member. Calculated members do not increase the size of the cube because only the definitions are stored in the cube; values are calculated in memory as required to answer a query.
    • Script Command: A script command is an MDX script, included as part of the definition of the cube. Script commands let you perform almost any action that is supported by MDX on a cube, such as scoping a calculation to apply to only part of the cube. In SQL Server Analysis Services, MDX scripts can apply either to the whole cube or to specific sections of the cube, at specific points throughout the execution of the script. The default script command, which is the CALCULATE statement, populates cells in the cube with aggregated data based on the default scope.
  • KPIs: In business terminology, a Key Performance Indicator (KPI) is a quantifiable measurement for gauging business success. In Analysis Services, a KPI is a collection of calculations that are associated with a measure group in a cube that are used to evaluate business success. 
  • Actions: Actions can be of different types and they have to be created accordingly. Actions can be:
    • Drillthrough actions, which return the set of rows that represents the underlying data of the selected cells of the cube where the action occurs.
    • Reporting actions, which return a report from Reporting Services that is associated with the selected section of the cube where the action occurs.
    • Standard actions, which return the action element (URL, HTML, DataSet, RowSet, and other elements) that is associated with the selected section of the cube where the action occurs.
  • Partitions: A partition is a container for a portion of the measure group data. Partitions are not seen from MDX queries; all queries reflect the whole content of the measure group, regardless of how many partitions are defined for the measure group. The data content of a partition is defined by the query bindings of the partition, and by the slicing expression.
    • Remote Partition: The data of a remote partition is stored on a different instance of Microsoft SQL Server Analysis Services than the instance that contains the definitions (metadata) of the partition and its parent cube. A remote partition is administered on the same instance of Analysis Services where the partition and its parent cube are defined.
    • Storage Mode: The storage mode of a partition affects the query and processing performance, storage requirements, and storage locations of the partition and its parent measure group and cube. The choice of storage mode also affects processing choices.A partition can use one of three basic storage modes:
      • Multidimensional OLAP (MOLAP)
      • Relational OLAP (ROLAP)
      • Hybrid OLAP (HOLAP)
    • Write Enabled Partitions: The data in a cube is generally read-only. However, for certain scenarios, you may want to write-enable a partition. Write-enabled partitions are used to enable business users to explore scenarios by changing cell values and analyzing the effects of the changes on cube data. When you write-enable a partition, client applications can record changes to the data in the partition. These changes, known as writeback data, are stored in a separate table and do not overwrite any existing data in a measure group. However, they are incorporated into query results as if they are part of the cube data.
  • Aggregation: An Aggregation Design object defines a set of aggregation definitions that can be shared across multiple partitions.An Aggregation object represents the summarization of measure group data at certain granularity of the dimensions. A simple Aggregation object is composed of: basic information and dimensions. Basic information includes the name of the aggregation, the ID, annotations, and a description. The dimensions are a collection of AggregationDimension objects that contain the list of granularity attributes for the dimension.
  • Perspective: A perspective is a definition that allows users to see a cube in a simpler way. A perspective is a subset of the features of a cube. A perspective enables administrators to create views of a cube, helping users to focus on the most relevant data for them. A perspective contains subsets of all objects from a cube. A perspective cannot include elements that are not defined in the parent cube.
  • Translation: A translation is a simple mechanism to change the displayed labels and captions from one language to another. Each translation is defined as a pair of values: a string with the translated text, and a number with the language ID. Translations are available for all objects in Analysis Services. Dimensions can also have the attribute values translated. 
  • Browser

Error Handling in SSIS


  1. Create control flow by using the control flow designer.
  2. Configure transaction handling for packages, containers, and tasks.
    1. Enable Distributed Transaction Coordinator (DTC)
    2. Lock windows from view menu
    3. Set transaction option
  3. Set checkpoints to define restart points.
    1. Set checkpoint properties to TRUE
    2. CheckpointFilename property provide Valid path for check point file
    3. CheckpointUsage to IfExists
  4. Create event handlers.
  5. Implement error handling. (error redirection)
    1. Handle errors by configuring data flow paths.
    2. Handle errors by configuring control flow paths.
  6. Debug packages.
    1. Debug progress reporting.
    2. View intermediate results by using debug windows.
    3. Examine the package state by setting breakpoints.

Wednesday, October 26, 2011

New in SQL Server 2008


T-SQL:
  • Transparent Data Encryption
  • Enhanced Database Mirroring
  • Automatic Recovery of data pages
  • Data Compression
  • ADO.Net Data Services
  • Date/Time
  • Spatial Data Type
  • Partitioned Table Parallelism
  • Grouping Sets
  • CDC and CTE’s
SQL Server Integration Services (SSIS) 2008
With version 2008 a more scalable and efficient architecture for extraction, transformation, and loading (ETL) is offered to the customers. Some of the new features include the following.
  • SSIS pipeline improvements: In SQL Server 2005, the SSIS pipeline execution engine did not scale up to multi-proc machines with more than two processors or CPUs. The Integration Services design in SQL Server 2008 improves the scalability of runtime to multiple processors
  • SSIS data profiling: The new data profiling feature in SSIS 2008 identifies patterns in data. This feature generates statistics that can be viewed by the stand-alone Data Profile Viewer.
  • Lookups: Performance is improved, access to new data sources including XML and a new component TXLookup transformation is added.
  • The new MERGE SQL statement allows you to insert, update, or delete data based on certain join conditions in the same statement where in previous versions of SQL Server you had to create separate statements.
  • CDC (Change Data Capture) is used to capture insert, update and delete activity on a SQL table using the transaction log and placing these data in a separate relational table.
  • Data Warehousing: It’s now easier to manage large tables more effectively by transparently breaking them into manageable blocks of data. Also with the resource Governor you can now assign limits on CPU and memory use for each workload. And finally with native data compression techniques disk I/O can be dramatically reduced increasing overall performance.
SQL Server Reporting Services (SSRS) 2008
In SSRS 2008, a report server is implemented as a Windows-based service that hosts the Report Manager, the Report Server Web service, and background processing feature areas decreasing the dependence on IIS and also consolidating server applications into a single service reduces the configuration and maintenance tasks.
  • The report server has new management features to set a memory threshold for background operations and performance counters for monitoring service activity.
  • SRS 2008 supports two modes of deployment for report server, the native mode and the SharePoint integrated mode. In the native mode, a report server is a stand-alone application server that provides viewing, management, and delivery of reports and report models.
  • In the SharePoint integrated mode, a report server becomes part of a SharePoint Web application. Users of Windows SharePoint Services 3.0 can store reports in SharePoint libraries and access them from SharePoint sites that are used for accessing other business documents.
  • Office Word 2007 rendering extension can be used to export a report to a Word document and Office Excel renderer has been enhanced to support features like nested data regions and sub-reports.
Report Designer improvements:
  • Entity hierarchies:  Provides a flattened analytic-style metadata browser that presents all entities as a flattened list.
  • Live data in design view: Allows display of live data by using simple iteration of design-time elements.
  • Instances in metadata browser: Extends the metadata browser to include instance data.
  • Filtering on the design surface: Adds UI elements for defining basic filter conditions directly on the design surface.
  • Multiple data sources: Builds reports against more than one data source and has more than one data region for each report.
  • Office12 ribbon: Provides UI parity with Microsoft Office 2007 products.
  • Conditional formatting: Provides conditional formatting, which is a recommended customer feature.
SQL Server Analysis Services (SSAS) 2008
In SSAS 2008, Multidimensional Expressions (MDX) query performance has improved because Analysis Services can deal better with cube space. SSAS 2008 divides the space to separate calculated members, regular members, and empty space to better evaluate cells that need to be included in calculations.
  • The new MOLAP-enabled write back capabilities in SSAS 2008 remove the need to query relational online analytical processing (ROLAP) partitions. This provides you with enhanced write back scenarios from within analytical applications without affecting the traditional online analytical processing (OLAP) performance.
  • A single read-only copy of SSAS 2008 database can be shared between many SQL Servers through a virtual IP address. This creates a highly scalable deployment option for an Analysis Services solution.
  • With SQL Server 2008 backup compression, less storage is required to keep backups online. The backups also run significantly faster because lesser disk I/O is required.
  • In SSAS 2008, data mining models have been enhanced by appending a new algorithm to the Microsoft Time Series algorithm. This improves the accuracy and stability of predictions in the data mining models. The new algorithm is based on the Auto Regressive Integrated Moving Average (ARIMA) algorithm, and provides better long-term predictions than the Auto Regression Trees with Cross Predict (ARTxp) algorithm that has been used in SQL Server 2005 Analysis Services (SSAS 2005).
  • You can use the Analysis Services personalization extensions to create new Analysis Services objects and functionality, and then provide these objects and functionality dynamically in the context of the user session.
  • SQL Server 2008 generates Analysis Management Objects (AMO) warnings spanning all objects to alert you when you violate any of the design best practices or make logical errors in database design
  • The improvements in designing dimensions include features such as the Attribute Relationship designer, a simplified and enhanced Dimension wizard, and the Key Columns dialog box.


Mostly used DMFs and DMVs in SQL Server


Dynamic management views and functions return server state information that can be used to monitor the health of a server instance, diagnose problems, and tune performance.

CDC related DMVs
Sys.dm_cdc_errors
     Contains error information for the previous 32 sessions.

Index related DMVs and DMFs
Sys.dm_db_index_usage_stats:
    Returns counts of different types of index operations and the time each type of operation was last performed.
Sys.dm_db_index_physical_stats:
    Returns size and fragmentation information for the data and indexes of the specified table or view. For an index, one row is returned for each level of the B-tree in each partition. 
Sys.dm_db_missing_index_columns:
       Returns information about database table columns that are missing an index, excluding spatial indexes. sys.dm_db_missing_index_columns is a dynamic management function.

Transaction related DMVsand DMFs
Sys.dm_tran_active_tansactions:
    Returns information about transactions for the instance of SQL Server.
Sys.dm_tran_lock:
     Returns information about currently active lock manager resources in SQL Server 2008 R2. Each row represents a currently active request to the lock manager for a lock that has been granted or is waiting to be granted.
Sys.dm_tran_session_transactions:
     Returns correlation information for associated transactions and sessions.