Monday, October 31, 2011

Interview Questions on DW design


    1. Question: How do you implement Slowly Changing Dimension type 2? I am not looking for the definition, but the practical implementation e.g. table structure, ETL/loading. {M}
    Answer: Create the dimension table as normal, i.e. first the dim key column as an integer, then the attributes as varchar (or varchar2 if you use Oracle). Then I’d create 3 additional columns: IsCurrent flag, “Valid From” and “Valid To” (they are datetime columns). With regards to the ETL, I’d check first if the row already exists by comparing the natural key. If it exists then “expire the row” and insert a new row. Set the “Valid From” date to today’s date or the current date time.
    An experienced candidate (particularly DW ETL developer) will not set the “Valid From” date to the current date time, but to the time when the ETL started. This is so that all the rows in the same load will have the same Valid From, which is 1 millisecond after the expiry time of the previous version thus avoiding issue with ETL workflows that run across midnight.
    Purpose: SCD 2 is the one of the first things that we learn in data warehousing. It is considered the basic/fundamental. The purpose of this question is to separate the quality candidate from the ones who are bluffing. If the candidate can not answer this question you should worry.
    2. Question: How do you index a fact table? And explain why. {H}
    Answer: Index all the dim key columns, individually, non clustered (SQL Server) or bitmap (Oracle). The dim key columns are used to join to the dimension tables, so if they are indexed the join will be faster. An exceptional candidate will suggest 3 additional things: a) index the fact key separately, b) consider creating a covering index in the right order on the combination of dim keys, and c) if the fact table is partitioned the partitioning key must be included in all indexes.
    Purpose: Many people know data warehousing only in theory or only in logical data model. This question is designed to separate those who have actually built a data warehouse and those who haven’t.
    3. Question: In the source system, your customer record changes like this: customer1 and customer2 now becomes one company called customer99. Explain a) impact to the customer dim (SCD1), b) impact to the fact tables. {M}
    Answer: In the customer dim we update the customer1 row, changing it to customer99 (remember that it is SCD1). We do soft delete on the customer2 row by updating the IsActive flag column (hard delete is not recommended). On the fact table we find the Surrogate Key for customer1 and 2 and update it with customer99’s SK.
    Purpose: This is a common problem that everybody in data warehousing encounters. By asking this question we will know if the candidate has enough experience in data warehousing. If they have not come across this (probably they are new in DW), we want to know if they have the capability to deal with it or not.
    4. Question: What are the differences between Kimball approach and Inmon’s? Which one is better and why? {L}
    Answer: if you are looking for a junior role e.g. a developer, then the expected answer is: in Kimball we do dimension modelling, i.e. fact and dim tables whereas in Inmon’s we do CIF, i.e. EDW in normalised form and we then create a DM/DDS from the EDW. Junior candidates usually prefer Kimball, because of query performance and flexibility, or because that’s the only one they know; which is fine. But if you are interviewing for a senior role e.g. senior data architect then they need to say that the approach depends on the situation. Both Kimball & Inmon’s approaches have advantages and disadvantages. I explained some of the main reasons of having a normalised DW here.
    Purpose: a) to see if the candidate understands the core principles of data warehousing or they just “know the skin”, b) to find out if the candidate is open minded, i.e. the solution depends on what we are trying to achieve (there’s right or wrong answer) or if they are blindly using Kimball for every situation.
    5. Question: Suppose a fact row has unknown dim keys, do you load that row or not? Can you explain the advantage/disadvantages? {M}
    Answer: We need to load that row so that the total of the measure/fact is correct. To enable us to load the row, we need to either set the unknown dim key to 0 or the dim key of the newly created dim rows. We can also not load that row (so the total of the measure will be different from the source system) if the business requirement prefer it. In this case we load the fact row to a quarantine area complete with error processing, DQ indicator and audit log. On the next day, after we receive the dim row, we load the fact row. This is commonly known as Late Arriving Dimension Rows and there are many sources for further information; one of the best is Bob Becker’s article here in 2006. Others refer to this as Early Arriving Fact Row, which Ralph Kimball explained here in 2004.
    Purpose: again this is a common problem that we encounter in regular basis in data warehousing. With this question we want to see if the candidate’s experience level is up to the expectation or not.
    6. Question: Please tell me your experience on your last 3 data warehouse projects. What were your roles in those projects? What were the issues and how did you solve them? {L}
    Answer: There’s no wrong or right answer here. With this question you are looking for a) whether they have done similar things to your current project, b) whether their have done the same role as the role you are offering, c) whether they faces the same issues as your current DW project.
    Purpose: Some of the reasons why we pay more to certain candidates compared to the others are: a) they have done it before they can deliver quicker than those who haven’t, b) they come from our competitors so we would know what’s happening there and we can make a better system than theirs, c) they have solved similar issues so we could “borrow their techniques”.
    7. Question: What are the advantages of having a normalised DW compared to dimensional DW? What are the advantages of dimensional DW compared to normalised DW? {M}
    Answer: For advantages of having a normalised DW see here and here. The advantages of dimensional DW are: a) flexibility, e.g. we can accommodate changes in the requirements with minimal changes on the data model, b) performance, e.g. you can query it faster than normalised model, c) it’s quicker and simpler to develop than normalised DW and easier to maintain.
    Purpose: to see if the candidate has seen “the other side of the coin”. Many people in data warehousing only knows Kimball/dimensional. Second purpose of this question is to check if the candidate understands the benefit of dimensional modelling, which is a fundamental understanding in data warehousing.
    8. Question: What is 3rd normal form? {L} Give me an example of a situation where the tables are not in 3rd NF, then make it 3rd NF. {M}
    Answer: No column is transitively depended on the PK. For example, column1 is dependant on column2 and column2 is dependant on column3. In this case column3 is “transitively dependant” on column1. To make it 3rd NF we need to split it into 2 tables: table1 which has column1 & column2 and table2 which has column2 and column3.
    Purpose: A lot of people talk about “3rd normal form” but they don’t know what it means. This is to test if the candidate is one of those people. If they can’t answer 3rd NF, ask 2nd NF. If they can’t answer 2nd NF, ask 1st NF.
    9. Question: Tell me how to design a data warehouse, i.e. what are the steps of doing dimensional modelling? {M}
    Answer: There are many ways, but it should not be too far from this order: 1. Understand the business process, 2. Declare the grain of the fact table, 3. Create the dimension tables including attributes, 4. Add the measures to the fact tables (from Kimball’s Toolkit book chapter 2). Step 3 and 4 could be reversed (add the fact first, then create the dims), but step 1 & 2 must be done in that order. Understanding the business process must always be the first, and declaring the grain must always be the second.
    Purpose: This question is for data architect or data warehouse architect to see if they can do their job. It’s not a question for an ETL, report or cube developer.
    10. Question: How do you join 2 fact tables? {H}
    Answer: It’s a trap question. You don’t usually join 2 fact tables especially if they have different grain. When designing a dimensional model, you include all the necessary measures into the same fact table. If the measure you need is located on another fact table, then there’s something wrong with the design. You need to add that measure to the fact table you are working with. But what if the measure has a different grain? Then you add the lower grain measure to the higher grain fact table. What if the fact table you are working with has a lower grain? Then you need to get the business logic for allocating the measure.
    It is possible to join 2 fact tables, i.e. using the common dim keys. But the performance is usually horrible, hence people don’t do this in practice, except for small fact tables (<100k rows). For example: if FactTable1 has dim1key, dim2key, dimkey3 and FactTable2 has dim1key and dim2key then you could join them like this:
    1
    select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2
    2
    from
    3
    select dim1key, dim2key, sum(measure1) as measure1
    4
      from FactTable1
    5
      group by dim1key, dim2key
    6
    ) f1
    7
    join FactTable2 f2
    8
    on f1.dim1key = f2.dim1key and f1.dim2key = f2.dim2key
    So if we don’t join 2 fact tables that way, how do we do it? The answer is using the fact key column. It is a good practice (especially in SQL Server because of the concept of cluster index) to have a fact key column to enable us to identify rows on the fact table. The performance would be much better (than joining on dim keys), but you need to plan this in advance as you need to include the fact key column on the other fact table.
    1
    select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2
    2
    from FactTable1 f1
    3
    join FactTable2 f2
    4
    on f2.fact1key = f1.factkey
    I implemented this technique originally for self joining, but then expand the usage to join to other fact table. But this must be used on an exception basis rather than the norm.
    Purpose: not to trap the candidate of course. But to see if they have the experience dealing with a problem which doesn’t happen every day.
    11. Question: How do you index a dimension table? {L}
    Answer: clustered index on the dim key, and non clustered index (individual) on attribute columns which are used on the query’s “where clause”.
    Purpose: this question is critical to be asked if you are looking for a Data Warehouse Architect (DWA) or a Data Architect (DA). Many DWA and DA only knows logical data model. Many of them don’t know how to index. They don’t know how different the physical tables are in Oracle compared to in Teradata. This question is not essential if you are looking for a report or ETL developer. It’s good for them to know, but it’s not essential
    12. Question: Tell me what you know about William Inmon? {L} Alternatively: Ralph Kimball.
    Answer: He was the one who introduced the concept of data warehousing. Arguably Barry Devlin was the first one, but he’s not as popular as Inmon. If you ask who is Barry Devlin or who is Claudia Imhoff 99.9% of the candidates wouldn’t know. But every decent practitioner in data warehousing should know about Inmon and Kimball.
    Purpose: to test if the candidate is a decent practitioner in data warehousing or not. You’ll be surprise (especially if you are interviewing a report developer) how many candidates don’t know the answer. If someone is applying for a BI architect role and he never heard about Inmon you should worry.
    13. Question: How do we build a real time data warehouse? {H}
    Answer: if the candidate asks “Do you mean real time or near real time” it may indicate that they have a good amount of experience dealing with this in the past. There are two ways we build a real time data warehouse (and this is applicable for both Normalised DW and Dimensional DW):
    a) By storing previous periods’ data in the warehouse then putting a view on top of it pointing to the source system’s current period data. “Current period” is usually 1 day in DW, but in some industries e.g. online trading and ecommerce, it is 1 hour.
    b) By storing previous periods’ data in the warehouse then use some kind of synchronous mechanism to propagate current period’s data. An example of synchronous data propagation mechanism is SQL Server 2008’s Change Tracking or the old school’s trigger.
    Near real time DW is built using asynchronous data propagation mechanism, aka mini batch (2-5 mins frequency) or micro batch (30s – 1.5 mins frequency).
    Purpose: to test if the candidate understands complex, non-traditional mechanism and follows the latest trends. Real time DW was considered impossible 5 years ago and only developed in the last 5 years. If the DW is normalised it’s easier to make it real time than if the DW is dimensional as there’s dim key lookup involved.
    14. Question: What is the difference between a data mart and a data warehouse? {L}
    Answer: Most candidates will answer that one is big and the other is small. Some good candidates (particularly Kimball practitioners) will say that data mart is one star. Whereas DW is a collection of all stars. An excellent candidate will say all the above answers, plus they will say that a DW could be the normalised model that store EDW, whereas DM is the dimensional model containing 1-4 stars for specific department (both relational DB and multidimensional DB).
    Purpose: The question has 3 different levels of answer, so we can see how deep the candidate’s knowledge in data warehousing.
    15. Question: What the purpose of having a multidimensional database? {L}
    Answer: Many candidates don’t know what a multidimensional database (MDB) is. They have heard about OLAP, but not MDB. So if the candidate looks puzzled, help them by saying “an MDB is an OLAP database”. Many will say “Oh… I see” but actually they are still puzzled so it will take a good few moments before they are back to earth again. So ask again: “What is the purpose of having an OLAP database?” The answer is performance and easier data exploration. An MDB (aka cube) is a hundred times faster than relational DB for returning an aggregate. An MDB will be very easy to navigate, drilling up and down the hierarchies and across attributes, exploring the data.
    Purpose: This question is irrelevant to report or ETL developer, but a must for a cube developer and DWA/DA. Every decent cube developer (SSAS, Hyperion, Cognos) should be able to answer the question as it’s their bread and butter.
    16. Question: Why do you need a staging area? {M}
    Answer: Because:
    a) Some data transformations/manipulations from source system to DWH can’t be done on the fly, but requires several stages and therefore needs to “be landed on disk first”
    b) The time to extract data from the source system is limited (e.g. we were only given 1 hour window) so we just “get everything we need out first and process later”
    c) For traceability and consistency, i.e. some data transform are simple and some are complex but for consistency we put all of them on stage first, then pick them up from stage for further processing
    d) Some data is required by more than 1 parts of the warehouse (e.g. ODS and DDS) and we want to minimise the impact to the source system’s workload. So rather than reading twice from the source system, we “land” the data on the staging then both the ODS and the DDS read the data from staging.
    Purpose: This question is intended more for an ETL developer than a report/cube developer. Obviously a data architect needs to know this too.
    17. Question: How do you decide that you need to keep it as 1 dimension or split it into 2 dimensions? Take for example dim product: there are attributes which are at product code level and there are attributes which are at product group level. Should we keep them all in 1 dimension (product) or split them into 2 dimensions (product and product group)? {H}
    Answer: Depends on how they are going to be used, as I explained in my article “One or two dimensions”.
    Purpose: To test if the candidate is conversant in dimensional modelling. This question especially is relevant for data architects and cube developers and less relevant for a report or ETL developer.
    18. Question: Fact table columns usually numeric. In what case does a fact table have a varchar column? {M}
    Answer: degenerate dimension
    Purpose: to check if the candidate has ever involved in detailed design of warehouse tables. Follow up with question 19.
    19. Question: What kind of dimension is a “degenerate dimension”?  Give me an example. {L}
    Answer: A “dimension” which stays in the fact table. It is usually the reference number of the transaction. For example: Transaction ID, payment ref and order ID
    Purpose: Just another question to test the fundamentals.
    20. Question: What is show flaking? What are the advantages and disadvantages? {M}
    Answer: In dimensional modelling, snow flaking is breaking a dimension into several tables by normalising it. The advantages are: a) performance when processing dimensions in SSAS, b) flexibility if the sub dim is used in several places e.g. city is used in dim customer and dim supplier (or in insurance DW: dim policy holder and dim broker), c) one place to update, and d) the DW load is quicker as there are less duplications of data. The disadvantages are: a) more difficult in “navigating the star*”, i.e. need joins a few tables, b) worse “sum group by*” query performance (compared to “pure star*”), c) more flexible in accommodating requirements, i.e. the city attributes for dim supplier don’t have to be the same as the city attributes for dim customer, d) the DW load is simpler as you don’t have to integrate the city.
    *: a “star” is a fact table with all its dimensions, “navigating” means browsing/querying, “sum group by” is a SQL select statement with a “group by” clause, pure star is a fact table with all its dimensions and none of the dims are snow-flaked.
    Purpose: Snow flaking is one of the classic debates in dimensional modelling community. It is useful to check if the candidate understands the reasons of just “following blindly”. This question is applicable particularly for data architect and OLAP designer. If their answers are way off then you should worry. But it also relevant to ETL and report developers as they will be populating and querying the structure.

Thursday, October 27, 2011

Cube Objects with design in SSAS

  • Cube: edit measure or to add new measure group.
  • Dimensions: to add new dimension or another object.
  • Calculations: A calculation is a Multidimensional Expressions (MDX) expression or script that is used to define a calculated member, a named set, or a scoped assignment in a cube in Microsoft SQL Server Analysis Services. Calculations let you add objects that are defined not by the data of the cube, but by expressions that can reference other parts of the cube, other cubes, or even information outside the Analysis Services database. Calculations let you extend the capabilities of a cube, adding flexibility and power to business intelligence applications.
    • Names Set: A named set is a CREATE SET MDX statement expression that returns a set.
    • Calculated Member: A calculated member is a member whose value is calculated at run time using a Multidimensional Expressions (MDX) expression that you specify when you define the calculated member. A calculated member is available to business intelligence applications just like any other member. Calculated members do not increase the size of the cube because only the definitions are stored in the cube; values are calculated in memory as required to answer a query.
    • Script Command: A script command is an MDX script, included as part of the definition of the cube. Script commands let you perform almost any action that is supported by MDX on a cube, such as scoping a calculation to apply to only part of the cube. In SQL Server Analysis Services, MDX scripts can apply either to the whole cube or to specific sections of the cube, at specific points throughout the execution of the script. The default script command, which is the CALCULATE statement, populates cells in the cube with aggregated data based on the default scope.
  • KPIs: In business terminology, a Key Performance Indicator (KPI) is a quantifiable measurement for gauging business success. In Analysis Services, a KPI is a collection of calculations that are associated with a measure group in a cube that are used to evaluate business success. 
  • Actions: Actions can be of different types and they have to be created accordingly. Actions can be:
    • Drillthrough actions, which return the set of rows that represents the underlying data of the selected cells of the cube where the action occurs.
    • Reporting actions, which return a report from Reporting Services that is associated with the selected section of the cube where the action occurs.
    • Standard actions, which return the action element (URL, HTML, DataSet, RowSet, and other elements) that is associated with the selected section of the cube where the action occurs.
  • Partitions: A partition is a container for a portion of the measure group data. Partitions are not seen from MDX queries; all queries reflect the whole content of the measure group, regardless of how many partitions are defined for the measure group. The data content of a partition is defined by the query bindings of the partition, and by the slicing expression.
    • Remote Partition: The data of a remote partition is stored on a different instance of Microsoft SQL Server Analysis Services than the instance that contains the definitions (metadata) of the partition and its parent cube. A remote partition is administered on the same instance of Analysis Services where the partition and its parent cube are defined.
    • Storage Mode: The storage mode of a partition affects the query and processing performance, storage requirements, and storage locations of the partition and its parent measure group and cube. The choice of storage mode also affects processing choices.A partition can use one of three basic storage modes:
      • Multidimensional OLAP (MOLAP)
      • Relational OLAP (ROLAP)
      • Hybrid OLAP (HOLAP)
    • Write Enabled Partitions: The data in a cube is generally read-only. However, for certain scenarios, you may want to write-enable a partition. Write-enabled partitions are used to enable business users to explore scenarios by changing cell values and analyzing the effects of the changes on cube data. When you write-enable a partition, client applications can record changes to the data in the partition. These changes, known as writeback data, are stored in a separate table and do not overwrite any existing data in a measure group. However, they are incorporated into query results as if they are part of the cube data.
  • Aggregation: An Aggregation Design object defines a set of aggregation definitions that can be shared across multiple partitions.An Aggregation object represents the summarization of measure group data at certain granularity of the dimensions. A simple Aggregation object is composed of: basic information and dimensions. Basic information includes the name of the aggregation, the ID, annotations, and a description. The dimensions are a collection of AggregationDimension objects that contain the list of granularity attributes for the dimension.
  • Perspective: A perspective is a definition that allows users to see a cube in a simpler way. A perspective is a subset of the features of a cube. A perspective enables administrators to create views of a cube, helping users to focus on the most relevant data for them. A perspective contains subsets of all objects from a cube. A perspective cannot include elements that are not defined in the parent cube.
  • Translation: A translation is a simple mechanism to change the displayed labels and captions from one language to another. Each translation is defined as a pair of values: a string with the translated text, and a number with the language ID. Translations are available for all objects in Analysis Services. Dimensions can also have the attribute values translated. 
  • Browser

Error Handling in SSIS


  1. Create control flow by using the control flow designer.
  2. Configure transaction handling for packages, containers, and tasks.
    1. Enable Distributed Transaction Coordinator (DTC)
    2. Lock windows from view menu
    3. Set transaction option
  3. Set checkpoints to define restart points.
    1. Set checkpoint properties to TRUE
    2. CheckpointFilename property provide Valid path for check point file
    3. CheckpointUsage to IfExists
  4. Create event handlers.
  5. Implement error handling. (error redirection)
    1. Handle errors by configuring data flow paths.
    2. Handle errors by configuring control flow paths.
  6. Debug packages.
    1. Debug progress reporting.
    2. View intermediate results by using debug windows.
    3. Examine the package state by setting breakpoints.

Wednesday, October 26, 2011

New in SQL Server 2008


T-SQL:
  • Transparent Data Encryption
  • Enhanced Database Mirroring
  • Automatic Recovery of data pages
  • Data Compression
  • ADO.Net Data Services
  • Date/Time
  • Spatial Data Type
  • Partitioned Table Parallelism
  • Grouping Sets
  • CDC and CTE’s
SQL Server Integration Services (SSIS) 2008
With version 2008 a more scalable and efficient architecture for extraction, transformation, and loading (ETL) is offered to the customers. Some of the new features include the following.
  • SSIS pipeline improvements: In SQL Server 2005, the SSIS pipeline execution engine did not scale up to multi-proc machines with more than two processors or CPUs. The Integration Services design in SQL Server 2008 improves the scalability of runtime to multiple processors
  • SSIS data profiling: The new data profiling feature in SSIS 2008 identifies patterns in data. This feature generates statistics that can be viewed by the stand-alone Data Profile Viewer.
  • Lookups: Performance is improved, access to new data sources including XML and a new component TXLookup transformation is added.
  • The new MERGE SQL statement allows you to insert, update, or delete data based on certain join conditions in the same statement where in previous versions of SQL Server you had to create separate statements.
  • CDC (Change Data Capture) is used to capture insert, update and delete activity on a SQL table using the transaction log and placing these data in a separate relational table.
  • Data Warehousing: It’s now easier to manage large tables more effectively by transparently breaking them into manageable blocks of data. Also with the resource Governor you can now assign limits on CPU and memory use for each workload. And finally with native data compression techniques disk I/O can be dramatically reduced increasing overall performance.
SQL Server Reporting Services (SSRS) 2008
In SSRS 2008, a report server is implemented as a Windows-based service that hosts the Report Manager, the Report Server Web service, and background processing feature areas decreasing the dependence on IIS and also consolidating server applications into a single service reduces the configuration and maintenance tasks.
  • The report server has new management features to set a memory threshold for background operations and performance counters for monitoring service activity.
  • SRS 2008 supports two modes of deployment for report server, the native mode and the SharePoint integrated mode. In the native mode, a report server is a stand-alone application server that provides viewing, management, and delivery of reports and report models.
  • In the SharePoint integrated mode, a report server becomes part of a SharePoint Web application. Users of Windows SharePoint Services 3.0 can store reports in SharePoint libraries and access them from SharePoint sites that are used for accessing other business documents.
  • Office Word 2007 rendering extension can be used to export a report to a Word document and Office Excel renderer has been enhanced to support features like nested data regions and sub-reports.
Report Designer improvements:
  • Entity hierarchies:  Provides a flattened analytic-style metadata browser that presents all entities as a flattened list.
  • Live data in design view: Allows display of live data by using simple iteration of design-time elements.
  • Instances in metadata browser: Extends the metadata browser to include instance data.
  • Filtering on the design surface: Adds UI elements for defining basic filter conditions directly on the design surface.
  • Multiple data sources: Builds reports against more than one data source and has more than one data region for each report.
  • Office12 ribbon: Provides UI parity with Microsoft Office 2007 products.
  • Conditional formatting: Provides conditional formatting, which is a recommended customer feature.
SQL Server Analysis Services (SSAS) 2008
In SSAS 2008, Multidimensional Expressions (MDX) query performance has improved because Analysis Services can deal better with cube space. SSAS 2008 divides the space to separate calculated members, regular members, and empty space to better evaluate cells that need to be included in calculations.
  • The new MOLAP-enabled write back capabilities in SSAS 2008 remove the need to query relational online analytical processing (ROLAP) partitions. This provides you with enhanced write back scenarios from within analytical applications without affecting the traditional online analytical processing (OLAP) performance.
  • A single read-only copy of SSAS 2008 database can be shared between many SQL Servers through a virtual IP address. This creates a highly scalable deployment option for an Analysis Services solution.
  • With SQL Server 2008 backup compression, less storage is required to keep backups online. The backups also run significantly faster because lesser disk I/O is required.
  • In SSAS 2008, data mining models have been enhanced by appending a new algorithm to the Microsoft Time Series algorithm. This improves the accuracy and stability of predictions in the data mining models. The new algorithm is based on the Auto Regressive Integrated Moving Average (ARIMA) algorithm, and provides better long-term predictions than the Auto Regression Trees with Cross Predict (ARTxp) algorithm that has been used in SQL Server 2005 Analysis Services (SSAS 2005).
  • You can use the Analysis Services personalization extensions to create new Analysis Services objects and functionality, and then provide these objects and functionality dynamically in the context of the user session.
  • SQL Server 2008 generates Analysis Management Objects (AMO) warnings spanning all objects to alert you when you violate any of the design best practices or make logical errors in database design
  • The improvements in designing dimensions include features such as the Attribute Relationship designer, a simplified and enhanced Dimension wizard, and the Key Columns dialog box.


Mostly used DMFs and DMVs in SQL Server


Dynamic management views and functions return server state information that can be used to monitor the health of a server instance, diagnose problems, and tune performance.

CDC related DMVs
Sys.dm_cdc_errors
     Contains error information for the previous 32 sessions.

Index related DMVs and DMFs
Sys.dm_db_index_usage_stats:
    Returns counts of different types of index operations and the time each type of operation was last performed.
Sys.dm_db_index_physical_stats:
    Returns size and fragmentation information for the data and indexes of the specified table or view. For an index, one row is returned for each level of the B-tree in each partition. 
Sys.dm_db_missing_index_columns:
       Returns information about database table columns that are missing an index, excluding spatial indexes. sys.dm_db_missing_index_columns is a dynamic management function.

Transaction related DMVsand DMFs
Sys.dm_tran_active_tansactions:
    Returns information about transactions for the instance of SQL Server.
Sys.dm_tran_lock:
     Returns information about currently active lock manager resources in SQL Server 2008 R2. Each row represents a currently active request to the lock manager for a lock that has been granted or is waiting to be granted.
Sys.dm_tran_session_transactions:
     Returns correlation information for associated transactions and sessions.

Slowly changing Dimensions

The Slowly Changing Dimension transformation coordinates the updating and inserting of records in data warehouse dimension tables. The Slowly Changing Dimension transformation provides the following functionality for managing slowly changing dimensions:
  1. Matching incoming rows with rows in the lookup table to identify new and existing rows.
  2. Identifying incoming rows that contain changes when changes are not permitted.
  3. Identifying inferred member records that require updating.
  4. Identifying incoming rows that contain historical changes that require insertion of new records and the updating of expired records.
  5. Detecting incoming rows that contain changes that require the updating of existing records, including expired ones.
Output Types:
  • Type 0: Fixed Attribute Output
  • Type1: Changing Attributes Updates Output
  • Type2 : Historical Attributes Inserts Output
  • Type3: Create new column for every updation.
  • More on New output, Unchanged Output and Inferred Member Updates Output
Type 0 : It do not allows to update any record inside dimension
Type 1 : It just update each record there is no trace for old record
Type 2 : It add new record in dimension table for each update. It uses flag value or time-stamp to check new value
Type 3 : It creates new column for each updation. This type is not preferable for record which updates more than two times.

Tuesday, October 25, 2011

Optimizing SSIS Packages

  1. Try to perform your data flows in bulk mode instead of row by row. 
  2. Do not sort within Integration Services unless it is absolutely necessary. 
  3. At certain time where using Transact-SQL will be faster than processing the data in SSIS. 
  4. Aggregation calculations such as GROUP BY and SUM. 
  5. These are typically also calculated faster using Transact-SQL instead of in-memory calculations by a pipeline. Delta detection is the technique where you change existing rows in the target table instead of reloading the table. 
  6. Use partitioning on your target table. This way you will be able to run multiple versions of the same package, in parallel, that insert data into different partitions of the same table. 
  7. Make data types as narrow as possible so you will allocate less memory for your transformation. 
  8. Do not perform excessive casting of data types – it will only degrade performance. 
  9. Use the NOLOCK or TABLOCK hints to remove locking overhead. 
  10. If possible, perform your datetime conversions at your source or target databases 
  11. Use a commit size of <5000 to avoid lock escalation when inserting. 
  12. Heap inserts are typically faster than using a clustered index. 
  13. Set Checkpoint So after failure execution start from checkpoint Minimize use of blocking and partially blocking Fast Load Features Optimum Buffering Fast Parse: 
  14. Using fast parse can drastically improve performance of your package when using a flat files source or data conversion transform by basically not validating the columns that you specify.

T-SQL Queries Optimization


  1. Create index on any key column such as foreign key.
  2. SET Transaction Isolation Level Read Uncommited at the top of SP
  3. Use transaction when appropriate
  4. Avoid temporary table if possible.
  5. Always use profiling before and after of each change
  6. Look for every possible way to reduce no of round trips to server
  7. Avoid Index and join hints
  8. Avoid functions in where clause
  9. Every SQL query is broken down in to series of execution steps called as operators. Each operator performs basic operations like insertion, search, scan, updating, aggregation etc. There are 2 kinds of operators Logical operators and physical operators.
  10. Creating covering index instead of Composite NCI
  11. Unique column and preferred data types
  12. The execution plan describes the sequence of operations, physical and logical, that SQL Server will perform in order to fulfill the query and produce the desired resultset. 
  13. Use SP instead of separate statements
  14. Do not use * operators
  15. Never use Computed by, cross join and cross join
  16. Never use LIKE operator when there is an exact match
  17. Make use of variables
  18. Use two part naming conventions
  19. Partition table and indexes
  20. Use union all instead of union
  21. Check out slow running query using query optimizer
  22. Use SQL profiler for trace files and DTA for Advise for better performance