Mar 26

How to move data or a table from one data file to another in the same filegroup! – Part 2 of 2

This post is a continuation of the previous post, How to move data or a table from one data file to another in the same filegroup! – Part 1 of 2, and will show you how SQL handles data when it has more than 1 file. I will show you what happens when someone adds a second data file because they ran out of space on the first. I have seen this done when a disk is out of space, so someone adds a second file to the file group on a different disk. In this demo, we will use the same disk, but limit the size and growth of the first data file.

To start, here is a section from BOL on Database Files and Filegroups:

For example, three files, Data1.ndf, Data2.ndf, and Data3.ndf, can be created on three disk drives, respectively, and assigned to the filegroup fgroup1. A table can then be created specifically on the filegroup fgroup1. Queries for data from the table will be spread across the three disks; this will improve performance. The same performance improvement can be accomplished by using a single file created on a RAID (redundant array of independent disks) stripe set. However, files and filegroups let you easily add new files to new disks.

Now I will show you how you do not get to decide which file within the group the data will go to.

Demo 1 – Adding a file to a full database in order to give some free space

First we will create a database with one data file in the PRIMARY filegroup and check the space of the table:

USE [master];
GO

--Check for existence of database and drop if exists
IF EXISTS (
 SELECT 1
 FROM sys.databases
 WHERE NAME = 'TestDataFileAllocation'
 )
 DROP DATABASE TestDataFileAllocation;
GO

--Create a database with 1 data file in the PRIMARY filegroup
CREATE DATABASE TestDataFileAllocation ON PRIMARY (
 NAME = N'TestDataFileAllocation'
 ,FILENAME = N'F:\DATA\TestDataFileAllocation.mdf'
 ,SIZE = 10 MB
 ,MAXSIZE = 100 MB
 ,FILEGROWTH = 0
 ) LOG ON (
 NAME = N'TestDataFileAllocation_log'
 ,FILENAME = N'F:\Log\TestDataFileAllocation_log.ldf'
 ,SIZE = 1024 MB
 ,MAXSIZE = UNLIMITED
 ,FILEGROWTH = 1
 );
GO

--Switch database context to new database
USE TestDataFileAllocation;
GO

--Create a table for some data
CREATE TABLE SomeData (
 idCol INT IDENTITY(1, 1)
 ,NAME VARCHAR(MAX)
 ,number INT
 );
GO

--Add a clustered primary key to our new table
ALTER TABLE SomeData ADD CONSTRAINT PK_idCol PRIMARY KEY CLUSTERED (idCol);
GO

--Look at the file allocation using system tables and DMVs
SELECT OBJECT_NAME(p.object_id) AS tableName
 ,df.file_id
 ,df.name AS file_logical_name
 ,df.physical_name AS file_physical_name
 ,CAST(df.size AS BIGINT)*8/1024 AS file_size_MB
 ,CAST(df.max_size AS BIGINT)*8/1024 AS file_max_size_MB
 ,df.growth AS file_autogrow_ind
 ,fsu.total_page_count AS file_total_page_count
 ,fsu.allocated_extent_page_count AS file_allocated_extent_page_count
 ,fsu.unallocated_extent_page_count AS file_unallocated_extent_page_count
 ,au.total_pages AS table_allocated_pages
 ,au.used_pages AS table_used_pages
 ,au.data_pages AS table_data_pages
 ,p.rows AS table_rows
FROM sys.allocation_units au
INNER JOIN sys.database_files df ON au.data_space_id = df.data_space_id
INNER JOIN sys.partitions p ON au.container_id = p.hobt_id
INNER JOIN sys.dm_db_file_space_usage fsu ON df.file_id = fsu.file_id
WHERE au.type IN (
 1
 ,3
 )
 AND OBJECT_NAME(p.object_id) = 'SomeData';
GO

Notice the results of the allocation on the file:

NoData

So far, 328 pages are allocated while 952 are unallocated for the file itself. This particular table is currently empty and has 0 pages. Now lets add dome data to the table and see what happens.

--Insert data into the SomeData table
INSERT INTO SomeData (
NAME
,number
)
VALUES (
REPLICATE('sqlknowitall', 1000)
,10000
);
GO

And using the same query to look at the data as we used earlier:

1Row

Now we have 1 row in the file and we can see that nothing has changed on the file itself, but that the table has added 1 data page. So, not much has changed… However, what happens if we add so much data that the file runs out of space?

--Insert data into the SomeData table and max out the file
INSERT INTO SomeData (
 NAME
 ,number
 )
VALUES (
 REPLICATE('sqlknowitall', 1000)
 ,10000
 );
GO 10000

Run the query to see what happened:

MaxData

We now have the file full. The file has no more unallocated pages and the table itself has 953 allocated and used pages. To allow more data into this table, we need to either grow the file or add a new one. For the sake of this article and the point I am trying to make, we will add a second file to the file group.

 --Add a second file to the filegroup
ALTER DATABASE TestDataFileAllocation ADD FILE (
NAME = TestDataFileAllocation_2
,FILENAME = N'F:\DATA\TestDataFileAllocation_2.ndf'
,SIZE = 1024 MB
,MAXSIZE = UNLIMITED
,FILEGROWTH = 512 MB
) TO FILEGROUP [PRIMARY];
GO

Look at the allocation:

SecondFile

We now see the new file with 8 pages. The only reason we are still showing table data with this new file is because the DMVs don’t tell us which file the table resides on, only which FILEGROUP. However, since we only have 1 table with no changes, we know that it still all resides on the first file.

The suggestion was to drop and recreate the clustered index to move the table. Well, let’s see what that does:

--Drop the clustered index
ALTER TABLE SomeData
DROP CONSTRAINT PK_idCol;
GO
--Recreate the same clustered index
ALTER TABLE SomeData ADD CONSTRAINT PK_idCol PRIMARY KEY CLUSTERED (idCol);
GO

Results:
SecondFileRebuildCI

Hmm… Well, some of the data moved to the new file. But how much? Well… There are several more technical ways to come to this, but to keep it simple I will just look at the starting. Before we had any data in the table, the first file showed 328 allocated pages. Before we had any data in the second file, it had 8 allocated pages. After the drop and recreate on the clustered index, the first file now has 336 pages allocated while the second has 968. We can see that not all of the data has moved off of the first file, but a majority did. According to Paul Randall:

SQL Server’s allocation system uses a round-robin mechanism where allocations are spread over all the data files in a filegroup – an allocation from file 1, then an allocation from file 2, and so on, then back to file 1 again. Combined with that, it also uses an algorithm called proportional fill that aims to allocate data from the files according to how much free space they have relative to the other files in the filegroup.
The basic premise of proportional fill is that each file has a weighting, where files with less free space will have a higher weighting, and files with lots of free space will have a low weighting. The files with the lower weightings will be allocated from more frequently – i.e. those files with more free space will be allocated from the most.
This means that if you add a new data file to a filegroup that has mostly full data files, the proportional fill weightings will be such that the new file will be the one where allocations come from until it fills up to the same level as the older files. The new file in essence becomes an allocation hot spot.

In closing, you cannot decide how the data moves between files within the same filegroup. SQL Server uses its own algorithm for doing this and it is out of our control. However, what if you decided that having 2 files was hurting the performance or was too hard to manage? Can you somehow get rid of 1 of them? Yes, you can… but ONLY if the file you are trying to remove is not the first file created when the database was created.

--Empty file 1 and remove it
CHECKPOINT
GO

DBCC SHRINKFILE (TestDataFileAllocation, EMPTYFILE);
GO

ALTER DATABASE TestDataFileAllocation REMOVE FILE TestDataFileAllocation;
GO 

This will result in an error:

EmptyFileError

The only thing we can do in this case is move our first file to a larger drive (if that was the issue to begin with, who knows) and move everyting from file 2 to the original after growing the file to accomodate the data.

--Grow our first file and move everything there to get rid of second file
USE master;
GO
ALTER DATABASE TestDataFileAllocation
MODIFY FILE
(NAME = TestDataFileAllocation,
SIZE = 1024MB);
GO

USE TestDataFileAllocation;
GO

CHECKPOINT;
GO

DBCC SHRINKFILE (TestDataFileAllocation_2, EMPTYFILE);
GO

ALTER DATABASE TestDataFileAllocation REMOVE FILE TestDataFileAllocation_2;
GO

Just double-checking with our allocation query:

FinalFile

Tada!

 

 

 

Mar 20

How to move data or a table from one data file to another in the same filegroup! – Part 1 of 2

Sorry! This isn’t something you can do. I wanted to title this post in a way that would draw people to it for my rant. I am ranting because I had a friend who went to an interview where he was asked how you move data from one file to another. He didn’t know, so he asked me…

Friend: Do you know how to move data from one file to another?

Me: You mean between filegroups. Yes.

Friend: No, between files in the same filegroup.

Me: You don’t. Why?

Friend: That’s what I thought too, but I learned at an interview today.

Me (laughing): Really? Can you share this knowledge with the rest of the SQL Server community? Well… start with me.

Friend: You rebuild the clustered index on the other file or you limit the growth of the file you want to move off of and then rebuild the clustered index so that it moves to the other file.

Me: They told you this? Have you ever tried to create an object on a file?

Friend: No.

Me: Did you test it out? Because, it can’t be done…

Friend: No, but…

Let me tell you friends… that as far as I know, this cannot be done. Whoever asked this question doesn’t understand the file structure in SQL Server or how data gets allocated. This conversation made me SOOOO mad because it means that there are people out there asking bad interview questions or they clearly explained the question incorrectly. If you are interviewing someone, PLEASE make sure you know the correct answer to your own questions and make sure you are right. In the next post, I will post the DDL and DML to show this.

Mar 02

My Experience on SQL Cruise Caribbean 2015

Other SQL Cruise Links:

 

This year I had the amazing opportunity to attend SQL Cruise Caribbean 2015. I found out about it from a blog post by David Klee (@kleegeek) (who posted his recap of the cruise here) and immediately talked to my boss about it. I was lucky because he had seen references to this in the past and had asked his boss, at the time, for permission to attend. However, he was shot down on the spot and laughed at. I imagine it went something like this…

DBA Manager: Good morning Director. I came across this amazing opportunity to learn SQL and network with some of the greatest minds in the field. The presenters are SQL Server MVPs, founders of PASS, and real contributors to the community!

Director: What’s PASS? Sounds good, how much does it cost and where is it?

DBA Manager: Well, it is really inexpensive and takes place on a cruise sh–

Director (cutting off DBA Manager, laughing): ARE YOU JOKING? I’m not paying for you to go on a cruise!

I’ll admit, when I saw the words “SQL Cruise Caribbean” I was thinking about sipping on a mojito while enjoying the sun on the pool deck. I was not thinking about all of the things I could learn and the people I could learn from.

I did a quick write up to my boss that focused on the cost and the value of this experience. I told him about all of the knowledge I could put to use when I returned from Jes Borland‘s sessions on restore methods and automation. The skills I could use to improve our infrastrucure and virtualization configuration that I would learn from David Klee. The questions I would ask Grant Fritchey and Kevin Kline about execution plans and performance tuning. I really sold this thing, but I really didn’t understand the true value until I arrived at the Bon Voyage event in Ft. Lauderdale the night before embarkation.

I was a little shy at first because I really didn’t know anybody (except David Klee) and was even a bit nervous about meeting some of the greats. I mean, I read these peoples’ books and blog posts on a daily basis. They are like celebrities to a SQL geek like myself! Tim and Amy Ford introduced themselves right away and made my wife and I feel very comfortable. We sat down and started talking to people right away. I could see that some of the people had met before this and had been on the previous cruises. They seemed like a family; catching up on the events of the year, talking about future endeavors, and (of course) plenty of SQL talk.

Training started immediately aboard the ship on Saturday, even before we left the port. Jeff Lehmann from Slalom Consulting kicked off the trip by teaching us about the offerings of the Amazon Cloud and how it could benefit SQL Server implementations. Almost immediately, I realized that this training was going to be different from any other. The room was small and people were asking lots of questions. Nobody, even the “experts”, held back. This was a group that would share knowledge, test theories, question standards and best practices, and do so all with the selfless intent of making the SQL community stronger. Not only in their knowledge of technology, but also in terms of professional advancement.

I spent 40 hours of scheduled time with this community throughout the week. I spent a total of over 50 hours combined with this new family of mine. I gained months of knowledge in a week. I gained years of networking within a week. I made friends. My wife made friends. I was inspired: to write, to present, to volunteer, to teach, and to inspire others!

During the trip, I made my deposit with the cruise line for the future. I returned from the cruise on February 14th and signed up for SQL Cruise Caribbean 2016 on February 28th because I know that this yearly training/networking is not only beneficial to my company, but also for my personal career.

So, what will your company get out of this? Why should they pay for it? The first is that the training you will receive is hands-on and by far the best training I have ever received. It is small group which allows for very pointed and specific questions to be asked. Even more specific questions can be asked during office hours and normal conversation outside of the formal sessions. The second is the cost of this training. It is significantly less than any training I have seen. Especially when you look at the quality of the training and the ability to tailor a lot of it to your own specific needs. Lastly, the networking is invaluable to you and your company. When you bring in a consultant, you typically have to get an MSA signed, SOW put together, and a minimum amount of hours contracted for. However, after this training you will have a family that you can pose issues and questions to. These people LOVE a puzzle and also love to help you, a friend, solve and learn from them. Ultimately, this does not only help you, but also your company.

In closing, this was the most valuable experience for my career and I intend to attend every year regardless of my company paying for it. Thanks to all those who made it a memorable experience. Also, thanks to all of the sponsors who make this a great value and give us the coveted swag: Amazon Web Servicesredgate, SQLSentry, and SQLAgentman.

 

 

Apr 26

Determining a setting for “Cost Threshold for Parallelism”

I was asked to share this after a discussion at SQLSaturday Chicago by a presenter, David Klee. So, thank you David for getting me to finally post something!

A while ago, I was tasked with researching Max Degree of Parallelism and Cost Threshold for Parallelism settings for my company’s SQL Servers. While trying to find recommendations, I found that the Cost Threshold for Parallelism was mostly done by picking a number and testing. I could not find any way of even finding a starting number in some sort of objective way. I couldn’t just settle for “start at 25 or 30 and play with it to see how your server responds.” It was just too random for me, and I knew there had to be a better way.

My boss had given me a script, some time before this, that returned a list of parallel query plans and their subtree costs that I thought could be used to gather some data for me.

;WITH XMLNAMESPACES
(DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
SELECT
	query_plan AS CompleteQueryPlan
	, n.value('(@StatementText)[1]', 'VARCHAR(4000)') AS StatementText
	, n.value('(@StatementOptmLevel)[1]', 'VARCHAR(25)') AS StatementOptimizationLevel
	, n.value('(@StatementSubTreeCost)[1]', 'VARCHAR(128)') AS StatementSubTreeCost
	, n.query('.') AS ParallelSubTreeXML
	, cp.usecounts
	, cp.size_in_bytes
FROM sys.dm_exec_cached_plans AS cp
CROSS APPLY sys.dm_exec_query_plan(plan_handle) AS qp
CROSS APPLY query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple') AS qn(n)
WHERE n.query('.').exist('//RelOp[@PhysicalOp="Parallelism"]') = 1;

I then took this script and modified it to simplify it for what I needed. All I really cared about were the costs. So I removed all other columns. I also thought it would be much easier to work with these numbers in a temp table, so I did that. The script below is now what I use to help me decide what to set my Cost Threshold for Parallelism. It gathers all of the costs, inserts them into a temp table, retrieves the data (not always needed and sometime too large to do anything with, so comment out if not needed), and calculates 3 important statistics; Mean, Median, and Mode.

CREATE TABLE #SubtreeCost(StatementSubtreeCost DECIMAL(18,2));

;WITH XMLNAMESPACES
(DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
INSERT INTO #SubtreeCost
SELECT
	CAST(n.value('(@StatementSubTreeCost)[1]', 'VARCHAR(128)') AS DECIMAL(18,2))
FROM sys.dm_exec_cached_plans AS cp
CROSS APPLY sys.dm_exec_query_plan(plan_handle) AS qp
CROSS APPLY query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple') AS qn(n)
WHERE n.query('.').exist('//RelOp[@PhysicalOp="Parallelism"]') = 1;

SELECT StatementSubtreeCost
FROM #SubtreeCost
ORDER BY 1;

SELECT AVG(StatementSubtreeCost) AS AverageSubtreeCost
FROM #SubtreeCost;

SELECT
	(SELECT TOP 1 StatementSubtreeCost
	FROM
		(
		SELECT TOP 50 PERCENT StatementSubtreeCost
		FROM #SubtreeCost
		ORDER BY StatementSubtreeCost ASC
		) AS A
	ORDER BY StatementSubtreeCost DESC
	)
	+
	(SELECT TOP 1 StatementSubtreeCost
	FROM
		(
		SELECT TOP 50 PERCENT StatementSubtreeCost
		FROM #SubtreeCost
		ORDER BY StatementSubtreeCost DESC
		) AS A
	ORDER BY StatementSubtreeCost ASC)
	/2 AS MEDIAN;

SELECT TOP 1 StatementSubtreeCost AS MODE
FROM   #SubtreeCost
GROUP  BY StatementSubtreeCost
ORDER  BY COUNT(1) DESC;

DROP TABLE #SubtreeCost;

So what do I do with these numbers? In my case, I am trying to get “about” 50% of the queries below the threshold and 50% above. This way, I split the amount of queries using parallelism. This is not going to guarantee me the best performance, but I figured it was the best objective way to come up with a good starting point.

If all of my statistics are very close, I just set the Cost Threshold for Parallelism equal to about what that number is. An average of the 3 and round will work. In many of my cases, this was between 25 and 30. If the numbers are different, i.e. a few very large costs skew the average up but the median and mode are close, then I will use something between the median and mode.

Although this method does not spit out 1 number that will magically solve all of your problems, nor is it completely objective… It is the most objective way I have found to come up with a starting point. If you have any thoughts on this or would like to share your own experiences, please feel free to send me an email and I will respond as soon as I can.

Mar 01

Existing Instance Not Available for Upgrade/Repair/Remove

Yesterday we ran into an issue when upgrading one of our SQL Server 2008 R2 instances to SQL Server 2012.  We ran the install as normal with the upgrade option.  When it came time to choose the instance and features to upgrade, the only options were the shared features.  What happened to our Database Engine instance?

Long story short… We found a registry key that contains the instance name was duplicated in 2 locations:

  • HKLM\Software\Microsoft\Microsoft SQL Server\Instance Names\SQL
  • HKLM\Software\Microsoft\Microsoft SQL Server\Instance Names\RS

The solution? Since we did not have Reporting services installed on this server anymore, I took a backup of the registry and then deleted the RS folder. Restarted the install and… VOILA!

Apr 04

Window Functions! RANK(), DENSE_RANK(), ROW_NUMBER(), NTILE(n)

Microsoft introduced ranking functions (also known as window functions) in SQL Server 2005. What are window functions exactly? First, they are awesome. Second, they are functions that can be applied to a group of rows as specified by the window descriptor (I. Ben-Gan, 2011). Itzik Ben-Gan has a great series on how to use these and the the new functions in SQL Server 2012. He also wrote an entire book on these! That’s how valuable they are…

So, in layman’s terms… We can apply a function to a group of rows as if they were the only results returned. This is done using, as mentioned above, a window descriptor. A sample from MSDN using AdventureWorks2008R2 database is below:

USE AdventureWorks2008R2;
GO 
SELECT p.FirstName, p.LastName
  ,ROW_NUMBER() OVER (ORDER BY a.PostalCode) AS 'Row Number'
  ,RANK() OVER (ORDER BY a.PostalCode) AS 'Rank'
  ,DENSE_RANK() OVER (ORDER BY a.PostalCode) AS 'Dense Rank'
  ,NTILE(4) OVER (ORDER BY a.PostalCode) AS 'Quartile'
  ,s.SalesYTD, a.PostalCode
FROM Sales.SalesPerson s
INNER JOIN Person.Person p
  ON s.BusinessEntityID = p.BusinessEntityID
INNER JOIN Person.Address a
  ON a.AddressID = p.BusinessEntityID
WHERE s.TerritoryID IS NOT NULL
  AND s.SalesYTD <> 0;

Ok, so what does this do? Let’s talk about the common clause of each function above: OVER(ORDER BY a.PostalCode)  This is the window descriptor, it is describing the set of rows to apply the function to and how to apply it. In this example it is applying the function to all rows, but in the order of the PostalCode column. In addition to the ORDER BY, a PARTITION BY can precede it to tell the function which group of rows to apply the function to. In the case of sales, you may want to use the functions to rank ALL of the sales reps within each territory.  To do this, the above code can be modified to include a PARTITION BY clause within each window descriptor:

USE AdventureWorks2008R2;
GO 
SELECT s.TerritoryID --added to show partition by TerritoryID
 , p.FirstName, p.LastName
 ,ROW_NUMBER() OVER (PARTITION BY s.TerritoryID ORDER BY a.PostalCode) AS 'Row Number'
 ,RANK() OVER (PARTITION BY s.TerritoryID ORDER BY a.PostalCode) AS 'Rank'
 ,DENSE_RANK() OVER (PARTITION BY s.TerritoryID ORDER BY a.PostalCode) AS 'Dense Rank'
 ,NTILE(4) OVER (PARTITION BY s.TerritoryID ORDER BY a.PostalCode) AS 'Quartile'
 ,s.SalesYTD, a.PostalCode
FROM Sales.SalesPerson s
INNER JOIN Person.Person p
 ON s.BusinessEntityID = p.BusinessEntityID
INNER JOIN Person.Address a
 ON a.AddressID = p.BusinessEntityID
WHERE s.TerritoryID IS NOT NULL
 AND s.SalesYTD <> 0;

So what is the difference between each of the ranking functions?

RANK()

According to BOL: RANK() “returns the rank of each row within the partition of a result set. The rank of a row is one plus the number of ranks that come before the row in question.”  This means that rank numbers will be assigned to each row in the window descriptor based on the value of the ORDER BY value.  Equal values will receive the same rank number, but rank number will still increase for each row in the set.  This is why you will see only 1 and 6 in the first example.

DENSE_RANK()

According to BOL: DENSE_RANK() “returns the rank of rows within the partition of a result set, without any gaps in the ranking. The rank of a row is one plus the number of distinct ranks that come before the row in question.” This means that rank numbers will be assigned to each row in the window descriptor based on the value of the ORDER BY value (just as in RANK).  Equal values will receive the same rank number (just as in RANK), but rank number will NOT increase for each row in the set.  This is why you will see only 1 and 2 in the first example.

ROW_NUMBER()

According to BOL: ROW_NUMBER() “returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition.” So, this is simple numbering from 1 to n of the rows defined in the window descriptor. If there is a partition clause, it will start the ordering of the next partition at 1 and resume incrementing until the end of the partition in the order defined.

NTILE(n)

According to BOL:NTILE(n) “Distributes the rows in an ordered partition into a specified number of groups. The groups are numbered, starting at one. For each row, NTILE returns the number of the group to which the row belongs.” In the above example, we told the function to split the partition into 4 groups.  The function will then assign a number 1-4 to the members of the group based on the ordering in the window descriptor. If there are not enough groups to meet n, it will increase until there are no more rows leaving n-totalGroups not included in the values of the result set.

A Great Utility of ROW_NUMBER()

Deleting true duplicates is something we have all experienced at one point or another.  The problem is that we have no easy way to tell SQL Server which of 1 or more duplicate rows to delete. Typical solutions have involved a cursor or many nested cursors to achieve this. Personally, if I can find a way to avoid a cursor I will use it.  Below is the script that I use to identify and delete duplicate rows:

WITH RowNumbers AS
(
SELECT columnA, columnB, columnC, columnD
  ,ROW_NUMBER()
    OVER (PARTITION BY  columnA, columnB, columnC, columnD ORDER BY columnA, columnB, columnC, columnD) AS rn
FROM tableA
)
DELETE FROM RowNumbers WHERE rn > 1

This is setting up my window descriptor to partition the groups based on all columns in this table so that true duplicates are in their own grouping. It then uses the ROW_NUMBER() function to assign a row number to each row in the group. If there are duplicates, there will be rows with a row number greater than 1.  If there are no duplicates of a row, it will be assigned a row number of 1.  Finally, delete all rows with a row number greater than 1! This script also uses a CTE or Common Table Expression to produce the result set that we will use for the delete.  CTEs will be discussed in a future post.

Jan 18

Generating Consecutive Numbers Within a Given Range

A couple of months ago I was working on a project where we were revamping an existing system.  One piece of the system was a call to a stored procedure that generated a number in sequence and assigned it to a user that was accessing the application. It was very similar to taking a number and waiting in line. In that respect, it was also similar because we were required to have a range of numbers used for this.  In our case it was to start at 100000 and then rollover back to 100000 once we got to 899999.

When I began this project, an existing stored procedure was accomplishing this in a peculiar way.  We only noticed a problem because of newly required load testing. When pushed to a high amount of calls to this stored procedure, they started failing. We found it was because the existing script was using the SERIALIZABLE isolation level as it first grabbed the MAX value from a 1 row/1 column table, checked if it was less than 899999, if it was it added 1 to it, updated it the table with the new value, and then passed back the value to the application. If it was equal to 899999, it updated the value in the table to 100000.  This isolation level created deadlocks when the stored procedure was called many times in quick succession which caused it to fail.  Subsequently, the application then failed as well. (I will cover isolation levels with more detail in a future post.) This method also did not keep a history of the numbers and thereby made it very difficult to trace back to a particular user, vendor, or time.

The Solution!

I wanted a solution that would not cause deadlocks, but that would also guarantee a unique number each time the stored procedure was called. I also wanted to store a history of when these numbers were generated, yet somehow be able to identify numbers after rollovers in the range.  I knew that generating sequential numbers in a range this large would not rollover for many years, so as long as the numbers could be generated sequentially I would not have any problems rolling over.

We decided to use a log table with an identity column, and a date column to start.

CREATE TABLE numberLog(
gNumber int IDENTITY(1,1) NOT NULL,
gTime datetime NOT NULL,
CONSTRAINT [PK_numberLog] PRIMARY KEY CLUSTERED
(
gNumber ASC
) ON [PRIMARY]
) ON [PRIMARY]

This allowed us to allow SQL to generate the number sequentially without having any chance of repeating the same number. Skipping numbers was also not an issue. So how did we limit the range?

CREATE PROC generateNumber
AS
BEGIN
INSERT INTO numberLog(gTime)
SELECT GETDATE()
SELECT scope_identity() % 800000 + 100000
END

This stored procedure will maintain the identity in the table while using the newly created id to return the proper number to the application.  Since the application itself never needs to return to this table, it was the perfect solution!

How does it work?

The table is set to auto-increment the first column upon insertion.  The SQL Server Database Engine ensures that this value is unique and incremented sequentially upon insertion.  So, when the stored procedure is called to generate a number, it first inserts a row into our table. Upon insertion, the identity is generated and the current datetime is inserted with it by using the GETDATE() function. Finally, the identity value is selected by  using the SCOPE_IDENTITY() function which returns the generated identity only from the scope of this session/stored procedure.  That identity is then converted using the % and + operator to give us the number required by the application. The % operator gives us the integer remainder of the identity number divided by 800000. Below is a sample of what happens with different numbers using the % operator.

SELECT 0 % 800000 --0
SELECT 1 % 800000 --1
SELECT 799999 % 800000 --799999
SELECT 800000 % 800000 --0
SELECT 800001 % 800000 --1

The great thing about this method is that we now have a log of generated numbers that never rolls over and by using the generated number in conjunction with the formula we can solve for multiple identity numbers.  Knowing the approximate date that the number was generated will give us the specific identity.  Again, it was not a requirement to store the generated numbers or the dates, but I thought it was safer to do so and may make my life easier if they decide to do a lookup.

Jan 10

Capturing Data Modifications – The OUTPUT Clause

There are several reasons you may want to capture data that has been inserted, updated, deleted, or merged (2008). The OUTPUT clause that SQL Server provides is extremely helpful for these situations. I have been tasked on several occasions to “merge” data from a staging table to a production table. Whether or not it is needed, I like to build a table that logs all of the changes made with this operation. This is especially helpful for debugging, but also allows you to run reports or rollback any changes to the data.  There are many different ways to capture data changes and based on your needs this may not be the proper choice (this does not capture all changes to data in the table as a trigger would). However, if you are just trying to capture changes done by a stored procedure or a specific block(s) of script this may be the best route.  I will not discuss the MERGE statement in detail here, but I will show you how to incorporate the OUTPUT clause into it.

Some Test Data

The following script will create 3 tables for demonstrating this. The first is a list of current employees and some of their information. The second is a staging table. The third table is for tracking the changes.

--
--
USE tempdb --safe place for testing, but you could always create a new database

--Create application table
SELECT 'AAA' AS empId, 'Albert' AS firstName, 'Anderson' AS lastName,
    '2008-12-01' AS hireDate
INTO currEmployees
UNION ALL
SELECT 'BBB', 'Beth', 'Beethoven', '2009-05-31'

--Create and populate staging table
SELECT 'AAA' AS empId, 'Albert' AS firstName, 'Andersen' AS lastName,
    '2008-12-01' AS hireDate
INTO empStaging
UNION ALL
SELECT 'CCC', 'Chet', 'Carlsberg', '2012-01-10'

--Create change log table
CREATE TABLE empChangeLog (id INT IDENTITY(1,1), changeDate DATETIME,
    changeType VARCHAR(10), empId CHAR(3), firstName VARCHAR(50),
    lastName VARCHAR(50), hireDate DATETIME, newEmpId CHAR(3),
    newFirstName VARCHAR(50), newLastName VARCHAR(50), newHireDate DATETIME)

--
--

Merging the Data

There are 2 ways to merge this data.  The most common way before SQL Server 2008 was to use 3 separate statements. This would be an insert, update, and delete statement:

--
--
DELETE currEmployees
FROM currEmployees e
LEFT JOIN empStaging s
 ON e.empId = s.empId
WHERE s.empId IS NULL

INSERT INTO currEmployees(empId, firstName, lastName, hireDate)
SELECT s.empId, s.firstName, s.lastName, s.hireDate
FROM empStaging s
LEFT JOIN currEmployees e
 ON s.empId = e.empId
WHERE e.empId IS NULL

UPDATE currEmployees
SET firstName = s.firstName, lastName = s.lastName, hireDate = s.hireDate
FROM empStaging s
INNER JOIN currEmployees e
 ON s.empId = e.empId
WHERE s.firstName <> e.firstName
 OR s.lastName <> e.lastName
 OR s.hireDate <> e.hireDate

--
--

You see, the problem here is that once you delete, insert, or update the rows you no longer have the old values.  All you have is the new values. So how do you retain the old and new values to insert into a change table? You could load them all into a temporary table before changing them so that you still have them, but that is a waste of resources. This is where you can add in the OUTPUT clause:

--
--
DECLARE @runtime DATETIME --declare a constant runtime for this whole block
SET @runtime = GETDATE() --initialize runtime of block

DELETE currEmployees
OUTPUT @runtime, 'DELETE', DELETED.empId, DELETED.firstName, DELETED.lastName,
 DELETED.hireDate, NULL, NULL, NULL, NULL
INTO empChangeLog
FROM currEmployees e
LEFT JOIN empStaging s
 ON e.empId = s.empId
WHERE s.empId IS NULL

INSERT INTO currEmployees(empId, firstName, lastName, hireDate)
OUTPUT @runtime, 'INSERT', NULL, NULL, NULL, NULL,
 INSERTED.empId, INSERTED.firstName, INSERTED.lastName, INSERTED.hireDate
INTO empChangeLog
SELECT s.empId, s.firstName, s.lastName, s.hireDate
FROM empStaging s
LEFT JOIN currEmployees e
 ON s.empId = e.empId
WHERE e.empId IS NULL

UPDATE currEmployees
SET firstName = s.firstName, lastName = s.lastName, hireDate = s.hireDate
OUTPUT @runtime, 'UPDATE', DELETED.empId, DELETED.firstName, DELETED.lastName, DELETED.hireDate,
 INSERTED.empId, INSERTED.firstName, INSERTED.lastName, INSERTED.hireDate
INTO empChangeLog
FROM empStaging s
INNER JOIN currEmployees e
 ON s.empId = e.empId
WHERE s.firstName <> e.firstName
 OR s.lastName <> e.lastName
 OR s.hireDate <> e.hireDate

--
--

The delete and insert are pretty straightforward, but what you will notice with the update is that it actually does a delete and an insert. Since this is how SQL Server actually does an update, it is also how the data is referenced in the OUTPUT clause. One final way to do this is with the MERGE statement introduced in 2008:

--
--
DECLARE @runtime DATETIME

SET @runtime = GETDATE()

MERGE currEmployees AS e --target
USING empStaging AS s --source
    ON e.empId = s.EmpId
WHEN NOT MATCHED BY TARGET
    THEN INSERT(empId, firstName, lastName, hireDate) VALUES(s.empId, s.firstName, s.lastName, s.hireDate)
WHEN MATCHED AND (e.firstName <> s.firstName OR e.lastName <> s.lastName OR e.hireDate <> s.hireDate)
    THEN UPDATE SET e.firstName = s.firstName, e.lastName = s.lastName, e.hireDate = s.hireDate
WHEN NOT MATCHED BY SOURCE
    THEN DELETE
OUTPUT @runtime, $action, INSERTED.empId, INSERTED.firstName, INSERTED.lastName, INSERTED.hireDate,
    DELETED.empId, DELETED.firstName, DELETED.lastName, DELETED.hireDate
INTO empChangeLog

--
--

And that is how you use the OUTPUT clause to capture data changes!

NOTE: Be sure to drop the 3 tables from tempdb that we created for testing:

USE tempdb

DROP TABLE currEmployees
DROP TABLE empStaging
DROP TABLE empChangeLog

--
--

Dec 21

Assuming order of data without ORDER BY

Why are my results out of order?

Many times I have seen posts in forums asking about why their data is not being returned in the order of the clustered index. Similar questions have been brought up questioning the order of data from a view when the view includes an ORDER BY clause (which it does not really allow except in a specific case). The fact is that you are NEVER… Let me repeat that… NEVER guaranteed the order of your result set without the ORDER BY clause in the main query. I am not going to get into the details of “why” in this post as it can be quite detailed and confusing when you get down to how rows of data are stored on an individual page and how SQL Server retrieves the data. What I WILL say is that when SQL Server returns results, it does so in the way that it determines most efficient at run time. This is many times in the order of a clustered index or the order of a view or subquery, but not always. When your applications or BI depend on the order of this data, you can not afford to rely on luck.

I have had people tell me that “This view has returned in the proper order every time since we scripted it.  Why is is all of the sudden returning a different order? You must have done something to the database or the view.” Let me ask you this, if I show you a quarter and display both the heads and tails side to you, then flip it 100 times and it comes up heads 100 out of 100 times, do you now assume the quarter will come up heads 100% of the time? I hope not…

Below are 3 situations that demonstrate the wrong way of assuming order of results. Now, these may return the results in the same order each time you test them. However, SQL Server may someday decide that it is more efficient to return those in another way and you will get a different order.

No ORDER BY in query

Many people assume that the result set will always be returned in the order of the clustered index when an ORDER BY is not present in the query. This is not true. You can try to prove me wrong by running a query 100 times and showing me the result set, but then I refer you to my quarter. In order to guarantee your results you will have to run the query below:

ORDER BY in subquery

This first one is a simple subquery using an ORDER BY, but not the main query. This will not guarantee the order of results will be in the same order as the subquery:

To fix this you would change the query to:

ORDER BY on a view

First of all, SQL Server will give you an error if you try to use an ORDER BY without a TOP clause:

Msg 1033, Level 15, State 1, Procedure viewName, Line 5
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP or FOR XML is also specified.

I can only assume this is because SQL Server is trying to tell you that an ORDER BY in a view does not guarantee you order of results. I have seen developers ignore the purpose of this error and do the following:

This will compile and create the view, but when you SELECT from this view without an ORDER BY, you will not be guaranteed the order of results. Why does SQL Server let you compile the view in this case? Lets say you have 100 students and would like the view to only return the top 10 in regards to their grades. Using a TOP 10 with the ORDER BY will sort the results of the query, and return the top 10.  However, the top 10 will not be guaranteed to be in any particular order.

Summary

When you need to guarantee the order of your result set, always make sure that the ORDER BY is present and is part of the main query. Never assume the order of your result set without this.

Dec 19

Windows Requires Restart: Keeps failing on install of SQL Server 2008 even after restart

For my first content related blog post I wanted to start with something simple and something that is encountered upon installation of SQL Server. I encountered this problem a number of times on desktop machines where a number of peripherals may already be installed.

The issue is that when SQL Server is going through its prerequisite checks, it says that Windows requires a restart.  So logically, you restart Windows and begin the process again only to find that the checks require a restart of Windows again!  If you are like me, yous restart Windows 1 more time and experience the same result (I believe that it was Einstein who said that insanity is doing the same thing over and over again and expecting different results). So, how do we resolve this issue? It is simple really, but requires editing the registry using regedit. If you are new to editing registry entries, please backup the registry and make sure you know how to restore it before using the steps outlined below.

  1. Start regedit.exe by typing regedit into your Windows search, or navigate to c:\Windows and double-click on regedit.exe.
  2. Navigate to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager and locate the PendingFileRenameOperations entry.
  3. Right-click on the PendingFileRenameOperations entry and click on Modify.
  4. Delete the data in the value box and click ok.
  5. Close regedit and restart the machine.
  6. Start installation of SQL Server again.

This should resolve your issue, but I have heard of instances where the value appears again after restart (this has not happened to me).  In this case, you may try deleting the data in the registry key and then starting the install process without restarting the machine or looking at the same FileRenameOperations entry in ControlSet001 or ControlSet002 instead of CurrentControlSet. NOTE: These last 2 methods are untested by me and I have not encountered them.

You can also install SQL Server from the command prompt with parameters, one of which will skip the prerequisite check.  I don’t advise this as there may be other prerequisites that are vital to the proper installation.

-SQL Know-It-All

Older posts «