computer data code assessment

The following is an excerpt from Chapter 22 of Jeffrey Garbus’ book Transact SQL which can be purchased here

SQL Code reviews, are much like changing accountants!  HUH?  Have you ever changed accountants? The first thing the new accountant does is explain to you, in detail, how stupid your previous accountant was, how much the old one cost you, and how much the new one is going to save you.

There are many reasons why SQL code can be deemed poor, but the primary determining factor for declaring code to be bad is the answer to one of two questions:

1. Performance:

Do my SQL statements retrieve data in the most efficient method possible?

2. Maintainability:

Can the next person who comes along understand what I did well enough to be able to easily modify the SQL code?

Regarding performance, I’d like to share a related story. I was teaching data modeling to a group of recent Ivy league grads at a major New York financial institution. This company recruits from the top 10 percent of the Ivies. This is a group of “kids” who have excelled academically for about 20 years. They are smart, analytical, and critical thinkers. I was talking about the assignment and made the following statement: “There may be more than one right answer. There is ‘meets requirement’ and ‘doesn’t meet requirements’; anything else is moot.”

What an upset group!

If you’re interested in right or wrong, you’ve chosen a tough field. There are many ways to answer almost any question with SQL. Many approaches may yield exactly the same answer, same amount of I/O, same everything… in which case, we ask the question What’s easiest to maintain? Statistically, it costs your company seven times more to maintain SQL code over time than it costs them for you to write it. See if you can make it six or five.

This chapter will focus on some common performance issues, as well as some common maintenance issues.

When you ask yourself, “What has made me crazy about somebody else’s SQL code?” You want to be able to say, “I wouldn’t do that to the next person.”

Why Is That There?

Let’s start with the basics: comment your code. Often SQL code that is considered to be “bad” is simply the result of necessity (often a tight deadline) combined with poor commenting practices. Every coder knows that the comments are a “best practice,” but few coders consistently employ those best practices. Consider the following code:

Select make,

sum((1sign(abs(2006datepart(yy,salesdate))))*sales)

‘2006’,

sum((1sign(abs(2007datepart(yy,salesdate)))) * sales)

‘2007’,

sum((1sign(abs(2008datepart(yy,salesdate)))) * sales)

‘2008’,

sum(sales) total

from car_sales

group by make

Now, consider the same code with comments included:

/*

Because the case statement was adding too much overhead, we rewrote the function with a characteristic function, which knocked 25% off the execution time.

The purpose of the characteristic function is to make the function evaluate to 1 if true, 0 if not true. We want to sum only values for the correct year. So, for the 2008 year, we:

1) Take datepart (yy,saledate)

To isolate the year part of the date

2) Subtract this from 2008: (2008 – datepart (yy,saledate)))

To create a single value result, so that we get one result for a year of 2008 and another result for anything else

3) Take the absolute value,abs((2008datepart (yy,saledate)))

So that we don’t care whether this is greater than or less than 2008, as long as it’s not equal

4) Take the sign of this value (sign returns + 1, -1 or 0 based upon whether the number is positive, negative, or zero: sign(abs((2008datepart(yy,saledate))))

Now we have a value that evaluates to 0 if it is 2008, 1 if it isn’t, which is the opposite of what we’re looking for, so…

5) Subtract that value from 1, which will change the flag value…: (1-sign(abs((2008datgepart(yy,saledate)))))

And now we have a function that yields as expected

That may not be a lot easier to understand, but at least you know there was a method behind the madness.

Coding Standards

Coding standards should be established at the beginning of any project; while the true value of coding standards is often not realized until working with a team of developers, the practice is invaluable in debugging code (particularly SQL code that has aged over a period of time). Good coding standards involve things like standard naming conventions, comment structures, error handling, debug syntax, block formatting, and keyword case.

Bad SQL code misuses coding standards.

Here’s another personal platitude: It doesn’t matter what your coding standard is, as long as you have one.

SELECT *

One of the most common poor “practices” For SQL coding is to use the asterisk shortcut rather than specifying a specific list of columns to be returned from a SELECT statement:

SELECT * FROM [AdventureWorks].[Person].[Contact]

Although the use of the asterisk certainly saves keystrokes, there are several reasons to avoid the use of it in a production environment. First, the use of column names assures that the result sets will not change if the underlying schema is altered; if a column is removed, the code will break. If you use SELECT *, and the schema is changed to add or remove a column, the result set will be altered as well and dependent applications my break. Even if every column from every table involved in the SELECT statement is being returned, preserving the expected structure (and avoiding brittle code) is essential.

Second, using SELECT* may also have an impact on network or disk I/O; if unnecessary columns are being retrieved, then more data than necessary is being read from the hard disk and pushed across the wire. While there may be some queries that require every column to be returned, the majority of SQL queries will not, and the database engine will perform best when it does only the work that is required. Finally (and related to the second point), focusing the SELECT statement to return only the required columns may benefit the query optimizer; if a table has multiple indexes and your SELECT statement with specific column names is covered by one of those indexes, the optimizer will probably use that index. If the SELECT statement all columns from a table (as in SELECT*), the optimizer may not choose the most efficient index option.

Write it like this:

SELECT [ContactID]

,[NameStyle]

,[Title]

,[FirstName]

,[MiddleName]

,[LastName]

,[Suffix]

,[EmailAddress]

,[EmailPromotion]

,[Phone]

,[PasswordHash]

,[PasswordSalt]

,[AdditionalContactInfo]

,[rowguid]

,[ModifiedDate]

FROM [AdventureWorks].[Person].[Contact]

GO

Not only is this an easy-to-read format, but the addition of columns to the table will have no impact on your code.

PARSING

Consider the following T-SQL Statement:

SELECT LEFT(Name, CHARINDEX(‘ ‘, NAME 1) as NameFirst,

REVERSE(LEFT (REVERSE(Name), CHARINDEX ( ‘ ‘, REVERSE(NAME) 1) AS NameLast

FROM Authors

It’s a simple SELECT statement that returns the names of all the authors in the Authors table; however, the Authors table stores the Author’s complete name in a single column, violating the basic rule of atomicity for database design. In order to determine the last name of the author, the name column has to be parsed.

Parsing not only adds performance considerations, it may also return bad data if the actual data doesn’t match the expectations of the parsing statement. In this example, the code will take a string that contains a space and splits it into two pieces.

What happens with the following values?

Stuart R. Ainsworth

Stuart Ainsworth

Ainsworth, Stuart

Mary Jones Powell (assuming that Jones Powell is the last name)

Or what about

John MC BRIDE

Parsing may also lead to poor performance if the parsing is necessary for the searching via a WHERE clause or inclusion in a JOIN. If the required elements are not at the beginning of the value, the SQL optimizer will not be able to use indexes on their column to return the appropriate data, as noted in the following example:

–Create a table of products for our sample, Including a unique index on the SmartLabel column

CREATE TABLE Products (ProductID int IDENTITY(1,1) PRIMARY KEY,

SmartLabel (Char (10))

GO

CREATE UNIQUE NONCLUSTERED INDEX ux_Prodcuts_SmartLabel On Products (SmartLabel)

GO

/*

Insert a list of products, using SmartLabel syntax.

Note that the first two letters represent a size, the next three letters represent a batch, and the final three letters represent the actual part number within each size and batch.

*/

INSERT TWO Products (SmartLabel)

SELECT ‘SM-KIT-123’

UNION ALL

SELECT ‘SM-KIT-456’

UNION ALL

SELECT ‘SM-KIT-789’

UNION ALL

SELECT ‘LG-KIT-123’

UNION ALL

SELECT ‘LG-KIT-456’

UNION ALL

SELECT ‘LG-KIT-789’

UNION ALL

SELECT ‘SM-MAT-123’

UNION ALL

SELECT ‘SM-MAT-456’

UNION ALL

SELECT ‘SM-MAT-789’

UNION ALL

SELECT ‘LG-POV-123’

UNION ALL

SELECT ‘LG-POV-456’

*/

–Select a specific part; note the index seek in the execution plan

SELECT ProductID, SmartLabel

FROM Products

WHERE SmartLabel = ‘SM-MAT-123’

–Select a specific size; note the index seek in the execution plan

SELECT ProductID, SmartLabel

FROM Products

WHERE SmartLabel LIKE ‘%MAT%’

Over Normalization

In a SQL database, the table is the most basic element; it’s the data storage mechanism that is used to hold information about the entities being stored in the database. In most cases, the table represents the entity (e.g. a Customers table holds basic information about Customers, A Vendors table represent the Vendor entities, etc). Columns in a table represent various attributes about that entity; Customers will have a name, as will Vendors. Items sold will also have names, as do their components. Each of the entities has an attribute (name), but those attributes have no relationship to each other.

Over normalization is an inappropriate design pattern that leads to organizing table structures by attribute rather than by entity; this design pattern will often lead to complex SQL statements, like so:

SELECT n1.Name as Customer,

N2.Name as part,

N3.Name as Vendor

FROM Sales s

Join Names n1 ON s.CustomerID = n1.ID

JOIN Names n2 ON S. PartID = n2.ID

JOIN Names n3.On S. VendorID = n3.ID

Note that in order to determine basic information about the relationship between Customers, Parts, and Vendors (even though that information is stored in a single table), we are forced to include a reference to a single table (Names) multiple times. Ignoring the fact that this jeopardizes the readability of the code, multiple references to this single table will incur a significant performance hit as the database size increases.

SQL Injections

SQL injection is a common attack point from outside the application. We don’t want to turn this into a lesson on creating openings or on sealing them, as there’s already plenty of literature on that, so let’s simply recommend that you clean output prior to passing it into the applications, and be aware that there some folks out there who will hack your system just for laughs.

System Tables

We’ve recently seen a spate of examples of folks using system tables to store schema description information. This is not in and of itself a bad idea, but then there are already heavily used database queries that constantly query the system tables.

System tables are already heavily used; we don’t recommend adding to the load.

Cursors

Where possible, avoid Cursors, (we have more in our books, courses, and webinars on this topic)

Multiple Passes Through Data

Avoid multiple passes through data where possible. I’ve seen way too many routines that transfer data back and forth through temporary tables before returning data bring data back with a single pass, only return data you need, and filter data prior to bringing it out of the database, not after.

Working Too hard

Finally, keep it simple. If your SQL code starts to look or feel clunky, it probably is. Take a few minutes, do something else, and come back to it. If you can’t come up with a better solution, phone a friend.

Summary

Make the next person’s life easy when you can. The next person might be you