The following is an excerpt from Chapter 18 of the book Mining New Gold - Managing Your Business Data by Penny, Jeffrey and Gillian Garbus which can be purchased here.
One of the biggest nightmares a business owner can face is the possibility of getting sued. The issue becomes even worse when the damages claimed are for millions of dollars. Yet, this is the type of scenario that executives can encounter if they fall victim to a database breach.
Case in point: We recently received a call from a business partner who lamented, “We have a customer, a large hospital, who complained of database performance issues. We spent $3 million of their money upgrading systems, but the performance remains the same. The CIO is worried about his job, and we are worried we are going to be sued for the $3 million. Can you help us?”
Avoiding a Database Crisis
This one incident indicates a host of problems. First and foremost, to solve a dilemma, you have to first identify it. This premise sounds simple, but I assure you, it is much easier in theory than in practice.
For example, it’s very common to throw hardware at a problem. But if hardware is not what caused the complication in the first place, it won’t solve it. This is intuitive after the fact. This is why it’s crucial to understand the issue of making changes in database production environments.
While onsite at a large health insurance company a few years ago, the firm had a relatively small window of six weeks to attempt a database upgrade. At any organization of any size, a database upgrade is something you should test before putting it into production.
At this specific customer site, a performance issue was uncovered in the new software. A process that should’ve taken two minutes took 23 minutes. In our industry, this is unacceptable. Therefore, it had to be solved so that they could internally certify and upgrade the software. If they weren’t able to do this on time, they’d have to wait almost a year for the next rollout window.
In came the experts to save the day: Soaring Eagle Consulting. We figured out that there was another option to identify the software bug: The code was poorly written. A small change in code, and the two-minute job could run in 2 microseconds.
Pleased with this finding, we brought it to the customer’s attention, expecting a pat on the back, only to have the customer tell us the solution was unacceptable. With a six-week window, I had expected this would be an easy code change and was surprised to see that my solution was rejected. After all, this was about 60,000 times faster than the prior approach, and 600,000 times faster than their existing dilemma. Yet, the customer said, “It takes us eight weeks to review and roll out any changes. Try again.”
The message here is for IT executives: It’s easier – politically and technologically – to make a hardware change than a software change. CIOs often throw money at a problem to make it go away because it is more cost effective to guarantee an immediate resolution to a problem than to identify root cause and fix it.
The problem with this approach is that if the problem wasn’t hardware, it may be that faster CPUs, more memory, or faster storage access times are not what was needed. That hospital found out the hard way. Sometimes you have no alternative to fixing code or tuning for database performance.
This is one place where there is no substitution for experience and expertise – note those are two separate, mandatory requirements.
Necessary Tools to Optimize Database Performance
Once you’ve identified the root cause of the problem, you’ll need tools that enable you to measure what’s happening within your system.
There are dozens of these type of tools, and most of them are virtually useless. I’ve lost count of the number of times I’ve been brought in to solve a performance problem, proudly shown tools the shop has purchased but which are futile.
For the sake of brevity, we will not be listing those tools here. Also, we don’t want to catch flak from vendors. Additionally, while we resell the few tools we believe are efficient, we change those over time.
That said, your tool needs to do many things:
1. Measure utilization of resources over time. CPU, memory (different from CPU), storage bandwidth, network bandwidth. That list of four is a simplification, as there are multiple metrics that are tracked for each, but it’s a starting list.
2. Measure the impact of database activity. This includes the ability to identify which queries are running and what they are doing. “What they are doing” in this context does not mean what report they are running, but what resources they are using and how they are using them. For example, it’s important to know if a query is CPU-bound (limited by CPU), memory-bound, or blocked by another process with locks on it.
3. Measure the impact of other activities as they interact. “Is my report slow because lots of folks are running it, because it’s being blocked, or because I simply can’t push that much data over the Internet?”
4. Graphically and easily identify which queries are taking up most resources.
5. Graphically and easily identify which queries are taking up most elapsed time.
6. Graphically and easily identify which queries are running most frequently.
Note: There’s a difference between one query running for an hour, one query running once per second for an hour, and one query running 1,000 times/second for an hour.
7. Graphically and easily identify which logins are involved in the above, critical in tracking down who did what and when.
There are other factors, but this is a solid start. If your tool can’t do these things, it’ll likely gather dust on a shelf.
Fixing the underlying problems
Once you’ve identified the underlying problem impeding hour database performance, that’s where you target resources, whether it be buying hardware or something else. As of now, the most common hardware purchases we recommend are a faster disk – for when we have positively quantified an IO bandwidth issue – and memory, also when we’ve quantified an issue. Most folks overbuy CPU, as it’s a commodity today, even with DBMS vendors pricing their wares by CPU.
Before doing either of those, though, we look to tune the application. It’s not unusual, with a few relatively minor and simple changes, to reduce CPU from 95 percent to 16 percent (a recent success), or memory requirements to a tiny percentage of what they were, or disk requests from 100 trillion/hour to 3,000/hour. We are not making that one up either.
Note that research organizations have quantified mean time to resolution, or MTTR, for performance issues at 80 percent identification of problem, 20 percent resolution. It’s another reason to invest in a tool to track down issues.
How do we do this? There are three primary tasks: query tuning, architecture, and index selection. Each of these steps is discussed in detail below.
Query tuning
What happens in real life (that is, outside the classroom) is the fact that software developers are good at creating logical, sensible, procedural code that is relatively easy to maintain. This last part is important. CIOs/CTOs will tell you that average cost of maintenance of an application is seven times the cost of developing it. Restated, if you spend $1 million building a software application, you’ll spend $7 million maintaining it before the application gets replaced. The easier the code is to maintain, the lower you can keep that number because business rules and needs always change.
Here’s a typical problem: The people who write your applications generally do not understand database performance, and as a result, often create artistic examples of sophistry. They are creations which are logical but flawed because they don’t understand how the underlying technology works.
Solution: have your DBAs code review before any data manipulation language, or DML accesses anything on your production database applications.
Architecture
The architecture of your DBMS is hardware and application dependent, but there are issues that can negatively drive performance. Here’s a short list.
High availability is a buzz phrase that has made the circuit, and means if your primary DBMS fails, you have another one to back it up, often automatically. For example, if the primary server is in Miami, and Miami gets hit by a hurricane, are your users in Chicago going to be happy that because of the hurricane -which perhaps cut off power or an Internet line- they’ll be unable to do their jobs?
There are ways you can avoid this issue, many of them popular and successful today. Some of them are being done, though, with an eye toward perfection rather than performance.
As an illustration, let’s look at this scenario: Let’s say you have a primary data center in Miami, with a failover in Chicago. It’s reasonable to assume that a hurricane hitting Miami won’t take out the Chicago data center, and an ice storm taking out the Chicago data center is going to have no effect on Miami. Good decision so far.
While having data in two different sites is ideal, have you considered the speed of light? How is that relevant, you ask?
Well, pushing information across a fast line to Chicago from Miami might take 30ms. The signal coming back that says “Got it!” would take the same 30ms; 60ms round trip time doesn’t seem like a lot. In fact, as an exercise for the reader, wait 60ms before you read this next sentence. Wait, 60ms is a bit below our threshold of awareness but multiply that by thousands of processes per second. This might end up creating a backlog which affects performance on the primary server.
The above example assumed synchronous data commitment – in other words, making a change at the one site could not complete before the change was made permanent at the remote site.
What, though, if we chose an asynchronous data commitment? For asynchronous commitment, we make permanent changes on the primary, without regard to whether the secondary is up, and trust changes to make it to the secondary. It becomes possible there’s a lag between data on the primary moving to the secondary, in case of a sudden unavailability of the primary and failover to the secondary, of about 30 ms of data.
For some places and times, this is no issue. Users will know there was a failover, and make sure whatever they were working on made it. For others, though, that loss is unacceptable. For example, if you are transferring money from a checking to a savings account, and an automatic withdrawal fails because of that, your payments will be late, or you won’t have available cash when you need it. There would be lots of unhappy campers, to put it mildly.
Indexing
Indexing is the lifeblood of the database performance expert.
We hope you’re old enough to remember phone books, as we haven’t come up with a better metaphor for database b-tree indexes. (Binary Tree indexes are an old term no longer used because the indexes are no longer binary.)
So, think about a phone book. It’s an index organized by last name, first name, middle name, which allows you to look up an address and telephone number.
This phone book structure is key to getting information quickly from large databases.
Consider a table with 100 million rows of data in it. This is not the least bit unusual anymore. A 100 million used to be a large table, but we have customers with 20 employees with much larger tables, on the order of 10 to 100 times bigger. So, how do you want to find your address and phone number? By checking every row? Or by using the phone book?
We would choose the phone book most of the time (not all the time; this is where database performance experts earn their money), as we can get to the information we want with just a few (8-10) page requests (a page is the basic unit of io, or input/output), rather than millions.
That telephone book (or the indexes) are going to need to be constructed to meet the needs of each query that runs against the database. Stated technically, any query the DBMS needs to optimize and process will need a matching index to avoid table scan – the systematic scanning of the entire table.
Matching indexes to queries is a science and art, and can have an effect on query performance. An example is reducing a 48-hour query to 23 seconds by modifying an index.
CASE Study
We recently went into a conglomerate to work on database performance. While there, we were asked to work briefly on another system, which had 3,000 users downloading data every night from main servers to their handheld devices. This was taking 30 minutes, and in some cases failing because of timeouts or lost connectivity.
We made index selections, which immediately went into test. This was a small part of what we did that week, but we found out later it had an impact. We were asked for a follow up call a few weeks later, which we thought was going to be with a few members of their technical staff asking questions on our report.
There were 27 people on the call, all wanting to say thank you.
Apparently, they rolled out changes for the 3,000 users without telling them performance would improve, as they didn’t want to set expectations and then have them dashed. Once we worked our magic, though, the job dropped to well under two minutes. About 3,000 users called that night to ask if something went wrong. It was the best 3,000-call night they’d ever had.
Key Takeaways
There are a lot of factors in database performance, some of which we’ve discussed here. Physical resources are a component, but over the years, we’ve seen hundreds of situations where throwing hardware at a problem not only didn’t solve it, but sometimes masked it and created other problems, often more urgent.
Get expertise early, when you expect your business to be growing, or your data volumes to be dramatically changing.