Once in a while I get requests for some information about reading and analyzing an AWR report. I have been thinking for a long time about writing such a post, but always postponed it as it is a very tricky topic. The AWR (or statspack for that matter) report is huge and contains so much information that it’s easy to get lost. It also requires a lot of knowledge about the database and the different mechanisms so it’s very difficult to explain all of this in a blog post (or even a series of posts). In this post I’ll try to start from the beginning, explaining a little bit about the AWR report and the analysis process and we’ll see where it takes us.
In this post I will talk about AWR reports, but statspack reports contain basically the same information. There are some significant differences between AWR and statspack, but the reports are relatively similar. So, I will only write AWR report, but I mean statspack report as well. If I talk about something that doesn’t exist in statspack I’ll mention it specifically.
What is AWR Report?
An AWR report is basically a huge list of everything that happened in the database between 2 points in time. It is generated for 2 snapshots and list everything that happened in-between. The report is huge and Oracle keeps on adding more and more information and more details to it with every version.
One important thing to remember, is that the report is an overview and its goal is to provide information about overall database activity. It is very difficult to analyze a specific application flow from it. It is not meant for a single SQL tuning. There are different tools for these things. We will generate and analyze an AWR report when we wish to see the overall database behavior or when we are feeling that the database doesn’t perform well in general.
The beginning of the report is very important. It’s an overview of the database. The first thing I do when I get an AWR report from a customer is to check what version the database is, what platform, if it’s RAC and the duration of the report.
The report uses a lot of totals and averages (per second and per transaction mostly). For example, how many commits happened during the report time, how much IO, how much redo and so on. If the report duration is too long, we will lose some of this information. A report of 24 hours can show that there were 10 SQL executions per second (about 860,000 in total for 24 hours). That’s really not a lot, but maybe all of the 860,000 executions happened in one hour and the database was completely idle during the other 23 hours? That’s 240 executions per second and that’s a lot more than 10. I usually prefer to have an AWR on no more than 1-2 hours max (usually 30 minutes to 1 hour is good).
DB time counts all the time that was spent by sessions waiting for something (so, not idle time) or running on CPU. It is often confusing people to see tat DB time is higher than the actual time, but it happens quite often and even makes sense.
For a start, in a multi-CPU system, several operations can happen at the same time. So if during 1 second, 2 different sessions are running on different CPUs doing something, Oracle counts that as 2 seconds of DB time. Another scenario is when a session locks a row and 5 other sessions are waiting for this row for 1 second at the same time. In this case the DB time will be 5 seconds, even though only 1 second has passed.
In this table we will see the activity of the database during the report time. Note that these are averages (per second and per transaction) and don’t necessarily reflect the real numbers (the maximum might be much higher than the average).
Also, it will take some time to understand if these numbers are high or low, and it completely depends on the specific database. Is 100MB redo per second a lot? Is it OK to have 200 SQL executions per second? The key is to know your database and compare these statistics between reports from different time windows.
Another comment here is that these numbers might be misleading. I saw 2 reports once from the same database. In one report there was 200MB redo and 500 SQL executions per second, and the other report showed 100MB redo and 200 SQL executions per second. Does it mean that during the 2nd report the database was more idle? It might look this way, but in fact, in this case, the server and storage were much more loaded during the 2nd report, and that was why the database simply couldn’t manage to do as much work as it needed to.
I think that’s enough for today’s post. In the next one I’ll discuss the next step, which is to understand what takes time, what the database is doing and identify potential problems.