I had a bad night last Thursday.
After patching two test RAC databases and one production RAC with 180417 DB Bundle Patch (and some one-offs), I got to the point where it was time to update the most critical RAC system.
We were really looking forward to this as we had hit a few bugs that this DBBP and one-offs should fix. But boy, did that go wrong…
It was scheduled for evening/night and I got approval to start at 7pm. This is a 5-node RAC (12.1 without any patches) and it was supposed to take a bit more than an hour per node, so I was looking at about 7 hours.
When I started, everything went really smoothly. I’ve done it a few times before, on similar systems, so I just hoped I wouldn’t run into any issues while patching.
At about 3am, while I was patching the last node, the customers called to complain that nothing seems to be working. The patches are rolling patches, so with careful planning we promised a “no down-time” maintenance window. But they claimed that EVERYTHING is down.
We started investigating and saw that the load average on the servers had skyrocketed with 100% CPU (and these are not small servers), while sessions were stuck on library cache latches and mutexes.
We decided to finished the patch and see if things calm down. They didn’t.
After a while I opened an SR and in parallel, contacted Mike Dietrich. Mike tried to help but unfortunately, he wasn’t next to a computer to see the SR and trace files, so his assistance was limited. He did give me moral support, which was helpful (and actually suggested a real solution as well which we didn’t implement, I’ll explain soon).
In the meantime, we decided to move to DR, as this system was completely useless. The guys (IT, developers, etc.) did a great job and the system was up and running on the DR side (it wasn’t perfect, but it worked).
After another couple of hours, the solution came from Oracle support. Since this was a library cache issue (which is hard to investigate) and with Oracle support’s reputation, I didn’t expect a fix from them so quickly, but it didn’t take long for the engineer to realize that it’s probably an unpublished bug (25058954) related to SQL Plan Directives and he gave me instructions to turn it off. After the fix our load test went from 8 executions per minute to more than 20,000!
The funny thing is that while talking to Mike he suggested a fix (disabling adaptive features, you can read more about it here), but then he realized that in our database the optimizer_features_enable is set to 18.104.22.168, which means that these features should be off anyway, so we ended up not doing that. The solution we used eventually was to disable SQL Plan Directive, which is a 12.1 feature and it should have been disabled, but for some reason it wasn’t. The fix that Mike suggested should actually have worked, if he hadn’t ruled it out because of the OFE parameter. Later, in the SR, the engineer admitted that he hadn’t noticed the OFE setting, otherwise he wouldn’t have suggested it, thinking that it was already off (as Mike did).
The frustrating part is that I don’t know how we could have avoided this. It’s the same system with the same code, running on the same environment as the test and the other production (give or take, the critical one has more nodes and stronger servers, and more load, but same OS, same versions and everything). We installed the latest DBBP (when we started rolling this out, it was before July patches were published). We used the DBBP and not the PSU as it has more bug fixes (some of which we needed), and still a degradation bug brought a production system down
At least Oracle support saved the day by providing an accurate fix quickly and I’m thankful for that.