Bug When Patching Production

I had a bad night last Thursday.

After patching two test RAC databases and one production RAC with 180417 DB Bundle Patch (and some one-offs), I got to the point where it was time to update the most critical RAC system.

We were really looking forward to this as we had hit a few bugs that this DBBP and one-offs should fix. But boy, did that go wrong…

It was scheduled for evening/night and I got approval to start at 7pm. This is a 5-node RAC (12.1 without any patches) and it was supposed to take a bit more than an hour per node, so I was looking at about 7 hours.

When I started, everything went really smoothly. I’ve done it a few times before, on similar systems, so I just hoped I wouldn’t run into any issues while patching.

At about 3am, while I was patching the last node, the customers called to complain that nothing seems to be working. The patches are rolling patches, so with careful planning we promised a “no down-time” maintenance window. But they claimed that EVERYTHING is down.

We started investigating and saw that the load average on the servers had skyrocketed with 100% CPU (and these are not small servers), while sessions were stuck on library cache latches and mutexes.

We decided to finished the patch and see if things calm down. They didn’t.

After a while I opened an SR and in parallel, contacted Mike Dietrich. Mike tried to help but unfortunately, he wasn’t next to a computer to see the SR and trace files, so his assistance was limited. He did give me moral support, which was helpful (and actually suggested a real solution as well which we didn’t implement, I’ll explain soon).

In the meantime, we decided to move to DR, as this system was completely useless. The guys (IT, developers, etc.) did a great job and the system was up and running on the DR side (it wasn’t perfect, but it worked).

After another couple of hours, the solution came from Oracle support. Since this was a library cache issue (which is hard to investigate) and with Oracle support’s reputation, I didn’t expect a fix from them so quickly, but it didn’t take long for the engineer to realize that it’s probably an unpublished bug (25058954) related to SQL Plan Directives and he gave me instructions to turn it off. After the fix our load test went from 8 executions per minute to more than 20,000!

The funny thing is that while talking to Mike he suggested a fix (disabling adaptive features, you can read more about it here), but then he realized that in our database the optimizer_features_enable is set to 11.2.0.2, which means that these features should be off anyway, so we ended up not doing that. The solution we used eventually was to disable SQL Plan Directive, which is a 12.1 feature and it should have been disabled, but for some reason it wasn’t. The fix that Mike suggested should actually have worked, if he hadn’t ruled it out because of the OFE parameter. Later, in the SR, the engineer admitted that he hadn’t noticed the OFE setting, otherwise he wouldn’t have suggested it, thinking that it was already off (as Mike did).

The frustrating part is that I don’t know how we could have avoided this. It’s the same system with the same code, running on the same environment as the test and the other production (give or take, the critical one has more nodes and stronger servers, and more load, but same OS, same versions and everything). We installed the latest DBBP (when we started rolling this out, it was before July patches were published). We used the DBBP and not the PSU as it has more bug fixes (some of which we needed), and still a degradation bug brought a production system down

At least Oracle support saved the day by providing an accurate fix quickly and I’m thankful for that.

Advertisements

3 thoughts on “Bug When Patching Production

  1. Dear Liron,

    – Please note the difference between PSU and DBBP (BP). The DBBP is NOT a PSU.

    – I can not understand why anybody would run a critical production system without any patches. I would never put a system in production without waiting at least one quarter for the first BP.

    – Release 12.1.0.2 is a an unfortunate child. So many bugs, so many attemtps to bugfix, then more bugs. In my opinion, it is important to follow the note 2312911.1 very closely. In addition, my recommendation is to completely disable SQL Plan directives in addition because the are not used when following MOS 2312911.1 anyway, but still generated.

    You can find a list of the most critical bugs i encountered with 12.1.0.2 on http://www.ora-solutions.net/web/resources/patch-recommendations/patch-recommendation-oracle-database-12-1-0-2/

    Unfortunately, at this point there are still unmerged conflicts and issues with the april dbbp and other critical bugfixes. The same goes for DBBP July.

    Be sure to check out these notes/links:
    – 2358436.1
    http://www.ora-solutions.net/web/2018/06/11/oracle-database-proactive-bundle-12-1-0-2-april-2018-changes-datapump-file-format-number/
    – Alert – DataPump Victim: LOB Regression Introduced in 12.1.0.2.180417DBBP Generates ORA-03106 When Importing (IMPDP) Package Bodies (Doc ID 2424272.1)
    – Tablespace MOnitoring in Cloud Control is no longer working after apply DBBP April 2018 12.1.0.2. Bug is still in status 11 : TABLESPACE_SIZE IN DBA_TABLESPACE_USAGE_METRICS INCORRECT AFTER 27338041(Bug ID 28001649)

    Good luck anyway.

    Regards,
    Martin

    Like

    • Hi Martin,
      Thanks for the response.
      – I know the difference, sorry for the phrase confusion, I will fix it in the post
      – As a consultant that stated to work with this environment a few months ago, it wasn’t my call. I wanted to installed patches a while ago, but didn’t get a maintanence window
      – Thanks for note 2312911.1, it is important. However, as Mike and the support engineer said, becuase my OFE is 11.2 it shouldn’t affect the system. Also, this didn’t happen in any other environment

      Thanks again for all the information, it is really valuable.
      Liron

      Like

  2. Bundle Patch requires more testing than Patch Set Update.
    This happens all the time because testing is not apple to apple comparison.
    Hard to avoid unless there is an exact copy of PROD and RAT.
    There’s all too many things to check.
    Had encountered checksum mismatch for oracle home since Jan 2017 patching.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s