Diagnosing Listener Issues in RAC

I’ve written a few “diagnostics” posts in the past (you can see them all under the “Troubleshooting” category). People really like these type of posts, and I got many comments about them in the past (good ones…), so here is another case that I had not too long ago.

I’d like to thank Kadhir Velavan, he is the guy who brought this to my attention and we worked on this issue together.

SCAN Listeners Error

This issue happened on a RAC database, version 12.1. This RAC database has two instances, one on server1 and one on server2. While we were working on this, server1 had two of the SCAN listeners (SCAN1 and SCAN3) while server2 had the last one (SCAN2).

The issue started when we saw that “crsctl status res -t” on server1 showed both scan listeners in “Not All Endpoints Registered,STABLE” state. Checking the SCAN listener logs showed “Address already in use“.

This is a quite self explanatory message, some other process is listening on the same IP and port. Remember, only the combination of IP and port must be unique. If the server has multiple IP addresses, processes can listen on the same IP and different ports or different IPs and the same port. But processes cannot listen on the same IP and same port.  With this we had to find the process. To find that we can use the netstat command and look for processes in “LISTENING” mode on the IP and port we are looking for. In this case we looked for the SCAN1 and SCAN3 IP addresses, and surprisingly the process that listened on this was the local listener process. Why was it listening on the SCAN IPs? Good question.

At this point we went to check if MOS has something interesting to say. MOS is really great when you have a well define message or error to look for, so I logged in and looked for the “Not All Endpoints Registered,STABLE” message. I managed to find a note talking about listeners configured manually in the listener.ora, but this was not the case here. Then I found note 1667873.1. This note explains that this might happen because of incorrect hosts file configuration. And indeed this was the case.

If you are working with Linux/Unix (I don’t know how it behaves in Windows), duplicate entries in /etc/hosts are illegal. Although usually it doesn’t cause any problems, sometimes it does (and RAC is very sensitive regarding host naming and IPs).

What does a duplicate entry mean? The hosts file is responsible for translating hostnames to IP. It doesn’t care if many hosts are translated into the same IP (in the same line or different ones), but when the same hostname is translated into different IPs it’s a problem.

I’ve seen quite a lot of servers where the real hostname is configured in the 127.0.0.1 line as well as the real IP in /etc/hosts. For example:

127.0.0.1 localhost localhost.localdomain server1 server1.mydomain
10.10.10.10 server1 server1.mydomain

This duplicate translation (server1 is translated to both 127.0.0.1 and 10.10.10.10) is illegal.

So this was the case here. The solution was to remove the server name from 127.0.0.1 and restart the listeners (local and SCAN). Problem solved, right? Hmmm, not quite.

Connection Problem

After we removed the duplicate hostname from /etc/hosts, we restarted the listeners and the message indeed disappeared. But then we realized that clients cannot connect to the database.

Clients who tried to connect to the database got “ORA-12520: TNS:listener could not find available handler for requested type of server”. So fixing the SCAN listeners issue introduced a new problem.

We checked and saw that everything is configured correctly, on both client side and server side. We tried to ping the SCAN name and tnsping the tns entry, both worked, so now we need to analyze where the problem is.

When running tnsping, the client goes to the host and port (as appears in the tnsnames.ora file) and tries to connect to a listener. If a listener responds, we see an “OK” message and the time it took. If not, we will get an error. However, tnsping doesn’t try to connect to the database, only to the listener. The fact that tnsping succeeded means that we could get to the SCAN listener, but clearly there is a problem further down the road.

At that point we enabled sqlnet trace for the client side (according to MOS note 395525.1). Looking at the trace file immediately solved the mystery. When connecting to RAC, we first get to the SCAN listener, but the SCAN listener doesn’t connect us to the database. It is simply a mechanism created for simplicity, management and load balancing. To continue with the connection, the SCAN listener redirects us to one of the local listeners and the local listener is the one that actually connects us to the database. This redirection is based on the hostname, so our client got a redirection message to the VIP name (server1-vip) in order to continue the connection. When we checked, the client couldn’t resolve the name server1-vip, so it returned an error. The solution was to fix the DNS (or local client hosts file), so the client will be able to resolve the VIP hostnames of the RAC servers.

Final Note

As a final note, I’d like to explain why fixing the SCAN listener issue caused the connection problem. As I said, this is the flow when connecting to RAC database:

  1. The client has the SCAN name, port and service name configured in the tnsnames.ora
  2. The client first resolves the SCAN name to an IP address
  3. With the IP and port it connects to the SCAN listener and asks to connect to the specific database (service)
  4. The SCAN listener knows which services run on which nodes and redirects the client to one of the local listeners (using the VIP name and port)
  5. The client then resolves the VIP name to an IP address
  6. With the IP and port of the local listener, The client connects to the listener and asks to connect to the specific database (service)
  7. The local listener creates a server process for this connection and gets out of the picture
  8. Now the client talks directly with the server process

In our case, the problem we had was at step 5. That’s why we couldn’t connect to the database, but tnsping worked.

Before fixing the initial problem, because of the hosts configuration, the local listener was listening on all IP addresses of the server (including the SCAN addresses). In this case, when the client tried to connect to the SCAN listener (step 3), it actually reached the local listener which simply connected it to the database. No redirection was needed (as this was not the SCAN listener), so the client didn’t need to resolve server1-vip address.

Advertisements

2 thoughts on “Diagnosing Listener Issues in RAC

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s