Quite some time ago, we encountered a very strange Oracle problem at work:
ORA-04030: out of process memory when trying to allocate <number> bytes
Initially, the problem was intermittent then its frequency increased. Soon enough, you could nearly generate the ORA-0430
error on command by clicking through our primary site half a dozen times. The standard trouble shooting events took place:
- Confirm no database changes since the last known good state
- Check the load and performance statistics on the Oracle RAC nodes
- Check the Oracle log files to see if there was anything obvious going wrong
- Check the resources on the Oracle RAC nodes
After all of those options were worked through, we immediately moved onto the application:
- Confirm no application changes since the last known good state
- Check resources on the web servers
- Check web server log files for anything obvious
There were in fact some application changes, so they were rolled back immediately. Unfortunately, that didn’t restore the site to an error free state. The hunting continued to look for anything that might have changed and we continued to draw blanks. At this stage, a support request was logged through the Oracle Metalink to try and resolve the error.
Since the service we were providing was so fundamentally broken, the next thing on the list was to cycle the servers:
- The IIS services were restarted
- The web servers themselves were rebooted
- The Oracle RAC nodes were rebooted
The important thing we didn’t notice or think of immediately is that it could have been just one of the Oracle RAC nodes causing the ORA-04030
problem. When the nodes were rebooted, they were cycled too close together for us to notice if anything had changed. Shortly there after, the servers were shutdown one node at a time and with continued testing; revealed an individual node was causing the problem.
Now that services were restored (though slightly degraded), time was with us and not against us. It seemed quite reasonable that the problem was related to the physical memory in the server. Since the server uses ECC memory, when the boxes were rebooted if there were any defects in the RAM modules – the POST tests should have highlighted them. Unfortunately, after rebooting the server again; there were no POST error messages alerting us to that fact.
While waiting for the Oracle technical support to come back to us with a possible solution or cause, the physical memory was swapped out for an identical set from another server. To test the server, it was joined back into the cluster to see if the error could be regenerated. Of course, even though the server didn’t report any errors with the memory; replacing it seemed to solve the error. In this instance, nothing the Oracle technical support mentioned gave us any real help and after seemingly having the problem nipped, the ticket was closed.
To put it through its paces as we were convinced that it had to have been a physical memory error (given the apparent solution); the memory was run through a series of grueling memory test utilities for days on end. After days of testing, not a single error was reported – go figure.
The moral of this story is simple:
When troubleshooting a technical problem, confirm or double check that a possible problem really isn’t a problem just because something else suggests that it isn’t.