-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NPE in ConnectorXAResource.getResourceHandle causes connection pool problems #24805
Comments
Testing using the fix of: payara/Payara#3344 solves the issue when applied to 7.0.10 codeline in class ConnectorXAResource: No exceptions. Integration test succeeds. Postgres connection pool connections are all state: 'idle' after the test, validated using postgresql query: NOTE I did not apply the fix of payara/Payara#769 to ConnectorXAResource, but it might be needed in the future in Glassfish as well. I am only testing working connections and transactions at this moment, no negative testing which might show issue 769 at this moment. |
I reviewed that fix back then and tried to find some root cause, but it's illusive indeed. As the simple NPE guard has been in use for years now, maybe we should just apply it too. Although I still like to find that actual cause. |
I have been able to reproduce it with: After 1946 times Tips where to look or focus at are welcome. One pattern I see is that all records before the failure show a not null value for nonXAResource, like: Example line 572522 differs from the ones before it:
Why is nonXAResource null in this case? The only other situation where I see delistResource with nonXAResource=null is at the very beginning of server startup, but then the jtsTx is set. Right after
Another thing I see is:
is called for transaction |
Enabled more debugging and added some debug info myself. Enabled:
When comparing the successful transaction before the failing one I see:
Changed toString of com.sun.enterprise.resource.ResourceHandle to list the state:
In the PoolManagerImpl.getResource(ResourceSpec, ResourceAllocator, ClientSecurityInfo) method I added logging:
Which shows in the successful situation
and in the failing situation:
The state (Note to myself: there are other places in the log with the Also the unregister is called in the failing situation:
and then the NPE exception occurs. And this relates to:
I see more calls to unregister than to register, example:
|
Based on what I can make of my logging, versus the code I suspect the problem is related to state of one transaction influencing the next transaction. I added class instance ids in the logging for the ResourceState object. Below my notes from the failing situation (perhaps I will later add it with the log file) but just to give an idea of what I see:
I wonder: should ResourceState enlisted not be altered to false in: Same for: Also the logic in 'com.sun.enterprise.transaction.JavaEETransactionManagerSimplified.delistComponentResources(ComponentInvocation, boolean)':
but it is not updating the handle state there if the delistResource succeeded. The delistResource success/false result it not even looked at. To test this I have a dependency problem: jta module does not know about the connectors-runtime com.sun.enterprise.resource.ResourceHandle, it only has com.sun.enterprise.transaction.spi.TransactionalResource available with isEnlisted. Another bit I do not understand, the interface has:
as if this was created before Resources could be pooled / reused ? |
One workaround "pleister" that seems to work is:
Result:
This is a patch up / repair. Not a clean fix. Just like preventing the NPE in Connector was not a clean fix. This could be related to issue:
as I have seen in Payara in: Note: is see the changes made in Payara, but Glassfish still contains the old implementation. Thread safeness of LocalTxConnectionEventListener and closeConnection, which I logged in Payara issue: 3029 in the past was fixed by: |
Stress tested my "fix" by running the same test I used to reproduce this issue multiple times.
which is equivalent to my logged Payara issue:
was not thread safe. |
Update: when using the related fixes mentioned in this issue I still run into issues like I still suspect there are some threading issues in the code, but I have not been able to proof it in the last few months. I do have a workaround, that avoids having to implement the fix mentioned in: The workaround for me is:
I still want to keep this issue open, and I still do not want to implement the fix mentioned in #24805 (comment) at this moment in time I did local testing with the fix:
to see a warning. |
If you want to try with virtual threads if the issue is related to thread locals not properly cleaned, you can try with this experimental virtual threads pool for Grizzly/GlassFish: https://github.com/OmniFish-EE/glassfish-grizzly-virtual-threads-pool |
There's also Thread Local cleaner component: glassfish/appserver/web/war-util/src/main/java/org/glassfish/web/loader/ReferenceCleaner.java Line 111 in 832cf83
|
Just to clarify the terms used in the source code, correct me if I'm wrong:
If I understand correctly, the problem is that sometimes a connection is marked as enlisted and stays enlisted even after it's put back to the pool (it's closed and free). When reused, the connection is not enlisted with the transaction because it's already marked as enlisted. This means that the It seems that a resource is delisted only in The question now is: Why |
@escay , can you see messages "JavaEETransactionImpl.registerSynchronization" ? They should be visible if you set |
In the log file above: the JavaEETransactionImpl.registerSynchronization is logged:
Note that ThreadID=77 seems to be reused. I have made some attempts to write some code that recognizes the already "marked as enlisted and stays enlisted" situation when the resource (which indeed represents a DB connection) is retreived from the database connection pool. But I have failed so far. I would like to turn the situation around: already figure out that something is wrong at the start, to understand all expected states I started with issue 24900: document the code to understand it myself. In commit bb41ac5#diff-21f318ff963fcb278c1231bc78adad08cb850c3527d071d5c4753e101633ab5d -> PoolManagerImplTest.java (maybe not related), I have added some unit test code and TODOs in PoolManagerImpl where I do not trust or not understand the error handling code:
|
I suppose, the importance of this issue is underestimated, because of described problem exists not only for PostgreSQL, but also for Oracle (v 19.21) and MS SQL (Microsoft SQL Server 2019). |
I can confirm I can reproduce the issue both on PostgreSQL and Oracle without code changes, using the same application. |
Environment Details
Problem Description
I am running integration tests against a large ear file. During the tests database connection pool exceptions occur.
These tests have been running against Payara 5 + 6 and Weblogic 12 and 14. Now I am testing against Glassfish 7.
I ran into the same issue which I logged in the past for Payara 5:
payara/Payara#3025
Which was fixed / improved in:
payara/Payara#3344
Note: exact cause was never found. This fix was more preventing the npe.
Now I am testing in Glassfish 7 and it shows a similar stacktrace and debugging shows same point as error.
The bean code is:
and database code is:
with:
One thing special: the TaskMainInfBean is looked up via jndi via "java:global"
Steps to reproduce
Start Glassfish 7.
Deploy ear file.
Run integration test, which makes REST api calls to test code in the ear file.
Impact of Issue
The test itself is inserting new records in the database. The connection pool is 25 connections. The test assumes the inserts into the database work and should succeed. The rest calls made to the server are not in parallel, only one call at a time: not expecting any connection pool sizing issues.
Exepectation was: test succeeds, but fails due to above exception.
Extra problem: Due to exceptions the connections from the connection pool get lost / are not immediately available again. Without ConnectionLeakReclaim it seems they stay 'lost'.
With ConnectionLeakReclaim=true and ConnectionLeakTimeoutInSeconds=60 you see reclaims:
Investigation
What I will do:
The text was updated successfully, but these errors were encountered: