See the attachment for several thread dumps
during J2SE parsing which was automatically
started at a new project creation. The thread
dumps show that AWT thread is blocked waiting in
NBMDRepositoryImpl.beginTrans for exclusive mutex
to read information from MDR.
This is unfortunatelly an MDR design flaw which
can show up at any time a time consuming MDR
database modifications are being done. See issue
42479 for another occurence.
MDR must be able to provide information without
blocking readers while a writer is holding a
write mutex to MDR. Exclusive mutex as
implemented today is an evil.
I would like to stress out that this occurence of this AWT blocking
is a regression to the previous refactoring builds. This shows how
evil this is. It really can show up somewhere else again anytime.
Created attachment 15397 [details]
Several thread dumps during the time IDE is frozen while J2SE is scanned
I would not be that fatalist. It can be called a swing/AWT design flaw
as well as MDR design flow. This problem was known from the beginning
and discussed with Jesse and other folks when working on build system
integration. The cure to this is not the mutex you suggest (that would
require a lot of code/work to implement and months to stabilize), but
making sure that modules make calls to MDR in event thread only when
they know what they are doing. Moreover the ExclusiveMutex is now able
to recognize that an AWT thread is waiting for it, it gives it a
higher priority when graning access to the mutex and also enables long
running tasks such as scanning to pause their transaction for a moment
(if the nature of the task allows it), let AWT/swing do its job and
reacquire the lock again. We are using this support when scanning, but
there seems to be a regression in this mechanism which we will look at
and try to fix it for the next build.
"The cure to this is not the mutex you suggest (that would require a lot of code/work to
implement and months to stabilize), but making sure that modules make calls to MDR in
event thread only when they know what they are doing."
I truly do not get why, when designing things in NetBeans, we *start* from the proposition
that potentially long running operations must be synchronous - it leads to massively deep
stack traces (with each additional stack frame an additional opportunity for someone to
cause a deadlock). Operating systems have been using message-based
queues for years to decouple such things with quite a bit of success, yet we treat the
natural case as being synchronous operation for everything, no matter how complex, and
invokeLater() as a sort of dirty hack you do when trapped in a corner, rather than using
event/message queues as designed. This gets us into troubles like this.
This also goes to our typical approach to threading which is, whatever thread asks for a
thing first wins (consider the java parser). That's more like a threading circus than a
My point is that, perhaps MDR queries simply *should not be synchronous, period.* That
is, when you want some information, you queue a request, and MDR gets back to you
when that information is ready. If somebody *really* needs the information
synchronously, they can do some equivalent of wait(myQuery), and it's their responsibility
not to do that when asking for something that will likely take a long time, and their
responsibility to design their code not to expect that all queries take 0 time.
I think no synchronous solution can ever work 100% unless you can prove a maximum
duration for queries.
"enables long running tasks such as scanning to pause their transaction for a moment (if
the nature of the task allows it), let AWT/swing do its job"
This is *almost* what I'm talking about, except that this should be the standard case, not
the exceptional case, and should not involve any kind of black magic or hacks. Queueing
work in logical serial units should be the norm, not the exception.
To calm down, a bit, what I would expect from anything that can take a potentially infinite
amount of time is:
- There is a message queue to which queries are posted
- There is/are dedicated thread(s) which will do the work
- There is a notification callback when a query's result is available, with failure,
cancellation and timeout semantics
- No code should be designed to assume a query is instantaneous. Code that must block
until a result is available is thus *forced* to post some UI to indicate that it is waiting for a
result, but the AWT event queue does not need to be blocked.
After all that, *if* MDR can determine *for sure* that a query can complete in <threshold>
time *before* it runs, then as an optimization, it may perform the query synchronously, to
avoid context switching costs, but that's an optimization you don't do until everything else
is rock solid.
What scares me is that all this seems pretty obvious and basic, and it sounds like we're not
terribly close to such a design. Martin, I hope you can correct me.
For the particular stack trace in this exception, it is trying to populate the combo box in
the editor toolbar. There is no particular reason that either:
- a. the combo's contents must be exactly accurate before it ever appears on the screen -
populating it could be invoke-latered
- b. it needs to know its full contents - it's a combo box. Unless its popup is open, it
does not need to know the full contents of its list, it only needs to know the selected item
it should display, and if there is at least one additional item so the popup button should
be enabled. So possibly an optimization to request from Explorer - NodeListModel should
resolve its contents lazily. What it's doing now is silly.
*** Issue 44089 has been marked as a duplicate of this issue. ***
Blocking of AWT thread during scanning is now fixed:
Checking in src/org/netbeans/modules/javacore/ExclusiveMutex.java;
new revision: 184.108.40.206.2.18; previous revision: 220.127.116.11.2.17
Checking in src/org/netbeans/modules/javacore/FileScanner.java;
new revision: 18.104.22.168.2.12; previous revision: 22.214.171.124.2.11
Thanks for your summary Tim. Moving from synchronous to ansynchronous
is not possible for javacore since it is based on JMI (all the APIs
are generated from a model) that does not support asynchronous calls.
So what we are doing currently is transforming the clients of the JMI
API to call it asynchronously (or at least make sure it is not called
in AWT thread). We may provide a utility classes/methods for making it
easier for clients in the future so that a single request processor
could be used to schedule the calls, etc. But purely moving to this
approach without moving the calls to source hierarchy from AWT would
not help anyway because of backward compatibility reasons (since the
old src API is synchronous).
So for now we are trying to fix exactly the problems you found out
about populating the editor drop down by moving the data-collecting
code to a different thread.
"Moving from synchronous to ansynchronous is not possible for javacore since it is based
on JMI (all the APIs are generated from a model) that does not support asynchronous
This suggests to me that either we are stretching JMI beyond its design limitations, or it is
simply not well enough designed to actually solve the problem it's supposed to solve.
The utility/helper approach sounds like a very, very good idea - no reason that couldn't be
wrappered on top of JMI (preferably along with deprecating any other avenues of access to
Let java/srcmodel be blocking, and deprecated - deadlocks and hangs are good
encouragement to stop using deprecated calls.
Verified fixed in trunk.
Just did a cvs update and a clean build, and got a long delay after the first paint of the
main window, when opening with a userdir which had openide open as a project, and a
few files open in the editor. Stack dump looks like exactly the same thing going on as
"AWT-EventQueue-1" prio=5 tid=0x00587920 nid=0x1ebe200 in Object.wait()
at java.lang.Object.wait(Native Method)
- locked <0x6283ab58> (a org.netbeans.modules.javacore.ExclusiveMutex)
- locked <0x6227d858> (a java.awt.Component$AWTTreeLock)
- locked <0x6227d858> (a java.awt.Component$AWTTreeLock)
*** Issue 44556 has been marked as a duplicate of this issue. ***
Moved to new subcomponent java/javacore.
Should not happen anymore - should be fixed by fix to issue 45077 (we
changed the way how the files are scanned). There can still be a delay
if you delete the storage files (or the storage files are corrupted)
and you start the IDE with some files open in the edtor. But this
should be a very rare case and it should not take too long.
Reorganization of java component