This Bugzilla instance is a read-only archive of historic NetBeans bug reports. To report a bug in NetBeans please follow the project's instructions for reporting issues.

Bug 191323 - The built-in editor damages file contents due to lack of notification about error recovery in decoder
Summary: The built-in editor damages file contents due to lack of notification about e...
Status: RESOLVED FIXED
Alias: None
Product: platform
Classification: Unclassified
Component: Text (show other bugs)
Version: 6.x
Hardware: All All
: P2 normal (vote)
Assignee: David Strupl
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-10-25 20:25 UTC by Thomas Preisler
Modified: 2010-12-09 06:15 UTC (History)
5 users (show)

See Also:
Issue Type: DEFECT
Exception Reporter:


Attachments
sample file in win1252 which will be broken by IDE (128 bytes, text/plain)
2010-10-26 09:45 UTC, Vladimir Voskresensky
Details
broken content (135 bytes, text/plain)
2010-10-26 09:53 UTC, Vladimir Voskresensky
Details
simplest check impl (3.04 KB, patch)
2010-11-02 21:32 UTC, Vladimir Voskresensky
Details | Diff
example how gedit tracks such problems (78.23 KB, image/png)
2010-11-06 10:32 UTC, Vladimir Voskresensky
Details
Slightly refined Vladimir's patch (7.10 KB, patch)
2010-11-26 13:08 UTC, David Strupl
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Preisler 2010-10-25 20:25:36 UTC
From CR 6992232:

We are trying to use Oracle Solaris Studio IDE in our daily work, so we created a
project in "mpmt" workspace, which allows to build Performance Analyzer tools.
Unfortunately it appears that the built-in editor damages file contents in case
there are French or other non-ASCII characters, e.g. copyright character.
Here is a pointer to the file that can be used to reproduce the problem:

<attached>

It has a French copyright message. If you create a project, copy this file
to your project and try to edit it (edit one line and save), you will find out
that saved version is very different from the original version (like 11 lines 
are different), and there is no way to roll back these changes using the IDE.
This is unacceptable. Note, the damage is done silently, there is no warning.
*** (#1 of 4): 2010-10-14 14:33:12 PDT nikolay

confirmed. Adding this file to project and modifying it corrupts french symbols.
The issue is that file has windows-1252 encoding while project by default uses UTF-8 encoding for files
*** (#2 of 4): 2010-10-15 02:20:36 PDT vladimir
From Vladimir V:
It is not OSS bug.
Whether it is cnd or nb bug I don't know...
1) nb is project oriented and each project declares only one encoding for all files.
2) nb opens file in encoding of project, because we implement MakeProjectEncodingQueryImpl

I think, it's better if we can auto detect encoding based on file content like in GEdit or other editors/browsers...
*** (#3 of 4): 2010-10-25 13:05:37 PDT thomas.preisler@oracle.com
Comment 1 Vladimir Voskresensky 2010-10-26 08:21:59 UTC
I agree with Nik:
- user should be warned about usage of encoding which is different from used in file
Comment 2 Vladimir Voskresensky 2010-10-26 09:35:01 UTC
can someone from platform evaluate this, please?
Comment 3 Vladimir Voskresensky 2010-10-26 09:41:32 UTC
If I understand correctly we should somehow understand that during decoding "replacement" have happened.
Can it be done in FileEncodingQuery$ProxyCharset or ProxyDecoder?
Comment 4 Vladimir Voskresensky 2010-10-26 09:45:27 UTC
Created attachment 102639 [details]
sample file in win1252 which will be broken by IDE
Comment 5 Vladimir Voskresensky 2010-10-26 09:53:07 UTC
Created attachment 102640 [details]
broken content

file content after pressing space and then Save file (used encoding is default UTF-8)
Comment 6 Jaroslav Tulach 2010-10-29 17:15:01 UTC
As far as I can tell, the system queries FileEncodingQuery for encoding and uses it. Use can specify encoding per project. I am not sure what is wrong with that, but passing to editor guys, as this is editor related, isn't it?
Comment 7 David Strupl 2010-11-01 09:05:58 UTC
Vladimir: just a quick question before we start: can you return the correct encoding from your implementation of FileEncodingQuery. I mean from method MakeProjectQueryEncodingImpl.getEncoding(FileObject)? Or do you return the correct encoding from there?
Comment 8 Vladimir Voskresensky 2010-11-01 18:17:58 UTC
(In reply to comment #6)
> As far as I can tell, the system queries FileEncodingQuery for encoding and
> uses it. 
Right

> User can specify encoding per project. 
Right. But it doesn't support user's use case:
- User has encoding specified for the project
- user add "existing" file to project and this file potentially can be in different encoding
- user opens file (no notification that decoder was in recovery mode to decode file) and modify one line, then save. Again no any notifications/warnings.

What's wrong? 
Without any notification user obtained damaged file content.
Have a look at other editors (i.e. gedit):
1) It tries to auto detect encoding and successful in most cases
2) If you open file (in encoding A), then insert special characters and then try to save it and encoding A doesn't support symbols => gedit notify user about that

That's user friendly, isn't it?
Otherwise without VCS user will lost file content without possibility of recovering.

> I am not sure what is wrong with 
> that, but passing to editor guys, as this is editor related, isn't it?
Yes, it's editor related, but when I debugged I've seen that decoder is hidden into FileEncodingQuery$ProxyCharset && ProxyDecoder and there are no any callbacks to editor system that decoder failed and recovered from error.
If such callback exists => we can notify user
Comment 9 Vladimir Voskresensky 2010-11-01 18:19:31 UTC
(In reply to comment #7)
> Vladimir: just a quick question before we start: can you return the correct
> encoding from your implementation of FileEncodingQuery. I mean from method
> MakeProjectQueryEncodingImpl.getEncoding(FileObject)? Or do you return the
> correct encoding from there?
No, we return only encoding specified on project level. We doesn't not read file to auto detect encoding.
And user using "Add existing file" action can potentially add file with any encoding.
Comment 10 David Strupl 2010-11-02 09:34:44 UTC
We had a discussion in the editor team and I am passing the issue back to the platform. In editor we don't do anything with the file encoding - it is part of the platform code - either platform/text or platform/loaders (putting it as enhancement request to the platform/text category).

BTW the user can still decide not to save the file if (s)he can see in the editor some screwed characters.

Also this is a feature request - it works as designed (especially in the editor area): the encoding is per/project setting and we honor the setting. Auto detection is clearly an additional feature.
Comment 11 Vladimir Voskresensky 2010-11-02 11:06:52 UTC
based on your *new* summary I agree, it is P3 ENH, but your new description is not what this bug is about.
It is really P2, because it corrupts user file content without any possibility to be realized by user (because no notifications are given about error recovery done by decoder)

David, please, do not change priority and status of defect based on description which you introduced yourself, you can file another P3 ENH about auto detection if you'd like :-)
Comment 12 Vladimir Voskresensky 2010-11-02 11:26:17 UTC
(In reply to comment #10)
> BTW the user can still decide not to save the file if (s)he can see in the
> editor some screwed characters.
The reality is that user can *not* realize that. 
Nobody pays attention to non ascii symbols in copyrights part of files.
So user (see CR6992232 against Studio) just modified shell script's ascii part of file and have not seen in editor copyright part at all
Comment 13 David Strupl 2010-11-02 12:47:23 UTC
Vladimir, with your new description I am closing this report as invalid. This is simply not true - the editor does not damage anything. It merely opens the file in the encoding specified in the project settings and after the user hits save (see: active user's action) it saves it in the specified encoding.

Please read Thomas' comment: he clearly wants auto detection, quote:

> I think, it's better if we can auto detect encoding based on file content like
> in GEdit or other editors/browsers...
> *** (#3 of 4): 2010-10-25 13:05:37 PDT thomas.preisler@oracle.com

Which was the part that I was referring to when changing the subject line (and type of the issue).

The end result:

 1. I am not going to file a new issue for autodection because I am not requesting the feature.
 2. I will keep this issue closed as either invalid or wontfix because we will not fix it

I understand that you don't agree with that - and the only way is that you will talk to your manager to talk to my manager to decide. From now on I am waiting for my manager to tell me the decision. Please don't change status of my issues without my agreement as I also don't change yours (read cnd) if I know that you don't agree.

Here goes another proposal for breaking the negotiation deadlock: if you attach a patch to this issue I am ready after consulting with Jarda to integrate such a patch - if you identify the place where we should add you some API addition (hook) that you can implement to add the autodection (ha, I am writing again about the auto detection because I totally don't understand how else would you like to solve this problem).
Comment 14 Vladimir Voskresensky 2010-11-02 13:07:37 UTC
David. Let me provide dirty patch what can be done and we'll think together how to beautify it. OK?
Comment 15 David Strupl 2010-11-02 13:17:48 UTC
Ok, I agree, let's talk about possible improvements instead of fighting about the issue title (or type).
Comment 16 Vladimir Voskresensky 2010-11-02 14:36:21 UTC
(In reply to comment #13)> 
> Please read Thomas' comment: he clearly wants auto detection, quote:
> 
> > I think, it's better if we can auto detect encoding based on file content like
> > in GEdit or other editors/browsers...
It is mine comment inserted by Thomas :-)

> a patch to this issue I am ready after consulting with Jarda to integrate such
> a patch - if you identify the place where we should add you some API addition
> (hook) that you can implement to add the autodection (ha, I am writing again
> about the auto detection 
about autodection I've just filed
http://netbeans.org/bugzilla/show_bug.cgi?id=191564
:-)
May be you are right... and it's easier just to implement it in NB.

>because I totally don't understand how else would you
> like to solve this problem).
My idea was:
- use encoding as it is now, but not in two phases: 
1) without recovery to catch the issue and save it somewhere
2) if 1 failed => run with recovery
3) let editor check "recovery" state and put it as red highlight in editor status bar to catch user's attention
what do you think?
Comment 17 Vladimir Voskresensky 2010-11-02 14:37:37 UTC
(In reply to comment #16)
> My idea was:
> - use encoding as it is now, but not in two phases: 
sorry, should be: "but in three phases"
Comment 18 David Strupl 2010-11-02 15:25:59 UTC
The only problem with this approach is that it still has the original problem: using different encoding can still pass without any exception but the characters will be interpreted differently in the other encoding and saving the modified file might "damage" the file the same way as in original report. Only if you will have good luck the exception about invalid characters might be shown to the user.

The correct solution is to solve 191564 and warn the user about incorrect encoding before opening the file in the editor. And only after the user explicitly agrees to open the file to allow saving such a file.

BTW there is also this ENH
http://netbeans.org/bugzilla/show_bug.cgi?id=190236
which might be solved as well with 191564.
Comment 19 Vladimir Voskresensky 2010-11-02 21:32:39 UTC
Created attachment 102762 [details]
simplest check impl

Proposed approach allows at least to check the simplest cases when FileEncodingQuery returned inappropriate encoding. You can check it trying to open attached files from Favorites
Comment 20 David Strupl 2010-11-02 22:46:27 UTC
Jardo, Tomasi, what do you think about the attached patch? I know it should be rewritten so that the file is not opened twice by the reader but my question is about
  a. the principle of the patch (try to decode and if fails notify the user)
  b. the place from where it should be invoked

I am reopening the report but marking it as P3 scheduled for NB 7.0.

Oh and BTW as the patch suggests this problem falls into platform/text which still doesn't have real owner. But I might refrain from forwarding this to the management this time ;-)

And a note for Vladimir - my point still holds (about opening without exception but still saving "wrong" content).
Comment 21 Jaroslav Tulach 2010-11-03 15:26:02 UTC
I guess there is no problem in opening files with wrong encoding. The problem arises when people start to edit such files and want to save them. Thus I'd rather let user open and view files that contain \uFFFD, just made them read only or warned the user before first modification.
Comment 22 Vladimir Voskresensky 2010-11-03 16:08:39 UTC
(In reply to comment #21)
> I guess there is no problem in opening files with wrong encoding. 
What about confirmation dialog with combobox to select another encoding for opening?
It would be more user friendly in case of detected issues with "default" decoding (if we would not have auto detection implemented see issue #191564).

> The problem
> arises when people start to edit such files and want to save them. Thus I'd
> rather let user open and view files that contain \uFFFD, just made them read
> only or warned the user before first modification.
If you already opened in bad encoding => information is lost for some characters
Comment 23 Vladimir Voskresensky 2010-11-06 10:32:40 UTC
Created attachment 102829 [details]
example how gedit tracks such problems
Comment 24 Vladimir Voskresensky 2010-11-12 09:06:46 UTC
any progress?
Comment 25 Leonid Lenyashin 2010-11-22 14:57:58 UTC
In Studio this issue caused legal information corruption, which is considered as a serious issue by our QE and Program Management.
File has been quietly changed 200 lines above the place in the file there user changed a couple of symbols. So the issue was really difficult to catch.
Studio will appreciate a timely fix for the issue. Please note that if there could be a warning to user that encoding might be wrong and a possibility to change it manually it would a good enough resolution here.
Comment 26 David Strupl 2010-11-26 13:08:31 UTC
Created attachment 103370 [details]
Slightly refined Vladimir's patch

I have made Vladimir's patch almost applicable to the current trunk. What it does: it checks only first 1k of characters whether they can be decoded using the given encoding. If they cannot the user is asked whether he really wants to proceed. I guess this would solve the original problem in cnd. Do we want to apply this patch? I hope it is harmless and can be applied. Although it is not ideal solution it gets rid of the bug ;-). Waiting for your comments.
Comment 27 Vladimir Voskresensky 2010-11-26 16:15:33 UTC
David, I'm fine with such approach.
Comment 28 Jaroslav Tulach 2010-11-30 09:22:01 UTC
I don't recommend to integrate into openide.text without a unit test. The code is  messy, without long term owner and it remains assembled together just by the force of existing tests. Without them it fails apart. Test to simulate this new trick would be more than handy for anyone who has to touch the code in future.

Otherwise the approach sounds reasonable.

Btw. I can see use of WeakSet in the patch - did not Vladimir promised to donate cnd's improved implementation to the platform during last release?
Comment 29 Vladimir Voskresensky 2010-11-30 11:06:47 UTC
(In reply to comment #28)

> 
> Otherwise the approach sounds reasonable.
Btw, we can easily enhance the approach by providing dialog where user can choose encoding (like other editor do).
http://netbeans.org/bugzilla/attachment.cgi?id=102829

I.e. we can simply introduce
class UserQuestionException2 extends UserQuestionException
which will be responsible for providing not just message, but the panel which will be shown in questionable dialog. Putting combobox with encodings on this panel will solve the problem even more user friendly.
Then just remember selected encoding and use it for "saving" operation as well.

> 
> Btw. I can see use of WeakSet in the patch - did not Vladimir promised to
> donate cnd's improved implementation to the platform during last release?
Sure, I should do it!
Do we have issue for that? Please, assign it to me.
Comment 30 Jaroslav Tulach 2010-12-01 18:44:16 UTC
(In reply to comment #29)
> (In reply to comment #28)
> I.e. we can simply introduce
> class UserQuestionException2 extends UserQuestionException
> which will be responsible for providing not just message, but the panel which
> will be shown in questionable dialog. 

Enough to extend the existing class and add new method getMessageComponent() I think.
Comment 31 Vladimir Voskresensky 2010-12-01 20:09:02 UTC
(In reply to comment #30)
> (In reply to comment #29)
> > (In reply to comment #28)
> > I.e. we can simply introduce
> > class UserQuestionException2 extends UserQuestionException
> > which will be responsible for providing not just message, but the panel which
> > will be shown in questionable dialog. 
> 
> Enough to extend the existing class and add new method getMessageComponent() I
> think.
If it's compatible change... than I agree.
Comment 32 Vladimir Voskresensky 2010-12-01 20:51:25 UTC
btw, if we allow user to select encoding => it should be remembered for files which were left open before closing ide, because files will be auto opened on the next ide start and should not corrupt as well.
+
there are quite a few places in ide where UserQuestionException is catched and confirmed with catch block without dialog => we need to mark such files opened with default encoding (and broken content) anyway
Comment 33 Jesse Glick 2010-12-01 21:18:46 UTC
(In reply to comment #32)
> there are quite a few places in ide where UserQuestionException is [caught] and
> confirmed with catch block without dialog

See bug #57748. UQE used to be used by some VCS supports, but AFAIK is now used just by CloneableEditor to handle big files. Since UQE is onerous to use and rarely used correctly, would seem preferable to handle this kind of thing directly in the editor component. Not sure exactly how.
Comment 34 David Strupl 2010-12-07 09:04:36 UTC
Hello everyone! Thanks a lot for your valuable feedback. As there were no more patches attached to this issue I took the liberty to apply the latest patch here

http://hg.netbeans.org/jet-main/rev/49a91224e3b4

and if you have some more ideas please create separate issues preferably with patches. Marking the problem as fixed - thanks mainly to Vladimir since he proposed the original patch.
Comment 35 Jan Lahoda 2010-12-07 17:33:44 UTC
I had to invert the detection logic, so that it reports incorrect charset only if the decoding really fails. The original version claimed that an empty file cannot be decoded, which does not make sense, IMO.
http://hg.netbeans.org/jet-main/rev/0a67fb4c2fe0

BTW: did someone check what will hapen if a multibyte character will span across the first 1024 bytes boundary?
Comment 36 Quality Engineering 2010-12-09 06:15:44 UTC
Integrated into 'main-golden', will be available in build *201012090001* on http://bits.netbeans.org/dev/nightly/ (upload may still be in progress)
Changeset: http://hg.netbeans.org/main/rev/0a67fb4c2fe0
User: Jan Lahoda <jlahoda@netbeans.org>
Log: #191323: inverting the "unable to decode file" logic - claim that file cannot be decoded only if CharacterCodingException is thrown, otherwise claim its OK (the original refused to open empty files).