196707 – Changes are made to unedited French characters while saving a file

This Bugzilla instance is a read-only archive of historic NetBeans bug reports. To report a bug in NetBeans please follow the project's instructions for reporting issues.

Bug 196707 - Changes are made to unedited French characters while saving a file

Summary: Changes are made to unedited French characters while saving a file

Status:	RESOLVED FIXED

Alias:	None

Product:	platform
Classification:	Unclassified
Component:	Text (show other bugs)
Version:	7.0
Hardware:	All All

Importance:	P2 normal with 1 vote (vote)
Assignee:	Vladimir Voskresensky

URL:
Keywords:	I18N

Depends on:
Blocks:

Reported:	2011-03-15 06:50 UTC by PrakharMathur
Modified:	2011-04-01 14:07 UTC (History)
CC List:	2 users (show)

See Also:
Issue Type:	DEFECT
Exception Reporter:

Attachments
Proposed patch. (3.46 KB, patch) 2011-03-29 01:01 UTC, Jan Lahoda	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description PrakharMathur 2011-03-15 06:50:13 UTC

Product Version = NetBeans IDE 7.0 Beta 2 (Build 201102140001) and NetBeans IDE 6.9.1 (Build 201011082200)
Operating System = Windows XP version 5.1 running on x86
Java; VM; Vendor = 1.6.0_17
Runtime = Java HotSpot(TM) Client VM 14.3-b01

I have been trying to save a C++ file which has comments in French but after saving it Netbeans is changing some unedited French Characters to junk characters which can't even be undone by Undo.

E.g.

Original Comment: Cette fonction recupère et gère les
Changed Comment: Cette fonction recupï¿½re et gï¿½re les

So it is saving junk(ï¿½) in place of "è".

I have seen similar behavior with NetBeans IDE 6.9.1 (Build 201011082200).

Comment 1 Jan Becicka 2011-03-17 13:26:30 UTC

not a stopper.
IDE tries to autodetect encoding of file by reading first 1024 chars.
Workaround:
add some french comment at very beginning of the file. E.g. french copyright header.

See also issue 191323 and issue 193476

Comment 2 David Strupl 2011-03-25 17:31:24 UTC

More details about what Jan said. It actually checks the first 1024 bytes as to whether the encoding set as property of a project really matches the content of the file. If it does not match a warning is shown. The easy workaround should be to set encoding of a project to be the same as the encoding of the edited files. Closing the report as invalid.

Comment 3 Vladimir Voskresensky 2011-03-26 17:27:15 UTC

1024 check is incorrect. We have to check full file content always.
See reopened P2 CR#6992232 and probably issue #196945 as well

Comment 4 Vladimir Voskresensky 2011-03-28 06:09:24 UTC

more correct version of encoding check to prevent false positives
    private static boolean checkIfCharsetCanDecodeFile(FileObject fo, Charset charset) {
        try {
            int BUF_SIZE = 1024*4;

            BufferedInputStream input = new BufferedInputStream(fo.getInputStream(), BUF_SIZE);
            try {
                CharsetDecoder decoder = charset.newDecoder();
                decoder.reset();
                try {
                    BufferedReader reader = new BufferedReader(new InputStreamReader(fo.getInputStream(), decoder), BUF_SIZE);
                    char[] buf = new char[BUF_SIZE];
                    while (reader.read(buf) > 0) {}
                    reader.close();
                } catch (CharacterCodingException e) {
                    ERR.log(Level.FINE, "Encoding problem using " + charset, e); // NOI18N
                    return false;
                } catch (IllegalStateException e) {
                    if (!e.getMessage().contains("CODING_END")) {
                        ERR.log(Level.FINE, "Encoding problem using " + charset, e); // NOI18N
                        return false;
                    }
                }
            } finally {
                input.close();
            }
        } catch (IOException ex) {
            ERR.log(Level.FINE, "Encoding problem using " + charset, ex); // NOI18N
        }
        return true;
    }

Comment 5 Vladimir Voskresensky 2011-03-28 06:09:46 UTC

in fact - data loss is P1...

Comment 6 Vladimir Voskresensky 2011-03-28 18:47:46 UTC

PrakharMathur, can you attach your file, please?

Thanks,
Vladimir.

Comment 7 Vladimir Voskresensky 2011-03-28 18:53:15 UTC

http://hg.netbeans.org/cnd-main/rev/1b8d3b85a037

Comment 8 Jan Lahoda 2011-03-29 01:01:30 UTC

I am sorry, but the above fix does not seem good to me: it means that all files will be read twice - and encoding will be OK for vast majority of them. I would prefer if the overhead would be near-zero in such cases. I will attach a patch that tries to achieve that.

Comment 9 Jan Lahoda 2011-03-29 01:01:57 UTC

Created attachment 107352 [details]
Proposed patch.

Comment 10 Vladimir Voskresensky 2011-03-29 06:19:39 UTC

Jan, I'm fine with your patch and have only one comment:
What would be internal state of kit which read part of file, then encoding exception was thrown and user said "No"?

Comment 11 Quality Engineering 2011-03-29 08:42:44 UTC

Integrated into 'main-golden', will be available in build *201103290400* on http://bits.netbeans.org/dev/nightly/ (upload may still be in progress)
Changeset: http://hg.netbeans.org/main/rev/1b8d3b85a037
User: Vladimir Voskresensky <vv159170@netbeans.org>
Log: fixed #196945 -  [70cat] org.openide.text.DataEditorSupport$1: The file cannot be safely opened with encoding UTF-8. Do you want to continue opening it?
-- check content of whole file, otherwise we incorrectly reject UTF-8 and also can corrupt user's file without notification (issue #196707 -  Changes are made to unedited French characters while saving a file)

Comment 12 Jan Lahoda 2011-03-31 18:50:19 UTC

(Sorry for so late response, I have lots of meetings this week.)

(In reply to comment #10)
> Jan, I'm fine with your patch and have only one comment:
> What would be internal state of kit which read part of file, then encoding
> exception was thrown and user said "No"?

I would expect less problems with the do-not-open path, as the document will be thrown away in that case. Not sure if that will be the case for the open-anyway path (that's why the patch tries to remove the document's content). The kit itself should not be holding any state (there is just one instance of a kit per mimetype in the IDE). I am a bit worried about the internal state of the document, though.

Your patch is of course much safer in the respect that much fewer things can go wrong. But, reading each file twice is not acceptable, IMO, especially expecting that in almost all cases the encoding will be reasonable.

Comment 13 Vladimir Voskresensky 2011-04-01 11:05:18 UTC

Jan, so what to push?
Now trunk has your variant.
I've tested it with files which made issues and everything works as expected.
I've asked editor team about integration in issue #196945

Comment 14 Jan Lahoda 2011-04-01 14:07:14 UTC

If needed, I would go with my variant.