225597 – Warn when using BOM in UTF-8 in XMLs

This Bugzilla instance is a read-only archive of historic NetBeans bug reports. To report a bug in NetBeans please follow the project's instructions for reporting issues.

Bug 225597 - Warn when using BOM in UTF-8 in XMLs

Summary: Warn when using BOM in UTF-8 in XMLs

Status:	NEW

Alias:	None

Product:	xml
Classification:	Unclassified
Component:	Validation (show other bugs)
Version:	7.4
Hardware:	PC Windows 7

Importance:	P2 normal with 1 vote (vote)
Assignee:	Svata Dedic

URL:
Keywords:	REGRESSION

Depends on:
Blocks:

Reported:	2013-01-31 12:54 UTC by lokad
Modified:	2015-04-07 09:43 UTC (History)
CC List:	2 users (show)

See Also:
Issue Type:	ENHANCEMENT
Exception Reporter:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description lokad 2013-01-31 12:54:11 UTC

It seems that this bug has resurfaced (#83321). Trying to validate an XML schema file (right click | Validate XML) containing the 3 byte UTF-8 BOM yields 


XML validation started.

Checking file:/[path-to-file]/[file].xsd...
Content is not allowed in prolog. [1] 
XML validation finished.


Clicking on <Content is not allowed in prolog. [1]> opens the XML file and highlights the first line of the file.

Other UTF encodings with or without BOM do work.

The exact NetBeans version used is NetBeans IDE 7.3 RC1 (Build 201301240957)

Comment 1 lokad 2013-02-01 09:48:57 UTC

Another observation: if no encoding is specified (e.g. only <?xml version="1.0"?>) then the validation also works for UTF-8 with BOM.

Comment 2 lokad 2013-02-01 11:01:36 UTC

My previous comment actually is not true.

So here are some results:

              | encoding specified | no encoding specified
--------------+--------------------+-----------------------
UTF-8         | OK                 | OK
UTF-8     BOM | 1)                 | 1)
UTF-16 LE BOM | OK                 | 2) #)
UTF-16 BE BOM | OK                 | 2) #)
[UTF-16 LE    | 1) *)              | 3) +)                ]
[UTF-16 BE    | OK                 | 1) +)                ]


1) "Content is not allowed in prolog."
2) "Premature end of file."
3) "The markup in the document preceding the root element must be well-formed."
*) Garbage when opened in NB (wrong encoding detected)
#) Nothing when opened in NB
+) Probably detected as UTF-8 (spaces between characters)

file contents:
<?xml version="1.0" encoding="$encoding"?>
<root/>

<?xml version="1.0"?>
<root/>

I am of course unsure which of these combinations should have worked ...
If I understand the W3C requirements correctly UTF-8 with or without BOM and UTF-16 with BOM have to be understood. UTF-16 without BOM is illegal.
Only documents not encoding in UTF-8 or UTF-16 seem to be required to provide a correct encoding information. [http://www.w3.org/TR/REC-xml/#charencoding]

Comment 3 Svata Dedic 2013-04-02 13:37:08 UTC

Hm, at least in the case the encoding is not specified, the file even opens bad (BOM is displayed). The defect is present from netbeans 7.1.2, I cannot pinpoint a changeset which changed the behaviour.

Anyway, the EncodingUtil.doDetectEncoding attempts to autodetect encoding and then reads document's declared encoding. If the document does NOT declare anything, the autodetected encoding (e.g. UTF-8 detected using BOM presence) is thrown away and null is returned. That causes the next encoding in the queue (project default, ISO-8859-1 in my case) to step in, and interpret the BOM as a regular text.

Comment 4 Svata Dedic 2013-04-02 14:15:56 UTC

Although I was able to fix the charset detection, the UTF-8 encoded file is still not read correctly. Java I/O libraries do not support UTF-8 with BOM mark correctly - see
http://bugs.sun.com/view_bug.do?bug_id=4508058
http://bugs.sun.com/view_bug.do?bug_id=6378911
http://en.wikipedia.org/wiki/Byte-order_mark#UTF-8

Sadly, the net result of the evaluation is that NetBeans XML support should warn if a document contains BOM sequence at the start; even if NB worked around this JDK defect, JAXP would not parse the XML correctly at application runtime.

I'll commit the encoding detection fix; it won't harm, and improve code's correctness. However I have to to mark the issue as an enhancement to report JDK-unsupported feature rather than provide fix for the use-case, sorry.

Comment 5 Svata Dedic 2013-04-02 14:58:41 UTC

encoding detection improved by http://hg.netbeans.org/jet-main/rev/6bf6bd1eac3f

Comment 6 rkraneis 2014-10-10 08:08:08 UTC

Hi Svata,
I just want to confirm the current status in NB 8.0.1 (and thanks for the encoding-detection-fix):

              | encoding specified | no encoding specified
--------------+--------------------+-----------------------
UTF-8         | OK                 | OK
UTF-8     BOM | 1)                 | 1)
UTF-16 LE BOM | OK                 | OK
UTF-16 BE BOM | OK                 | OK
1) "Content is not allowed in prolog."

It might be a good idea to warn the user if a XML file with a BOM is detected? Maybe as a configurable hint?

Regards,
René