Bug 225597 - Validating an XSD with UTF-8 BOM fails.
Validating an XSD with UTF-8 BOM fails.
Status: NEW
Product: xml
Classification: Unclassified
Component: Validation
7.4
PC Windows 7
: P3 (vote)
: TBD
Assigned To: Svata Dedic
issues@xml
jdk_bug_4508058, jdk_bug_6378911
: REGRESSION
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-01-31 12:54 UTC by lokad
Modified: 2013-04-02 14:58 UTC (History)
0 users

See Also:
Issue Type: ENHANCEMENT
:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description lokad 2013-01-31 12:54:11 UTC
It seems that this bug has resurfaced (#83321). Trying to validate an XML schema file (right click | Validate XML) containing the 3 byte UTF-8 BOM yields 


XML validation started.

Checking file:/[path-to-file]/[file].xsd...
Content is not allowed in prolog. [1] 
XML validation finished.


Clicking on <Content is not allowed in prolog. [1]> opens the XML file and highlights the first line of the file.

Other UTF encodings with or without BOM do work.

The exact NetBeans version used is NetBeans IDE 7.3 RC1 (Build 201301240957)
Comment 1 lokad 2013-02-01 09:48:57 UTC
Another observation: if no encoding is specified (e.g. only <?xml version="1.0"?>) then the validation also works for UTF-8 with BOM.
Comment 2 lokad 2013-02-01 11:01:36 UTC
My previous comment actually is not true.

So here are some results:

              | encoding specified | no encoding specified
--------------+--------------------+-----------------------
UTF-8         | OK                 | OK
UTF-8     BOM | 1)                 | 1)
UTF-16 LE BOM | OK                 | 2) #)
UTF-16 BE BOM | OK                 | 2) #)
[UTF-16 LE    | 1) *)              | 3) +)                ]
[UTF-16 BE    | OK                 | 1) +)                ]


1) "Content is not allowed in prolog."
2) "Premature end of file."
3) "The markup in the document preceding the root element must be well-formed."
*) Garbage when opened in NB (wrong encoding detected)
#) Nothing when opened in NB
+) Probably detected as UTF-8 (spaces between characters)

file contents:
<?xml version="1.0" encoding="$encoding"?>
<root/>

<?xml version="1.0"?>
<root/>

I am of course unsure which of these combinations should have worked ...
If I understand the W3C requirements correctly UTF-8 with or without BOM and UTF-16 with BOM have to be understood. UTF-16 without BOM is illegal.
Only documents not encoding in UTF-8 or UTF-16 seem to be required to provide a correct encoding information. [http://www.w3.org/TR/REC-xml/#charencoding]
Comment 3 Svata Dedic 2013-04-02 13:37:08 UTC
Hm, at least in the case the encoding is not specified, the file even opens bad (BOM is displayed). The defect is present from netbeans 7.1.2, I cannot pinpoint a changeset which changed the behaviour.

Anyway, the EncodingUtil.doDetectEncoding attempts to autodetect encoding and then reads document's declared encoding. If the document does NOT declare anything, the autodetected encoding (e.g. UTF-8 detected using BOM presence) is thrown away and null is returned. That causes the next encoding in the queue (project default, ISO-8859-1 in my case) to step in, and interpret the BOM as a regular text.
Comment 4 Svata Dedic 2013-04-02 14:15:56 UTC
Although I was able to fix the charset detection, the UTF-8 encoded file is still not read correctly. Java I/O libraries do not support UTF-8 with BOM mark correctly - see
http://bugs.sun.com/view_bug.do?bug_id=4508058
http://bugs.sun.com/view_bug.do?bug_id=6378911
http://en.wikipedia.org/wiki/Byte-order_mark#UTF-8

Sadly, the net result of the evaluation is that NetBeans XML support should warn if a document contains BOM sequence at the start; even if NB worked around this JDK defect, JAXP would not parse the XML correctly at application runtime.

I'll commit the encoding detection fix; it won't harm, and improve code's correctness. However I have to to mark the issue as an enhancement to report JDK-unsupported feature rather than provide fix for the use-case, sorry.
Comment 5 Svata Dedic 2013-04-02 14:58:41 UTC
encoding detection improved by http://hg.netbeans.org/jet-main/rev/6bf6bd1eac3f


By use of this website, you agree to the NetBeans Policies and Terms of Use. © 2012, Oracle Corporation and/or its affiliates. Sponsored by Oracle logo