164820 – Cannot display or paste supplementary characters

This Bugzilla instance is a read-only archive of historic NetBeans bug reports. To report a bug in NetBeans please follow the project's instructions for reporting issues.

Bug 164820 - Cannot display or paste supplementary characters

Summary: Cannot display or paste supplementary characters

Status:	RESOLVED FIXED

Alias:	None

Product:	editor
Classification:	Unclassified
Component:	-- Other -- (show other bugs)
Version:	6.x
Hardware:	All All

Importance:	P3 blocker (vote)
Assignee:	Vitezslav Stejskal

URL:
Keywords:	I18N, NETFIX

Depends on:
Blocks:

Reported:	2009-05-10 11:12 UTC by johnsonlau
Modified:	2010-09-02 03:56 UTC (History)
CC List:	2 users (show)

See Also:
Issue Type:	ENHANCEMENT
Exception Reporter:

Attachments
SMP characters become undisplayable in netbeans (236.48 KB, image/png) 2009-05-10 11:17 UTC, johnsonlau	Details
SMP characters display well in Notepad (Windows Vista) (66.79 KB, image/png) 2009-05-10 11:19 UTC, johnsonlau	Details
SMP characters in nb (half the character can be selected) (28.11 KB, image/png) 2009-05-10 11:32 UTC, johnsonlau	Details
BMP characters display well on Windows Vista (21.08 KB, image/png) 2009-05-11 14:59 UTC, johnsonlau	Details
BMP characters display well on Ubuntu 9.04 (20.86 KB, image/png) 2009-05-11 15:09 UTC, johnsonlau	Details
issue #17356: couting should be 79 accoring to the position, but only 56. (18.02 KB, image/png) 2009-05-11 15:21 UTC, johnsonlau	Details
issue #17356: couting should be 79 accoring to the position, but only 56. (full) (87.28 KB, image/png) 2009-05-11 15:26 UTC, johnsonlau	Details
modification diff on supplementary character issue (5.09 KB, patch) 2009-09-08 16:57 UTC, johnsonlau	Details \| Diff
JDK bug #6877495 patch (3.51 KB, patch) 2010-09-02 03:56 UTC, johnsonlau	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description johnsonlau 2009-05-10 11:12:06 UTC

I found NetBeans didn't support SMP characters.

If you copy and paste a text from other application as Firefox, which is supporting SMP,
it will stop at where the SMP characters begin.

When you use notepad to add the SMP characters manually, it cannot be displayed properly.
Font is set to SimSun on Vista, and displayed propertly in Notepad.

I didn't find a way to input these characters, and don't known whether IME input works.

Confirmed on both Windows Vista and Ubuntu 9.04.

Product Version: NetBeans IDE Dev (Build 200905080201)
Java: 1.6.0_13; Java HotSpot(TM) Client VM 11.3-b02

This link might help you to copy & paste a SMP character.
http://zh.wikibooks.org/w/index.php?title=Unicode/20000-20FFF

Comment 1 johnsonlau 2009-05-10 11:17:16 UTC

Created attachment 81857 [details]
SMP characters become undisplayable in netbeans

Comment 2 johnsonlau 2009-05-10 11:19:10 UTC

Created attachment 81858 [details]
SMP characters display well in Notepad (Windows Vista)

Comment 3 johnsonlau 2009-05-10 11:31:22 UTC

Sorry for the wrong attachment, and I've confirmed that the undisplayable issue is due to my font setting
(it should be SimSun-ExtB to make it display properly).

But this brings a new problem.
When I use keyboard to navigate the file, the SMP character was divied into 2 characters actually,
so it might make the user be able to select half the character which results a wrong output.

And after I set a right font in NB,
I can copy and paste the text containing SMP character inside NB.
But if I do the same thing from Notepad to NB, the paste failed on the text where SMP characters start still.

Comment 4 johnsonlau 2009-05-10 11:32:42 UTC

Created attachment 81859 [details]
SMP characters in nb (half the character can be selected)

Comment 5 Vitezslav Stejskal 2009-05-11 12:31:20 UTC

Basically a duplicate of issue #17356. Netbeans editor currently does not support multibyte characters, sorry.

*** This issue has been marked as a duplicate of 17356 ***

Comment 6 johnsonlau 2009-05-11 14:57:14 UTC

Sorry, I cannot agree with your evaluation.

NetBeans do support multi-byte characters such as Chinese, Japanese and Korean well in Basic Multilingual Plane (BMP).
(See the screenshot I newly attached)
It treats the character as two column width, well-displayed when you choose a right fix-width font,
and never break the character into two pieces.

But supplemental characters introduced from Unicode 3.1, need particular treatment accoring to Sun.
(http://java.sun.com/developer/technicalArticles/Intl/Supplementary/)

AFAIK, Korean characters have been adopted as BMP characters, but never in supplemental planet yet.
Supplemental characters currently introduced are mostly some rearly-used Chinese characters (or Kanji, Hanja whatever
you like to call) which will be met in people's names, ancient articles.
Please don't mess up with the two problems. (And I think the issue #17356 should be solved yet.)

Supplemental characters cannot be represented in a char on Java platform.
I think the editor treats every char as a character makes it posisble to select on half the SIP character.
The platform is able to determine column width the character takes, so the display is not a problem.
I don't know how NetBeans handles clipboard - BMP character won't stop a paste, only supplemental character does.

Comment 7 johnsonlau 2009-05-11 14:59:10 UTC

Created attachment 81901 [details]
BMP characters display well on Windows Vista

Comment 8 johnsonlau 2009-05-11 15:09:09 UTC

Created attachment 81903 [details]
BMP characters display well on Ubuntu 9.04

Comment 9 johnsonlau 2009-05-11 15:19:52 UTC

Sorry for the conclusion to #17356.
I reconsider it again and realized that #17356 is critical on NetBeans's character counting algorithm.
I post an attachment relative to that one.
You can see that's a different issue from the screen shot.

This issue is due to the wrong character sequence handling in NetBeans.

Comment 10 johnsonlau 2009-05-11 15:21:36 UTC

Created attachment 81904 [details]
issue #17356: couting should be 79 accoring to the position, but only 56.

Comment 11 johnsonlau 2009-05-11 15:26:15 UTC

Created attachment 81905 [details]
issue #17356: couting should be 79 accoring to the position, but only 56. (full)

Comment 12 Vitezslav Stejskal 2009-05-12 15:52:41 UTC

Thanks for all the information.

Comment 13 johnsonlau 2009-08-29 08:07:01 UTC

I did some tests and this is likely a problem in JDK, not NetBeans.
I found that JDK's default JTextField and JTextArea have the same problem.
Editor component in NetBeans seems to inherit directly from Swing's JTextComponent, the common parent of both JTextField
and JTextArea.

When use defualt Win32 L&F, there are two methods in javax.swing.plaf.basic.BasicTextUI#TextTransferHandler,
getImportFlavor() and handleReaderImport() which handle the paste (Ctrl+V) action.
I use a SMP character U+20026 (\uD840\uDC26, a legal character) for testing.
(see http://www.fileformat.info/info/unicode/char/20026/index.htm)

I copied this character from Windows 7's Notepad to clipboard, and tried Word accepts this character when performing paste.
Then, I accessed to clipboard directly by:

Clipboard clipboard = Toolkit.getDefaultToolkit().getSystemClipboard();
Transferable t = clipboard.getContents(null);
String value = (String) (t.getTransferData(DataFlavor.stringFlavor));

And I got the correct result in value, a character with \uD840\uDC26.
Then I steped into BasicTextUI#TextTransferHandler#getImportFlavor(), and found it uses the first DataFlavor which has a
"text/plain" mime type to "read" the characters other than the default string flavor.

            for (int i = 0; i < flavors.length; i++) {
                String mime = flavors[i].getMimeType();
                if (mime.startsWith("text/plain")) {
                    return flavors[i];
                } else if (refFlavor == null && mime.startsWith("application/x-java-jvm-local-objectref")
                                             && flavors[i].getRepresentationClass() == java.lang.String.class) {
                    refFlavor = flavors[i];
                } else if (stringFlavor == null && flavors[i].equals(DataFlavor.stringFlavor)) {
                    stringFlavor = flavors[i];
                }
            }

Then it uses DataFlavor#getReaderForText() to get a java.io.Reader object to perform the character read operation.
But in my system, the first text/plain DataFlavor cannot read SMP correctly.
This is all DataFlavor in my system. (The first 2 char will be displayed if it is a text/plain DataFlavor.)
Java: 1.6.0_18 debug; Java HotSpot(TM) Client VM 14.2-b01
System: Windows 7 version 6.1 running on x86; GBK; zh_CN (nb)

mime: application/x-java-serialized-object; class=java.lang.String, 
mime: text/plain; class=java.io.Reader; charset=Unicode, java.io.InputStreamReader@1e779a1: 0, 0, -1
mime: text/plain; class=java.lang.String; charset=Unicode, : 55360, 56358, 2
mime: text/plain; class=java.nio.CharBuffer; charset=Unicode, : 55360, 56358, 2
mime: text/plain; class="[C"; charset=Unicode, [C@b8d09d: 55360, 56358, 2
mime: text/plain; class=java.io.InputStream; charset=unicode, java.io.StringReader@187f9f1: 55360, 56358, 2
mime: text/plain; class=java.nio.ByteBuffer; charset=UTF-16, java.nio.HeapByteBuffer[pos=0 lim=6 cap=6]: 55360, 56358, 2
mime: text/plain; class="[B"; charset=UTF-16, [B@2a5ab9: 55360, 56358, 2
mime: text/plain; class=java.io.InputStream; charset=UTF-8,
sun.awt.datatransfer.DataTransferer$ReencodingInputStream@aa2ef2: 55360, 56358, -1
mime: text/plain; class=java.nio.ByteBuffer; charset=UTF-8, java.nio.HeapByteBuffer[pos=0 lim=4 cap=4]: 55360, 56358, 2
mime: text/plain; class="[B"; charset=UTF-8, [B@f052d5: 55360, 56358, 2
mime: text/plain; class=java.io.InputStream; charset=UTF-16BE,
sun.awt.datatransfer.DataTransferer$ReencodingInputStream@1c87093: 55360, 56358, -1
mime: text/plain; class=java.nio.ByteBuffer; charset=UTF-16BE, java.nio.HeapByteBuffer[pos=0 lim=4 cap=4]: 55360, 56358, 2
mime: text/plain; class="[B"; charset=UTF-16BE, [B@56c3cf: 55360, 56358, 2
mime: text/plain; class=java.io.InputStream; charset=UTF-16LE,
sun.awt.datatransfer.DataTransferer$ReencodingInputStream@f81402: 55360, 56358, -1
mime: text/plain; class=java.nio.ByteBuffer; charset=UTF-16LE, java.nio.HeapByteBuffer[pos=0 lim=4 cap=4]: 55360, 56358, 2
mime: text/plain; class="[B"; charset=UTF-16LE, [B@e9b4bb: 55360, 56358, 2
mime: text/plain; class=java.io.InputStream; charset=ISO-8859-1,
sun.awt.datatransfer.DataTransferer$ReencodingInputStream@189b939: 55360, 56358, -1
mime: text/plain; class=java.nio.ByteBuffer; charset=ISO-8859-1, java.nio.HeapByteBuffer[pos=0 lim=1 cap=1]: 63, 56358, 1
mime: text/plain; class="[B"; charset=ISO-8859-1, [B@df824a: 63, 56358, 1
mime: text/plain; class=java.io.InputStream; charset=US-ASCII,
sun.awt.datatransfer.DataTransferer$ReencodingInputStream@10e9df: 63, 56358, -1
mime: text/plain; class=java.nio.ByteBuffer; charset=US-ASCII, java.nio.HeapByteBuffer[pos=0 lim=1 cap=1]: 63, 56358, 1
mime: text/plain; class="[B"; charset=US-ASCII, [B@6a2f81: 63, 56358, 1

This DataFlavor cannot get the correct SMP character, so the SMP character cannot be pasted to a Java application.

I thought it is a bug in JDK.
Since I didn't have much experience in Swing, could somebody else give me a conclusion?
Thanks.

Comment 14 johnsonlau 2009-08-29 19:50:07 UTC

About the SMP character selection issue,
I found that it is the position counting algorithm in View#getNextVisualPositionFrom() that results this problem.

Generally, NetBeans uses an implementation DrawEngineLineView.
As a temporary solution, I overrided View#getNextVisualPositionFrom() and provided my own counting algorithm to prevent
selecting the low surrogate code unit.

	@Override
	public int getNextVisualPositionFrom(int pos, Bias b, Shape a, int direction, Bias[] biasRet) throws BadLocationException {
		
		switch (direction) {
		case WEST:
			if (pos == -1) {
				pos = Math.max(0, getEndOffset() - 1);
			} else {
				pos = Math.max(0, pos - 1);
			}

			if (pos != 0) {
				char[] chars = getDocument().getText(pos, 1).toCharArray();
				if (chars.length > 0 && Character.isLowSurrogate(chars[0])) {
					// SMP character
					return super.getNextVisualPositionFrom(pos, b, a, direction, biasRet);
				}
			}
			break;

		case EAST:
			int length = getDocument().getLength();
			if (pos == -1) {
				pos = getStartOffset();
			} else {
				pos = Math.min(pos + 1, length);
			}

			if (pos != length) {
				char[] chars = getDocument().getText(pos, 1).toCharArray();
				if (chars.length > 0 && Character.isLowSurrogate(chars[0])) {
					// SMP character
					return super.getNextVisualPositionFrom(pos, b, a, direction, biasRet);
				}
			}
			break;

		default:
			pos = super.getNextVisualPositionFrom(pos, b, a, direction, biasRet);
		}

		return pos;
	}
	
After doing this, NetBeans behaves well when selecting SMP characters.
Certainly, this algorithm might need more considerations.

Because there are a lacking consideration on SMP characters in JDK's View#getNextVisualPositionFrom(),
the selection issue might be much more appropriate to be file to JDK directly.
But considering both JDK5 and JDK6, what should NetBeans do?
Sun seems not willing to release a new patched version of JDK5.

Comment 15 Vitezslav Stejskal 2009-08-31 11:36:53 UTC

Thanks for the analysis. The patch for View#getNextVisualPositionFrom looks ok to me. If I understand it correctly this
is going to fix the selection part of this issue, right? The other part about copy-pasting SMP characters is still going
to be broken. Thanks

Mila, could you please have a look at the patch as well? If there are no objections I'll apply the patch to
DrawEngineView as a workaround. I agree with johnsonlau that the fix should ultimately be done in JDK. Jirka, could you
guys please file an issue against JDK? Please refer to this issue, which contains detailed descriptions. And also please
add the link to the JDK bug here so that johnsonlau can follow conversation there. Thanks a lot.

Comment 16 Jiri Prox 2009-08-31 15:19:33 UTC

I've filed new issue to the JDK team, it's number is 6877495, but the url where it can be view is not public, so I'll
will not paste it here. 

I'll try to transfer any further question/comments from JDK team here, so the conversation can continue in this issue.

Comment 17 johnsonlau 2009-08-31 15:46:49 UTC

Thanks for the attention, vstejskal,
There is no misunderstanding in your description.

The selection issue will be partially solved with the patch I provided.
But I cannot determine where real cause of the paste issue is - the AWT Clipboard or just BasicTextUI#TextTransferHandler.
I found that it is interesting that you can copy and paste a SMP character
if it is displayed and copied by Java application itself
(use jTextField1.setText("\uD840\uDC26") to display the SMP character,
then copy and paste it in the Java application itself),
it's totally fine.
But it will fail when copy from an outside application, like Notepad and Word, to a Java application.
The DataFlavors are different in the two scenes.

Changing logic in BasicTextUI#TextTransferHandler#getImportFlavor() did solve the problem.
Since I don't know exactly about the JDK,
maybe there is something wrong with the AWT Clipboard,
I cannot easily tell this is still a workaround or a bug fix.
I think I should leave this to the JDK team now and keep on eyes on their improvements.

About the selection issue, just hold on a second please.
I think there are more things need to be done together to make NetBeans much confortable with SMP characters.

1) It looks weird if your cursor jumps from a line with BMP characters to an upper line with SMP characters
(findBestSpan might be fixed too).
2) You can use the mouse to separate high surrogate and low surrogate of the SMP character still.

And I modified the algorithm here to make it much more compatible with JDK, no matter how JDK calculates.
Please review this and decide.

	public int getNextVisualPositionFrom(int pos, Bias b, Shape a, int direction, Bias[] biasRet) throws BadLocationException {
		
		switch (direction) {
		case WEST:
			{
				pos = super.getNextVisualPositionFrom(pos, b, a, direction, biasRet);
				char[] chars = getDocument().getText(pos, 1).toCharArray();
				if (chars.length > 0 && Character.isLowSurrogate(chars[0])) {
					// SMP character
					return super.getNextVisualPositionFrom(pos, b, a, direction, biasRet);
				}
			}
			break;

		case EAST:
			{
				pos = super.getNextVisualPositionFrom(pos, b, a, direction, biasRet);
				char[] chars = getDocument().getText(pos, 1).toCharArray();
				if (chars.length > 0 && Character.isLowSurrogate(chars[0])) {
					// SMP character
					return super.getNextVisualPositionFrom(pos, b, a, direction, biasRet);
				}
			}
			break;

		default:
			pos = super.getNextVisualPositionFrom(pos, b, a, direction, biasRet);
			break;
		}

		return pos;
	}

Comment 18 Vitezslav Stejskal 2009-08-31 16:36:25 UTC

Re. "I've filed new issue to the JDK team, it's number is 6877495..." - Thanks Jirka
Re. the new patch - It looks better of course. Thanks
Re. the additional two problems with navigation - I'm not sure how to fix them, sorry.

Comment 19 johnsonlau 2009-09-08 16:56:05 UTC

Hi all.

I attached my modifications to this issue here.
I have tested some cases, including using both the arrow keys and mouse to perform navigataion,
and text selection on supplementary characters, and it didn't break into 2 pieces now.
I'm new to NetBeans development.
Would somebody please review it and merge into source repo? Thanks.
The copy/paste issue still waiting for JDK team's response.

Since this issue filed to JDK is not public,
I would like to talk about something more here that are connected to fix this issue.

fontconfig.properties settings released with JDK (at least up to 1.6.0_18)
does not supports displaying of the supplementary characters.

Some Chinese characters are introduced as supplementary characters.
AFAIK, on Windows Vista and Windows 7, supplementary characters are provided apart from BMPs.
For example, on a Simplified Chinese environment for most mainland China users,
SimSun is for BMPs and SimSun-ExtB is for supplementary characters.
(Also MingLiU / MingLiU-ExtB for Taiwan users, MingLiU_HKSCS / MingLiU_HKSCS-ExtB for Hong Kong users).

And there are some ethnic minorities' language as Yi Syllables were introduced too.
On Windows 7, these characters can be found in the Microsoft Yi Baiti font.

Java requires a Unicode font containing all sorts of characters to make display behavior well.
If there is a character lack in the font, the character could not display properly.
Instead of this, most operating systems uses a smarter way (or called Font Linking?) to do this.
It falls back to another linked font, if current font doesn't contain the character,
which makes an application easier to render its output,
regardless of the fonts that users actually choose to use.

Besides, It doesn't seem possible to find a font that is full of all Unicode characters today,
so I think it is a perfect way to solve this problem,
extremely helpful on i18n or complex desktop applications.

On Windows, Java also provides four pre-defined logical fonts which are combined by several physical fonts.
But neither it can be controlled or add/modified by the application.
And it affects all applications which is harmful to particular UI customization or personalization for only one application.

Eclipse/SWT uses native controls, so applications built on SWT remain good looking on a Chinese OS,
since Windows eventually falls back to SimSun no matter what font your choose,
even Verdana / Consolas which contains no more other than Latin.
But I have to stand the bad looking of Swing applications, otherwise undisplayable character would come out.
Java doesn't do a fallback.

This issue has been submitted to Sun for over 3 years
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6407157,
but no further response is done so far.

Hopefully Sun would take seriously on this issue.

Comment 20 johnsonlau 2009-09-08 16:57:22 UTC

Created attachment 87300 [details]
modification diff on supplementary character issue

Comment 21 Vitezslav Stejskal 2009-09-09 08:18:27 UTC

Thanks for the patch. Mila and I will review it and apply if all is ok. Jirka, could you please track this as NetFIX.
I'd just like you to kick me if I forget and don't act in a timely manner. Plus if you guys have give away stuff for
NetFIX participants johnsonlau definitely deserves to be thanked to for all his hard work on this issue. Thanks

Comment 22 Jiri Kovalsky 2009-09-09 13:07:50 UTC

Not a problem. Adding NETFIX keyword plus added to the NetFIX Pool [1] as already Patched issue. Thanks for the notice Vito!

[1] http://wiki.netbeans.org/NetFIXIssues

Comment 23 Miloslav Metelka 2009-09-09 13:41:07 UTC

BTW we may possibly simplify
char[] chars = document().getText(pos, 1)...
to
char c = DocumentUtilities.getText(document()).charAt(pos);

Comment 24 Miloslav Metelka 2009-09-11 08:16:36 UTC

I have integrated a patch where I just modified the doc.getText() -> DocumentUtilities.getText()...
Two questions:
1) Did you encounter a real case when Mark would be inserted with an offset between low and high surrogate chars or is it just for the case if it would 
happen? Does this code e.g. fix caret movements over the low/high surrogate chars?
IMHO shifting the offset inside the code that inserts the mark is relatively late - ideally it should instead be done by callers. With the integrated code it might 
be that doc.createPosition(offset).getOffset() != offset which is a bit weird.
2) The patch against Utilities.java just fixes getPositionAbove(). Does the present code suit your needs or is it also necessary to patch getPositionBelow()?
http://hg.netbeans.org/jet-main/rev/a6bbed8c0441

Comment 25 johnsonlau 2009-09-11 16:07:51 UTC

Hi, mmetelka. Thanks for your hard working.

Re 1) This fixes the selection and cursor position issue while mouse is used.
It's important to keep consistency and integrity of a string,
so I think it is a better practise to split a document by a codepoint but a single char.
The positioning should follow the codepoint also but an meaningless offset in Unicode,
which might be really confused as you said.

Re 2) getPositionBelow() has a similar logic already. It was fixed by #70254.
I don't know why issue #70254 left getPositionAbove() unfixed.
I tested and found that getPositionBelow() behaves as what I supposed,
then I decided to fixed getPositionAbove() with the similar logic.

Comment 26 Quality Engineering 2009-09-13 21:05:07 UTC

Integrated into 'main-golden', will be available in build *200909131354* on http://bits.netbeans.org/dev/nightly/ (upload may still be in progress)
Changeset: http://hg.netbeans.org/main-golden/rev/a6bbed8c0441
User: Miloslav Metelka <mmetelka@netbeans.org>
Log: #164820 - Cannot display or paste supplementary characters.

Comment 27 johnsonlau 2010-09-02 03:56:19 UTC

Created attachment 101816 [details]
JDK bug #6877495 patch

A patch for JDK bug #6877495.
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6877495

Verified on Ubuntu 10.04 + OpenJDK 6b20.
Supplementary characters can be pasted from a native GTK application (like gedit) to Java application and vice versa after patched.