17356 – I18N - Multi-bytes (CJK) characters should be counted by 2 columns.

This Bugzilla instance is a read-only archive of historic NetBeans bug reports. To report a bug in NetBeans please follow the project's instructions for reporting issues.

Bug 17356 - I18N - Multi-bytes (CJK) characters should be counted by 2 columns.

Summary: I18N - Multi-bytes (CJK) characters should be counted by 2 columns.

Status:	RESOLVED FIXED

Alias:	None

Product:	editor
Classification:	Unclassified
Component:	-- Other -- (show other bugs)
Version:	3.x
Hardware:	PC Windows ME/2000

Importance:	P3 blocker (vote)
Assignee:	Vitezslav Stejskal

URL:
Keywords:	I18N, NETFIX

Duplicates (1):	19310 (view as bug list)
Depends on:
Blocks:

Reported:	2001-11-07 01:16 UTC by Jo, Okgeun
Modified:	2009-12-05 14:35 UTC (History)
CC List:	4 users (show)

See Also:
Issue Type:	ENHANCEMENT
Exception Reporter:

Attachments
#17356 patch (need reviewed and further mod) (15.44 KB, patch) 2009-09-09 17:44 UTC, johnsonlau	Details \| Diff
counting result comparison (Tibetan ch line, non-Tibetan line, and Notepad) (44.77 KB, image/png) 2009-09-09 17:45 UTC, johnsonlau	Details
Counting algorithm demo on CJK / Half-width KATAKANA / Latin mixed (17.47 KB, image/png) 2009-09-09 17:46 UTC, johnsonlau	Details
Screenshot: Notepad counts differently on bmp and supplementary characters. (68.03 KB, image/png) 2009-09-10 16:38 UTC, johnsonlau	Details
Screenshot: Eclipse behaves the same with notepad. (109.53 KB, image/png) 2009-09-10 16:39 UTC, johnsonlau	Details
Character samples (363 bytes, text/plain) 2009-09-27 16:51 UTC, johnsonlau	Details
Some supplementary characters (CJK characters) (21 bytes, text/plain) 2009-09-27 17:01 UTC, johnsonlau	Details
6x83 Character samples (998 bytes, text/plain) 2009-09-27 17:23 UTC, johnsonlau	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jo, Okgeun 2001-11-07 01:16:29 UTC

Multibytes characters such as Korean ones occupy 2 columns per character.
The IDE source editors don't count this. They increase column counter by 1
regardless of the character currently being input.

In summary, Editor module should consider the fact that Multibytes character
such as Korean ones are 2 columns wide. (Maybe Chinese and Japanese characters
are such ones)

Comment 1 Miloslav Metelka 2001-11-07 16:29:18 UTC

Are there any other editors that count the Korean and Japanese chars
as 2 columns?

Comment 2 Jan Chalupa 2001-11-27 12:28:12 UTC

Target milestone -> 3.3.1.

Comment 3 Keiichi Oono 2002-02-14 07:11:37 UTC

Add "I18N" prefix at Summary, and "jf4jbug@netbeans.org" at cc for
tracking this bug.

Comment 4 Keiichi Oono 2002-02-14 08:11:16 UTC

As for C, we have wscol() to get the column width of wide characters.
This function returns the screen display width. Please type on
Solaris:
   % man wscol
Please note that the column number is not equal to byte (alphanumeric
is 1 column, but it's 2 byte in Unicode). And not all CJK characters
are 2 columns, some of them are 1 column. So if we don't have any
method like above wscol() of C, I guess it is difficult to implement.
Do you have any information to get column width?

The current editor shows "number of characters" for wide char instead
of number of columns.

Comment 5 Miloslav Metelka 2002-04-12 13:40:08 UTC

*** Issue 19310 has been marked as a duplicate of this issue. ***

Comment 6 Marek Grummich 2002-07-22 12:11:14 UTC

Set target milestone to TBD

Comment 7 Marek Grummich 2002-07-22 12:15:05 UTC

Set target milestone to TBD

Comment 8 Miloslav Metelka 2004-11-01 10:32:15 UTC

I was searching for "java wscol" and also for "java wide character
width" occurrences on the internet but there was nothing really useful
found. So I regret but there seems to be no way to obtain this
information for java at this time. So I modify this to be an
enhancement. If you are aware of a way how to find this information in
java please update the issue. Thanks.

Comment 9 Vitezslav Stejskal 2009-05-11 12:31:08 UTC

*** Issue 164820 has been marked as a duplicate of this issue. ***

Comment 10 Vitezslav Stejskal 2009-05-12 15:52:58 UTC

See also issue #164820.

Comment 11 johnsonlau 2009-09-09 17:42:28 UTC

Thanks to wcwidth/wcswidth implementation provided by Markus Kuhn,
I ported his C implementation to Java.
http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

I think it is helpful on counting column width taken by the characters.
I attached the Java version of wcwidth/wcswidth and patch here.

I have a little doubt on my patch.
+ if (Character.isHighSurrogate(buffer[offset])) {
+ codePoint = Character.toCodePoint(buffer[offset], buffer[offset + 1]);
Should the offset+1 be tested whether it is in the range of (offset, offset2)?
In normal cases, a low surrogate should follow after the high surrogate.
If there is a wrong cutting range of buffer or offset2 parameter,
an IndexOutOfBoundsException might be thrown.
I think it might be better to throw an exception to notify the error here somehow.

Again, it seems that Tibetan characters are displayed improperly in Java.
From the algorithm, Tibetan characters are counted as 1 width column as Latin characters do,
not 2 width column as CJK characters do.
When I use Microsoft Himalaya, a font with Tibetan characters support,
to display some texts containing Tibetan characters,
Java displays Tibetan characters as 2 width column and makes the algorithm look wried.
Windows 7's Notepad counts and displays Tibetan characters as 1 width column characters,
so it doesn't cause any problem.
From the screenshot, you can see the counting algorithm gets the same result as Notepad,
but differs in other line which contains no such characters.
Could someone give me an explanation?

And this algorithm calculates some characters like Yi SYLLABLE differently from Notepad.
I'm still working on these points.
But it works great for most CJK characters, half-width Katakana characters and Latin characters.

Comment 12 johnsonlau 2009-09-09 17:44:13 UTC

Created attachment 87383 [details]
#17356 patch (need reviewed and further mod)

Comment 13 johnsonlau 2009-09-09 17:45:13 UTC

Created attachment 87384 [details]
counting result comparison (Tibetan ch line, non-Tibetan line, and Notepad)

Comment 14 johnsonlau 2009-09-09 17:46:05 UTC

Created attachment 87385 [details]
Counting algorithm demo on CJK / Half-width KATAKANA / Latin mixed

Comment 15 johnsonlau 2009-09-10 16:35:42 UTC

Sorry. Windows 7's Notepad didn't not count characters in their column widths.
It is like what NetBeans does currently, counting only by the number of characters.

Supplement characters will be counted as 2 columns,
just because in Windows,
wchar_t is 2-bytes and supplement characters are represented as a couple of wchar_t,
so they would be counted as 2 characters.
Eclipse does the same thing.

I confirmed that CJK characters in BMPs and supplementary planes behave differently in both Notepad and Eclipse.

Abouut the counting difference between lines with and without Tibetan characters,
it might be caused by the proportional font.
It is no doubt that proportional font would cause the algorithm look weird, but reasonable.

Comment 16 johnsonlau 2009-09-10 16:38:39 UTC

Created attachment 87446 [details]
Screenshot: Notepad counts differently on bmp and supplementary characters.

Comment 17 johnsonlau 2009-09-10 16:39:19 UTC

Created attachment 87447 [details]
Screenshot: Eclipse behaves the same with notepad.

Comment 18 Vitezslav Stejskal 2009-09-11 09:25:18 UTC

Maybe another NetFIX candidate.

Comment 19 Jiri Kovalsky 2009-09-11 10:53:32 UTC

Good idea Vito. I have added this candidate to the NetFIX Pool [1]. Thanks.

[1] http://wiki.netbeans.org/NetFIXIssues

Comment 20 Michel Graciano 2009-09-24 18:00:25 UTC

Could anyone attach an file with this kind of character for future tests?

Comment 21 johnsonlau 2009-09-27 16:51:07 UTC

Created attachment 88405 [details]
Character samples

Comment 22 johnsonlau 2009-09-27 17:01:00 UTC

Hi hmichel, I've attached a file with some character samples.
Hope it might be helpful.

The file contains
1) Latin,
2) Chinese characters,
3) Japanese Kanji, KATAKANA, HIRAGANA and half-width KATAKANA characters,
4) Korean Hanja and Hangul characters (I barely know Korean. The Korean characters are copied from Wikipedia).
It would cover most cases of interesting.

I also provided a file with some supplementary characters.

Comment 23 johnsonlau 2009-09-27 17:01:45 UTC

Created attachment 88406 [details]
Some supplementary characters (CJK characters)

Comment 24 johnsonlau 2009-09-27 17:22:26 UTC

And this is a text file with all lines ends exactly in 83 columns (6x83).

Comment 25 johnsonlau 2009-09-27 17:23:19 UTC

Created attachment 88409 [details]
6x83 Character samples

Comment 26 Jiri Kovalsky 2009-12-03 15:14:23 UTC

Vito, can you please review last Lao's patch and integrate it if you find it safe? Thanks a lot!

Comment 27 Vitezslav Stejskal 2009-12-03 23:14:01 UTC

Sorry for the delay - http://hg.netbeans.org/jet-main/rev/f17598cd4963

Comment 28 Jiri Kovalsky 2009-12-04 05:30:10 UTC

Great job guys. Thanks!

Comment 29 Quality Engineering 2009-12-05 14:35:43 UTC

Integrated into 'main-golden', will be available in build *200912051400* on http://bits.netbeans.org/dev/nightly/ (upload may still be in progress)
Changeset: http://hg.netbeans.org/main/rev/f17598cd4963
User: Vita Stejskal <vstejskal@netbeans.org>
Log: #17356: applying johnsonlau's patch