Initial import of linebreak
This commit is contained in:
parent
56a5d7f63f
commit
74ca5511d7
34 changed files with 6889 additions and 0 deletions
8
linebreak/linebreak/AUTHORS
Normal file
8
linebreak/linebreak/AUTHORS
Normal file
|
@ -0,0 +1,8 @@
|
||||||
|
Wu Yongwei. Designed and implemented liblinebreak.
|
||||||
|
|
||||||
|
Nikolay Pultsin. Put forward the original requirements on liblinebreak,
|
||||||
|
performed tests, and made a lot of suggestions on the initial versions.
|
||||||
|
|
||||||
|
Thomas Klausner. Autoconfiscated and libtoolized liblinebreak.
|
||||||
|
|
||||||
|
Tom Hacohen. Added word boundaries support.
|
32
linebreak/linebreak/CVS/Entries
Normal file
32
linebreak/linebreak/CVS/Entries
Normal file
|
@ -0,0 +1,32 @@
|
||||||
|
/AUTHORS/1.2/Wed Jan 18 14:26:13 2012//
|
||||||
|
/ChangeLog/1.78/Sat Aug 11 07:35:23 2012//
|
||||||
|
/Doxyfile/1.7/Sat Aug 11 06:55:18 2012//
|
||||||
|
/LICENCE/1.4/Sat Aug 11 07:35:23 2012//
|
||||||
|
/LineBreak1.sed/1.2/Sun Dec 7 10:54:37 2008//
|
||||||
|
/LineBreak2.sed/1.2/Sun Dec 7 10:54:37 2008//
|
||||||
|
/Makefile.am/1.8/Sat Aug 11 06:55:18 2012//
|
||||||
|
/Makefile.gcc/1.4/Thu Jan 19 14:03:34 2012//
|
||||||
|
/Makefile.msvc/1.5/Sat Aug 11 05:57:50 2012//
|
||||||
|
/NEWS/1.7/Sat Aug 11 06:55:18 2012//
|
||||||
|
/README/1.8/Sat Aug 11 06:55:18 2012//
|
||||||
|
/bootstrap/1.1/Fri Dec 12 12:01:39 2008//
|
||||||
|
/configure.ac/1.6/Sat Aug 11 06:55:18 2012//
|
||||||
|
/filter_dup.c/1.1/Sat Feb 23 11:53:28 2008//
|
||||||
|
/libunibreak.pc.in/1.1/Sat Aug 11 06:55:18 2012//
|
||||||
|
/linebreak.c/1.25/Sat May 7 19:55:10 2011//
|
||||||
|
/linebreak.h/1.14/Sat May 7 19:55:10 2011//
|
||||||
|
/linebreakdata.c/1.5/Sat May 7 19:40:20 2011//
|
||||||
|
/linebreakdata1.tmpl/1.1/Sat Feb 23 11:53:28 2008//
|
||||||
|
/linebreakdata2.tmpl/1.2/Sun Mar 2 07:30:43 2008//
|
||||||
|
/linebreakdata3.tmpl/1.1/Sat Feb 23 11:53:28 2008//
|
||||||
|
/linebreakdef.c/1.12/Sat May 7 19:55:10 2011//
|
||||||
|
/linebreakdef.h/1.12/Sat May 7 19:55:10 2011//
|
||||||
|
/purge/1.1/Fri Dec 12 12:01:39 2008//
|
||||||
|
/sort_numeric_hex.py/1.2/Wed Jan 18 14:26:13 2012//
|
||||||
|
/wordbreak.c/1.3/Sat Feb 4 14:32:57 2012//
|
||||||
|
/wordbreak.h/1.4/Sat Feb 4 14:32:58 2012//
|
||||||
|
/wordbreakdata.c/1.2/Wed Jan 18 14:26:13 2012//
|
||||||
|
/wordbreakdata1.tmpl/1.2/Wed Jan 18 14:26:13 2012//
|
||||||
|
/wordbreakdata2.tmpl/1.2/Wed Jan 18 14:26:13 2012//
|
||||||
|
/wordbreakdef.h/1.2/Wed Jan 18 14:26:13 2012//
|
||||||
|
D
|
1
linebreak/linebreak/CVS/Repository
Normal file
1
linebreak/linebreak/CVS/Repository
Normal file
|
@ -0,0 +1 @@
|
||||||
|
common/tools/linebreak
|
1
linebreak/linebreak/CVS/Root
Normal file
1
linebreak/linebreak/CVS/Root
Normal file
|
@ -0,0 +1 @@
|
||||||
|
:pserver:anonymous@vimgadgets.cvs.sourceforge.net:/cvsroot/vimgadgets
|
512
linebreak/linebreak/ChangeLog
Normal file
512
linebreak/linebreak/ChangeLog
Normal file
|
@ -0,0 +1,512 @@
|
||||||
|
2012-08-11 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* LICENCE: Add copyright information about Tom Hacohen.
|
||||||
|
|
||||||
|
2012-08-11 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* configure.ac (AC_INIT): Change the library name and version to
|
||||||
|
`libunibreak' and `1.0'.
|
||||||
|
(AC_PROG_LN_S): New macro.
|
||||||
|
(AC_OUTPUT): Change to `libunibreak.pc'.
|
||||||
|
* Doxyfile: (PROJECT_NAME): Change to `libunibreak'.
|
||||||
|
(PROJECT_NUMBER): Change to `1.0'.
|
||||||
|
* Makefile.am (lib_LTLIBRARIES): Change to `libunibreak.la'.
|
||||||
|
(pkgconfig_DATA): Change to `libunibreak.la'.
|
||||||
|
(libunibreak_la_LDFLAGS): Reset the version to `1:0'.
|
||||||
|
(install-exec-hook): Replace the static library liblinebreak.a with
|
||||||
|
a symlink to libunibreak.a.
|
||||||
|
* NEW: Add information about libunibreak 1.0.
|
||||||
|
* README: Change the library name, and add information about word
|
||||||
|
break.
|
||||||
|
|
||||||
|
2012-08-11 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Makefile.msvc: Change the library name to `libunibreak', and the
|
||||||
|
output library to `unibreak.lib'.
|
||||||
|
|
||||||
|
2012-02-04 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* wordbreak.h (WORDBREAK_INSIDEACHAR): Change from
|
||||||
|
WORDBREAK_INSIDECHAR.
|
||||||
|
* wordbreak.c (set_brks_to): Change `WORDBREAK_INSIDECHAR' to
|
||||||
|
`WORDBREAK_INSIDEACHAR'.
|
||||||
|
|
||||||
|
2012-01-19 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* wordbreak.h: Change angle brackets to quotation marks (which
|
||||||
|
caused build errors).
|
||||||
|
|
||||||
|
2012-01-19 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Makefile.gcc (CFILES): Add wordbreak.c.
|
||||||
|
(WordBreakProperty.txt): New target.
|
||||||
|
(wordbreakdata): New target.
|
||||||
|
|
||||||
|
2012-01-19 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Makefile.am (liblinebreak_la_SOURCES): Remove wordbreakdata.c.
|
||||||
|
(EXTRA_DIST): Add wordbreakdata.c, wordbreakdata1.tmpl, and
|
||||||
|
wordbreakdata2.tmpl.
|
||||||
|
|
||||||
|
2012-01-19 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Makefile.msvc: Add wordbreak files.
|
||||||
|
|
||||||
|
2012-01-18 Tom Hacohen <tom@stosb.com>
|
||||||
|
|
||||||
|
Add word breaking support.
|
||||||
|
* AUTHORS: Add `Tom Hacohen'.
|
||||||
|
* Makefile.am (include_HEADERS): Add header files for word breaking.
|
||||||
|
(liblinebreak_la_SOURCES): Add source files for word breaking.
|
||||||
|
(sort_numeric_hex.py): Add `sort_numeric_hex.py'.
|
||||||
|
(distclean-local): Clean also `WordBreakData.txt'.
|
||||||
|
(WordBreakProperty.txt): New target.
|
||||||
|
(wordbreakdata): New target.
|
||||||
|
* sort_numeric_hex.py: New file.
|
||||||
|
* wordbreak.c: New file.
|
||||||
|
* wordbreak.h: New file.
|
||||||
|
* wordbreakdef.h: New file.
|
||||||
|
* wordbreakdata.c: New file.
|
||||||
|
* wordbreakdata1.tmpl: New file.
|
||||||
|
* wordbreakdata2.tmpl: New file.
|
||||||
|
|
||||||
|
2011-05-17 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Add support for pkg-config (thanks to Tom Hacohen).
|
||||||
|
* liblinebreak.pc.in: New file.
|
||||||
|
* configure.ac (AC_OUTPUT): Add `liblinebreak.pc'.
|
||||||
|
* Makefile.am (pkgconfig_DATA): Set to `liblinebreak.pc'.
|
||||||
|
(pkgconfigdir): Set to `$(libdir)/pkgconfig'.
|
||||||
|
|
||||||
|
2011-05-07 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* README: Update the reference to UAX #14-26, for Unicode 6.0.0.
|
||||||
|
|
||||||
|
2011-05-07 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* configure.ac (AC_INIT): Increase the version to 2.1.
|
||||||
|
* Makefile.am (liblinebreak_la_LDFLAGS): Set the version-info to
|
||||||
|
`2:1'.
|
||||||
|
|
||||||
|
2011-05-07 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* LICENCE: Update the copyright year.
|
||||||
|
|
||||||
|
2011-05-07 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Update for the 2.1 release.
|
||||||
|
* Doxyfile (PROJECT_NUMBER): Set to `2.1'.
|
||||||
|
* NEWS: Add information about the 2.1 release.
|
||||||
|
* linebreak.h (LINEBREAK_VERSION): Set to `0x0201'.
|
||||||
|
* linebreak.h: Update comments.
|
||||||
|
* linebreak.c: Ditto.
|
||||||
|
* linebreakdef.h: Ditto.
|
||||||
|
* linebreakdef.c: Ditto.
|
||||||
|
|
||||||
|
2011-05-07 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* linebreakdata.c: Regenerate from LineBreak-6.0.0.txt.
|
||||||
|
|
||||||
|
2011-05-07 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* linebreak.c (set_linebreaks): Fix the assertion failure when
|
||||||
|
U+FFFC (OBJECT REPLACEMENT CHARACTER) appears at the beginning of a
|
||||||
|
line (thanks to Tom Hacohen).
|
||||||
|
|
||||||
|
2010-01-03 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* LICENCE: Update the copyright year.
|
||||||
|
|
||||||
|
2010-01-03 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* NEWS: Add information about the 2.0 release.
|
||||||
|
|
||||||
|
2010-01-03 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Doxyfile (PROJECT_NUMBER): Set to `2.0'.
|
||||||
|
(HAVE_DOT): Set to `YES'.
|
||||||
|
|
||||||
|
2010-01-03 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* linebreak.c: Update the version number in comment to 2.0.
|
||||||
|
* linebreak.h: Ditto.
|
||||||
|
* linebreakdef.c: Ditto.
|
||||||
|
* linebreakdef.h: Ditto.
|
||||||
|
|
||||||
|
2009-12-17 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Change the values of enum BreakAction to the same length.
|
||||||
|
* linebreak.c (DIRECT_BRK): Rename to DIR_BRK.
|
||||||
|
(INDIRECT_BRK): Rename to IND_BRK.
|
||||||
|
(CM_INDIRECT_BRK): Rename to CMI_BRK.
|
||||||
|
(CM_PROHIBITED_BRK): Rename to CMP_BRK.
|
||||||
|
(PROHIBITED_BRK): Rename to PRH_BRK.
|
||||||
|
|
||||||
|
2009-11-29 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Doxyfile (TAB_SIZE): Set to the correct size `4', as used in the
|
||||||
|
source files.
|
||||||
|
|
||||||
|
2009-11-29 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Update files according to UAX #14-24, for Unicode 5.2.0.
|
||||||
|
* linebreak.c: Update comments about UAX #14.
|
||||||
|
* linebreak.h: Ditto.
|
||||||
|
* linebreakdef.c: Ditto.
|
||||||
|
* linebreakdef.h: Ditto.
|
||||||
|
(LBP_CP): New enumerator for the new `CP' class as defined in
|
||||||
|
UAX #14-24.
|
||||||
|
* linebreak.c (baTable): Update for the new class `CP'.
|
||||||
|
* linebreakdata.c: Regenerate from LineBreak-5.2.0.txt.
|
||||||
|
* README: Update the reference to UAX #14-24, for Unicode 5.2.0.
|
||||||
|
|
||||||
|
2009-05-03 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* NEWS: Add information about the 1.2 release.
|
||||||
|
|
||||||
|
2009-04-30 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Optimize the Doxygen output.
|
||||||
|
* linebreak.c (lb_prop_index): Adjust its definition format
|
||||||
|
slightly.
|
||||||
|
|
||||||
|
2009-04-30 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Doxyfile (USE_WINDOWS_ENCODING): Remove obsolete tag.
|
||||||
|
(DETAILS_AT_TOP): Ditto.
|
||||||
|
(MAX_DOT_GRAPH_WIDTH): Ditto.
|
||||||
|
(MAX_DOT_GRAPH_HEIGHT): Ditto.
|
||||||
|
(REFERENCED_BY_RELATION): Set to `NO'.
|
||||||
|
(REFERENCES_RELATION): Ditto.
|
||||||
|
(EXCLUDE): Add `filter_dup.c'.
|
||||||
|
|
||||||
|
2009-04-28 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* linebreak.c (lb_get_next_char_utf8): Fix the issue that the index
|
||||||
|
can point to the middle of a UTF-8 sequence if End of String (EOS)
|
||||||
|
is encountered prematurely (thanks to Nikolay Pultsin and Rick Xu).
|
||||||
|
(lb_get_next_char_utf16): Fix the issue that the index can point to
|
||||||
|
the middle of a UTF-16 surrogate pair if EOS is encountered
|
||||||
|
prematurely.
|
||||||
|
|
||||||
|
2009-04-20 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* linebreakdef.c (lb_prop_English): Remove the specialization of
|
||||||
|
right single quotation mark as closing punctuation mark, because it
|
||||||
|
can be used as apostrophe.
|
||||||
|
(lb_prop_Spanish): Ditto.
|
||||||
|
(lb_prop_French): Ditto.
|
||||||
|
|
||||||
|
2009-04-09 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Makefile.msvc: Make the `clean' target work on MSVC versions other
|
||||||
|
than 6.0; do not use precompiled header.
|
||||||
|
|
||||||
|
2009-03-07 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* linebreak.h: Correct the wrong date in the documentation comment.
|
||||||
|
* linebreakdef.h: Ditto.
|
||||||
|
|
||||||
|
2009-02-10 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* configure.ac (AC_INIT): Increase the version to 2.0.
|
||||||
|
* Makefile.am (liblinebreak_la_LDFLAGS): Set the version-info to
|
||||||
|
`2:0'.
|
||||||
|
|
||||||
|
2009-02-10 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* linebreak.h (LINEBREAK_VERSION): New macro.
|
||||||
|
(linebreak_version): New global constant declaration.
|
||||||
|
* linebreak.c (linebreak_version): New global constant definition.
|
||||||
|
|
||||||
|
2009-02-10 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Reduce namespace pollution.
|
||||||
|
* linebreak.c (get_lb_prop_lang): Mark as static.
|
||||||
|
(get_next_char_utf8): Rename to lb_get_next_char_utf8.
|
||||||
|
(get_next_char_utf16): Rename to lb_get_next_char_utf32.
|
||||||
|
(get_next_char_utf32): Rename to lb_get_next_char_utf32.
|
||||||
|
(is_breakable): Rename to is_line_breakable.
|
||||||
|
* linebreak.h (get_next_char_utf8): Remove the function prototype
|
||||||
|
declaration.
|
||||||
|
(get_next_char_utf16): Ditto.
|
||||||
|
(get_next_char_utf32): Ditto.
|
||||||
|
(is_breakable): Rename to is_line_breakable.
|
||||||
|
* linebreakdef.h (lb_get_next_char_utf8): Add the function prototype
|
||||||
|
declaration.
|
||||||
|
(lb_get_next_char_utf16): Ditto.
|
||||||
|
(lb_get_next_char_utf32): Ditto.
|
||||||
|
|
||||||
|
2009-02-06 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* NEWS: Add information about the 1.1 release.
|
||||||
|
|
||||||
|
2009-01-02 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Makefile.am (EXTRA_DIST): Add the missing `LICENCE' file.
|
||||||
|
|
||||||
|
2008-12-31 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* linebreak.c: Update the version number in comment to 1.0.
|
||||||
|
* linebreak.h: Ditto.
|
||||||
|
* linebreakdef.c: Ditto.
|
||||||
|
* linebreakdef.h: Ditto.
|
||||||
|
|
||||||
|
2008-12-31 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* NEWS: Update for the 1.0 release.
|
||||||
|
|
||||||
|
2008-12-31 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* README: Correct two typos.
|
||||||
|
|
||||||
|
2008-12-31 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* README: Add the online URL reference.
|
||||||
|
|
||||||
|
2008-12-30 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* README: Update the reference to UAX #14-22, for Unicode 5.1.0.
|
||||||
|
|
||||||
|
2008-12-13 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Update files according to UAX #14-22, for Unicode 5.1.0.
|
||||||
|
* linebreak.c (baTable): Update according to Table 2 of UAX #14-22.
|
||||||
|
* linebreakdef.c (lb_prop_Spanish): Remove the unnecessary
|
||||||
|
customization for inverted marks in Spanish.
|
||||||
|
* linebreakdata.c: Regenerate from LineBreak-5.1.0.txt.
|
||||||
|
* linebreak.h: Update comment only.
|
||||||
|
* linebreakdef.h: Ditto.
|
||||||
|
|
||||||
|
2008-12-12 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* README: Update for the new build methods and better readability.
|
||||||
|
|
||||||
|
2008-12-12 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Makefile.msvc: Correct the inconsistent naming in the output
|
||||||
|
message.
|
||||||
|
|
||||||
|
2008-12-12 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* configure.ac (AM_INIT_AUTOMAKE): Mark `foreign'.
|
||||||
|
* bootstrap: New file.
|
||||||
|
* purge: New file.
|
||||||
|
* Makefile.gcc (purge): Remove this target.
|
||||||
|
|
||||||
|
2008-12-10 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* NEWS: New file.
|
||||||
|
|
||||||
|
2008-12-10 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* AUTHORS: New file.
|
||||||
|
|
||||||
|
2008-12-10 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Makefile.gcc (purge): New phony target to purge files generated by
|
||||||
|
autoconfiscation.
|
||||||
|
|
||||||
|
2008-12-10 Thomas Klausner <tk@giga.or.at>
|
||||||
|
|
||||||
|
* configure.ac: New file.
|
||||||
|
* Makefile.am: New file.
|
||||||
|
|
||||||
|
2008-12-10 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Doxyfile (OUTPUT_DIRECTORY): Set to `doc'.
|
||||||
|
(ALPHABETICAL_INDEX): Set to `YES'.
|
||||||
|
|
||||||
|
2008-12-09 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Makefile.msvc: New file.
|
||||||
|
|
||||||
|
2008-12-09 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Makefile: Remove (to become Makefile.gcc).
|
||||||
|
* Makefile.gcc: New file (was Makefile).
|
||||||
|
|
||||||
|
2008-12-07 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* linebreak.c: Adjust the comment that refers to Unicode Annex 14.
|
||||||
|
* linebreak.h: Ditto.
|
||||||
|
* linebreakdef.c: Ditto.
|
||||||
|
* linebreakdef.h: Ditto.
|
||||||
|
|
||||||
|
2008-12-07 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Use only POSIX basic regexp to ensure maximum portability (issues
|
||||||
|
have been found on Mac OS X, where GNU extensions do not work).
|
||||||
|
* LineBreak1.sed: Replace `[:xdigit:]' with `0-9A-F', and `\+' with
|
||||||
|
`\{1,\}'.
|
||||||
|
* LineBreak2.sed: Ditto.
|
||||||
|
|
||||||
|
2008-12-07 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Makefile: Replace `*.exe' with `filter_dup$(EXEEXT)', since the
|
||||||
|
extension `.exe' is specific to Windows.
|
||||||
|
|
||||||
|
2008-04-20 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Add README and LICENCE files, as well as a Doxyfile to generate
|
||||||
|
documents.
|
||||||
|
* README: New file.
|
||||||
|
* LICENCE: New file.
|
||||||
|
* Doxyfile: New file.
|
||||||
|
* Makefile (doc): Add new phony target.
|
||||||
|
|
||||||
|
2008-04-04 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Remove the English override for plus sign: it is better treated in
|
||||||
|
the text breaking program (see ../breaktext/ for an example).
|
||||||
|
* linebreakdef.c (lb_prop_English): Remove the line for plus sign.
|
||||||
|
|
||||||
|
2008-03-29 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Makefile: Correct the dependency-making rules when OLDGCC=Y.
|
||||||
|
|
||||||
|
2008-03-23 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Makefile (clean): Do not remove *.exe and tags here.
|
||||||
|
(distclean): Remove *.exe and tags.
|
||||||
|
|
||||||
|
2008-03-23 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Remove the English override for solidus: it is better treated in the
|
||||||
|
text breaking program (see ../breaktext/ for an example).
|
||||||
|
* linebreakdef.c (lb_prop_English): Remove the line for solidus.
|
||||||
|
|
||||||
|
2008-03-16 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Rename init_linebreak_prop_index to init_linebreak for future
|
||||||
|
safety; make visible certain functions that are potentially useful.
|
||||||
|
* linebreak.c (init_linebreak_prop_index): Rename to init_linebreak.
|
||||||
|
(get_next_char_t): Move to linebreakdef.h.
|
||||||
|
(get_next_char_utf8): Make non-static.
|
||||||
|
(get_next_char_utf16): Ditto.
|
||||||
|
(get_next_char_utf32): Ditto.
|
||||||
|
(set_linebreaks): Ditto.
|
||||||
|
* linebreak.h (init_linebreak_prop_index): Rename to init_linebreak.
|
||||||
|
(get_next_char_utf8): Add the function prototype.
|
||||||
|
(get_next_char_utf16): Ditto.
|
||||||
|
(get_next_char_utf32): Ditto.
|
||||||
|
* linebreakdef.h (get_next_char_t): Add the typedef.
|
||||||
|
(set_linebreaks): Add the function prototype.
|
||||||
|
|
||||||
|
2008-03-16 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Makefile (OLDGCC): Add support for GCC 2.95.3 (when OLDGCC=Y).
|
||||||
|
|
||||||
|
2008-03-15 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* linebreak.c (set_linebreaks): Fix a bug that `==' was wrongly used
|
||||||
|
for `='.
|
||||||
|
|
||||||
|
2008-03-05 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Improve the performance by reducing the look-ups of the
|
||||||
|
language-specific line breaking properties array from the language
|
||||||
|
name (thanks to Nikolay Pultsin).
|
||||||
|
* linebreak.c (get_lb_prop_lang): New function.
|
||||||
|
(get_char_lb_class_lang): Change the second parameter from the
|
||||||
|
language name to the line breaking properties array.
|
||||||
|
(set_linebreaks): Look up the language-specific line breaking
|
||||||
|
properties array from the language name only once in one function
|
||||||
|
call.
|
||||||
|
|
||||||
|
2008-03-03 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Make minor adjustments in code and comments.
|
||||||
|
* linebreak.c: Adjust the doc comments.
|
||||||
|
(init_linebreak_prop_index): Modify a conditional to make it more
|
||||||
|
robust and consistent.
|
||||||
|
* linebreakdef.c (lb_prop_lang_map): Replace the pointer
|
||||||
|
lb_prop_default with NULL, since the value is never used.
|
||||||
|
|
||||||
|
2008-03-03 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Accelerate get_char_lb_class for invalid Unicode code points.
|
||||||
|
* linebreak.c (get_char_lb_class): Adjust the conditionals so that
|
||||||
|
getting the line breaking class for an invalid code point is much
|
||||||
|
faster, which requires the array of line breaking properties be
|
||||||
|
sorted.
|
||||||
|
* linebreakdef.h: Adjust a comment that the array of line break
|
||||||
|
properties must be sorted.
|
||||||
|
|
||||||
|
2008-03-02 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Change the values of enum BreakAction to more complete forms.
|
||||||
|
* linebreak.c (INDRCT_BRK): Rename to INDIRECT_BRK.
|
||||||
|
(CM_INDRCT_BRK): Rename to CM_INDIRECT_BRK.
|
||||||
|
(CM_PROHIBTD_BRK): Rename to CM_PROHIBITED_BRK.
|
||||||
|
(PROHIBTD_BRK): Rename to PROHIBITED_BRK.
|
||||||
|
|
||||||
|
2008-03-02 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Implement a two-stage search in get_char_lb_class_default to
|
||||||
|
accelerate the overall performance, especially for non-Latin
|
||||||
|
languages.
|
||||||
|
* linebreak.c (LINEBREAK_INDEX_SIZE): New constant macro.
|
||||||
|
(struct LineBreakPropertiesIndex): New struct.
|
||||||
|
(lb_prop_index): New static variable.
|
||||||
|
(init_linebreak_prop_index): New function.
|
||||||
|
(get_char_lb_class_default): New function.
|
||||||
|
(get_char_lb_class_lang): Use get_char_lb_class_default.
|
||||||
|
* linebreak.h: Detect C++ and add extern "C" guard if necessary.
|
||||||
|
(init_linebreak_prop_index): Add the prototype declaration.
|
||||||
|
* linebreakdef.h: Adjust a comment.
|
||||||
|
|
||||||
|
2008-03-02 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Split/refactor the code; add (doc) comments.
|
||||||
|
* Makefile (CFILES): Add linebreakdata.c and linebreakdef.c.
|
||||||
|
* linebreak.c: Add and adjust comments.
|
||||||
|
(linebreakdef.h): Add include file.
|
||||||
|
(linebreakdata.c): Remove include file.
|
||||||
|
(EOS): Remove (now in linebreakdef.h).
|
||||||
|
(enum LineBreakClass): Ditto.
|
||||||
|
(struct LineBreakProperties): Ditto.
|
||||||
|
(lbpEnglish): Remove (now in linebreakdef.c as lb_prop_English).
|
||||||
|
(lbpGerman): Remove (now in linebreakdef.c as lb_prop_German).
|
||||||
|
(lbpSpanish): Remove (now in linebreakdef.c as lb_prop_Spanish).
|
||||||
|
(lbpFrench): Remove (now in linebreakdef.c as lb_prop_French).
|
||||||
|
(lbpRussian): Remove (now in linebreakdef.c as lb_prop_Russian).
|
||||||
|
(lbpChinese): Remove (now in linebreakdef.c as lb_prop_Chinese).
|
||||||
|
(struct LineBreakPropertiesLang): Remove (now in linebreakdef.h).
|
||||||
|
(lbpLangs): Remove (now in linebreakdef.c as lb_prop_lang_map).
|
||||||
|
(get_next_char_utf16): Make sure memory access not go beyond len.
|
||||||
|
* linebreak.h: Add copyright information and adjust comments.
|
||||||
|
(stddef.h): Add include file.
|
||||||
|
* linebreakdata.c (linebreak.h): Add include file.
|
||||||
|
(linebreakdef.h): Add include file.
|
||||||
|
(lbpDefault): Make global and rename to lb_prop_default.
|
||||||
|
* linebreakdata2.tmpl: Add two include files, a comment line, and
|
||||||
|
remove `static'.
|
||||||
|
* linebreakdef.c: New file.
|
||||||
|
* linebreakdef.h: New file.
|
||||||
|
|
||||||
|
2008-02-26 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* linebreak.c (lbpSpanish): New array for Spanish-specific data.
|
||||||
|
(lbpLangs): Update the index array for Spanish.
|
||||||
|
(resolve_lb_class): Resolve AmbIguous class to IDeographic in
|
||||||
|
Chinese, Japanese, and Korean.
|
||||||
|
|
||||||
|
2008-02-26 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
* Makefile (LineBreak.txt): Add new rule to retrieve it from the Web
|
||||||
|
if it is not already there.
|
||||||
|
|
||||||
|
2008-02-23 Wu Yongwei <wuyongwei@gmail.com>
|
||||||
|
|
||||||
|
Add files for linebreak.
|
||||||
|
* LineBreak1.sed: New file.
|
||||||
|
* LineBreak2.sed: New file.
|
||||||
|
* Makefile: New file.
|
||||||
|
* filter_dup.c: New file.
|
||||||
|
* linebreak.c: New file.
|
||||||
|
* linebreak.h: New file.
|
||||||
|
* linebreakdata.c: New file.
|
||||||
|
* linebreakdata1.tmpl: New file.
|
||||||
|
* linebreakdata2.tmpl: New file.
|
||||||
|
* linebreakdata3.tmpl: New file.
|
1219
linebreak/linebreak/Doxyfile
Normal file
1219
linebreak/linebreak/Doxyfile
Normal file
File diff suppressed because it is too large
Load diff
19
linebreak/linebreak/LICENCE
Normal file
19
linebreak/linebreak/LICENCE
Normal file
|
@ -0,0 +1,19 @@
|
||||||
|
Copyright (C) 2008-2012 Wu Yongwei <wuyongwei at gmail dot com>
|
||||||
|
Copyright (C) 2012 Tom Hacohen <tom dot hacohen at samsung dot com>
|
||||||
|
|
||||||
|
This software is provided 'as-is', without any express or implied
|
||||||
|
warranty. In no event will the author be held liable for any damages
|
||||||
|
arising from the use of this software.
|
||||||
|
|
||||||
|
Permission is granted to anyone to use this software for any purpose,
|
||||||
|
including commercial applications, and to alter it and redistribute it
|
||||||
|
freely, subject to the following restrictions:
|
||||||
|
|
||||||
|
1. The origin of this software must not be misrepresented; you must not
|
||||||
|
claim that you wrote the original software. If you use this software
|
||||||
|
in a product, an acknowledgement in the product documentation would
|
||||||
|
be appreciated but is not required.
|
||||||
|
2. Altered source versions must be plainly marked as such, and must not
|
||||||
|
be misrepresented as being the original software.
|
||||||
|
3. This notice may not be removed or altered from any source
|
||||||
|
distribution.
|
1
linebreak/linebreak/LineBreak1.sed
Normal file
1
linebreak/linebreak/LineBreak1.sed
Normal file
|
@ -0,0 +1 @@
|
||||||
|
s/\(^[0-9A-F.]\{1,\};[A-Z][A-Z0-9]\) #.*/\1/p
|
2
linebreak/linebreak/LineBreak2.sed
Normal file
2
linebreak/linebreak/LineBreak2.sed
Normal file
|
@ -0,0 +1,2 @@
|
||||||
|
s/^\([0-9A-F]\{1,\}\);/\1..\1;/
|
||||||
|
s/^\([0-9A-F]\{1,\}\)\.\.\([0-9A-F]\{1,\}\);\([A-Z][A-Z0-9]\)/ { 0x\1, 0x\2, LBP_\3 },/
|
63
linebreak/linebreak/Makefile.am
Normal file
63
linebreak/linebreak/Makefile.am
Normal file
|
@ -0,0 +1,63 @@
|
||||||
|
#noinst_PROGRAMS = filter_dup
|
||||||
|
include_HEADERS = linebreak.h linebreakdef.h wordbreak.h wordbreakdef.h
|
||||||
|
lib_LTLIBRARIES = libunibreak.la
|
||||||
|
pkgconfig_DATA = libunibreak.pc
|
||||||
|
pkgconfigdir = ${libdir}/pkgconfig
|
||||||
|
|
||||||
|
libunibreak_la_LDFLAGS = -no-undefined -version-info 1:0
|
||||||
|
libunibreak_la_SOURCES = \
|
||||||
|
linebreak.c \
|
||||||
|
linebreakdata.c \
|
||||||
|
linebreakdef.c \
|
||||||
|
wordbreak.c
|
||||||
|
|
||||||
|
EXTRA_DIST = \
|
||||||
|
LineBreak1.sed \
|
||||||
|
LineBreak2.sed \
|
||||||
|
linebreakdata1.tmpl \
|
||||||
|
linebreakdata2.tmpl \
|
||||||
|
linebreakdata3.tmpl \
|
||||||
|
wordbreakdata1.tmpl \
|
||||||
|
wordbreakdata2.tmpl \
|
||||||
|
wordbreakdata.c \
|
||||||
|
LICENCE \
|
||||||
|
Doxyfile \
|
||||||
|
Makefile.gcc \
|
||||||
|
Makefile.msvc \
|
||||||
|
doc \
|
||||||
|
sort_numeric_hex.py
|
||||||
|
|
||||||
|
install-exec-hook:
|
||||||
|
rm -f ${libdir}/liblinebreak.a
|
||||||
|
${LN_S} ${libdir}/libunibreak.a ${libdir}/liblinebreak.a
|
||||||
|
|
||||||
|
distclean-local:
|
||||||
|
rm -f LineBreak.txt WordBreakData.txt filter_dup${EXEEXT}
|
||||||
|
|
||||||
|
doc:
|
||||||
|
cd ${top_srcdir} && doxygen
|
||||||
|
|
||||||
|
LineBreak.txt:
|
||||||
|
wget http://unicode.org/Public/UNIDATA/LineBreak.txt
|
||||||
|
|
||||||
|
WordBreakProperty.txt:
|
||||||
|
wget http://www.unicode.org/Public/UNIDATA/auxiliary/WordBreakProperty.txt
|
||||||
|
|
||||||
|
linebreakdata: ${builddir}/filter_dup LineBreak.txt
|
||||||
|
sed -n -f ${top_srcdir}/LineBreak1.sed LineBreak.txt > tmp.txt
|
||||||
|
sed -f ${top_srcdir}/LineBreak2.sed tmp.txt | ${builddir}/filter_dup > tmp.c
|
||||||
|
head -2 LineBreak.txt > tmp.txt
|
||||||
|
cat ${top_srcdir}/linebreakdata1.tmpl tmp.txt ${top_srcdir}/linebreakdata2.tmpl tmp.c ${top_srcdir}/linebreakdata3.tmpl > ${top_srcdir}/linebreakdata.c
|
||||||
|
rm tmp.txt tmp.c
|
||||||
|
|
||||||
|
wordbreakdata: WordBreakProperty.txt
|
||||||
|
sed -E -n 's/(^[0-9A-F.]+)/\1/p' WordBreakProperty.txt > tmp2.txt
|
||||||
|
sed -E -i.bak 's/^([0-9A-F]+) +/\1..\1/' tmp2.txt
|
||||||
|
${top_srcdir}/sort_numeric_hex.py tmp2.txt > tmp.txt
|
||||||
|
rm tmp2.txt tmp2.txt.bak
|
||||||
|
sed -E -i.bak -n 's/^([0-9A-F]+)..([0-9A-F]+) *; *([A-Za-z]+).*/'$$'\t''{0x\1, 0x\2, WBP_\3},/p' tmp.txt
|
||||||
|
echo "/* The content of this file is generated from:" > ${top_srcdir}/wordbreakdata.c
|
||||||
|
head -2 WordBreakProperty.txt >> ${top_srcdir}/wordbreakdata.c
|
||||||
|
echo "*/" >> ${top_srcdir}/wordbreakdata.c
|
||||||
|
cat ${top_srcdir}/wordbreakdata1.tmpl tmp.txt ${top_srcdir}/wordbreakdata2.tmpl >> ${top_srcdir}/wordbreakdata.c
|
||||||
|
rm tmp.txt tmp.txt.bak
|
177
linebreak/linebreak/Makefile.gcc
Normal file
177
linebreak/linebreak/Makefile.gcc
Normal file
|
@ -0,0 +1,177 @@
|
||||||
|
# Windows/Cygwin support
|
||||||
|
ifdef windir
|
||||||
|
WINDOWS := 1
|
||||||
|
CYGWIN := 0
|
||||||
|
else
|
||||||
|
ifdef WINDIR
|
||||||
|
WINDOWS := 1
|
||||||
|
CYGWIN := 1
|
||||||
|
else
|
||||||
|
WINDOWS := 0
|
||||||
|
endif
|
||||||
|
endif
|
||||||
|
ifeq ($(WINDOWS),1)
|
||||||
|
EXEEXT := .exe
|
||||||
|
DLLEXT := .dll
|
||||||
|
DEVNUL := nul
|
||||||
|
ifeq ($(CYGWIN),1)
|
||||||
|
PATHSEP := /
|
||||||
|
else
|
||||||
|
PATHSEP := $(strip \ )
|
||||||
|
endif
|
||||||
|
else
|
||||||
|
EXEEXT :=
|
||||||
|
DLLEXT := .so
|
||||||
|
DEVNUL := /dev/null
|
||||||
|
PATHSEP := /
|
||||||
|
endif
|
||||||
|
|
||||||
|
CFG ?= Debug
|
||||||
|
ifeq ($(CFG),Debug)
|
||||||
|
all: debug
|
||||||
|
else
|
||||||
|
all: release
|
||||||
|
endif
|
||||||
|
|
||||||
|
OLDGCC ?= N
|
||||||
|
|
||||||
|
DEBUG := DebugDir
|
||||||
|
RELEASE := ReleaseDir
|
||||||
|
|
||||||
|
$(DEBUG)/%.o: %.c
|
||||||
|
$(CC) $(CFLAGS) $(CPPFLAGS) $(DBGFLAGS) $(TARGET_ARCH) -c -o $@ $<
|
||||||
|
|
||||||
|
$(RELEASE)/%.o: %.c
|
||||||
|
$(CC) $(CFLAGS) $(CPPFLAGS) $(RELFLAGS) $(TARGET_ARCH) -c -o $@ $<
|
||||||
|
|
||||||
|
$(DEBUG)/%.o: %.cpp
|
||||||
|
$(CXX) $(CXXFLAGS) $(CPPFLAGS) $(DBGFLAGS) $(TARGET_ARCH) -c -o $@ $<
|
||||||
|
|
||||||
|
$(RELEASE)/%.o: %.cpp
|
||||||
|
$(CXX) $(CXXFLAGS) $(CPPFLAGS) $(RELFLAGS) $(TARGET_ARCH) -c -o $@ $<
|
||||||
|
|
||||||
|
ifeq ($(OLDGCC),N)
|
||||||
|
|
||||||
|
$(DEBUG)/%.dep: %.c
|
||||||
|
$(CC) -MM -MT $(patsubst %.dep,%.o,$@) $(CFLAGS) $(CPPFLAGS) $(DBGFLAGS) $(TARGET_ARCH) -o $@ $<
|
||||||
|
|
||||||
|
$(RELEASE)/%.dep: %.c
|
||||||
|
$(CC) -MM -MT $(patsubst %.dep,%.o,$@) $(CFLAGS) $(CPPFLAGS) $(RELFLAGS) $(TARGET_ARCH) -o $@ $<
|
||||||
|
|
||||||
|
$(DEBUG)/%.dep: %.cpp
|
||||||
|
$(CXX) -MM -MT $(patsubst %.dep,%.o,$@) $(CXXFLAGS) $(CPPFLAGS) $(DBGFLAGS) $(TARGET_ARCH) -o $@ $<
|
||||||
|
|
||||||
|
$(RELEASE)/%.dep: %.cpp
|
||||||
|
$(CXX) -MM -MT $(patsubst %.dep,%.o,$@) $(CXXFLAGS) $(CPPFLAGS) $(RELFLAGS) $(TARGET_ARCH) -o $@ $<
|
||||||
|
|
||||||
|
else
|
||||||
|
|
||||||
|
$(DEBUG)/%.dep: %.c
|
||||||
|
$(CC) -MM $(CFLAGS) $(CPPFLAGS) $(DBGFLAGS) $(TARGET_ARCH) $< | sed "s!^!$(DEBUG)/!" > $@
|
||||||
|
|
||||||
|
$(RELEASE)/%.dep: %.c
|
||||||
|
$(CC) -MM $(CFLAGS) $(CPPFLAGS) $(RELFLAGS) $(TARGET_ARCH) $< | sed "s!^!$(RELEASE)/!" > $@
|
||||||
|
|
||||||
|
$(DEBUG)/%.dep: %.cpp
|
||||||
|
$(CXX) -MM $(CXXFLAGS) $(CPPFLAGS) $(DBGFLAGS) $(TARGET_ARCH) $< | sed "s!^!$(DEBUG)/!" > $@
|
||||||
|
|
||||||
|
$(RELEASE)/%.dep: %.cpp
|
||||||
|
$(CXX) -MM $(CXXFLAGS) $(CPPFLAGS) $(RELFLAGS) $(TARGET_ARCH) $< | sed "s!^!$(RELEASE)/!" > $@
|
||||||
|
|
||||||
|
endif
|
||||||
|
|
||||||
|
CC = gcc
|
||||||
|
CXX = g++
|
||||||
|
AR = ar
|
||||||
|
LD = $(CXX) $(CXXFLAGS) $(TARGET_ARCH)
|
||||||
|
|
||||||
|
INCLUDE = -I. $(patsubst %,-I%,$(VPATH))
|
||||||
|
CFLAGS = -W -Wall $(INCLUDE)
|
||||||
|
CXXFLAGS = $(CFLAGS)
|
||||||
|
DBGFLAGS = -D_DEBUG -g
|
||||||
|
RELFLAGS = -DNDEBUG -O2
|
||||||
|
CPPFLAGS =
|
||||||
|
|
||||||
|
ifeq ($(OLDGCC),N)
|
||||||
|
CFLAGS += -fmessage-length=0
|
||||||
|
endif
|
||||||
|
|
||||||
|
HFILES = $(wildcard $(patsubst -I%,%/*.h,$(INCLUDE)))
|
||||||
|
OBJFILES = $(CFILES:.c=.o) $(CXXFILES:.cpp=.o)
|
||||||
|
|
||||||
|
DEBUG_OBJS = $(patsubst %.o,$(DEBUG)/%.o,$(OBJFILES))
|
||||||
|
RELEASE_OBJS = $(patsubst %.o,$(RELEASE)/%.o,$(OBJFILES))
|
||||||
|
|
||||||
|
DEBUG_DEPS = $(patsubst %.o,%.dep,$(DEBUG_OBJS))
|
||||||
|
RELEASE_DEPS = $(patsubst %.o,%.dep,$(RELEASE_OBJS))
|
||||||
|
|
||||||
|
CFILES := linebreak.c linebreakdata.c linebreakdef.c wordbreak.c
|
||||||
|
CXXFILES :=
|
||||||
|
|
||||||
|
LIBS :=
|
||||||
|
|
||||||
|
TARGET = liblinebreak.a
|
||||||
|
DEBUG_TARGET = $(patsubst %,$(DEBUG)/%,$(TARGET))
|
||||||
|
RELEASE_TARGET = $(patsubst %,$(RELEASE)/%,$(TARGET))
|
||||||
|
|
||||||
|
debug: $(DEBUG) $(DEBUG_TARGET)
|
||||||
|
|
||||||
|
release: $(RELEASE) $(RELEASE_TARGET)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
$(DEBUG):
|
||||||
|
mkdir $(DEBUG)
|
||||||
|
|
||||||
|
$(RELEASE):
|
||||||
|
mkdir $(RELEASE)
|
||||||
|
|
||||||
|
$(DEBUG_TARGET): $(DEBUG_DEPS) $(DEBUG_OBJS)
|
||||||
|
$(AR) -r $(DEBUG_TARGET) $(DEBUG_OBJS)
|
||||||
|
|
||||||
|
$(RELEASE_TARGET): $(RELEASE_DEPS) $(RELEASE_OBJS)
|
||||||
|
$(AR) -r $(RELEASE_TARGET) $(RELEASE_OBJS)
|
||||||
|
|
||||||
|
doc:
|
||||||
|
doxygen
|
||||||
|
|
||||||
|
linebreakdata: filter_dup$(EXEEXT) LineBreak.txt
|
||||||
|
sed -n -f LineBreak1.sed LineBreak.txt > tmp.txt
|
||||||
|
sed -f LineBreak2.sed tmp.txt | .$(PATHSEP)filter_dup > tmp.c
|
||||||
|
head -2 LineBreak.txt > tmp.txt
|
||||||
|
cat linebreakdata1.tmpl tmp.txt linebreakdata2.tmpl tmp.c linebreakdata3.tmpl > linebreakdata.c
|
||||||
|
$(RM) tmp.txt tmp.c
|
||||||
|
|
||||||
|
wordbreakdata: WordBreakProperty.txt
|
||||||
|
sed -E -n 's/(^[0-9A-F.]+)/\1/p' WordBreakProperty.txt > tmp2.txt
|
||||||
|
sed -E -i.bak 's/^([0-9A-F]+) +/\1..\1/' tmp2.txt
|
||||||
|
./sort_numeric_hex.py tmp2.txt > tmp.txt
|
||||||
|
rm tmp2.txt tmp2.txt.bak
|
||||||
|
sed -E -i.bak -n 's/^([0-9A-F]+)..([0-9A-F]+) *; *([A-Za-z]+).*/'$$'\t''{0x\1, 0x\2, WBP_\3},/p' tmp.txt
|
||||||
|
echo "/* The content of this file is generated from:" > wordbreakdata.c
|
||||||
|
head -2 WordBreakProperty.txt >> wordbreakdata.c
|
||||||
|
echo "*/" >> wordbreakdata.c
|
||||||
|
cat wordbreakdata1.tmpl tmp.txt wordbreakdata2.tmpl >> wordbreakdata.c
|
||||||
|
rm tmp.txt tmp.txt.bak
|
||||||
|
|
||||||
|
filter_dup$(EXEEXT): filter_dup.c
|
||||||
|
gcc -O2 -o filter_dup$(EXEEXT) $<
|
||||||
|
|
||||||
|
LineBreak.txt:
|
||||||
|
wget http://unicode.org/Public/UNIDATA/LineBreak.txt
|
||||||
|
|
||||||
|
WordBreakProperty.txt:
|
||||||
|
wget http://www.unicode.org/Public/UNIDATA/auxiliary/WordBreakProperty.txt
|
||||||
|
|
||||||
|
.PHONY: all debug release clean distclean doc linebreakdata wordbreakdata
|
||||||
|
|
||||||
|
clean:
|
||||||
|
$(RM) $(DEBUG)/*.o $(DEBUG)/*.dep $(DEBUG_TARGET)
|
||||||
|
$(RM) $(RELEASE)/*.o $(RELEASE)/*.dep $(RELEASE_TARGET)
|
||||||
|
|
||||||
|
distclean: clean
|
||||||
|
$(RM) $(DEBUG)/* $(RELEASE)/* filter_dup$(EXEEXT) tags LineBreak.txt
|
||||||
|
-rmdir $(DEBUG) 2> $(DEVNUL)
|
||||||
|
-rmdir $(RELEASE) 2> $(DEVNUL)
|
||||||
|
|
||||||
|
-include $(wildcard $(DEBUG)/*.dep) $(wildcard $(RELEASE)/*.dep)
|
189
linebreak/linebreak/Makefile.msvc
Normal file
189
linebreak/linebreak/Makefile.msvc
Normal file
|
@ -0,0 +1,189 @@
|
||||||
|
# Makefile for Microsoft Visual C++ and NMAKE
|
||||||
|
|
||||||
|
!IF "$(CFG)" == ""
|
||||||
|
CFG=libunibreak - Win32 Debug
|
||||||
|
!MESSAGE No configuration specified. Defaulting to libunibreak - Win32 Debug.
|
||||||
|
!ENDIF
|
||||||
|
|
||||||
|
!IF "$(CFG)" != "libunibreak - Win32 Release" && "$(CFG)" != "libunibreak - Win32 Debug"
|
||||||
|
!MESSAGE Invalid configuration "$(CFG)" specified.
|
||||||
|
!MESSAGE You can specify a configuration when running NMAKE
|
||||||
|
!MESSAGE by defining the macro CFG on the command line. For example:
|
||||||
|
!MESSAGE
|
||||||
|
!MESSAGE NMAKE /f Makefile.msvc CFG="libunibreak - Win32 Debug"
|
||||||
|
!MESSAGE
|
||||||
|
!MESSAGE Possible choices for configuration are:
|
||||||
|
!MESSAGE
|
||||||
|
!MESSAGE "libunibreak - Win32 Release" (based on "Win32 (x86) Static Library")
|
||||||
|
!MESSAGE "libunibreak - Win32 Debug" (based on "Win32 (x86) Static Library")
|
||||||
|
!MESSAGE
|
||||||
|
!ERROR An invalid configuration is specified.
|
||||||
|
!ENDIF
|
||||||
|
|
||||||
|
!IF "$(OS)" == "Windows_NT"
|
||||||
|
NULL=
|
||||||
|
!ELSE
|
||||||
|
NULL=nul
|
||||||
|
!ENDIF
|
||||||
|
|
||||||
|
CPP=cl.exe
|
||||||
|
RSC=rc.exe
|
||||||
|
|
||||||
|
!IF "$(CFG)" == "libunibreak - Win32 Release"
|
||||||
|
|
||||||
|
OUTDIR=.\Release
|
||||||
|
INTDIR=.\Release
|
||||||
|
# Begin Custom Macros
|
||||||
|
OutDir=.\Release
|
||||||
|
# End Custom Macros
|
||||||
|
|
||||||
|
ALL : "$(OUTDIR)\unibreak.lib"
|
||||||
|
|
||||||
|
|
||||||
|
CLEAN :
|
||||||
|
-@erase "$(INTDIR)\linebreak.obj"
|
||||||
|
-@erase "$(INTDIR)\linebreakdata.obj"
|
||||||
|
-@erase "$(INTDIR)\linebreakdef.obj"
|
||||||
|
-@erase "$(INTDIR)\wordbreak.obj"
|
||||||
|
-@erase "$(INTDIR)\vc*.idb"
|
||||||
|
-@erase "$(OUTDIR)\unibreak.lib"
|
||||||
|
|
||||||
|
"$(OUTDIR)" :
|
||||||
|
if not exist "$(OUTDIR)/$(NULL)" mkdir "$(OUTDIR)"
|
||||||
|
|
||||||
|
CPP_PROJ=/nologo /ML /W3 /GX /O2 /D "WIN32" /D "NDEBUG" /D "_MBCS" /D "_LIB" /Fo"$(INTDIR)\\" /Fd"$(INTDIR)\\" /FD /c
|
||||||
|
BSC32=bscmake.exe
|
||||||
|
BSC32_FLAGS=/nologo /o"$(OUTDIR)\unibreak.bsc"
|
||||||
|
BSC32_SBRS= \
|
||||||
|
|
||||||
|
LIB32=link.exe -lib
|
||||||
|
LIB32_FLAGS=/nologo /out:"$(OUTDIR)\unibreak.lib"
|
||||||
|
LIB32_OBJS= \
|
||||||
|
"$(INTDIR)\linebreak.obj" \
|
||||||
|
"$(INTDIR)\linebreakdata.obj" \
|
||||||
|
"$(INTDIR)\linebreakdef.obj" \
|
||||||
|
"$(INTDIR)\wordbreak.obj"
|
||||||
|
|
||||||
|
"$(OUTDIR)\unibreak.lib" : "$(OUTDIR)" $(DEF_FILE) $(LIB32_OBJS)
|
||||||
|
$(LIB32) @<<
|
||||||
|
$(LIB32_FLAGS) $(DEF_FLAGS) $(LIB32_OBJS)
|
||||||
|
<<
|
||||||
|
|
||||||
|
!ELSEIF "$(CFG)" == "libunibreak - Win32 Debug"
|
||||||
|
|
||||||
|
OUTDIR=.\Debug
|
||||||
|
INTDIR=.\Debug
|
||||||
|
# Begin Custom Macros
|
||||||
|
OutDir=.\Debug
|
||||||
|
# End Custom Macros
|
||||||
|
|
||||||
|
ALL : "$(OUTDIR)\unibreak.lib"
|
||||||
|
|
||||||
|
|
||||||
|
CLEAN :
|
||||||
|
-@erase "$(INTDIR)\linebreak.obj"
|
||||||
|
-@erase "$(INTDIR)\linebreakdata.obj"
|
||||||
|
-@erase "$(INTDIR)\linebreakdef.obj"
|
||||||
|
-@erase "$(INTDIR)\wordbreak.obj"
|
||||||
|
-@erase "$(INTDIR)\vc*.idb"
|
||||||
|
-@erase "$(INTDIR)\vc*.pdb"
|
||||||
|
-@erase "$(OUTDIR)\unibreak.lib"
|
||||||
|
|
||||||
|
"$(OUTDIR)" :
|
||||||
|
if not exist "$(OUTDIR)/$(NULL)" mkdir "$(OUTDIR)"
|
||||||
|
|
||||||
|
CPP_PROJ=/nologo /MLd /W3 /Gm /GX /ZI /Od /D "WIN32" /D "_DEBUG" /D "_MBCS" /D "_LIB" /Fo"$(INTDIR)\\" /Fd"$(INTDIR)\\" /FD /GZ /c
|
||||||
|
BSC32=bscmake.exe
|
||||||
|
BSC32_FLAGS=/nologo /o"$(OUTDIR)\unibreak.bsc"
|
||||||
|
BSC32_SBRS= \
|
||||||
|
|
||||||
|
LIB32=link.exe -lib
|
||||||
|
LIB32_FLAGS=/nologo /out:"$(OUTDIR)\unibreak.lib"
|
||||||
|
LIB32_OBJS= \
|
||||||
|
"$(INTDIR)\linebreak.obj" \
|
||||||
|
"$(INTDIR)\linebreakdata.obj" \
|
||||||
|
"$(INTDIR)\linebreakdef.obj" \
|
||||||
|
"$(INTDIR)\wordbreak.obj"
|
||||||
|
|
||||||
|
"$(OUTDIR)\unibreak.lib" : "$(OUTDIR)" $(DEF_FILE) $(LIB32_OBJS)
|
||||||
|
$(LIB32) @<<
|
||||||
|
$(LIB32_FLAGS) $(DEF_FLAGS) $(LIB32_OBJS)
|
||||||
|
<<
|
||||||
|
|
||||||
|
!ENDIF
|
||||||
|
|
||||||
|
.c{$(INTDIR)}.obj::
|
||||||
|
$(CPP) @<<
|
||||||
|
$(CPP_PROJ) $<
|
||||||
|
<<
|
||||||
|
|
||||||
|
.cpp{$(INTDIR)}.obj::
|
||||||
|
$(CPP) @<<
|
||||||
|
$(CPP_PROJ) $<
|
||||||
|
<<
|
||||||
|
|
||||||
|
.cxx{$(INTDIR)}.obj::
|
||||||
|
$(CPP) @<<
|
||||||
|
$(CPP_PROJ) $<
|
||||||
|
<<
|
||||||
|
|
||||||
|
.c{$(INTDIR)}.sbr::
|
||||||
|
$(CPP) @<<
|
||||||
|
$(CPP_PROJ) $<
|
||||||
|
<<
|
||||||
|
|
||||||
|
.cpp{$(INTDIR)}.sbr::
|
||||||
|
$(CPP) @<<
|
||||||
|
$(CPP_PROJ) $<
|
||||||
|
<<
|
||||||
|
|
||||||
|
.cxx{$(INTDIR)}.sbr::
|
||||||
|
$(CPP) @<<
|
||||||
|
$(CPP_PROJ) $<
|
||||||
|
<<
|
||||||
|
|
||||||
|
|
||||||
|
.\linebreak.c : \
|
||||||
|
".\linebreak.h"\
|
||||||
|
".\linebreakdef.h"\
|
||||||
|
|
||||||
|
.\linebreakdata.c : \
|
||||||
|
".\linebreak.h"\
|
||||||
|
".\linebreakdef.h"\
|
||||||
|
|
||||||
|
.\linebreakdef.c : \
|
||||||
|
".\linebreak.h"\
|
||||||
|
".\linebreakdef.h"\
|
||||||
|
|
||||||
|
.\wordbreak.c : \
|
||||||
|
".\linebreak.h"\
|
||||||
|
".\linebreakdef.h"\
|
||||||
|
".\wordbreak.h"\
|
||||||
|
".\wordbreakdef.h"\
|
||||||
|
".\wordbreakdata.c"\
|
||||||
|
|
||||||
|
|
||||||
|
!IF "$(CFG)" == "libunibreak - Win32 Release" || "$(CFG)" == "libunibreak - Win32 Debug"
|
||||||
|
SOURCE=.\linebreak.c
|
||||||
|
|
||||||
|
"$(INTDIR)\linebreak.obj" : $(SOURCE) "$(INTDIR)"
|
||||||
|
|
||||||
|
|
||||||
|
SOURCE=.\linebreakdata.c
|
||||||
|
|
||||||
|
"$(INTDIR)\linebreakdata.obj" : $(SOURCE) "$(INTDIR)"
|
||||||
|
|
||||||
|
|
||||||
|
SOURCE=.\linebreakdef.c
|
||||||
|
|
||||||
|
"$(INTDIR)\linebreakdef.obj" : $(SOURCE) "$(INTDIR)"
|
||||||
|
|
||||||
|
|
||||||
|
SOURCE=.\wordbreak.c
|
||||||
|
|
||||||
|
"$(INTDIR)\wordbreak.obj" : $(SOURCE) "$(INTDIR)"
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
!ENDIF
|
||||||
|
|
49
linebreak/linebreak/NEWS
Normal file
49
linebreak/linebreak/NEWS
Normal file
|
@ -0,0 +1,49 @@
|
||||||
|
New in libunibreak 1.0
|
||||||
|
|
||||||
|
- Add word breaking support
|
||||||
|
- Change the library name to "libunibreak", while keeping maximum compatibility
|
||||||
|
- Add pkg-config support
|
||||||
|
|
||||||
|
New in liblinebreak 2.1
|
||||||
|
|
||||||
|
- Update the data according to LineBreak-6.0.0.txt
|
||||||
|
- Fix the bug that an assertion in code can fail if U+FFFC is
|
||||||
|
encountered at the beginning of a line
|
||||||
|
|
||||||
|
New in liblinebreak 2.0
|
||||||
|
|
||||||
|
- Update the algorithm and data according to UAX #14-24 and
|
||||||
|
LineBreak-5.2.0.txt
|
||||||
|
- Rename some functions to reduce namespace pollution
|
||||||
|
- Make Doxygen documentation better
|
||||||
|
|
||||||
|
New in liblinebreak 1.2
|
||||||
|
|
||||||
|
- Fix the bug that an assertion in code can fail if an invalid UTF-8 or
|
||||||
|
UTF-16 sequence is encountered near the end of input
|
||||||
|
- Remove the specialization of right single quotation mark as closing
|
||||||
|
punctuation mark in English, French, and Spanish, because it can be
|
||||||
|
used as apostrophe
|
||||||
|
- Make Doxygen documentation better
|
||||||
|
|
||||||
|
New in liblinebreak 1.1
|
||||||
|
|
||||||
|
- Make get_lb_prop_lang static and not an exported symbol
|
||||||
|
- Define is_line_breakable to alias to is_breakable
|
||||||
|
- Declare get_next_char_utf* will be changed to lb_get_next_char_utf*
|
||||||
|
- Move the declarations of get_next_char_utf* from linebreak.h to
|
||||||
|
linebreakdef.h
|
||||||
|
- Add the function documentation comments to the header files
|
||||||
|
|
||||||
|
New in liblinebreak 1.0
|
||||||
|
|
||||||
|
- Update the line breaking data according to UAX #14-22 and
|
||||||
|
LineBreak-5.1.0.txt
|
||||||
|
- Add autoconfiscation support (./configure, make, make install)
|
||||||
|
- Add Makefile for MSVC
|
||||||
|
|
||||||
|
First public release (0.9.6, or 20080421)
|
||||||
|
|
||||||
|
- Implement line breaking algorithm according to UAX #14-19
|
||||||
|
- Line breaking data is generated from LineBreak-5.0.0.txt
|
||||||
|
- Makefile only supports GCC
|
88
linebreak/linebreak/README
Normal file
88
linebreak/linebreak/README
Normal file
|
@ -0,0 +1,88 @@
|
||||||
|
L I B U N I B R E A K
|
||||||
|
=====================
|
||||||
|
|
||||||
|
Overview
|
||||||
|
--------
|
||||||
|
|
||||||
|
This is the README file for libunibreak, an implementation of the line
|
||||||
|
breaking and word breaking algorithms as described in Unicode
|
||||||
|
Standard Annex 14 and Unicode Standard Annex 29, available at
|
||||||
|
<URL:http://www.unicode.org/reports/tr14/tr14-26.html>
|
||||||
|
<URL:http://www.unicode.org/reports/tr29/tr29-17.html>
|
||||||
|
|
||||||
|
Check this URL for up-to-date information:
|
||||||
|
<URL:http://vimgadgets.sourceforge.net/libunibreak/>
|
||||||
|
|
||||||
|
|
||||||
|
Licence
|
||||||
|
-------
|
||||||
|
|
||||||
|
This library is released under an open-source licence, the zlib/libpng
|
||||||
|
licence. Please check the file LICENCE for details.
|
||||||
|
|
||||||
|
Apart from using the algorithm, part of the code is derived from the
|
||||||
|
data provided under
|
||||||
|
<URL:http://www.unicode.org/Public/>
|
||||||
|
|
||||||
|
And the Unicode Terms of Use may apply:
|
||||||
|
<URL:http://www.unicode.org/copyright.html>
|
||||||
|
|
||||||
|
|
||||||
|
Installation
|
||||||
|
------------
|
||||||
|
|
||||||
|
There are three ways to build the library:
|
||||||
|
|
||||||
|
1) On *NIX systems supported by the autoconfiscation tools, do the
|
||||||
|
normal
|
||||||
|
|
||||||
|
./configure
|
||||||
|
make
|
||||||
|
sudo make install
|
||||||
|
|
||||||
|
to build and install both the dynamic and static libraries. In
|
||||||
|
addition, one may
|
||||||
|
|
||||||
|
- type `make doc' to generate the doxygen documentation; or
|
||||||
|
- type `make linebreakdata' to regenerate linebreakdata.c from
|
||||||
|
LineBreak.txt.
|
||||||
|
- type ‘make wordbreakdata’ to regenerate wordbreakdata.c from
|
||||||
|
WordBreakProperty.txt.
|
||||||
|
|
||||||
|
2) On systems where GCC and Binutils are supported, one can type
|
||||||
|
|
||||||
|
cp -p Makefile.gcc Makefile
|
||||||
|
make
|
||||||
|
|
||||||
|
to build the static library. In addition, one may
|
||||||
|
|
||||||
|
- type `make debug' or `make release' to explicitly generate the
|
||||||
|
debug or release build;
|
||||||
|
- type `make doc' to generate the doxygen documentation; or
|
||||||
|
- type `make linebreakdata' to regenerate linebreakdata.c from
|
||||||
|
LineBreak.txt.
|
||||||
|
- type ‘make wordbreakdata’ to regenerate wordbreakdata.c from
|
||||||
|
WordBreakProperty.txt.
|
||||||
|
|
||||||
|
3) On Windows, apart from using method 1 (Cygwin/MSYS) and method 2
|
||||||
|
(MinGW), MSVC can also be used. Type
|
||||||
|
|
||||||
|
nmake -f Makefile.msvc
|
||||||
|
|
||||||
|
to build the static library. By default the debug release is built.
|
||||||
|
To build the release version
|
||||||
|
|
||||||
|
nmake -f Makefile.msvc CFG="libunibreak - Win32 Release"
|
||||||
|
|
||||||
|
|
||||||
|
Documentation
|
||||||
|
-------------
|
||||||
|
|
||||||
|
Check the generated document doc/html/linebreak_8h.html and
|
||||||
|
doc/html/wordbreak_8h.html in the downloaded file for the public
|
||||||
|
interfaces exposed to applications.
|
||||||
|
|
||||||
|
|
||||||
|
$Id: README,v 1.8 2012/08/11 06:55:18 adah Exp $
|
||||||
|
|
||||||
|
vim:autoindent:expandtab:formatoptions=tcqlmn:textwidth=72:
|
6
linebreak/linebreak/bootstrap
Executable file
6
linebreak/linebreak/bootstrap
Executable file
|
@ -0,0 +1,6 @@
|
||||||
|
#! /bin/sh
|
||||||
|
aclocal && \
|
||||||
|
autoheader && \
|
||||||
|
autoconf && \
|
||||||
|
libtoolize && \
|
||||||
|
automake --add-missing
|
12
linebreak/linebreak/configure.ac
Normal file
12
linebreak/linebreak/configure.ac
Normal file
|
@ -0,0 +1,12 @@
|
||||||
|
AC_PREREQ(2.57)
|
||||||
|
AC_INIT([libunibreak],[1.0],[wuyongwei@gmail.com])
|
||||||
|
AC_CONFIG_SRCDIR([linebreak.c])
|
||||||
|
AC_CONFIG_HEADERS([config.h])
|
||||||
|
AM_INIT_AUTOMAKE([foreign])
|
||||||
|
|
||||||
|
AC_PROG_CC
|
||||||
|
AC_PROG_LN_S
|
||||||
|
AC_EXEEXT
|
||||||
|
AM_PROG_LIBTOOL
|
||||||
|
AC_CONFIG_FILES([Makefile])
|
||||||
|
AC_OUTPUT([libunibreak.pc])
|
47
linebreak/linebreak/filter_dup.c
Normal file
47
linebreak/linebreak/filter_dup.c
Normal file
|
@ -0,0 +1,47 @@
|
||||||
|
#include <stdio.h>
|
||||||
|
#include <string.h>
|
||||||
|
|
||||||
|
int main()
|
||||||
|
{
|
||||||
|
char s[80];
|
||||||
|
char beg[16];
|
||||||
|
char end[16];
|
||||||
|
char prop[16];
|
||||||
|
char lastbeg[16];
|
||||||
|
char lastend[16];
|
||||||
|
char lastprop[16];
|
||||||
|
lastprop[0] = 0;
|
||||||
|
for (;;)
|
||||||
|
{
|
||||||
|
if (fgets(s, sizeof s, stdin) == NULL)
|
||||||
|
break;
|
||||||
|
if (strstr(s, "LBP_") == NULL || strstr(s, "LBP_Undef") != NULL)
|
||||||
|
{
|
||||||
|
if (lastprop[0])
|
||||||
|
{
|
||||||
|
printf("\t{ %s %s %s },\n", lastbeg, lastend, lastprop);
|
||||||
|
lastprop[0] = 0;
|
||||||
|
}
|
||||||
|
printf("%s", s);
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
sscanf(s, "\t{ %s %s %s }", beg, end, prop);
|
||||||
|
/*printf("==>\t{ \"%s\" \"%s\" \"%s\" },\n", beg, end, prop);*/
|
||||||
|
if (lastprop[0] && strcmp(lastprop, prop) != 0)
|
||||||
|
{
|
||||||
|
printf("\t{ %s %s %s },\n", lastbeg, lastend, lastprop);
|
||||||
|
lastprop[0] = 0;
|
||||||
|
}
|
||||||
|
if (lastprop[0] == 0)
|
||||||
|
{
|
||||||
|
strcpy(lastbeg, beg);
|
||||||
|
strcpy(lastprop, prop);
|
||||||
|
}
|
||||||
|
strcpy(lastend, end);
|
||||||
|
}
|
||||||
|
if (lastprop[0])
|
||||||
|
{
|
||||||
|
printf("\t{ %s %s %s },\n", lastbeg, lastend, prop);
|
||||||
|
}
|
||||||
|
return 0;
|
||||||
|
}
|
11
linebreak/linebreak/libunibreak.pc.in
Normal file
11
linebreak/linebreak/libunibreak.pc.in
Normal file
|
@ -0,0 +1,11 @@
|
||||||
|
libunibreak:
|
||||||
|
prefix=@prefix@
|
||||||
|
exec_prefix=@exec_prefix@
|
||||||
|
libdir=@libdir@
|
||||||
|
includedir=@includedir@
|
||||||
|
|
||||||
|
Name: libunibreak
|
||||||
|
Description: Library to implement Unicode algorithms for line and word breaking
|
||||||
|
Version: @VERSION@
|
||||||
|
Libs: -L${libdir} -lunibreak
|
||||||
|
Cflags: -I${includedir}
|
737
linebreak/linebreak/linebreak.c
Normal file
737
linebreak/linebreak/linebreak.c
Normal file
|
@ -0,0 +1,737 @@
|
||||||
|
/* vim: set tabstop=4 shiftwidth=4: */
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Line breaking in a Unicode sequence. Designed to be used in a
|
||||||
|
* generic text renderer.
|
||||||
|
*
|
||||||
|
* Copyright (C) 2008-2011 Wu Yongwei <wuyongwei at gmail dot com>
|
||||||
|
*
|
||||||
|
* This software is provided 'as-is', without any express or implied
|
||||||
|
* warranty. In no event will the author be held liable for any damages
|
||||||
|
* arising from the use of this software.
|
||||||
|
*
|
||||||
|
* Permission is granted to anyone to use this software for any purpose,
|
||||||
|
* including commercial applications, and to alter it and redistribute
|
||||||
|
* it freely, subject to the following restrictions:
|
||||||
|
*
|
||||||
|
* 1. The origin of this software must not be misrepresented; you must
|
||||||
|
* not claim that you wrote the original software. If you use this
|
||||||
|
* software in a product, an acknowledgement in the product
|
||||||
|
* documentation would be appreciated but is not required.
|
||||||
|
* 2. Altered source versions must be plainly marked as such, and must
|
||||||
|
* not be misrepresented as being the original software.
|
||||||
|
* 3. This notice may not be removed or altered from any source
|
||||||
|
* distribution.
|
||||||
|
*
|
||||||
|
* The main reference is Unicode Standard Annex 14 (UAX #14):
|
||||||
|
* <URL:http://www.unicode.org/reports/tr14/>
|
||||||
|
*
|
||||||
|
* When this library was designed, this annex was at Revision 19, for
|
||||||
|
* Unicode 5.0.0:
|
||||||
|
* <URL:http://www.unicode.org/reports/tr14/tr14-19.html>
|
||||||
|
*
|
||||||
|
* This library has been updated according to Revision 26, for
|
||||||
|
* Unicode 6.0.0:
|
||||||
|
* <URL:http://www.unicode.org/reports/tr14/tr14-26.html>
|
||||||
|
*
|
||||||
|
* The Unicode Terms of Use are available at
|
||||||
|
* <URL:http://www.unicode.org/copyright.html>
|
||||||
|
*/
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @file linebreak.c
|
||||||
|
*
|
||||||
|
* Implementation of the line breaking algorithm as described in Unicode
|
||||||
|
* Standard Annex 14.
|
||||||
|
*
|
||||||
|
* @version 2.1, 2011/05/07
|
||||||
|
* @author Wu Yongwei
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include <assert.h>
|
||||||
|
#include <stddef.h>
|
||||||
|
#include <string.h>
|
||||||
|
#include "linebreak.h"
|
||||||
|
#include "linebreakdef.h"
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Size of the second-level index to the line breaking properties.
|
||||||
|
*/
|
||||||
|
#define LINEBREAK_INDEX_SIZE 40
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Version number of the library.
|
||||||
|
*/
|
||||||
|
const int linebreak_version = LINEBREAK_VERSION;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Enumeration of break actions. They are used in the break action
|
||||||
|
* pair table below.
|
||||||
|
*/
|
||||||
|
enum BreakAction
|
||||||
|
{
|
||||||
|
DIR_BRK, /**< Direct break opportunity */
|
||||||
|
IND_BRK, /**< Indirect break opportunity */
|
||||||
|
CMI_BRK, /**< Indirect break opportunity for combining marks */
|
||||||
|
CMP_BRK, /**< Prohibited break for combining marks */
|
||||||
|
PRH_BRK /**< Prohibited break */
|
||||||
|
};
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Break action pair table. This is a direct mapping of Table 2 of
|
||||||
|
* Unicode Standard Annex 14, Revision 24.
|
||||||
|
*/
|
||||||
|
static enum BreakAction baTable[LBP_JT][LBP_JT] = {
|
||||||
|
{ /* OP */
|
||||||
|
PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, CMP_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK },
|
||||||
|
{ /* CL */
|
||||||
|
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, PRH_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
|
||||||
|
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
|
||||||
|
{ /* CP */
|
||||||
|
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, PRH_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, DIR_BRK,
|
||||||
|
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
|
||||||
|
{ /* QU */
|
||||||
|
PRH_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
|
||||||
|
IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK },
|
||||||
|
{ /* GL */
|
||||||
|
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
|
||||||
|
IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK },
|
||||||
|
{ /* NS */
|
||||||
|
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
|
||||||
|
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
|
||||||
|
{ /* EX */
|
||||||
|
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
|
||||||
|
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
|
||||||
|
{ /* SY */
|
||||||
|
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK,
|
||||||
|
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
|
||||||
|
{ /* IS */
|
||||||
|
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK,
|
||||||
|
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
|
||||||
|
{ /* PR */
|
||||||
|
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, IND_BRK,
|
||||||
|
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK },
|
||||||
|
{ /* PO */
|
||||||
|
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK,
|
||||||
|
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
|
||||||
|
{ /* NU */
|
||||||
|
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, DIR_BRK,
|
||||||
|
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
|
||||||
|
{ /* AL */
|
||||||
|
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK,
|
||||||
|
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
|
||||||
|
{ /* ID */
|
||||||
|
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
|
||||||
|
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
|
||||||
|
{ /* IN */
|
||||||
|
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
|
||||||
|
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
|
||||||
|
{ /* HY */
|
||||||
|
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, DIR_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK,
|
||||||
|
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
|
||||||
|
{ /* BA */
|
||||||
|
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, DIR_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
|
||||||
|
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
|
||||||
|
{ /* BB */
|
||||||
|
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
|
||||||
|
IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK },
|
||||||
|
{ /* B2 */
|
||||||
|
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
|
||||||
|
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, PRH_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
|
||||||
|
{ /* ZW */
|
||||||
|
DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
|
||||||
|
DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
|
||||||
|
DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, PRH_BRK, DIR_BRK,
|
||||||
|
DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
|
||||||
|
{ /* CM */
|
||||||
|
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK,
|
||||||
|
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
|
||||||
|
{ /* WJ */
|
||||||
|
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
|
||||||
|
IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK },
|
||||||
|
{ /* H2 */
|
||||||
|
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
|
||||||
|
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK },
|
||||||
|
{ /* H3 */
|
||||||
|
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
|
||||||
|
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, IND_BRK },
|
||||||
|
{ /* JL */
|
||||||
|
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
|
||||||
|
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, DIR_BRK },
|
||||||
|
{ /* JV */
|
||||||
|
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
|
||||||
|
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK },
|
||||||
|
{ /* JT */
|
||||||
|
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
|
||||||
|
PRH_BRK, PRH_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
|
||||||
|
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
|
||||||
|
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, IND_BRK }
|
||||||
|
};
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Struct for the second-level index to the line breaking properties.
|
||||||
|
*/
|
||||||
|
struct LineBreakPropertiesIndex
|
||||||
|
{
|
||||||
|
utf32_t end; /**< End coding point */
|
||||||
|
struct LineBreakProperties *lbp;/**< Pointer to line breaking properties */
|
||||||
|
};
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Second-level index to the line breaking properties.
|
||||||
|
*/
|
||||||
|
static struct LineBreakPropertiesIndex lb_prop_index[LINEBREAK_INDEX_SIZE] =
|
||||||
|
{
|
||||||
|
{ 0xFFFFFFFF, lb_prop_default }
|
||||||
|
};
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Initializes the second-level index to the line breaking properties.
|
||||||
|
* If it is not called, the performance of #get_char_lb_class_lang (and
|
||||||
|
* thus the main functionality) can be pretty bad, especially for big
|
||||||
|
* code points like those of Chinese.
|
||||||
|
*/
|
||||||
|
void init_linebreak(void)
|
||||||
|
{
|
||||||
|
size_t i;
|
||||||
|
size_t iPropDefault;
|
||||||
|
size_t len;
|
||||||
|
size_t step;
|
||||||
|
|
||||||
|
len = 0;
|
||||||
|
while (lb_prop_default[len].prop != LBP_Undefined)
|
||||||
|
++len;
|
||||||
|
step = len / LINEBREAK_INDEX_SIZE;
|
||||||
|
iPropDefault = 0;
|
||||||
|
for (i = 0; i < LINEBREAK_INDEX_SIZE; ++i)
|
||||||
|
{
|
||||||
|
lb_prop_index[i].lbp = lb_prop_default + iPropDefault;
|
||||||
|
iPropDefault += step;
|
||||||
|
lb_prop_index[i].end = lb_prop_default[iPropDefault].start - 1;
|
||||||
|
}
|
||||||
|
lb_prop_index[--i].end = 0xFFFFFFFF;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Gets the language-specific line breaking properties.
|
||||||
|
*
|
||||||
|
* @param lang language of the text
|
||||||
|
* @return pointer to the language-specific line breaking
|
||||||
|
* properties array if found; \c NULL otherwise
|
||||||
|
*/
|
||||||
|
static struct LineBreakProperties *get_lb_prop_lang(const char *lang)
|
||||||
|
{
|
||||||
|
struct LineBreakPropertiesLang *lbplIter;
|
||||||
|
if (lang != NULL)
|
||||||
|
{
|
||||||
|
for (lbplIter = lb_prop_lang_map; lbplIter->lang != NULL; ++lbplIter)
|
||||||
|
{
|
||||||
|
if (strncmp(lang, lbplIter->lang, lbplIter->namelen) == 0)
|
||||||
|
{
|
||||||
|
return lbplIter->lbp;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Gets the line breaking class of a character from a line breaking
|
||||||
|
* properties array.
|
||||||
|
*
|
||||||
|
* @param ch character to check
|
||||||
|
* @param lbp pointer to the line breaking properties array
|
||||||
|
* @return the line breaking class if found; \c LBP_XX otherwise
|
||||||
|
*/
|
||||||
|
static enum LineBreakClass get_char_lb_class(
|
||||||
|
utf32_t ch,
|
||||||
|
struct LineBreakProperties *lbp)
|
||||||
|
{
|
||||||
|
while (lbp->prop != LBP_Undefined && ch >= lbp->start)
|
||||||
|
{
|
||||||
|
if (ch <= lbp->end)
|
||||||
|
return lbp->prop;
|
||||||
|
++lbp;
|
||||||
|
}
|
||||||
|
return LBP_XX;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Gets the line breaking class of a character from the default line
|
||||||
|
* breaking properties array.
|
||||||
|
*
|
||||||
|
* @param ch character to check
|
||||||
|
* @return the line breaking class if found; \c LBP_XX otherwise
|
||||||
|
*/
|
||||||
|
static enum LineBreakClass get_char_lb_class_default(
|
||||||
|
utf32_t ch)
|
||||||
|
{
|
||||||
|
size_t i = 0;
|
||||||
|
while (ch > lb_prop_index[i].end)
|
||||||
|
++i;
|
||||||
|
assert(i < LINEBREAK_INDEX_SIZE);
|
||||||
|
return get_char_lb_class(ch, lb_prop_index[i].lbp);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Gets the line breaking class of a character for a specific
|
||||||
|
* language. This function will check the language-specific data first,
|
||||||
|
* and then the default data if there is no language-specific property
|
||||||
|
* available for the character.
|
||||||
|
*
|
||||||
|
* @param ch character to check
|
||||||
|
* @param lbpLang pointer to the language-specific line breaking
|
||||||
|
* properties array
|
||||||
|
* @return the line breaking class if found; \c LBP_XX
|
||||||
|
* otherwise
|
||||||
|
*/
|
||||||
|
static enum LineBreakClass get_char_lb_class_lang(
|
||||||
|
utf32_t ch,
|
||||||
|
struct LineBreakProperties *lbpLang)
|
||||||
|
{
|
||||||
|
enum LineBreakClass lbcResult;
|
||||||
|
|
||||||
|
/* Find the language-specific line breaking class for a character */
|
||||||
|
if (lbpLang)
|
||||||
|
{
|
||||||
|
lbcResult = get_char_lb_class(ch, lbpLang);
|
||||||
|
if (lbcResult != LBP_XX)
|
||||||
|
return lbcResult;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Find the generic language-specific line breaking class, if no
|
||||||
|
* language context is provided, or language-specific data are not
|
||||||
|
* available for the specific character in the specified language */
|
||||||
|
return get_char_lb_class_default(ch);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Resolves the line breaking class for certain ambiguous or complicated
|
||||||
|
* characters. They are treated in a simplistic way in this
|
||||||
|
* implementation.
|
||||||
|
*
|
||||||
|
* @param lbc line breaking class to resolve
|
||||||
|
* @param lang language of the text
|
||||||
|
* @return the resolved line breaking class
|
||||||
|
*/
|
||||||
|
static enum LineBreakClass resolve_lb_class(
|
||||||
|
enum LineBreakClass lbc,
|
||||||
|
const char *lang)
|
||||||
|
{
|
||||||
|
switch (lbc)
|
||||||
|
{
|
||||||
|
case LBP_AI:
|
||||||
|
if (lang != NULL &&
|
||||||
|
(strncmp(lang, "zh", 2) == 0 || /* Chinese */
|
||||||
|
strncmp(lang, "ja", 2) == 0 || /* Japanese */
|
||||||
|
strncmp(lang, "ko", 2) == 0)) /* Korean */
|
||||||
|
{
|
||||||
|
return LBP_ID;
|
||||||
|
}
|
||||||
|
/* Fall through */
|
||||||
|
case LBP_SA:
|
||||||
|
case LBP_SG:
|
||||||
|
case LBP_XX:
|
||||||
|
return LBP_AL;
|
||||||
|
default:
|
||||||
|
return lbc;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Gets the next Unicode character in a UTF-8 sequence. The index will
|
||||||
|
* be advanced to the next complete character, unless the end of string
|
||||||
|
* is reached in the middle of a UTF-8 sequence.
|
||||||
|
*
|
||||||
|
* @param[in] s input UTF-8 string
|
||||||
|
* @param[in] len length of the string in bytes
|
||||||
|
* @param[in,out] ip pointer to the index
|
||||||
|
* @return the Unicode character beginning at the index; or
|
||||||
|
* #EOS if end of input is encountered
|
||||||
|
*/
|
||||||
|
utf32_t lb_get_next_char_utf8(
|
||||||
|
const utf8_t *s,
|
||||||
|
size_t len,
|
||||||
|
size_t *ip)
|
||||||
|
{
|
||||||
|
utf8_t ch;
|
||||||
|
utf32_t res;
|
||||||
|
|
||||||
|
assert(*ip <= len);
|
||||||
|
if (*ip == len)
|
||||||
|
return EOS;
|
||||||
|
ch = s[*ip];
|
||||||
|
|
||||||
|
if (ch < 0xC2 || ch > 0xF4)
|
||||||
|
{ /* One-byte sequence, tail (should not occur), or invalid */
|
||||||
|
*ip += 1;
|
||||||
|
return ch;
|
||||||
|
}
|
||||||
|
else if (ch < 0xE0)
|
||||||
|
{ /* Two-byte sequence */
|
||||||
|
if (*ip + 2 > len)
|
||||||
|
return EOS;
|
||||||
|
res = ((ch & 0x1F) << 6) + (s[*ip + 1] & 0x3F);
|
||||||
|
*ip += 2;
|
||||||
|
return res;
|
||||||
|
}
|
||||||
|
else if (ch < 0xF0)
|
||||||
|
{ /* Three-byte sequence */
|
||||||
|
if (*ip + 3 > len)
|
||||||
|
return EOS;
|
||||||
|
res = ((ch & 0x0F) << 12) +
|
||||||
|
((s[*ip + 1] & 0x3F) << 6) +
|
||||||
|
((s[*ip + 2] & 0x3F));
|
||||||
|
*ip += 3;
|
||||||
|
return res;
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{ /* Four-byte sequence */
|
||||||
|
if (*ip + 4 > len)
|
||||||
|
return EOS;
|
||||||
|
res = ((ch & 0x07) << 18) +
|
||||||
|
((s[*ip + 1] & 0x3F) << 12) +
|
||||||
|
((s[*ip + 2] & 0x3F) << 6) +
|
||||||
|
((s[*ip + 3] & 0x3F));
|
||||||
|
*ip += 4;
|
||||||
|
return res;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Gets the next Unicode character in a UTF-16 sequence. The index will
|
||||||
|
* be advanced to the next complete character, unless the end of string
|
||||||
|
* is reached in the middle of a UTF-16 surrogate pair.
|
||||||
|
*
|
||||||
|
* @param[in] s input UTF-16 string
|
||||||
|
* @param[in] len length of the string in words
|
||||||
|
* @param[in,out] ip pointer to the index
|
||||||
|
* @return the Unicode character beginning at the index; or
|
||||||
|
* #EOS if end of input is encountered
|
||||||
|
*/
|
||||||
|
utf32_t lb_get_next_char_utf16(
|
||||||
|
const utf16_t *s,
|
||||||
|
size_t len,
|
||||||
|
size_t *ip)
|
||||||
|
{
|
||||||
|
utf16_t ch;
|
||||||
|
|
||||||
|
assert(*ip <= len);
|
||||||
|
if (*ip == len)
|
||||||
|
return EOS;
|
||||||
|
ch = s[(*ip)++];
|
||||||
|
|
||||||
|
if (ch < 0xD800 || ch > 0xDBFF)
|
||||||
|
{ /* If the character is not a high surrogate */
|
||||||
|
return ch;
|
||||||
|
}
|
||||||
|
if (*ip == len)
|
||||||
|
{ /* If the input ends here (an error) */
|
||||||
|
--(*ip);
|
||||||
|
return EOS;
|
||||||
|
}
|
||||||
|
if (s[*ip] < 0xDC00 || s[*ip] > 0xDFFF)
|
||||||
|
{ /* If the next character is not the low surrogate (an error) */
|
||||||
|
return ch;
|
||||||
|
}
|
||||||
|
/* Return the constructed character and advance the index again */
|
||||||
|
return (((utf32_t)ch & 0x3FF) << 10) + (s[(*ip)++] & 0x3FF) + 0x10000;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Gets the next Unicode character in a UTF-32 sequence. The index will
|
||||||
|
* be advanced to the next character.
|
||||||
|
*
|
||||||
|
* @param[in] s input UTF-32 string
|
||||||
|
* @param[in] len length of the string in dwords
|
||||||
|
* @param[in,out] ip pointer to the index
|
||||||
|
* @return the Unicode character beginning at the index; or
|
||||||
|
* #EOS if end of input is encountered
|
||||||
|
*/
|
||||||
|
utf32_t lb_get_next_char_utf32(
|
||||||
|
const utf32_t *s,
|
||||||
|
size_t len,
|
||||||
|
size_t *ip)
|
||||||
|
{
|
||||||
|
assert(*ip <= len);
|
||||||
|
if (*ip == len)
|
||||||
|
return EOS;
|
||||||
|
return s[(*ip)++];
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Sets the line breaking information for a generic input string.
|
||||||
|
*
|
||||||
|
* @param[in] s input string
|
||||||
|
* @param[in] len length of the input
|
||||||
|
* @param[in] lang language of the input
|
||||||
|
* @param[out] brks pointer to the output breaking data,
|
||||||
|
* containing #LINEBREAK_MUSTBREAK,
|
||||||
|
* #LINEBREAK_ALLOWBREAK, #LINEBREAK_NOBREAK,
|
||||||
|
* or #LINEBREAK_INSIDEACHAR
|
||||||
|
* @param[in] get_next_char function to get the next UTF-32 character
|
||||||
|
*/
|
||||||
|
void set_linebreaks(
|
||||||
|
const void *s,
|
||||||
|
size_t len,
|
||||||
|
const char *lang,
|
||||||
|
char *brks,
|
||||||
|
get_next_char_t get_next_char)
|
||||||
|
{
|
||||||
|
utf32_t ch;
|
||||||
|
enum LineBreakClass lbcCur;
|
||||||
|
enum LineBreakClass lbcNew;
|
||||||
|
enum LineBreakClass lbcLast;
|
||||||
|
struct LineBreakProperties *lbpLang;
|
||||||
|
size_t posCur = 0;
|
||||||
|
size_t posLast = 0;
|
||||||
|
|
||||||
|
--posLast; /* To be ++'d later */
|
||||||
|
ch = get_next_char(s, len, &posCur);
|
||||||
|
if (ch == EOS)
|
||||||
|
return;
|
||||||
|
lbpLang = get_lb_prop_lang(lang);
|
||||||
|
lbcCur = resolve_lb_class(get_char_lb_class_lang(ch, lbpLang), lang);
|
||||||
|
lbcNew = LBP_Undefined;
|
||||||
|
|
||||||
|
nextline:
|
||||||
|
|
||||||
|
/* Special treatment for the first character */
|
||||||
|
switch (lbcCur)
|
||||||
|
{
|
||||||
|
case LBP_LF:
|
||||||
|
case LBP_NL:
|
||||||
|
lbcCur = LBP_BK;
|
||||||
|
break;
|
||||||
|
case LBP_CB:
|
||||||
|
lbcCur = LBP_BA;
|
||||||
|
break;
|
||||||
|
case LBP_SP:
|
||||||
|
lbcCur = LBP_WJ;
|
||||||
|
break;
|
||||||
|
default:
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Process a line till an explicit break or end of string */
|
||||||
|
for (;;)
|
||||||
|
{
|
||||||
|
for (++posLast; posLast < posCur - 1; ++posLast)
|
||||||
|
{
|
||||||
|
brks[posLast] = LINEBREAK_INSIDEACHAR;
|
||||||
|
}
|
||||||
|
assert(posLast == posCur - 1);
|
||||||
|
lbcLast = lbcNew;
|
||||||
|
ch = get_next_char(s, len, &posCur);
|
||||||
|
if (ch == EOS)
|
||||||
|
break;
|
||||||
|
lbcNew = get_char_lb_class_lang(ch, lbpLang);
|
||||||
|
if (lbcCur == LBP_BK || (lbcCur == LBP_CR && lbcNew != LBP_LF))
|
||||||
|
{
|
||||||
|
brks[posLast] = LINEBREAK_MUSTBREAK;
|
||||||
|
lbcCur = resolve_lb_class(lbcNew, lang);
|
||||||
|
goto nextline;
|
||||||
|
}
|
||||||
|
|
||||||
|
switch (lbcNew)
|
||||||
|
{
|
||||||
|
case LBP_SP:
|
||||||
|
brks[posLast] = LINEBREAK_NOBREAK;
|
||||||
|
continue;
|
||||||
|
case LBP_BK:
|
||||||
|
case LBP_LF:
|
||||||
|
case LBP_NL:
|
||||||
|
brks[posLast] = LINEBREAK_NOBREAK;
|
||||||
|
lbcCur = LBP_BK;
|
||||||
|
continue;
|
||||||
|
case LBP_CR:
|
||||||
|
brks[posLast] = LINEBREAK_NOBREAK;
|
||||||
|
lbcCur = LBP_CR;
|
||||||
|
continue;
|
||||||
|
case LBP_CB:
|
||||||
|
brks[posLast] = LINEBREAK_ALLOWBREAK;
|
||||||
|
lbcCur = LBP_BA;
|
||||||
|
continue;
|
||||||
|
default:
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
|
||||||
|
lbcNew = resolve_lb_class(lbcNew, lang);
|
||||||
|
|
||||||
|
assert(lbcCur <= LBP_JT);
|
||||||
|
assert(lbcNew <= LBP_JT);
|
||||||
|
switch (baTable[lbcCur - 1][lbcNew - 1])
|
||||||
|
{
|
||||||
|
case DIR_BRK:
|
||||||
|
brks[posLast] = LINEBREAK_ALLOWBREAK;
|
||||||
|
break;
|
||||||
|
case CMI_BRK:
|
||||||
|
case IND_BRK:
|
||||||
|
if (lbcLast == LBP_SP)
|
||||||
|
{
|
||||||
|
brks[posLast] = LINEBREAK_ALLOWBREAK;
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
brks[posLast] = LINEBREAK_NOBREAK;
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
case CMP_BRK:
|
||||||
|
brks[posLast] = LINEBREAK_NOBREAK;
|
||||||
|
if (lbcLast != LBP_SP)
|
||||||
|
continue;
|
||||||
|
break;
|
||||||
|
case PRH_BRK:
|
||||||
|
brks[posLast] = LINEBREAK_NOBREAK;
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
|
||||||
|
lbcCur = lbcNew;
|
||||||
|
}
|
||||||
|
|
||||||
|
assert(posLast == posCur - 1 && posCur <= len);
|
||||||
|
/* Break after the last character */
|
||||||
|
brks[posLast] = LINEBREAK_MUSTBREAK;
|
||||||
|
/* When the input contains incomplete sequences */
|
||||||
|
while (posCur < len)
|
||||||
|
{
|
||||||
|
brks[posCur++] = LINEBREAK_INSIDEACHAR;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Sets the line breaking information for a UTF-8 input string.
|
||||||
|
*
|
||||||
|
* @param[in] s input UTF-8 string
|
||||||
|
* @param[in] len length of the input
|
||||||
|
* @param[in] lang language of the input
|
||||||
|
* @param[out] brks pointer to the output breaking data, containing
|
||||||
|
* #LINEBREAK_MUSTBREAK, #LINEBREAK_ALLOWBREAK,
|
||||||
|
* #LINEBREAK_NOBREAK, or #LINEBREAK_INSIDEACHAR
|
||||||
|
*/
|
||||||
|
void set_linebreaks_utf8(
|
||||||
|
const utf8_t *s,
|
||||||
|
size_t len,
|
||||||
|
const char *lang,
|
||||||
|
char *brks)
|
||||||
|
{
|
||||||
|
set_linebreaks(s, len, lang, brks,
|
||||||
|
(get_next_char_t)lb_get_next_char_utf8);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Sets the line breaking information for a UTF-16 input string.
|
||||||
|
*
|
||||||
|
* @param[in] s input UTF-16 string
|
||||||
|
* @param[in] len length of the input
|
||||||
|
* @param[in] lang language of the input
|
||||||
|
* @param[out] brks pointer to the output breaking data, containing
|
||||||
|
* #LINEBREAK_MUSTBREAK, #LINEBREAK_ALLOWBREAK,
|
||||||
|
* #LINEBREAK_NOBREAK, or #LINEBREAK_INSIDEACHAR
|
||||||
|
*/
|
||||||
|
void set_linebreaks_utf16(
|
||||||
|
const utf16_t *s,
|
||||||
|
size_t len,
|
||||||
|
const char *lang,
|
||||||
|
char *brks)
|
||||||
|
{
|
||||||
|
set_linebreaks(s, len, lang, brks,
|
||||||
|
(get_next_char_t)lb_get_next_char_utf16);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Sets the line breaking information for a UTF-32 input string.
|
||||||
|
*
|
||||||
|
* @param[in] s input UTF-32 string
|
||||||
|
* @param[in] len length of the input
|
||||||
|
* @param[in] lang language of the input
|
||||||
|
* @param[out] brks pointer to the output breaking data, containing
|
||||||
|
* #LINEBREAK_MUSTBREAK, #LINEBREAK_ALLOWBREAK,
|
||||||
|
* #LINEBREAK_NOBREAK, or #LINEBREAK_INSIDEACHAR
|
||||||
|
*/
|
||||||
|
void set_linebreaks_utf32(
|
||||||
|
const utf32_t *s,
|
||||||
|
size_t len,
|
||||||
|
const char *lang,
|
||||||
|
char *brks)
|
||||||
|
{
|
||||||
|
set_linebreaks(s, len, lang, brks,
|
||||||
|
(get_next_char_t)lb_get_next_char_utf32);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Tells whether a line break can occur between two Unicode characters.
|
||||||
|
* This is a wrapper function to expose a simple interface. Generally
|
||||||
|
* speaking, it is better to use #set_linebreaks_utf32 instead, since
|
||||||
|
* complicated cases involving combining marks, spaces, etc. cannot be
|
||||||
|
* correctly processed.
|
||||||
|
*
|
||||||
|
* @param char1 the first Unicode character
|
||||||
|
* @param char2 the second Unicode character
|
||||||
|
* @param lang language of the input
|
||||||
|
* @return one of #LINEBREAK_MUSTBREAK, #LINEBREAK_ALLOWBREAK,
|
||||||
|
* #LINEBREAK_NOBREAK, or #LINEBREAK_INSIDEACHAR
|
||||||
|
*/
|
||||||
|
int is_line_breakable(
|
||||||
|
utf32_t char1,
|
||||||
|
utf32_t char2,
|
||||||
|
const char* lang)
|
||||||
|
{
|
||||||
|
utf32_t s[2];
|
||||||
|
char brks[2];
|
||||||
|
s[0] = char1;
|
||||||
|
s[1] = char2;
|
||||||
|
set_linebreaks_utf32(s, 2, lang, brks);
|
||||||
|
return brks[0];
|
||||||
|
}
|
87
linebreak/linebreak/linebreak.h
Normal file
87
linebreak/linebreak/linebreak.h
Normal file
|
@ -0,0 +1,87 @@
|
||||||
|
/* vim: set tabstop=4 shiftwidth=4: */
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Line breaking in a Unicode sequence. Designed to be used in a
|
||||||
|
* generic text renderer.
|
||||||
|
*
|
||||||
|
* Copyright (C) 2008-2011 Wu Yongwei <wuyongwei at gmail dot com>
|
||||||
|
*
|
||||||
|
* This software is provided 'as-is', without any express or implied
|
||||||
|
* warranty. In no event will the author be held liable for any damages
|
||||||
|
* arising from the use of this software.
|
||||||
|
*
|
||||||
|
* Permission is granted to anyone to use this software for any purpose,
|
||||||
|
* including commercial applications, and to alter it and redistribute
|
||||||
|
* it freely, subject to the following restrictions:
|
||||||
|
*
|
||||||
|
* 1. The origin of this software must not be misrepresented; you must
|
||||||
|
* not claim that you wrote the original software. If you use this
|
||||||
|
* software in a product, an acknowledgement in the product
|
||||||
|
* documentation would be appreciated but is not required.
|
||||||
|
* 2. Altered source versions must be plainly marked as such, and must
|
||||||
|
* not be misrepresented as being the original software.
|
||||||
|
* 3. This notice may not be removed or altered from any source
|
||||||
|
* distribution.
|
||||||
|
*
|
||||||
|
* The main reference is Unicode Standard Annex 14 (UAX #14):
|
||||||
|
* <URL:http://www.unicode.org/reports/tr14/>
|
||||||
|
*
|
||||||
|
* When this library was designed, this annex was at Revision 19, for
|
||||||
|
* Unicode 5.0.0:
|
||||||
|
* <URL:http://www.unicode.org/reports/tr14/tr14-19.html>
|
||||||
|
*
|
||||||
|
* This library has been updated according to Revision 26, for
|
||||||
|
* Unicode 6.0.0:
|
||||||
|
* <URL:http://www.unicode.org/reports/tr14/tr14-26.html>
|
||||||
|
*
|
||||||
|
* The Unicode Terms of Use are available at
|
||||||
|
* <URL:http://www.unicode.org/copyright.html>
|
||||||
|
*/
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @file linebreak.h
|
||||||
|
*
|
||||||
|
* Header file for the line breaking algorithm.
|
||||||
|
*
|
||||||
|
* @version 2.1, 2011/05/07
|
||||||
|
* @author Wu Yongwei
|
||||||
|
*/
|
||||||
|
|
||||||
|
#ifndef LINEBREAK_H
|
||||||
|
#define LINEBREAK_H
|
||||||
|
|
||||||
|
#include <stddef.h>
|
||||||
|
|
||||||
|
#ifdef __cplusplus
|
||||||
|
extern "C" {
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#define LINEBREAK_VERSION 0x0201 /**< Version of the library linebreak */
|
||||||
|
extern const int linebreak_version;
|
||||||
|
|
||||||
|
#ifndef LINEBREAK_UTF_TYPES_DEFINED
|
||||||
|
#define LINEBREAK_UTF_TYPES_DEFINED
|
||||||
|
typedef unsigned char utf8_t; /**< Type for UTF-8 data points */
|
||||||
|
typedef unsigned short utf16_t; /**< Type for UTF-16 data points */
|
||||||
|
typedef unsigned int utf32_t; /**< Type for UTF-32 data points */
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#define LINEBREAK_MUSTBREAK 0 /**< Break is mandatory */
|
||||||
|
#define LINEBREAK_ALLOWBREAK 1 /**< Break is allowed */
|
||||||
|
#define LINEBREAK_NOBREAK 2 /**< No break is possible */
|
||||||
|
#define LINEBREAK_INSIDEACHAR 3 /**< A UTF-8/16 sequence is unfinished */
|
||||||
|
|
||||||
|
void init_linebreak(void);
|
||||||
|
void set_linebreaks_utf8(
|
||||||
|
const utf8_t *s, size_t len, const char* lang, char *brks);
|
||||||
|
void set_linebreaks_utf16(
|
||||||
|
const utf16_t *s, size_t len, const char* lang, char *brks);
|
||||||
|
void set_linebreaks_utf32(
|
||||||
|
const utf32_t *s, size_t len, const char* lang, char *brks);
|
||||||
|
int is_line_breakable(utf32_t char1, utf32_t char2, const char* lang);
|
||||||
|
|
||||||
|
#ifdef __cplusplus
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#endif /* LINEBREAK_H */
|
1868
linebreak/linebreak/linebreakdata.c
Normal file
1868
linebreak/linebreak/linebreakdata.c
Normal file
File diff suppressed because it is too large
Load diff
1
linebreak/linebreak/linebreakdata1.tmpl
Normal file
1
linebreak/linebreak/linebreakdata1.tmpl
Normal file
|
@ -0,0 +1 @@
|
||||||
|
/* The content of this file is generated from:
|
7
linebreak/linebreak/linebreakdata2.tmpl
Normal file
7
linebreak/linebreak/linebreakdata2.tmpl
Normal file
|
@ -0,0 +1,7 @@
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include "linebreak.h"
|
||||||
|
#include "linebreakdef.h"
|
||||||
|
|
||||||
|
/** Default line breaking properties as from the Unicode Web site. */
|
||||||
|
struct LineBreakProperties lb_prop_default[] = {
|
2
linebreak/linebreak/linebreakdata3.tmpl
Normal file
2
linebreak/linebreak/linebreakdata3.tmpl
Normal file
|
@ -0,0 +1,2 @@
|
||||||
|
{ 0xFFFFFFFF, 0xFFFFFFFF, LBP_Undefined }
|
||||||
|
};
|
139
linebreak/linebreak/linebreakdef.c
Normal file
139
linebreak/linebreak/linebreakdef.c
Normal file
|
@ -0,0 +1,139 @@
|
||||||
|
/* vim: set tabstop=4 shiftwidth=4: */
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Line breaking in a Unicode sequence. Designed to be used in a
|
||||||
|
* generic text renderer.
|
||||||
|
*
|
||||||
|
* Copyright (C) 2008-2011 Wu Yongwei <wuyongwei at gmail dot com>
|
||||||
|
*
|
||||||
|
* This software is provided 'as-is', without any express or implied
|
||||||
|
* warranty. In no event will the author be held liable for any damages
|
||||||
|
* arising from the use of this software.
|
||||||
|
*
|
||||||
|
* Permission is granted to anyone to use this software for any purpose,
|
||||||
|
* including commercial applications, and to alter it and redistribute
|
||||||
|
* it freely, subject to the following restrictions:
|
||||||
|
*
|
||||||
|
* 1. The origin of this software must not be misrepresented; you must
|
||||||
|
* not claim that you wrote the original software. If you use this
|
||||||
|
* software in a product, an acknowledgement in the product
|
||||||
|
* documentation would be appreciated but is not required.
|
||||||
|
* 2. Altered source versions must be plainly marked as such, and must
|
||||||
|
* not be misrepresented as being the original software.
|
||||||
|
* 3. This notice may not be removed or altered from any source
|
||||||
|
* distribution.
|
||||||
|
*
|
||||||
|
* The main reference is Unicode Standard Annex 14 (UAX #14):
|
||||||
|
* <URL:http://www.unicode.org/reports/tr14/>
|
||||||
|
*
|
||||||
|
* When this library was designed, this annex was at Revision 19, for
|
||||||
|
* Unicode 5.0.0:
|
||||||
|
* <URL:http://www.unicode.org/reports/tr14/tr14-19.html>
|
||||||
|
*
|
||||||
|
* This library has been updated according to Revision 26, for
|
||||||
|
* Unicode 6.0.0:
|
||||||
|
* <URL:http://www.unicode.org/reports/tr14/tr14-26.html>
|
||||||
|
*
|
||||||
|
* The Unicode Terms of Use are available at
|
||||||
|
* <URL:http://www.unicode.org/copyright.html>
|
||||||
|
*/
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @file linebreakdef.c
|
||||||
|
*
|
||||||
|
* Definition of language-specific data.
|
||||||
|
*
|
||||||
|
* @version 2.1, 2011/05/07
|
||||||
|
* @author Wu Yongwei
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include "linebreak.h"
|
||||||
|
#include "linebreakdef.h"
|
||||||
|
|
||||||
|
/**
|
||||||
|
* English-specifc data over the default Unicode rules.
|
||||||
|
*/
|
||||||
|
static struct LineBreakProperties lb_prop_English[] = {
|
||||||
|
{ 0x2018, 0x2018, LBP_OP }, /* Left single quotation mark: opening */
|
||||||
|
{ 0x201C, 0x201C, LBP_OP }, /* Left double quotation mark: opening */
|
||||||
|
{ 0x201D, 0x201D, LBP_CL }, /* Right double quotation mark: closing */
|
||||||
|
{ 0, 0, LBP_Undefined }
|
||||||
|
};
|
||||||
|
|
||||||
|
/**
|
||||||
|
* German-specifc data over the default Unicode rules.
|
||||||
|
*/
|
||||||
|
static struct LineBreakProperties lb_prop_German[] = {
|
||||||
|
{ 0x00AB, 0x00AB, LBP_CL }, /* Left double angle quotation mark: closing */
|
||||||
|
{ 0x00BB, 0x00BB, LBP_OP }, /* Right double angle quotation mark: opening */
|
||||||
|
{ 0x2018, 0x2018, LBP_CL }, /* Left single quotation mark: closing */
|
||||||
|
{ 0x201C, 0x201C, LBP_CL }, /* Left double quotation mark: closing */
|
||||||
|
{ 0x2039, 0x2039, LBP_CL }, /* Left single angle quotation mark: closing */
|
||||||
|
{ 0x203A, 0x203A, LBP_OP }, /* Right single angle quotation mark: opening */
|
||||||
|
{ 0, 0, LBP_Undefined }
|
||||||
|
};
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Spanish-specifc data over the default Unicode rules.
|
||||||
|
*/
|
||||||
|
static struct LineBreakProperties lb_prop_Spanish[] = {
|
||||||
|
{ 0x00AB, 0x00AB, LBP_OP }, /* Left double angle quotation mark: opening */
|
||||||
|
{ 0x00BB, 0x00BB, LBP_CL }, /* Right double angle quotation mark: closing */
|
||||||
|
{ 0x2018, 0x2018, LBP_OP }, /* Left single quotation mark: opening */
|
||||||
|
{ 0x201C, 0x201C, LBP_OP }, /* Left double quotation mark: opening */
|
||||||
|
{ 0x201D, 0x201D, LBP_CL }, /* Right double quotation mark: closing */
|
||||||
|
{ 0x2039, 0x2039, LBP_OP }, /* Left single angle quotation mark: opening */
|
||||||
|
{ 0x203A, 0x203A, LBP_CL }, /* Right single angle quotation mark: closing */
|
||||||
|
{ 0, 0, LBP_Undefined }
|
||||||
|
};
|
||||||
|
|
||||||
|
/**
|
||||||
|
* French-specifc data over the default Unicode rules.
|
||||||
|
*/
|
||||||
|
static struct LineBreakProperties lb_prop_French[] = {
|
||||||
|
{ 0x00AB, 0x00AB, LBP_OP }, /* Left double angle quotation mark: opening */
|
||||||
|
{ 0x00BB, 0x00BB, LBP_CL }, /* Right double angle quotation mark: closing */
|
||||||
|
{ 0x2018, 0x2018, LBP_OP }, /* Left single quotation mark: opening */
|
||||||
|
{ 0x201C, 0x201C, LBP_OP }, /* Left double quotation mark: opening */
|
||||||
|
{ 0x201D, 0x201D, LBP_CL }, /* Right double quotation mark: closing */
|
||||||
|
{ 0x2039, 0x2039, LBP_OP }, /* Left single angle quotation mark: opening */
|
||||||
|
{ 0x203A, 0x203A, LBP_CL }, /* Right single angle quotation mark: closing */
|
||||||
|
{ 0, 0, LBP_Undefined }
|
||||||
|
};
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Russian-specifc data over the default Unicode rules.
|
||||||
|
*/
|
||||||
|
static struct LineBreakProperties lb_prop_Russian[] = {
|
||||||
|
{ 0x00AB, 0x00AB, LBP_OP }, /* Left double angle quotation mark: opening */
|
||||||
|
{ 0x00BB, 0x00BB, LBP_CL }, /* Right double angle quotation mark: closing */
|
||||||
|
{ 0x201C, 0x201C, LBP_CL }, /* Left double quotation mark: closing */
|
||||||
|
{ 0, 0, LBP_Undefined }
|
||||||
|
};
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Chinese-specifc data over the default Unicode rules.
|
||||||
|
*/
|
||||||
|
static struct LineBreakProperties lb_prop_Chinese[] = {
|
||||||
|
{ 0x2018, 0x2018, LBP_OP }, /* Left single quotation mark: opening */
|
||||||
|
{ 0x2019, 0x2019, LBP_CL }, /* Right single quotation mark: closing */
|
||||||
|
{ 0x201C, 0x201C, LBP_OP }, /* Left double quotation mark: opening */
|
||||||
|
{ 0x201D, 0x201D, LBP_CL }, /* Right double quotation mark: closing */
|
||||||
|
{ 0, 0, LBP_Undefined }
|
||||||
|
};
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Association data of language-specific line breaking properties with
|
||||||
|
* language names. This is the definition for the static data in this
|
||||||
|
* file. If you want more flexibility, or do not need the data here,
|
||||||
|
* you may want to redefine \e lb_prop_lang_map in your C source file.
|
||||||
|
*/
|
||||||
|
struct LineBreakPropertiesLang lb_prop_lang_map[] = {
|
||||||
|
{ "en", 2, lb_prop_English },
|
||||||
|
{ "de", 2, lb_prop_German },
|
||||||
|
{ "es", 2, lb_prop_Spanish },
|
||||||
|
{ "fr", 2, lb_prop_French },
|
||||||
|
{ "ru", 2, lb_prop_Russian },
|
||||||
|
{ "zh", 2, lb_prop_Chinese },
|
||||||
|
{ NULL, 0, NULL }
|
||||||
|
};
|
149
linebreak/linebreak/linebreakdef.h
Normal file
149
linebreak/linebreak/linebreakdef.h
Normal file
|
@ -0,0 +1,149 @@
|
||||||
|
/* vim: set tabstop=4 shiftwidth=4: */
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Line breaking in a Unicode sequence. Designed to be used in a
|
||||||
|
* generic text renderer.
|
||||||
|
*
|
||||||
|
* Copyright (C) 2008-2011 Wu Yongwei <wuyongwei at gmail dot com>
|
||||||
|
*
|
||||||
|
* This software is provided 'as-is', without any express or implied
|
||||||
|
* warranty. In no event will the author be held liable for any damages
|
||||||
|
* arising from the use of this software.
|
||||||
|
*
|
||||||
|
* Permission is granted to anyone to use this software for any purpose,
|
||||||
|
* including commercial applications, and to alter it and redistribute
|
||||||
|
* it freely, subject to the following restrictions:
|
||||||
|
*
|
||||||
|
* 1. The origin of this software must not be misrepresented; you must
|
||||||
|
* not claim that you wrote the original software. If you use this
|
||||||
|
* software in a product, an acknowledgement in the product
|
||||||
|
* documentation would be appreciated but is not required.
|
||||||
|
* 2. Altered source versions must be plainly marked as such, and must
|
||||||
|
* not be misrepresented as being the original software.
|
||||||
|
* 3. This notice may not be removed or altered from any source
|
||||||
|
* distribution.
|
||||||
|
*
|
||||||
|
* The main reference is Unicode Standard Annex 14 (UAX #14):
|
||||||
|
* <URL:http://www.unicode.org/reports/tr14/>
|
||||||
|
*
|
||||||
|
* When this library was designed, this annex was at Revision 19, for
|
||||||
|
* Unicode 5.0.0:
|
||||||
|
* <URL:http://www.unicode.org/reports/tr14/tr14-19.html>
|
||||||
|
*
|
||||||
|
* This library has been updated according to Revision 26, for
|
||||||
|
* Unicode 6.0.0:
|
||||||
|
* <URL:http://www.unicode.org/reports/tr14/tr14-26.html>
|
||||||
|
*
|
||||||
|
* The Unicode Terms of Use are available at
|
||||||
|
* <URL:http://www.unicode.org/copyright.html>
|
||||||
|
*/
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @file linebreakdef.h
|
||||||
|
*
|
||||||
|
* Definitions of internal data structures, declarations of global
|
||||||
|
* variables, and function prototypes for the line breaking algorithm.
|
||||||
|
*
|
||||||
|
* @version 2.1, 2011/05/07
|
||||||
|
* @author Wu Yongwei
|
||||||
|
*/
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Constant value to mark the end of string. It is not a valid Unicode
|
||||||
|
* character.
|
||||||
|
*/
|
||||||
|
#define EOS 0xFFFF
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Line break classes. This is a direct mapping of Table 1 of Unicode
|
||||||
|
* Standard Annex 14, Revision 26.
|
||||||
|
*/
|
||||||
|
enum LineBreakClass
|
||||||
|
{
|
||||||
|
/* This is used to signal an error condition. */
|
||||||
|
LBP_Undefined, /**< Undefined */
|
||||||
|
|
||||||
|
/* The following break classes are treated in the pair table. */
|
||||||
|
LBP_OP, /**< Opening punctuation */
|
||||||
|
LBP_CL, /**< Closing punctuation */
|
||||||
|
LBP_CP, /**< Closing parenthesis */
|
||||||
|
LBP_QU, /**< Ambiguous quotation */
|
||||||
|
LBP_GL, /**< Glue */
|
||||||
|
LBP_NS, /**< Non-starters */
|
||||||
|
LBP_EX, /**< Exclamation/Interrogation */
|
||||||
|
LBP_SY, /**< Symbols allowing break after */
|
||||||
|
LBP_IS, /**< Infix separator */
|
||||||
|
LBP_PR, /**< Prefix */
|
||||||
|
LBP_PO, /**< Postfix */
|
||||||
|
LBP_NU, /**< Numeric */
|
||||||
|
LBP_AL, /**< Alphabetic */
|
||||||
|
LBP_ID, /**< Ideographic */
|
||||||
|
LBP_IN, /**< Inseparable characters */
|
||||||
|
LBP_HY, /**< Hyphen */
|
||||||
|
LBP_BA, /**< Break after */
|
||||||
|
LBP_BB, /**< Break before */
|
||||||
|
LBP_B2, /**< Break on either side (but not pair) */
|
||||||
|
LBP_ZW, /**< Zero-width space */
|
||||||
|
LBP_CM, /**< Combining marks */
|
||||||
|
LBP_WJ, /**< Word joiner */
|
||||||
|
LBP_H2, /**< Hangul LV */
|
||||||
|
LBP_H3, /**< Hangul LVT */
|
||||||
|
LBP_JL, /**< Hangul L Jamo */
|
||||||
|
LBP_JV, /**< Hangul V Jamo */
|
||||||
|
LBP_JT, /**< Hangul T Jamo */
|
||||||
|
|
||||||
|
/* The following break classes are not treated in the pair table */
|
||||||
|
LBP_AI, /**< Ambiguous (alphabetic or ideograph) */
|
||||||
|
LBP_BK, /**< Break (mandatory) */
|
||||||
|
LBP_CB, /**< Contingent break */
|
||||||
|
LBP_CR, /**< Carriage return */
|
||||||
|
LBP_LF, /**< Line feed */
|
||||||
|
LBP_NL, /**< Next line */
|
||||||
|
LBP_SA, /**< South-East Asian */
|
||||||
|
LBP_SG, /**< Surrogates */
|
||||||
|
LBP_SP, /**< Space */
|
||||||
|
LBP_XX /**< Unknown */
|
||||||
|
};
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Struct for entries of line break properties. The array of the
|
||||||
|
* entries \e must be sorted.
|
||||||
|
*/
|
||||||
|
struct LineBreakProperties
|
||||||
|
{
|
||||||
|
utf32_t start; /**< Starting coding point */
|
||||||
|
utf32_t end; /**< End coding point */
|
||||||
|
enum LineBreakClass prop; /**< The line breaking property */
|
||||||
|
};
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Struct for association of language-specific line breaking properties
|
||||||
|
* with language names.
|
||||||
|
*/
|
||||||
|
struct LineBreakPropertiesLang
|
||||||
|
{
|
||||||
|
const char *lang; /**< Language name */
|
||||||
|
size_t namelen; /**< Length of name to match */
|
||||||
|
struct LineBreakProperties *lbp; /**< Pointer to associated data */
|
||||||
|
};
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Abstract function interface for #lb_get_next_char_utf8,
|
||||||
|
* #lb_get_next_char_utf16, and #lb_get_next_char_utf32.
|
||||||
|
*/
|
||||||
|
typedef utf32_t (*get_next_char_t)(const void *, size_t, size_t *);
|
||||||
|
|
||||||
|
/* Declarations */
|
||||||
|
extern struct LineBreakProperties lb_prop_default[];
|
||||||
|
extern struct LineBreakPropertiesLang lb_prop_lang_map[];
|
||||||
|
|
||||||
|
/* Function Prototype */
|
||||||
|
utf32_t lb_get_next_char_utf8(const utf8_t *s, size_t len, size_t *ip);
|
||||||
|
utf32_t lb_get_next_char_utf16(const utf16_t *s, size_t len, size_t *ip);
|
||||||
|
utf32_t lb_get_next_char_utf32(const utf32_t *s, size_t len, size_t *ip);
|
||||||
|
void set_linebreaks(
|
||||||
|
const void *s,
|
||||||
|
size_t len,
|
||||||
|
const char *lang,
|
||||||
|
char *brks,
|
||||||
|
get_next_char_t get_next_char);
|
2
linebreak/linebreak/purge
Executable file
2
linebreak/linebreak/purge
Executable file
|
@ -0,0 +1,2 @@
|
||||||
|
#! /bin/sh
|
||||||
|
rm -rf Makefile.in aclocal.m4 autom4te.cache/ config.guess config.h.in config.sub configure depcomp doc/ install-sh ltmain.sh missing
|
6
linebreak/linebreak/sort_numeric_hex.py
Executable file
6
linebreak/linebreak/sort_numeric_hex.py
Executable file
|
@ -0,0 +1,6 @@
|
||||||
|
#!/usr/bin/env python
|
||||||
|
import sys
|
||||||
|
|
||||||
|
lines = open(sys.argv[1]).readlines()
|
||||||
|
lines_out = sorted(lines, key=lambda line: int(line.split("..")[0], 16))
|
||||||
|
map(sys.stdout.write, lines_out)
|
437
linebreak/linebreak/wordbreak.c
Normal file
437
linebreak/linebreak/wordbreak.c
Normal file
|
@ -0,0 +1,437 @@
|
||||||
|
/* vim: set tabstop=4 shiftwidth=4: */
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Word breaking in a Unicode sequence. Designed to be used in a
|
||||||
|
* generic text renderer.
|
||||||
|
*
|
||||||
|
* Copyright (C) 2012 Tom Hacohen <tom@stosb.com>
|
||||||
|
*
|
||||||
|
* This software is provided 'as-is', without any express or implied
|
||||||
|
* warranty. In no event will the author be held liable for any damages
|
||||||
|
* arising from the use of this software.
|
||||||
|
*
|
||||||
|
* Permission is granted to anyone to use this software for any purpose,
|
||||||
|
* including commercial applications, and to alter it and redistribute
|
||||||
|
* it freely, subject to the following restrictions:
|
||||||
|
*
|
||||||
|
* 1. The origin of this software must not be misrepresented; you must
|
||||||
|
* not claim that you wrote the original software. If you use this
|
||||||
|
* software in a product, an acknowledgement in the product
|
||||||
|
* documentation would be appreciated but is not required.
|
||||||
|
* 2. Altered source versions must be plainly marked as such, and must
|
||||||
|
* not be misrepresented as being the original software.
|
||||||
|
* 3. This notice may not be removed or altered from any source
|
||||||
|
* distribution.
|
||||||
|
*
|
||||||
|
* The main reference is Unicode Standard Annex 29 (UAX #29):
|
||||||
|
* <URL:http://unicode.org/reports/tr29>
|
||||||
|
*
|
||||||
|
* When this library was designed, this annex was at Revision 17, for
|
||||||
|
* Unicode 6.0.0:
|
||||||
|
* <URL:http://www.unicode.org/reports/tr29/tr29-17.html>
|
||||||
|
*
|
||||||
|
* The Unicode Terms of Use are available at
|
||||||
|
* <URL:http://www.unicode.org/copyright.html>
|
||||||
|
*/
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @file wordbreak.c
|
||||||
|
*
|
||||||
|
* Implementation of the word breaking algorithm as described in Unicode
|
||||||
|
* Standard Annex 29.
|
||||||
|
*
|
||||||
|
* @version 2.2, 2012/02/04
|
||||||
|
* @author Tom Hacohen
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include <assert.h>
|
||||||
|
#include <stddef.h>
|
||||||
|
#include <string.h>
|
||||||
|
#include "linebreak.h"
|
||||||
|
#include "linebreakdef.h"
|
||||||
|
|
||||||
|
#include "wordbreak.h"
|
||||||
|
#include "wordbreakdata.c"
|
||||||
|
|
||||||
|
#define ARRAY_LEN(x) (sizeof(x) / sizeof(x[0]))
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Initializes the wordbreak internals. It currently does nothing, but
|
||||||
|
* it may in the future.
|
||||||
|
*/
|
||||||
|
void init_wordbreak(void)
|
||||||
|
{
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Gets the word breaking class of a character.
|
||||||
|
*
|
||||||
|
* @param ch character to check
|
||||||
|
* @param wbp pointer to the wbp breaking properties array
|
||||||
|
* @param len size of the wbp array in number of items
|
||||||
|
* @return the word breaking class if found; \c WBP_Any otherwise
|
||||||
|
*/
|
||||||
|
static enum WordBreakClass get_char_wb_class(
|
||||||
|
utf32_t ch,
|
||||||
|
struct WordBreakProperties *wbp,
|
||||||
|
size_t len)
|
||||||
|
{
|
||||||
|
int min = 0;
|
||||||
|
int max = len - 1;
|
||||||
|
int mid;
|
||||||
|
|
||||||
|
do
|
||||||
|
{
|
||||||
|
mid = (min + max) / 2;
|
||||||
|
|
||||||
|
if (ch < wbp[mid].start)
|
||||||
|
max = mid - 1;
|
||||||
|
else if (ch > wbp[mid].end)
|
||||||
|
min = mid + 1;
|
||||||
|
else
|
||||||
|
return wbp[mid].prop;
|
||||||
|
}
|
||||||
|
while (min <= max);
|
||||||
|
|
||||||
|
return WBP_Any;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Sets the word break types to a specific value in a range.
|
||||||
|
*
|
||||||
|
* It sets the inside chars to #WORDBREAK_INSIDEACHAR and the rest to brkType.
|
||||||
|
* Assumes \a brks is initialized - all the cells with #WORDBREAK_NOBREAK are
|
||||||
|
* cells that we really don't want to break after.
|
||||||
|
*
|
||||||
|
* @param[in] s input string
|
||||||
|
* @param[out] brks breaks array to fill
|
||||||
|
* @param[in] posStart start position
|
||||||
|
* @param[in] posEnd end position (exclusive)
|
||||||
|
* @param[in] len length of the string
|
||||||
|
* @param[in] brkType breaks type to use
|
||||||
|
* @param[in] get_next_char function to get the next UTF-32 character
|
||||||
|
*/
|
||||||
|
static void set_brks_to(
|
||||||
|
const void *s,
|
||||||
|
char *brks,
|
||||||
|
size_t posStart,
|
||||||
|
size_t posEnd,
|
||||||
|
size_t len,
|
||||||
|
char brkType,
|
||||||
|
get_next_char_t get_next_char)
|
||||||
|
{
|
||||||
|
size_t posNext = posStart;
|
||||||
|
while (posNext < posEnd)
|
||||||
|
{
|
||||||
|
utf32_t ch;
|
||||||
|
ch = get_next_char(s, len, &posNext);
|
||||||
|
assert(ch != EOS);
|
||||||
|
for (; posStart < posNext - 1; ++posStart)
|
||||||
|
brks[posStart] = WORDBREAK_INSIDEACHAR;
|
||||||
|
assert(posStart == posNext - 1);
|
||||||
|
|
||||||
|
/* Only set it if we haven't set it not to break before. */
|
||||||
|
if (brks[posStart] != WORDBREAK_NOBREAK)
|
||||||
|
brks[posStart] = brkType;
|
||||||
|
posStart = posNext;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Checks to see if the class is newline, CR, or LF (rules WB3a and b). */
|
||||||
|
#define IS_WB3ab(cls) ((cls == WBP_Newline) || (cls == WBP_CR) || \
|
||||||
|
(cls == WBP_LF))
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Sets the word breaking information for a generic input string.
|
||||||
|
*
|
||||||
|
* @param[in] s input string
|
||||||
|
* @param[in] len length of the input
|
||||||
|
* @param[in] lang language of the input
|
||||||
|
* @param[out] brks pointer to the output breaking data, containing
|
||||||
|
* #WORDBREAK_BREAK, #WORDBREAK_NOBREAK, or
|
||||||
|
* #WORDBREAK_INSIDEACHAR
|
||||||
|
* @param[in] get_next_char function to get the next UTF-32 character
|
||||||
|
*/
|
||||||
|
static void set_wordbreaks(
|
||||||
|
const void *s,
|
||||||
|
size_t len,
|
||||||
|
const char *lang,
|
||||||
|
char *brks,
|
||||||
|
get_next_char_t get_next_char)
|
||||||
|
{
|
||||||
|
enum WordBreakClass wbcLast = WBP_Undefined;
|
||||||
|
/* wbcSeqStart is the class that started the current sequence.
|
||||||
|
* WBP_Undefined is a special case that means "sot".
|
||||||
|
* This value is the class that is at the start of the current rule
|
||||||
|
* matching sequence. For example, in case of Numeric+MidNum+Numeric
|
||||||
|
* it'll be Numeric all the way.
|
||||||
|
*/
|
||||||
|
enum WordBreakClass wbcSeqStart = WBP_Undefined;
|
||||||
|
utf32_t ch;
|
||||||
|
size_t posNext = 0;
|
||||||
|
size_t posCur = 0;
|
||||||
|
size_t posLast = 0;
|
||||||
|
|
||||||
|
/* TODO: Language-specific specialization. */
|
||||||
|
(void) lang;
|
||||||
|
|
||||||
|
/* Init brks. */
|
||||||
|
memset(brks, WORDBREAK_BREAK, len);
|
||||||
|
|
||||||
|
ch = get_next_char(s, len, &posNext);
|
||||||
|
|
||||||
|
while (ch != EOS)
|
||||||
|
{
|
||||||
|
enum WordBreakClass wbcCur;
|
||||||
|
wbcCur = get_char_wb_class(ch, wb_prop_default,
|
||||||
|
ARRAY_LEN(wb_prop_default));
|
||||||
|
|
||||||
|
switch (wbcCur)
|
||||||
|
{
|
||||||
|
case WBP_CR:
|
||||||
|
/* WB3b */
|
||||||
|
set_brks_to(s, brks, posLast, posCur, len,
|
||||||
|
WORDBREAK_BREAK, get_next_char);
|
||||||
|
wbcSeqStart = wbcCur;
|
||||||
|
posLast = posCur;
|
||||||
|
break;
|
||||||
|
|
||||||
|
case WBP_LF:
|
||||||
|
if (wbcSeqStart == WBP_CR) /* WB3 */
|
||||||
|
{
|
||||||
|
set_brks_to(s, brks, posLast, posCur, len,
|
||||||
|
WORDBREAK_NOBREAK, get_next_char);
|
||||||
|
wbcSeqStart = wbcCur;
|
||||||
|
posLast = posCur;
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
/* Fall off */
|
||||||
|
|
||||||
|
case WBP_Newline:
|
||||||
|
/* WB3a,3b */
|
||||||
|
set_brks_to(s, brks, posLast, posCur, len,
|
||||||
|
WORDBREAK_BREAK, get_next_char);
|
||||||
|
wbcSeqStart = wbcCur;
|
||||||
|
posLast = posCur;
|
||||||
|
break;
|
||||||
|
|
||||||
|
case WBP_Extend:
|
||||||
|
case WBP_Format:
|
||||||
|
/* WB4 - If not the first char/after a newline (WB3a,3b), skip
|
||||||
|
* this class, set it to be the same as the prev, and mark
|
||||||
|
* brks not to break before them. */
|
||||||
|
if ((wbcSeqStart == WBP_Undefined) || IS_WB3ab(wbcSeqStart))
|
||||||
|
{
|
||||||
|
set_brks_to(s, brks, posLast, posCur, len,
|
||||||
|
WORDBREAK_BREAK, get_next_char);
|
||||||
|
wbcSeqStart = wbcCur;
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
/* It's surely not the first */
|
||||||
|
brks[posCur - 1] = WORDBREAK_NOBREAK;
|
||||||
|
/* "inherit" the previous class. */
|
||||||
|
wbcCur = wbcLast;
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
|
||||||
|
case WBP_Katakana:
|
||||||
|
if ((wbcSeqStart == WBP_Katakana) || /* WB13 */
|
||||||
|
(wbcSeqStart == WBP_ExtendNumLet)) /* WB13b */
|
||||||
|
{
|
||||||
|
set_brks_to(s, brks, posLast, posCur, len,
|
||||||
|
WORDBREAK_NOBREAK, get_next_char);
|
||||||
|
}
|
||||||
|
/* No rule found, reset */
|
||||||
|
else
|
||||||
|
{
|
||||||
|
set_brks_to(s, brks, posLast, posCur, len,
|
||||||
|
WORDBREAK_BREAK, get_next_char);
|
||||||
|
}
|
||||||
|
wbcSeqStart = wbcCur;
|
||||||
|
posLast = posCur;
|
||||||
|
break;
|
||||||
|
|
||||||
|
case WBP_ALetter:
|
||||||
|
if ((wbcSeqStart == WBP_ALetter) || /* WB5,6,7 */
|
||||||
|
(wbcLast == WBP_Numeric) || /* WB10 */
|
||||||
|
(wbcSeqStart == WBP_ExtendNumLet)) /* WB13b */
|
||||||
|
{
|
||||||
|
set_brks_to(s, brks, posLast, posCur, len,
|
||||||
|
WORDBREAK_NOBREAK, get_next_char);
|
||||||
|
}
|
||||||
|
/* No rule found, reset */
|
||||||
|
else
|
||||||
|
{
|
||||||
|
set_brks_to(s, brks, posLast, posCur, len,
|
||||||
|
WORDBREAK_BREAK, get_next_char);
|
||||||
|
}
|
||||||
|
wbcSeqStart = wbcCur;
|
||||||
|
posLast = posCur;
|
||||||
|
break;
|
||||||
|
|
||||||
|
case WBP_MidNumLet:
|
||||||
|
if ((wbcLast == WBP_ALetter) || /* WB6,7 */
|
||||||
|
(wbcLast == WBP_Numeric)) /* WB11,12 */
|
||||||
|
{
|
||||||
|
/* Go on */
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
set_brks_to(s, brks, posLast, posCur, len,
|
||||||
|
WORDBREAK_BREAK, get_next_char);
|
||||||
|
wbcSeqStart = wbcCur;
|
||||||
|
posLast = posCur;
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
|
||||||
|
case WBP_MidLetter:
|
||||||
|
if (wbcLast == WBP_ALetter) /* WB6,7 */
|
||||||
|
{
|
||||||
|
/* Go on */
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
set_brks_to(s, brks, posLast, posCur, len,
|
||||||
|
WORDBREAK_BREAK, get_next_char);
|
||||||
|
wbcSeqStart = wbcCur;
|
||||||
|
posLast = posCur;
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
|
||||||
|
case WBP_MidNum:
|
||||||
|
if (wbcLast == WBP_Numeric) /* WB11,12 */
|
||||||
|
{
|
||||||
|
/* Go on */
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
set_brks_to(s, brks, posLast, posCur, len,
|
||||||
|
WORDBREAK_BREAK, get_next_char);
|
||||||
|
wbcSeqStart = wbcCur;
|
||||||
|
posLast = posCur;
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
|
||||||
|
case WBP_Numeric:
|
||||||
|
if ((wbcSeqStart == WBP_Numeric) || /* WB8,11,12 */
|
||||||
|
(wbcLast == WBP_ALetter) || /* WB9 */
|
||||||
|
(wbcSeqStart == WBP_ExtendNumLet)) /* WB13b */
|
||||||
|
{
|
||||||
|
set_brks_to(s, brks, posLast, posCur, len,
|
||||||
|
WORDBREAK_NOBREAK, get_next_char);
|
||||||
|
}
|
||||||
|
/* No rule found, reset */
|
||||||
|
else
|
||||||
|
{
|
||||||
|
set_brks_to(s, brks, posLast, posCur, len,
|
||||||
|
WORDBREAK_BREAK, get_next_char);
|
||||||
|
}
|
||||||
|
wbcSeqStart = wbcCur;
|
||||||
|
posLast = posCur;
|
||||||
|
break;
|
||||||
|
|
||||||
|
case WBP_ExtendNumLet:
|
||||||
|
/* WB13a,13b */
|
||||||
|
if ((wbcSeqStart == wbcLast) &&
|
||||||
|
((wbcLast == WBP_ALetter) ||
|
||||||
|
(wbcLast == WBP_Numeric) ||
|
||||||
|
(wbcLast == WBP_Katakana) ||
|
||||||
|
(wbcLast == WBP_ExtendNumLet)))
|
||||||
|
{
|
||||||
|
set_brks_to(s, brks, posLast, posCur, len,
|
||||||
|
WORDBREAK_NOBREAK, get_next_char);
|
||||||
|
}
|
||||||
|
/* No rule found, reset */
|
||||||
|
else
|
||||||
|
{
|
||||||
|
set_brks_to(s, brks, posLast, posCur, len,
|
||||||
|
WORDBREAK_BREAK, get_next_char);
|
||||||
|
}
|
||||||
|
wbcSeqStart = wbcCur;
|
||||||
|
posLast = posCur;
|
||||||
|
break;
|
||||||
|
|
||||||
|
case WBP_Any:
|
||||||
|
/* Allow breaks and reset */
|
||||||
|
set_brks_to(s, brks, posLast, posCur, len,
|
||||||
|
WORDBREAK_BREAK, get_next_char);
|
||||||
|
wbcSeqStart = wbcCur;
|
||||||
|
posLast = posCur;
|
||||||
|
break;
|
||||||
|
|
||||||
|
default:
|
||||||
|
/* Error, should never get here! */
|
||||||
|
assert(0);
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
|
||||||
|
wbcLast = wbcCur;
|
||||||
|
posCur = posNext;
|
||||||
|
ch = get_next_char(s, len, &posNext);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* WB2 */
|
||||||
|
set_brks_to(s, brks, posLast, posNext, len,
|
||||||
|
WORDBREAK_BREAK, get_next_char);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Sets the word breaking information for a UTF-8 input string.
|
||||||
|
*
|
||||||
|
* @param[in] s input UTF-8 string
|
||||||
|
* @param[in] len length of the input
|
||||||
|
* @param[in] lang language of the input
|
||||||
|
* @param[out] brks pointer to the output breaking data, containing
|
||||||
|
* #WORDBREAK_BREAK, #WORDBREAK_NOBREAK, or
|
||||||
|
* #WORDBREAK_INSIDEACHAR
|
||||||
|
*/
|
||||||
|
void set_wordbreaks_utf8(
|
||||||
|
const utf8_t *s,
|
||||||
|
size_t len,
|
||||||
|
const char *lang,
|
||||||
|
char *brks)
|
||||||
|
{
|
||||||
|
set_wordbreaks(s, len, lang, brks,
|
||||||
|
(get_next_char_t)lb_get_next_char_utf8);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Sets the word breaking information for a UTF-16 input string.
|
||||||
|
*
|
||||||
|
* @param[in] s input UTF-16 string
|
||||||
|
* @param[in] len length of the input
|
||||||
|
* @param[in] lang language of the input
|
||||||
|
* @param[out] brks pointer to the output breaking data, containing
|
||||||
|
* #WORDBREAK_BREAK, #WORDBREAK_NOBREAK, or
|
||||||
|
* #WORDBREAK_INSIDEACHAR
|
||||||
|
*/
|
||||||
|
void set_wordbreaks_utf16(
|
||||||
|
const utf16_t *s,
|
||||||
|
size_t len,
|
||||||
|
const char *lang,
|
||||||
|
char *brks)
|
||||||
|
{
|
||||||
|
set_wordbreaks(s, len, lang, brks,
|
||||||
|
(get_next_char_t)lb_get_next_char_utf16);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Sets the word breaking information for a UTF-32 input string.
|
||||||
|
*
|
||||||
|
* @param[in] s input UTF-32 string
|
||||||
|
* @param[in] len length of the input
|
||||||
|
* @param[in] lang language of the input
|
||||||
|
* @param[out] brks pointer to the output breaking data, containing
|
||||||
|
* #WORDBREAK_BREAK, #WORDBREAK_NOBREAK, or
|
||||||
|
* #WORDBREAK_INSIDEACHAR
|
||||||
|
*/
|
||||||
|
void set_wordbreaks_utf32(
|
||||||
|
const utf32_t *s,
|
||||||
|
size_t len,
|
||||||
|
const char *lang,
|
||||||
|
char *brks)
|
||||||
|
{
|
||||||
|
set_wordbreaks(s, len, lang, brks,
|
||||||
|
(get_next_char_t)lb_get_next_char_utf32);
|
||||||
|
}
|
72
linebreak/linebreak/wordbreak.h
Normal file
72
linebreak/linebreak/wordbreak.h
Normal file
|
@ -0,0 +1,72 @@
|
||||||
|
/* vim: set tabstop=4 shiftwidth=4: */
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Word breaking in a Unicode sequence. Designed to be used in a
|
||||||
|
* generic text renderer.
|
||||||
|
*
|
||||||
|
* Copyright (C) 2012 Tom Hacohen <tom@stosb.com>
|
||||||
|
*
|
||||||
|
* This software is provided 'as-is', without any express or implied
|
||||||
|
* warranty. In no event will the author be held liable for any damages
|
||||||
|
* arising from the use of this software.
|
||||||
|
*
|
||||||
|
* Permission is granted to anyone to use this software for any purpose,
|
||||||
|
* including commercial applications, and to alter it and redistribute
|
||||||
|
* it freely, subject to the following restrictions:
|
||||||
|
*
|
||||||
|
* 1. The origin of this software must not be misrepresented; you must
|
||||||
|
* not claim that you wrote the original software. If you use this
|
||||||
|
* software in a product, an acknowledgement in the product
|
||||||
|
* documentation would be appreciated but is not required.
|
||||||
|
* 2. Altered source versions must be plainly marked as such, and must
|
||||||
|
* not be misrepresented as being the original software.
|
||||||
|
* 3. This notice may not be removed or altered from any source
|
||||||
|
* distribution.
|
||||||
|
*
|
||||||
|
* The main reference is Unicode Standard Annex 29 (UAX #29):
|
||||||
|
* <URL:http://unicode.org/reports/tr29>
|
||||||
|
*
|
||||||
|
* When this library was designed, this annex was at Revision 17, for
|
||||||
|
* Unicode 6.0.0:
|
||||||
|
* <URL:http://www.unicode.org/reports/tr29/tr29-17.html>
|
||||||
|
*
|
||||||
|
* The Unicode Terms of Use are available at
|
||||||
|
* <URL:http://www.unicode.org/copyright.html>
|
||||||
|
*/
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @file wordbreak.h
|
||||||
|
*
|
||||||
|
* Header file for the word breaking (segmentation) algorithm.
|
||||||
|
*
|
||||||
|
* @version 2.2, 2012/02/04
|
||||||
|
* @author Tom Hacohen
|
||||||
|
*/
|
||||||
|
|
||||||
|
#ifndef WORDBREAK_H
|
||||||
|
#define WORDBREAK_H
|
||||||
|
|
||||||
|
#include <stddef.h>
|
||||||
|
#include "linebreak.h"
|
||||||
|
|
||||||
|
#ifdef __cplusplus
|
||||||
|
extern "C" {
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#define WORDBREAK_BREAK 0 /**< Break is allowed */
|
||||||
|
#define WORDBREAK_NOBREAK 1 /**< No break is allowed */
|
||||||
|
#define WORDBREAK_INSIDEACHAR 2 /**< A UTF-8/16 sequence is unfinished */
|
||||||
|
|
||||||
|
void init_wordbreak(void);
|
||||||
|
void set_wordbreaks_utf8(
|
||||||
|
const utf8_t *s, size_t len, const char* lang, char *brks);
|
||||||
|
void set_wordbreaks_utf16(
|
||||||
|
const utf16_t *s, size_t len, const char* lang, char *brks);
|
||||||
|
void set_wordbreaks_utf32(
|
||||||
|
const utf32_t *s, size_t len, const char* lang, char *brks);
|
||||||
|
|
||||||
|
#ifdef __cplusplus
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#endif
|
860
linebreak/linebreak/wordbreakdata.c
Normal file
860
linebreak/linebreak/wordbreakdata.c
Normal file
|
@ -0,0 +1,860 @@
|
||||||
|
/* The content of this file is generated from:
|
||||||
|
# WordBreakProperty-6.0.0.txt
|
||||||
|
# Date: 2010-08-19, 00:48:48 GMT [MD]
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include "linebreak.h"
|
||||||
|
#include "wordbreakdef.h"
|
||||||
|
|
||||||
|
static struct WordBreakProperties wb_prop_default[] = {
|
||||||
|
{0x000A, 0x000A, WBP_LF},
|
||||||
|
{0x000B, 0x000C, WBP_Newline},
|
||||||
|
{0x000D, 0x000D, WBP_CR},
|
||||||
|
{0x0027, 0x0027, WBP_MidNumLet},
|
||||||
|
{0x002C, 0x002C, WBP_MidNum},
|
||||||
|
{0x002E, 0x002E, WBP_MidNumLet},
|
||||||
|
{0x0030, 0x0039, WBP_Numeric},
|
||||||
|
{0x003A, 0x003A, WBP_MidLetter},
|
||||||
|
{0x003B, 0x003B, WBP_MidNum},
|
||||||
|
{0x0041, 0x005A, WBP_ALetter},
|
||||||
|
{0x005F, 0x005F, WBP_ExtendNumLet},
|
||||||
|
{0x0061, 0x007A, WBP_ALetter},
|
||||||
|
{0x0085, 0x0085, WBP_Newline},
|
||||||
|
{0x00AA, 0x00AA, WBP_ALetter},
|
||||||
|
{0x00AD, 0x00AD, WBP_Format},
|
||||||
|
{0x00B5, 0x00B5, WBP_ALetter},
|
||||||
|
{0x00B7, 0x00B7, WBP_MidLetter},
|
||||||
|
{0x00BA, 0x00BA, WBP_ALetter},
|
||||||
|
{0x00C0, 0x00D6, WBP_ALetter},
|
||||||
|
{0x00D8, 0x00F6, WBP_ALetter},
|
||||||
|
{0x00F8, 0x01BA, WBP_ALetter},
|
||||||
|
{0x01BB, 0x01BB, WBP_ALetter},
|
||||||
|
{0x01BC, 0x01BF, WBP_ALetter},
|
||||||
|
{0x01C0, 0x01C3, WBP_ALetter},
|
||||||
|
{0x01C4, 0x0293, WBP_ALetter},
|
||||||
|
{0x0294, 0x0294, WBP_ALetter},
|
||||||
|
{0x0295, 0x02AF, WBP_ALetter},
|
||||||
|
{0x02B0, 0x02C1, WBP_ALetter},
|
||||||
|
{0x02C6, 0x02D1, WBP_ALetter},
|
||||||
|
{0x02E0, 0x02E4, WBP_ALetter},
|
||||||
|
{0x02EC, 0x02EC, WBP_ALetter},
|
||||||
|
{0x02EE, 0x02EE, WBP_ALetter},
|
||||||
|
{0x0300, 0x036F, WBP_Extend},
|
||||||
|
{0x0370, 0x0373, WBP_ALetter},
|
||||||
|
{0x0374, 0x0374, WBP_ALetter},
|
||||||
|
{0x0376, 0x0377, WBP_ALetter},
|
||||||
|
{0x037A, 0x037A, WBP_ALetter},
|
||||||
|
{0x037B, 0x037D, WBP_ALetter},
|
||||||
|
{0x037E, 0x037E, WBP_MidNum},
|
||||||
|
{0x0386, 0x0386, WBP_ALetter},
|
||||||
|
{0x0387, 0x0387, WBP_MidLetter},
|
||||||
|
{0x0388, 0x038A, WBP_ALetter},
|
||||||
|
{0x038C, 0x038C, WBP_ALetter},
|
||||||
|
{0x038E, 0x03A1, WBP_ALetter},
|
||||||
|
{0x03A3, 0x03F5, WBP_ALetter},
|
||||||
|
{0x03F7, 0x0481, WBP_ALetter},
|
||||||
|
{0x0483, 0x0487, WBP_Extend},
|
||||||
|
{0x0488, 0x0489, WBP_Extend},
|
||||||
|
{0x048A, 0x0527, WBP_ALetter},
|
||||||
|
{0x0531, 0x0556, WBP_ALetter},
|
||||||
|
{0x0559, 0x0559, WBP_ALetter},
|
||||||
|
{0x0561, 0x0587, WBP_ALetter},
|
||||||
|
{0x0589, 0x0589, WBP_MidNum},
|
||||||
|
{0x0591, 0x05BD, WBP_Extend},
|
||||||
|
{0x05BF, 0x05BF, WBP_Extend},
|
||||||
|
{0x05C1, 0x05C2, WBP_Extend},
|
||||||
|
{0x05C4, 0x05C5, WBP_Extend},
|
||||||
|
{0x05C7, 0x05C7, WBP_Extend},
|
||||||
|
{0x05D0, 0x05EA, WBP_ALetter},
|
||||||
|
{0x05F0, 0x05F2, WBP_ALetter},
|
||||||
|
{0x05F3, 0x05F3, WBP_ALetter},
|
||||||
|
{0x05F4, 0x05F4, WBP_MidLetter},
|
||||||
|
{0x0600, 0x0603, WBP_Format},
|
||||||
|
{0x060C, 0x060D, WBP_MidNum},
|
||||||
|
{0x0610, 0x061A, WBP_Extend},
|
||||||
|
{0x0620, 0x063F, WBP_ALetter},
|
||||||
|
{0x0640, 0x0640, WBP_ALetter},
|
||||||
|
{0x0641, 0x064A, WBP_ALetter},
|
||||||
|
{0x064B, 0x065F, WBP_Extend},
|
||||||
|
{0x0660, 0x0669, WBP_Numeric},
|
||||||
|
{0x066B, 0x066B, WBP_Numeric},
|
||||||
|
{0x066C, 0x066C, WBP_MidNum},
|
||||||
|
{0x066E, 0x066F, WBP_ALetter},
|
||||||
|
{0x0670, 0x0670, WBP_Extend},
|
||||||
|
{0x0671, 0x06D3, WBP_ALetter},
|
||||||
|
{0x06D5, 0x06D5, WBP_ALetter},
|
||||||
|
{0x06D6, 0x06DC, WBP_Extend},
|
||||||
|
{0x06DD, 0x06DD, WBP_Format},
|
||||||
|
{0x06DF, 0x06E4, WBP_Extend},
|
||||||
|
{0x06E5, 0x06E6, WBP_ALetter},
|
||||||
|
{0x06E7, 0x06E8, WBP_Extend},
|
||||||
|
{0x06EA, 0x06ED, WBP_Extend},
|
||||||
|
{0x06EE, 0x06EF, WBP_ALetter},
|
||||||
|
{0x06F0, 0x06F9, WBP_Numeric},
|
||||||
|
{0x06FA, 0x06FC, WBP_ALetter},
|
||||||
|
{0x06FF, 0x06FF, WBP_ALetter},
|
||||||
|
{0x070F, 0x070F, WBP_Format},
|
||||||
|
{0x0710, 0x0710, WBP_ALetter},
|
||||||
|
{0x0711, 0x0711, WBP_Extend},
|
||||||
|
{0x0712, 0x072F, WBP_ALetter},
|
||||||
|
{0x0730, 0x074A, WBP_Extend},
|
||||||
|
{0x074D, 0x07A5, WBP_ALetter},
|
||||||
|
{0x07A6, 0x07B0, WBP_Extend},
|
||||||
|
{0x07B1, 0x07B1, WBP_ALetter},
|
||||||
|
{0x07C0, 0x07C9, WBP_Numeric},
|
||||||
|
{0x07CA, 0x07EA, WBP_ALetter},
|
||||||
|
{0x07EB, 0x07F3, WBP_Extend},
|
||||||
|
{0x07F4, 0x07F5, WBP_ALetter},
|
||||||
|
{0x07F8, 0x07F8, WBP_MidNum},
|
||||||
|
{0x07FA, 0x07FA, WBP_ALetter},
|
||||||
|
{0x0800, 0x0815, WBP_ALetter},
|
||||||
|
{0x0816, 0x0819, WBP_Extend},
|
||||||
|
{0x081A, 0x081A, WBP_ALetter},
|
||||||
|
{0x081B, 0x0823, WBP_Extend},
|
||||||
|
{0x0824, 0x0824, WBP_ALetter},
|
||||||
|
{0x0825, 0x0827, WBP_Extend},
|
||||||
|
{0x0828, 0x0828, WBP_ALetter},
|
||||||
|
{0x0829, 0x082D, WBP_Extend},
|
||||||
|
{0x0840, 0x0858, WBP_ALetter},
|
||||||
|
{0x0859, 0x085B, WBP_Extend},
|
||||||
|
{0x0900, 0x0902, WBP_Extend},
|
||||||
|
{0x0903, 0x0903, WBP_Extend},
|
||||||
|
{0x0904, 0x0939, WBP_ALetter},
|
||||||
|
{0x093A, 0x093A, WBP_Extend},
|
||||||
|
{0x093B, 0x093B, WBP_Extend},
|
||||||
|
{0x093C, 0x093C, WBP_Extend},
|
||||||
|
{0x093D, 0x093D, WBP_ALetter},
|
||||||
|
{0x093E, 0x0940, WBP_Extend},
|
||||||
|
{0x0941, 0x0948, WBP_Extend},
|
||||||
|
{0x0949, 0x094C, WBP_Extend},
|
||||||
|
{0x094D, 0x094D, WBP_Extend},
|
||||||
|
{0x094E, 0x094F, WBP_Extend},
|
||||||
|
{0x0950, 0x0950, WBP_ALetter},
|
||||||
|
{0x0951, 0x0957, WBP_Extend},
|
||||||
|
{0x0958, 0x0961, WBP_ALetter},
|
||||||
|
{0x0962, 0x0963, WBP_Extend},
|
||||||
|
{0x0966, 0x096F, WBP_Numeric},
|
||||||
|
{0x0971, 0x0971, WBP_ALetter},
|
||||||
|
{0x0972, 0x0977, WBP_ALetter},
|
||||||
|
{0x0979, 0x097F, WBP_ALetter},
|
||||||
|
{0x0981, 0x0981, WBP_Extend},
|
||||||
|
{0x0982, 0x0983, WBP_Extend},
|
||||||
|
{0x0985, 0x098C, WBP_ALetter},
|
||||||
|
{0x098F, 0x0990, WBP_ALetter},
|
||||||
|
{0x0993, 0x09A8, WBP_ALetter},
|
||||||
|
{0x09AA, 0x09B0, WBP_ALetter},
|
||||||
|
{0x09B2, 0x09B2, WBP_ALetter},
|
||||||
|
{0x09B6, 0x09B9, WBP_ALetter},
|
||||||
|
{0x09BC, 0x09BC, WBP_Extend},
|
||||||
|
{0x09BD, 0x09BD, WBP_ALetter},
|
||||||
|
{0x09BE, 0x09C0, WBP_Extend},
|
||||||
|
{0x09C1, 0x09C4, WBP_Extend},
|
||||||
|
{0x09C7, 0x09C8, WBP_Extend},
|
||||||
|
{0x09CB, 0x09CC, WBP_Extend},
|
||||||
|
{0x09CD, 0x09CD, WBP_Extend},
|
||||||
|
{0x09CE, 0x09CE, WBP_ALetter},
|
||||||
|
{0x09D7, 0x09D7, WBP_Extend},
|
||||||
|
{0x09DC, 0x09DD, WBP_ALetter},
|
||||||
|
{0x09DF, 0x09E1, WBP_ALetter},
|
||||||
|
{0x09E2, 0x09E3, WBP_Extend},
|
||||||
|
{0x09E6, 0x09EF, WBP_Numeric},
|
||||||
|
{0x09F0, 0x09F1, WBP_ALetter},
|
||||||
|
{0x0A01, 0x0A02, WBP_Extend},
|
||||||
|
{0x0A03, 0x0A03, WBP_Extend},
|
||||||
|
{0x0A05, 0x0A0A, WBP_ALetter},
|
||||||
|
{0x0A0F, 0x0A10, WBP_ALetter},
|
||||||
|
{0x0A13, 0x0A28, WBP_ALetter},
|
||||||
|
{0x0A2A, 0x0A30, WBP_ALetter},
|
||||||
|
{0x0A32, 0x0A33, WBP_ALetter},
|
||||||
|
{0x0A35, 0x0A36, WBP_ALetter},
|
||||||
|
{0x0A38, 0x0A39, WBP_ALetter},
|
||||||
|
{0x0A3C, 0x0A3C, WBP_Extend},
|
||||||
|
{0x0A3E, 0x0A40, WBP_Extend},
|
||||||
|
{0x0A41, 0x0A42, WBP_Extend},
|
||||||
|
{0x0A47, 0x0A48, WBP_Extend},
|
||||||
|
{0x0A4B, 0x0A4D, WBP_Extend},
|
||||||
|
{0x0A51, 0x0A51, WBP_Extend},
|
||||||
|
{0x0A59, 0x0A5C, WBP_ALetter},
|
||||||
|
{0x0A5E, 0x0A5E, WBP_ALetter},
|
||||||
|
{0x0A66, 0x0A6F, WBP_Numeric},
|
||||||
|
{0x0A70, 0x0A71, WBP_Extend},
|
||||||
|
{0x0A72, 0x0A74, WBP_ALetter},
|
||||||
|
{0x0A75, 0x0A75, WBP_Extend},
|
||||||
|
{0x0A81, 0x0A82, WBP_Extend},
|
||||||
|
{0x0A83, 0x0A83, WBP_Extend},
|
||||||
|
{0x0A85, 0x0A8D, WBP_ALetter},
|
||||||
|
{0x0A8F, 0x0A91, WBP_ALetter},
|
||||||
|
{0x0A93, 0x0AA8, WBP_ALetter},
|
||||||
|
{0x0AAA, 0x0AB0, WBP_ALetter},
|
||||||
|
{0x0AB2, 0x0AB3, WBP_ALetter},
|
||||||
|
{0x0AB5, 0x0AB9, WBP_ALetter},
|
||||||
|
{0x0ABC, 0x0ABC, WBP_Extend},
|
||||||
|
{0x0ABD, 0x0ABD, WBP_ALetter},
|
||||||
|
{0x0ABE, 0x0AC0, WBP_Extend},
|
||||||
|
{0x0AC1, 0x0AC5, WBP_Extend},
|
||||||
|
{0x0AC7, 0x0AC8, WBP_Extend},
|
||||||
|
{0x0AC9, 0x0AC9, WBP_Extend},
|
||||||
|
{0x0ACB, 0x0ACC, WBP_Extend},
|
||||||
|
{0x0ACD, 0x0ACD, WBP_Extend},
|
||||||
|
{0x0AD0, 0x0AD0, WBP_ALetter},
|
||||||
|
{0x0AE0, 0x0AE1, WBP_ALetter},
|
||||||
|
{0x0AE2, 0x0AE3, WBP_Extend},
|
||||||
|
{0x0AE6, 0x0AEF, WBP_Numeric},
|
||||||
|
{0x0B01, 0x0B01, WBP_Extend},
|
||||||
|
{0x0B02, 0x0B03, WBP_Extend},
|
||||||
|
{0x0B05, 0x0B0C, WBP_ALetter},
|
||||||
|
{0x0B0F, 0x0B10, WBP_ALetter},
|
||||||
|
{0x0B13, 0x0B28, WBP_ALetter},
|
||||||
|
{0x0B2A, 0x0B30, WBP_ALetter},
|
||||||
|
{0x0B32, 0x0B33, WBP_ALetter},
|
||||||
|
{0x0B35, 0x0B39, WBP_ALetter},
|
||||||
|
{0x0B3C, 0x0B3C, WBP_Extend},
|
||||||
|
{0x0B3D, 0x0B3D, WBP_ALetter},
|
||||||
|
{0x0B3E, 0x0B3E, WBP_Extend},
|
||||||
|
{0x0B3F, 0x0B3F, WBP_Extend},
|
||||||
|
{0x0B40, 0x0B40, WBP_Extend},
|
||||||
|
{0x0B41, 0x0B44, WBP_Extend},
|
||||||
|
{0x0B47, 0x0B48, WBP_Extend},
|
||||||
|
{0x0B4B, 0x0B4C, WBP_Extend},
|
||||||
|
{0x0B4D, 0x0B4D, WBP_Extend},
|
||||||
|
{0x0B56, 0x0B56, WBP_Extend},
|
||||||
|
{0x0B57, 0x0B57, WBP_Extend},
|
||||||
|
{0x0B5C, 0x0B5D, WBP_ALetter},
|
||||||
|
{0x0B5F, 0x0B61, WBP_ALetter},
|
||||||
|
{0x0B62, 0x0B63, WBP_Extend},
|
||||||
|
{0x0B66, 0x0B6F, WBP_Numeric},
|
||||||
|
{0x0B71, 0x0B71, WBP_ALetter},
|
||||||
|
{0x0B82, 0x0B82, WBP_Extend},
|
||||||
|
{0x0B83, 0x0B83, WBP_ALetter},
|
||||||
|
{0x0B85, 0x0B8A, WBP_ALetter},
|
||||||
|
{0x0B8E, 0x0B90, WBP_ALetter},
|
||||||
|
{0x0B92, 0x0B95, WBP_ALetter},
|
||||||
|
{0x0B99, 0x0B9A, WBP_ALetter},
|
||||||
|
{0x0B9C, 0x0B9C, WBP_ALetter},
|
||||||
|
{0x0B9E, 0x0B9F, WBP_ALetter},
|
||||||
|
{0x0BA3, 0x0BA4, WBP_ALetter},
|
||||||
|
{0x0BA8, 0x0BAA, WBP_ALetter},
|
||||||
|
{0x0BAE, 0x0BB9, WBP_ALetter},
|
||||||
|
{0x0BBE, 0x0BBF, WBP_Extend},
|
||||||
|
{0x0BC0, 0x0BC0, WBP_Extend},
|
||||||
|
{0x0BC1, 0x0BC2, WBP_Extend},
|
||||||
|
{0x0BC6, 0x0BC8, WBP_Extend},
|
||||||
|
{0x0BCA, 0x0BCC, WBP_Extend},
|
||||||
|
{0x0BCD, 0x0BCD, WBP_Extend},
|
||||||
|
{0x0BD0, 0x0BD0, WBP_ALetter},
|
||||||
|
{0x0BD7, 0x0BD7, WBP_Extend},
|
||||||
|
{0x0BE6, 0x0BEF, WBP_Numeric},
|
||||||
|
{0x0C01, 0x0C03, WBP_Extend},
|
||||||
|
{0x0C05, 0x0C0C, WBP_ALetter},
|
||||||
|
{0x0C0E, 0x0C10, WBP_ALetter},
|
||||||
|
{0x0C12, 0x0C28, WBP_ALetter},
|
||||||
|
{0x0C2A, 0x0C33, WBP_ALetter},
|
||||||
|
{0x0C35, 0x0C39, WBP_ALetter},
|
||||||
|
{0x0C3D, 0x0C3D, WBP_ALetter},
|
||||||
|
{0x0C3E, 0x0C40, WBP_Extend},
|
||||||
|
{0x0C41, 0x0C44, WBP_Extend},
|
||||||
|
{0x0C46, 0x0C48, WBP_Extend},
|
||||||
|
{0x0C4A, 0x0C4D, WBP_Extend},
|
||||||
|
{0x0C55, 0x0C56, WBP_Extend},
|
||||||
|
{0x0C58, 0x0C59, WBP_ALetter},
|
||||||
|
{0x0C60, 0x0C61, WBP_ALetter},
|
||||||
|
{0x0C62, 0x0C63, WBP_Extend},
|
||||||
|
{0x0C66, 0x0C6F, WBP_Numeric},
|
||||||
|
{0x0C82, 0x0C83, WBP_Extend},
|
||||||
|
{0x0C85, 0x0C8C, WBP_ALetter},
|
||||||
|
{0x0C8E, 0x0C90, WBP_ALetter},
|
||||||
|
{0x0C92, 0x0CA8, WBP_ALetter},
|
||||||
|
{0x0CAA, 0x0CB3, WBP_ALetter},
|
||||||
|
{0x0CB5, 0x0CB9, WBP_ALetter},
|
||||||
|
{0x0CBC, 0x0CBC, WBP_Extend},
|
||||||
|
{0x0CBD, 0x0CBD, WBP_ALetter},
|
||||||
|
{0x0CBE, 0x0CBE, WBP_Extend},
|
||||||
|
{0x0CBF, 0x0CBF, WBP_Extend},
|
||||||
|
{0x0CC0, 0x0CC4, WBP_Extend},
|
||||||
|
{0x0CC6, 0x0CC6, WBP_Extend},
|
||||||
|
{0x0CC7, 0x0CC8, WBP_Extend},
|
||||||
|
{0x0CCA, 0x0CCB, WBP_Extend},
|
||||||
|
{0x0CCC, 0x0CCD, WBP_Extend},
|
||||||
|
{0x0CD5, 0x0CD6, WBP_Extend},
|
||||||
|
{0x0CDE, 0x0CDE, WBP_ALetter},
|
||||||
|
{0x0CE0, 0x0CE1, WBP_ALetter},
|
||||||
|
{0x0CE2, 0x0CE3, WBP_Extend},
|
||||||
|
{0x0CE6, 0x0CEF, WBP_Numeric},
|
||||||
|
{0x0CF1, 0x0CF2, WBP_ALetter},
|
||||||
|
{0x0D02, 0x0D03, WBP_Extend},
|
||||||
|
{0x0D05, 0x0D0C, WBP_ALetter},
|
||||||
|
{0x0D0E, 0x0D10, WBP_ALetter},
|
||||||
|
{0x0D12, 0x0D3A, WBP_ALetter},
|
||||||
|
{0x0D3D, 0x0D3D, WBP_ALetter},
|
||||||
|
{0x0D3E, 0x0D40, WBP_Extend},
|
||||||
|
{0x0D41, 0x0D44, WBP_Extend},
|
||||||
|
{0x0D46, 0x0D48, WBP_Extend},
|
||||||
|
{0x0D4A, 0x0D4C, WBP_Extend},
|
||||||
|
{0x0D4D, 0x0D4D, WBP_Extend},
|
||||||
|
{0x0D4E, 0x0D4E, WBP_ALetter},
|
||||||
|
{0x0D57, 0x0D57, WBP_Extend},
|
||||||
|
{0x0D60, 0x0D61, WBP_ALetter},
|
||||||
|
{0x0D62, 0x0D63, WBP_Extend},
|
||||||
|
{0x0D66, 0x0D6F, WBP_Numeric},
|
||||||
|
{0x0D7A, 0x0D7F, WBP_ALetter},
|
||||||
|
{0x0D82, 0x0D83, WBP_Extend},
|
||||||
|
{0x0D85, 0x0D96, WBP_ALetter},
|
||||||
|
{0x0D9A, 0x0DB1, WBP_ALetter},
|
||||||
|
{0x0DB3, 0x0DBB, WBP_ALetter},
|
||||||
|
{0x0DBD, 0x0DBD, WBP_ALetter},
|
||||||
|
{0x0DC0, 0x0DC6, WBP_ALetter},
|
||||||
|
{0x0DCA, 0x0DCA, WBP_Extend},
|
||||||
|
{0x0DCF, 0x0DD1, WBP_Extend},
|
||||||
|
{0x0DD2, 0x0DD4, WBP_Extend},
|
||||||
|
{0x0DD6, 0x0DD6, WBP_Extend},
|
||||||
|
{0x0DD8, 0x0DDF, WBP_Extend},
|
||||||
|
{0x0DF2, 0x0DF3, WBP_Extend},
|
||||||
|
{0x0E31, 0x0E31, WBP_Extend},
|
||||||
|
{0x0E34, 0x0E3A, WBP_Extend},
|
||||||
|
{0x0E47, 0x0E4E, WBP_Extend},
|
||||||
|
{0x0E50, 0x0E59, WBP_Numeric},
|
||||||
|
{0x0EB1, 0x0EB1, WBP_Extend},
|
||||||
|
{0x0EB4, 0x0EB9, WBP_Extend},
|
||||||
|
{0x0EBB, 0x0EBC, WBP_Extend},
|
||||||
|
{0x0EC8, 0x0ECD, WBP_Extend},
|
||||||
|
{0x0ED0, 0x0ED9, WBP_Numeric},
|
||||||
|
{0x0F00, 0x0F00, WBP_ALetter},
|
||||||
|
{0x0F18, 0x0F19, WBP_Extend},
|
||||||
|
{0x0F20, 0x0F29, WBP_Numeric},
|
||||||
|
{0x0F35, 0x0F35, WBP_Extend},
|
||||||
|
{0x0F37, 0x0F37, WBP_Extend},
|
||||||
|
{0x0F39, 0x0F39, WBP_Extend},
|
||||||
|
{0x0F3E, 0x0F3F, WBP_Extend},
|
||||||
|
{0x0F40, 0x0F47, WBP_ALetter},
|
||||||
|
{0x0F49, 0x0F6C, WBP_ALetter},
|
||||||
|
{0x0F71, 0x0F7E, WBP_Extend},
|
||||||
|
{0x0F7F, 0x0F7F, WBP_Extend},
|
||||||
|
{0x0F80, 0x0F84, WBP_Extend},
|
||||||
|
{0x0F86, 0x0F87, WBP_Extend},
|
||||||
|
{0x0F88, 0x0F8C, WBP_ALetter},
|
||||||
|
{0x0F8D, 0x0F97, WBP_Extend},
|
||||||
|
{0x0F99, 0x0FBC, WBP_Extend},
|
||||||
|
{0x0FC6, 0x0FC6, WBP_Extend},
|
||||||
|
{0x102B, 0x102C, WBP_Extend},
|
||||||
|
{0x102D, 0x1030, WBP_Extend},
|
||||||
|
{0x1031, 0x1031, WBP_Extend},
|
||||||
|
{0x1032, 0x1037, WBP_Extend},
|
||||||
|
{0x1038, 0x1038, WBP_Extend},
|
||||||
|
{0x1039, 0x103A, WBP_Extend},
|
||||||
|
{0x103B, 0x103C, WBP_Extend},
|
||||||
|
{0x103D, 0x103E, WBP_Extend},
|
||||||
|
{0x1040, 0x1049, WBP_Numeric},
|
||||||
|
{0x1056, 0x1057, WBP_Extend},
|
||||||
|
{0x1058, 0x1059, WBP_Extend},
|
||||||
|
{0x105E, 0x1060, WBP_Extend},
|
||||||
|
{0x1062, 0x1064, WBP_Extend},
|
||||||
|
{0x1067, 0x106D, WBP_Extend},
|
||||||
|
{0x1071, 0x1074, WBP_Extend},
|
||||||
|
{0x1082, 0x1082, WBP_Extend},
|
||||||
|
{0x1083, 0x1084, WBP_Extend},
|
||||||
|
{0x1085, 0x1086, WBP_Extend},
|
||||||
|
{0x1087, 0x108C, WBP_Extend},
|
||||||
|
{0x108D, 0x108D, WBP_Extend},
|
||||||
|
{0x108F, 0x108F, WBP_Extend},
|
||||||
|
{0x1090, 0x1099, WBP_Numeric},
|
||||||
|
{0x109A, 0x109C, WBP_Extend},
|
||||||
|
{0x109D, 0x109D, WBP_Extend},
|
||||||
|
{0x10A0, 0x10C5, WBP_ALetter},
|
||||||
|
{0x10D0, 0x10FA, WBP_ALetter},
|
||||||
|
{0x10FC, 0x10FC, WBP_ALetter},
|
||||||
|
{0x1100, 0x1248, WBP_ALetter},
|
||||||
|
{0x124A, 0x124D, WBP_ALetter},
|
||||||
|
{0x1250, 0x1256, WBP_ALetter},
|
||||||
|
{0x1258, 0x1258, WBP_ALetter},
|
||||||
|
{0x125A, 0x125D, WBP_ALetter},
|
||||||
|
{0x1260, 0x1288, WBP_ALetter},
|
||||||
|
{0x128A, 0x128D, WBP_ALetter},
|
||||||
|
{0x1290, 0x12B0, WBP_ALetter},
|
||||||
|
{0x12B2, 0x12B5, WBP_ALetter},
|
||||||
|
{0x12B8, 0x12BE, WBP_ALetter},
|
||||||
|
{0x12C0, 0x12C0, WBP_ALetter},
|
||||||
|
{0x12C2, 0x12C5, WBP_ALetter},
|
||||||
|
{0x12C8, 0x12D6, WBP_ALetter},
|
||||||
|
{0x12D8, 0x1310, WBP_ALetter},
|
||||||
|
{0x1312, 0x1315, WBP_ALetter},
|
||||||
|
{0x1318, 0x135A, WBP_ALetter},
|
||||||
|
{0x135D, 0x135F, WBP_Extend},
|
||||||
|
{0x1380, 0x138F, WBP_ALetter},
|
||||||
|
{0x13A0, 0x13F4, WBP_ALetter},
|
||||||
|
{0x1401, 0x166C, WBP_ALetter},
|
||||||
|
{0x166F, 0x167F, WBP_ALetter},
|
||||||
|
{0x1681, 0x169A, WBP_ALetter},
|
||||||
|
{0x16A0, 0x16EA, WBP_ALetter},
|
||||||
|
{0x16EE, 0x16F0, WBP_ALetter},
|
||||||
|
{0x1700, 0x170C, WBP_ALetter},
|
||||||
|
{0x170E, 0x1711, WBP_ALetter},
|
||||||
|
{0x1712, 0x1714, WBP_Extend},
|
||||||
|
{0x1720, 0x1731, WBP_ALetter},
|
||||||
|
{0x1732, 0x1734, WBP_Extend},
|
||||||
|
{0x1740, 0x1751, WBP_ALetter},
|
||||||
|
{0x1752, 0x1753, WBP_Extend},
|
||||||
|
{0x1760, 0x176C, WBP_ALetter},
|
||||||
|
{0x176E, 0x1770, WBP_ALetter},
|
||||||
|
{0x1772, 0x1773, WBP_Extend},
|
||||||
|
{0x17B4, 0x17B5, WBP_Format},
|
||||||
|
{0x17B6, 0x17B6, WBP_Extend},
|
||||||
|
{0x17B7, 0x17BD, WBP_Extend},
|
||||||
|
{0x17BE, 0x17C5, WBP_Extend},
|
||||||
|
{0x17C6, 0x17C6, WBP_Extend},
|
||||||
|
{0x17C7, 0x17C8, WBP_Extend},
|
||||||
|
{0x17C9, 0x17D3, WBP_Extend},
|
||||||
|
{0x17DD, 0x17DD, WBP_Extend},
|
||||||
|
{0x17E0, 0x17E9, WBP_Numeric},
|
||||||
|
{0x180B, 0x180D, WBP_Extend},
|
||||||
|
{0x1810, 0x1819, WBP_Numeric},
|
||||||
|
{0x1820, 0x1842, WBP_ALetter},
|
||||||
|
{0x1843, 0x1843, WBP_ALetter},
|
||||||
|
{0x1844, 0x1877, WBP_ALetter},
|
||||||
|
{0x1880, 0x18A8, WBP_ALetter},
|
||||||
|
{0x18A9, 0x18A9, WBP_Extend},
|
||||||
|
{0x18AA, 0x18AA, WBP_ALetter},
|
||||||
|
{0x18B0, 0x18F5, WBP_ALetter},
|
||||||
|
{0x1900, 0x191C, WBP_ALetter},
|
||||||
|
{0x1920, 0x1922, WBP_Extend},
|
||||||
|
{0x1923, 0x1926, WBP_Extend},
|
||||||
|
{0x1927, 0x1928, WBP_Extend},
|
||||||
|
{0x1929, 0x192B, WBP_Extend},
|
||||||
|
{0x1930, 0x1931, WBP_Extend},
|
||||||
|
{0x1932, 0x1932, WBP_Extend},
|
||||||
|
{0x1933, 0x1938, WBP_Extend},
|
||||||
|
{0x1939, 0x193B, WBP_Extend},
|
||||||
|
{0x1946, 0x194F, WBP_Numeric},
|
||||||
|
{0x19B0, 0x19C0, WBP_Extend},
|
||||||
|
{0x19C8, 0x19C9, WBP_Extend},
|
||||||
|
{0x19D0, 0x19D9, WBP_Numeric},
|
||||||
|
{0x1A00, 0x1A16, WBP_ALetter},
|
||||||
|
{0x1A17, 0x1A18, WBP_Extend},
|
||||||
|
{0x1A19, 0x1A1B, WBP_Extend},
|
||||||
|
{0x1A55, 0x1A55, WBP_Extend},
|
||||||
|
{0x1A56, 0x1A56, WBP_Extend},
|
||||||
|
{0x1A57, 0x1A57, WBP_Extend},
|
||||||
|
{0x1A58, 0x1A5E, WBP_Extend},
|
||||||
|
{0x1A60, 0x1A60, WBP_Extend},
|
||||||
|
{0x1A61, 0x1A61, WBP_Extend},
|
||||||
|
{0x1A62, 0x1A62, WBP_Extend},
|
||||||
|
{0x1A63, 0x1A64, WBP_Extend},
|
||||||
|
{0x1A65, 0x1A6C, WBP_Extend},
|
||||||
|
{0x1A6D, 0x1A72, WBP_Extend},
|
||||||
|
{0x1A73, 0x1A7C, WBP_Extend},
|
||||||
|
{0x1A7F, 0x1A7F, WBP_Extend},
|
||||||
|
{0x1A80, 0x1A89, WBP_Numeric},
|
||||||
|
{0x1A90, 0x1A99, WBP_Numeric},
|
||||||
|
{0x1B00, 0x1B03, WBP_Extend},
|
||||||
|
{0x1B04, 0x1B04, WBP_Extend},
|
||||||
|
{0x1B05, 0x1B33, WBP_ALetter},
|
||||||
|
{0x1B34, 0x1B34, WBP_Extend},
|
||||||
|
{0x1B35, 0x1B35, WBP_Extend},
|
||||||
|
{0x1B36, 0x1B3A, WBP_Extend},
|
||||||
|
{0x1B3B, 0x1B3B, WBP_Extend},
|
||||||
|
{0x1B3C, 0x1B3C, WBP_Extend},
|
||||||
|
{0x1B3D, 0x1B41, WBP_Extend},
|
||||||
|
{0x1B42, 0x1B42, WBP_Extend},
|
||||||
|
{0x1B43, 0x1B44, WBP_Extend},
|
||||||
|
{0x1B45, 0x1B4B, WBP_ALetter},
|
||||||
|
{0x1B50, 0x1B59, WBP_Numeric},
|
||||||
|
{0x1B6B, 0x1B73, WBP_Extend},
|
||||||
|
{0x1B80, 0x1B81, WBP_Extend},
|
||||||
|
{0x1B82, 0x1B82, WBP_Extend},
|
||||||
|
{0x1B83, 0x1BA0, WBP_ALetter},
|
||||||
|
{0x1BA1, 0x1BA1, WBP_Extend},
|
||||||
|
{0x1BA2, 0x1BA5, WBP_Extend},
|
||||||
|
{0x1BA6, 0x1BA7, WBP_Extend},
|
||||||
|
{0x1BA8, 0x1BA9, WBP_Extend},
|
||||||
|
{0x1BAA, 0x1BAA, WBP_Extend},
|
||||||
|
{0x1BAE, 0x1BAF, WBP_ALetter},
|
||||||
|
{0x1BB0, 0x1BB9, WBP_Numeric},
|
||||||
|
{0x1BC0, 0x1BE5, WBP_ALetter},
|
||||||
|
{0x1BE6, 0x1BE6, WBP_Extend},
|
||||||
|
{0x1BE7, 0x1BE7, WBP_Extend},
|
||||||
|
{0x1BE8, 0x1BE9, WBP_Extend},
|
||||||
|
{0x1BEA, 0x1BEC, WBP_Extend},
|
||||||
|
{0x1BED, 0x1BED, WBP_Extend},
|
||||||
|
{0x1BEE, 0x1BEE, WBP_Extend},
|
||||||
|
{0x1BEF, 0x1BF1, WBP_Extend},
|
||||||
|
{0x1BF2, 0x1BF3, WBP_Extend},
|
||||||
|
{0x1C00, 0x1C23, WBP_ALetter},
|
||||||
|
{0x1C24, 0x1C2B, WBP_Extend},
|
||||||
|
{0x1C2C, 0x1C33, WBP_Extend},
|
||||||
|
{0x1C34, 0x1C35, WBP_Extend},
|
||||||
|
{0x1C36, 0x1C37, WBP_Extend},
|
||||||
|
{0x1C40, 0x1C49, WBP_Numeric},
|
||||||
|
{0x1C4D, 0x1C4F, WBP_ALetter},
|
||||||
|
{0x1C50, 0x1C59, WBP_Numeric},
|
||||||
|
{0x1C5A, 0x1C77, WBP_ALetter},
|
||||||
|
{0x1C78, 0x1C7D, WBP_ALetter},
|
||||||
|
{0x1CD0, 0x1CD2, WBP_Extend},
|
||||||
|
{0x1CD4, 0x1CE0, WBP_Extend},
|
||||||
|
{0x1CE1, 0x1CE1, WBP_Extend},
|
||||||
|
{0x1CE2, 0x1CE8, WBP_Extend},
|
||||||
|
{0x1CE9, 0x1CEC, WBP_ALetter},
|
||||||
|
{0x1CED, 0x1CED, WBP_Extend},
|
||||||
|
{0x1CEE, 0x1CF1, WBP_ALetter},
|
||||||
|
{0x1CF2, 0x1CF2, WBP_Extend},
|
||||||
|
{0x1D00, 0x1D2B, WBP_ALetter},
|
||||||
|
{0x1D2C, 0x1D61, WBP_ALetter},
|
||||||
|
{0x1D62, 0x1D77, WBP_ALetter},
|
||||||
|
{0x1D78, 0x1D78, WBP_ALetter},
|
||||||
|
{0x1D79, 0x1D9A, WBP_ALetter},
|
||||||
|
{0x1D9B, 0x1DBF, WBP_ALetter},
|
||||||
|
{0x1DC0, 0x1DE6, WBP_Extend},
|
||||||
|
{0x1DFC, 0x1DFF, WBP_Extend},
|
||||||
|
{0x1E00, 0x1F15, WBP_ALetter},
|
||||||
|
{0x1F18, 0x1F1D, WBP_ALetter},
|
||||||
|
{0x1F20, 0x1F45, WBP_ALetter},
|
||||||
|
{0x1F48, 0x1F4D, WBP_ALetter},
|
||||||
|
{0x1F50, 0x1F57, WBP_ALetter},
|
||||||
|
{0x1F59, 0x1F59, WBP_ALetter},
|
||||||
|
{0x1F5B, 0x1F5B, WBP_ALetter},
|
||||||
|
{0x1F5D, 0x1F5D, WBP_ALetter},
|
||||||
|
{0x1F5F, 0x1F7D, WBP_ALetter},
|
||||||
|
{0x1F80, 0x1FB4, WBP_ALetter},
|
||||||
|
{0x1FB6, 0x1FBC, WBP_ALetter},
|
||||||
|
{0x1FBE, 0x1FBE, WBP_ALetter},
|
||||||
|
{0x1FC2, 0x1FC4, WBP_ALetter},
|
||||||
|
{0x1FC6, 0x1FCC, WBP_ALetter},
|
||||||
|
{0x1FD0, 0x1FD3, WBP_ALetter},
|
||||||
|
{0x1FD6, 0x1FDB, WBP_ALetter},
|
||||||
|
{0x1FE0, 0x1FEC, WBP_ALetter},
|
||||||
|
{0x1FF2, 0x1FF4, WBP_ALetter},
|
||||||
|
{0x1FF6, 0x1FFC, WBP_ALetter},
|
||||||
|
{0x200C, 0x200D, WBP_Extend},
|
||||||
|
{0x200E, 0x200F, WBP_Format},
|
||||||
|
{0x2018, 0x2018, WBP_MidNumLet},
|
||||||
|
{0x2019, 0x2019, WBP_MidNumLet},
|
||||||
|
{0x2024, 0x2024, WBP_MidNumLet},
|
||||||
|
{0x2027, 0x2027, WBP_MidLetter},
|
||||||
|
{0x2028, 0x2028, WBP_Newline},
|
||||||
|
{0x2029, 0x2029, WBP_Newline},
|
||||||
|
{0x202A, 0x202E, WBP_Format},
|
||||||
|
{0x203F, 0x2040, WBP_ExtendNumLet},
|
||||||
|
{0x2044, 0x2044, WBP_MidNum},
|
||||||
|
{0x2054, 0x2054, WBP_ExtendNumLet},
|
||||||
|
{0x2060, 0x2064, WBP_Format},
|
||||||
|
{0x206A, 0x206F, WBP_Format},
|
||||||
|
{0x2071, 0x2071, WBP_ALetter},
|
||||||
|
{0x207F, 0x207F, WBP_ALetter},
|
||||||
|
{0x2090, 0x209C, WBP_ALetter},
|
||||||
|
{0x20D0, 0x20DC, WBP_Extend},
|
||||||
|
{0x20DD, 0x20E0, WBP_Extend},
|
||||||
|
{0x20E1, 0x20E1, WBP_Extend},
|
||||||
|
{0x20E2, 0x20E4, WBP_Extend},
|
||||||
|
{0x20E5, 0x20F0, WBP_Extend},
|
||||||
|
{0x2102, 0x2102, WBP_ALetter},
|
||||||
|
{0x2107, 0x2107, WBP_ALetter},
|
||||||
|
{0x210A, 0x2113, WBP_ALetter},
|
||||||
|
{0x2115, 0x2115, WBP_ALetter},
|
||||||
|
{0x2119, 0x211D, WBP_ALetter},
|
||||||
|
{0x2124, 0x2124, WBP_ALetter},
|
||||||
|
{0x2126, 0x2126, WBP_ALetter},
|
||||||
|
{0x2128, 0x2128, WBP_ALetter},
|
||||||
|
{0x212A, 0x212D, WBP_ALetter},
|
||||||
|
{0x212F, 0x2134, WBP_ALetter},
|
||||||
|
{0x2135, 0x2138, WBP_ALetter},
|
||||||
|
{0x2139, 0x2139, WBP_ALetter},
|
||||||
|
{0x213C, 0x213F, WBP_ALetter},
|
||||||
|
{0x2145, 0x2149, WBP_ALetter},
|
||||||
|
{0x214E, 0x214E, WBP_ALetter},
|
||||||
|
{0x2160, 0x2182, WBP_ALetter},
|
||||||
|
{0x2183, 0x2184, WBP_ALetter},
|
||||||
|
{0x2185, 0x2188, WBP_ALetter},
|
||||||
|
{0x24B6, 0x24E9, WBP_ALetter},
|
||||||
|
{0x2C00, 0x2C2E, WBP_ALetter},
|
||||||
|
{0x2C30, 0x2C5E, WBP_ALetter},
|
||||||
|
{0x2C60, 0x2C7C, WBP_ALetter},
|
||||||
|
{0x2C7D, 0x2C7D, WBP_ALetter},
|
||||||
|
{0x2C7E, 0x2CE4, WBP_ALetter},
|
||||||
|
{0x2CEB, 0x2CEE, WBP_ALetter},
|
||||||
|
{0x2CEF, 0x2CF1, WBP_Extend},
|
||||||
|
{0x2D00, 0x2D25, WBP_ALetter},
|
||||||
|
{0x2D30, 0x2D65, WBP_ALetter},
|
||||||
|
{0x2D6F, 0x2D6F, WBP_ALetter},
|
||||||
|
{0x2D7F, 0x2D7F, WBP_Extend},
|
||||||
|
{0x2D80, 0x2D96, WBP_ALetter},
|
||||||
|
{0x2DA0, 0x2DA6, WBP_ALetter},
|
||||||
|
{0x2DA8, 0x2DAE, WBP_ALetter},
|
||||||
|
{0x2DB0, 0x2DB6, WBP_ALetter},
|
||||||
|
{0x2DB8, 0x2DBE, WBP_ALetter},
|
||||||
|
{0x2DC0, 0x2DC6, WBP_ALetter},
|
||||||
|
{0x2DC8, 0x2DCE, WBP_ALetter},
|
||||||
|
{0x2DD0, 0x2DD6, WBP_ALetter},
|
||||||
|
{0x2DD8, 0x2DDE, WBP_ALetter},
|
||||||
|
{0x2DE0, 0x2DFF, WBP_Extend},
|
||||||
|
{0x2E2F, 0x2E2F, WBP_ALetter},
|
||||||
|
{0x3005, 0x3005, WBP_ALetter},
|
||||||
|
{0x302A, 0x302F, WBP_Extend},
|
||||||
|
{0x3031, 0x3035, WBP_Katakana},
|
||||||
|
{0x303B, 0x303B, WBP_ALetter},
|
||||||
|
{0x303C, 0x303C, WBP_ALetter},
|
||||||
|
{0x3099, 0x309A, WBP_Extend},
|
||||||
|
{0x309B, 0x309C, WBP_Katakana},
|
||||||
|
{0x30A0, 0x30A0, WBP_Katakana},
|
||||||
|
{0x30A1, 0x30FA, WBP_Katakana},
|
||||||
|
{0x30FC, 0x30FE, WBP_Katakana},
|
||||||
|
{0x30FF, 0x30FF, WBP_Katakana},
|
||||||
|
{0x3105, 0x312D, WBP_ALetter},
|
||||||
|
{0x3131, 0x318E, WBP_ALetter},
|
||||||
|
{0x31A0, 0x31BA, WBP_ALetter},
|
||||||
|
{0x31F0, 0x31FF, WBP_Katakana},
|
||||||
|
{0x32D0, 0x32FE, WBP_Katakana},
|
||||||
|
{0x3300, 0x3357, WBP_Katakana},
|
||||||
|
{0xA000, 0xA014, WBP_ALetter},
|
||||||
|
{0xA015, 0xA015, WBP_ALetter},
|
||||||
|
{0xA016, 0xA48C, WBP_ALetter},
|
||||||
|
{0xA4D0, 0xA4F7, WBP_ALetter},
|
||||||
|
{0xA4F8, 0xA4FD, WBP_ALetter},
|
||||||
|
{0xA500, 0xA60B, WBP_ALetter},
|
||||||
|
{0xA60C, 0xA60C, WBP_ALetter},
|
||||||
|
{0xA610, 0xA61F, WBP_ALetter},
|
||||||
|
{0xA620, 0xA629, WBP_Numeric},
|
||||||
|
{0xA62A, 0xA62B, WBP_ALetter},
|
||||||
|
{0xA640, 0xA66D, WBP_ALetter},
|
||||||
|
{0xA66E, 0xA66E, WBP_ALetter},
|
||||||
|
{0xA66F, 0xA66F, WBP_Extend},
|
||||||
|
{0xA670, 0xA672, WBP_Extend},
|
||||||
|
{0xA67C, 0xA67D, WBP_Extend},
|
||||||
|
{0xA67F, 0xA67F, WBP_ALetter},
|
||||||
|
{0xA680, 0xA697, WBP_ALetter},
|
||||||
|
{0xA6A0, 0xA6E5, WBP_ALetter},
|
||||||
|
{0xA6E6, 0xA6EF, WBP_ALetter},
|
||||||
|
{0xA6F0, 0xA6F1, WBP_Extend},
|
||||||
|
{0xA717, 0xA71F, WBP_ALetter},
|
||||||
|
{0xA722, 0xA76F, WBP_ALetter},
|
||||||
|
{0xA770, 0xA770, WBP_ALetter},
|
||||||
|
{0xA771, 0xA787, WBP_ALetter},
|
||||||
|
{0xA788, 0xA788, WBP_ALetter},
|
||||||
|
{0xA78B, 0xA78E, WBP_ALetter},
|
||||||
|
{0xA790, 0xA791, WBP_ALetter},
|
||||||
|
{0xA7A0, 0xA7A9, WBP_ALetter},
|
||||||
|
{0xA7FA, 0xA7FA, WBP_ALetter},
|
||||||
|
{0xA7FB, 0xA801, WBP_ALetter},
|
||||||
|
{0xA802, 0xA802, WBP_Extend},
|
||||||
|
{0xA803, 0xA805, WBP_ALetter},
|
||||||
|
{0xA806, 0xA806, WBP_Extend},
|
||||||
|
{0xA807, 0xA80A, WBP_ALetter},
|
||||||
|
{0xA80B, 0xA80B, WBP_Extend},
|
||||||
|
{0xA80C, 0xA822, WBP_ALetter},
|
||||||
|
{0xA823, 0xA824, WBP_Extend},
|
||||||
|
{0xA825, 0xA826, WBP_Extend},
|
||||||
|
{0xA827, 0xA827, WBP_Extend},
|
||||||
|
{0xA840, 0xA873, WBP_ALetter},
|
||||||
|
{0xA880, 0xA881, WBP_Extend},
|
||||||
|
{0xA882, 0xA8B3, WBP_ALetter},
|
||||||
|
{0xA8B4, 0xA8C3, WBP_Extend},
|
||||||
|
{0xA8C4, 0xA8C4, WBP_Extend},
|
||||||
|
{0xA8D0, 0xA8D9, WBP_Numeric},
|
||||||
|
{0xA8E0, 0xA8F1, WBP_Extend},
|
||||||
|
{0xA8F2, 0xA8F7, WBP_ALetter},
|
||||||
|
{0xA8FB, 0xA8FB, WBP_ALetter},
|
||||||
|
{0xA900, 0xA909, WBP_Numeric},
|
||||||
|
{0xA90A, 0xA925, WBP_ALetter},
|
||||||
|
{0xA926, 0xA92D, WBP_Extend},
|
||||||
|
{0xA930, 0xA946, WBP_ALetter},
|
||||||
|
{0xA947, 0xA951, WBP_Extend},
|
||||||
|
{0xA952, 0xA953, WBP_Extend},
|
||||||
|
{0xA960, 0xA97C, WBP_ALetter},
|
||||||
|
{0xA980, 0xA982, WBP_Extend},
|
||||||
|
{0xA983, 0xA983, WBP_Extend},
|
||||||
|
{0xA984, 0xA9B2, WBP_ALetter},
|
||||||
|
{0xA9B3, 0xA9B3, WBP_Extend},
|
||||||
|
{0xA9B4, 0xA9B5, WBP_Extend},
|
||||||
|
{0xA9B6, 0xA9B9, WBP_Extend},
|
||||||
|
{0xA9BA, 0xA9BB, WBP_Extend},
|
||||||
|
{0xA9BC, 0xA9BC, WBP_Extend},
|
||||||
|
{0xA9BD, 0xA9C0, WBP_Extend},
|
||||||
|
{0xA9CF, 0xA9CF, WBP_ALetter},
|
||||||
|
{0xA9D0, 0xA9D9, WBP_Numeric},
|
||||||
|
{0xAA00, 0xAA28, WBP_ALetter},
|
||||||
|
{0xAA29, 0xAA2E, WBP_Extend},
|
||||||
|
{0xAA2F, 0xAA30, WBP_Extend},
|
||||||
|
{0xAA31, 0xAA32, WBP_Extend},
|
||||||
|
{0xAA33, 0xAA34, WBP_Extend},
|
||||||
|
{0xAA35, 0xAA36, WBP_Extend},
|
||||||
|
{0xAA40, 0xAA42, WBP_ALetter},
|
||||||
|
{0xAA43, 0xAA43, WBP_Extend},
|
||||||
|
{0xAA44, 0xAA4B, WBP_ALetter},
|
||||||
|
{0xAA4C, 0xAA4C, WBP_Extend},
|
||||||
|
{0xAA4D, 0xAA4D, WBP_Extend},
|
||||||
|
{0xAA50, 0xAA59, WBP_Numeric},
|
||||||
|
{0xAA7B, 0xAA7B, WBP_Extend},
|
||||||
|
{0xAAB0, 0xAAB0, WBP_Extend},
|
||||||
|
{0xAAB2, 0xAAB4, WBP_Extend},
|
||||||
|
{0xAAB7, 0xAAB8, WBP_Extend},
|
||||||
|
{0xAABE, 0xAABF, WBP_Extend},
|
||||||
|
{0xAAC1, 0xAAC1, WBP_Extend},
|
||||||
|
{0xAB01, 0xAB06, WBP_ALetter},
|
||||||
|
{0xAB09, 0xAB0E, WBP_ALetter},
|
||||||
|
{0xAB11, 0xAB16, WBP_ALetter},
|
||||||
|
{0xAB20, 0xAB26, WBP_ALetter},
|
||||||
|
{0xAB28, 0xAB2E, WBP_ALetter},
|
||||||
|
{0xABC0, 0xABE2, WBP_ALetter},
|
||||||
|
{0xABE3, 0xABE4, WBP_Extend},
|
||||||
|
{0xABE5, 0xABE5, WBP_Extend},
|
||||||
|
{0xABE6, 0xABE7, WBP_Extend},
|
||||||
|
{0xABE8, 0xABE8, WBP_Extend},
|
||||||
|
{0xABE9, 0xABEA, WBP_Extend},
|
||||||
|
{0xABEC, 0xABEC, WBP_Extend},
|
||||||
|
{0xABED, 0xABED, WBP_Extend},
|
||||||
|
{0xABF0, 0xABF9, WBP_Numeric},
|
||||||
|
{0xAC00, 0xD7A3, WBP_ALetter},
|
||||||
|
{0xD7B0, 0xD7C6, WBP_ALetter},
|
||||||
|
{0xD7CB, 0xD7FB, WBP_ALetter},
|
||||||
|
{0xFB00, 0xFB06, WBP_ALetter},
|
||||||
|
{0xFB13, 0xFB17, WBP_ALetter},
|
||||||
|
{0xFB1D, 0xFB1D, WBP_ALetter},
|
||||||
|
{0xFB1E, 0xFB1E, WBP_Extend},
|
||||||
|
{0xFB1F, 0xFB28, WBP_ALetter},
|
||||||
|
{0xFB2A, 0xFB36, WBP_ALetter},
|
||||||
|
{0xFB38, 0xFB3C, WBP_ALetter},
|
||||||
|
{0xFB3E, 0xFB3E, WBP_ALetter},
|
||||||
|
{0xFB40, 0xFB41, WBP_ALetter},
|
||||||
|
{0xFB43, 0xFB44, WBP_ALetter},
|
||||||
|
{0xFB46, 0xFBB1, WBP_ALetter},
|
||||||
|
{0xFBD3, 0xFD3D, WBP_ALetter},
|
||||||
|
{0xFD50, 0xFD8F, WBP_ALetter},
|
||||||
|
{0xFD92, 0xFDC7, WBP_ALetter},
|
||||||
|
{0xFDF0, 0xFDFB, WBP_ALetter},
|
||||||
|
{0xFE00, 0xFE0F, WBP_Extend},
|
||||||
|
{0xFE10, 0xFE10, WBP_MidNum},
|
||||||
|
{0xFE13, 0xFE13, WBP_MidLetter},
|
||||||
|
{0xFE14, 0xFE14, WBP_MidNum},
|
||||||
|
{0xFE20, 0xFE26, WBP_Extend},
|
||||||
|
{0xFE33, 0xFE34, WBP_ExtendNumLet},
|
||||||
|
{0xFE4D, 0xFE4F, WBP_ExtendNumLet},
|
||||||
|
{0xFE50, 0xFE50, WBP_MidNum},
|
||||||
|
{0xFE52, 0xFE52, WBP_MidNumLet},
|
||||||
|
{0xFE54, 0xFE54, WBP_MidNum},
|
||||||
|
{0xFE55, 0xFE55, WBP_MidLetter},
|
||||||
|
{0xFE70, 0xFE74, WBP_ALetter},
|
||||||
|
{0xFE76, 0xFEFC, WBP_ALetter},
|
||||||
|
{0xFEFF, 0xFEFF, WBP_Format},
|
||||||
|
{0xFF07, 0xFF07, WBP_MidNumLet},
|
||||||
|
{0xFF0C, 0xFF0C, WBP_MidNum},
|
||||||
|
{0xFF0E, 0xFF0E, WBP_MidNumLet},
|
||||||
|
{0xFF1A, 0xFF1A, WBP_MidLetter},
|
||||||
|
{0xFF1B, 0xFF1B, WBP_MidNum},
|
||||||
|
{0xFF21, 0xFF3A, WBP_ALetter},
|
||||||
|
{0xFF3F, 0xFF3F, WBP_ExtendNumLet},
|
||||||
|
{0xFF41, 0xFF5A, WBP_ALetter},
|
||||||
|
{0xFF66, 0xFF6F, WBP_Katakana},
|
||||||
|
{0xFF70, 0xFF70, WBP_Katakana},
|
||||||
|
{0xFF71, 0xFF9D, WBP_Katakana},
|
||||||
|
{0xFF9E, 0xFF9F, WBP_Extend},
|
||||||
|
{0xFFA0, 0xFFBE, WBP_ALetter},
|
||||||
|
{0xFFC2, 0xFFC7, WBP_ALetter},
|
||||||
|
{0xFFCA, 0xFFCF, WBP_ALetter},
|
||||||
|
{0xFFD2, 0xFFD7, WBP_ALetter},
|
||||||
|
{0xFFDA, 0xFFDC, WBP_ALetter},
|
||||||
|
{0xFFF9, 0xFFFB, WBP_Format},
|
||||||
|
{0x10000, 0x1000B, WBP_ALetter},
|
||||||
|
{0x1000D, 0x10026, WBP_ALetter},
|
||||||
|
{0x10028, 0x1003A, WBP_ALetter},
|
||||||
|
{0x1003C, 0x1003D, WBP_ALetter},
|
||||||
|
{0x1003F, 0x1004D, WBP_ALetter},
|
||||||
|
{0x10050, 0x1005D, WBP_ALetter},
|
||||||
|
{0x10080, 0x100FA, WBP_ALetter},
|
||||||
|
{0x10140, 0x10174, WBP_ALetter},
|
||||||
|
{0x101FD, 0x101FD, WBP_Extend},
|
||||||
|
{0x10280, 0x1029C, WBP_ALetter},
|
||||||
|
{0x102A0, 0x102D0, WBP_ALetter},
|
||||||
|
{0x10300, 0x1031E, WBP_ALetter},
|
||||||
|
{0x10330, 0x10340, WBP_ALetter},
|
||||||
|
{0x10341, 0x10341, WBP_ALetter},
|
||||||
|
{0x10342, 0x10349, WBP_ALetter},
|
||||||
|
{0x1034A, 0x1034A, WBP_ALetter},
|
||||||
|
{0x10380, 0x1039D, WBP_ALetter},
|
||||||
|
{0x103A0, 0x103C3, WBP_ALetter},
|
||||||
|
{0x103C8, 0x103CF, WBP_ALetter},
|
||||||
|
{0x103D1, 0x103D5, WBP_ALetter},
|
||||||
|
{0x10400, 0x1044F, WBP_ALetter},
|
||||||
|
{0x10450, 0x1049D, WBP_ALetter},
|
||||||
|
{0x104A0, 0x104A9, WBP_Numeric},
|
||||||
|
{0x10800, 0x10805, WBP_ALetter},
|
||||||
|
{0x10808, 0x10808, WBP_ALetter},
|
||||||
|
{0x1080A, 0x10835, WBP_ALetter},
|
||||||
|
{0x10837, 0x10838, WBP_ALetter},
|
||||||
|
{0x1083C, 0x1083C, WBP_ALetter},
|
||||||
|
{0x1083F, 0x10855, WBP_ALetter},
|
||||||
|
{0x10900, 0x10915, WBP_ALetter},
|
||||||
|
{0x10920, 0x10939, WBP_ALetter},
|
||||||
|
{0x10A00, 0x10A00, WBP_ALetter},
|
||||||
|
{0x10A01, 0x10A03, WBP_Extend},
|
||||||
|
{0x10A05, 0x10A06, WBP_Extend},
|
||||||
|
{0x10A0C, 0x10A0F, WBP_Extend},
|
||||||
|
{0x10A10, 0x10A13, WBP_ALetter},
|
||||||
|
{0x10A15, 0x10A17, WBP_ALetter},
|
||||||
|
{0x10A19, 0x10A33, WBP_ALetter},
|
||||||
|
{0x10A38, 0x10A3A, WBP_Extend},
|
||||||
|
{0x10A3F, 0x10A3F, WBP_Extend},
|
||||||
|
{0x10A60, 0x10A7C, WBP_ALetter},
|
||||||
|
{0x10B00, 0x10B35, WBP_ALetter},
|
||||||
|
{0x10B40, 0x10B55, WBP_ALetter},
|
||||||
|
{0x10B60, 0x10B72, WBP_ALetter},
|
||||||
|
{0x10C00, 0x10C48, WBP_ALetter},
|
||||||
|
{0x11000, 0x11000, WBP_Extend},
|
||||||
|
{0x11001, 0x11001, WBP_Extend},
|
||||||
|
{0x11002, 0x11002, WBP_Extend},
|
||||||
|
{0x11003, 0x11037, WBP_ALetter},
|
||||||
|
{0x11038, 0x11046, WBP_Extend},
|
||||||
|
{0x11066, 0x1106F, WBP_Numeric},
|
||||||
|
{0x11080, 0x11081, WBP_Extend},
|
||||||
|
{0x11082, 0x11082, WBP_Extend},
|
||||||
|
{0x11083, 0x110AF, WBP_ALetter},
|
||||||
|
{0x110B0, 0x110B2, WBP_Extend},
|
||||||
|
{0x110B3, 0x110B6, WBP_Extend},
|
||||||
|
{0x110B7, 0x110B8, WBP_Extend},
|
||||||
|
{0x110B9, 0x110BA, WBP_Extend},
|
||||||
|
{0x110BD, 0x110BD, WBP_Format},
|
||||||
|
{0x12000, 0x1236E, WBP_ALetter},
|
||||||
|
{0x12400, 0x12462, WBP_ALetter},
|
||||||
|
{0x13000, 0x1342E, WBP_ALetter},
|
||||||
|
{0x16800, 0x16A38, WBP_ALetter},
|
||||||
|
{0x1B000, 0x1B000, WBP_Katakana},
|
||||||
|
{0x1D165, 0x1D166, WBP_Extend},
|
||||||
|
{0x1D167, 0x1D169, WBP_Extend},
|
||||||
|
{0x1D16D, 0x1D172, WBP_Extend},
|
||||||
|
{0x1D173, 0x1D17A, WBP_Format},
|
||||||
|
{0x1D17B, 0x1D182, WBP_Extend},
|
||||||
|
{0x1D185, 0x1D18B, WBP_Extend},
|
||||||
|
{0x1D1AA, 0x1D1AD, WBP_Extend},
|
||||||
|
{0x1D242, 0x1D244, WBP_Extend},
|
||||||
|
{0x1D400, 0x1D454, WBP_ALetter},
|
||||||
|
{0x1D456, 0x1D49C, WBP_ALetter},
|
||||||
|
{0x1D49E, 0x1D49F, WBP_ALetter},
|
||||||
|
{0x1D4A2, 0x1D4A2, WBP_ALetter},
|
||||||
|
{0x1D4A5, 0x1D4A6, WBP_ALetter},
|
||||||
|
{0x1D4A9, 0x1D4AC, WBP_ALetter},
|
||||||
|
{0x1D4AE, 0x1D4B9, WBP_ALetter},
|
||||||
|
{0x1D4BB, 0x1D4BB, WBP_ALetter},
|
||||||
|
{0x1D4BD, 0x1D4C3, WBP_ALetter},
|
||||||
|
{0x1D4C5, 0x1D505, WBP_ALetter},
|
||||||
|
{0x1D507, 0x1D50A, WBP_ALetter},
|
||||||
|
{0x1D50D, 0x1D514, WBP_ALetter},
|
||||||
|
{0x1D516, 0x1D51C, WBP_ALetter},
|
||||||
|
{0x1D51E, 0x1D539, WBP_ALetter},
|
||||||
|
{0x1D53B, 0x1D53E, WBP_ALetter},
|
||||||
|
{0x1D540, 0x1D544, WBP_ALetter},
|
||||||
|
{0x1D546, 0x1D546, WBP_ALetter},
|
||||||
|
{0x1D54A, 0x1D550, WBP_ALetter},
|
||||||
|
{0x1D552, 0x1D6A5, WBP_ALetter},
|
||||||
|
{0x1D6A8, 0x1D6C0, WBP_ALetter},
|
||||||
|
{0x1D6C2, 0x1D6DA, WBP_ALetter},
|
||||||
|
{0x1D6DC, 0x1D6FA, WBP_ALetter},
|
||||||
|
{0x1D6FC, 0x1D714, WBP_ALetter},
|
||||||
|
{0x1D716, 0x1D734, WBP_ALetter},
|
||||||
|
{0x1D736, 0x1D74E, WBP_ALetter},
|
||||||
|
{0x1D750, 0x1D76E, WBP_ALetter},
|
||||||
|
{0x1D770, 0x1D788, WBP_ALetter},
|
||||||
|
{0x1D78A, 0x1D7A8, WBP_ALetter},
|
||||||
|
{0x1D7AA, 0x1D7C2, WBP_ALetter},
|
||||||
|
{0x1D7C4, 0x1D7CB, WBP_ALetter},
|
||||||
|
{0x1D7CE, 0x1D7FF, WBP_Numeric},
|
||||||
|
{0xE0001, 0xE0001, WBP_Format},
|
||||||
|
{0xE0020, 0xE007F, WBP_Format},
|
||||||
|
{0xE0100, 0xE01EF, WBP_Extend},
|
||||||
|
{0xFFFFFFFF, 0xFFFFFFFF, WBP_Undefined}
|
||||||
|
};
|
5
linebreak/linebreak/wordbreakdata1.tmpl
Normal file
5
linebreak/linebreak/wordbreakdata1.tmpl
Normal file
|
@ -0,0 +1,5 @@
|
||||||
|
|
||||||
|
#include "linebreak.h"
|
||||||
|
#include "wordbreakdef.h"
|
||||||
|
|
||||||
|
static struct WordBreakProperties wb_prop_default[] = {
|
2
linebreak/linebreak/wordbreakdata2.tmpl
Normal file
2
linebreak/linebreak/wordbreakdata2.tmpl
Normal file
|
@ -0,0 +1,2 @@
|
||||||
|
{0xFFFFFFFF, 0xFFFFFFFF, WBP_Undefined}
|
||||||
|
};
|
78
linebreak/linebreak/wordbreakdef.h
Normal file
78
linebreak/linebreak/wordbreakdef.h
Normal file
|
@ -0,0 +1,78 @@
|
||||||
|
/* vim: set tabstop=4 shiftwidth=4: */
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Word breaking in a Unicode sequence. Designed to be used in a
|
||||||
|
* generic text renderer.
|
||||||
|
*
|
||||||
|
* Copyright (C) 2012 Tom Hacohen <tom@stosb.com>
|
||||||
|
*
|
||||||
|
* This software is provided 'as-is', without any express or implied
|
||||||
|
* warranty. In no event will the author be held liable for any damages
|
||||||
|
* arising from the use of this software.
|
||||||
|
*
|
||||||
|
* Permission is granted to anyone to use this software for any purpose,
|
||||||
|
* including commercial applications, and to alter it and redistribute
|
||||||
|
* it freely, subject to the following restrictions:
|
||||||
|
*
|
||||||
|
* 1. The origin of this software must not be misrepresented; you must
|
||||||
|
* not claim that you wrote the original software. If you use this
|
||||||
|
* software in a product, an acknowledgement in the product
|
||||||
|
* documentation would be appreciated but is not required.
|
||||||
|
* 2. Altered source versions must be plainly marked as such, and must
|
||||||
|
* not be misrepresented as being the original software.
|
||||||
|
* 3. This notice may not be removed or altered from any source
|
||||||
|
* distribution.
|
||||||
|
*
|
||||||
|
* The main reference is Unicode Standard Annex 29 (UAX #29):
|
||||||
|
* <URL:http://unicode.org/reports/tr29>
|
||||||
|
*
|
||||||
|
* When this library was designed, this annex was at Revision 17, for
|
||||||
|
* Unicode 6.0.0:
|
||||||
|
* <URL:http://www.unicode.org/reports/tr29/tr29-17.html>
|
||||||
|
*
|
||||||
|
* The Unicode Terms of Use are available at
|
||||||
|
* <URL:http://www.unicode.org/copyright.html>
|
||||||
|
*/
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @file wordbreakdef.h
|
||||||
|
*
|
||||||
|
* Definitions of internal data structures, declarations of global
|
||||||
|
* variables, and function prototypes for the word breaking algorithm.
|
||||||
|
*
|
||||||
|
* @version 2.1, 2012/01/18
|
||||||
|
* @author Tom Hacohen
|
||||||
|
*/
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Word break classes. This is a direct mapping of Table 3 of Unicode
|
||||||
|
* Standard Annex 29, Revision 17.
|
||||||
|
*/
|
||||||
|
enum WordBreakClass
|
||||||
|
{
|
||||||
|
WBP_Undefined,
|
||||||
|
WBP_CR,
|
||||||
|
WBP_LF,
|
||||||
|
WBP_Newline,
|
||||||
|
WBP_Extend,
|
||||||
|
WBP_Format,
|
||||||
|
WBP_Katakana,
|
||||||
|
WBP_ALetter,
|
||||||
|
WBP_MidNumLet,
|
||||||
|
WBP_MidLetter,
|
||||||
|
WBP_MidNum,
|
||||||
|
WBP_Numeric,
|
||||||
|
WBP_ExtendNumLet,
|
||||||
|
WBP_Any
|
||||||
|
};
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Struct for entries of word break properties. The array of the
|
||||||
|
* entries \e must be sorted.
|
||||||
|
*/
|
||||||
|
struct WordBreakProperties
|
||||||
|
{
|
||||||
|
utf32_t start; /**< Starting coding point */
|
||||||
|
utf32_t end; /**< End coding point */
|
||||||
|
enum WordBreakClass prop; /**< The word breaking property */
|
||||||
|
};
|
Loading…
Reference in a new issue