Initial import of linebreak

This commit is contained in:
Slava Monich 2015-05-27 00:00:50 +03:00
parent 56a5d7f63f
commit 74ca5511d7
34 changed files with 6889 additions and 0 deletions

View file

@ -0,0 +1,8 @@
Wu Yongwei. Designed and implemented liblinebreak.
Nikolay Pultsin. Put forward the original requirements on liblinebreak,
performed tests, and made a lot of suggestions on the initial versions.
Thomas Klausner. Autoconfiscated and libtoolized liblinebreak.
Tom Hacohen. Added word boundaries support.

View file

@ -0,0 +1,32 @@
/AUTHORS/1.2/Wed Jan 18 14:26:13 2012//
/ChangeLog/1.78/Sat Aug 11 07:35:23 2012//
/Doxyfile/1.7/Sat Aug 11 06:55:18 2012//
/LICENCE/1.4/Sat Aug 11 07:35:23 2012//
/LineBreak1.sed/1.2/Sun Dec 7 10:54:37 2008//
/LineBreak2.sed/1.2/Sun Dec 7 10:54:37 2008//
/Makefile.am/1.8/Sat Aug 11 06:55:18 2012//
/Makefile.gcc/1.4/Thu Jan 19 14:03:34 2012//
/Makefile.msvc/1.5/Sat Aug 11 05:57:50 2012//
/NEWS/1.7/Sat Aug 11 06:55:18 2012//
/README/1.8/Sat Aug 11 06:55:18 2012//
/bootstrap/1.1/Fri Dec 12 12:01:39 2008//
/configure.ac/1.6/Sat Aug 11 06:55:18 2012//
/filter_dup.c/1.1/Sat Feb 23 11:53:28 2008//
/libunibreak.pc.in/1.1/Sat Aug 11 06:55:18 2012//
/linebreak.c/1.25/Sat May 7 19:55:10 2011//
/linebreak.h/1.14/Sat May 7 19:55:10 2011//
/linebreakdata.c/1.5/Sat May 7 19:40:20 2011//
/linebreakdata1.tmpl/1.1/Sat Feb 23 11:53:28 2008//
/linebreakdata2.tmpl/1.2/Sun Mar 2 07:30:43 2008//
/linebreakdata3.tmpl/1.1/Sat Feb 23 11:53:28 2008//
/linebreakdef.c/1.12/Sat May 7 19:55:10 2011//
/linebreakdef.h/1.12/Sat May 7 19:55:10 2011//
/purge/1.1/Fri Dec 12 12:01:39 2008//
/sort_numeric_hex.py/1.2/Wed Jan 18 14:26:13 2012//
/wordbreak.c/1.3/Sat Feb 4 14:32:57 2012//
/wordbreak.h/1.4/Sat Feb 4 14:32:58 2012//
/wordbreakdata.c/1.2/Wed Jan 18 14:26:13 2012//
/wordbreakdata1.tmpl/1.2/Wed Jan 18 14:26:13 2012//
/wordbreakdata2.tmpl/1.2/Wed Jan 18 14:26:13 2012//
/wordbreakdef.h/1.2/Wed Jan 18 14:26:13 2012//
D

View file

@ -0,0 +1 @@
common/tools/linebreak

View file

@ -0,0 +1 @@
:pserver:anonymous@vimgadgets.cvs.sourceforge.net:/cvsroot/vimgadgets

View file

@ -0,0 +1,512 @@
2012-08-11 Wu Yongwei <wuyongwei@gmail.com>
* LICENCE: Add copyright information about Tom Hacohen.
2012-08-11 Wu Yongwei <wuyongwei@gmail.com>
* configure.ac (AC_INIT): Change the library name and version to
`libunibreak' and `1.0'.
(AC_PROG_LN_S): New macro.
(AC_OUTPUT): Change to `libunibreak.pc'.
* Doxyfile: (PROJECT_NAME): Change to `libunibreak'.
(PROJECT_NUMBER): Change to `1.0'.
* Makefile.am (lib_LTLIBRARIES): Change to `libunibreak.la'.
(pkgconfig_DATA): Change to `libunibreak.la'.
(libunibreak_la_LDFLAGS): Reset the version to `1:0'.
(install-exec-hook): Replace the static library liblinebreak.a with
a symlink to libunibreak.a.
* NEW: Add information about libunibreak 1.0.
* README: Change the library name, and add information about word
break.
2012-08-11 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.msvc: Change the library name to `libunibreak', and the
output library to `unibreak.lib'.
2012-02-04 Wu Yongwei <wuyongwei@gmail.com>
* wordbreak.h (WORDBREAK_INSIDEACHAR): Change from
WORDBREAK_INSIDECHAR.
* wordbreak.c (set_brks_to): Change `WORDBREAK_INSIDECHAR' to
`WORDBREAK_INSIDEACHAR'.
2012-01-19 Wu Yongwei <wuyongwei@gmail.com>
* wordbreak.h: Change angle brackets to quotation marks (which
caused build errors).
2012-01-19 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.gcc (CFILES): Add wordbreak.c.
(WordBreakProperty.txt): New target.
(wordbreakdata): New target.
2012-01-19 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.am (liblinebreak_la_SOURCES): Remove wordbreakdata.c.
(EXTRA_DIST): Add wordbreakdata.c, wordbreakdata1.tmpl, and
wordbreakdata2.tmpl.
2012-01-19 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.msvc: Add wordbreak files.
2012-01-18 Tom Hacohen <tom@stosb.com>
Add word breaking support.
* AUTHORS: Add `Tom Hacohen'.
* Makefile.am (include_HEADERS): Add header files for word breaking.
(liblinebreak_la_SOURCES): Add source files for word breaking.
(sort_numeric_hex.py): Add `sort_numeric_hex.py'.
(distclean-local): Clean also `WordBreakData.txt'.
(WordBreakProperty.txt): New target.
(wordbreakdata): New target.
* sort_numeric_hex.py: New file.
* wordbreak.c: New file.
* wordbreak.h: New file.
* wordbreakdef.h: New file.
* wordbreakdata.c: New file.
* wordbreakdata1.tmpl: New file.
* wordbreakdata2.tmpl: New file.
2011-05-17 Wu Yongwei <wuyongwei@gmail.com>
Add support for pkg-config (thanks to Tom Hacohen).
* liblinebreak.pc.in: New file.
* configure.ac (AC_OUTPUT): Add `liblinebreak.pc'.
* Makefile.am (pkgconfig_DATA): Set to `liblinebreak.pc'.
(pkgconfigdir): Set to `$(libdir)/pkgconfig'.
2011-05-07 Wu Yongwei <wuyongwei@gmail.com>
* README: Update the reference to UAX #14-26, for Unicode 6.0.0.
2011-05-07 Wu Yongwei <wuyongwei@gmail.com>
* configure.ac (AC_INIT): Increase the version to 2.1.
* Makefile.am (liblinebreak_la_LDFLAGS): Set the version-info to
`2:1'.
2011-05-07 Wu Yongwei <wuyongwei@gmail.com>
* LICENCE: Update the copyright year.
2011-05-07 Wu Yongwei <wuyongwei@gmail.com>
Update for the 2.1 release.
* Doxyfile (PROJECT_NUMBER): Set to `2.1'.
* NEWS: Add information about the 2.1 release.
* linebreak.h (LINEBREAK_VERSION): Set to `0x0201'.
* linebreak.h: Update comments.
* linebreak.c: Ditto.
* linebreakdef.h: Ditto.
* linebreakdef.c: Ditto.
2011-05-07 Wu Yongwei <wuyongwei@gmail.com>
* linebreakdata.c: Regenerate from LineBreak-6.0.0.txt.
2011-05-07 Wu Yongwei <wuyongwei@gmail.com>
* linebreak.c (set_linebreaks): Fix the assertion failure when
U+FFFC (OBJECT REPLACEMENT CHARACTER) appears at the beginning of a
line (thanks to Tom Hacohen).
2010-01-03 Wu Yongwei <wuyongwei@gmail.com>
* LICENCE: Update the copyright year.
2010-01-03 Wu Yongwei <wuyongwei@gmail.com>
* NEWS: Add information about the 2.0 release.
2010-01-03 Wu Yongwei <wuyongwei@gmail.com>
* Doxyfile (PROJECT_NUMBER): Set to `2.0'.
(HAVE_DOT): Set to `YES'.
2010-01-03 Wu Yongwei <wuyongwei@gmail.com>
* linebreak.c: Update the version number in comment to 2.0.
* linebreak.h: Ditto.
* linebreakdef.c: Ditto.
* linebreakdef.h: Ditto.
2009-12-17 Wu Yongwei <wuyongwei@gmail.com>
Change the values of enum BreakAction to the same length.
* linebreak.c (DIRECT_BRK): Rename to DIR_BRK.
(INDIRECT_BRK): Rename to IND_BRK.
(CM_INDIRECT_BRK): Rename to CMI_BRK.
(CM_PROHIBITED_BRK): Rename to CMP_BRK.
(PROHIBITED_BRK): Rename to PRH_BRK.
2009-11-29 Wu Yongwei <wuyongwei@gmail.com>
* Doxyfile (TAB_SIZE): Set to the correct size `4', as used in the
source files.
2009-11-29 Wu Yongwei <wuyongwei@gmail.com>
Update files according to UAX #14-24, for Unicode 5.2.0.
* linebreak.c: Update comments about UAX #14.
* linebreak.h: Ditto.
* linebreakdef.c: Ditto.
* linebreakdef.h: Ditto.
(LBP_CP): New enumerator for the new `CP' class as defined in
UAX #14-24.
* linebreak.c (baTable): Update for the new class `CP'.
* linebreakdata.c: Regenerate from LineBreak-5.2.0.txt.
* README: Update the reference to UAX #14-24, for Unicode 5.2.0.
2009-05-03 Wu Yongwei <wuyongwei@gmail.com>
* NEWS: Add information about the 1.2 release.
2009-04-30 Wu Yongwei <wuyongwei@gmail.com>
Optimize the Doxygen output.
* linebreak.c (lb_prop_index): Adjust its definition format
slightly.
2009-04-30 Wu Yongwei <wuyongwei@gmail.com>
* Doxyfile (USE_WINDOWS_ENCODING): Remove obsolete tag.
(DETAILS_AT_TOP): Ditto.
(MAX_DOT_GRAPH_WIDTH): Ditto.
(MAX_DOT_GRAPH_HEIGHT): Ditto.
(REFERENCED_BY_RELATION): Set to `NO'.
(REFERENCES_RELATION): Ditto.
(EXCLUDE): Add `filter_dup.c'.
2009-04-28 Wu Yongwei <wuyongwei@gmail.com>
* linebreak.c (lb_get_next_char_utf8): Fix the issue that the index
can point to the middle of a UTF-8 sequence if End of String (EOS)
is encountered prematurely (thanks to Nikolay Pultsin and Rick Xu).
(lb_get_next_char_utf16): Fix the issue that the index can point to
the middle of a UTF-16 surrogate pair if EOS is encountered
prematurely.
2009-04-20 Wu Yongwei <wuyongwei@gmail.com>
* linebreakdef.c (lb_prop_English): Remove the specialization of
right single quotation mark as closing punctuation mark, because it
can be used as apostrophe.
(lb_prop_Spanish): Ditto.
(lb_prop_French): Ditto.
2009-04-09 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.msvc: Make the `clean' target work on MSVC versions other
than 6.0; do not use precompiled header.
2009-03-07 Wu Yongwei <wuyongwei@gmail.com>
* linebreak.h: Correct the wrong date in the documentation comment.
* linebreakdef.h: Ditto.
2009-02-10 Wu Yongwei <wuyongwei@gmail.com>
* configure.ac (AC_INIT): Increase the version to 2.0.
* Makefile.am (liblinebreak_la_LDFLAGS): Set the version-info to
`2:0'.
2009-02-10 Wu Yongwei <wuyongwei@gmail.com>
* linebreak.h (LINEBREAK_VERSION): New macro.
(linebreak_version): New global constant declaration.
* linebreak.c (linebreak_version): New global constant definition.
2009-02-10 Wu Yongwei <wuyongwei@gmail.com>
Reduce namespace pollution.
* linebreak.c (get_lb_prop_lang): Mark as static.
(get_next_char_utf8): Rename to lb_get_next_char_utf8.
(get_next_char_utf16): Rename to lb_get_next_char_utf32.
(get_next_char_utf32): Rename to lb_get_next_char_utf32.
(is_breakable): Rename to is_line_breakable.
* linebreak.h (get_next_char_utf8): Remove the function prototype
declaration.
(get_next_char_utf16): Ditto.
(get_next_char_utf32): Ditto.
(is_breakable): Rename to is_line_breakable.
* linebreakdef.h (lb_get_next_char_utf8): Add the function prototype
declaration.
(lb_get_next_char_utf16): Ditto.
(lb_get_next_char_utf32): Ditto.
2009-02-06 Wu Yongwei <wuyongwei@gmail.com>
* NEWS: Add information about the 1.1 release.
2009-01-02 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.am (EXTRA_DIST): Add the missing `LICENCE' file.
2008-12-31 Wu Yongwei <wuyongwei@gmail.com>
* linebreak.c: Update the version number in comment to 1.0.
* linebreak.h: Ditto.
* linebreakdef.c: Ditto.
* linebreakdef.h: Ditto.
2008-12-31 Wu Yongwei <wuyongwei@gmail.com>
* NEWS: Update for the 1.0 release.
2008-12-31 Wu Yongwei <wuyongwei@gmail.com>
* README: Correct two typos.
2008-12-31 Wu Yongwei <wuyongwei@gmail.com>
* README: Add the online URL reference.
2008-12-30 Wu Yongwei <wuyongwei@gmail.com>
* README: Update the reference to UAX #14-22, for Unicode 5.1.0.
2008-12-13 Wu Yongwei <wuyongwei@gmail.com>
Update files according to UAX #14-22, for Unicode 5.1.0.
* linebreak.c (baTable): Update according to Table 2 of UAX #14-22.
* linebreakdef.c (lb_prop_Spanish): Remove the unnecessary
customization for inverted marks in Spanish.
* linebreakdata.c: Regenerate from LineBreak-5.1.0.txt.
* linebreak.h: Update comment only.
* linebreakdef.h: Ditto.
2008-12-12 Wu Yongwei <wuyongwei@gmail.com>
* README: Update for the new build methods and better readability.
2008-12-12 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.msvc: Correct the inconsistent naming in the output
message.
2008-12-12 Wu Yongwei <wuyongwei@gmail.com>
* configure.ac (AM_INIT_AUTOMAKE): Mark `foreign'.
* bootstrap: New file.
* purge: New file.
* Makefile.gcc (purge): Remove this target.
2008-12-10 Wu Yongwei <wuyongwei@gmail.com>
* NEWS: New file.
2008-12-10 Wu Yongwei <wuyongwei@gmail.com>
* AUTHORS: New file.
2008-12-10 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.gcc (purge): New phony target to purge files generated by
autoconfiscation.
2008-12-10 Thomas Klausner <tk@giga.or.at>
* configure.ac: New file.
* Makefile.am: New file.
2008-12-10 Wu Yongwei <wuyongwei@gmail.com>
* Doxyfile (OUTPUT_DIRECTORY): Set to `doc'.
(ALPHABETICAL_INDEX): Set to `YES'.
2008-12-09 Wu Yongwei <wuyongwei@gmail.com>
* Makefile.msvc: New file.
2008-12-09 Wu Yongwei <wuyongwei@gmail.com>
* Makefile: Remove (to become Makefile.gcc).
* Makefile.gcc: New file (was Makefile).
2008-12-07 Wu Yongwei <wuyongwei@gmail.com>
* linebreak.c: Adjust the comment that refers to Unicode Annex 14.
* linebreak.h: Ditto.
* linebreakdef.c: Ditto.
* linebreakdef.h: Ditto.
2008-12-07 Wu Yongwei <wuyongwei@gmail.com>
Use only POSIX basic regexp to ensure maximum portability (issues
have been found on Mac OS X, where GNU extensions do not work).
* LineBreak1.sed: Replace `[:xdigit:]' with `0-9A-F', and `\+' with
`\{1,\}'.
* LineBreak2.sed: Ditto.
2008-12-07 Wu Yongwei <wuyongwei@gmail.com>
* Makefile: Replace `*.exe' with `filter_dup$(EXEEXT)', since the
extension `.exe' is specific to Windows.
2008-04-20 Wu Yongwei <wuyongwei@gmail.com>
Add README and LICENCE files, as well as a Doxyfile to generate
documents.
* README: New file.
* LICENCE: New file.
* Doxyfile: New file.
* Makefile (doc): Add new phony target.
2008-04-04 Wu Yongwei <wuyongwei@gmail.com>
Remove the English override for plus sign: it is better treated in
the text breaking program (see ../breaktext/ for an example).
* linebreakdef.c (lb_prop_English): Remove the line for plus sign.
2008-03-29 Wu Yongwei <wuyongwei@gmail.com>
* Makefile: Correct the dependency-making rules when OLDGCC=Y.
2008-03-23 Wu Yongwei <wuyongwei@gmail.com>
* Makefile (clean): Do not remove *.exe and tags here.
(distclean): Remove *.exe and tags.
2008-03-23 Wu Yongwei <wuyongwei@gmail.com>
Remove the English override for solidus: it is better treated in the
text breaking program (see ../breaktext/ for an example).
* linebreakdef.c (lb_prop_English): Remove the line for solidus.
2008-03-16 Wu Yongwei <wuyongwei@gmail.com>
Rename init_linebreak_prop_index to init_linebreak for future
safety; make visible certain functions that are potentially useful.
* linebreak.c (init_linebreak_prop_index): Rename to init_linebreak.
(get_next_char_t): Move to linebreakdef.h.
(get_next_char_utf8): Make non-static.
(get_next_char_utf16): Ditto.
(get_next_char_utf32): Ditto.
(set_linebreaks): Ditto.
* linebreak.h (init_linebreak_prop_index): Rename to init_linebreak.
(get_next_char_utf8): Add the function prototype.
(get_next_char_utf16): Ditto.
(get_next_char_utf32): Ditto.
* linebreakdef.h (get_next_char_t): Add the typedef.
(set_linebreaks): Add the function prototype.
2008-03-16 Wu Yongwei <wuyongwei@gmail.com>
* Makefile (OLDGCC): Add support for GCC 2.95.3 (when OLDGCC=Y).
2008-03-15 Wu Yongwei <wuyongwei@gmail.com>
* linebreak.c (set_linebreaks): Fix a bug that `==' was wrongly used
for `='.
2008-03-05 Wu Yongwei <wuyongwei@gmail.com>
Improve the performance by reducing the look-ups of the
language-specific line breaking properties array from the language
name (thanks to Nikolay Pultsin).
* linebreak.c (get_lb_prop_lang): New function.
(get_char_lb_class_lang): Change the second parameter from the
language name to the line breaking properties array.
(set_linebreaks): Look up the language-specific line breaking
properties array from the language name only once in one function
call.
2008-03-03 Wu Yongwei <wuyongwei@gmail.com>
Make minor adjustments in code and comments.
* linebreak.c: Adjust the doc comments.
(init_linebreak_prop_index): Modify a conditional to make it more
robust and consistent.
* linebreakdef.c (lb_prop_lang_map): Replace the pointer
lb_prop_default with NULL, since the value is never used.
2008-03-03 Wu Yongwei <wuyongwei@gmail.com>
Accelerate get_char_lb_class for invalid Unicode code points.
* linebreak.c (get_char_lb_class): Adjust the conditionals so that
getting the line breaking class for an invalid code point is much
faster, which requires the array of line breaking properties be
sorted.
* linebreakdef.h: Adjust a comment that the array of line break
properties must be sorted.
2008-03-02 Wu Yongwei <wuyongwei@gmail.com>
Change the values of enum BreakAction to more complete forms.
* linebreak.c (INDRCT_BRK): Rename to INDIRECT_BRK.
(CM_INDRCT_BRK): Rename to CM_INDIRECT_BRK.
(CM_PROHIBTD_BRK): Rename to CM_PROHIBITED_BRK.
(PROHIBTD_BRK): Rename to PROHIBITED_BRK.
2008-03-02 Wu Yongwei <wuyongwei@gmail.com>
Implement a two-stage search in get_char_lb_class_default to
accelerate the overall performance, especially for non-Latin
languages.
* linebreak.c (LINEBREAK_INDEX_SIZE): New constant macro.
(struct LineBreakPropertiesIndex): New struct.
(lb_prop_index): New static variable.
(init_linebreak_prop_index): New function.
(get_char_lb_class_default): New function.
(get_char_lb_class_lang): Use get_char_lb_class_default.
* linebreak.h: Detect C++ and add extern "C" guard if necessary.
(init_linebreak_prop_index): Add the prototype declaration.
* linebreakdef.h: Adjust a comment.
2008-03-02 Wu Yongwei <wuyongwei@gmail.com>
Split/refactor the code; add (doc) comments.
* Makefile (CFILES): Add linebreakdata.c and linebreakdef.c.
* linebreak.c: Add and adjust comments.
(linebreakdef.h): Add include file.
(linebreakdata.c): Remove include file.
(EOS): Remove (now in linebreakdef.h).
(enum LineBreakClass): Ditto.
(struct LineBreakProperties): Ditto.
(lbpEnglish): Remove (now in linebreakdef.c as lb_prop_English).
(lbpGerman): Remove (now in linebreakdef.c as lb_prop_German).
(lbpSpanish): Remove (now in linebreakdef.c as lb_prop_Spanish).
(lbpFrench): Remove (now in linebreakdef.c as lb_prop_French).
(lbpRussian): Remove (now in linebreakdef.c as lb_prop_Russian).
(lbpChinese): Remove (now in linebreakdef.c as lb_prop_Chinese).
(struct LineBreakPropertiesLang): Remove (now in linebreakdef.h).
(lbpLangs): Remove (now in linebreakdef.c as lb_prop_lang_map).
(get_next_char_utf16): Make sure memory access not go beyond len.
* linebreak.h: Add copyright information and adjust comments.
(stddef.h): Add include file.
* linebreakdata.c (linebreak.h): Add include file.
(linebreakdef.h): Add include file.
(lbpDefault): Make global and rename to lb_prop_default.
* linebreakdata2.tmpl: Add two include files, a comment line, and
remove `static'.
* linebreakdef.c: New file.
* linebreakdef.h: New file.
2008-02-26 Wu Yongwei <wuyongwei@gmail.com>
* linebreak.c (lbpSpanish): New array for Spanish-specific data.
(lbpLangs): Update the index array for Spanish.
(resolve_lb_class): Resolve AmbIguous class to IDeographic in
Chinese, Japanese, and Korean.
2008-02-26 Wu Yongwei <wuyongwei@gmail.com>
* Makefile (LineBreak.txt): Add new rule to retrieve it from the Web
if it is not already there.
2008-02-23 Wu Yongwei <wuyongwei@gmail.com>
Add files for linebreak.
* LineBreak1.sed: New file.
* LineBreak2.sed: New file.
* Makefile: New file.
* filter_dup.c: New file.
* linebreak.c: New file.
* linebreak.h: New file.
* linebreakdata.c: New file.
* linebreakdata1.tmpl: New file.
* linebreakdata2.tmpl: New file.
* linebreakdata3.tmpl: New file.

1219
linebreak/linebreak/Doxyfile Normal file

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,19 @@
Copyright (C) 2008-2012 Wu Yongwei <wuyongwei at gmail dot com>
Copyright (C) 2012 Tom Hacohen <tom dot hacohen at samsung dot com>
This software is provided 'as-is', without any express or implied
warranty. In no event will the author be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgement in the product documentation would
be appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not
be misrepresented as being the original software.
3. This notice may not be removed or altered from any source
distribution.

View file

@ -0,0 +1 @@
s/\(^[0-9A-F.]\{1,\};[A-Z][A-Z0-9]\) #.*/\1/p

View file

@ -0,0 +1,2 @@
s/^\([0-9A-F]\{1,\}\);/\1..\1;/
s/^\([0-9A-F]\{1,\}\)\.\.\([0-9A-F]\{1,\}\);\([A-Z][A-Z0-9]\)/ { 0x\1, 0x\2, LBP_\3 },/

View file

@ -0,0 +1,63 @@
#noinst_PROGRAMS = filter_dup
include_HEADERS = linebreak.h linebreakdef.h wordbreak.h wordbreakdef.h
lib_LTLIBRARIES = libunibreak.la
pkgconfig_DATA = libunibreak.pc
pkgconfigdir = ${libdir}/pkgconfig
libunibreak_la_LDFLAGS = -no-undefined -version-info 1:0
libunibreak_la_SOURCES = \
linebreak.c \
linebreakdata.c \
linebreakdef.c \
wordbreak.c
EXTRA_DIST = \
LineBreak1.sed \
LineBreak2.sed \
linebreakdata1.tmpl \
linebreakdata2.tmpl \
linebreakdata3.tmpl \
wordbreakdata1.tmpl \
wordbreakdata2.tmpl \
wordbreakdata.c \
LICENCE \
Doxyfile \
Makefile.gcc \
Makefile.msvc \
doc \
sort_numeric_hex.py
install-exec-hook:
rm -f ${libdir}/liblinebreak.a
${LN_S} ${libdir}/libunibreak.a ${libdir}/liblinebreak.a
distclean-local:
rm -f LineBreak.txt WordBreakData.txt filter_dup${EXEEXT}
doc:
cd ${top_srcdir} && doxygen
LineBreak.txt:
wget http://unicode.org/Public/UNIDATA/LineBreak.txt
WordBreakProperty.txt:
wget http://www.unicode.org/Public/UNIDATA/auxiliary/WordBreakProperty.txt
linebreakdata: ${builddir}/filter_dup LineBreak.txt
sed -n -f ${top_srcdir}/LineBreak1.sed LineBreak.txt > tmp.txt
sed -f ${top_srcdir}/LineBreak2.sed tmp.txt | ${builddir}/filter_dup > tmp.c
head -2 LineBreak.txt > tmp.txt
cat ${top_srcdir}/linebreakdata1.tmpl tmp.txt ${top_srcdir}/linebreakdata2.tmpl tmp.c ${top_srcdir}/linebreakdata3.tmpl > ${top_srcdir}/linebreakdata.c
rm tmp.txt tmp.c
wordbreakdata: WordBreakProperty.txt
sed -E -n 's/(^[0-9A-F.]+)/\1/p' WordBreakProperty.txt > tmp2.txt
sed -E -i.bak 's/^([0-9A-F]+) +/\1..\1/' tmp2.txt
${top_srcdir}/sort_numeric_hex.py tmp2.txt > tmp.txt
rm tmp2.txt tmp2.txt.bak
sed -E -i.bak -n 's/^([0-9A-F]+)..([0-9A-F]+) *; *([A-Za-z]+).*/'$$'\t''{0x\1, 0x\2, WBP_\3},/p' tmp.txt
echo "/* The content of this file is generated from:" > ${top_srcdir}/wordbreakdata.c
head -2 WordBreakProperty.txt >> ${top_srcdir}/wordbreakdata.c
echo "*/" >> ${top_srcdir}/wordbreakdata.c
cat ${top_srcdir}/wordbreakdata1.tmpl tmp.txt ${top_srcdir}/wordbreakdata2.tmpl >> ${top_srcdir}/wordbreakdata.c
rm tmp.txt tmp.txt.bak

View file

@ -0,0 +1,177 @@
# Windows/Cygwin support
ifdef windir
WINDOWS := 1
CYGWIN := 0
else
ifdef WINDIR
WINDOWS := 1
CYGWIN := 1
else
WINDOWS := 0
endif
endif
ifeq ($(WINDOWS),1)
EXEEXT := .exe
DLLEXT := .dll
DEVNUL := nul
ifeq ($(CYGWIN),1)
PATHSEP := /
else
PATHSEP := $(strip \ )
endif
else
EXEEXT :=
DLLEXT := .so
DEVNUL := /dev/null
PATHSEP := /
endif
CFG ?= Debug
ifeq ($(CFG),Debug)
all: debug
else
all: release
endif
OLDGCC ?= N
DEBUG := DebugDir
RELEASE := ReleaseDir
$(DEBUG)/%.o: %.c
$(CC) $(CFLAGS) $(CPPFLAGS) $(DBGFLAGS) $(TARGET_ARCH) -c -o $@ $<
$(RELEASE)/%.o: %.c
$(CC) $(CFLAGS) $(CPPFLAGS) $(RELFLAGS) $(TARGET_ARCH) -c -o $@ $<
$(DEBUG)/%.o: %.cpp
$(CXX) $(CXXFLAGS) $(CPPFLAGS) $(DBGFLAGS) $(TARGET_ARCH) -c -o $@ $<
$(RELEASE)/%.o: %.cpp
$(CXX) $(CXXFLAGS) $(CPPFLAGS) $(RELFLAGS) $(TARGET_ARCH) -c -o $@ $<
ifeq ($(OLDGCC),N)
$(DEBUG)/%.dep: %.c
$(CC) -MM -MT $(patsubst %.dep,%.o,$@) $(CFLAGS) $(CPPFLAGS) $(DBGFLAGS) $(TARGET_ARCH) -o $@ $<
$(RELEASE)/%.dep: %.c
$(CC) -MM -MT $(patsubst %.dep,%.o,$@) $(CFLAGS) $(CPPFLAGS) $(RELFLAGS) $(TARGET_ARCH) -o $@ $<
$(DEBUG)/%.dep: %.cpp
$(CXX) -MM -MT $(patsubst %.dep,%.o,$@) $(CXXFLAGS) $(CPPFLAGS) $(DBGFLAGS) $(TARGET_ARCH) -o $@ $<
$(RELEASE)/%.dep: %.cpp
$(CXX) -MM -MT $(patsubst %.dep,%.o,$@) $(CXXFLAGS) $(CPPFLAGS) $(RELFLAGS) $(TARGET_ARCH) -o $@ $<
else
$(DEBUG)/%.dep: %.c
$(CC) -MM $(CFLAGS) $(CPPFLAGS) $(DBGFLAGS) $(TARGET_ARCH) $< | sed "s!^!$(DEBUG)/!" > $@
$(RELEASE)/%.dep: %.c
$(CC) -MM $(CFLAGS) $(CPPFLAGS) $(RELFLAGS) $(TARGET_ARCH) $< | sed "s!^!$(RELEASE)/!" > $@
$(DEBUG)/%.dep: %.cpp
$(CXX) -MM $(CXXFLAGS) $(CPPFLAGS) $(DBGFLAGS) $(TARGET_ARCH) $< | sed "s!^!$(DEBUG)/!" > $@
$(RELEASE)/%.dep: %.cpp
$(CXX) -MM $(CXXFLAGS) $(CPPFLAGS) $(RELFLAGS) $(TARGET_ARCH) $< | sed "s!^!$(RELEASE)/!" > $@
endif
CC = gcc
CXX = g++
AR = ar
LD = $(CXX) $(CXXFLAGS) $(TARGET_ARCH)
INCLUDE = -I. $(patsubst %,-I%,$(VPATH))
CFLAGS = -W -Wall $(INCLUDE)
CXXFLAGS = $(CFLAGS)
DBGFLAGS = -D_DEBUG -g
RELFLAGS = -DNDEBUG -O2
CPPFLAGS =
ifeq ($(OLDGCC),N)
CFLAGS += -fmessage-length=0
endif
HFILES = $(wildcard $(patsubst -I%,%/*.h,$(INCLUDE)))
OBJFILES = $(CFILES:.c=.o) $(CXXFILES:.cpp=.o)
DEBUG_OBJS = $(patsubst %.o,$(DEBUG)/%.o,$(OBJFILES))
RELEASE_OBJS = $(patsubst %.o,$(RELEASE)/%.o,$(OBJFILES))
DEBUG_DEPS = $(patsubst %.o,%.dep,$(DEBUG_OBJS))
RELEASE_DEPS = $(patsubst %.o,%.dep,$(RELEASE_OBJS))
CFILES := linebreak.c linebreakdata.c linebreakdef.c wordbreak.c
CXXFILES :=
LIBS :=
TARGET = liblinebreak.a
DEBUG_TARGET = $(patsubst %,$(DEBUG)/%,$(TARGET))
RELEASE_TARGET = $(patsubst %,$(RELEASE)/%,$(TARGET))
debug: $(DEBUG) $(DEBUG_TARGET)
release: $(RELEASE) $(RELEASE_TARGET)
$(DEBUG):
mkdir $(DEBUG)
$(RELEASE):
mkdir $(RELEASE)
$(DEBUG_TARGET): $(DEBUG_DEPS) $(DEBUG_OBJS)
$(AR) -r $(DEBUG_TARGET) $(DEBUG_OBJS)
$(RELEASE_TARGET): $(RELEASE_DEPS) $(RELEASE_OBJS)
$(AR) -r $(RELEASE_TARGET) $(RELEASE_OBJS)
doc:
doxygen
linebreakdata: filter_dup$(EXEEXT) LineBreak.txt
sed -n -f LineBreak1.sed LineBreak.txt > tmp.txt
sed -f LineBreak2.sed tmp.txt | .$(PATHSEP)filter_dup > tmp.c
head -2 LineBreak.txt > tmp.txt
cat linebreakdata1.tmpl tmp.txt linebreakdata2.tmpl tmp.c linebreakdata3.tmpl > linebreakdata.c
$(RM) tmp.txt tmp.c
wordbreakdata: WordBreakProperty.txt
sed -E -n 's/(^[0-9A-F.]+)/\1/p' WordBreakProperty.txt > tmp2.txt
sed -E -i.bak 's/^([0-9A-F]+) +/\1..\1/' tmp2.txt
./sort_numeric_hex.py tmp2.txt > tmp.txt
rm tmp2.txt tmp2.txt.bak
sed -E -i.bak -n 's/^([0-9A-F]+)..([0-9A-F]+) *; *([A-Za-z]+).*/'$$'\t''{0x\1, 0x\2, WBP_\3},/p' tmp.txt
echo "/* The content of this file is generated from:" > wordbreakdata.c
head -2 WordBreakProperty.txt >> wordbreakdata.c
echo "*/" >> wordbreakdata.c
cat wordbreakdata1.tmpl tmp.txt wordbreakdata2.tmpl >> wordbreakdata.c
rm tmp.txt tmp.txt.bak
filter_dup$(EXEEXT): filter_dup.c
gcc -O2 -o filter_dup$(EXEEXT) $<
LineBreak.txt:
wget http://unicode.org/Public/UNIDATA/LineBreak.txt
WordBreakProperty.txt:
wget http://www.unicode.org/Public/UNIDATA/auxiliary/WordBreakProperty.txt
.PHONY: all debug release clean distclean doc linebreakdata wordbreakdata
clean:
$(RM) $(DEBUG)/*.o $(DEBUG)/*.dep $(DEBUG_TARGET)
$(RM) $(RELEASE)/*.o $(RELEASE)/*.dep $(RELEASE_TARGET)
distclean: clean
$(RM) $(DEBUG)/* $(RELEASE)/* filter_dup$(EXEEXT) tags LineBreak.txt
-rmdir $(DEBUG) 2> $(DEVNUL)
-rmdir $(RELEASE) 2> $(DEVNUL)
-include $(wildcard $(DEBUG)/*.dep) $(wildcard $(RELEASE)/*.dep)

View file

@ -0,0 +1,189 @@
# Makefile for Microsoft Visual C++ and NMAKE
!IF "$(CFG)" == ""
CFG=libunibreak - Win32 Debug
!MESSAGE No configuration specified. Defaulting to libunibreak - Win32 Debug.
!ENDIF
!IF "$(CFG)" != "libunibreak - Win32 Release" && "$(CFG)" != "libunibreak - Win32 Debug"
!MESSAGE Invalid configuration "$(CFG)" specified.
!MESSAGE You can specify a configuration when running NMAKE
!MESSAGE by defining the macro CFG on the command line. For example:
!MESSAGE
!MESSAGE NMAKE /f Makefile.msvc CFG="libunibreak - Win32 Debug"
!MESSAGE
!MESSAGE Possible choices for configuration are:
!MESSAGE
!MESSAGE "libunibreak - Win32 Release" (based on "Win32 (x86) Static Library")
!MESSAGE "libunibreak - Win32 Debug" (based on "Win32 (x86) Static Library")
!MESSAGE
!ERROR An invalid configuration is specified.
!ENDIF
!IF "$(OS)" == "Windows_NT"
NULL=
!ELSE
NULL=nul
!ENDIF
CPP=cl.exe
RSC=rc.exe
!IF "$(CFG)" == "libunibreak - Win32 Release"
OUTDIR=.\Release
INTDIR=.\Release
# Begin Custom Macros
OutDir=.\Release
# End Custom Macros
ALL : "$(OUTDIR)\unibreak.lib"
CLEAN :
-@erase "$(INTDIR)\linebreak.obj"
-@erase "$(INTDIR)\linebreakdata.obj"
-@erase "$(INTDIR)\linebreakdef.obj"
-@erase "$(INTDIR)\wordbreak.obj"
-@erase "$(INTDIR)\vc*.idb"
-@erase "$(OUTDIR)\unibreak.lib"
"$(OUTDIR)" :
if not exist "$(OUTDIR)/$(NULL)" mkdir "$(OUTDIR)"
CPP_PROJ=/nologo /ML /W3 /GX /O2 /D "WIN32" /D "NDEBUG" /D "_MBCS" /D "_LIB" /Fo"$(INTDIR)\\" /Fd"$(INTDIR)\\" /FD /c
BSC32=bscmake.exe
BSC32_FLAGS=/nologo /o"$(OUTDIR)\unibreak.bsc"
BSC32_SBRS= \
LIB32=link.exe -lib
LIB32_FLAGS=/nologo /out:"$(OUTDIR)\unibreak.lib"
LIB32_OBJS= \
"$(INTDIR)\linebreak.obj" \
"$(INTDIR)\linebreakdata.obj" \
"$(INTDIR)\linebreakdef.obj" \
"$(INTDIR)\wordbreak.obj"
"$(OUTDIR)\unibreak.lib" : "$(OUTDIR)" $(DEF_FILE) $(LIB32_OBJS)
$(LIB32) @<<
$(LIB32_FLAGS) $(DEF_FLAGS) $(LIB32_OBJS)
<<
!ELSEIF "$(CFG)" == "libunibreak - Win32 Debug"
OUTDIR=.\Debug
INTDIR=.\Debug
# Begin Custom Macros
OutDir=.\Debug
# End Custom Macros
ALL : "$(OUTDIR)\unibreak.lib"
CLEAN :
-@erase "$(INTDIR)\linebreak.obj"
-@erase "$(INTDIR)\linebreakdata.obj"
-@erase "$(INTDIR)\linebreakdef.obj"
-@erase "$(INTDIR)\wordbreak.obj"
-@erase "$(INTDIR)\vc*.idb"
-@erase "$(INTDIR)\vc*.pdb"
-@erase "$(OUTDIR)\unibreak.lib"
"$(OUTDIR)" :
if not exist "$(OUTDIR)/$(NULL)" mkdir "$(OUTDIR)"
CPP_PROJ=/nologo /MLd /W3 /Gm /GX /ZI /Od /D "WIN32" /D "_DEBUG" /D "_MBCS" /D "_LIB" /Fo"$(INTDIR)\\" /Fd"$(INTDIR)\\" /FD /GZ /c
BSC32=bscmake.exe
BSC32_FLAGS=/nologo /o"$(OUTDIR)\unibreak.bsc"
BSC32_SBRS= \
LIB32=link.exe -lib
LIB32_FLAGS=/nologo /out:"$(OUTDIR)\unibreak.lib"
LIB32_OBJS= \
"$(INTDIR)\linebreak.obj" \
"$(INTDIR)\linebreakdata.obj" \
"$(INTDIR)\linebreakdef.obj" \
"$(INTDIR)\wordbreak.obj"
"$(OUTDIR)\unibreak.lib" : "$(OUTDIR)" $(DEF_FILE) $(LIB32_OBJS)
$(LIB32) @<<
$(LIB32_FLAGS) $(DEF_FLAGS) $(LIB32_OBJS)
<<
!ENDIF
.c{$(INTDIR)}.obj::
$(CPP) @<<
$(CPP_PROJ) $<
<<
.cpp{$(INTDIR)}.obj::
$(CPP) @<<
$(CPP_PROJ) $<
<<
.cxx{$(INTDIR)}.obj::
$(CPP) @<<
$(CPP_PROJ) $<
<<
.c{$(INTDIR)}.sbr::
$(CPP) @<<
$(CPP_PROJ) $<
<<
.cpp{$(INTDIR)}.sbr::
$(CPP) @<<
$(CPP_PROJ) $<
<<
.cxx{$(INTDIR)}.sbr::
$(CPP) @<<
$(CPP_PROJ) $<
<<
.\linebreak.c : \
".\linebreak.h"\
".\linebreakdef.h"\
.\linebreakdata.c : \
".\linebreak.h"\
".\linebreakdef.h"\
.\linebreakdef.c : \
".\linebreak.h"\
".\linebreakdef.h"\
.\wordbreak.c : \
".\linebreak.h"\
".\linebreakdef.h"\
".\wordbreak.h"\
".\wordbreakdef.h"\
".\wordbreakdata.c"\
!IF "$(CFG)" == "libunibreak - Win32 Release" || "$(CFG)" == "libunibreak - Win32 Debug"
SOURCE=.\linebreak.c
"$(INTDIR)\linebreak.obj" : $(SOURCE) "$(INTDIR)"
SOURCE=.\linebreakdata.c
"$(INTDIR)\linebreakdata.obj" : $(SOURCE) "$(INTDIR)"
SOURCE=.\linebreakdef.c
"$(INTDIR)\linebreakdef.obj" : $(SOURCE) "$(INTDIR)"
SOURCE=.\wordbreak.c
"$(INTDIR)\wordbreak.obj" : $(SOURCE) "$(INTDIR)"
!ENDIF

49
linebreak/linebreak/NEWS Normal file
View file

@ -0,0 +1,49 @@
New in libunibreak 1.0
- Add word breaking support
- Change the library name to "libunibreak", while keeping maximum compatibility
- Add pkg-config support
New in liblinebreak 2.1
- Update the data according to LineBreak-6.0.0.txt
- Fix the bug that an assertion in code can fail if U+FFFC is
encountered at the beginning of a line
New in liblinebreak 2.0
- Update the algorithm and data according to UAX #14-24 and
LineBreak-5.2.0.txt
- Rename some functions to reduce namespace pollution
- Make Doxygen documentation better
New in liblinebreak 1.2
- Fix the bug that an assertion in code can fail if an invalid UTF-8 or
UTF-16 sequence is encountered near the end of input
- Remove the specialization of right single quotation mark as closing
punctuation mark in English, French, and Spanish, because it can be
used as apostrophe
- Make Doxygen documentation better
New in liblinebreak 1.1
- Make get_lb_prop_lang static and not an exported symbol
- Define is_line_breakable to alias to is_breakable
- Declare get_next_char_utf* will be changed to lb_get_next_char_utf*
- Move the declarations of get_next_char_utf* from linebreak.h to
linebreakdef.h
- Add the function documentation comments to the header files
New in liblinebreak 1.0
- Update the line breaking data according to UAX #14-22 and
LineBreak-5.1.0.txt
- Add autoconfiscation support (./configure, make, make install)
- Add Makefile for MSVC
First public release (0.9.6, or 20080421)
- Implement line breaking algorithm according to UAX #14-19
- Line breaking data is generated from LineBreak-5.0.0.txt
- Makefile only supports GCC

View file

@ -0,0 +1,88 @@
L I B U N I B R E A K
=====================
Overview
--------
This is the README file for libunibreak, an implementation of the line
breaking and word breaking algorithms as described in Unicode
Standard Annex 14 and Unicode Standard Annex 29, available at
<URL:http://www.unicode.org/reports/tr14/tr14-26.html>
<URL:http://www.unicode.org/reports/tr29/tr29-17.html>
Check this URL for up-to-date information:
<URL:http://vimgadgets.sourceforge.net/libunibreak/>
Licence
-------
This library is released under an open-source licence, the zlib/libpng
licence. Please check the file LICENCE for details.
Apart from using the algorithm, part of the code is derived from the
data provided under
<URL:http://www.unicode.org/Public/>
And the Unicode Terms of Use may apply:
<URL:http://www.unicode.org/copyright.html>
Installation
------------
There are three ways to build the library:
1) On *NIX systems supported by the autoconfiscation tools, do the
normal
./configure
make
sudo make install
to build and install both the dynamic and static libraries. In
addition, one may
- type `make doc' to generate the doxygen documentation; or
- type `make linebreakdata' to regenerate linebreakdata.c from
LineBreak.txt.
- type make wordbreakdata to regenerate wordbreakdata.c from
WordBreakProperty.txt.
2) On systems where GCC and Binutils are supported, one can type
cp -p Makefile.gcc Makefile
make
to build the static library. In addition, one may
- type `make debug' or `make release' to explicitly generate the
debug or release build;
- type `make doc' to generate the doxygen documentation; or
- type `make linebreakdata' to regenerate linebreakdata.c from
LineBreak.txt.
- type make wordbreakdata to regenerate wordbreakdata.c from
WordBreakProperty.txt.
3) On Windows, apart from using method 1 (Cygwin/MSYS) and method 2
(MinGW), MSVC can also be used. Type
nmake -f Makefile.msvc
to build the static library. By default the debug release is built.
To build the release version
nmake -f Makefile.msvc CFG="libunibreak - Win32 Release"
Documentation
-------------
Check the generated document doc/html/linebreak_8h.html and
doc/html/wordbreak_8h.html in the downloaded file for the public
interfaces exposed to applications.
$Id: README,v 1.8 2012/08/11 06:55:18 adah Exp $
vim:autoindent:expandtab:formatoptions=tcqlmn:textwidth=72:

6
linebreak/linebreak/bootstrap Executable file
View file

@ -0,0 +1,6 @@
#! /bin/sh
aclocal && \
autoheader && \
autoconf && \
libtoolize && \
automake --add-missing

View file

@ -0,0 +1,12 @@
AC_PREREQ(2.57)
AC_INIT([libunibreak],[1.0],[wuyongwei@gmail.com])
AC_CONFIG_SRCDIR([linebreak.c])
AC_CONFIG_HEADERS([config.h])
AM_INIT_AUTOMAKE([foreign])
AC_PROG_CC
AC_PROG_LN_S
AC_EXEEXT
AM_PROG_LIBTOOL
AC_CONFIG_FILES([Makefile])
AC_OUTPUT([libunibreak.pc])

View file

@ -0,0 +1,47 @@
#include <stdio.h>
#include <string.h>
int main()
{
char s[80];
char beg[16];
char end[16];
char prop[16];
char lastbeg[16];
char lastend[16];
char lastprop[16];
lastprop[0] = 0;
for (;;)
{
if (fgets(s, sizeof s, stdin) == NULL)
break;
if (strstr(s, "LBP_") == NULL || strstr(s, "LBP_Undef") != NULL)
{
if (lastprop[0])
{
printf("\t{ %s %s %s },\n", lastbeg, lastend, lastprop);
lastprop[0] = 0;
}
printf("%s", s);
continue;
}
sscanf(s, "\t{ %s %s %s }", beg, end, prop);
/*printf("==>\t{ \"%s\" \"%s\" \"%s\" },\n", beg, end, prop);*/
if (lastprop[0] && strcmp(lastprop, prop) != 0)
{
printf("\t{ %s %s %s },\n", lastbeg, lastend, lastprop);
lastprop[0] = 0;
}
if (lastprop[0] == 0)
{
strcpy(lastbeg, beg);
strcpy(lastprop, prop);
}
strcpy(lastend, end);
}
if (lastprop[0])
{
printf("\t{ %s %s %s },\n", lastbeg, lastend, prop);
}
return 0;
}

View file

@ -0,0 +1,11 @@
libunibreak:
prefix=@prefix@
exec_prefix=@exec_prefix@
libdir=@libdir@
includedir=@includedir@
Name: libunibreak
Description: Library to implement Unicode algorithms for line and word breaking
Version: @VERSION@
Libs: -L${libdir} -lunibreak
Cflags: -I${includedir}

View file

@ -0,0 +1,737 @@
/* vim: set tabstop=4 shiftwidth=4: */
/*
* Line breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2008-2011 Wu Yongwei <wuyongwei at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute
* it freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must
* not claim that you wrote the original software. If you use this
* software in a product, an acknowledgement in the product
* documentation would be appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must
* not be misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source
* distribution.
*
* The main reference is Unicode Standard Annex 14 (UAX #14):
* <URL:http://www.unicode.org/reports/tr14/>
*
* When this library was designed, this annex was at Revision 19, for
* Unicode 5.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-19.html>
*
* This library has been updated according to Revision 26, for
* Unicode 6.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-26.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
/**
* @file linebreak.c
*
* Implementation of the line breaking algorithm as described in Unicode
* Standard Annex 14.
*
* @version 2.1, 2011/05/07
* @author Wu Yongwei
*/
#include <assert.h>
#include <stddef.h>
#include <string.h>
#include "linebreak.h"
#include "linebreakdef.h"
/**
* Size of the second-level index to the line breaking properties.
*/
#define LINEBREAK_INDEX_SIZE 40
/**
* Version number of the library.
*/
const int linebreak_version = LINEBREAK_VERSION;
/**
* Enumeration of break actions. They are used in the break action
* pair table below.
*/
enum BreakAction
{
DIR_BRK, /**< Direct break opportunity */
IND_BRK, /**< Indirect break opportunity */
CMI_BRK, /**< Indirect break opportunity for combining marks */
CMP_BRK, /**< Prohibited break for combining marks */
PRH_BRK /**< Prohibited break */
};
/**
* Break action pair table. This is a direct mapping of Table 2 of
* Unicode Standard Annex 14, Revision 24.
*/
static enum BreakAction baTable[LBP_JT][LBP_JT] = {
{ /* OP */
PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, CMP_BRK,
PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK, PRH_BRK },
{ /* CL */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, PRH_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
{ /* CP */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, PRH_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, DIR_BRK,
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
{ /* QU */
PRH_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK },
{ /* GL */
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK },
{ /* NS */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
{ /* EX */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
{ /* SY */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
{ /* IS */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK,
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
{ /* PR */
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, IND_BRK,
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK },
{ /* PO */
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK,
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
{ /* NU */
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, DIR_BRK,
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
{ /* AL */
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK,
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
{ /* ID */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
{ /* IN */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
{ /* HY */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, DIR_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
{ /* BA */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, DIR_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
{ /* BB */
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK },
{ /* B2 */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, IND_BRK, IND_BRK, DIR_BRK, PRH_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
{ /* ZW */
DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, PRH_BRK, DIR_BRK,
DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
{ /* CM */
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK, DIR_BRK,
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK },
{ /* WJ */
IND_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK,
IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK },
{ /* H2 */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK },
{ /* H3 */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, IND_BRK },
{ /* JL */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, IND_BRK, IND_BRK, IND_BRK, IND_BRK, DIR_BRK },
{ /* JV */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, IND_BRK, IND_BRK },
{ /* JT */
DIR_BRK, PRH_BRK, PRH_BRK, IND_BRK, IND_BRK, IND_BRK, PRH_BRK,
PRH_BRK, PRH_BRK, DIR_BRK, IND_BRK, DIR_BRK, DIR_BRK, DIR_BRK,
IND_BRK, IND_BRK, IND_BRK, DIR_BRK, DIR_BRK, PRH_BRK, CMI_BRK,
PRH_BRK, DIR_BRK, DIR_BRK, DIR_BRK, DIR_BRK, IND_BRK }
};
/**
* Struct for the second-level index to the line breaking properties.
*/
struct LineBreakPropertiesIndex
{
utf32_t end; /**< End coding point */
struct LineBreakProperties *lbp;/**< Pointer to line breaking properties */
};
/**
* Second-level index to the line breaking properties.
*/
static struct LineBreakPropertiesIndex lb_prop_index[LINEBREAK_INDEX_SIZE] =
{
{ 0xFFFFFFFF, lb_prop_default }
};
/**
* Initializes the second-level index to the line breaking properties.
* If it is not called, the performance of #get_char_lb_class_lang (and
* thus the main functionality) can be pretty bad, especially for big
* code points like those of Chinese.
*/
void init_linebreak(void)
{
size_t i;
size_t iPropDefault;
size_t len;
size_t step;
len = 0;
while (lb_prop_default[len].prop != LBP_Undefined)
++len;
step = len / LINEBREAK_INDEX_SIZE;
iPropDefault = 0;
for (i = 0; i < LINEBREAK_INDEX_SIZE; ++i)
{
lb_prop_index[i].lbp = lb_prop_default + iPropDefault;
iPropDefault += step;
lb_prop_index[i].end = lb_prop_default[iPropDefault].start - 1;
}
lb_prop_index[--i].end = 0xFFFFFFFF;
}
/**
* Gets the language-specific line breaking properties.
*
* @param lang language of the text
* @return pointer to the language-specific line breaking
* properties array if found; \c NULL otherwise
*/
static struct LineBreakProperties *get_lb_prop_lang(const char *lang)
{
struct LineBreakPropertiesLang *lbplIter;
if (lang != NULL)
{
for (lbplIter = lb_prop_lang_map; lbplIter->lang != NULL; ++lbplIter)
{
if (strncmp(lang, lbplIter->lang, lbplIter->namelen) == 0)
{
return lbplIter->lbp;
}
}
}
return NULL;
}
/**
* Gets the line breaking class of a character from a line breaking
* properties array.
*
* @param ch character to check
* @param lbp pointer to the line breaking properties array
* @return the line breaking class if found; \c LBP_XX otherwise
*/
static enum LineBreakClass get_char_lb_class(
utf32_t ch,
struct LineBreakProperties *lbp)
{
while (lbp->prop != LBP_Undefined && ch >= lbp->start)
{
if (ch <= lbp->end)
return lbp->prop;
++lbp;
}
return LBP_XX;
}
/**
* Gets the line breaking class of a character from the default line
* breaking properties array.
*
* @param ch character to check
* @return the line breaking class if found; \c LBP_XX otherwise
*/
static enum LineBreakClass get_char_lb_class_default(
utf32_t ch)
{
size_t i = 0;
while (ch > lb_prop_index[i].end)
++i;
assert(i < LINEBREAK_INDEX_SIZE);
return get_char_lb_class(ch, lb_prop_index[i].lbp);
}
/**
* Gets the line breaking class of a character for a specific
* language. This function will check the language-specific data first,
* and then the default data if there is no language-specific property
* available for the character.
*
* @param ch character to check
* @param lbpLang pointer to the language-specific line breaking
* properties array
* @return the line breaking class if found; \c LBP_XX
* otherwise
*/
static enum LineBreakClass get_char_lb_class_lang(
utf32_t ch,
struct LineBreakProperties *lbpLang)
{
enum LineBreakClass lbcResult;
/* Find the language-specific line breaking class for a character */
if (lbpLang)
{
lbcResult = get_char_lb_class(ch, lbpLang);
if (lbcResult != LBP_XX)
return lbcResult;
}
/* Find the generic language-specific line breaking class, if no
* language context is provided, or language-specific data are not
* available for the specific character in the specified language */
return get_char_lb_class_default(ch);
}
/**
* Resolves the line breaking class for certain ambiguous or complicated
* characters. They are treated in a simplistic way in this
* implementation.
*
* @param lbc line breaking class to resolve
* @param lang language of the text
* @return the resolved line breaking class
*/
static enum LineBreakClass resolve_lb_class(
enum LineBreakClass lbc,
const char *lang)
{
switch (lbc)
{
case LBP_AI:
if (lang != NULL &&
(strncmp(lang, "zh", 2) == 0 || /* Chinese */
strncmp(lang, "ja", 2) == 0 || /* Japanese */
strncmp(lang, "ko", 2) == 0)) /* Korean */
{
return LBP_ID;
}
/* Fall through */
case LBP_SA:
case LBP_SG:
case LBP_XX:
return LBP_AL;
default:
return lbc;
}
}
/**
* Gets the next Unicode character in a UTF-8 sequence. The index will
* be advanced to the next complete character, unless the end of string
* is reached in the middle of a UTF-8 sequence.
*
* @param[in] s input UTF-8 string
* @param[in] len length of the string in bytes
* @param[in,out] ip pointer to the index
* @return the Unicode character beginning at the index; or
* #EOS if end of input is encountered
*/
utf32_t lb_get_next_char_utf8(
const utf8_t *s,
size_t len,
size_t *ip)
{
utf8_t ch;
utf32_t res;
assert(*ip <= len);
if (*ip == len)
return EOS;
ch = s[*ip];
if (ch < 0xC2 || ch > 0xF4)
{ /* One-byte sequence, tail (should not occur), or invalid */
*ip += 1;
return ch;
}
else if (ch < 0xE0)
{ /* Two-byte sequence */
if (*ip + 2 > len)
return EOS;
res = ((ch & 0x1F) << 6) + (s[*ip + 1] & 0x3F);
*ip += 2;
return res;
}
else if (ch < 0xF0)
{ /* Three-byte sequence */
if (*ip + 3 > len)
return EOS;
res = ((ch & 0x0F) << 12) +
((s[*ip + 1] & 0x3F) << 6) +
((s[*ip + 2] & 0x3F));
*ip += 3;
return res;
}
else
{ /* Four-byte sequence */
if (*ip + 4 > len)
return EOS;
res = ((ch & 0x07) << 18) +
((s[*ip + 1] & 0x3F) << 12) +
((s[*ip + 2] & 0x3F) << 6) +
((s[*ip + 3] & 0x3F));
*ip += 4;
return res;
}
}
/**
* Gets the next Unicode character in a UTF-16 sequence. The index will
* be advanced to the next complete character, unless the end of string
* is reached in the middle of a UTF-16 surrogate pair.
*
* @param[in] s input UTF-16 string
* @param[in] len length of the string in words
* @param[in,out] ip pointer to the index
* @return the Unicode character beginning at the index; or
* #EOS if end of input is encountered
*/
utf32_t lb_get_next_char_utf16(
const utf16_t *s,
size_t len,
size_t *ip)
{
utf16_t ch;
assert(*ip <= len);
if (*ip == len)
return EOS;
ch = s[(*ip)++];
if (ch < 0xD800 || ch > 0xDBFF)
{ /* If the character is not a high surrogate */
return ch;
}
if (*ip == len)
{ /* If the input ends here (an error) */
--(*ip);
return EOS;
}
if (s[*ip] < 0xDC00 || s[*ip] > 0xDFFF)
{ /* If the next character is not the low surrogate (an error) */
return ch;
}
/* Return the constructed character and advance the index again */
return (((utf32_t)ch & 0x3FF) << 10) + (s[(*ip)++] & 0x3FF) + 0x10000;
}
/**
* Gets the next Unicode character in a UTF-32 sequence. The index will
* be advanced to the next character.
*
* @param[in] s input UTF-32 string
* @param[in] len length of the string in dwords
* @param[in,out] ip pointer to the index
* @return the Unicode character beginning at the index; or
* #EOS if end of input is encountered
*/
utf32_t lb_get_next_char_utf32(
const utf32_t *s,
size_t len,
size_t *ip)
{
assert(*ip <= len);
if (*ip == len)
return EOS;
return s[(*ip)++];
}
/**
* Sets the line breaking information for a generic input string.
*
* @param[in] s input string
* @param[in] len length of the input
* @param[in] lang language of the input
* @param[out] brks pointer to the output breaking data,
* containing #LINEBREAK_MUSTBREAK,
* #LINEBREAK_ALLOWBREAK, #LINEBREAK_NOBREAK,
* or #LINEBREAK_INSIDEACHAR
* @param[in] get_next_char function to get the next UTF-32 character
*/
void set_linebreaks(
const void *s,
size_t len,
const char *lang,
char *brks,
get_next_char_t get_next_char)
{
utf32_t ch;
enum LineBreakClass lbcCur;
enum LineBreakClass lbcNew;
enum LineBreakClass lbcLast;
struct LineBreakProperties *lbpLang;
size_t posCur = 0;
size_t posLast = 0;
--posLast; /* To be ++'d later */
ch = get_next_char(s, len, &posCur);
if (ch == EOS)
return;
lbpLang = get_lb_prop_lang(lang);
lbcCur = resolve_lb_class(get_char_lb_class_lang(ch, lbpLang), lang);
lbcNew = LBP_Undefined;
nextline:
/* Special treatment for the first character */
switch (lbcCur)
{
case LBP_LF:
case LBP_NL:
lbcCur = LBP_BK;
break;
case LBP_CB:
lbcCur = LBP_BA;
break;
case LBP_SP:
lbcCur = LBP_WJ;
break;
default:
break;
}
/* Process a line till an explicit break or end of string */
for (;;)
{
for (++posLast; posLast < posCur - 1; ++posLast)
{
brks[posLast] = LINEBREAK_INSIDEACHAR;
}
assert(posLast == posCur - 1);
lbcLast = lbcNew;
ch = get_next_char(s, len, &posCur);
if (ch == EOS)
break;
lbcNew = get_char_lb_class_lang(ch, lbpLang);
if (lbcCur == LBP_BK || (lbcCur == LBP_CR && lbcNew != LBP_LF))
{
brks[posLast] = LINEBREAK_MUSTBREAK;
lbcCur = resolve_lb_class(lbcNew, lang);
goto nextline;
}
switch (lbcNew)
{
case LBP_SP:
brks[posLast] = LINEBREAK_NOBREAK;
continue;
case LBP_BK:
case LBP_LF:
case LBP_NL:
brks[posLast] = LINEBREAK_NOBREAK;
lbcCur = LBP_BK;
continue;
case LBP_CR:
brks[posLast] = LINEBREAK_NOBREAK;
lbcCur = LBP_CR;
continue;
case LBP_CB:
brks[posLast] = LINEBREAK_ALLOWBREAK;
lbcCur = LBP_BA;
continue;
default:
break;
}
lbcNew = resolve_lb_class(lbcNew, lang);
assert(lbcCur <= LBP_JT);
assert(lbcNew <= LBP_JT);
switch (baTable[lbcCur - 1][lbcNew - 1])
{
case DIR_BRK:
brks[posLast] = LINEBREAK_ALLOWBREAK;
break;
case CMI_BRK:
case IND_BRK:
if (lbcLast == LBP_SP)
{
brks[posLast] = LINEBREAK_ALLOWBREAK;
}
else
{
brks[posLast] = LINEBREAK_NOBREAK;
}
break;
case CMP_BRK:
brks[posLast] = LINEBREAK_NOBREAK;
if (lbcLast != LBP_SP)
continue;
break;
case PRH_BRK:
brks[posLast] = LINEBREAK_NOBREAK;
break;
}
lbcCur = lbcNew;
}
assert(posLast == posCur - 1 && posCur <= len);
/* Break after the last character */
brks[posLast] = LINEBREAK_MUSTBREAK;
/* When the input contains incomplete sequences */
while (posCur < len)
{
brks[posCur++] = LINEBREAK_INSIDEACHAR;
}
}
/**
* Sets the line breaking information for a UTF-8 input string.
*
* @param[in] s input UTF-8 string
* @param[in] len length of the input
* @param[in] lang language of the input
* @param[out] brks pointer to the output breaking data, containing
* #LINEBREAK_MUSTBREAK, #LINEBREAK_ALLOWBREAK,
* #LINEBREAK_NOBREAK, or #LINEBREAK_INSIDEACHAR
*/
void set_linebreaks_utf8(
const utf8_t *s,
size_t len,
const char *lang,
char *brks)
{
set_linebreaks(s, len, lang, brks,
(get_next_char_t)lb_get_next_char_utf8);
}
/**
* Sets the line breaking information for a UTF-16 input string.
*
* @param[in] s input UTF-16 string
* @param[in] len length of the input
* @param[in] lang language of the input
* @param[out] brks pointer to the output breaking data, containing
* #LINEBREAK_MUSTBREAK, #LINEBREAK_ALLOWBREAK,
* #LINEBREAK_NOBREAK, or #LINEBREAK_INSIDEACHAR
*/
void set_linebreaks_utf16(
const utf16_t *s,
size_t len,
const char *lang,
char *brks)
{
set_linebreaks(s, len, lang, brks,
(get_next_char_t)lb_get_next_char_utf16);
}
/**
* Sets the line breaking information for a UTF-32 input string.
*
* @param[in] s input UTF-32 string
* @param[in] len length of the input
* @param[in] lang language of the input
* @param[out] brks pointer to the output breaking data, containing
* #LINEBREAK_MUSTBREAK, #LINEBREAK_ALLOWBREAK,
* #LINEBREAK_NOBREAK, or #LINEBREAK_INSIDEACHAR
*/
void set_linebreaks_utf32(
const utf32_t *s,
size_t len,
const char *lang,
char *brks)
{
set_linebreaks(s, len, lang, brks,
(get_next_char_t)lb_get_next_char_utf32);
}
/**
* Tells whether a line break can occur between two Unicode characters.
* This is a wrapper function to expose a simple interface. Generally
* speaking, it is better to use #set_linebreaks_utf32 instead, since
* complicated cases involving combining marks, spaces, etc. cannot be
* correctly processed.
*
* @param char1 the first Unicode character
* @param char2 the second Unicode character
* @param lang language of the input
* @return one of #LINEBREAK_MUSTBREAK, #LINEBREAK_ALLOWBREAK,
* #LINEBREAK_NOBREAK, or #LINEBREAK_INSIDEACHAR
*/
int is_line_breakable(
utf32_t char1,
utf32_t char2,
const char* lang)
{
utf32_t s[2];
char brks[2];
s[0] = char1;
s[1] = char2;
set_linebreaks_utf32(s, 2, lang, brks);
return brks[0];
}

View file

@ -0,0 +1,87 @@
/* vim: set tabstop=4 shiftwidth=4: */
/*
* Line breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2008-2011 Wu Yongwei <wuyongwei at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute
* it freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must
* not claim that you wrote the original software. If you use this
* software in a product, an acknowledgement in the product
* documentation would be appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must
* not be misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source
* distribution.
*
* The main reference is Unicode Standard Annex 14 (UAX #14):
* <URL:http://www.unicode.org/reports/tr14/>
*
* When this library was designed, this annex was at Revision 19, for
* Unicode 5.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-19.html>
*
* This library has been updated according to Revision 26, for
* Unicode 6.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-26.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
/**
* @file linebreak.h
*
* Header file for the line breaking algorithm.
*
* @version 2.1, 2011/05/07
* @author Wu Yongwei
*/
#ifndef LINEBREAK_H
#define LINEBREAK_H
#include <stddef.h>
#ifdef __cplusplus
extern "C" {
#endif
#define LINEBREAK_VERSION 0x0201 /**< Version of the library linebreak */
extern const int linebreak_version;
#ifndef LINEBREAK_UTF_TYPES_DEFINED
#define LINEBREAK_UTF_TYPES_DEFINED
typedef unsigned char utf8_t; /**< Type for UTF-8 data points */
typedef unsigned short utf16_t; /**< Type for UTF-16 data points */
typedef unsigned int utf32_t; /**< Type for UTF-32 data points */
#endif
#define LINEBREAK_MUSTBREAK 0 /**< Break is mandatory */
#define LINEBREAK_ALLOWBREAK 1 /**< Break is allowed */
#define LINEBREAK_NOBREAK 2 /**< No break is possible */
#define LINEBREAK_INSIDEACHAR 3 /**< A UTF-8/16 sequence is unfinished */
void init_linebreak(void);
void set_linebreaks_utf8(
const utf8_t *s, size_t len, const char* lang, char *brks);
void set_linebreaks_utf16(
const utf16_t *s, size_t len, const char* lang, char *brks);
void set_linebreaks_utf32(
const utf32_t *s, size_t len, const char* lang, char *brks);
int is_line_breakable(utf32_t char1, utf32_t char2, const char* lang);
#ifdef __cplusplus
}
#endif
#endif /* LINEBREAK_H */

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1 @@
/* The content of this file is generated from:

View file

@ -0,0 +1,7 @@
*/
#include "linebreak.h"
#include "linebreakdef.h"
/** Default line breaking properties as from the Unicode Web site. */
struct LineBreakProperties lb_prop_default[] = {

View file

@ -0,0 +1,2 @@
{ 0xFFFFFFFF, 0xFFFFFFFF, LBP_Undefined }
};

View file

@ -0,0 +1,139 @@
/* vim: set tabstop=4 shiftwidth=4: */
/*
* Line breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2008-2011 Wu Yongwei <wuyongwei at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute
* it freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must
* not claim that you wrote the original software. If you use this
* software in a product, an acknowledgement in the product
* documentation would be appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must
* not be misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source
* distribution.
*
* The main reference is Unicode Standard Annex 14 (UAX #14):
* <URL:http://www.unicode.org/reports/tr14/>
*
* When this library was designed, this annex was at Revision 19, for
* Unicode 5.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-19.html>
*
* This library has been updated according to Revision 26, for
* Unicode 6.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-26.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
/**
* @file linebreakdef.c
*
* Definition of language-specific data.
*
* @version 2.1, 2011/05/07
* @author Wu Yongwei
*/
#include "linebreak.h"
#include "linebreakdef.h"
/**
* English-specifc data over the default Unicode rules.
*/
static struct LineBreakProperties lb_prop_English[] = {
{ 0x2018, 0x2018, LBP_OP }, /* Left single quotation mark: opening */
{ 0x201C, 0x201C, LBP_OP }, /* Left double quotation mark: opening */
{ 0x201D, 0x201D, LBP_CL }, /* Right double quotation mark: closing */
{ 0, 0, LBP_Undefined }
};
/**
* German-specifc data over the default Unicode rules.
*/
static struct LineBreakProperties lb_prop_German[] = {
{ 0x00AB, 0x00AB, LBP_CL }, /* Left double angle quotation mark: closing */
{ 0x00BB, 0x00BB, LBP_OP }, /* Right double angle quotation mark: opening */
{ 0x2018, 0x2018, LBP_CL }, /* Left single quotation mark: closing */
{ 0x201C, 0x201C, LBP_CL }, /* Left double quotation mark: closing */
{ 0x2039, 0x2039, LBP_CL }, /* Left single angle quotation mark: closing */
{ 0x203A, 0x203A, LBP_OP }, /* Right single angle quotation mark: opening */
{ 0, 0, LBP_Undefined }
};
/**
* Spanish-specifc data over the default Unicode rules.
*/
static struct LineBreakProperties lb_prop_Spanish[] = {
{ 0x00AB, 0x00AB, LBP_OP }, /* Left double angle quotation mark: opening */
{ 0x00BB, 0x00BB, LBP_CL }, /* Right double angle quotation mark: closing */
{ 0x2018, 0x2018, LBP_OP }, /* Left single quotation mark: opening */
{ 0x201C, 0x201C, LBP_OP }, /* Left double quotation mark: opening */
{ 0x201D, 0x201D, LBP_CL }, /* Right double quotation mark: closing */
{ 0x2039, 0x2039, LBP_OP }, /* Left single angle quotation mark: opening */
{ 0x203A, 0x203A, LBP_CL }, /* Right single angle quotation mark: closing */
{ 0, 0, LBP_Undefined }
};
/**
* French-specifc data over the default Unicode rules.
*/
static struct LineBreakProperties lb_prop_French[] = {
{ 0x00AB, 0x00AB, LBP_OP }, /* Left double angle quotation mark: opening */
{ 0x00BB, 0x00BB, LBP_CL }, /* Right double angle quotation mark: closing */
{ 0x2018, 0x2018, LBP_OP }, /* Left single quotation mark: opening */
{ 0x201C, 0x201C, LBP_OP }, /* Left double quotation mark: opening */
{ 0x201D, 0x201D, LBP_CL }, /* Right double quotation mark: closing */
{ 0x2039, 0x2039, LBP_OP }, /* Left single angle quotation mark: opening */
{ 0x203A, 0x203A, LBP_CL }, /* Right single angle quotation mark: closing */
{ 0, 0, LBP_Undefined }
};
/**
* Russian-specifc data over the default Unicode rules.
*/
static struct LineBreakProperties lb_prop_Russian[] = {
{ 0x00AB, 0x00AB, LBP_OP }, /* Left double angle quotation mark: opening */
{ 0x00BB, 0x00BB, LBP_CL }, /* Right double angle quotation mark: closing */
{ 0x201C, 0x201C, LBP_CL }, /* Left double quotation mark: closing */
{ 0, 0, LBP_Undefined }
};
/**
* Chinese-specifc data over the default Unicode rules.
*/
static struct LineBreakProperties lb_prop_Chinese[] = {
{ 0x2018, 0x2018, LBP_OP }, /* Left single quotation mark: opening */
{ 0x2019, 0x2019, LBP_CL }, /* Right single quotation mark: closing */
{ 0x201C, 0x201C, LBP_OP }, /* Left double quotation mark: opening */
{ 0x201D, 0x201D, LBP_CL }, /* Right double quotation mark: closing */
{ 0, 0, LBP_Undefined }
};
/**
* Association data of language-specific line breaking properties with
* language names. This is the definition for the static data in this
* file. If you want more flexibility, or do not need the data here,
* you may want to redefine \e lb_prop_lang_map in your C source file.
*/
struct LineBreakPropertiesLang lb_prop_lang_map[] = {
{ "en", 2, lb_prop_English },
{ "de", 2, lb_prop_German },
{ "es", 2, lb_prop_Spanish },
{ "fr", 2, lb_prop_French },
{ "ru", 2, lb_prop_Russian },
{ "zh", 2, lb_prop_Chinese },
{ NULL, 0, NULL }
};

View file

@ -0,0 +1,149 @@
/* vim: set tabstop=4 shiftwidth=4: */
/*
* Line breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2008-2011 Wu Yongwei <wuyongwei at gmail dot com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute
* it freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must
* not claim that you wrote the original software. If you use this
* software in a product, an acknowledgement in the product
* documentation would be appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must
* not be misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source
* distribution.
*
* The main reference is Unicode Standard Annex 14 (UAX #14):
* <URL:http://www.unicode.org/reports/tr14/>
*
* When this library was designed, this annex was at Revision 19, for
* Unicode 5.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-19.html>
*
* This library has been updated according to Revision 26, for
* Unicode 6.0.0:
* <URL:http://www.unicode.org/reports/tr14/tr14-26.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
/**
* @file linebreakdef.h
*
* Definitions of internal data structures, declarations of global
* variables, and function prototypes for the line breaking algorithm.
*
* @version 2.1, 2011/05/07
* @author Wu Yongwei
*/
/**
* Constant value to mark the end of string. It is not a valid Unicode
* character.
*/
#define EOS 0xFFFF
/**
* Line break classes. This is a direct mapping of Table 1 of Unicode
* Standard Annex 14, Revision 26.
*/
enum LineBreakClass
{
/* This is used to signal an error condition. */
LBP_Undefined, /**< Undefined */
/* The following break classes are treated in the pair table. */
LBP_OP, /**< Opening punctuation */
LBP_CL, /**< Closing punctuation */
LBP_CP, /**< Closing parenthesis */
LBP_QU, /**< Ambiguous quotation */
LBP_GL, /**< Glue */
LBP_NS, /**< Non-starters */
LBP_EX, /**< Exclamation/Interrogation */
LBP_SY, /**< Symbols allowing break after */
LBP_IS, /**< Infix separator */
LBP_PR, /**< Prefix */
LBP_PO, /**< Postfix */
LBP_NU, /**< Numeric */
LBP_AL, /**< Alphabetic */
LBP_ID, /**< Ideographic */
LBP_IN, /**< Inseparable characters */
LBP_HY, /**< Hyphen */
LBP_BA, /**< Break after */
LBP_BB, /**< Break before */
LBP_B2, /**< Break on either side (but not pair) */
LBP_ZW, /**< Zero-width space */
LBP_CM, /**< Combining marks */
LBP_WJ, /**< Word joiner */
LBP_H2, /**< Hangul LV */
LBP_H3, /**< Hangul LVT */
LBP_JL, /**< Hangul L Jamo */
LBP_JV, /**< Hangul V Jamo */
LBP_JT, /**< Hangul T Jamo */
/* The following break classes are not treated in the pair table */
LBP_AI, /**< Ambiguous (alphabetic or ideograph) */
LBP_BK, /**< Break (mandatory) */
LBP_CB, /**< Contingent break */
LBP_CR, /**< Carriage return */
LBP_LF, /**< Line feed */
LBP_NL, /**< Next line */
LBP_SA, /**< South-East Asian */
LBP_SG, /**< Surrogates */
LBP_SP, /**< Space */
LBP_XX /**< Unknown */
};
/**
* Struct for entries of line break properties. The array of the
* entries \e must be sorted.
*/
struct LineBreakProperties
{
utf32_t start; /**< Starting coding point */
utf32_t end; /**< End coding point */
enum LineBreakClass prop; /**< The line breaking property */
};
/**
* Struct for association of language-specific line breaking properties
* with language names.
*/
struct LineBreakPropertiesLang
{
const char *lang; /**< Language name */
size_t namelen; /**< Length of name to match */
struct LineBreakProperties *lbp; /**< Pointer to associated data */
};
/**
* Abstract function interface for #lb_get_next_char_utf8,
* #lb_get_next_char_utf16, and #lb_get_next_char_utf32.
*/
typedef utf32_t (*get_next_char_t)(const void *, size_t, size_t *);
/* Declarations */
extern struct LineBreakProperties lb_prop_default[];
extern struct LineBreakPropertiesLang lb_prop_lang_map[];
/* Function Prototype */
utf32_t lb_get_next_char_utf8(const utf8_t *s, size_t len, size_t *ip);
utf32_t lb_get_next_char_utf16(const utf16_t *s, size_t len, size_t *ip);
utf32_t lb_get_next_char_utf32(const utf32_t *s, size_t len, size_t *ip);
void set_linebreaks(
const void *s,
size_t len,
const char *lang,
char *brks,
get_next_char_t get_next_char);

2
linebreak/linebreak/purge Executable file
View file

@ -0,0 +1,2 @@
#! /bin/sh
rm -rf Makefile.in aclocal.m4 autom4te.cache/ config.guess config.h.in config.sub configure depcomp doc/ install-sh ltmain.sh missing

View file

@ -0,0 +1,6 @@
#!/usr/bin/env python
import sys
lines = open(sys.argv[1]).readlines()
lines_out = sorted(lines, key=lambda line: int(line.split("..")[0], 16))
map(sys.stdout.write, lines_out)

View file

@ -0,0 +1,437 @@
/* vim: set tabstop=4 shiftwidth=4: */
/*
* Word breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2012 Tom Hacohen <tom@stosb.com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute
* it freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must
* not claim that you wrote the original software. If you use this
* software in a product, an acknowledgement in the product
* documentation would be appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must
* not be misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source
* distribution.
*
* The main reference is Unicode Standard Annex 29 (UAX #29):
* <URL:http://unicode.org/reports/tr29>
*
* When this library was designed, this annex was at Revision 17, for
* Unicode 6.0.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-17.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
/**
* @file wordbreak.c
*
* Implementation of the word breaking algorithm as described in Unicode
* Standard Annex 29.
*
* @version 2.2, 2012/02/04
* @author Tom Hacohen
*/
#include <assert.h>
#include <stddef.h>
#include <string.h>
#include "linebreak.h"
#include "linebreakdef.h"
#include "wordbreak.h"
#include "wordbreakdata.c"
#define ARRAY_LEN(x) (sizeof(x) / sizeof(x[0]))
/**
* Initializes the wordbreak internals. It currently does nothing, but
* it may in the future.
*/
void init_wordbreak(void)
{
}
/**
* Gets the word breaking class of a character.
*
* @param ch character to check
* @param wbp pointer to the wbp breaking properties array
* @param len size of the wbp array in number of items
* @return the word breaking class if found; \c WBP_Any otherwise
*/
static enum WordBreakClass get_char_wb_class(
utf32_t ch,
struct WordBreakProperties *wbp,
size_t len)
{
int min = 0;
int max = len - 1;
int mid;
do
{
mid = (min + max) / 2;
if (ch < wbp[mid].start)
max = mid - 1;
else if (ch > wbp[mid].end)
min = mid + 1;
else
return wbp[mid].prop;
}
while (min <= max);
return WBP_Any;
}
/**
* Sets the word break types to a specific value in a range.
*
* It sets the inside chars to #WORDBREAK_INSIDEACHAR and the rest to brkType.
* Assumes \a brks is initialized - all the cells with #WORDBREAK_NOBREAK are
* cells that we really don't want to break after.
*
* @param[in] s input string
* @param[out] brks breaks array to fill
* @param[in] posStart start position
* @param[in] posEnd end position (exclusive)
* @param[in] len length of the string
* @param[in] brkType breaks type to use
* @param[in] get_next_char function to get the next UTF-32 character
*/
static void set_brks_to(
const void *s,
char *brks,
size_t posStart,
size_t posEnd,
size_t len,
char brkType,
get_next_char_t get_next_char)
{
size_t posNext = posStart;
while (posNext < posEnd)
{
utf32_t ch;
ch = get_next_char(s, len, &posNext);
assert(ch != EOS);
for (; posStart < posNext - 1; ++posStart)
brks[posStart] = WORDBREAK_INSIDEACHAR;
assert(posStart == posNext - 1);
/* Only set it if we haven't set it not to break before. */
if (brks[posStart] != WORDBREAK_NOBREAK)
brks[posStart] = brkType;
posStart = posNext;
}
}
/* Checks to see if the class is newline, CR, or LF (rules WB3a and b). */
#define IS_WB3ab(cls) ((cls == WBP_Newline) || (cls == WBP_CR) || \
(cls == WBP_LF))
/**
* Sets the word breaking information for a generic input string.
*
* @param[in] s input string
* @param[in] len length of the input
* @param[in] lang language of the input
* @param[out] brks pointer to the output breaking data, containing
* #WORDBREAK_BREAK, #WORDBREAK_NOBREAK, or
* #WORDBREAK_INSIDEACHAR
* @param[in] get_next_char function to get the next UTF-32 character
*/
static void set_wordbreaks(
const void *s,
size_t len,
const char *lang,
char *brks,
get_next_char_t get_next_char)
{
enum WordBreakClass wbcLast = WBP_Undefined;
/* wbcSeqStart is the class that started the current sequence.
* WBP_Undefined is a special case that means "sot".
* This value is the class that is at the start of the current rule
* matching sequence. For example, in case of Numeric+MidNum+Numeric
* it'll be Numeric all the way.
*/
enum WordBreakClass wbcSeqStart = WBP_Undefined;
utf32_t ch;
size_t posNext = 0;
size_t posCur = 0;
size_t posLast = 0;
/* TODO: Language-specific specialization. */
(void) lang;
/* Init brks. */
memset(brks, WORDBREAK_BREAK, len);
ch = get_next_char(s, len, &posNext);
while (ch != EOS)
{
enum WordBreakClass wbcCur;
wbcCur = get_char_wb_class(ch, wb_prop_default,
ARRAY_LEN(wb_prop_default));
switch (wbcCur)
{
case WBP_CR:
/* WB3b */
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
wbcSeqStart = wbcCur;
posLast = posCur;
break;
case WBP_LF:
if (wbcSeqStart == WBP_CR) /* WB3 */
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_NOBREAK, get_next_char);
wbcSeqStart = wbcCur;
posLast = posCur;
break;
}
/* Fall off */
case WBP_Newline:
/* WB3a,3b */
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
wbcSeqStart = wbcCur;
posLast = posCur;
break;
case WBP_Extend:
case WBP_Format:
/* WB4 - If not the first char/after a newline (WB3a,3b), skip
* this class, set it to be the same as the prev, and mark
* brks not to break before them. */
if ((wbcSeqStart == WBP_Undefined) || IS_WB3ab(wbcSeqStart))
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
wbcSeqStart = wbcCur;
}
else
{
/* It's surely not the first */
brks[posCur - 1] = WORDBREAK_NOBREAK;
/* "inherit" the previous class. */
wbcCur = wbcLast;
}
break;
case WBP_Katakana:
if ((wbcSeqStart == WBP_Katakana) || /* WB13 */
(wbcSeqStart == WBP_ExtendNumLet)) /* WB13b */
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_NOBREAK, get_next_char);
}
/* No rule found, reset */
else
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
}
wbcSeqStart = wbcCur;
posLast = posCur;
break;
case WBP_ALetter:
if ((wbcSeqStart == WBP_ALetter) || /* WB5,6,7 */
(wbcLast == WBP_Numeric) || /* WB10 */
(wbcSeqStart == WBP_ExtendNumLet)) /* WB13b */
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_NOBREAK, get_next_char);
}
/* No rule found, reset */
else
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
}
wbcSeqStart = wbcCur;
posLast = posCur;
break;
case WBP_MidNumLet:
if ((wbcLast == WBP_ALetter) || /* WB6,7 */
(wbcLast == WBP_Numeric)) /* WB11,12 */
{
/* Go on */
}
else
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
wbcSeqStart = wbcCur;
posLast = posCur;
}
break;
case WBP_MidLetter:
if (wbcLast == WBP_ALetter) /* WB6,7 */
{
/* Go on */
}
else
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
wbcSeqStart = wbcCur;
posLast = posCur;
}
break;
case WBP_MidNum:
if (wbcLast == WBP_Numeric) /* WB11,12 */
{
/* Go on */
}
else
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
wbcSeqStart = wbcCur;
posLast = posCur;
}
break;
case WBP_Numeric:
if ((wbcSeqStart == WBP_Numeric) || /* WB8,11,12 */
(wbcLast == WBP_ALetter) || /* WB9 */
(wbcSeqStart == WBP_ExtendNumLet)) /* WB13b */
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_NOBREAK, get_next_char);
}
/* No rule found, reset */
else
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
}
wbcSeqStart = wbcCur;
posLast = posCur;
break;
case WBP_ExtendNumLet:
/* WB13a,13b */
if ((wbcSeqStart == wbcLast) &&
((wbcLast == WBP_ALetter) ||
(wbcLast == WBP_Numeric) ||
(wbcLast == WBP_Katakana) ||
(wbcLast == WBP_ExtendNumLet)))
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_NOBREAK, get_next_char);
}
/* No rule found, reset */
else
{
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
}
wbcSeqStart = wbcCur;
posLast = posCur;
break;
case WBP_Any:
/* Allow breaks and reset */
set_brks_to(s, brks, posLast, posCur, len,
WORDBREAK_BREAK, get_next_char);
wbcSeqStart = wbcCur;
posLast = posCur;
break;
default:
/* Error, should never get here! */
assert(0);
break;
}
wbcLast = wbcCur;
posCur = posNext;
ch = get_next_char(s, len, &posNext);
}
/* WB2 */
set_brks_to(s, brks, posLast, posNext, len,
WORDBREAK_BREAK, get_next_char);
}
/**
* Sets the word breaking information for a UTF-8 input string.
*
* @param[in] s input UTF-8 string
* @param[in] len length of the input
* @param[in] lang language of the input
* @param[out] brks pointer to the output breaking data, containing
* #WORDBREAK_BREAK, #WORDBREAK_NOBREAK, or
* #WORDBREAK_INSIDEACHAR
*/
void set_wordbreaks_utf8(
const utf8_t *s,
size_t len,
const char *lang,
char *brks)
{
set_wordbreaks(s, len, lang, brks,
(get_next_char_t)lb_get_next_char_utf8);
}
/**
* Sets the word breaking information for a UTF-16 input string.
*
* @param[in] s input UTF-16 string
* @param[in] len length of the input
* @param[in] lang language of the input
* @param[out] brks pointer to the output breaking data, containing
* #WORDBREAK_BREAK, #WORDBREAK_NOBREAK, or
* #WORDBREAK_INSIDEACHAR
*/
void set_wordbreaks_utf16(
const utf16_t *s,
size_t len,
const char *lang,
char *brks)
{
set_wordbreaks(s, len, lang, brks,
(get_next_char_t)lb_get_next_char_utf16);
}
/**
* Sets the word breaking information for a UTF-32 input string.
*
* @param[in] s input UTF-32 string
* @param[in] len length of the input
* @param[in] lang language of the input
* @param[out] brks pointer to the output breaking data, containing
* #WORDBREAK_BREAK, #WORDBREAK_NOBREAK, or
* #WORDBREAK_INSIDEACHAR
*/
void set_wordbreaks_utf32(
const utf32_t *s,
size_t len,
const char *lang,
char *brks)
{
set_wordbreaks(s, len, lang, brks,
(get_next_char_t)lb_get_next_char_utf32);
}

View file

@ -0,0 +1,72 @@
/* vim: set tabstop=4 shiftwidth=4: */
/*
* Word breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2012 Tom Hacohen <tom@stosb.com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute
* it freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must
* not claim that you wrote the original software. If you use this
* software in a product, an acknowledgement in the product
* documentation would be appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must
* not be misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source
* distribution.
*
* The main reference is Unicode Standard Annex 29 (UAX #29):
* <URL:http://unicode.org/reports/tr29>
*
* When this library was designed, this annex was at Revision 17, for
* Unicode 6.0.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-17.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
/**
* @file wordbreak.h
*
* Header file for the word breaking (segmentation) algorithm.
*
* @version 2.2, 2012/02/04
* @author Tom Hacohen
*/
#ifndef WORDBREAK_H
#define WORDBREAK_H
#include <stddef.h>
#include "linebreak.h"
#ifdef __cplusplus
extern "C" {
#endif
#define WORDBREAK_BREAK 0 /**< Break is allowed */
#define WORDBREAK_NOBREAK 1 /**< No break is allowed */
#define WORDBREAK_INSIDEACHAR 2 /**< A UTF-8/16 sequence is unfinished */
void init_wordbreak(void);
void set_wordbreaks_utf8(
const utf8_t *s, size_t len, const char* lang, char *brks);
void set_wordbreaks_utf16(
const utf16_t *s, size_t len, const char* lang, char *brks);
void set_wordbreaks_utf32(
const utf32_t *s, size_t len, const char* lang, char *brks);
#ifdef __cplusplus
}
#endif
#endif

View file

@ -0,0 +1,860 @@
/* The content of this file is generated from:
# WordBreakProperty-6.0.0.txt
# Date: 2010-08-19, 00:48:48 GMT [MD]
*/
#include "linebreak.h"
#include "wordbreakdef.h"
static struct WordBreakProperties wb_prop_default[] = {
{0x000A, 0x000A, WBP_LF},
{0x000B, 0x000C, WBP_Newline},
{0x000D, 0x000D, WBP_CR},
{0x0027, 0x0027, WBP_MidNumLet},
{0x002C, 0x002C, WBP_MidNum},
{0x002E, 0x002E, WBP_MidNumLet},
{0x0030, 0x0039, WBP_Numeric},
{0x003A, 0x003A, WBP_MidLetter},
{0x003B, 0x003B, WBP_MidNum},
{0x0041, 0x005A, WBP_ALetter},
{0x005F, 0x005F, WBP_ExtendNumLet},
{0x0061, 0x007A, WBP_ALetter},
{0x0085, 0x0085, WBP_Newline},
{0x00AA, 0x00AA, WBP_ALetter},
{0x00AD, 0x00AD, WBP_Format},
{0x00B5, 0x00B5, WBP_ALetter},
{0x00B7, 0x00B7, WBP_MidLetter},
{0x00BA, 0x00BA, WBP_ALetter},
{0x00C0, 0x00D6, WBP_ALetter},
{0x00D8, 0x00F6, WBP_ALetter},
{0x00F8, 0x01BA, WBP_ALetter},
{0x01BB, 0x01BB, WBP_ALetter},
{0x01BC, 0x01BF, WBP_ALetter},
{0x01C0, 0x01C3, WBP_ALetter},
{0x01C4, 0x0293, WBP_ALetter},
{0x0294, 0x0294, WBP_ALetter},
{0x0295, 0x02AF, WBP_ALetter},
{0x02B0, 0x02C1, WBP_ALetter},
{0x02C6, 0x02D1, WBP_ALetter},
{0x02E0, 0x02E4, WBP_ALetter},
{0x02EC, 0x02EC, WBP_ALetter},
{0x02EE, 0x02EE, WBP_ALetter},
{0x0300, 0x036F, WBP_Extend},
{0x0370, 0x0373, WBP_ALetter},
{0x0374, 0x0374, WBP_ALetter},
{0x0376, 0x0377, WBP_ALetter},
{0x037A, 0x037A, WBP_ALetter},
{0x037B, 0x037D, WBP_ALetter},
{0x037E, 0x037E, WBP_MidNum},
{0x0386, 0x0386, WBP_ALetter},
{0x0387, 0x0387, WBP_MidLetter},
{0x0388, 0x038A, WBP_ALetter},
{0x038C, 0x038C, WBP_ALetter},
{0x038E, 0x03A1, WBP_ALetter},
{0x03A3, 0x03F5, WBP_ALetter},
{0x03F7, 0x0481, WBP_ALetter},
{0x0483, 0x0487, WBP_Extend},
{0x0488, 0x0489, WBP_Extend},
{0x048A, 0x0527, WBP_ALetter},
{0x0531, 0x0556, WBP_ALetter},
{0x0559, 0x0559, WBP_ALetter},
{0x0561, 0x0587, WBP_ALetter},
{0x0589, 0x0589, WBP_MidNum},
{0x0591, 0x05BD, WBP_Extend},
{0x05BF, 0x05BF, WBP_Extend},
{0x05C1, 0x05C2, WBP_Extend},
{0x05C4, 0x05C5, WBP_Extend},
{0x05C7, 0x05C7, WBP_Extend},
{0x05D0, 0x05EA, WBP_ALetter},
{0x05F0, 0x05F2, WBP_ALetter},
{0x05F3, 0x05F3, WBP_ALetter},
{0x05F4, 0x05F4, WBP_MidLetter},
{0x0600, 0x0603, WBP_Format},
{0x060C, 0x060D, WBP_MidNum},
{0x0610, 0x061A, WBP_Extend},
{0x0620, 0x063F, WBP_ALetter},
{0x0640, 0x0640, WBP_ALetter},
{0x0641, 0x064A, WBP_ALetter},
{0x064B, 0x065F, WBP_Extend},
{0x0660, 0x0669, WBP_Numeric},
{0x066B, 0x066B, WBP_Numeric},
{0x066C, 0x066C, WBP_MidNum},
{0x066E, 0x066F, WBP_ALetter},
{0x0670, 0x0670, WBP_Extend},
{0x0671, 0x06D3, WBP_ALetter},
{0x06D5, 0x06D5, WBP_ALetter},
{0x06D6, 0x06DC, WBP_Extend},
{0x06DD, 0x06DD, WBP_Format},
{0x06DF, 0x06E4, WBP_Extend},
{0x06E5, 0x06E6, WBP_ALetter},
{0x06E7, 0x06E8, WBP_Extend},
{0x06EA, 0x06ED, WBP_Extend},
{0x06EE, 0x06EF, WBP_ALetter},
{0x06F0, 0x06F9, WBP_Numeric},
{0x06FA, 0x06FC, WBP_ALetter},
{0x06FF, 0x06FF, WBP_ALetter},
{0x070F, 0x070F, WBP_Format},
{0x0710, 0x0710, WBP_ALetter},
{0x0711, 0x0711, WBP_Extend},
{0x0712, 0x072F, WBP_ALetter},
{0x0730, 0x074A, WBP_Extend},
{0x074D, 0x07A5, WBP_ALetter},
{0x07A6, 0x07B0, WBP_Extend},
{0x07B1, 0x07B1, WBP_ALetter},
{0x07C0, 0x07C9, WBP_Numeric},
{0x07CA, 0x07EA, WBP_ALetter},
{0x07EB, 0x07F3, WBP_Extend},
{0x07F4, 0x07F5, WBP_ALetter},
{0x07F8, 0x07F8, WBP_MidNum},
{0x07FA, 0x07FA, WBP_ALetter},
{0x0800, 0x0815, WBP_ALetter},
{0x0816, 0x0819, WBP_Extend},
{0x081A, 0x081A, WBP_ALetter},
{0x081B, 0x0823, WBP_Extend},
{0x0824, 0x0824, WBP_ALetter},
{0x0825, 0x0827, WBP_Extend},
{0x0828, 0x0828, WBP_ALetter},
{0x0829, 0x082D, WBP_Extend},
{0x0840, 0x0858, WBP_ALetter},
{0x0859, 0x085B, WBP_Extend},
{0x0900, 0x0902, WBP_Extend},
{0x0903, 0x0903, WBP_Extend},
{0x0904, 0x0939, WBP_ALetter},
{0x093A, 0x093A, WBP_Extend},
{0x093B, 0x093B, WBP_Extend},
{0x093C, 0x093C, WBP_Extend},
{0x093D, 0x093D, WBP_ALetter},
{0x093E, 0x0940, WBP_Extend},
{0x0941, 0x0948, WBP_Extend},
{0x0949, 0x094C, WBP_Extend},
{0x094D, 0x094D, WBP_Extend},
{0x094E, 0x094F, WBP_Extend},
{0x0950, 0x0950, WBP_ALetter},
{0x0951, 0x0957, WBP_Extend},
{0x0958, 0x0961, WBP_ALetter},
{0x0962, 0x0963, WBP_Extend},
{0x0966, 0x096F, WBP_Numeric},
{0x0971, 0x0971, WBP_ALetter},
{0x0972, 0x0977, WBP_ALetter},
{0x0979, 0x097F, WBP_ALetter},
{0x0981, 0x0981, WBP_Extend},
{0x0982, 0x0983, WBP_Extend},
{0x0985, 0x098C, WBP_ALetter},
{0x098F, 0x0990, WBP_ALetter},
{0x0993, 0x09A8, WBP_ALetter},
{0x09AA, 0x09B0, WBP_ALetter},
{0x09B2, 0x09B2, WBP_ALetter},
{0x09B6, 0x09B9, WBP_ALetter},
{0x09BC, 0x09BC, WBP_Extend},
{0x09BD, 0x09BD, WBP_ALetter},
{0x09BE, 0x09C0, WBP_Extend},
{0x09C1, 0x09C4, WBP_Extend},
{0x09C7, 0x09C8, WBP_Extend},
{0x09CB, 0x09CC, WBP_Extend},
{0x09CD, 0x09CD, WBP_Extend},
{0x09CE, 0x09CE, WBP_ALetter},
{0x09D7, 0x09D7, WBP_Extend},
{0x09DC, 0x09DD, WBP_ALetter},
{0x09DF, 0x09E1, WBP_ALetter},
{0x09E2, 0x09E3, WBP_Extend},
{0x09E6, 0x09EF, WBP_Numeric},
{0x09F0, 0x09F1, WBP_ALetter},
{0x0A01, 0x0A02, WBP_Extend},
{0x0A03, 0x0A03, WBP_Extend},
{0x0A05, 0x0A0A, WBP_ALetter},
{0x0A0F, 0x0A10, WBP_ALetter},
{0x0A13, 0x0A28, WBP_ALetter},
{0x0A2A, 0x0A30, WBP_ALetter},
{0x0A32, 0x0A33, WBP_ALetter},
{0x0A35, 0x0A36, WBP_ALetter},
{0x0A38, 0x0A39, WBP_ALetter},
{0x0A3C, 0x0A3C, WBP_Extend},
{0x0A3E, 0x0A40, WBP_Extend},
{0x0A41, 0x0A42, WBP_Extend},
{0x0A47, 0x0A48, WBP_Extend},
{0x0A4B, 0x0A4D, WBP_Extend},
{0x0A51, 0x0A51, WBP_Extend},
{0x0A59, 0x0A5C, WBP_ALetter},
{0x0A5E, 0x0A5E, WBP_ALetter},
{0x0A66, 0x0A6F, WBP_Numeric},
{0x0A70, 0x0A71, WBP_Extend},
{0x0A72, 0x0A74, WBP_ALetter},
{0x0A75, 0x0A75, WBP_Extend},
{0x0A81, 0x0A82, WBP_Extend},
{0x0A83, 0x0A83, WBP_Extend},
{0x0A85, 0x0A8D, WBP_ALetter},
{0x0A8F, 0x0A91, WBP_ALetter},
{0x0A93, 0x0AA8, WBP_ALetter},
{0x0AAA, 0x0AB0, WBP_ALetter},
{0x0AB2, 0x0AB3, WBP_ALetter},
{0x0AB5, 0x0AB9, WBP_ALetter},
{0x0ABC, 0x0ABC, WBP_Extend},
{0x0ABD, 0x0ABD, WBP_ALetter},
{0x0ABE, 0x0AC0, WBP_Extend},
{0x0AC1, 0x0AC5, WBP_Extend},
{0x0AC7, 0x0AC8, WBP_Extend},
{0x0AC9, 0x0AC9, WBP_Extend},
{0x0ACB, 0x0ACC, WBP_Extend},
{0x0ACD, 0x0ACD, WBP_Extend},
{0x0AD0, 0x0AD0, WBP_ALetter},
{0x0AE0, 0x0AE1, WBP_ALetter},
{0x0AE2, 0x0AE3, WBP_Extend},
{0x0AE6, 0x0AEF, WBP_Numeric},
{0x0B01, 0x0B01, WBP_Extend},
{0x0B02, 0x0B03, WBP_Extend},
{0x0B05, 0x0B0C, WBP_ALetter},
{0x0B0F, 0x0B10, WBP_ALetter},
{0x0B13, 0x0B28, WBP_ALetter},
{0x0B2A, 0x0B30, WBP_ALetter},
{0x0B32, 0x0B33, WBP_ALetter},
{0x0B35, 0x0B39, WBP_ALetter},
{0x0B3C, 0x0B3C, WBP_Extend},
{0x0B3D, 0x0B3D, WBP_ALetter},
{0x0B3E, 0x0B3E, WBP_Extend},
{0x0B3F, 0x0B3F, WBP_Extend},
{0x0B40, 0x0B40, WBP_Extend},
{0x0B41, 0x0B44, WBP_Extend},
{0x0B47, 0x0B48, WBP_Extend},
{0x0B4B, 0x0B4C, WBP_Extend},
{0x0B4D, 0x0B4D, WBP_Extend},
{0x0B56, 0x0B56, WBP_Extend},
{0x0B57, 0x0B57, WBP_Extend},
{0x0B5C, 0x0B5D, WBP_ALetter},
{0x0B5F, 0x0B61, WBP_ALetter},
{0x0B62, 0x0B63, WBP_Extend},
{0x0B66, 0x0B6F, WBP_Numeric},
{0x0B71, 0x0B71, WBP_ALetter},
{0x0B82, 0x0B82, WBP_Extend},
{0x0B83, 0x0B83, WBP_ALetter},
{0x0B85, 0x0B8A, WBP_ALetter},
{0x0B8E, 0x0B90, WBP_ALetter},
{0x0B92, 0x0B95, WBP_ALetter},
{0x0B99, 0x0B9A, WBP_ALetter},
{0x0B9C, 0x0B9C, WBP_ALetter},
{0x0B9E, 0x0B9F, WBP_ALetter},
{0x0BA3, 0x0BA4, WBP_ALetter},
{0x0BA8, 0x0BAA, WBP_ALetter},
{0x0BAE, 0x0BB9, WBP_ALetter},
{0x0BBE, 0x0BBF, WBP_Extend},
{0x0BC0, 0x0BC0, WBP_Extend},
{0x0BC1, 0x0BC2, WBP_Extend},
{0x0BC6, 0x0BC8, WBP_Extend},
{0x0BCA, 0x0BCC, WBP_Extend},
{0x0BCD, 0x0BCD, WBP_Extend},
{0x0BD0, 0x0BD0, WBP_ALetter},
{0x0BD7, 0x0BD7, WBP_Extend},
{0x0BE6, 0x0BEF, WBP_Numeric},
{0x0C01, 0x0C03, WBP_Extend},
{0x0C05, 0x0C0C, WBP_ALetter},
{0x0C0E, 0x0C10, WBP_ALetter},
{0x0C12, 0x0C28, WBP_ALetter},
{0x0C2A, 0x0C33, WBP_ALetter},
{0x0C35, 0x0C39, WBP_ALetter},
{0x0C3D, 0x0C3D, WBP_ALetter},
{0x0C3E, 0x0C40, WBP_Extend},
{0x0C41, 0x0C44, WBP_Extend},
{0x0C46, 0x0C48, WBP_Extend},
{0x0C4A, 0x0C4D, WBP_Extend},
{0x0C55, 0x0C56, WBP_Extend},
{0x0C58, 0x0C59, WBP_ALetter},
{0x0C60, 0x0C61, WBP_ALetter},
{0x0C62, 0x0C63, WBP_Extend},
{0x0C66, 0x0C6F, WBP_Numeric},
{0x0C82, 0x0C83, WBP_Extend},
{0x0C85, 0x0C8C, WBP_ALetter},
{0x0C8E, 0x0C90, WBP_ALetter},
{0x0C92, 0x0CA8, WBP_ALetter},
{0x0CAA, 0x0CB3, WBP_ALetter},
{0x0CB5, 0x0CB9, WBP_ALetter},
{0x0CBC, 0x0CBC, WBP_Extend},
{0x0CBD, 0x0CBD, WBP_ALetter},
{0x0CBE, 0x0CBE, WBP_Extend},
{0x0CBF, 0x0CBF, WBP_Extend},
{0x0CC0, 0x0CC4, WBP_Extend},
{0x0CC6, 0x0CC6, WBP_Extend},
{0x0CC7, 0x0CC8, WBP_Extend},
{0x0CCA, 0x0CCB, WBP_Extend},
{0x0CCC, 0x0CCD, WBP_Extend},
{0x0CD5, 0x0CD6, WBP_Extend},
{0x0CDE, 0x0CDE, WBP_ALetter},
{0x0CE0, 0x0CE1, WBP_ALetter},
{0x0CE2, 0x0CE3, WBP_Extend},
{0x0CE6, 0x0CEF, WBP_Numeric},
{0x0CF1, 0x0CF2, WBP_ALetter},
{0x0D02, 0x0D03, WBP_Extend},
{0x0D05, 0x0D0C, WBP_ALetter},
{0x0D0E, 0x0D10, WBP_ALetter},
{0x0D12, 0x0D3A, WBP_ALetter},
{0x0D3D, 0x0D3D, WBP_ALetter},
{0x0D3E, 0x0D40, WBP_Extend},
{0x0D41, 0x0D44, WBP_Extend},
{0x0D46, 0x0D48, WBP_Extend},
{0x0D4A, 0x0D4C, WBP_Extend},
{0x0D4D, 0x0D4D, WBP_Extend},
{0x0D4E, 0x0D4E, WBP_ALetter},
{0x0D57, 0x0D57, WBP_Extend},
{0x0D60, 0x0D61, WBP_ALetter},
{0x0D62, 0x0D63, WBP_Extend},
{0x0D66, 0x0D6F, WBP_Numeric},
{0x0D7A, 0x0D7F, WBP_ALetter},
{0x0D82, 0x0D83, WBP_Extend},
{0x0D85, 0x0D96, WBP_ALetter},
{0x0D9A, 0x0DB1, WBP_ALetter},
{0x0DB3, 0x0DBB, WBP_ALetter},
{0x0DBD, 0x0DBD, WBP_ALetter},
{0x0DC0, 0x0DC6, WBP_ALetter},
{0x0DCA, 0x0DCA, WBP_Extend},
{0x0DCF, 0x0DD1, WBP_Extend},
{0x0DD2, 0x0DD4, WBP_Extend},
{0x0DD6, 0x0DD6, WBP_Extend},
{0x0DD8, 0x0DDF, WBP_Extend},
{0x0DF2, 0x0DF3, WBP_Extend},
{0x0E31, 0x0E31, WBP_Extend},
{0x0E34, 0x0E3A, WBP_Extend},
{0x0E47, 0x0E4E, WBP_Extend},
{0x0E50, 0x0E59, WBP_Numeric},
{0x0EB1, 0x0EB1, WBP_Extend},
{0x0EB4, 0x0EB9, WBP_Extend},
{0x0EBB, 0x0EBC, WBP_Extend},
{0x0EC8, 0x0ECD, WBP_Extend},
{0x0ED0, 0x0ED9, WBP_Numeric},
{0x0F00, 0x0F00, WBP_ALetter},
{0x0F18, 0x0F19, WBP_Extend},
{0x0F20, 0x0F29, WBP_Numeric},
{0x0F35, 0x0F35, WBP_Extend},
{0x0F37, 0x0F37, WBP_Extend},
{0x0F39, 0x0F39, WBP_Extend},
{0x0F3E, 0x0F3F, WBP_Extend},
{0x0F40, 0x0F47, WBP_ALetter},
{0x0F49, 0x0F6C, WBP_ALetter},
{0x0F71, 0x0F7E, WBP_Extend},
{0x0F7F, 0x0F7F, WBP_Extend},
{0x0F80, 0x0F84, WBP_Extend},
{0x0F86, 0x0F87, WBP_Extend},
{0x0F88, 0x0F8C, WBP_ALetter},
{0x0F8D, 0x0F97, WBP_Extend},
{0x0F99, 0x0FBC, WBP_Extend},
{0x0FC6, 0x0FC6, WBP_Extend},
{0x102B, 0x102C, WBP_Extend},
{0x102D, 0x1030, WBP_Extend},
{0x1031, 0x1031, WBP_Extend},
{0x1032, 0x1037, WBP_Extend},
{0x1038, 0x1038, WBP_Extend},
{0x1039, 0x103A, WBP_Extend},
{0x103B, 0x103C, WBP_Extend},
{0x103D, 0x103E, WBP_Extend},
{0x1040, 0x1049, WBP_Numeric},
{0x1056, 0x1057, WBP_Extend},
{0x1058, 0x1059, WBP_Extend},
{0x105E, 0x1060, WBP_Extend},
{0x1062, 0x1064, WBP_Extend},
{0x1067, 0x106D, WBP_Extend},
{0x1071, 0x1074, WBP_Extend},
{0x1082, 0x1082, WBP_Extend},
{0x1083, 0x1084, WBP_Extend},
{0x1085, 0x1086, WBP_Extend},
{0x1087, 0x108C, WBP_Extend},
{0x108D, 0x108D, WBP_Extend},
{0x108F, 0x108F, WBP_Extend},
{0x1090, 0x1099, WBP_Numeric},
{0x109A, 0x109C, WBP_Extend},
{0x109D, 0x109D, WBP_Extend},
{0x10A0, 0x10C5, WBP_ALetter},
{0x10D0, 0x10FA, WBP_ALetter},
{0x10FC, 0x10FC, WBP_ALetter},
{0x1100, 0x1248, WBP_ALetter},
{0x124A, 0x124D, WBP_ALetter},
{0x1250, 0x1256, WBP_ALetter},
{0x1258, 0x1258, WBP_ALetter},
{0x125A, 0x125D, WBP_ALetter},
{0x1260, 0x1288, WBP_ALetter},
{0x128A, 0x128D, WBP_ALetter},
{0x1290, 0x12B0, WBP_ALetter},
{0x12B2, 0x12B5, WBP_ALetter},
{0x12B8, 0x12BE, WBP_ALetter},
{0x12C0, 0x12C0, WBP_ALetter},
{0x12C2, 0x12C5, WBP_ALetter},
{0x12C8, 0x12D6, WBP_ALetter},
{0x12D8, 0x1310, WBP_ALetter},
{0x1312, 0x1315, WBP_ALetter},
{0x1318, 0x135A, WBP_ALetter},
{0x135D, 0x135F, WBP_Extend},
{0x1380, 0x138F, WBP_ALetter},
{0x13A0, 0x13F4, WBP_ALetter},
{0x1401, 0x166C, WBP_ALetter},
{0x166F, 0x167F, WBP_ALetter},
{0x1681, 0x169A, WBP_ALetter},
{0x16A0, 0x16EA, WBP_ALetter},
{0x16EE, 0x16F0, WBP_ALetter},
{0x1700, 0x170C, WBP_ALetter},
{0x170E, 0x1711, WBP_ALetter},
{0x1712, 0x1714, WBP_Extend},
{0x1720, 0x1731, WBP_ALetter},
{0x1732, 0x1734, WBP_Extend},
{0x1740, 0x1751, WBP_ALetter},
{0x1752, 0x1753, WBP_Extend},
{0x1760, 0x176C, WBP_ALetter},
{0x176E, 0x1770, WBP_ALetter},
{0x1772, 0x1773, WBP_Extend},
{0x17B4, 0x17B5, WBP_Format},
{0x17B6, 0x17B6, WBP_Extend},
{0x17B7, 0x17BD, WBP_Extend},
{0x17BE, 0x17C5, WBP_Extend},
{0x17C6, 0x17C6, WBP_Extend},
{0x17C7, 0x17C8, WBP_Extend},
{0x17C9, 0x17D3, WBP_Extend},
{0x17DD, 0x17DD, WBP_Extend},
{0x17E0, 0x17E9, WBP_Numeric},
{0x180B, 0x180D, WBP_Extend},
{0x1810, 0x1819, WBP_Numeric},
{0x1820, 0x1842, WBP_ALetter},
{0x1843, 0x1843, WBP_ALetter},
{0x1844, 0x1877, WBP_ALetter},
{0x1880, 0x18A8, WBP_ALetter},
{0x18A9, 0x18A9, WBP_Extend},
{0x18AA, 0x18AA, WBP_ALetter},
{0x18B0, 0x18F5, WBP_ALetter},
{0x1900, 0x191C, WBP_ALetter},
{0x1920, 0x1922, WBP_Extend},
{0x1923, 0x1926, WBP_Extend},
{0x1927, 0x1928, WBP_Extend},
{0x1929, 0x192B, WBP_Extend},
{0x1930, 0x1931, WBP_Extend},
{0x1932, 0x1932, WBP_Extend},
{0x1933, 0x1938, WBP_Extend},
{0x1939, 0x193B, WBP_Extend},
{0x1946, 0x194F, WBP_Numeric},
{0x19B0, 0x19C0, WBP_Extend},
{0x19C8, 0x19C9, WBP_Extend},
{0x19D0, 0x19D9, WBP_Numeric},
{0x1A00, 0x1A16, WBP_ALetter},
{0x1A17, 0x1A18, WBP_Extend},
{0x1A19, 0x1A1B, WBP_Extend},
{0x1A55, 0x1A55, WBP_Extend},
{0x1A56, 0x1A56, WBP_Extend},
{0x1A57, 0x1A57, WBP_Extend},
{0x1A58, 0x1A5E, WBP_Extend},
{0x1A60, 0x1A60, WBP_Extend},
{0x1A61, 0x1A61, WBP_Extend},
{0x1A62, 0x1A62, WBP_Extend},
{0x1A63, 0x1A64, WBP_Extend},
{0x1A65, 0x1A6C, WBP_Extend},
{0x1A6D, 0x1A72, WBP_Extend},
{0x1A73, 0x1A7C, WBP_Extend},
{0x1A7F, 0x1A7F, WBP_Extend},
{0x1A80, 0x1A89, WBP_Numeric},
{0x1A90, 0x1A99, WBP_Numeric},
{0x1B00, 0x1B03, WBP_Extend},
{0x1B04, 0x1B04, WBP_Extend},
{0x1B05, 0x1B33, WBP_ALetter},
{0x1B34, 0x1B34, WBP_Extend},
{0x1B35, 0x1B35, WBP_Extend},
{0x1B36, 0x1B3A, WBP_Extend},
{0x1B3B, 0x1B3B, WBP_Extend},
{0x1B3C, 0x1B3C, WBP_Extend},
{0x1B3D, 0x1B41, WBP_Extend},
{0x1B42, 0x1B42, WBP_Extend},
{0x1B43, 0x1B44, WBP_Extend},
{0x1B45, 0x1B4B, WBP_ALetter},
{0x1B50, 0x1B59, WBP_Numeric},
{0x1B6B, 0x1B73, WBP_Extend},
{0x1B80, 0x1B81, WBP_Extend},
{0x1B82, 0x1B82, WBP_Extend},
{0x1B83, 0x1BA0, WBP_ALetter},
{0x1BA1, 0x1BA1, WBP_Extend},
{0x1BA2, 0x1BA5, WBP_Extend},
{0x1BA6, 0x1BA7, WBP_Extend},
{0x1BA8, 0x1BA9, WBP_Extend},
{0x1BAA, 0x1BAA, WBP_Extend},
{0x1BAE, 0x1BAF, WBP_ALetter},
{0x1BB0, 0x1BB9, WBP_Numeric},
{0x1BC0, 0x1BE5, WBP_ALetter},
{0x1BE6, 0x1BE6, WBP_Extend},
{0x1BE7, 0x1BE7, WBP_Extend},
{0x1BE8, 0x1BE9, WBP_Extend},
{0x1BEA, 0x1BEC, WBP_Extend},
{0x1BED, 0x1BED, WBP_Extend},
{0x1BEE, 0x1BEE, WBP_Extend},
{0x1BEF, 0x1BF1, WBP_Extend},
{0x1BF2, 0x1BF3, WBP_Extend},
{0x1C00, 0x1C23, WBP_ALetter},
{0x1C24, 0x1C2B, WBP_Extend},
{0x1C2C, 0x1C33, WBP_Extend},
{0x1C34, 0x1C35, WBP_Extend},
{0x1C36, 0x1C37, WBP_Extend},
{0x1C40, 0x1C49, WBP_Numeric},
{0x1C4D, 0x1C4F, WBP_ALetter},
{0x1C50, 0x1C59, WBP_Numeric},
{0x1C5A, 0x1C77, WBP_ALetter},
{0x1C78, 0x1C7D, WBP_ALetter},
{0x1CD0, 0x1CD2, WBP_Extend},
{0x1CD4, 0x1CE0, WBP_Extend},
{0x1CE1, 0x1CE1, WBP_Extend},
{0x1CE2, 0x1CE8, WBP_Extend},
{0x1CE9, 0x1CEC, WBP_ALetter},
{0x1CED, 0x1CED, WBP_Extend},
{0x1CEE, 0x1CF1, WBP_ALetter},
{0x1CF2, 0x1CF2, WBP_Extend},
{0x1D00, 0x1D2B, WBP_ALetter},
{0x1D2C, 0x1D61, WBP_ALetter},
{0x1D62, 0x1D77, WBP_ALetter},
{0x1D78, 0x1D78, WBP_ALetter},
{0x1D79, 0x1D9A, WBP_ALetter},
{0x1D9B, 0x1DBF, WBP_ALetter},
{0x1DC0, 0x1DE6, WBP_Extend},
{0x1DFC, 0x1DFF, WBP_Extend},
{0x1E00, 0x1F15, WBP_ALetter},
{0x1F18, 0x1F1D, WBP_ALetter},
{0x1F20, 0x1F45, WBP_ALetter},
{0x1F48, 0x1F4D, WBP_ALetter},
{0x1F50, 0x1F57, WBP_ALetter},
{0x1F59, 0x1F59, WBP_ALetter},
{0x1F5B, 0x1F5B, WBP_ALetter},
{0x1F5D, 0x1F5D, WBP_ALetter},
{0x1F5F, 0x1F7D, WBP_ALetter},
{0x1F80, 0x1FB4, WBP_ALetter},
{0x1FB6, 0x1FBC, WBP_ALetter},
{0x1FBE, 0x1FBE, WBP_ALetter},
{0x1FC2, 0x1FC4, WBP_ALetter},
{0x1FC6, 0x1FCC, WBP_ALetter},
{0x1FD0, 0x1FD3, WBP_ALetter},
{0x1FD6, 0x1FDB, WBP_ALetter},
{0x1FE0, 0x1FEC, WBP_ALetter},
{0x1FF2, 0x1FF4, WBP_ALetter},
{0x1FF6, 0x1FFC, WBP_ALetter},
{0x200C, 0x200D, WBP_Extend},
{0x200E, 0x200F, WBP_Format},
{0x2018, 0x2018, WBP_MidNumLet},
{0x2019, 0x2019, WBP_MidNumLet},
{0x2024, 0x2024, WBP_MidNumLet},
{0x2027, 0x2027, WBP_MidLetter},
{0x2028, 0x2028, WBP_Newline},
{0x2029, 0x2029, WBP_Newline},
{0x202A, 0x202E, WBP_Format},
{0x203F, 0x2040, WBP_ExtendNumLet},
{0x2044, 0x2044, WBP_MidNum},
{0x2054, 0x2054, WBP_ExtendNumLet},
{0x2060, 0x2064, WBP_Format},
{0x206A, 0x206F, WBP_Format},
{0x2071, 0x2071, WBP_ALetter},
{0x207F, 0x207F, WBP_ALetter},
{0x2090, 0x209C, WBP_ALetter},
{0x20D0, 0x20DC, WBP_Extend},
{0x20DD, 0x20E0, WBP_Extend},
{0x20E1, 0x20E1, WBP_Extend},
{0x20E2, 0x20E4, WBP_Extend},
{0x20E5, 0x20F0, WBP_Extend},
{0x2102, 0x2102, WBP_ALetter},
{0x2107, 0x2107, WBP_ALetter},
{0x210A, 0x2113, WBP_ALetter},
{0x2115, 0x2115, WBP_ALetter},
{0x2119, 0x211D, WBP_ALetter},
{0x2124, 0x2124, WBP_ALetter},
{0x2126, 0x2126, WBP_ALetter},
{0x2128, 0x2128, WBP_ALetter},
{0x212A, 0x212D, WBP_ALetter},
{0x212F, 0x2134, WBP_ALetter},
{0x2135, 0x2138, WBP_ALetter},
{0x2139, 0x2139, WBP_ALetter},
{0x213C, 0x213F, WBP_ALetter},
{0x2145, 0x2149, WBP_ALetter},
{0x214E, 0x214E, WBP_ALetter},
{0x2160, 0x2182, WBP_ALetter},
{0x2183, 0x2184, WBP_ALetter},
{0x2185, 0x2188, WBP_ALetter},
{0x24B6, 0x24E9, WBP_ALetter},
{0x2C00, 0x2C2E, WBP_ALetter},
{0x2C30, 0x2C5E, WBP_ALetter},
{0x2C60, 0x2C7C, WBP_ALetter},
{0x2C7D, 0x2C7D, WBP_ALetter},
{0x2C7E, 0x2CE4, WBP_ALetter},
{0x2CEB, 0x2CEE, WBP_ALetter},
{0x2CEF, 0x2CF1, WBP_Extend},
{0x2D00, 0x2D25, WBP_ALetter},
{0x2D30, 0x2D65, WBP_ALetter},
{0x2D6F, 0x2D6F, WBP_ALetter},
{0x2D7F, 0x2D7F, WBP_Extend},
{0x2D80, 0x2D96, WBP_ALetter},
{0x2DA0, 0x2DA6, WBP_ALetter},
{0x2DA8, 0x2DAE, WBP_ALetter},
{0x2DB0, 0x2DB6, WBP_ALetter},
{0x2DB8, 0x2DBE, WBP_ALetter},
{0x2DC0, 0x2DC6, WBP_ALetter},
{0x2DC8, 0x2DCE, WBP_ALetter},
{0x2DD0, 0x2DD6, WBP_ALetter},
{0x2DD8, 0x2DDE, WBP_ALetter},
{0x2DE0, 0x2DFF, WBP_Extend},
{0x2E2F, 0x2E2F, WBP_ALetter},
{0x3005, 0x3005, WBP_ALetter},
{0x302A, 0x302F, WBP_Extend},
{0x3031, 0x3035, WBP_Katakana},
{0x303B, 0x303B, WBP_ALetter},
{0x303C, 0x303C, WBP_ALetter},
{0x3099, 0x309A, WBP_Extend},
{0x309B, 0x309C, WBP_Katakana},
{0x30A0, 0x30A0, WBP_Katakana},
{0x30A1, 0x30FA, WBP_Katakana},
{0x30FC, 0x30FE, WBP_Katakana},
{0x30FF, 0x30FF, WBP_Katakana},
{0x3105, 0x312D, WBP_ALetter},
{0x3131, 0x318E, WBP_ALetter},
{0x31A0, 0x31BA, WBP_ALetter},
{0x31F0, 0x31FF, WBP_Katakana},
{0x32D0, 0x32FE, WBP_Katakana},
{0x3300, 0x3357, WBP_Katakana},
{0xA000, 0xA014, WBP_ALetter},
{0xA015, 0xA015, WBP_ALetter},
{0xA016, 0xA48C, WBP_ALetter},
{0xA4D0, 0xA4F7, WBP_ALetter},
{0xA4F8, 0xA4FD, WBP_ALetter},
{0xA500, 0xA60B, WBP_ALetter},
{0xA60C, 0xA60C, WBP_ALetter},
{0xA610, 0xA61F, WBP_ALetter},
{0xA620, 0xA629, WBP_Numeric},
{0xA62A, 0xA62B, WBP_ALetter},
{0xA640, 0xA66D, WBP_ALetter},
{0xA66E, 0xA66E, WBP_ALetter},
{0xA66F, 0xA66F, WBP_Extend},
{0xA670, 0xA672, WBP_Extend},
{0xA67C, 0xA67D, WBP_Extend},
{0xA67F, 0xA67F, WBP_ALetter},
{0xA680, 0xA697, WBP_ALetter},
{0xA6A0, 0xA6E5, WBP_ALetter},
{0xA6E6, 0xA6EF, WBP_ALetter},
{0xA6F0, 0xA6F1, WBP_Extend},
{0xA717, 0xA71F, WBP_ALetter},
{0xA722, 0xA76F, WBP_ALetter},
{0xA770, 0xA770, WBP_ALetter},
{0xA771, 0xA787, WBP_ALetter},
{0xA788, 0xA788, WBP_ALetter},
{0xA78B, 0xA78E, WBP_ALetter},
{0xA790, 0xA791, WBP_ALetter},
{0xA7A0, 0xA7A9, WBP_ALetter},
{0xA7FA, 0xA7FA, WBP_ALetter},
{0xA7FB, 0xA801, WBP_ALetter},
{0xA802, 0xA802, WBP_Extend},
{0xA803, 0xA805, WBP_ALetter},
{0xA806, 0xA806, WBP_Extend},
{0xA807, 0xA80A, WBP_ALetter},
{0xA80B, 0xA80B, WBP_Extend},
{0xA80C, 0xA822, WBP_ALetter},
{0xA823, 0xA824, WBP_Extend},
{0xA825, 0xA826, WBP_Extend},
{0xA827, 0xA827, WBP_Extend},
{0xA840, 0xA873, WBP_ALetter},
{0xA880, 0xA881, WBP_Extend},
{0xA882, 0xA8B3, WBP_ALetter},
{0xA8B4, 0xA8C3, WBP_Extend},
{0xA8C4, 0xA8C4, WBP_Extend},
{0xA8D0, 0xA8D9, WBP_Numeric},
{0xA8E0, 0xA8F1, WBP_Extend},
{0xA8F2, 0xA8F7, WBP_ALetter},
{0xA8FB, 0xA8FB, WBP_ALetter},
{0xA900, 0xA909, WBP_Numeric},
{0xA90A, 0xA925, WBP_ALetter},
{0xA926, 0xA92D, WBP_Extend},
{0xA930, 0xA946, WBP_ALetter},
{0xA947, 0xA951, WBP_Extend},
{0xA952, 0xA953, WBP_Extend},
{0xA960, 0xA97C, WBP_ALetter},
{0xA980, 0xA982, WBP_Extend},
{0xA983, 0xA983, WBP_Extend},
{0xA984, 0xA9B2, WBP_ALetter},
{0xA9B3, 0xA9B3, WBP_Extend},
{0xA9B4, 0xA9B5, WBP_Extend},
{0xA9B6, 0xA9B9, WBP_Extend},
{0xA9BA, 0xA9BB, WBP_Extend},
{0xA9BC, 0xA9BC, WBP_Extend},
{0xA9BD, 0xA9C0, WBP_Extend},
{0xA9CF, 0xA9CF, WBP_ALetter},
{0xA9D0, 0xA9D9, WBP_Numeric},
{0xAA00, 0xAA28, WBP_ALetter},
{0xAA29, 0xAA2E, WBP_Extend},
{0xAA2F, 0xAA30, WBP_Extend},
{0xAA31, 0xAA32, WBP_Extend},
{0xAA33, 0xAA34, WBP_Extend},
{0xAA35, 0xAA36, WBP_Extend},
{0xAA40, 0xAA42, WBP_ALetter},
{0xAA43, 0xAA43, WBP_Extend},
{0xAA44, 0xAA4B, WBP_ALetter},
{0xAA4C, 0xAA4C, WBP_Extend},
{0xAA4D, 0xAA4D, WBP_Extend},
{0xAA50, 0xAA59, WBP_Numeric},
{0xAA7B, 0xAA7B, WBP_Extend},
{0xAAB0, 0xAAB0, WBP_Extend},
{0xAAB2, 0xAAB4, WBP_Extend},
{0xAAB7, 0xAAB8, WBP_Extend},
{0xAABE, 0xAABF, WBP_Extend},
{0xAAC1, 0xAAC1, WBP_Extend},
{0xAB01, 0xAB06, WBP_ALetter},
{0xAB09, 0xAB0E, WBP_ALetter},
{0xAB11, 0xAB16, WBP_ALetter},
{0xAB20, 0xAB26, WBP_ALetter},
{0xAB28, 0xAB2E, WBP_ALetter},
{0xABC0, 0xABE2, WBP_ALetter},
{0xABE3, 0xABE4, WBP_Extend},
{0xABE5, 0xABE5, WBP_Extend},
{0xABE6, 0xABE7, WBP_Extend},
{0xABE8, 0xABE8, WBP_Extend},
{0xABE9, 0xABEA, WBP_Extend},
{0xABEC, 0xABEC, WBP_Extend},
{0xABED, 0xABED, WBP_Extend},
{0xABF0, 0xABF9, WBP_Numeric},
{0xAC00, 0xD7A3, WBP_ALetter},
{0xD7B0, 0xD7C6, WBP_ALetter},
{0xD7CB, 0xD7FB, WBP_ALetter},
{0xFB00, 0xFB06, WBP_ALetter},
{0xFB13, 0xFB17, WBP_ALetter},
{0xFB1D, 0xFB1D, WBP_ALetter},
{0xFB1E, 0xFB1E, WBP_Extend},
{0xFB1F, 0xFB28, WBP_ALetter},
{0xFB2A, 0xFB36, WBP_ALetter},
{0xFB38, 0xFB3C, WBP_ALetter},
{0xFB3E, 0xFB3E, WBP_ALetter},
{0xFB40, 0xFB41, WBP_ALetter},
{0xFB43, 0xFB44, WBP_ALetter},
{0xFB46, 0xFBB1, WBP_ALetter},
{0xFBD3, 0xFD3D, WBP_ALetter},
{0xFD50, 0xFD8F, WBP_ALetter},
{0xFD92, 0xFDC7, WBP_ALetter},
{0xFDF0, 0xFDFB, WBP_ALetter},
{0xFE00, 0xFE0F, WBP_Extend},
{0xFE10, 0xFE10, WBP_MidNum},
{0xFE13, 0xFE13, WBP_MidLetter},
{0xFE14, 0xFE14, WBP_MidNum},
{0xFE20, 0xFE26, WBP_Extend},
{0xFE33, 0xFE34, WBP_ExtendNumLet},
{0xFE4D, 0xFE4F, WBP_ExtendNumLet},
{0xFE50, 0xFE50, WBP_MidNum},
{0xFE52, 0xFE52, WBP_MidNumLet},
{0xFE54, 0xFE54, WBP_MidNum},
{0xFE55, 0xFE55, WBP_MidLetter},
{0xFE70, 0xFE74, WBP_ALetter},
{0xFE76, 0xFEFC, WBP_ALetter},
{0xFEFF, 0xFEFF, WBP_Format},
{0xFF07, 0xFF07, WBP_MidNumLet},
{0xFF0C, 0xFF0C, WBP_MidNum},
{0xFF0E, 0xFF0E, WBP_MidNumLet},
{0xFF1A, 0xFF1A, WBP_MidLetter},
{0xFF1B, 0xFF1B, WBP_MidNum},
{0xFF21, 0xFF3A, WBP_ALetter},
{0xFF3F, 0xFF3F, WBP_ExtendNumLet},
{0xFF41, 0xFF5A, WBP_ALetter},
{0xFF66, 0xFF6F, WBP_Katakana},
{0xFF70, 0xFF70, WBP_Katakana},
{0xFF71, 0xFF9D, WBP_Katakana},
{0xFF9E, 0xFF9F, WBP_Extend},
{0xFFA0, 0xFFBE, WBP_ALetter},
{0xFFC2, 0xFFC7, WBP_ALetter},
{0xFFCA, 0xFFCF, WBP_ALetter},
{0xFFD2, 0xFFD7, WBP_ALetter},
{0xFFDA, 0xFFDC, WBP_ALetter},
{0xFFF9, 0xFFFB, WBP_Format},
{0x10000, 0x1000B, WBP_ALetter},
{0x1000D, 0x10026, WBP_ALetter},
{0x10028, 0x1003A, WBP_ALetter},
{0x1003C, 0x1003D, WBP_ALetter},
{0x1003F, 0x1004D, WBP_ALetter},
{0x10050, 0x1005D, WBP_ALetter},
{0x10080, 0x100FA, WBP_ALetter},
{0x10140, 0x10174, WBP_ALetter},
{0x101FD, 0x101FD, WBP_Extend},
{0x10280, 0x1029C, WBP_ALetter},
{0x102A0, 0x102D0, WBP_ALetter},
{0x10300, 0x1031E, WBP_ALetter},
{0x10330, 0x10340, WBP_ALetter},
{0x10341, 0x10341, WBP_ALetter},
{0x10342, 0x10349, WBP_ALetter},
{0x1034A, 0x1034A, WBP_ALetter},
{0x10380, 0x1039D, WBP_ALetter},
{0x103A0, 0x103C3, WBP_ALetter},
{0x103C8, 0x103CF, WBP_ALetter},
{0x103D1, 0x103D5, WBP_ALetter},
{0x10400, 0x1044F, WBP_ALetter},
{0x10450, 0x1049D, WBP_ALetter},
{0x104A0, 0x104A9, WBP_Numeric},
{0x10800, 0x10805, WBP_ALetter},
{0x10808, 0x10808, WBP_ALetter},
{0x1080A, 0x10835, WBP_ALetter},
{0x10837, 0x10838, WBP_ALetter},
{0x1083C, 0x1083C, WBP_ALetter},
{0x1083F, 0x10855, WBP_ALetter},
{0x10900, 0x10915, WBP_ALetter},
{0x10920, 0x10939, WBP_ALetter},
{0x10A00, 0x10A00, WBP_ALetter},
{0x10A01, 0x10A03, WBP_Extend},
{0x10A05, 0x10A06, WBP_Extend},
{0x10A0C, 0x10A0F, WBP_Extend},
{0x10A10, 0x10A13, WBP_ALetter},
{0x10A15, 0x10A17, WBP_ALetter},
{0x10A19, 0x10A33, WBP_ALetter},
{0x10A38, 0x10A3A, WBP_Extend},
{0x10A3F, 0x10A3F, WBP_Extend},
{0x10A60, 0x10A7C, WBP_ALetter},
{0x10B00, 0x10B35, WBP_ALetter},
{0x10B40, 0x10B55, WBP_ALetter},
{0x10B60, 0x10B72, WBP_ALetter},
{0x10C00, 0x10C48, WBP_ALetter},
{0x11000, 0x11000, WBP_Extend},
{0x11001, 0x11001, WBP_Extend},
{0x11002, 0x11002, WBP_Extend},
{0x11003, 0x11037, WBP_ALetter},
{0x11038, 0x11046, WBP_Extend},
{0x11066, 0x1106F, WBP_Numeric},
{0x11080, 0x11081, WBP_Extend},
{0x11082, 0x11082, WBP_Extend},
{0x11083, 0x110AF, WBP_ALetter},
{0x110B0, 0x110B2, WBP_Extend},
{0x110B3, 0x110B6, WBP_Extend},
{0x110B7, 0x110B8, WBP_Extend},
{0x110B9, 0x110BA, WBP_Extend},
{0x110BD, 0x110BD, WBP_Format},
{0x12000, 0x1236E, WBP_ALetter},
{0x12400, 0x12462, WBP_ALetter},
{0x13000, 0x1342E, WBP_ALetter},
{0x16800, 0x16A38, WBP_ALetter},
{0x1B000, 0x1B000, WBP_Katakana},
{0x1D165, 0x1D166, WBP_Extend},
{0x1D167, 0x1D169, WBP_Extend},
{0x1D16D, 0x1D172, WBP_Extend},
{0x1D173, 0x1D17A, WBP_Format},
{0x1D17B, 0x1D182, WBP_Extend},
{0x1D185, 0x1D18B, WBP_Extend},
{0x1D1AA, 0x1D1AD, WBP_Extend},
{0x1D242, 0x1D244, WBP_Extend},
{0x1D400, 0x1D454, WBP_ALetter},
{0x1D456, 0x1D49C, WBP_ALetter},
{0x1D49E, 0x1D49F, WBP_ALetter},
{0x1D4A2, 0x1D4A2, WBP_ALetter},
{0x1D4A5, 0x1D4A6, WBP_ALetter},
{0x1D4A9, 0x1D4AC, WBP_ALetter},
{0x1D4AE, 0x1D4B9, WBP_ALetter},
{0x1D4BB, 0x1D4BB, WBP_ALetter},
{0x1D4BD, 0x1D4C3, WBP_ALetter},
{0x1D4C5, 0x1D505, WBP_ALetter},
{0x1D507, 0x1D50A, WBP_ALetter},
{0x1D50D, 0x1D514, WBP_ALetter},
{0x1D516, 0x1D51C, WBP_ALetter},
{0x1D51E, 0x1D539, WBP_ALetter},
{0x1D53B, 0x1D53E, WBP_ALetter},
{0x1D540, 0x1D544, WBP_ALetter},
{0x1D546, 0x1D546, WBP_ALetter},
{0x1D54A, 0x1D550, WBP_ALetter},
{0x1D552, 0x1D6A5, WBP_ALetter},
{0x1D6A8, 0x1D6C0, WBP_ALetter},
{0x1D6C2, 0x1D6DA, WBP_ALetter},
{0x1D6DC, 0x1D6FA, WBP_ALetter},
{0x1D6FC, 0x1D714, WBP_ALetter},
{0x1D716, 0x1D734, WBP_ALetter},
{0x1D736, 0x1D74E, WBP_ALetter},
{0x1D750, 0x1D76E, WBP_ALetter},
{0x1D770, 0x1D788, WBP_ALetter},
{0x1D78A, 0x1D7A8, WBP_ALetter},
{0x1D7AA, 0x1D7C2, WBP_ALetter},
{0x1D7C4, 0x1D7CB, WBP_ALetter},
{0x1D7CE, 0x1D7FF, WBP_Numeric},
{0xE0001, 0xE0001, WBP_Format},
{0xE0020, 0xE007F, WBP_Format},
{0xE0100, 0xE01EF, WBP_Extend},
{0xFFFFFFFF, 0xFFFFFFFF, WBP_Undefined}
};

View file

@ -0,0 +1,5 @@
#include "linebreak.h"
#include "wordbreakdef.h"
static struct WordBreakProperties wb_prop_default[] = {

View file

@ -0,0 +1,2 @@
{0xFFFFFFFF, 0xFFFFFFFF, WBP_Undefined}
};

View file

@ -0,0 +1,78 @@
/* vim: set tabstop=4 shiftwidth=4: */
/*
* Word breaking in a Unicode sequence. Designed to be used in a
* generic text renderer.
*
* Copyright (C) 2012 Tom Hacohen <tom@stosb.com>
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the author be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute
* it freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must
* not claim that you wrote the original software. If you use this
* software in a product, an acknowledgement in the product
* documentation would be appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must
* not be misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source
* distribution.
*
* The main reference is Unicode Standard Annex 29 (UAX #29):
* <URL:http://unicode.org/reports/tr29>
*
* When this library was designed, this annex was at Revision 17, for
* Unicode 6.0.0:
* <URL:http://www.unicode.org/reports/tr29/tr29-17.html>
*
* The Unicode Terms of Use are available at
* <URL:http://www.unicode.org/copyright.html>
*/
/**
* @file wordbreakdef.h
*
* Definitions of internal data structures, declarations of global
* variables, and function prototypes for the word breaking algorithm.
*
* @version 2.1, 2012/01/18
* @author Tom Hacohen
*/
/**
* Word break classes. This is a direct mapping of Table 3 of Unicode
* Standard Annex 29, Revision 17.
*/
enum WordBreakClass
{
WBP_Undefined,
WBP_CR,
WBP_LF,
WBP_Newline,
WBP_Extend,
WBP_Format,
WBP_Katakana,
WBP_ALetter,
WBP_MidNumLet,
WBP_MidLetter,
WBP_MidNum,
WBP_Numeric,
WBP_ExtendNumLet,
WBP_Any
};
/**
* Struct for entries of word break properties. The array of the
* entries \e must be sorted.
*/
struct WordBreakProperties
{
utf32_t start; /**< Starting coding point */
utf32_t end; /**< End coding point */
enum WordBreakClass prop; /**< The word breaking property */
};