Name: Unihan database
Unicode version: 5.0.0
Table version: 1.1
Date: 7 July 2006

Copyright © 1996-2006 Unicode, Inc. All Rights reserved.

For terms of use, see <http://www.unicode.org/terms_of_use.html>

Format information:

Each line of this file consists of three tab-separated fields.
The first is the Unicode scalar value as U+[x]xxxx (that is, there are
    either four or five hex digits)
The second is a tag indicating the type of information in the third field
The third is the line's value (in UTF-8)

We give below a list of the tags in alphabetical order. For each tag,
we give additional information, such as its formal status in the standard,
a general category to which its data belongs, the separator (if any)
between individual subvalues, a regular expression indicating the
format of each subvalue, the version of Unicode in which the data were
originally introduced, and a description of the data associated with the
tag.

Regular expressions are based on standard Perl 5.8.6 syntax and may
require modification for use with other regular expression engines.

Unless otherwise noted, the order of subvalues within a single
value field is not significant.

Note that only the description is present for every tag value.

See also <http://www.unicode.org/Public/UNIDATA/Unihan.html>

###############################################################################

Tag:    kAccountingNumeric
Status:    Informative
Category:    Numeric Values
Separator:    space
Syntax:    [0-9]+
Introduced:    3.2

The value of the character when used in the writing of accounting
numerals.

Accounting numerals are used in East Asia to prevent fraud. Because
a number like ten (十) is easily turned into one thousand (千) with
a stroke of a brush, monetary documents will often use an
accounting form of the numeral ten (such as 拾) in their place.

The three numeric-value fields should have no overlap; that is, characters
with a kAccountingNumeric value should not have a kPrimaryNumeric
or kOtherNumeric value as well.

###############################################################################

Tag:    kBigFive
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9A-F]{4}

The Big Five mapping for this character in hex; note that this does
not cover any of the Big Five extensions in common use, including
the ETEN extensions.

###############################################################################

Tag:    kCCCII
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9A-F]{6}

The CCCII mapping for this character in hex.

###############################################################################

Tag:    kCNS1986
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [12E]-[0-9A-F]{4}

The CNS 11643-1986 mapping for this character in hex.

###############################################################################

Tag:    kCNS1992
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [123]-[0-9A-F]{4}

The CNS 11643-1992 mapping for this character in hex.

###############################################################################

Tag:    kCangjie
Status:    Provisional
Category:    Dictionary-like Data
Separator:    space
Syntax:    [A-Z]+
Introduced:    3.1.1

The cangjie input code for the character. This incorporates
data from the file cangjie-table.b5 by Christian Wittern.

###############################################################################

Tag:    kCantonese
Status:    Provisional
Category:    Dictionary-like Data
Separator:    space
Syntax:    [a-z]+[1-6]

The Cantonese pronunciation(s) for this character using the
jyutping romanization.

A full description of jyutping can be found at <http://cpct92.cityu.edu.hk/lshk/Jyutping/Jyutping.htm>.
The main differences between jyutping and the Yale romanization
previously used are:

1) Jyutping always uses tone numbers and does not distinguish
the high falling and high level tones.

2) Jyutping always writes a long a as "aa".

3) Jyutping uses "oe" and "eo" for the Yale "eu" vowel.

4) Jyutping uses "c" instead of "ch", "z" instead of "j",
and "j" instead of "y" as initials.

5) A non-null initial is always explicitly written (thus
"jyut" in jyutping instead of Yale's "yut").

Cantonese pronunciations are sorted alphabetically, not in
order of frequency.

N.B., the Hong Kong dialect of Cantonese is in the process of dropping
initial NG- before non-null finals. Any word with an initial NG-
may actually be pronounced without it, depending on the speaker and
circumstances. Many words with a null initial may similarly be pronounced
with an initial NG-. Similarly, many speakers use an initial
L- for words previously pronounced with an initial N-.

Cantonese data are derived from the following sources:

Casey, G. Hugh, S.J. Ten Thousand Characters: An Analytic
Dictionary. Hong Kong: Kelley and Walsh,1980 (kPhonetic).

Cheung Kwan-hin and Robert S. Bauer, The Representation of Cantonese
with Chinese Characters, Journal of Chinese Linguistics Monograph
Series Number 18, 2002.

Roy T. Cowles, A Pocket Dictionary of Cantonese, Hong Kong:
University Press, 1999 (kCowles).

Sidney Lau, A Practical Cantonese-English Dictionary, Hong
Kong: Government Printer, 1977 (kLau).

Bernard F. Meyer and Theodore F. Wempe, Student's Cantonese-English
Dictionary, Maryknoll, New York: Catholic Foreign Mission
Society of America, 1947 (kMeyerWempe).

饒秉才, ed. 廣州音字典, Hong Kong: Joint Publishing (H.K.) Co., Ltd.,
1989.

中華新字典, Hong Kong:中華書局, 1987.

黃港生, ed. 商務新詞典, Hong Kong: The Commercial Press, 1991.

朗文初級中文詞典, Hong Kong: Longman, 2001.

The jyutping phrase box from the Linguistic Society of Hong Kong,
<http://cpct92.cityu.edu.hk/lshk/Jyutping/>. The copyright of the
Jyutping phrase box belongs to the Linguistic Society of Hong Kong. 
We would like to thank the Jyutping Group of the Linguistic Society
of Hong Kong for permission to use the electronic file in our research
and/or product development. Note that the inclusion of the phrase
box in the Unihan database requires that any products developed
using the kCantonese field needs to include this acknowledgment.

###############################################################################

Tag:    kCheungBauer
Status:    Provisional
Category:    Dictionary-like Data
Separator:    NA
Introduced:    5.0

Data regarding the character in Cheung Kwan-hin and Robert S. Bauer,
_The Representation of Cantonese with Chinese Characters_, Journal
of Chinese Linguistics, Monograph Series Number 18, 2002. The data
consist of three pieces, separated by semicolons: (1) the character's
radical-stroke index as a three-digit radical, slash, two-digit stroke
count; (2) the character's cangjie input code (if any); and (3) a
comma-separated list of Cantonese readings using the jyutping
romanization in alphabetical order.

###############################################################################

Tag:    kCheungBauerIndex
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [0-9]{3}\.[0-9][0-9]{2}
Introduced:    5.0

The position of the character in Cheung Kwan-hin and Robert S. Bauer,
_The Representation of Cantonese with Chinese Characters_, Journal
of Chinese Linguistics, Monograph Series Number 18, 2002. The format
is a three-digit page number followed by a two-digit position
number, separated by a period.

###############################################################################

Tag:    kCihaiT
Status:    Provisional
Category:    Dictionary-like Data
Separator:    space
Syntax:    [1-9][0-9]{0,3}\.[0-9]{3}
Introduced:    3.2

The position of this character in the Cihai (辭海) dictionary, single
volume edition, published in Hong Kong by the Zhonghua Bookstore,
1983 (reprint of the 1947 edition), ISBN 962-231-005-2.

The position is indicated by a decimal number. The digits to the
left of the decimal are the page number. The first digit after the
decimal is the row on the page, and the remaining two digits
after the decimal are the position on the row.

###############################################################################

Tag:    kCompatibilityVariant
Status:    Normative
Category:    Variants
Separator:    space
Syntax:    U\+2?[0-9A-F]{4}
Introduced:    3.2

The compatibility decomposition for this ideograph, derived
from the UnicodeData.txt file.

###############################################################################

Tag:    kCowles
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [0-9]{1,4}(\.[0-9]{1,2})?
Introduced:    3.1.1

The index or indices of this character in Roy T. Cowles,
A Pocket Dictionary of Cantonese, Hong Kong: University Press,
1999.

The Cowles indices are numerical, usually integers but occasionally
fractional where a character was added after the original indices
were determined. Cowles is missing indices 1222 and 4949, and four
characters in Cowles are part of Unicode's "Hangzhou" numeral
set: 2964 (U+3025), 3197 (U+3028), 3574 (U+3023), and 4720
(U+3027).

Approximately 100 characters from Cowles which are not currently
encoded are being submitted to the IRG by Unicode for inclusion
in future versions of the standard.

###############################################################################

Tag:    kDaeJaweon
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [0-9]{4}\.[0-9]{2}[0158]

The position of this character in the Dae Jaweon (Korean) dictionary
used in the four-dictionary sorting algorithm. The position is in
the form "page.position" with the final digit in the position being
"0" for characters actually in the dictionary and "1" for characters
not found in the dictionary and assigned a "virtual" position
in the dictionary.

Thus, "1187.060" indicates the sixth character on page 1187. A character
not in this dictionary but assigned a position between the
6th and 7th characters on page 1187 for sorting purposes
would have the code "1187.061"

The edition used is the first edition, published in Seoul
by Samseong Publishing Co., Ltd., 1988.

###############################################################################

Tag:    kDefinition
Status:    Provisional
Category:    Dictionary-like Data
Separator:    space
Syntax:    See Description

An English definition for this character. Definitions are for modern
written Chinese and are usually (but not always) the same as the
definition in other Chinese dialects or non-Chinese languages. In
some cases, synonyms are indicated. Fuller variant information
can be found using the various variant fields.

Definitions specific to non-Chinese languages or Chinese
dialects other than modern Mandarin are marked, e.g., (Cant.)
or (J).

Major definitions are separated by semicolons, and minor definitions
by commas. Any valid Unicode character (except for tab, double-quote,
and any line break character) may be used within the definition
field.

###############################################################################

Tag:    kEACC
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9A-F]{6}

The EACC mapping for this character in hex.

###############################################################################

Tag:    kFenn
Status:    Provisional
Category:    Dictionary-like Data
Separator:    space
Syntax:    [0-9]+a?[A-KP*]
Introduced:    3.1.1

Data on the character from The Five Thousand Dictionary (aka Fenn's
Chinese-English Pocket Dictionary) by Courtenay H. Fenn,
Cambridge, Mass.: Harvard University Press, 1979.

The data here consists of a decimal number followed by a letter A
through K, the letter P, or an asterisk. The decimal number gives
the Soothill number for the character's phonetic, and the letter
is a rough frequency indication, with A indicating the 500
most common ideographs, B the next five hundred, and so on.

P is used by Fenn to indicate a rare character included in
the dictionary only because it is the phonetic element in
other characters.

An asterisk is used instead of a letter in the final position to
indicate a character which belongs to one of Soothill's phonetic
groups but is not found in Fenn's dictionary.

Characters which have a frequency letter but no Soothill
phonetic group are assigned group 0.

###############################################################################

Tag:    kFennIndex
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [1-9]{3}\.[01][0-9]

The position of this character in _Fenn's Chinese-English Pocket
Dictionary_ by Courtenay H. Fenn, Cambridge, Mass.: Harvard University
Press, 1942. The position is indicated by a three-digit page
number followed by a period and a two-digit position on the
page.

###############################################################################

Tag:    kFourCornerCode
Status:    Provisional
Category:    Dictionary-like Data
Separator:    space
Syntax:    [0-9]{4}(\.[0-9])?
Introduced:    5.0

The four-corner code(s) for the character. This data is derived from
data provided in the public domain by Hartmut Bohn, Urs App,
and Christian Wittern.

The four-corner system assigns each character a four-digit code from
0 through 9. The digit is derived from the "shape" of the four corners
of the character (upper-left, upper-right, lower-left, lower-right).
An optional fifth digit can be used to further distinguish characters;
the fifth digit is derived from the shape in the character's
center or region immediately to the left of the fourth corner.

The four-corner system is now used only rarely. Full descriptions
are available online, e.g., at <http://en.wikipedia.org/wiki/Four_corner_input>.

Values in this field consist of four decimal digits, optionally
followed by a period and fifth digit for a five-digit form.

###############################################################################

Tag:    kFrequency
Status:    Provisional
Category:    Dictionary-like Data
Separator:    space
Syntax:    [1-5]
Introduced:    3.2

A rough frequency measurement for the character based on analysis
of traditional Chinese USENET postings; characters with a kFrequency
of 1 are the most common, those with a kFrequency of 2 are
less common, and so on, through a kFrequency of 5.

###############################################################################

Tag:    kGB0
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9A-F]{4}

The GB 2312-80 mapping for this character in ku/ten form.

###############################################################################

Tag:    kGB1
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9A-F]{4}

The GB 12345-90 mapping for this character in ku/ten form.

###############################################################################

Tag:    kGB3
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9A-F]{4}

The GB 7589-87 mapping for this character in ku/ten form.

###############################################################################

Tag:    kGB5
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9A-F]{4}

The GB 7590-87 mapping for this character in ku/ten form.

###############################################################################

Tag:    kGB7
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9A-F]{4}

The GB 8565-89 mapping for this character in ku/ten form.

###############################################################################

Tag:    kGB8
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9]{4}

The GB 8565-89 mapping for this character in ku/ten form

###############################################################################

Tag:    kGSR
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [0-9]{4}[a-vx-z]\'*
Introduced:    4.0.1

The position of this character in Bernhard Karlgren's Grammata
Serica Recensa (1957).

This dataset contains a total of 7,403 records. References are given
in the form DDDDa('), where "DDDD" is a set number in the range [0001..1260]
zero-padded to 4-digits, "a" is a letter in the range [a..z] (excluding
"w"), optionally followed by (') apostrophe. The data from which
this mapping table is extracted contains a total of 10,023
references. References to inscriptional forms have been omitted.

Release notes

22-Dec-2003: Initial release. The following 32 references are to
unencoded forms: 0059k, 0069y, 0079d, 0275b, 0286a, 0289a, 0289f,
0293a, 0325a, 0389o, 0391h, 0392s, 0468h, 0480a, 0516a, 0526o, 0566g',
0642y, 0661a, 0739i,0775b, 0837h, 0893r, 0969a, 0969e, 1019e, 1062b,
1112d, 1124l, 1129c', 1144a, 1144b. In some cases a variant mapping
has been substituted in the mapping table, in other cases
the reference is omitted.

Bibliographic information

Karlgren, Klas Bernhard Johannes 高本漢 (1889–1978): 2000. Grammata
Serica Recensa Electronica. Electronic version of GSR, including
indices, syllable canon, & images of the original Karlgren (1957)
text. Prepared for the STEDT Project by Richard Cook; based in part
on work by Tor Ulving & Ferenc Tafferner (see below), used
by permission. Berkeley: University of California., <http://stedt.berkeley.edu/>

Karlgren 1957. Grammata Serica Recensa. First published in the Bulletin
of the Museum of Far Eastern Antiquities (BMFEA) No. 29, Stockholm,
Sweden. Reprinted by Elanders Boktrycker Aktiebolag, Kungsbacka,
[1972]. Reprinted also by SMC Publishing Inc., Taipei, Taiwan,
ROC, [1996]. ISBN: 957-638-269-6.

Karlgren 1940. Grammata Serica: Script and Phonetics in Chinese and
Sino-Japanese 《中日漢字形聲論》Zhong-Ri Hanzi Xingsheng Lun [A study of Sino-Japanese
semantic-phonetic compound characters:] BMFEA No. 12. Reprinted,
Taipei: Ch'eng-Wen Publishing Company, [1966].

Ulving, Tor: 1997. Dictionary of Old and Middle Chinese: Bernhard
Karlgren's Grammata Serica Recensa Alphabetically Arranged. With
Ferenc Tafferner. Göteborg, Sweden: Acta Universitatis Gothoburgensis.
Orientalia Gothoburgensia, 11. ISBN: 91-7346-294-2.

###############################################################################

Tag:    kGradeLevel
Status:    Provisional
Category:    Dictionary-like Data
Separator:    space
Syntax:    [1-6]
Introduced:    3.2

The primary grade in the Hong Kong school system by which a student
is expected to know the character; this data is derived from
朗文初級中文詞典, Hong Kong: Longman, 2001.

###############################################################################

Tag:    kHDZRadBreak
Status:    Provisional
Category:    Dictionary-like Data
Separator:    NA
Syntax:    [x{2F00}-x{2FD5}][U+2?[0-9A-F]{4}]:[1-8][0-9]{4}\.[0-9]{2}[012]
Introduced:    4.1

Indicates that 《漢語大字典》 Hanyu Da Zidian has a radical break beginning
at this character's position. The field consists of the radical (with
its Unicode code point), a colon, and then the Hanyu Da Zidian
position as in the kHanyu field.

###############################################################################

Tag:    kHKGlyph
Status:    Provisional
Category:    Dictionary-like Data
Separator:    space
Syntax:    [0-9]{4}
Introduced:    3.1.1

The index of the character in 常用字字形表 (二零零零年修訂本),香港: 香港教育學院, 2000,
ISBN 962-949-040-4. This publication gives the "proper" shapes for
4759 characters as used in the Hong Kong school system. The
index is an integer, zero-padded to four digits.

###############################################################################

Tag:    kHKSCS
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9A-F]{4}
Introduced:    3.1.1

Mappings to the Big Five extended code points used for the
Hong Kong Supplementary Character Set.

###############################################################################

Tag:    kHanYu
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [1-8][0-9]{4}\.[0-9]{2}[0-3]

The position of this character in the Hanyu Da Zidian (HDZ)
Chinese character dictionary (bibliographic information below).

The character references are given in the form "ABCDE.XYZ", in which:
"A" is the volume number [1..8]; "BCDE" is the zero-padded page number
[0001..4809]; "XY" is the zero-padded number of the character on
the page [01..32]; "Z" is "0" for a character actually in the dictionary,
and greater than 0 for a character assigned a "virtual" position
in the dictionary. For example, 53024.060 indicates an actual HDZ
character, the 6th character on Page 3,044 of Volume 5 (i.e. 籉).
Note that the Volume 8 "BCDE" references are in the range [0008..0044]
inclusive, referring to the pagination of the "Appendix of
Addendum" at the end of that volume (beginning after p. 5746).

The first character assigned a given virtual position has an index
ending in 1; the second assigned the same virtual position
has an index ending in 2; and so on.

Release information

This data set contains a total of 56097 records, 54728 of which are
actual HDZ character references (positions are given for all HDZ
head entries, including source-internal unifications), and
1369 of which are virtual character positions (see note below).

All 55817 HDZ references in this data set are unique. Because of
IRG source-internal unifications, a given UCS-4 Scalar Value (USV)
may have more than one HDZ reference. Source-internal unifications
are of two types: (1) unifications of graphical variants;
(2) unifications of duplicate head entries.

The proofing of all references was done primarily on the basis of
cross-checks of three versions of the reference data: (1) the original
print source; (2) the "kIRGHanyuDaZidian" field of Unihan.txt (release
3.1.1d1); (3) "HDZ.txt", originally produced and proofed for Academia
Sinica's Institute of Information Technology (Document Processing
Laboratory). In addition, the data was checked against the "kHanYu"
and "kAlternateHanYu" fields of Unihan.txt (release 3.1.1d1),
which the present data set supersedes.

String value, string length, compound key, field count, and page
total validations were all performed. Altogether, 578 omissions/
errors in source (2) were identified/corrected. Any remaining errors
will likely relate to virtual positions, or to the ordering of actual
characters within a given page. It is unlikely that errors across
page breaks remain. Possible future deunifications of source-internal
unifications will necessitate update of USV for some references.
Under no circumstances should the source-internal unification
(duplicate USV) mappings be removed from this data set.

Note: Source (3) contributed only actual HDZ character references
to the proofing process, while source (2) contributed all virtual
positions. It seems that the compilers of source (2) usually assigned
virtual positions based on stroke count, though occasionally the
virtual position brings the virtual character together with the
actual HDZ character of which it is a variant, without regard
to actual stroke count.

Bibliographic information for the print source:

<Hanyu Da Zidian> ['Great Chinese Character Dictionary' (in 8 Volumes)].
XU Zhongshu (Editor in Chief). Wuhan, Hubei Province (PRC): Hubei
and Sichuan Dictionary Publishing Collectives, 1986-1990.
ISBN: 7-5403-0030-2/H.16.

《漢語大字典》。許力以主任,徐中舒主編,(漢語大字典工作委員會)。武漢:四川辭書出版社,湖北辭書出版社,1986-1990.
ISBN: 7-5403-0030-2/H.16.

###############################################################################

Tag:    kHangul
Status:    Provisional
Category:    Dictionary-like Data
Separator:    space
Introduced:    5.0

The modern Korean pronunciation(s) for this character in
Hangul.

###############################################################################

Tag:    kHanyuPinlu
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [a-zü]+[1-5]\([0-9]+\)
Introduced:    4.0.1

The Pronunciations and Frequencies of this character, based in part
on those appearing in 《現代漢語頻率詞典》 <Xiandai Hanyu Pinlu Cidian> (XDHYPLCD)
[Modern Standard Beijing Chinese Frequency Dictionary] (complete
bibliographic information below).

Data Format

This dataset contains a total of 3800 records. Each entry
is comprised of two pieces of data.

The Hanyu Pinyin (HYPY) pronunciation(s) of the character, with numeric
tone marks (1-5, where 5 indicates the "neutral tone") immediately
following each alphabetic string.

Immediately following the numeric tone mark, a numeric string appears
in parentheses: e.g. in "a1(392)" the numeric string "392" indicates
the sum total of the frequencies of the pronunciations of
the character as given in HYPLCD.

Where more than one pronunciation exists, these are sorted
by descending frequency, and the list elements are "comma
+ space" delimited.

Release Information

The XDHYPLCD data here for Modern Standard Chinese (Putonghua) cuts
across 4 genres ("News," "Scientific," "Colloquial," and "Literature"),
and was derived from a 440799 character corpus. See that
text for additional information.

The 8548 entries (8586 with variant writings) from p. 491-656 of
XDHYPLCD were input by hand and proof-read from 1994/08/04
to 1995/03/22 by Richard Cook.

Current Release Date above reflects date of last proofing.

HYPY transcription for the data in this release was semiautomated
and hand-corrected in 1995, based in part on data provided
by Ross Paterson (Department of Computing, Imperial College,
London).

Tom Bishop <http://www.wenlin.com> is also due thanks for
early assistance in proof-reading this data.

The character set used for this digitization of HYPLCD (a
"simplified" mainland PRC text) was (Mac OS 7-9) GB 2312-80
(plus 嗐).

These data were converted to Big5 (plus 腈), and both GB and Big5
versions were separately converted to Unicode 4.0, and then merged,
resulting in the 3800 records in the current release. Frequency data
for simplified polysyllabic words has been employed to generate
both simplified and traditional character frequencies.

Bibliographic information for the primary print source

《現代漢語頻率詞典》,北京語言學院語言教學研究所編著。

<Xiandai Hanyu Pinlu Cidian> = XDHYPLCD First edition 1986/6,
2nd printing 1990/4. ISBN 7-5619-0094-5/H.67.

###############################################################################

Tag:    kIBMJapan
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    F[ABC][0-9A-F]{2}

The IBM Japanese mapping for this character in hexadecimal.

###############################################################################

Tag:    kIICore
Status:    Normative
Category:    IRG Sources
Separator:    space
Syntax:    [1-9]\.[1-9]
Introduced:    4.1

Indicates that a character is in IICore, the IRG-produced
minimal set of required ideographs for East Asian use.

Each individual value in this field is either P (for preliminary,
meaning it has been approved by the IRG but not by WG2),
or the ISO/IEC 10646 subset identifier for the subset(s)
containing this character.

###############################################################################

Tag:    kIRGDaeJaweon
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [0-9]{4}\.[0-9]{2}[01]|0000\.555
Introduced:    3

The position of this character in the Dae Jaweon (Korean) dictionary
used in the four-dictionary sorting algorithm. The position is in
the form "page.position" with the final digit in the position being
"0" for characters actually in the dictionary and "1" for characters
not found in the dictionary and assigned a "virtual" position
in the dictionary.

Thus, "1187.060" indicates the sixth character on page 1187. A character
not in this dictionary but assigned a position between the
6th and 7th characters on page 1187 for sorting purposes
would have the code "1187.061"

This field represents the official position of the character within
the Dae Jaweon dictionary as used by the IRG in the four-dictionary
sorting algorithm.

The edition used is the first edition, published in Seoul
by Samseong Publishing Co., Ltd., 1988.

###############################################################################

Tag:    kIRGDaiKanwaZiten
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [0-9]{5}\'?
Introduced:    3

The index of this character in the Dai Kanwa Ziten, aka Morohashi
dictionary (Japanese) used in the four-dictionary sorting
algorithm.

This field represents the official position of the character within
the DaiKanwa dictionary as used by the IRG in the four-dictionary
sorting algorithm. The edition used is the revised edition,
published in Tokyo by Taishuukan Shoten, 1986.

###############################################################################

Tag:    kIRGHanyuDaZidian
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [1-8][0-9]{4}\.[0-3][0-9][01]
Introduced:    3

The position of this character in the Hanyu Da Zidian (PRC) dictionary
used in the four-dictionary sorting algorithm. The position is in
the form "volume page.position" with the final digit in the position
being "0" for characters actually in the dictionary and "1" for characters
not found in the dictionary and assigned a "virtual" position
in the dictionary.

Thus, "32264.080" indicates the eighth character on page 2264 in
volume 3. A character not in this dictionary but assigned a position
between the 8th and 9th characters on this page for sorting
purposes would have the code "32264.081"

This field represents the official position of the character within
the Hanyu Da Zidian dictionary as used by the IRG in the
four-dictionary sorting algorithm.

The edition of the Hanyu Da Zidian used is the first edition,
published in Chengdu by Sichuan Cishu Publishing, 1986.

###############################################################################

Tag:    kIRGKangXi
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [01][0-9]{3}\.[0-7][0-9][01]
Introduced:    3

The position of this character in the KangXi dictionary used in the
four-dictionary sorting algorithm. The position is in the form "page.position"
with the final digit in the position being "0" for characters actually
in the dictionary and "1" for characters not found in the
dictionary and assigned a "virtual" position in the dictionary.

Thus, "1187.060" indicates the sixth character on page 1187. A character
not in this dictionary but assigned a position between the
6th and 7th characters on page 1187 for sorting purposes
would have the code "1187.061"

This field represents the official position of the character within
the KangXi dictionary as used by the IRG in the four-dictionary sorting
algorithm. The edition of the KangXi dictionary used is the
7th edition published by Zhonghua Bookstore in Beijing, 1989.

###############################################################################

Tag:    kIRG_GSource
Status:    Normative
Category:    IRG Sources
Separator:    space
Syntax:    (4K|BK|CH|CY|FZ(_BK)?|HC|HZ|KX|[0135789ES]-[0-9A-F]{4})
Introduced:    3

The IRG "G" source mapping for this character in hex. The IRG G source
consists of data from the following national standards, publications,
and lists from the People's Republic of China and Singapore. The
versions of the standards used are those provided by the PRC to the
IRG and may not always reflect published versions of the
standards generally available.

4K Siku Quanshu

BK Chinese Encyclopedia

CH The Ci Hai (PRC edition)

CY The Ci Yuan

FZ and FZ_BK Founder Press System

G0 GB2312-80

G1 GB12345-90 with 58 Hong Kong and 92 Korean "Idu" characters

G3 GB7589-87 unsimplified forms

G5 GB7590-87 unsimplified forms

G7 General Purpose Hanzi List for Modern Chinese Language,
and General List of Simplified Hanzi

GS Singapore characters

G8 GB8685-88

GE GB16500-95

HC The Hanyu Da Cidian

HZ The Hanyu Da Zidian

KX The KangXi dictionary

###############################################################################

Tag:    kIRG_HSource
Status:    Normative
Category:    IRG Sources
Separator:    N/A
Syntax:    [0-9A-F]{4}
Introduced:    3.1

The IRG "H" source mapping for this character in hex. The
IRG "H" source consists of data from the Hong Kong Supplementary
Characer Set.

###############################################################################

Tag:    kIRG_JSource
Status:    Normative
Category:    IRG Sources
Separator:    space
Syntax:    ([0134A]|3A)-[0-9A-F]{4}
Introduced:    3

The IRG "J" source mapping for this character in hex. The IRG
J source consists of data from the following national standards
and lists from Japan.

J0 JIS X 0208:1990

J1 JIS X 0212:1990

J3 JIS X 0213:2000

J4 JIS X 0213:2000

JA Unified Japanese IT Vendors Contemporary Ideographs, 1993

J3A JIS X 0213:2004 level-3

###############################################################################

Tag:    kIRG_KPSource
Status:    Normative
Category:    IRG Sources
Separator:    N/A
Syntax:    KP[01]-[0-9A-F]{4}
Introduced:    3.1.1

The IRG "KP" source mapping for this character in hex. The IRG "KP"
source consists of data from the following national standards
and lists from the Democratic People's Republic of Korea
(North Korea).

KP0 KPS 9566-97

KP1 KPS 10721-2000

###############################################################################

Tag:    kIRG_KSource
Status:    Normative
Category:    IRG Sources
Separator:    N/A
Syntax:    [01234]-[0-9A-F]{4}
Introduced:    3

The IRG "K" source mapping for this character in hex. The IRG "K"
source consists of data from the following national standards
and lists from the Republic of Korea (South Korea).

K0 KS C 5601-1987

K1 KS C 5657-1991

K2 PKS C 5700-1 1994

K3 PKS C 5700-2 1994

K4 PKS 5700-3:1998

Note that the K4 source is expressed in hexadecimal, but
unlike the other sources, it is not organized in row/column.

###############################################################################

Tag:    kIRG_TSource
Status:    Normative
Category:    IRG Sources
Separator:    N/A
Syntax:    [1-7F]-[0-9A-F]{4}
Introduced:    3

The IRG "T" source mapping for this character in hex. The IRG "T"
source consists of data from the following national standards
and lists from the Republic of China (Taiwan).

T1 CNS 11643-1992, plane 1

T2 CNS 11643-1992, plane 2

T3 CNS 11643-1992, plane 3 (with some additional characters)

T4 CNS 11643-1992, plane 4

T5 CNS 11643-1992, plane 5

T6 CNS 11643-1992, plane 6

T7 CNS 11643-1992, plane 7

TF CNS 11643-1992, plane 15

###############################################################################

Tag:    kIRG_USource
Status:    Normative
Category:    IRG Sources
Separator:    space
Syntax:    U\+2?[0-9A-F]{4}
Introduced:    4.0.1

The IRG "U" source mapping for this character. Currently, the IRG
U source is limited to a small number of characters in the
CJK Compatibility Ideographs block, where the value is the
Unicode code point.

###############################################################################

Tag:    kIRG_VSource
Status:    Normative
Category:    IRG Sources
Separator:    space
Syntax:    [0123]-[0-9A-F]{4}
Introduced:    3

The IRG "V" source mapping for this character in hex. The IRG
V source consists of data from the following national standards
and lists from Vietnam.

V0 TCVN 5773:1993

V1 VHN 01:1998

V2 VHN 02:1998

V3 TCVN 6056:1995

###############################################################################

Tag:    kJIS0213
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [12],[0-9]{2},[0-9]{1,2}
Introduced:    3.1.1

The JIS X 0213-2000 mapping for this character in min,ku,ten
form.

###############################################################################

Tag:    kJapaneseKun
Status:    Provisional
Category:    Dictionary-like Data
Separator:    space
Syntax:    [A-Z]+

The Japanese pronunciation(s) of this character.

###############################################################################

Tag:    kJapaneseOn
Status:    Provisional
Category:    Dictionary-like Data
Separator:    space
Syntax:    [A-Z]+

The Sino-Japanese pronunciation(s) of this character.

###############################################################################

Tag:    kJis0
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9]{4}

The JIS X 0208-1990 mapping for this character in ku/ten
form.

###############################################################################

Tag:    kJis1
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9]{4}

The JIS X 0212-1990 mapping for this character in ku/ten
form.

###############################################################################

Tag:    kKPS0
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9A-F]{4}
Introduced:    3.1.1

The KPS 9566-97 mapping for this character in hexadecimal
form.

###############################################################################

Tag:    kKPS1
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9A-F]{4}
Introduced:    3.1.1

The KPS 10721-2000 mapping for this character in hexadecimal
form.

###############################################################################

Tag:    kKSC0
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9]{4}

The KS X 1001:1992 (KS C 5601-1989) mapping for this character
in ku/ten form.

###############################################################################

Tag:    kKSC1
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9]{4}

The KS X 1002:1991 (KS C 5657-1991) mapping for this character
in ku/ten form.

###############################################################################

Tag:    kKangXi
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [0-9]{4}\.[0-9]{2}[01]

The position of this character in the KangXi dictionary used in the
four-dictionary sorting algorithm. The position is in the form "page.position"
with the final digit in the position being "0" for characters actually
in the dictionary and "1" for characters not found in the
dictionary and assigned a "virtual" position in the dictionary.

Thus, "1187.060" indicates the sixth character on page 1187. A character
not in this dictionary but assigned a position between the
6th and 7th characters on page 1187 for sorting purposes
would have the code "1187.061"

The edition of the KangXi dictionary used is the 7th edition
published by Zhonghua Bookstore in Beijing, 1989.

###############################################################################

Tag:    kKarlgren
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [1-9][0-9]{0,3}[A*]?
Introduced:    3.1.1

The index of this character in _Analytic Dictionary of Chinese
and Sino-Japanese_ by Bernhard Karlgren, New York: Dover
Publications, Inc., 1974.

If the index is followed by an asterisk (*), then the index is an
interpolated one, indicating where the character would be found if
it were to have been included in the dictionary. Note that while
the index itself is usually an integer, there are some cases
where it is an integer followed by an "A".

###############################################################################

Tag:    kKorean
Status:    Provisional
Category:    Dictionary-like Data
Separator:    space
Syntax:    [A-Z]+

The Korean pronunciation(s) of this character, using the Yale romanization
system. (See <http://www.coffeesigns.com/Resources/romanization/korean.asp>
for a comparison of the various Korean romanization systems.)

###############################################################################

Tag:    kLau
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [1-9][0-9]{0,3}
Introduced:    3.1.1

The index of this character in A Practical Cantonese-English
Dictionary by Sidney Lau, Hong Kong: The Government Printer,
1977.

The index consists of an integer. Missing indices indicate unencoded
characters which are being submitted to the IRG for inclusion
in future versions of the standard.

###############################################################################

Tag:    kMainlandTelegraph
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9]{4}

The PRC telegraph code for this character, derived from "Kanzi denpou
koudo henkan-hyou" ("Chinese character telegraph code conversion
table"), Lin Jinyi, KDD Engineering and Consulting, Tokyo,
1984.

###############################################################################

Tag:    kMandarin
Status:    Provisional
Category:    Dictionary-like Data
Separator:    space
Syntax:    [A-ZÜ]+[1-5]

The Mandarin pronunciation(s) for this character in pinyin;
Mandarin pronunciations are sorted in order of frequency,
not alphabetically.

###############################################################################

Tag:    kMatthews
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [0-9]{1,4}(a|\.5)?

The index of this character in Mathews' Chinese-English Dictionary
by Robert H. Mathews, Cambrige: Harvard University Press,
1975.

Note that the field name is kMatthews instead of kMathews to maintain
compatibility with earlier versions of this file, where it
was inadvertently misspelled.

###############################################################################

Tag:    kMeyerWempe
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [1-9][0-9]{0,3}[a-t*]?
Introduced:    3.1

The index of this character in the Student's Cantonese-English Dictionary
by Bernard F. Meyer and Theodore F. Wempe (3rd edition, 1947). The
index is an integer, optionally followed by a lower-case Latin letter
if the listing is in a subsidiary entry and not a main one. In some
cases where the character is found in the radical-stroke index, but
not in the main body of the dictionary, the integer is followed
by an asterisk (e.g., U+50E5, which is listed as 736* as
well as 1185a).

###############################################################################

Tag:    kMorohashi
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [0-9]{5}'?

The index of this character in the Dae Kanwa Ziten, aka Morohashi
dictionary (Japanese) used in the four-dictionary sorting
algorithm.

The edition used is the revised edition, published in Tokyo
by Taishuukan Shoten, 1986.

###############################################################################

Tag:    kNelson
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [0-9]{4}

The index of this character in The Modern Reader's Japanese-English
Character Dictionary by Andrew Nathaniel Nelson, Rutland,
Vermont: Charles E. Tuttle Company, 1974.

###############################################################################

Tag:    kOtherNumeric
Status:    Informative
Category:    Numeric Values
Separator:    space
Syntax:    [0-9]+
Introduced:    3.2

The numeric value for the character in certain unusual, specialized
contexts.

The three numeric-value fields should have no overlap; that is, characters
with a kOtherNumeric value should not have a kAccountingNumeric
or kPrimaryNumeric value as well.

###############################################################################

Tag:    kPhonetic
Status:    Provisional
Category:    Dictionary-like Data
Separator:    space
Syntax:    [1-9][0-9]{0,3}[A-D]?*?
Introduced:    3.1

The phonetic index for the character from Ten Thousand Characters:
An Analytic Dictionary by G. Hugh Casey, S.J. Hong Kong:
Kelley and Walsh,1980.

###############################################################################

Tag:    kPrimaryNumeric
Status:    Informative
Category:    Numeric Values
Separator:    space
Syntax:    [0-9]+
Introduced:    3.2

The value of the character when used in the writing of numbers
in the standard fashion.

The three numeric-value fields should have no overlap; that is, characters
with a kPrimaryNumeric value should not have a kAccountingNumeric
or kOtherNumeric value as well.

###############################################################################

Tag:    kPseudoGB1
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9]{4}

A "GB 12345-90" code point assigned this character for the purposes
of including it within Unihan. Pseudo-GB1 codes were used to provide
official code points for characters not already in national
standards, such as characters used to write Cantonese, and
so on.

###############################################################################

Tag:    kRSAdobe_Japan1_6
Status:    Provisional
Category:    Radical-Stroke Counts
Separator:    space
Syntax:    [CV]\+[0-9]{1,5}\+[1-9][0-9]{0,2}\.[1-9][0-9]?\.[0-9]{1,2}
Introduced:    4.1

Information on the glyphs in Adobe-Japan1-6 as contributed by Adobe.
The value consists of a number of space-separated entries.
Each entry consists of three pieces of information separated
by a plus sign:

1) C or V. "C" indicates that the Unicode code point maps directly
to the Adobe-Japan1-6 CID that appears after it, and "V"
indicates that it is considered a variant form, and thus
not directly encoded.

2) The Adobe-Japan1-6 CID.

3) Radical-stroke data for the indicated Adobe-Japan1-6 CID. The
radical-stroke data consists of three pieces separated by periods:
the KangXi radical (1-214), the number of strokes in the form the
radical takes in the glyph, and the number of strokes in the residue.
The standard Unicode radical-stroke form can be obtained by omitting
the second value, and the total strokes in the glyph from
adding the second and third values.

###############################################################################

Tag:    kRSJapanese
Status:    Provisional
Category:    Radical-Stroke Counts
Separator:    space
Syntax:    [0-9]{1,3}\.[0-9]{1,2}

A Japanese radical/stroke count for this character in the form "radical.additional
strokes". A ' after the radical indicates the simplified
version of the given radical.

###############################################################################

Tag:    kRSKanWa
Status:    Provisional
Category:    Radical-Stroke Counts
Separator:    space
Syntax:    [0-9]{1,3}\.[0-9]{1,2}

A Morohashi radical/stroke count for this character in the form "radical.additional
strokes". A ' after the radical indicates the simplified
version of the given radical.

###############################################################################

Tag:    kRSKangXi
Status:    Provisional
Category:    Radical-Stroke Counts
Separator:    space
Syntax:    [0-9]{1,3}\.[0-9]{1,2}

The KangXi radical/stroke count for this character consistent with
the value of the kKangXi field in the form "radical.additional
strokes". A ' after the radical indicates the simplified
version of the given radical.

###############################################################################

Tag:    kRSKorean
Status:    Provisional
Category:    Radical-Stroke Counts
Separator:    space
Syntax:    [0-9]{1,3}\.[0-9]{1,2}

A Korean radical/stroke count for this character in the form "radical.additional
strokes". A ' after the radical indicates the simplified
version of the given radical

###############################################################################

Tag:    kRSUnicode
Status:    Informative
Category:    Radical-Stroke Counts
Separator:    space
Syntax:    [0-9]{1,3}\'?\.[0-9]{1,2}

A standard radical/stroke count for this character in the form "radical.additional
strokes". A ' after the radical indicates the simplified
version of the given radical

This field is used for additional radical-stroke indices where either
a character may be reasonably classified under more than
one radical, or alternate stroke count algorithms may provide
different stroke counts.

The first value is intended to reflect the same radical as the kRSKangXi
field and the stroke count of the glyph used to print the
character within the Unicode Standard.

###############################################################################

Tag:    kSBGY
Status:    Provisional
Category:    Dictionary Indices
Separator:    space
Syntax:    [0-9]{3}\.[0-9]{2}
Introduced:    3.2

The position of this character in the Song Ben Guang Yun (SBGY)
Medieval Chinese character dictionary (bibliographic and
general information below).

The 25334 character references are given in the form "ABC.XY", in
which: "ABC" is the zero-padded page number [004..546]; "XY" is the
zero-padded number of the character on the page [01..73]. For example,
364.38 indicates the 38th character on Page 364 (i.e. 澍). Where a
given Unicode Scalar Value (USV) has more than one reference,
these are space-delimited.

- Release information (20031005):

This release corrects several mappings.

-- Release information (20020310) --

This data set contains a total of 25334 references, for 19572
different hanzi (up from 25330 and 19511 in the previous
release).

This release of the kSBGY data fixes a number of mappings, based
on extensive work done since the initial release (compare the initial
release counts given below). See the end of this header for
additional information.

-- Initial release information (20020310) --

The original data was input under the direction of Prof. LUO Fengzhu
at Taiwan Taoyuanxian Yuan Zhi University (see below) using an early
version of the Big5- based CDP encoding scheme developed at Academia
Sinica. During 2000-2002 this raw data was processed and revised
by Richard Cook as follows: the data was converted to Unicode encoding
using his revised kHanYu mapping tables (first provided to the Unicode
Consortium for the Unihan.txt release 3.1.1d1) and also using several
other mapping tables developed specifically for this project; the
kSBGY indices were generated based on hand-counts of all page
totals; numerous indexing errors were corrected; and the
data underwent final proofing.

-- About the print sources --

The SBGY text, which dates to the beginning of the Song Dynasty (c.
1008, edited by 陳彭年 CHEN Pengnian et al.) is an enlargement of an
earlier text known as 《切韻》 Qie Yun (dated to c. 601, edited by 陸法言
LU Fayan). With 25,330 head entries, this large early lexicon is
important in part for the information which it provides for historical
Chinese phonology. The GY dictionary employs a Chinese transcription
method (known as 反切) to give pronunciations for each of its
head entries. In addition, each syllable is also given a
brief gloss.

It must be emphasized that the mapping of a particular SBGY glyph
to a single USV may in some cases be merely an approximation or may
have required the choice of a "best possible glyph" (out of those
available in the Unicode repertoire). This indexing data in conjunction
with the print sources will be useful for evaluating the degree of
distinctive variation in the character forms appearing in this text,
and future proofing of this data may reveal additional Chinese
glyphs for IRG encoding.

-- Bibliographic information on the print sources --

《宋本廣韻》 <<Song Ben Guang Yun>> ['Song Dynasty edition of the
Guang Yun Rhyming Dictionary'], edited by 陳彭年 CHEN Pengnian
et al. (c. 1008).

Two modern editions of this work were consulted in building
the kSBGY indices:

《新校正切宋本廣韻》。台灣黎明文化事業公司 出版,林尹校訂1976 年出版。[This was the edition used
by Prof. LUO (台灣桃園縣元智大學中語系羅鳳珠), and in the subsequent revision,
conversion, indexing and proofing.]

《新校互註‧宋本廣韻》。香港中文大學,余迺永 1993, 2000 年出版。ISBN: 962-201-413-5; 7-5326-0685-6.
[Textual problems were resolved on the basis of this extensively
annotated modern edition of the text.]

-- Additional Information --

For further information on this index data and the databases
from which it is excerpted, see:

Cook, Richard S. 2003. 《說文解字‧電子版》 Shuo Wen Jie Zi - Dianzi Ban: Digital
Recension of the Eastern Han Chinese Grammaticon. PhD Dissertation.
Department of Linguistics. Berkeley: University of California.

###############################################################################

Tag:    kSemanticVariant
Status:    Provisional
Category:    Variants
Separator:    space
Syntax:    U+2?[0-9A-F]{4}(<k[A-Za-z:]+(,k[A-Za-z]+)*)?

The Unicode value for a semantic variant for this character. A semantic
variant is an x- or y-variant with similar or identical meaning
which can generally be used in place of the indicated character.

The basic syntax is a Unicode scalar value. It may optionally be
followed by additional data. The additional data is separated from
the Unicode scalar value by a less-than sign (<), and may be subdivided
itself into substrings by commas, each of which may be divided into
two pieces by a colon. The additional data consists of a series of
field tags for another field in the Unihan database indicating the
source of the information. If subdivided, the final piece is a string
consisting of the letters T (for tòng, U+540C 同) B (for bù,
U+4E0D 不), or Z (for zhèng, U+6B63 正).

T is used if the indicated source explicitly indicates the
two are the same (e.g., by saying that the one character
is "the same as" the other).

B is used if the source explicitly indicates that the two
are used improperly one for the other.

Z is used if the source explicitly indicates that the given character
is the preferred form. Thus, the Hanyu Da Zidian indicates that
U+5231 刱 and U+5275 創 are semantic variants and that U+5275
創 is the preferred form.

###############################################################################

Tag:    kSimplifiedVariant
Status:    Provisional
Category:    Variants
Separator:    space
Syntax:    U\+2?[0-9A-F]{4}

The Unicode value for the simplified Chinese variant for
this character (if any).

Note that a character can be *both* a traditional Chinese character
in its own right *and* the simplified variant for other characters
(e.g., U+53F0).

In such case, the character is listed as its own simplified variant
and one of its own traditional variants. This distinguishes this
from the case where the character is not the simplified form
for any character (e.g., U+4E95).

Much of the of the data on simplified and traditional variants
was supplied by Wenlin <http://www.wenlin.com>

###############################################################################

Tag:    kSpecializedSemanticVariant
Status:    Provisional
Category:    Variants
Separator:    space
Syntax:    U+2?[0-9A-F]{4}(<k[A-Za-z]+(,k[A-Za-z]+)*)?

The Unicode value for a specialized semantic variant for
this character. The syntax is the same as for the kSemanticVariant
field.

A specialized semantic variant is an x- or y-variant with
similar or identical meaning only in certain contexts (such
as accountants' numerals).

###############################################################################

Tag:    kTaiwanTelegraph
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9]{4}

The Taiwanese telegraph code for this character, derived from "Kanzi
denpou koudo henkan-hyou" ("Chinese character telegraph code
conversion table"), Lin Jinyi, KDD Engineering and Consulting,
Tokyo, 1984.

###############################################################################

Tag:    kTang
Status:    Provisional
Category:    Dictionary-like Data
Separator:    space
Syntax:    *?[A-Za-z()x{E6}x{251}x{259}x{25B}x{300}x{30C}]+

The Tang dynasty pronunciation(s) of this character, derived from
or consistent with _T'ang Poetic Vocabulary_ by Hugh M. Stimson,
Far Eastern Publications, Yale Univ. 1976.

###############################################################################

Tag:    kTotalStrokes
Status:    Provisional
Category:    Dictionary-like Data
Separator:    space
Syntax:    [1-9][0-9]{0,2}
Introduced:    3.1

The total number of strokes in the character (including the
radical). This value is for the character as drawn in the
Unicode charts.

###############################################################################

Tag:    kTraditionalVariant
Status:    Provisional
Category:    Variants
Separator:    space
Syntax:    U\+2?[0-9A-F]{4}

The Unicode value(s) for the traditional Chinese variant(s)
for this character.

Note that a character can be *both* a traditional Chinese character
in its own right *and* the simplified variant for other characters
(e.g., 台 U+53F0).

In such case, the character is listed as its own simplified variant
and one of its own traditional variants. This distinguishes this
from the case where the character is not the simplified form
for any character (e.g., 井 U+4E95).

Much of the of the data on simplified and traditional variants
was supplied by Wenlin Institute, Inc. <http://www.wenlin.com>.

###############################################################################

Tag:    kVietnamese
Status:    Provisional
Category:    Dictionary-like Data
Separator:    space
Syntax:    [A-Za-zx{E0}-x{1B0}x{1EA1}-x{1EF9}]+
Introduced:    3.1.1

The character's pronunciation(s) in Quốc ngữ.

###############################################################################

Tag:    kXerox
Status:    Provisional
Category:    Other Mappings
Separator:    space
Syntax:    [0-9]{3}:[0-9]{3}

The Xerox code for this character.

###############################################################################

Tag:    kZVariant
Status:    Provisional
Category:    Variants
Separator:    space
Syntax:    U+2?[0-9A-F]{4}(:k[A-Za-z]+)?

The Unicode value(s) for known z-variants of this character.

###############################################################################

BEGIN Valid UniHan Ranges for this release (5.0):
U+3400..U+4DB5 : CJK Unified Ideographs Extension A
U+4E00..U+9FA5 : CJK Unified Ideographs
U+9FA6..U+9FBB : CJK Unified Ideographs (4.1)
U+F900..U+FA2D : CJK Compatibility Ideographs (a)
U+FA30..U+FA6A : CJK Compatibility Ideographs (b)
U+FA70..U+FAD9 : CJK Compatibility Ideographs (4.1)
U+20000..U+2A6D6 : CJK Unified Ideographs Extension B
U+2F800..U+2FA1D : CJK Compatibility Supplement
END Valid UniHan Ranges for this release (5.0)

###############################################################################

ACCURACY OF THE DATA:

Not all of these fields have been checked and proofed as carefully as some
    others have been. Please report errata, corrections, and additions at
    <http://www.unicode.org/unicode/reporting.html>.

The following fields may be taken as completely accurate and their values are
    *normative* parts of Unicode and ISO/IEC 10646-1 and -2:

kIRG_GSource, kIRG_TSource, kIRG_JSource, kIRG_KSource, kIRG_KPSource, kIRG_VSource,
    and kIICore

The IRG dictionary fields have also been extensively proofed by IRG experts and may
    be taken as accurate.

The following fields have been extensively proofed by experts world-wide and may be
    taken as accurate:

kBigFive, kCNS1986, kGB0, kGB1, kGB3, kGB5, kGB7, kGB8, kJis0, kJis1, kJIS0213,
    kKSC0, kKSC1, kPseudoGB1, kCCCII, kCNS1992, kDaeJaweon, kHanYu, kIBMJapan,
    kKangXi, kMatthews, kMorohashi, kNelson, kXerox

The remaining fields have not been as extensively proofed and their values should be
    taken as provisional.

RELEASE NOTES:

5.0        The kCheungBauer, kCheungBauerIndex, kFourCornerCode, and kHangul fields were added.

4.1        The kPhonetic data was regenerated to include multiple entries for individual
        characters. Duplicate entries were removed from the kMandarin and kCantonese
        fields. All fields are now complete. The kFenn field had substantial new
        data added. The kFennIndex field was added. The latest data sets for kSBGY
        and kHanYu were included. The kAlternateKangXi and kAlternateMorohashi
        fields were dropped. The syntax of the kSemanticVariant and
        kSpecializedSemanticVariant fields was extended to include source information.
        The data in these two fields were substantially extended. The Cantonese field
        has been changed to use jyutping instead of Yale romanization. Preliminary
        data for new characters has been added. The various kIRG* fields have
        had their values resynchronized with data in ISO/IEC 10646. Numerous other
        individual corrections and additions were made. The header has been
        restructured and expanded, in preparation for moving the field
        descriptions into a separate document. The kRSAdobe_Japan1_6 field was
        added. The Cantonese readings have been extended and corrected using
        data from the Hong Kong Linguistic Society and Hong Kong Polytechnic
        University.    The kIICore field was added.

4.0.1    In addition to numerous small changes and corrections, the kMandarin field
        has been regenerated from earlier versions of the data with later corrections
        re-inserted. This was required because of a script error which incorrectly
        assigned readings to various characters. The order of the kMandarin field
        has been restored to frequency order. There have been substantial updates
        and corrections to the kCantonese, kCihaiT, kCowles, kDefinition, kGradeLevel,
        kHKGlyph, kLau, kMeyerWempe, and kVietnamese fields. (The kCihaiT, kCowles,
        kGradeLevel, and kLau fields are now complete.) The kHanyuPinlu, kIRG_USource,
        and kGSR fields have been added.

KNOWN ERRORS:

The Japanese and Korean readings need to be normalized. The variant fields need
    to be extended.


END OF FILE