Perl source | Recommended Data Files | How to use | Future Plans | Background | Version History
IMPORTANT: Newest version 0.40, implements Normalization Form C (NFC, Canonical Composition) and NFD (Canonical Decomposition), including Hangul.
Charlint is a character normalization/checking tool written in Perl. Among else, it implements Normalization Form C of Unicode TR 15.
Charlint , aka 'Charlie', is written in Perl 5. You can get the source from http://www.w3.org/International/charlint/charlint.pl. Charlint is covered by the W3C software licence. To install charlint, please make sure you have installed Perl 5, you have downloaded an appropriate character data file, and you have downloaded the Perl source. Please send error reports or comments to duerst@w3.org; for anouncements and public discussion please see the Winter mailing list (www-international@w3.org).
Charlint needs information on characters in order to work correctly. To indicate the file you want to use, please use the -f option. The currently recommended character data file is available from ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData-Latest.txt. Composition exclusions are currently hard-coded and are based on ftp://ftp.unicode.org/Public/UNIDATA/CompositionExclusions.txt [Final 3.0.0 version of 10 Sept 1999 for both files; identical with the versions available on the CD-ROM provided with The Unicode Standard, Version 3.0]. Additional information on these and other files can be found at http://www.unicode.org/unicode/onlinedat/online.html.
Charlint is a perl script that works as a simple filter. It uses UTF-8 both for input and for output. Behaviour can be fine-tuned with various options. A list of options such as the one below can be optained by using charlint -h.
(options prefixed by # are currently not available) -b: Remove initial 'Byte Order Mark' -B: Supress warning about initial 'Byte Order Mark' -C: Do not normalize -d: Debug: Thoroughly check character data table input -D: Leave after reading in character data -e: # remove undefined codepoints -E: Do not warn about undefined codepoints -f file: Read data from file (please use newest V3.0 beta datafiles) -h: Prints out this short description -k: # Warn about compatibility codepoints -K: # Normalize out compatibility codepoints -n: Accept &#ddddd; and &#xhhhh; on input (beware of <![CDATA[, <SCRIPT>, <STYLE>) -N: Produce &#xhhhh; on output -o: Print out 'unprintable' bytes as \octal -p: # Remove stuff in private zone -P: Supress checking private zone -u: # Fix UTF-8 (convert or remove) -U: Supress checking correctness of UTF-8 -v: Print version -x: Do decomposition only -X: Don't do decomposition (assume input is decomposed)
# 2000/08/03: 0.40, added Hangul support and did quite some testing MJD # 2000/08/02: 0.37, added -x and -X for decomposition MJD # 2000/07/27: 0.36, fixed a bug for non-starter decompositions MJD # 2000/07/24: 0.35, adapted exclusions to 3.0.0 final (+Tibetan) MJD # 2000/07/24: 0.34, $chClass = $CombClass{ch}; should read $chClass = $CombClass{$ch}; # implemented -C MJD # 1999/08/16: 0.33, updated for second version of 3.0.0.beta MJD # 1999/07/01: 0.32, adapted surrogates/exclusions to 3.0.0.beta MJD # 1999/06/25: 0.31, fixed reordering bug, going public MJD # 1999/06/23: 0.30, preparation for W3C member test, without Hangul MJD
We have just released the first version of charlint. There are many things we plan to add in the future:
Your help (bug reports, patches, ideas, test cases) is welcome.