Charlint - A Character Normalization Tool

IMPORTANT: Newest version 0.40, implements Normalization Form C (NFC, Canonical Composition) and NFD (Canonical Decomposition), including Hangul.

Charlint is a character normalization/checking tool written in Perl. Among else, it implements Normalization Form C of Unicode TR 15.

Perl Source and Installation

Charlint , aka 'Charlie', is written in Perl 5. You can get the source from http://www.w3.org/International/charlint/charlint.pl. Charlint is covered by the W3C software licence. To install charlint, please make sure you have installed Perl 5, you have downloaded an appropriate character data file, and you have downloaded the Perl source. Please send error reports or comments to duerst@w3.org; for anouncements and public discussion please see the Winter mailing list (www-international@w3.org).

Recommended Character Data Files

Charlint needs information on characters in order to work correctly. To indicate the file you want to use, please use the -f option. The currently recommended character data file is available from ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData-Latest.txt. Composition exclusions are currently hard-coded and are based on ftp://ftp.unicode.org/Public/UNIDATA/CompositionExclusions.txt [Final 3.0.0 version of 10 Sept 1999 for both files; identical with the versions available on the CD-ROM provided with The Unicode Standard, Version 3.0]. Additional information on these and other files can be found at http://www.unicode.org/unicode/onlinedat/online.html.

How to use charlint

Charlint is a perl script that works as a simple filter. It uses UTF-8 both for input and for output. Behaviour can be fine-tuned with various options. A list of options such as the one below can be optained by using charlint -h.

(options prefixed by # are currently not available)
-b: Remove initial 'Byte Order Mark'
-B: Supress warning about initial 'Byte Order Mark'
-C: Do not normalize
-d: Debug: Thoroughly check character data table input
-D: Leave after reading in character data
-e: # remove undefined codepoints
-E: Do not warn about undefined codepoints
-f file: Read data from file
         (please use newest V3.0 beta datafiles)
-h: Prints out this short description
-k: # Warn about compatibility codepoints
-K: # Normalize out compatibility codepoints
-n: Accept &#ddddd; and &#xhhhh; on input
        (beware of <![CDATA[, <SCRIPT>, <STYLE>)
-N: Produce &#xhhhh; on output
-o: Print out 'unprintable' bytes as \octal
-p: # Remove stuff in private zone
-P: Supress checking private zone
-u: # Fix UTF-8 (convert or remove)
-U: Supress checking correctness of UTF-8
-v: Print version
-x: Do decomposition only
-X: Don't do decomposition (assume input is decomposed)

Version History

# 2000/08/03: 0.40, added Hangul support and did quite some testing  MJD
# 2000/08/02: 0.37, added -x and -X for decomposition                MJD
# 2000/07/27: 0.36, fixed a bug for non-starter decompositions       MJD
# 2000/07/24: 0.35, adapted exclusions to 3.0.0 final (+Tibetan)     MJD
# 2000/07/24: 0.34, $chClass = $CombClass{ch}; should read $chClass = $CombClass{$ch};
#                   implemented -C                                   MJD
# 1999/08/16: 0.33, updated for second version of 3.0.0.beta         MJD
# 1999/07/01: 0.32, adapted surrogates/exclusions to 3.0.0.beta      MJD
# 1999/06/25: 0.31, fixed reordering bug, going public               MJD
# 1999/06/23: 0.30, preparation for W3C member test, without Hangul  MJD

Background

Character Model for the World Wide Web (W3C Working Draft)
Unicode Technical Report #15 (Version 18 part of Unicode V 3.0)
W3C Open Source Releases

Future Plans

We have just released the first version of charlint. There are many things we plan to add in the future:

Hangul syllable normalization (Done in version 0.40)
Removal of undefined codepoints and codepoints in the private zone
Removal/fix of incorrect UTF-8
Compatibility character detection or removal
Detection or removal of characters not suitable for markup

Your help (bug reports, patches, ideas, test cases) is welcome.

Martin Dürst
Webmaster
last revised $Date: 2000/08/03 08:45:47 $ by $Author: duerst $

Copyright © 1997 W3C (MIT, INRIA, Keio ), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.