W3C

Charlint - A Character Normalization Tool

Perl source | Recommended Data Files | How to use | Future Plans | Background | Version History

Perl Source and Installation

Charlint is writen in Perl 5. You can get the source from http://www.w3.org/International/charlint/charlint.pl. Charlint is covered by the W3C software licence. To install charlint, please make sure you have installed Perl 5, you have downloaded an appropriate character data file, and you have downloaded the Perl source. Please send error reports or comments to duerst@w3.org; for anouncements and public discussion please see the Winter mailing list (www-international@w3.org).

Recommended Character Data Files

Charlint needs information on characters in order to work correctly. To indicate the file you want to use, please use the -f option. The currently recommended character data file is available from ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.beta.txt. Composition exclusions are currently hard-coded and are based on ftp://ftp.unicode.org/Public/3.0-Update/CompositionExclusions-1.beta.txt. Additional information on these and other files can be found at http://www.unicode.org/unicode/standard/versions/Unicode3.0-beta.html. Please note that this data file is a beta version; the beta test will last up to 15 August 1999. Using charlint is one way to test this data file. Please send any comments on the data file to errata@unicode.org.

How to use charlint

Charlint is a perl script that works as a simple filter. It uses UTF-8 both for input and for output. Behaviour can be fine-tuned with various options. A list of options as the one below can be optained by using charlint -h.

(options prefixed by # are currently not available)
-b: Remove initial 'Byte Order Mark'
-B: Supress warning about initial 'Byte Order Mark'
-d: Debug: Thoroughly check character data table input
-D: Leave after reading in character data
-e: # remove undefined codepoints
-E: Do not warn about undefined codepoints
-f file: Read data from file
         (please use newest V3.0 beta datafiles)
-C: # Do not normalize
-h: Prints out this short description
-n: Accept &#ddddd; and &#xhhhh; on input
        (beware of <![CDATA[, <SCRIPT>, <STYLE>)
-N: Produce &#xhh; on output
-o: Print out 'unprintable' bytes as octal
-p: # Remove stuff in private zone
-P: Supress checking private zone
-u: # Fix UTF-8 (convert or remove)
-U: Supress checking correctness of UTF-8
-v: Print version

Version History

# History:

# 1999/06/23: 0.30, preparation for W3C member test, without Hangul MJD

# 1999/06/25: 0.31, fixed reordering bug, going public MJD

# 1999/07/01: 0.32, adapted surrogates/exclusions to 3.0.0.beta MJD

Future Plans

We have just released the first version of charlint. There are many things we plan to add in the future:


Martin Dürst
Webmaster
last revised $Date: 1999/07/01 08:56:23 $ by $Author: duerst $

Copyright  ©  1997 W3C (MIT, INRIA, Keio ), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.