File:  [Public] / charlint / Overview.html
Revision 1.33: download - view: text, annotated - select for diffs
Wed Sep 29 09:03:18 1999 UTC (24 years, 8 months ago) by duerst
Branches: MAIN
CVS tags: HEAD
(duerst) Changed through Jigsaw.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<HTML>
<HEAD>
  <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
  <META http-equiv="Content-Style-Type" content="text/css">
  <!--BASE href="http://www.w3.org/Consortium/Translation/"-->
  <!--LINK rel="stylesheet" href="../i18n.css"-->
  <STYLE type="text/css">
  <!--
    H1.title {text-align: center }
    P.toolbar { text-align: center }
    DIV.deliverable { margin-left: 2em;
                margin-right: 2em }
    P.note { margin-left: 10%;
                margin-right: 10%;
                color: green }
    TH { text-align: left }
    TH, TD { padding: 2px }
    .external { font-style: italic }
  -->
  </STYLE>
  <TITLE>Charlint - A Character Normalization Tool</TITLE>
  <LINK rel="stylesheet" type="text/css" href="../../StyleSheets/base.css">
</HEAD>
<BODY bgcolor="#FFFFFF" text="#000000">
<P>
<A href="/"><IMG border="0" src="/Icons/WWW/w3c_home" alt="W3C" width="72"
    height="48"></A>
<A href="/International"><IMG src="/Icons/WWW/i18n-alt" alt="International"
    width="72" height="48" border="0"></A>
<H1>
  Charlint - A Character Normalization Tool
</H1>
<P>
<A href="#Perl">Perl source</A> | <A href="#Recommended">Recommended Data
Files</A> | <A href="#How">How to use</A> | <A href="#Future">Future Plans</A>
| <A href="#Background">Background </A>| <A href="#Version">Version History</A>
<P>
Charlint is a character normalization/checking tool written in Perl. Among
else, it implements Normalization Form C of
<A href="http://www.unicode.org/unicode/reports/tr15/">Unicode TR 15</A>.
<H3>
  <A name="Perl">Perl Source</A> and Installation
</H3>
<P>
Charlint , aka 'Charlie', is written in
<A href="http://www.perl.com/pace/pub/perldocs/latest.html">Perl 5</A>. You
can get the source from
<A href="http://www.w3.org/International/charlint/charlint.pl">http://www.w3.org/International/charlint/charlint.pl</A>.
Charlint is covered by the
<A href="http://www.w3.org/Consortium/Legal/copyright-software.html">W3C
software licence</A>. To install charlint, please make sure you have installed
<A href="http://www.perl.com/pace/pub/perldocs/latest.html">Perl 5</A>, you
have downloaded an appropriate character data file, and you have downloaded
the <A href="http://www.w3.org/International/charlint/charlint.pl">Perl
source</A>. Please send error reports or comments to
<A href="mailto:duerst@w3.org">duerst@w3.org</A>; for anouncements and public
discussion please see the Winter mailing list (www-international@w3.org).
<H3>
  <A name="Recommended">Recommended Character Data Files</A>
</H3>
<P>
Charlint needs information on characters in order to work correctly. To indicate
the file you want to use, please use the -f option. The currently recommended
character data file is available from
<A href="ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.beta.txt">ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.beta.txt</A>.
Composition exclusions are currently hard-coded and are based on
<A href="ftp://ftp.unicode.org/Public/3.0-Update/CompositionExclusions-1.beta.txt">ftp://ftp.unicode.org/Public/3.0-Update/CompositionExclusions-1.beta.txt</A>
[version of 04 Aug 1999 for both files]. Additional information on these
and other files can be found at
<A href="http://www.unicode.org/unicode/standard/versions/Unicode3.0-beta.html">http://www.unicode.org/unicode/standard/versions/Unicode3.0-beta.html</A>.
Please note that this data file is a beta version; the beta test will last
up to 15 August 1999. Using charlint is one way to test this data file. Please
send any comments on <EM>the data file</EM> to
<A href="mailto:errata@unicode.org">errata@unicode.org</A>.
<H3>
  <A name="How">How to use charlint</A>
</H3>
<P>
Charlint is a perl script that works as a simple filter. It uses UTF-8 both
for input and for output. Behaviour can be fine-tuned with various options.
A list of options such as the one below can be optained by using <KBD>charlint
-h</KBD>.
<PRE>(options prefixed by # are currently not available)
-b: Remove initial 'Byte Order Mark'
-B: Supress warning about initial 'Byte Order Mark'
-d: Debug: Thoroughly check character data table input
-D: Leave after reading in character data
-e: # remove undefined codepoints
-E: Do not warn about undefined codepoints
-f file: Read data from file
         (please use newest V3.0 beta datafiles)
-C: # Do not normalize
-h: Prints out this short description
-n: Accept &amp;#ddddd; and &amp;#xhhhh; on input
        (beware of &lt;![CDATA[, &lt;SCRIPT&gt;, &lt;STYLE&gt;)
-N: Produce &amp;#xhh; on output
-o: Print out 'unprintable' bytes as octal
-p: # Remove stuff in private zone
-P: Supress checking private zone
-u: # Fix UTF-8 (convert or remove)
-U: Supress checking correctness of UTF-8
-v: Print version
</PRE>
<H3>
  <A name="Version">Version History</A>
</H3>
<PRE># 1999/06/23: 0.30, preparation for W3C member test, without Hangul  MJD
# 1999/06/25: 0.31, fixed reordering bug, going public               MJD
# 1999/07/01: 0.32, adapted surrogates/exclusions to 3.0.0.beta      MJD<BR># 1999/08/16: 0.33, updated for second version of 3.0.0.beta         MJD
</PRE>
<H3>
  <A name="Background">Background</A>
</H3>
<UL>
  <LI>
    <A href="http://www.w3.org/TR/WD-charmod">Character Model for the World Wide
    Web</A> (W3C Working Draft)
  <LI>
    <A href="http://www.unicode.org/unicode/reports/tr15/">Unicode Technical
    Report #15</A> (Version 14 approved subject to various changes)
  <LI>
    <A href="http://www.w3.org/Status">W3C Open Source Releases</A>
</UL>
<H3>
  <A name="Future">Future Plans</A>
</H3>
<P>
We have just released the first version of charlint. There are many things
we plan to add in the future:
<UL>
  <LI>
    Hangul syllable normalization
  <LI>
    Removal of undefined codepoints and codepoints in the private zone
  <LI>
    Removal/fix of incorrect UTF-8
  <LI>
    Incorporate knowledge of V3.0 decompositions to automatically detect future
    precomposites.
  <LI>
    Compatibility character detection or removal.
  <LI>
    Detection or removal of characters not suitable for markup.
</UL>
<P>
Your help (bug reports, patches, ideas, test cases) is welcome.
<P>
<P>
  <HR>
<ADDRESS>
  <A href="mailto:duerst@w3.org">Martin D&uuml;rst</A> <BR>
  <A href="/Help/Webmaster.html">Webmaster</A> <BR>
  last revised $Date: 1999/09/29 09:03:18 $ by $Author: duerst $
</ADDRESS>
<P class="policyfooter">
<SMALL><A href="/Consortium/Legal/ipr-notice.html#Copyright">Copyright</A>
&nbsp;&copy;&nbsp; 1997 <A href="http://www.w3.org">W3C</A>
(<A href="http://www.lcs.mit.edu">MIT</A>,
<A href="http://www.inria.fr/">INRIA</A>,
<A href="http://www.keio.ac.jp/">Keio</A> ), All Rights Reserved. W3C
<A href="/Consortium/Legal/ipr-notice.html#Legal Disclaimer">liability,</A>
<A href="/Consortium/Legal/ipr-notice.html#W3C Trademarks">trademark</A>,
<A href="/Consortium/Legal/copyright-documents.html">document use </A>and
<A href="/Consortium/Legal/copyright-software.html">software licensing
</A>rules apply. Your interactions with this site are in accordance with
our <A href="/Consortium/Legal/privacy-statement.html#Public">public</A>
and <A href="/Consortium/Legal/privacy-statement.html#Members">Member</A>
privacy statements.</SMALL>
</BODY></HTML>

Webmaster