Annotation of libwww/Library/src/HTParse.html, revision 2.23

2.14      frystyk     1: <HTML>
                      2: <HEAD>
2.23    ! frystyk     3: <TITLE>URI Parsing</TITLE>
        !             4: <!-- Changed by: Henrik Frystyk Nielsen, 14-Aug-1995 -->
2.6       timbl       5: <NEXTID N="1">
2.14      frystyk     6: </HEAD>
2.2       timbl       7: <BODY>
2.14      frystyk     8: 
2.9       frystyk     9: <H1>HTParse</H1>
                     10: 
2.14      frystyk    11: <PRE>
                     12: /*
2.20      frystyk    13: **     (c) COPYRIGHT MIT 1995.
2.14      frystyk    14: **     Please first read the full copyright statement in the file COPYRIGH.
                     15: */
                     16: </PRE>
                     17: 
                     18: This module contains code to parse URIs and various related things such as:
2.9       frystyk    19: 
                     20: <UL>
2.17      frystyk    21: <LI><A HREF="#parse">Parse a URI for tokens</A>
                     22: <LI><A HREF="#canon">Canonicalization of URIs</A>
                     23: <LI><A HREF="#secure">Search a URI for illigal characters in order to prevent security holes</A>
2.9       frystyk    24: </UL>
                     25: 
2.14      frystyk    26: This module is implemented by <A HREF="HTParse.c">HTParse.c</A>, and it is
                     27: a part of the <A
2.22      frystyk    28: HREF="http://www.w3.org/hypertext/WWW/Library/">
                     29: W3C Reference Library</A>.
2.9       frystyk    30: 
                     31: <PRE>
                     32: #ifndef HTPARSE_H
2.2       timbl      33: #define HTPARSE_H
2.16      frystyk    34: 
2.13      frystyk    35: #include "HTEscape.h"
2.9       frystyk    36: </PRE>
2.2       timbl      37: 
2.17      frystyk    38: <A NAME="parse"><H2>Parsing URIs</H2></A>
2.9       frystyk    39: 
2.17      frystyk    40: These functions can be used to get information in a URI.
2.9       frystyk    41: 
2.17      frystyk    42: <H3>Parse a URI relative to another URI</H3>
2.9       frystyk    43: 
2.17      frystyk    44: This returns those parts of a name which are given (and requested)
                     45: substituting bits from the related name where necessary. The
                     46: <CODE>aName</CODE> argument is the (possibly relative) URI to be
                     47: parsed, the <CODE>relatedName</CODE> is the URI which the
                     48: <CODE>aName</CODE> is to be parsed relative to. Passing an empty
                     49: string means that the <CODE>aName</CODE> is an absolute URI. The
                     50: following are flag bits which may be OR'ed together to form a number
2.9       frystyk    51: to give the 'wanted' argument to HTParse.
                     52: 
                     53: <PRE>
                     54: #define PARSE_ACCESS           16
2.1       timbl      55: #define PARSE_HOST              8
                     56: #define PARSE_PATH              4
                     57: #define PARSE_ANCHOR            2
                     58: #define PARSE_PUNCTUATION       1
                     59: #define PARSE_ALL              31
2.9       frystyk    60: </PRE>
2.1       timbl      61: 
2.17      frystyk    62: where the format of a URI is as follows:
                     63: 
                     64: <PRE>
                     65: /*
                     66:        ACCESS :// HOST / PATH # ANCHOR
                     67: */
                     68: </PRE>
                     69: 
                     70: <CODE>PUNCTUATION</CODE> means any delimiter like '/', ':', '#'
                     71: between the tokens above.
                     72: 
                     73: The string returned by the function must be freed by the caller.
2.2       timbl      74: 
                     75: <PRE>
2.13      frystyk    76: extern char * HTParse  PARAMS((        const char * aName,
2.9       frystyk    77:                                const char * relatedName,
                     78:                                int wanted));
                     79: </PRE>
2.2       timbl      80: 
2.17      frystyk    81: <H3>Create a Relative (Partial) URI</H3>
                     82: 
                     83: This function creates and returns a string which gives an expression
                     84: of one address as related to another.  Where there is no relation, an
                     85: absolute address is retured.
                     86: 
                     87: <H3>On entry,</H3>Both names must be absolute, fully
                     88: qualified names of nodes (no anchor
                     89: bits)
                     90: <H3>On exit,</H3>The return result points to a newly
                     91: allocated name which, if parsed by
                     92: HTParse relative to relatedName,
                     93: will yield aName. The caller is responsible
                     94: for freeing the resulting name later.
                     95: 
                     96: <PRE>
                     97: extern char * HTRelative PARAMS((const char * aName, const char *relatedName));
                     98: </PRE>
                     99: 
                    100: <A NAME="canon"><H2>Canonicalization</H2></A>
                    101: 
                    102: Canonicalization of URIs is a difficult job, but it saves a lot of
                    103: down loads and double entries in the cache if we do a good job...
                    104: 
                    105: <H3>Canonicalize the Path Part of a URI</H3>
2.1       timbl     106: 
2.9       frystyk   107: A URI is allowed to contain the seqeunce xxx/../ which may be
                    108: replaced by "" , and the seqeunce "/./" which may be replaced by "/".
                    109: Simplification helps us recognize duplicate URIs. Thus, the following
                    110: transformations are done:
                    111: 
                    112: <UL>
                    113: <LI> /etc/junk/../fred         becomes /etc/fred
                    114: <LI> /etc/junk/./fred  becomes /etc/junk/fred
                    115: </UL>
                    116: 
                    117: but we should NOT change
                    118: <UL>
                    119: <LI> http://fred.xxx.edu/../.. or
                    120: <LI> ../../albert.html
                    121: </UL>
                    122: 
                    123: In the same manner, the following prefixed are preserved:
                    124: 
                    125: <UL>
                    126: <LI> ./<etc>
                    127: <LI> //<etc>
                    128: </UL>
                    129: 
                    130: In order to avoid empty URIs the following URIs become:
                    131: 
                    132: <UL>
                    133: <LI> /fred/..          becomes /fred/..
                    134: <LI> /fred/././..              becomes /fred/..
                    135: <LI> /fred/.././junk/.././     becomes /fred/..
                    136: </UL>
                    137: 
                    138: If more than one set of `://' is found (several proxies in cascade) then
                    139: only the part after the last `://' is simplified.
                    140: 
                    141: <PRE>
2.19      frystyk   142: extern char *HTSimplify PARAMS((char **filename));
2.2       timbl     143: </PRE>
2.1       timbl     144: 
2.17      frystyk   145: <H3>Canonicalize the DNS part of a URI</H3>
2.2       timbl     146: 
2.9       frystyk   147: This function expands the host name of the URI from a local name to a
2.11      frystyk   148: full domain name and converts the host name to lower case. The
                    149: advantage by doing this is that we only have one entry in the host
                    150: case and one entry in the document cache.
2.6       timbl     151: 
2.9       frystyk   152: <PRE>
2.13      frystyk   153: extern char *HTCanon PARAMS (( char ** filename,
2.9       frystyk   154:                                char *  host));
2.6       timbl     155: </PRE>
2.9       frystyk   156: 
2.17      frystyk   157: <A NAME="secure"><H2>Prevent Security Holes</H2></A>
                    158: 
                    159: In many telnet like protocols, it can be very dangerous to allow a
                    160: full ASCII character set to be in a URI. Therefore we have to strip
                    161: them out.
2.8       luotonen  162: 
                    163: <CODE>HTCleanTelnetString()</CODE> makes sure that the given string
                    164: doesn't contain characters that could cause security holes, such as
                    165: newlines in ftp, gopher, news or telnet URLs; more specifically:
                    166: allows everything between hexadesimal ASCII 20-7E, and also A0-FE,
                    167: inclusive.
                    168: <DL>
                    169: <DT> <CODE>str</CODE>
                    170: <DD> the string that is *modified* if necessary.  The string will be
                    171:      truncated at the first illegal character that is encountered.
                    172: <DT>returns
                    173: <DD> YES, if the string was modified.
                    174:      NO, otherwise.
                    175: </DL>
2.9       frystyk   176: 
2.8       luotonen  177: <PRE>
2.13      frystyk   178: extern BOOL HTCleanTelnetString PARAMS((char * str));
2.8       luotonen  179: </PRE>
                    180: 
                    181: <PRE>
2.6       timbl     182: #endif /* HTPARSE_H */
2.9       frystyk   183: </PRE>
2.2       timbl     184: 
2.9       frystyk   185: End of HTParse Module
                    186: </BODY>
                    187: </HTML>
2.2       timbl     188: 

Webmaster