Annotation of libwww/Library/src/HTParse.html, revision 2.28

2.14      frystyk     1: <HTML>
                      2: <HEAD>
2.23      frystyk     3: <TITLE>URI Parsing</TITLE>
2.26      frystyk     4: <!-- Changed by: Henrik Frystyk Nielsen, 11-Nov-1995 -->
2.6       timbl       5: <NEXTID N="1">
2.14      frystyk     6: </HEAD>
2.2       timbl       7: <BODY>
2.14      frystyk     8: 
2.26      frystyk     9: <H1>URI Parsing</H1>
2.9       frystyk    10: 
2.14      frystyk    11: <PRE>
                     12: /*
2.20      frystyk    13: **     (c) COPYRIGHT MIT 1995.
2.14      frystyk    14: **     Please first read the full copyright statement in the file COPYRIGH.
                     15: */
                     16: </PRE>
                     17: 
                     18: This module contains code to parse URIs and various related things such as:
2.9       frystyk    19: 
                     20: <UL>
2.17      frystyk    21: <LI><A HREF="#parse">Parse a URI for tokens</A>
                     22: <LI><A HREF="#canon">Canonicalization of URIs</A>
                     23: <LI><A HREF="#secure">Search a URI for illigal characters in order to prevent security holes</A>
2.9       frystyk    24: </UL>
                     25: 
2.26      frystyk    26: This module is implemented by <A HREF="HTParse.c">HTParse.c</A>, and
                     27: it is a part of the <A HREF="http://www.w3.org/pub/WWW/Library/"> W3C
                     28: Reference Library</A>.
2.9       frystyk    29: 
                     30: <PRE>
                     31: #ifndef HTPARSE_H
2.2       timbl      32: #define HTPARSE_H
2.16      frystyk    33: 
2.13      frystyk    34: #include "HTEscape.h"
2.9       frystyk    35: </PRE>
2.2       timbl      36: 
2.17      frystyk    37: <A NAME="parse"><H2>Parsing URIs</H2></A>
2.9       frystyk    38: 
2.17      frystyk    39: These functions can be used to get information in a URI.
2.9       frystyk    40: 
2.17      frystyk    41: <H3>Parse a URI relative to another URI</H3>
2.9       frystyk    42: 
2.17      frystyk    43: This returns those parts of a name which are given (and requested)
                     44: substituting bits from the related name where necessary. The
                     45: <CODE>aName</CODE> argument is the (possibly relative) URI to be
                     46: parsed, the <CODE>relatedName</CODE> is the URI which the
                     47: <CODE>aName</CODE> is to be parsed relative to. Passing an empty
                     48: string means that the <CODE>aName</CODE> is an absolute URI. The
                     49: following are flag bits which may be OR'ed together to form a number
2.9       frystyk    50: to give the 'wanted' argument to HTParse.
                     51: 
                     52: <PRE>
                     53: #define PARSE_ACCESS           16
2.1       timbl      54: #define PARSE_HOST              8
                     55: #define PARSE_PATH              4
                     56: #define PARSE_ANCHOR            2
                     57: #define PARSE_PUNCTUATION       1
                     58: #define PARSE_ALL              31
2.9       frystyk    59: </PRE>
2.1       timbl      60: 
2.17      frystyk    61: where the format of a URI is as follows:
                     62: 
                     63: <PRE>
                     64: /*
                     65:        ACCESS :// HOST / PATH # ANCHOR
                     66: */
                     67: </PRE>
                     68: 
                     69: <CODE>PUNCTUATION</CODE> means any delimiter like '/', ':', '#'
                     70: between the tokens above.
                     71: 
                     72: The string returned by the function must be freed by the caller.
2.2       timbl      73: 
                     74: <PRE>
2.28    ! frystyk    75: extern char * HTParse  (const char * aName, const char * relatedName,
2.26      frystyk    76:                        int wanted);
2.9       frystyk    77: </PRE>
2.2       timbl      78: 
2.17      frystyk    79: <H3>Create a Relative (Partial) URI</H3>
                     80: 
                     81: This function creates and returns a string which gives an expression
                     82: of one address as related to another.  Where there is no relation, an
                     83: absolute address is retured.
                     84: 
                     85: <H3>On entry,</H3>Both names must be absolute, fully
                     86: qualified names of nodes (no anchor
                     87: bits)
                     88: <H3>On exit,</H3>The return result points to a newly
                     89: allocated name which, if parsed by
                     90: HTParse relative to relatedName,
                     91: will yield aName. The caller is responsible
                     92: for freeing the resulting name later.
                     93: 
                     94: <PRE>
2.28    ! frystyk    95: extern char * HTRelative (const char * aName, const char *relatedName);
2.17      frystyk    96: </PRE>
                     97: 
                     98: <A NAME="canon"><H2>Canonicalization</H2></A>
                     99: 
                    100: Canonicalization of URIs is a difficult job, but it saves a lot of
2.24      frystyk   101: down loads and double entries in the cache if we do a good job. A URI
                    102: is allowed to contain the seqeunce xxx/../ which may be replaced by ""
                    103: , and the seqeunce "/./" which may be replaced by "/".  Simplification
                    104: helps us recognize duplicate URIs. Thus, the following transformations
                    105: are done:
2.9       frystyk   106: 
                    107: <UL>
                    108: <LI> /etc/junk/../fred         becomes /etc/fred
                    109: <LI> /etc/junk/./fred  becomes /etc/junk/fred
                    110: </UL>
                    111: 
                    112: but we should NOT change
                    113: <UL>
                    114: <LI> http://fred.xxx.edu/../.. or
                    115: <LI> ../../albert.html
                    116: </UL>
                    117: 
                    118: In the same manner, the following prefixed are preserved:
                    119: 
                    120: <UL>
                    121: <LI> ./<etc>
                    122: <LI> //<etc>
                    123: </UL>
                    124: 
                    125: In order to avoid empty URIs the following URIs become:
                    126: 
                    127: <UL>
2.26      frystyk   128: <LI> /fred/..                  becomes /fred/..
2.9       frystyk   129: <LI> /fred/././..              becomes /fred/..
                    130: <LI> /fred/.././junk/.././     becomes /fred/..
                    131: </UL>
                    132: 
                    133: If more than one set of `://' is found (several proxies in cascade) then
                    134: only the part after the last `://' is simplified.
                    135: 
                    136: <PRE>
2.26      frystyk   137: extern char *HTSimplify (char **filename);
2.6       timbl     138: </PRE>
2.9       frystyk   139: 
2.17      frystyk   140: <A NAME="secure"><H2>Prevent Security Holes</H2></A>
                    141: 
                    142: In many telnet like protocols, it can be very dangerous to allow a
                    143: full ASCII character set to be in a URI. Therefore we have to strip
                    144: them out.
2.8       luotonen  145: 
                    146: <CODE>HTCleanTelnetString()</CODE> makes sure that the given string
                    147: doesn't contain characters that could cause security holes, such as
                    148: newlines in ftp, gopher, news or telnet URLs; more specifically:
                    149: allows everything between hexadesimal ASCII 20-7E, and also A0-FE,
                    150: inclusive.
                    151: <DL>
                    152: <DT> <CODE>str</CODE>
                    153: <DD> the string that is *modified* if necessary.  The string will be
                    154:      truncated at the first illegal character that is encountered.
                    155: <DT>returns
                    156: <DD> YES, if the string was modified.
                    157:      NO, otherwise.
                    158: </DL>
2.9       frystyk   159: 
2.8       luotonen  160: <PRE>
2.26      frystyk   161: extern BOOL HTCleanTelnetString (char * str);
2.8       luotonen  162: </PRE>
                    163: 
                    164: <PRE>
2.6       timbl     165: #endif /* HTPARSE_H */
2.9       frystyk   166: </PRE>
2.2       timbl     167: 
2.9       frystyk   168: End of HTParse Module
                    169: </BODY>
                    170: </HTML>
2.2       timbl     171: 

Webmaster