Annotation of libwww/Library/src/HTParse.html, revision 2.33

2.14      frystyk     1: <HTML>
                      2: <HEAD>
2.31      frystyk     3:   <TITLE>W3C Reference Library libwww URIs</TITLE>
2.33    ! frystyk     4: <!-- Changed by: Henrik Frystyk Nielsen, 23-Jun-1996 -->
2.14      frystyk     5: </HEAD>
2.2       timbl       6: <BODY>
2.31      frystyk     7: <H1>
                      8:   URI Parsing
                      9: </H1>
2.14      frystyk    10: <PRE>
                     11: /*
2.20      frystyk    12: **     (c) COPYRIGHT MIT 1995.
2.14      frystyk    13: **     Please first read the full copyright statement in the file COPYRIGH.
                     14: */
                     15: </PRE>
2.31      frystyk    16: <P>
2.14      frystyk    17: This module contains code to parse URIs and various related things such as:
2.9       frystyk    18: <UL>
2.31      frystyk    19:   <LI>
                     20:     <A HREF="#parse">Parse a URI for tokens</A>
                     21:   <LI>
                     22:     <A HREF="#canon">Canonicalization of URIs</A>
                     23:   <LI>
                     24:     <A HREF="#secure">Search a URI for illigal characters in order to prevent
                     25:     security holes</A>
2.9       frystyk    26: </UL>
2.31      frystyk    27: <P>
                     28: This module is implemented by <A HREF="HTParse.c">HTParse.c</A>, and it is
                     29: a part of the <A HREF="http://www.w3.org/pub/WWW/Library/"> W3C Reference
                     30: Library</A>.
2.9       frystyk    31: <PRE>
                     32: #ifndef HTPARSE_H
2.2       timbl      33: #define HTPARSE_H
2.16      frystyk    34: 
2.13      frystyk    35: #include "HTEscape.h"
2.9       frystyk    36: </PRE>
2.31      frystyk    37: <H2>
                     38:   Parsing URIs
                     39: </H2>
                     40: <P>
2.17      frystyk    41: These functions can be used to get information in a URI.
2.31      frystyk    42: <H3>
                     43:   Parse a URI relative to another URI
                     44: </H3>
                     45: <P>
                     46: This returns those parts of a name which are given (and requested) substituting
                     47: bits from the related name where necessary. The <CODE>aName</CODE> argument
                     48: is the (possibly relative) URI to be parsed, the <CODE>relatedName</CODE>
                     49: is the URI which the <CODE>aName</CODE> is to be parsed relative to. Passing
                     50: an empty string means that the <CODE>aName</CODE> is an absolute URI. The
                     51: following are flag bits which may be OR'ed together to form a number to give
2.32      frystyk    52: the 'wanted' argument to HTParse. As an example we have the URL:
                     53: "<CODE>http://www.w3.org/pub/WWW/TheProject.html#news</CODE>"
2.9       frystyk    54: <PRE>
2.32      frystyk    55: #define PARSE_ACCESS           16              /* Access scheme, e.g. "HTTP" */
                     56: #define PARSE_HOST              8              /* Host name, e.g. "www.w3.org" */
                     57: #define PARSE_PATH              4              /* URL Path, e.g. "pub/WWW/TheProject.html" */
                     58: #define PARSE_ANCHOR            2              /* Fragment identifier, e.g. "news" */
                     59: #define PARSE_PUNCTUATION       1              /* Include delimiters, e.g, "/" and ":" */
2.1       timbl      60: #define PARSE_ALL              31
2.9       frystyk    61: </PRE>
2.31      frystyk    62: <P>
2.32      frystyk    63: where the format of a URI is as follows: "<CODE>ACCESS :// HOST / PATH #
                     64: ANCHOR</CODE>"
2.31      frystyk    65: <P>
                     66: <CODE>PUNCTUATION</CODE> means any delimiter like '/', ':', '#' between the
                     67: tokens above. The string returned by the function must be freed by the caller.
2.2       timbl      68: <PRE>
2.28      frystyk    69: extern char * HTParse  (const char * aName, const char * relatedName,
2.26      frystyk    70:                        int wanted);
2.9       frystyk    71: </PRE>
2.31      frystyk    72: <H3>
                     73:   Create a Relative (Partial) URI
                     74: </H3>
                     75: <P>
                     76: This function creates and returns a string which gives an expression of one
                     77: address as related to another. Where there is no relation, an absolute address
                     78: is retured.
                     79: <H3>
                     80:   On entry,
                     81: </H3>
                     82: <P>
                     83: Both names must be absolute, fully qualified names of nodes (no anchor bits)
                     84: <H3>
                     85:   On exit,
                     86: </H3>
                     87: <P>
                     88: The return result points to a newly allocated name which, if parsed by HTParse
                     89: relative to relatedName, will yield aName. The caller is responsible for
                     90: freeing the resulting name later.
2.17      frystyk    91: <PRE>
2.28      frystyk    92: extern char * HTRelative (const char * aName, const char *relatedName);
2.17      frystyk    93: </PRE>
2.31      frystyk    94: <H2>
                     95:   Canonicalization
                     96: </H2>
                     97: <P>
                     98: Canonicalization of URIs is a difficult job, but it saves a lot of down loads
                     99: and double entries in the cache if we do a good job. A URI is allowed to
                    100: contain the seqeunce xxx/../ which may be replaced by "" , and the seqeunce
                    101: "/./" which may be replaced by "/". Simplification helps us recognize duplicate
                    102: URIs. Thus, the following transformations are done:
2.9       frystyk   103: <UL>
2.31      frystyk   104:   <LI>
                    105:     /etc/junk/../fred becomes /etc/fred
                    106:   <LI>
                    107:     /etc/junk/./fred becomes /etc/junk/fred
2.9       frystyk   108: </UL>
2.31      frystyk   109: <P>
2.9       frystyk   110: but we should NOT change
                    111: <UL>
2.31      frystyk   112:   <LI>
                    113:     http://fred.xxx.edu/../.. or
                    114:   <LI>
                    115:     ../../albert.html
2.9       frystyk   116: </UL>
2.31      frystyk   117: <P>
2.9       frystyk   118: In the same manner, the following prefixed are preserved:
                    119: <UL>
2.31      frystyk   120:   <LI>
2.32      frystyk   121:     ./&lt;etc&gt;
2.31      frystyk   122:   <LI>
2.32      frystyk   123:     //&lt;etc&gt;
2.9       frystyk   124: </UL>
2.31      frystyk   125: <P>
2.9       frystyk   126: In order to avoid empty URIs the following URIs become:
                    127: <UL>
2.31      frystyk   128:   <LI>
                    129:     /fred/.. becomes /fred/..
                    130:   <LI>
                    131:     /fred/././.. becomes /fred/..
                    132:   <LI>
                    133:     /fred/.././junk/.././ becomes /fred/..
2.9       frystyk   134: </UL>
2.31      frystyk   135: <P>
2.9       frystyk   136: If more than one set of `://' is found (several proxies in cascade) then
                    137: only the part after the last `://' is simplified.
                    138: <PRE>
2.26      frystyk   139: extern char *HTSimplify (char **filename);
2.6       timbl     140: </PRE>
2.31      frystyk   141: <H2>
                    142:   Prevent Security Holes
                    143: </H2>
                    144: <P>
                    145: In many telnet like protocols, it can be very dangerous to allow a full ASCII
                    146: character set to be in a URI. Therefore we have to strip them out.
                    147: <CODE>HTCleanTelnetString()</CODE> makes sure that the given string doesn't
                    148: contain characters that could cause security holes, such as newlines in ftp,
                    149: gopher, news or telnet URLs; more specifically: allows everything between
                    150: hexadesimal ASCII 20-7E, and also A0-FE, inclusive.
2.8       luotonen  151: <DL>
2.31      frystyk   152:   <DT>
                    153:     <CODE>str</CODE>
                    154:   <DD>
                    155:     the string that is *modified* if necessary. The string will be truncated
                    156:     at the first illegal character that is encountered.
                    157:   <DT>
                    158:     returns
                    159:   <DD>
                    160:     YES, if the string was modified. NO, otherwise.
2.8       luotonen  161: </DL>
                    162: <PRE>
2.26      frystyk   163: extern BOOL HTCleanTelnetString (char * str);
2.8       luotonen  164: </PRE>
                    165: <PRE>
2.6       timbl     166: #endif /* HTPARSE_H */
2.9       frystyk   167: </PRE>
2.31      frystyk   168: <P>
                    169:   <HR>
2.30      frystyk   170: <ADDRESS>
2.33    ! frystyk   171:   @(#) $Id: HTParse.html,v 2.32 1996/06/01 17:46:53 frystyk Exp $
2.30      frystyk   172: </ADDRESS>
2.31      frystyk   173: </BODY></HTML>

Webmaster