Annotation of libwww/Library/src/HTParse.html, revision 2.33
2.14 frystyk 1: <HTML>
2: <HEAD>
2.31 frystyk 3: <TITLE>W3C Reference Library libwww URIs</TITLE>
2.33 ! frystyk 4: <!-- Changed by: Henrik Frystyk Nielsen, 23-Jun-1996 -->
2.14 frystyk 5: </HEAD>
2.2 timbl 6: <BODY>
2.31 frystyk 7: <H1>
8: URI Parsing
9: </H1>
2.14 frystyk 10: <PRE>
11: /*
2.20 frystyk 12: ** (c) COPYRIGHT MIT 1995.
2.14 frystyk 13: ** Please first read the full copyright statement in the file COPYRIGH.
14: */
15: </PRE>
2.31 frystyk 16: <P>
2.14 frystyk 17: This module contains code to parse URIs and various related things such as:
2.9 frystyk 18: <UL>
2.31 frystyk 19: <LI>
20: <A HREF="#parse">Parse a URI for tokens</A>
21: <LI>
22: <A HREF="#canon">Canonicalization of URIs</A>
23: <LI>
24: <A HREF="#secure">Search a URI for illigal characters in order to prevent
25: security holes</A>
2.9 frystyk 26: </UL>
2.31 frystyk 27: <P>
28: This module is implemented by <A HREF="HTParse.c">HTParse.c</A>, and it is
29: a part of the <A HREF="http://www.w3.org/pub/WWW/Library/"> W3C Reference
30: Library</A>.
2.9 frystyk 31: <PRE>
32: #ifndef HTPARSE_H
2.2 timbl 33: #define HTPARSE_H
2.16 frystyk 34:
2.13 frystyk 35: #include "HTEscape.h"
2.9 frystyk 36: </PRE>
2.31 frystyk 37: <H2>
38: Parsing URIs
39: </H2>
40: <P>
2.17 frystyk 41: These functions can be used to get information in a URI.
2.31 frystyk 42: <H3>
43: Parse a URI relative to another URI
44: </H3>
45: <P>
46: This returns those parts of a name which are given (and requested) substituting
47: bits from the related name where necessary. The <CODE>aName</CODE> argument
48: is the (possibly relative) URI to be parsed, the <CODE>relatedName</CODE>
49: is the URI which the <CODE>aName</CODE> is to be parsed relative to. Passing
50: an empty string means that the <CODE>aName</CODE> is an absolute URI. The
51: following are flag bits which may be OR'ed together to form a number to give
2.32 frystyk 52: the 'wanted' argument to HTParse. As an example we have the URL:
53: "<CODE>http://www.w3.org/pub/WWW/TheProject.html#news</CODE>"
2.9 frystyk 54: <PRE>
2.32 frystyk 55: #define PARSE_ACCESS 16 /* Access scheme, e.g. "HTTP" */
56: #define PARSE_HOST 8 /* Host name, e.g. "www.w3.org" */
57: #define PARSE_PATH 4 /* URL Path, e.g. "pub/WWW/TheProject.html" */
58: #define PARSE_ANCHOR 2 /* Fragment identifier, e.g. "news" */
59: #define PARSE_PUNCTUATION 1 /* Include delimiters, e.g, "/" and ":" */
2.1 timbl 60: #define PARSE_ALL 31
2.9 frystyk 61: </PRE>
2.31 frystyk 62: <P>
2.32 frystyk 63: where the format of a URI is as follows: "<CODE>ACCESS :// HOST / PATH #
64: ANCHOR</CODE>"
2.31 frystyk 65: <P>
66: <CODE>PUNCTUATION</CODE> means any delimiter like '/', ':', '#' between the
67: tokens above. The string returned by the function must be freed by the caller.
2.2 timbl 68: <PRE>
2.28 frystyk 69: extern char * HTParse (const char * aName, const char * relatedName,
2.26 frystyk 70: int wanted);
2.9 frystyk 71: </PRE>
2.31 frystyk 72: <H3>
73: Create a Relative (Partial) URI
74: </H3>
75: <P>
76: This function creates and returns a string which gives an expression of one
77: address as related to another. Where there is no relation, an absolute address
78: is retured.
79: <H3>
80: On entry,
81: </H3>
82: <P>
83: Both names must be absolute, fully qualified names of nodes (no anchor bits)
84: <H3>
85: On exit,
86: </H3>
87: <P>
88: The return result points to a newly allocated name which, if parsed by HTParse
89: relative to relatedName, will yield aName. The caller is responsible for
90: freeing the resulting name later.
2.17 frystyk 91: <PRE>
2.28 frystyk 92: extern char * HTRelative (const char * aName, const char *relatedName);
2.17 frystyk 93: </PRE>
2.31 frystyk 94: <H2>
95: Canonicalization
96: </H2>
97: <P>
98: Canonicalization of URIs is a difficult job, but it saves a lot of down loads
99: and double entries in the cache if we do a good job. A URI is allowed to
100: contain the seqeunce xxx/../ which may be replaced by "" , and the seqeunce
101: "/./" which may be replaced by "/". Simplification helps us recognize duplicate
102: URIs. Thus, the following transformations are done:
2.9 frystyk 103: <UL>
2.31 frystyk 104: <LI>
105: /etc/junk/../fred becomes /etc/fred
106: <LI>
107: /etc/junk/./fred becomes /etc/junk/fred
2.9 frystyk 108: </UL>
2.31 frystyk 109: <P>
2.9 frystyk 110: but we should NOT change
111: <UL>
2.31 frystyk 112: <LI>
113: http://fred.xxx.edu/../.. or
114: <LI>
115: ../../albert.html
2.9 frystyk 116: </UL>
2.31 frystyk 117: <P>
2.9 frystyk 118: In the same manner, the following prefixed are preserved:
119: <UL>
2.31 frystyk 120: <LI>
2.32 frystyk 121: ./<etc>
2.31 frystyk 122: <LI>
2.32 frystyk 123: //<etc>
2.9 frystyk 124: </UL>
2.31 frystyk 125: <P>
2.9 frystyk 126: In order to avoid empty URIs the following URIs become:
127: <UL>
2.31 frystyk 128: <LI>
129: /fred/.. becomes /fred/..
130: <LI>
131: /fred/././.. becomes /fred/..
132: <LI>
133: /fred/.././junk/.././ becomes /fred/..
2.9 frystyk 134: </UL>
2.31 frystyk 135: <P>
2.9 frystyk 136: If more than one set of `://' is found (several proxies in cascade) then
137: only the part after the last `://' is simplified.
138: <PRE>
2.26 frystyk 139: extern char *HTSimplify (char **filename);
2.6 timbl 140: </PRE>
2.31 frystyk 141: <H2>
142: Prevent Security Holes
143: </H2>
144: <P>
145: In many telnet like protocols, it can be very dangerous to allow a full ASCII
146: character set to be in a URI. Therefore we have to strip them out.
147: <CODE>HTCleanTelnetString()</CODE> makes sure that the given string doesn't
148: contain characters that could cause security holes, such as newlines in ftp,
149: gopher, news or telnet URLs; more specifically: allows everything between
150: hexadesimal ASCII 20-7E, and also A0-FE, inclusive.
2.8 luotonen 151: <DL>
2.31 frystyk 152: <DT>
153: <CODE>str</CODE>
154: <DD>
155: the string that is *modified* if necessary. The string will be truncated
156: at the first illegal character that is encountered.
157: <DT>
158: returns
159: <DD>
160: YES, if the string was modified. NO, otherwise.
2.8 luotonen 161: </DL>
162: <PRE>
2.26 frystyk 163: extern BOOL HTCleanTelnetString (char * str);
2.8 luotonen 164: </PRE>
165: <PRE>
2.6 timbl 166: #endif /* HTPARSE_H */
2.9 frystyk 167: </PRE>
2.31 frystyk 168: <P>
169: <HR>
2.30 frystyk 170: <ADDRESS>
2.33 ! frystyk 171: @(#) $Id: HTParse.html,v 2.32 1996/06/01 17:46:53 frystyk Exp $
2.30 frystyk 172: </ADDRESS>
2.31 frystyk 173: </BODY></HTML>
Webmaster