Annotation of libwww/Library/src/HTParse.html, revision 2.28
2.14 frystyk 1: <HTML>
2: <HEAD>
2.23 frystyk 3: <TITLE>URI Parsing</TITLE>
2.26 frystyk 4: <!-- Changed by: Henrik Frystyk Nielsen, 11-Nov-1995 -->
2.6 timbl 5: <NEXTID N="1">
2.14 frystyk 6: </HEAD>
2.2 timbl 7: <BODY>
2.14 frystyk 8:
2.26 frystyk 9: <H1>URI Parsing</H1>
2.9 frystyk 10:
2.14 frystyk 11: <PRE>
12: /*
2.20 frystyk 13: ** (c) COPYRIGHT MIT 1995.
2.14 frystyk 14: ** Please first read the full copyright statement in the file COPYRIGH.
15: */
16: </PRE>
17:
18: This module contains code to parse URIs and various related things such as:
2.9 frystyk 19:
20: <UL>
2.17 frystyk 21: <LI><A HREF="#parse">Parse a URI for tokens</A>
22: <LI><A HREF="#canon">Canonicalization of URIs</A>
23: <LI><A HREF="#secure">Search a URI for illigal characters in order to prevent security holes</A>
2.9 frystyk 24: </UL>
25:
2.26 frystyk 26: This module is implemented by <A HREF="HTParse.c">HTParse.c</A>, and
27: it is a part of the <A HREF="http://www.w3.org/pub/WWW/Library/"> W3C
28: Reference Library</A>.
2.9 frystyk 29:
30: <PRE>
31: #ifndef HTPARSE_H
2.2 timbl 32: #define HTPARSE_H
2.16 frystyk 33:
2.13 frystyk 34: #include "HTEscape.h"
2.9 frystyk 35: </PRE>
2.2 timbl 36:
2.17 frystyk 37: <A NAME="parse"><H2>Parsing URIs</H2></A>
2.9 frystyk 38:
2.17 frystyk 39: These functions can be used to get information in a URI.
2.9 frystyk 40:
2.17 frystyk 41: <H3>Parse a URI relative to another URI</H3>
2.9 frystyk 42:
2.17 frystyk 43: This returns those parts of a name which are given (and requested)
44: substituting bits from the related name where necessary. The
45: <CODE>aName</CODE> argument is the (possibly relative) URI to be
46: parsed, the <CODE>relatedName</CODE> is the URI which the
47: <CODE>aName</CODE> is to be parsed relative to. Passing an empty
48: string means that the <CODE>aName</CODE> is an absolute URI. The
49: following are flag bits which may be OR'ed together to form a number
2.9 frystyk 50: to give the 'wanted' argument to HTParse.
51:
52: <PRE>
53: #define PARSE_ACCESS 16
2.1 timbl 54: #define PARSE_HOST 8
55: #define PARSE_PATH 4
56: #define PARSE_ANCHOR 2
57: #define PARSE_PUNCTUATION 1
58: #define PARSE_ALL 31
2.9 frystyk 59: </PRE>
2.1 timbl 60:
2.17 frystyk 61: where the format of a URI is as follows:
62:
63: <PRE>
64: /*
65: ACCESS :// HOST / PATH # ANCHOR
66: */
67: </PRE>
68:
69: <CODE>PUNCTUATION</CODE> means any delimiter like '/', ':', '#'
70: between the tokens above.
71:
72: The string returned by the function must be freed by the caller.
2.2 timbl 73:
74: <PRE>
2.28 ! frystyk 75: extern char * HTParse (const char * aName, const char * relatedName,
2.26 frystyk 76: int wanted);
2.9 frystyk 77: </PRE>
2.2 timbl 78:
2.17 frystyk 79: <H3>Create a Relative (Partial) URI</H3>
80:
81: This function creates and returns a string which gives an expression
82: of one address as related to another. Where there is no relation, an
83: absolute address is retured.
84:
85: <H3>On entry,</H3>Both names must be absolute, fully
86: qualified names of nodes (no anchor
87: bits)
88: <H3>On exit,</H3>The return result points to a newly
89: allocated name which, if parsed by
90: HTParse relative to relatedName,
91: will yield aName. The caller is responsible
92: for freeing the resulting name later.
93:
94: <PRE>
2.28 ! frystyk 95: extern char * HTRelative (const char * aName, const char *relatedName);
2.17 frystyk 96: </PRE>
97:
98: <A NAME="canon"><H2>Canonicalization</H2></A>
99:
100: Canonicalization of URIs is a difficult job, but it saves a lot of
2.24 frystyk 101: down loads and double entries in the cache if we do a good job. A URI
102: is allowed to contain the seqeunce xxx/../ which may be replaced by ""
103: , and the seqeunce "/./" which may be replaced by "/". Simplification
104: helps us recognize duplicate URIs. Thus, the following transformations
105: are done:
2.9 frystyk 106:
107: <UL>
108: <LI> /etc/junk/../fred becomes /etc/fred
109: <LI> /etc/junk/./fred becomes /etc/junk/fred
110: </UL>
111:
112: but we should NOT change
113: <UL>
114: <LI> http://fred.xxx.edu/../.. or
115: <LI> ../../albert.html
116: </UL>
117:
118: In the same manner, the following prefixed are preserved:
119:
120: <UL>
121: <LI> ./<etc>
122: <LI> //<etc>
123: </UL>
124:
125: In order to avoid empty URIs the following URIs become:
126:
127: <UL>
2.26 frystyk 128: <LI> /fred/.. becomes /fred/..
2.9 frystyk 129: <LI> /fred/././.. becomes /fred/..
130: <LI> /fred/.././junk/.././ becomes /fred/..
131: </UL>
132:
133: If more than one set of `://' is found (several proxies in cascade) then
134: only the part after the last `://' is simplified.
135:
136: <PRE>
2.26 frystyk 137: extern char *HTSimplify (char **filename);
2.6 timbl 138: </PRE>
2.9 frystyk 139:
2.17 frystyk 140: <A NAME="secure"><H2>Prevent Security Holes</H2></A>
141:
142: In many telnet like protocols, it can be very dangerous to allow a
143: full ASCII character set to be in a URI. Therefore we have to strip
144: them out.
2.8 luotonen 145:
146: <CODE>HTCleanTelnetString()</CODE> makes sure that the given string
147: doesn't contain characters that could cause security holes, such as
148: newlines in ftp, gopher, news or telnet URLs; more specifically:
149: allows everything between hexadesimal ASCII 20-7E, and also A0-FE,
150: inclusive.
151: <DL>
152: <DT> <CODE>str</CODE>
153: <DD> the string that is *modified* if necessary. The string will be
154: truncated at the first illegal character that is encountered.
155: <DT>returns
156: <DD> YES, if the string was modified.
157: NO, otherwise.
158: </DL>
2.9 frystyk 159:
2.8 luotonen 160: <PRE>
2.26 frystyk 161: extern BOOL HTCleanTelnetString (char * str);
2.8 luotonen 162: </PRE>
163:
164: <PRE>
2.6 timbl 165: #endif /* HTPARSE_H */
2.9 frystyk 166: </PRE>
2.2 timbl 167:
2.9 frystyk 168: End of HTParse Module
169: </BODY>
170: </HTML>
2.2 timbl 171:
Webmaster