Annotation of libwww/Library/src/HTParse.html, revision 2.23
2.14 frystyk 1: <HTML>
2: <HEAD>
2.23 ! frystyk 3: <TITLE>URI Parsing</TITLE>
! 4: <!-- Changed by: Henrik Frystyk Nielsen, 14-Aug-1995 -->
2.6 timbl 5: <NEXTID N="1">
2.14 frystyk 6: </HEAD>
2.2 timbl 7: <BODY>
2.14 frystyk 8:
2.9 frystyk 9: <H1>HTParse</H1>
10:
2.14 frystyk 11: <PRE>
12: /*
2.20 frystyk 13: ** (c) COPYRIGHT MIT 1995.
2.14 frystyk 14: ** Please first read the full copyright statement in the file COPYRIGH.
15: */
16: </PRE>
17:
18: This module contains code to parse URIs and various related things such as:
2.9 frystyk 19:
20: <UL>
2.17 frystyk 21: <LI><A HREF="#parse">Parse a URI for tokens</A>
22: <LI><A HREF="#canon">Canonicalization of URIs</A>
23: <LI><A HREF="#secure">Search a URI for illigal characters in order to prevent security holes</A>
2.9 frystyk 24: </UL>
25:
2.14 frystyk 26: This module is implemented by <A HREF="HTParse.c">HTParse.c</A>, and it is
27: a part of the <A
2.22 frystyk 28: HREF="http://www.w3.org/hypertext/WWW/Library/">
29: W3C Reference Library</A>.
2.9 frystyk 30:
31: <PRE>
32: #ifndef HTPARSE_H
2.2 timbl 33: #define HTPARSE_H
2.16 frystyk 34:
2.13 frystyk 35: #include "HTEscape.h"
2.9 frystyk 36: </PRE>
2.2 timbl 37:
2.17 frystyk 38: <A NAME="parse"><H2>Parsing URIs</H2></A>
2.9 frystyk 39:
2.17 frystyk 40: These functions can be used to get information in a URI.
2.9 frystyk 41:
2.17 frystyk 42: <H3>Parse a URI relative to another URI</H3>
2.9 frystyk 43:
2.17 frystyk 44: This returns those parts of a name which are given (and requested)
45: substituting bits from the related name where necessary. The
46: <CODE>aName</CODE> argument is the (possibly relative) URI to be
47: parsed, the <CODE>relatedName</CODE> is the URI which the
48: <CODE>aName</CODE> is to be parsed relative to. Passing an empty
49: string means that the <CODE>aName</CODE> is an absolute URI. The
50: following are flag bits which may be OR'ed together to form a number
2.9 frystyk 51: to give the 'wanted' argument to HTParse.
52:
53: <PRE>
54: #define PARSE_ACCESS 16
2.1 timbl 55: #define PARSE_HOST 8
56: #define PARSE_PATH 4
57: #define PARSE_ANCHOR 2
58: #define PARSE_PUNCTUATION 1
59: #define PARSE_ALL 31
2.9 frystyk 60: </PRE>
2.1 timbl 61:
2.17 frystyk 62: where the format of a URI is as follows:
63:
64: <PRE>
65: /*
66: ACCESS :// HOST / PATH # ANCHOR
67: */
68: </PRE>
69:
70: <CODE>PUNCTUATION</CODE> means any delimiter like '/', ':', '#'
71: between the tokens above.
72:
73: The string returned by the function must be freed by the caller.
2.2 timbl 74:
75: <PRE>
2.13 frystyk 76: extern char * HTParse PARAMS(( const char * aName,
2.9 frystyk 77: const char * relatedName,
78: int wanted));
79: </PRE>
2.2 timbl 80:
2.17 frystyk 81: <H3>Create a Relative (Partial) URI</H3>
82:
83: This function creates and returns a string which gives an expression
84: of one address as related to another. Where there is no relation, an
85: absolute address is retured.
86:
87: <H3>On entry,</H3>Both names must be absolute, fully
88: qualified names of nodes (no anchor
89: bits)
90: <H3>On exit,</H3>The return result points to a newly
91: allocated name which, if parsed by
92: HTParse relative to relatedName,
93: will yield aName. The caller is responsible
94: for freeing the resulting name later.
95:
96: <PRE>
97: extern char * HTRelative PARAMS((const char * aName, const char *relatedName));
98: </PRE>
99:
100: <A NAME="canon"><H2>Canonicalization</H2></A>
101:
102: Canonicalization of URIs is a difficult job, but it saves a lot of
103: down loads and double entries in the cache if we do a good job...
104:
105: <H3>Canonicalize the Path Part of a URI</H3>
2.1 timbl 106:
2.9 frystyk 107: A URI is allowed to contain the seqeunce xxx/../ which may be
108: replaced by "" , and the seqeunce "/./" which may be replaced by "/".
109: Simplification helps us recognize duplicate URIs. Thus, the following
110: transformations are done:
111:
112: <UL>
113: <LI> /etc/junk/../fred becomes /etc/fred
114: <LI> /etc/junk/./fred becomes /etc/junk/fred
115: </UL>
116:
117: but we should NOT change
118: <UL>
119: <LI> http://fred.xxx.edu/../.. or
120: <LI> ../../albert.html
121: </UL>
122:
123: In the same manner, the following prefixed are preserved:
124:
125: <UL>
126: <LI> ./<etc>
127: <LI> //<etc>
128: </UL>
129:
130: In order to avoid empty URIs the following URIs become:
131:
132: <UL>
133: <LI> /fred/.. becomes /fred/..
134: <LI> /fred/././.. becomes /fred/..
135: <LI> /fred/.././junk/.././ becomes /fred/..
136: </UL>
137:
138: If more than one set of `://' is found (several proxies in cascade) then
139: only the part after the last `://' is simplified.
140:
141: <PRE>
2.19 frystyk 142: extern char *HTSimplify PARAMS((char **filename));
2.2 timbl 143: </PRE>
2.1 timbl 144:
2.17 frystyk 145: <H3>Canonicalize the DNS part of a URI</H3>
2.2 timbl 146:
2.9 frystyk 147: This function expands the host name of the URI from a local name to a
2.11 frystyk 148: full domain name and converts the host name to lower case. The
149: advantage by doing this is that we only have one entry in the host
150: case and one entry in the document cache.
2.6 timbl 151:
2.9 frystyk 152: <PRE>
2.13 frystyk 153: extern char *HTCanon PARAMS (( char ** filename,
2.9 frystyk 154: char * host));
2.6 timbl 155: </PRE>
2.9 frystyk 156:
2.17 frystyk 157: <A NAME="secure"><H2>Prevent Security Holes</H2></A>
158:
159: In many telnet like protocols, it can be very dangerous to allow a
160: full ASCII character set to be in a URI. Therefore we have to strip
161: them out.
2.8 luotonen 162:
163: <CODE>HTCleanTelnetString()</CODE> makes sure that the given string
164: doesn't contain characters that could cause security holes, such as
165: newlines in ftp, gopher, news or telnet URLs; more specifically:
166: allows everything between hexadesimal ASCII 20-7E, and also A0-FE,
167: inclusive.
168: <DL>
169: <DT> <CODE>str</CODE>
170: <DD> the string that is *modified* if necessary. The string will be
171: truncated at the first illegal character that is encountered.
172: <DT>returns
173: <DD> YES, if the string was modified.
174: NO, otherwise.
175: </DL>
2.9 frystyk 176:
2.8 luotonen 177: <PRE>
2.13 frystyk 178: extern BOOL HTCleanTelnetString PARAMS((char * str));
2.8 luotonen 179: </PRE>
180:
181: <PRE>
2.6 timbl 182: #endif /* HTPARSE_H */
2.9 frystyk 183: </PRE>
2.2 timbl 184:
2.9 frystyk 185: End of HTParse Module
186: </BODY>
187: </HTML>
2.2 timbl 188:
Webmaster