<HTML>
<HEAD>
<!-- Changed by: Henrik Frystyk Nielsen, 23-Jun-1996 -->
<TITLE>W3C Sample Code Library libwww URI Management</TITLE>
</HEAD>
<BODY>
<H1>
URI Management
</H1>
<PRE>
/*
** (c) COPYRIGHT MIT 1995.
** Please first read the full copyright statement in the file COPYRIGH.
*/
</PRE>
<P>
This module contains code to parse URIs and various related things such as:
<UL>
<LI>
<A HREF="#parse">Parse a URI for tokens</A>
<LI>
<A HREF="#canon">Canonicalization of URIs</A>
<LI>
<A HREF="#secure">Search a URI for illegal characters in order to prevent
security holes</A>
</UL>
<P>
This module is implemented by <A HREF="HTParse.c">HTParse.c</A>, and it is
a part of the <A HREF="http://www.w3.org/Library/"> W3C Sample Code
Library</A>.
<PRE>
#ifndef HTPARSE_H
#define HTPARSE_H
#include "HTEscape.h"
#ifdef __cplusplus
extern "C" {
#endif
</PRE>
<H2>
<A NAME="Parsing">Parsing URIs</A>
</H2>
<P>
These functions can be used to get information in a URI.
<H3>
Parse a URI relative to another URI
</H3>
<P>
This returns those parts of a name which are given (and requested) substituting
bits from the related name where necessary. The <CODE>aName</CODE> argument
is the (possibly relative) URI to be parsed, the <CODE>relatedName</CODE>
is the URI which the <CODE>aName</CODE> is to be parsed relative to. Passing
an empty string means that the <CODE>aName</CODE> is an absolute URI. The
following are flag bits which may be OR'ed together to form a number to give
the 'wanted' argument to HTParse. As an example we have the URL:
"<CODE>/TheProject.html#news</CODE>"
<PRE>
#define PARSE_ACCESS 16 /* Access scheme, e.g. "HTTP" */
#define PARSE_HOST 8 /* Host name, e.g. "www.w3.org" */
#define PARSE_PATH 4 /* URL Path, e.g. "pub/WWW/TheProject.html" */
#define PARSE_VIEW 2 /* Fragment identifier, e.g. "news" */
#define PARSE_FRAGMENT PARSE_VIEW
#define PARSE_ANCHOR PARSE_VIEW
#define PARSE_PUNCTUATION 1 /* Include delimiters, e.g, "/" and ":" */
#define PARSE_ALL 31
</PRE>
<P>
where the format of a URI is as follows: "<CODE>ACCESS :// HOST / PATH #
ANCHOR</CODE>"
<P>
<CODE>PUNCTUATION</CODE> means any delimiter like '/', ':', '#' between the
tokens above. The string returned by the function must be freed by the caller.
<PRE>
extern char * HTParse (const char * aName, const char * relatedName,
int wanted);
</PRE>
<H3>
Create a Relative (Partial) URI
</H3>
<P>
This function creates and returns a string which gives an expression of one
address as related to another. Where there is no relation, an absolute address
is retured.
<DL>
<DT>
On entry,
<DD>
Both names must be absolute, fully qualified names of nodes (no anchor bits)
<DT>
On exit,
<DD>
The return result points to a newly allocated name which, if parsed by HTParse
relative to relatedName, will yield aName. The caller is responsible for
freeing the resulting name later.
</DL>
<PRE>
extern char * HTRelative (const char * aName, const char *relatedName);
</PRE>
<H2>
<A NAME="absrel">Is a URL Relative or Absolute?</A>
</H2>
<P>
Search the URL and determine whether it is a relative or absolute URL. We
check to see if there is a ":" before any "/", "?", and "#". If this is the
case then we say it is absolute. Otherwise we say it is relative.
<PRE>
extern BOOL HTURL_isAbsolute (const char * url);
</PRE>
<H2>
<A NAME="URL">URL Canonicalization</A>
</H2>
<P>
Canonicalization of URIs is a difficult job, but it saves a lot of down loads
and double entries in the cache if we do a good job. A URI is allowed to
contain the seqeunce xxx/../ which may be replaced by "" , and the seqeunce
"/./" which may be replaced by "/". Simplification helps us recognize duplicate
URIs. Thus, the following transformations are done:
<UL>
<LI>
/etc/junk/../fred becomes /etc/fred
<LI>
/etc/junk/./fred becomes /etc/junk/fred
</UL>
<P>
but we should NOT change
<UL>
<LI>
http://fred.xxx.edu/../.. or
<LI>
../../albert.html
</UL>
<P>
In the same manner, the following prefixed are preserved:
<UL>
<LI>
./<etc>
<LI>
//<etc>
</UL>
<P>
In order to avoid empty URIs the following URIs become:
<UL>
<LI>
/fred/.. becomes /fred/..
<LI>
/fred/././.. becomes /fred/..
<LI>
/fred/.././junk/.././ becomes /fred/..
</UL>
<P>
If more than one set of `://' is found (several proxies in cascade) then
only the part after the last `://' is simplified.
<PRE>
extern char *HTSimplify (char **filename);
</PRE>
<H2>
<A NAME="sec">Prevent Security Holes</A>
</H2>
<P>
In many telnet like protocols, it can be very dangerous to allow a full ASCII
character set to be in a URI. Therefore we have to strip them out.
<CODE>HTCleanTelnetString()</CODE> makes sure that the given string doesn't
contain characters that could cause security holes, such as newlines in ftp,
gopher, news or telnet URLs; more specifically: allows everything between
hexadesimal ASCII 20-7E, and also A0-FE, inclusive.
<DL>
<DT>
<CODE>str</CODE>
<DD>
the string that is *modified* if necessary. The string will be truncated
at the first illegal character that is encountered.
<DT>
returns
<DD>
YES, if the string was modified. NO, otherwise.
</DL>
<PRE>
extern BOOL HTCleanTelnetString (char * str);
</PRE>
<PRE>
#ifdef __cplusplus
}
#endif
#endif /* HTPARSE_H */
</PRE>
<P>
<HR>
<ADDRESS>
@(#) $Id: HTParse.html,v 2.40 2005/11/11 14:03:15 vbancrof Exp $
</ADDRESS>
</BODY></HTML>
Webmaster