libwww/Library/src/HTParse.html - view

File: [Public] / libwww / Library / src / HTParse.html
Revision 2.40: download - view: text, annotated - select for diffs
Fri Nov 11 14:03:15 2005 UTC (18 years, 6 months ago) by vbancrof
Branches: MAIN
CVS tags: candidate-5-4-1, HEAD

add extern c and HTFile_dirent_buf_size

<HTML> <HEAD>  <TITLE>W3C Sample Code Library libwww URI Management</TITLE> </HEAD> <BODY> <H1> URI Management </H1> <PRE> /* ** (c) COPYRIGHT MIT 1995. ** Please first read the full copyright statement in the file COPYRIGH. */ </PRE> <P> This module contains code to parse URIs and various related things such as: <UL> <LI> <A HREF="#parse">Parse a URI for tokens</A> <LI> <A HREF="#canon">Canonicalization of URIs</A> <LI> <A HREF="#secure">Search a URI for illegal characters in order to prevent security holes</A> </UL> <P> This module is implemented by <A HREF="HTParse.c">HTParse.c</A>, and it is a part of the <A HREF="http://www.w3.org/Library/"> W3C Sample Code Library</A>. <PRE> #ifndef HTPARSE_H #define HTPARSE_H #include "HTEscape.h" #ifdef __cplusplus extern "C" { #endif </PRE> <H2> <A NAME="Parsing">Parsing URIs</A> </H2> <P> These functions can be used to get information in a URI. <H3> Parse a URI relative to another URI </H3> <P> This returns those parts of a name which are given (and requested) substituting bits from the related name where necessary. The <CODE>aName</CODE> argument is the (possibly relative) URI to be parsed, the <CODE>relatedName</CODE> is the URI which the <CODE>aName</CODE> is to be parsed relative to. Passing an empty string means that the <CODE>aName</CODE> is an absolute URI. The following are flag bits which may be OR'ed together to form a number to give the 'wanted' argument to HTParse. As an example we have the URL: "<CODE>/TheProject.html#news</CODE>" <PRE> #define PARSE_ACCESS 16 /* Access scheme, e.g. "HTTP" */ #define PARSE_HOST 8 /* Host name, e.g. "www.w3.org" */ #define PARSE_PATH 4 /* URL Path, e.g. "pub/WWW/TheProject.html" */ #define PARSE_VIEW 2 /* Fragment identifier, e.g. "news" */ #define PARSE_FRAGMENT PARSE_VIEW #define PARSE_ANCHOR PARSE_VIEW #define PARSE_PUNCTUATION 1 /* Include delimiters, e.g, "/" and ":" */ #define PARSE_ALL 31 </PRE> <P> where the format of a URI is as follows: "<CODE>ACCESS :// HOST / PATH # ANCHOR</CODE>" <P> <CODE>PUNCTUATION</CODE> means any delimiter like '/', ':', '#' between the tokens above. The string returned by the function must be freed by the caller. <PRE> extern char * HTParse (const char * aName, const char * relatedName, int wanted); </PRE> <H3> Create a Relative (Partial) URI </H3> <P> This function creates and returns a string which gives an expression of one address as related to another. Where there is no relation, an absolute address is retured. <DL> <DT> On entry, <DD> Both names must be absolute, fully qualified names of nodes (no anchor bits) <DT> On exit, <DD> The return result points to a newly allocated name which, if parsed by HTParse relative to relatedName, will yield aName. The caller is responsible for freeing the resulting name later. </DL> <PRE> extern char * HTRelative (const char * aName, const char *relatedName); </PRE> <H2> <A NAME="absrel">Is a URL Relative or Absolute?</A> </H2> <P> Search the URL and determine whether it is a relative or absolute URL. We check to see if there is a ":" before any "/", "?", and "#". If this is the case then we say it is absolute. Otherwise we say it is relative. <PRE> extern BOOL HTURL_isAbsolute (const char * url); </PRE> <H2> <A NAME="URL">URL Canonicalization</A> </H2> <P> Canonicalization of URIs is a difficult job, but it saves a lot of down loads and double entries in the cache if we do a good job. A URI is allowed to contain the seqeunce xxx/../ which may be replaced by "" , and the seqeunce "/./" which may be replaced by "/". Simplification helps us recognize duplicate URIs. Thus, the following transformations are done: <UL> <LI> /etc/junk/../fred becomes /etc/fred <LI> /etc/junk/./fred becomes /etc/junk/fred </UL> <P> but we should NOT change <UL> <LI> http://fred.xxx.edu/../.. or <LI> ../../albert.html </UL> <P> In the same manner, the following prefixed are preserved: <UL> <LI> ./<etc> <LI> //<etc> </UL> <P> In order to avoid empty URIs the following URIs become: <UL> <LI> /fred/.. becomes /fred/.. <LI> /fred/././.. becomes /fred/.. <LI> /fred/.././junk/.././ becomes /fred/.. </UL> <P> If more than one set of `://' is found (several proxies in cascade) then only the part after the last `://' is simplified. <PRE> extern char *HTSimplify (char **filename); </PRE> <H2> <A NAME="sec">Prevent Security Holes</A> </H2> <P> In many telnet like protocols, it can be very dangerous to allow a full ASCII character set to be in a URI. Therefore we have to strip them out. <CODE>HTCleanTelnetString()</CODE> makes sure that the given string doesn't contain characters that could cause security holes, such as newlines in ftp, gopher, news or telnet URLs; more specifically: allows everything between hexadesimal ASCII 20-7E, and also A0-FE, inclusive. <DL> <DT> <CODE>str</CODE> <DD> the string that is *modified* if necessary. The string will be truncated at the first illegal character that is encountered. <DT> returns <DD> YES, if the string was modified. NO, otherwise. </DL> <PRE> extern BOOL HTCleanTelnetString (char * str); </PRE> <PRE> #ifdef __cplusplus } #endif #endif /* HTPARSE_H */ </PRE> <P> <HR> <ADDRESS> @(#) $Id: HTParse.html,v 2.40 2005/11/11 14:03:15 vbancrof Exp $ </ADDRESS> </BODY></HTML>