File:  [Public] / libwww / Library / src / HTParse.html
Revision 2.40: download - view: text, annotated - select for diffs
Fri Nov 11 14:03:15 2005 UTC (18 years, 6 months ago) by vbancrof
Branches: MAIN
CVS tags: candidate-5-4-1, HEAD
add extern c and HTFile_dirent_buf_size

<HTML>
<HEAD>
  <!-- Changed by: Henrik Frystyk Nielsen, 23-Jun-1996 -->
  <TITLE>W3C Sample Code Library libwww URI Management</TITLE>
</HEAD>
<BODY>
<H1>
  URI Management
</H1>
<PRE>
/*
**	(c) COPYRIGHT MIT 1995.
**	Please first read the full copyright statement in the file COPYRIGH.
*/
</PRE>
<P>
This module contains code to parse URIs and various related things such as:
<UL>
  <LI>
    <A HREF="#parse">Parse a URI for tokens</A>
  <LI>
    <A HREF="#canon">Canonicalization of URIs</A>
  <LI>
    <A HREF="#secure">Search a URI for illegal characters in order to prevent
    security holes</A>
</UL>
<P>
This module is implemented by <A HREF="HTParse.c">HTParse.c</A>, and it is
a part of the <A HREF="http://www.w3.org/Library/"> W3C Sample Code
Library</A>.
<PRE>
#ifndef HTPARSE_H
#define HTPARSE_H

#include "HTEscape.h"

#ifdef __cplusplus
extern "C" { 
#endif 
</PRE>
<H2>
  <A NAME="Parsing">Parsing URIs</A>
</H2>
<P>
These functions can be used to get information in a URI.
<H3>
  Parse a URI relative to another URI
</H3>
<P>
This returns those parts of a name which are given (and requested) substituting
bits from the related name where necessary. The <CODE>aName</CODE> argument
is the (possibly relative) URI to be parsed, the <CODE>relatedName</CODE>
is the URI which the <CODE>aName</CODE> is to be parsed relative to. Passing
an empty string means that the <CODE>aName</CODE> is an absolute URI. The
following are flag bits which may be OR'ed together to form a number to give
the 'wanted' argument to HTParse. As an example we have the URL:
"<CODE>/TheProject.html#news</CODE>"
<PRE>
#define PARSE_ACCESS		16	/* Access scheme, e.g. "HTTP" */
#define PARSE_HOST		 8	/* Host name, e.g. "www.w3.org" */
#define PARSE_PATH		 4	/* URL Path, e.g. "pub/WWW/TheProject.html" */

#define PARSE_VIEW               2      /* Fragment identifier, e.g. "news" */
#define PARSE_FRAGMENT           PARSE_VIEW
#define PARSE_ANCHOR		 PARSE_VIEW

#define PARSE_PUNCTUATION	 1	/* Include delimiters, e.g, "/" and ":" */
#define PARSE_ALL		31
</PRE>
<P>
where the format of a URI is as follows: "<CODE>ACCESS :// HOST / PATH #
ANCHOR</CODE>"
<P>
<CODE>PUNCTUATION</CODE> means any delimiter like '/', ':', '#' between the
tokens above. The string returned by the function must be freed by the caller.
<PRE>
extern char * HTParse  (const char * aName, const char * relatedName,
			int wanted);
</PRE>
<H3>
  Create a Relative (Partial) URI
</H3>
<P>
This function creates and returns a string which gives an expression of one
address as related to another. Where there is no relation, an absolute address
is retured.
<DL>
  <DT>
    On entry,
  <DD>
    Both names must be absolute, fully qualified names of nodes (no anchor bits)
  <DT>
    On exit,
  <DD>
    The return result points to a newly allocated name which, if parsed by HTParse
    relative to relatedName, will yield aName. The caller is responsible for
    freeing the resulting name later.
</DL>
<PRE>
extern char * HTRelative (const char * aName, const char *relatedName);
</PRE>
<H2>
  <A NAME="absrel">Is a URL Relative or Absolute?</A>
</H2>
<P>
Search the URL and determine whether it is a relative or absolute URL. We
check to see if there is a ":" before any "/", "?", and "#". If this is the
case then we say it is absolute. Otherwise we say it is relative.
<PRE>
extern BOOL HTURL_isAbsolute (const char * url);
</PRE>
<H2>
  <A NAME="URL">URL Canonicalization</A>
</H2>
<P>
Canonicalization of URIs is a difficult job, but it saves a lot of down loads
and double entries in the cache if we do a good job. A URI is allowed to
contain the seqeunce xxx/../ which may be replaced by "" , and the seqeunce
"/./" which may be replaced by "/". Simplification helps us recognize duplicate
URIs. Thus, the following transformations are done:
<UL>
  <LI>
    /etc/junk/../fred becomes /etc/fred
  <LI>
    /etc/junk/./fred becomes /etc/junk/fred
</UL>
<P>
but we should NOT change
<UL>
  <LI>
    http://fred.xxx.edu/../.. or
  <LI>
    ../../albert.html
</UL>
<P>
In the same manner, the following prefixed are preserved:
<UL>
  <LI>
    ./&lt;etc&gt;
  <LI>
    //&lt;etc&gt;
</UL>
<P>
In order to avoid empty URIs the following URIs become:
<UL>
  <LI>
    /fred/.. becomes /fred/..
  <LI>
    /fred/././.. becomes /fred/..
  <LI>
    /fred/.././junk/.././ becomes /fred/..
</UL>
<P>
If more than one set of `://' is found (several proxies in cascade) then
only the part after the last `://' is simplified.
<PRE>
extern char *HTSimplify (char **filename);
</PRE>
<H2>
  <A NAME="sec">Prevent Security Holes</A>
</H2>
<P>
In many telnet like protocols, it can be very dangerous to allow a full ASCII
character set to be in a URI. Therefore we have to strip them out.
<CODE>HTCleanTelnetString()</CODE> makes sure that the given string doesn't
contain characters that could cause security holes, such as newlines in ftp,
gopher, news or telnet URLs; more specifically: allows everything between
hexadesimal ASCII 20-7E, and also A0-FE, inclusive.
<DL>
  <DT>
    <CODE>str</CODE>
  <DD>
    the string that is *modified* if necessary. The string will be truncated
    at the first illegal character that is encountered.
  <DT>
    returns
  <DD>
    YES, if the string was modified. NO, otherwise.
</DL>
<PRE>
extern BOOL HTCleanTelnetString (char * str);
</PRE>
<PRE>
#ifdef __cplusplus
}
#endif

#endif	/* HTPARSE_H */
</PRE>
<P>
  <HR>
<ADDRESS>
  @(#) $Id: HTParse.html,v 2.40 2005/11/11 14:03:15 vbancrof Exp $
</ADDRESS>
</BODY></HTML>

Webmaster