SGML Fix

I have a proposition to improve a little the SGML module. A small patch I send here improves the error recovery in parsing ill-formed HTML documents. I have a proposition to improve a little the SGML module. A small patch I send here improves the error recovery in parsing ill-formed HTML documents.

The end_element function in sgml.c hanles ill-nested tags pretty well, ignoring bad end tags or assuming corresponding start tags. The rule used now states that if we encounter an ending tag without the corresponding start tag on the top of the element=20 stack, we do one of the following:

  1. If we are on the last level (that is there is only <HTML> on the element stack), we ignore the bad ending tag.
  2. If there are some other elements (start tags) on the stack, we assume the corresponding end tags (except </HTML>), free the stack from those elements, and=20 then check again the tag which caused the whole problem against the rules 1 and 2.
In some cases this algorithm doesn't work well. Consider the example:
	<HTML>
	<BODY>
	<UL>
	<LI> <A...> <B> ... </A> </I> </B>
	<LI> ...
	</UL>
	</BODY>
	</HTML>
There are two errors in this example:
  1. <A> and <B> are overlapping
  2. There is an </I> without any <I>
According to the rules given above, this is parsed exactly as shown below:
	<HTML>
	<BODY>
	<UL>
	<LI> <A...> <B> ... </B> </A>
	</UL>
	</BODY>
	<LI> ...
	</HTML>
This is because:
  1. When the parser encounters </A> (line 4 of the original example) it assumes </B> and then goes to </A>, which is now nested perfectly.
  2. Next it encounters </I>, so it assumes all open start tags (except <HTML>); this gives </UL> </BODY>.
  3. Again with </I>, the parser is now on the last level (only <HTML> hasn't been closed), so it ignores </I>.
  4. Now the parser encounters <LI>. Note that it's a <LI> outside <UL>! What=20 happen next depends on the quality of the HTML module (my version of html.c was crashing until I corrected it).
The problem is that too many ending tags are assumed. I propose to ignore all the ill-nested end tags which doesn't have a corresponding opening tag on the element stack (on the whole stack, not only on its top). Considering the example, after seeing=20 </I> we simply ignore it (and not assume all pending tags and then ignore it).

So, after this modification the example is parsed as:

	<HTML>
	<BODY>
	<UL>
	<LI> <A...> <B> ... </B> </A>
	<LI> ...
	</UL>
	</BODY>
	</HTML>
Here's the patch:
/*
**	Helper function to check if the tag is on the stack
*/
PRIVATE BOOL lookup_element_stack (HTElement* stack, HTTag *tag)
{
    HTElement* elem;
    for (elem = stack; elem != NULL; elem = elem->next)
    {
        if (elem->tag == tag)  return TRUE;
    }
    return FALSE;
}

/* 
** Modified end_element function
** Only one line is added, it's marked with <==
*/
PRIVATE void end_element (HTStream * context, HTTag * old_tag)
{
    if (SGML_TRACE) TTYPrint(TDEST, "SGML: End   </%s>\n", old_tag->name);
    if (old_tag->contents == SGML_EMPTY) {
        if (SGML_TRACE) TTYPrint(TDEST,"SGML: Illegal end tag </%s> found.\n",
                old_tag->name);
        return;
    }
    while (context->element_stack)      {/* Loop is error path only */
        HTElement * N = context->element_stack;
        HTTag * t = N->tag;

        if (old_tag != t) {             /* Mismatch: syntax error */
            if (context->element_stack->next    /* This is not the last level */
                    && lookup_element_stack(context->element_stack, old_tag)) {  /* <== */
                if (SGML_TRACE) TTYPrint(TDEST,
                "SGML: Found </%s> when expecting </%s>. </%s> assumed.\n",
                    old_tag->name, t->name, t->name);
            } else {                    /* last level */
                if (SGML_TRACE) TTYPrint(TDEST,
                    "SGML: Found </%s> when expecting </%s>. </%s> Ignored.\n",
                    old_tag->name, t->name, old_tag->name);
                return;                 /* Ignore */
            }
        }

        context->element_stack = N->next;               /* Remove from stack */
        free(N);
        (*context->actions->end_element)(context->target,
                 t - context->dtd->tags);
        if (old_tag == t) return;  /* Correct sequence */

        /* Syntax error path only */

    }
    if (SGML_TRACE) TTYPrint(TDEST,
        "SGML: Extra end tag </%s> found and ignored.\n", old_tag->name);
}

Tha't all for now. I have also another proposition concerning SGML module (better handling of <P> tags), but in my opinion it will need some discussion, and I don't know if you are interested in this.


Maciej Puzio, puzio@laser.mimuw.edu.pl