Adding a search engine
to your Hypermail archives

One of the feature requests we hear most often is to incorporate a search engine in Hypermail. But because everyone has his or her own favorite search engine, and since it's hard to imagine a search engine that may not turn out to be unusable for some archives, we haven't built a search engine into hypermail, and we probably won't.

But hypermail's page customizations make it easy to integrate your own search engine into your hypermail archives.

For our example, we're going to put a form box on the top and bottom of every index page, and we'll use the swish-e search engine. We'll show a typical PHP script and a typical Perl script that can function as a glue layer between the web and the search engine. There ought to be enough information here to get you started regardless of what search engine and scripting language you choose for your site.

Let's begin by modifying the header and footer hypermail puts on each index file.

(1) Create a file called indexheader.hyp containing the following:

indexheader.hyp

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    <head>
    <title>%l: %s</title>
    <meta http-equiv="Content-Type" content="text/html;CHARSET=iso-8859-1" />
    <meta name="generator" content="%p %v, see %h" />
    <meta name="Subject" content="%s" />
    <meta name="Date" content="(nil)" />
    <link rev="made" href="mailto:%m" />
    <style type="text/css">
        body {color: black; background: #ffffff}
        h1.center {text-align: center}
        div.center {text-align: center}
        .quotelev1 {color : #990099}
        .quotelev2 {color : #ff7700}
        .quotelev3 {color : #007799}
        .quotelev4 {color : #95c500}
        .headers {background : #e0e0d0}
        .links {background : #f8f8e0}
    </style>
    </head>
    <body>
    <h1 class="center">%l<br />%s</h1>

    <!-- Your custom code goes below this line -->

    <div class="center">
    <form action="http://url/of/search.php" method="post"
        enctype="application/x-www-form-urlencoded">
    <div>
    <input type="hidden" name="db" value="name.of.index" />
    <p class="headers">
    <input type="text" name="str" size="20" />
    <input type="submit" value="Search" />
    </p>
    </div>
    </form>
    </div>

Notice the HTML comment about 2/3 of the way down. Everything above that line is standard, and ought to be very similar to the default header. The custom code (below the comment) adds the features you want to add. In this case, it's a form for a search.

You will, of course, substitute your own URL for search.php for the form's ACTION and the name of your own search index for "name.of.index". See the bottom of this page for an example of creating a search index file.

If you like the general appearance of the web pages Hypermail creates and don't want to spend a lot of time playing with the HTML, you might want to look at the source of "index.html" in the directory for which you're installing a search engine and just copy the appropriate lines (e.g., the lines above the first <hr />) to indexheader.hyp.

(2) Create a file called indexfooter.hyp containing the following:

indexfooter.hyp

    <div class="center">
    <form action="/url/of/search.php" method="post"
        enctype="application/x-www-form-urlencoded">
    <div>
    <input type="hidden" name="db" value="name.of.index" />
    <p class="headers">
    <input type="text" name="str" size="20" />
    <input type"submit" value="Search" />
    </p>
    </div>
    </form>
    </div>

    <!-- Your custom code goes above this line -->

    <hr />
    <p><small><em>
    This archive was generated by <a href="%h">%p %v</a> : %g
    </em></small></p>
    </body>
    </html>

Since this is a footer file, the common code will be at the end, and you'll put your custom code above the comment.

You may have noted what seem to be superfluous <div> elements in the HTML in the header and footer file. The W3C validator likes them, and they do no harm.

(3) Modify the .hmrc file to point to your custom header and footer:

.hmrc (excerpt)

    # ihtmlheaderfile = [ path to index header template file | NONE ]
    #
    # Set this to the path to the Index header template file containing
    # valid HTML statements and substitution cookies for runtime expansion.
    # This will be included at the top of every index page.
    
    ihtmlheaderfile = /path/to/indexheader.hyp
    
    # ihtmlfooterfile = [ path to index footer template file | NONE ]
    #
    # Set this to the path to the Index footer template file containing
    # valid HTML statements and substitution cookies for runtime expansion.
    # This will be included at the bottom of every index page.
    
    ihtmlfooterfile = /path/to/indexfooter.hyp

You'll call hypermail using this .hmrc file to create your archive or add a message:

    hypermail -c /path/to/.hmrc [...]

Here's a sample PHP script that performs the minimum functionality required for "search.php":

search.php

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    <head>
    <title>Search results</title>
    <meta http-equiv="Content-Type" content="text/html;CHARSET=iso-8859-1" />
    </head>
    <body>
    <div>
    <?
        // Very simple search using swish-e (see http://swish-e.org/)
        // copyright 2003 by Bob Crispen <http://www.crispen.org/>
        // May be distributed without restriction for any purpose.
    
        // This program is distributed in the hope that it will be useful,
        // but WITHOUT ANY WARRANTY; without even the implied warranty of
        // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
        // GNU General Public License for more details.
    
        // Set these values to the appropriate ones for your system
        $db_dir = "/path/to/swish-e-databases/";          // Directory where
            // you keep the databases and indexes that swish-e generates
        $swishe = "/usr/local/bin/swish-e";               // swish-e executable
    
        // Get arguments from call
        //  "str" -- String to search for
        //  "db"  -- swish-e Database name (e.g., "site.index")
        $str = $_POST["str"];
        $db  = $_POST["db"];
    
        // This script is probably called from a form similar to the one
        // below.  If you call it without arguments, all it'll do is print
        // the form.
    
        if (($str != "") && ($db != "")) {
            // Prevent shell execution exploit
            $searchfor = escapeshellarg($str);
            $database  = escapeshellarg($db_dir . $db);
    
            // Do the search
            $result = `$swishe -H 0 -d '\t' -w $searchfor -f $database`;
    
            // Turn the results into an array
            $results = split("\n", $result);
    
            // See how many results we have.  Ignore the final blank line.
            $count = count($results)-1;
            if ($count > 0) {
                print("Search results for <strong>$searchfor</strong>:\n");
                print("<ul>\n");
                for ($i=0; $i<$count; $i++) {
                    list($score, $url, $title, $len) = split("\t", $results[$i], 4);
                    print("<li> <a href=\"$url\">$title</a> [$score]</li>\n");
                }
                print("</ul>\n");
            } else {
                print("Sorry, <strong>$searchfor</strong> not found\n");
            }
        }
    
        // Print a new form so they can continue searching
        print <<<EOT
    <form action="$PHP_SELF" method="post"
        enctype="application/x-www-form-urlencoded">
    <div>
    <input type="hidden" name="db" value="$db" />
    <input type="text" name="str" size="20" />
    <input type="submit" value="Search" />
    </div>
    </form>
    </div>

    EOT;
    ?>
    </body>
    </html>

If you prefer Perl, or if your web server doesn't offer PHP, then you could modify these lines in indexheader.hyp and indexfooter.hyp:

    <form action="/url/of/search.php" method="post"
        enctype="application/x-www-form-urlencoded">

    <form action="/url/of/search.pl" method="post"
        enctype="application/x-www-form-urlencoded">

Here's a sample script contributed by Perl maven Greg Bacon that performs the minimum functionality required for "search.pl":

search.pl


    #! /usr/local/bin/perl -T
    
    # Perl implementation of Bob Crispen's PHP swish-e search available
    # at <URL:http://www.crispen.org/doc/hypermail/archive_search.html>
    # Copyright 2003 Greg Bacon.
    
    # Very simple search using swish-e (see http://swish-e.org/)
    # copyright 2003 by Bob Crispen <http://www.crispen.org/>
    # May be distributed without restriction for any purpose.
        
    # This program is distributed in the hope that it will be useful,
    # but WITHOUT ANY WARRANTY; without even the implied warranty of
    # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    # GNU General Public License for more details.
    
    use warnings;
    use strict;
    
    my @warnings;
    
    BEGIN {
        $SIG{__WARN__} = sub { push @warnings, @_ };
    
        print <<'EOHeader';
    Content-type: text/html
    
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    <head>
    <title>Search results</title>
    <meta http-equiv="Content-Type" content="text/html;CHARSET=iso-8859-1" />
    </head>
    <body>
    <div>
    EOHeader
    
        $SIG{__DIE__} = sub {
            print "<h1>Error!</h1>\n",
                  map { "<pre>$_</pre>\n" } @_;
        };
    }
    
    use CGI qw/ :standard /;
    use HTML::Entities;
    
    # keep the taint checker happy
    $ENV{PATH} = "/bin:/usr/bin:/usr/local/bin";
    
    # ----------------------- Configuration ------------------------
    
    # Directory where you keep the databases and
    # indexes that swish-e generates
    my $db_dir = "/path/to/index/";
    
    # swish-e executable
    my $swishe = "/path/to/bin/swish-e";
    
    # -------------------- End of Configuration --------------------
    
    sub drop_privs {
        my @temp = ($>, $));
    
        my $orig_uid = $<;
        my $orig_gid = $(;
    
        # set effective user and group id to real 
        $> = $<;
        $) = $(;
    
        # Drop privileges
        $< = $orig_uid;
        $( = $orig_gid;
    
        # Make sure privs are really gone
        ($>, $)) = @temp;
        die "FATAL: can't drop privileges" unless $< == $> && $( eq $);
    }
    
    sub search {
        my $str = shift;
        my $db  = shift;
    
        return unless $str && $db;
    
        my $database = $db_dir . $db;
        
        my $pid = open SWISHE, "-|";
        unless (defined $pid) {
            warn "failed fork: $!";
            return;
        }
    
        my @results;
    
        if ($pid) {
            # parent
    
            while (<SWISHE>) {
                chomp;
    
                # swish-e output is four TAB-separated fields, so
                # we assume lines with three TABs are from swish-e
                # and all other non-blank lines are warnings
                if (/\t.*\t.*\t/) {
                    push @results, [split /\t/, $_, 4];
                }
                else {
                    push @warnings, $_ if /\S/;
                }
            }
    
            close SWISHE
                or warn $! ? "Error closing $swishe pipe: $!"
                           : "Exit status $? from $swishe";
    
        }
        else {
            # child
    
            local $SIG{__WARN__} = sub { print @_ };
            local $SIG{__DIE__}  = sub { print @_; exit 1 };
    
            # 2>&1
            open STDERR, ">&STDOUT" or warn "WARNING: dup STDOUT: $!";
    
            drop_privs;
    
            # Do the search
            no warnings;
            exec $swishe, '-H', 0, '-d', '\t', '-w', $str, '-f', $database
                or die "FATAL: exec $swishe: $!";
    
            exit 1;
        }
    
        my $searchfor = encode_entities $str;
        if (@results) {
            print "Search results for <strong>$searchfor</strong>:\n",
                  "<ul>\n";
    
            for (@results) {
                my($score,$url,$title,$len) = @$_;
    
                print qq{  <li> <a href="$url">$title</a> [$score]</li>\n};
            }
    
            print "</ul>\n";
        }
        else {
            print("Sorry, <strong>$searchfor</strong> not found\n");
        }
    }
    
    ## main
    
    # URL of this page
    my $me = url -full => 1;
    
    # Get arguments from call
    #  "str" -- String to search for
    #  "db"  -- swish-e Database name (e.g., "site.index")
    my $str = param "str";
    my $db  = param "db";
    
    # This script is probably called from a form similar to the one
    # below.  If you call it without arguments, all it'll do is print
    # the form.
    
    my @results = search $str, $db;
    
    # Print a new form so they can continue searching
    print startform,
          "<div>\n",
          hidden(db => $db),
          textfield(-name => 'str', -size => 20),
          submit('Search'),
          "</div>\n",
          end_form, "\n";
    
    if (@warnings) {
        my $messages = @warnings == 1 ? "message" : "messages";
    
        print <<EOWarningsHead;
    <p><hr>
    <h1>Warnings</h1>
    <em>Warning $messages:</em>
    <ul>
    EOWarningsHead
    
        for (@warnings) {
            print "  <li><code>$_</code></li>\n";
        }
    
        print "</ul>\n";
    }
    
    print <<EOFooter;
    
    </div>
    </body>
    
    </html>
    EOFooter

Swish-e, like many other search engines, requires you to generate a search index before you perform any searches. Here's a .conf file for swish-e that might generate a reasonable search index for a hypermail archive of messages about model railroading:

model-rr.conf

    IndexDir /path/to/model-rr-archive
    IndexFile /path/to/model-rr.index
    IndexReport 3
    IndexOnly .html
    ReplaceRules replace "/path/to/document_root/" "http://www.yoursite.com/"
    FileRules filename is attachment.html
    FileRules filename is author.html
    FileRules filename is date.html
    FileRules filename is index.html
    FileRules filename is subject.html
    FileRules filename is thread.html

This tells swish-e to collect data from all the HTML files in your model railroading archive except the index pages that hypermail generates. That way, all the search results will point directly to the messages themselves.

When you want to build your search index, you'll call swish-e something like this:

    swish-e -v 3 -c /path/to/model-rr.conf > /path/to/model-rr.report 2>&1

Many people do this either in a cron job during off-peak hours or through a CGI script that they call whenever they update their archive.

You might want to add search logging, page control (so you only print, say, 20 results at a time), some nice CSS, and all sorts of other things to your script. And if you'd prefer to write your script in Python, sh, or Ada, you can do that too. From here on, it's up to you.

Bob Crispen

Thursday, June 26, 2003