File:  [Public] / rpm2html / mirroring.html
Revision 1.2: download - view: text, annotated - select for diffs
Sat May 2 17:23:42 1998 UTC (26 years, 1 month ago) by veillard
Branches: MAIN
CVS tags: HEAD
Updated scripts, added Coda mirror, published the mirroring paper, Daniel.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>
Linux Packages Metadata Mirroring Proposal</title>
<meta name="GENERATOR" content="amaya V1.3">
</head>
<body bgcolor="#ffffff">

<h1 align=center>Linux Packages Metadata Mirroring Proposal</h1>
<p>
What does this title means ? Simply that there is currently a huge amount of
precompiled software freely available for Linux, but it's hard to find the one
one need:</p>
<ol>
<li>
It is difficult to locate the actual package needed to fullfill the needs of
the user (I want a graphic editor which support PNG format).
<li>
Once the program has been found, getting the binary version for a given system
setup (distribution version, architecture, etc ...) is usually hard.
<li>
Now getting the nearest binary, to minimize the download time again proves
difficult.
<li>
If the correct package has been found sometime it is impossible to install due
to missed dependancies, and the quest continues !
</ol>
<p>
I suggest here a mechanism to help finding the packages needed. It is based on
the propagation of Metadata (structured, machine readable informations) about
the available binary packages. It is based on the lessons learned by building
the <a href="http://rufus.w3.org/linux/RPM/">RPM database on rufus.w3.org</a>,
the <a href="http://rufus.w3.org/linux/rpm2html/">rpm2html</a> program
development, the work done on <a href="http://www.w3.org/Metadata/">Metadata
at W3C</a>, and discussions with <a href="mailto:jim@jimpick.com">Jim Pick</a>
(Debian) and <a href="mailto:marc@redhat.com">Marc Ewing</a> (RedHat).</p>
<p>
<img src="mirroring.gif" alt=" mirroring.gif "></p>
<p>
The picture above illustrate the four steps needed: to create, centralize,
propagate and expose packages metadata</p>

<h3>Extracting Metadata from binary packages</h3>
<p>
Basically the idea is to extract useful information about a package, like
application name, revision, author, dependancies, etc., and save them in a
format that is predefined and can be automatically parsed to extract the
informations. This is precisely <a
href="http://www.w3.org/Metadata/">Metadata</a> (data about data) and I
suggest to use the <a href="http://www.w3.org/TR/WD-rdf-syntax/">RDF</a>
Metadata encoding - based on <a href="http://www.w3.org/XML/">XML</a> - the
metadata encoding proposed by <a href="http://www.w3.org/">W3C</a>. Ideally
the description for the metadata should be independant of the package format,
however in practice it may be, for example the package dependancies are more
sophisticated in <a href="http://www.debian.org/">Debian</a> packages than in
<a href="http://www.rpm.org/">RPM</a> ones. Here is for example an RDF file
describing the RPM package "rpm2html-0.90-1.i386.rpm" :</p>
<pre>&lt;?XML version="1.0">
&lt;?namespace href ="http://www.w3.org/TR/WD-rdf-syntax#/" AS = "RDF"?>
&lt;?namespace href ="http://www.rpm.org/" AS = "RPM"?>
&lt;RDF:RDF>
 &lt;RDF:Description RDF:HREF="ftp://ftp.redhat.com/pub/contrib/i386/rpm2html-0.90-1.i386.rpm">
  &lt;RPM:Name>rpm2html&lt;/RPM:Name>
  &lt;RPM:Version>0.90&lt;/RPM:Version>
  &lt;RPM:Release>1&lt;/RPM:Release>
  &lt;RPM:Distribution>Unknown&lt;/RPM:Distribution>
  &lt;RPM:Vendor>Daniel Veillard&lt;/RPM:Vendor>
  &lt;RPM:Size>13244&lt;/RPM:Size>
  &lt;RPM:URL>http://rufus.w3.org/linux/rpm2html/&lt;/RPM:URL>
  &lt;RPM:BuildDate>Sun Mar 29 19:44:53 EST 1998&lt;/RPM:BuildDate>
  &lt;RPM:BuildHost>rufus.w3.org&lt;/RPM:BuildHost>
  &lt;RPM:Group>X11/Applications&lt;/RPM:Group>
  &lt;RPM:Packager>Daniel Veillard&lt;/RPM:Packager>
  &lt;RPM:Summary>Translates rpm database into html info&lt;/RPM:Summary>
  &lt;RPM:Sources>ftp://ftp.redhat.com/pub/contrib/SRPM/rpm2html-0.90-1.src.rpm&lt;/RPM:Sources>
  &lt;RPM:Description>
  &lt;/RPM:Description>
Rpm2html tries to solve 2 big problems one face when
grabbing a RPM package from a mirror on the net and trying to
install it:

   - it gives more information than just the filename before
     installing the package.
   - it tries to solve the dependancy problem by analyzing all
     the Provides and Requires of the set of RPMs. It shows the
     cross references by the way of hypertext links.
  &lt;RPM:Provides>
     &lt;RDF:Bag>
       &lt;RPM:Resource>rpm2html&lt;/RPM:Resource>
     &lt;/RDF:Bag>
  &lt;/RPM:Provides>
  &lt;RPM:Requires>
     &lt;RDF:Bag>
       &lt;RPM:Resource>libz.so.1&lt;/RPM:Resource>
       &lt;RPM:Resource>libdb.so.2&lt;/RPM:Resource>
       &lt;RPM:Resource>libc.so.6&lt;/RPM:Resource>
       &lt;RPM:Resource>ld-linux.so.2&lt;/RPM:Resource>
     &lt;/RDF:Bag>
  &lt;/RPM:Requires>
  &lt;RPM:Files>
/etc/rpm2html.config
/usr/bin
/usr/bin/rpm2html
/usr/doc/rpm2html-0.85
/usr/doc/rpm2html-0.85/CHANGES
/usr/doc/rpm2html-0.85/COPYING
/usr/doc/rpm2html-0.85/PRINCIPLES
/usr/doc/rpm2html-0.85/README
/usr/doc/rpm2html-0.85/TODO
/usr/doc/rpm2html-0.85/config.small
/usr/man/man1/rpm2html.1
/usr/share/rpm2html/msg.de
/usr/share/rpm2html/msg.es
/usr/share/rpm2html/msg.fr
  &lt;/RPM:Files>
 &lt;/RDF:Description>
&lt;/RDF:RDF>
</pre>
<p>
While this description is definitely not suitable for an human, it can be
parsed easily (basic RDF support is already present in Mozilla for example)
and numerous tools can take advantage of the medata to process the associated
data.</p>
<p>
As discussed and demonstrated quickly, these metadata can be easilly generated
from the packages themselve (has been done for both RPM and Debian packages),
and are usually quite smaller than the binary package themselves.</p>

<h3>Centralizing the Metadata</h3>
<p>
Since one cannot assume that one entity can actually generate the metadata for
all the available Linux binary packages available, some sort of distributed
work is needed, for example each maintainer of a Linux distribution or of a
set of packages can extract the metadata and make them available along with
the packages.</p>
<p>
The next step is to centralize these Metadata to build a database as complete
as possible, it can be done by mirroring the metadata provided by various
maintainers. The key point is that for each package the metadata has to be
generated once and uploaded once to the repository.</p>
<p>
Why centralizing ? Simply because metadata alone are not very useful, but the
cross references one can obtain by gathering and following multiple references
are usually far more useful. </p>
<p>
Moreover the bigger the database, the higher the probablility to answer a
request based on the associated data. Centralizing also ease the spreading of
data a lot !</p>

<h3>Propagating the Metadata</h3>
<p>
Once the Matadata have been gathered in a unique place, setting up a mirroring
scheme is esay and can be done in a very efficient way to propagate the data
near the final user, that's the basic mechanism set up for FTP mirrors, it's
well known and quite effective. The goal is to offer services based on these
Metadata and install them as close as possible from the final user.</p>

<h3> Expose the metadata</h3>
<p>
Once the Metadata are available a large amount of tools can be build to expose
and use their content. A basic idea is to build directories available for
searching and locating binary packages, for example rpm2html tool is being
modified to support RDF Metadata as input instead of the binary packages. This
allow the databases maintainer to point to a near mirror of the binary
packages. A lot of other tools can be build using the metadata, like automatic
checking of packages, smart installers following the metadata informations to
retrieve and install the latest packages and the correct dependancies, easier
management of clusters, etc.</p>
<address>
<p>
<a href="mailto:veillard@w3.org">Daniel Veillard</a> </p>
</address>
<p>
$Id: mirroring.html,v 1.2 1998/05/02 17:23:42 veillard Exp $</p>

<h3></h3>
</body>
</html>

Webmaster