This is a short overview of the query facilities added to RDFLib. These are based on the July 2005 version
of the SPARQL draft
worked on at the W3C. For a lack of a better word, I refer to this
implementation as sparql-p.
Thanks to the work of Daniel Krech and mainly Michel Pelletier,
sparql-p is now fully integrated with the newer versions of
RDFLib (version
2.2.2 or later), whereas earlier versions were distributed as separate
packages. This integration has led to some minor adjustments in class naming
and structure compared to earlier versions. If you are looking for the
documentation of the separate package, please refer to an earlier
version of this document. Be warned, though, that the
earlier versions are now deprecated in favour of RDFLib 2.2.2 or later.
The SPARQL draft describes its facilities in terms of a query language. A
full SPARQL implementation should include a parser of that language mapping
on the underlying implementation. sparql-p does not
include such parser yet, only the underlying SPARQL engine and its API. The
description below shows how the mapping works. This also means that the API
is not the full implementation of SPARQL: some of the features should be left
to the parser that could use this API. This is the case, for example, of
named Graphs facilities that could be mapped using RDFLib Graph
instances: all query is performed on such an instance in the first
place! In any case, the implementation of sparql-p covers (I
believe) the most frequently used cases of SPARQL.
The SPARQL facilities ar based on a wrapper class called
SPARQLGraph around the basic Graph object defined
by RDFLib. Ie, all programs using sparql-p should be of the
form:
from rdflib.sparql import sparqlGraph sparqlGr = sparqlGraph.SPARQLGraph()
the sparqlGr object thus created has the same methods as a
rdflib.Graph type object would have, extended with the
sparql-p facilities. An alternative way of creating the
sparql-p graph is to use an existing rdflib.Graph
instance:
sparqlGr = sparqlGraph.SPARQLGraph(graph=myExistingGraph)
The basic SPARQL construct is as follows (using the query syntax of the SPARQL document):
SELECT ?a,?b,?c
WHERE { ?a P ?x .
Q ?b ?a .
?x R ?c
}
...
The meaning of this construction is simple: the '?a', '?b', etc, symbols
(referred to as 'unbound' symbols) are queried with the constraint that the
tuples listed in the WHERE clause are 'true', i.e., part of the
triple store. This functionality is translated into a Python method as:
from rdflib.sparql import GraphPattern
select = ("?a","?b","?c")
where = GraphPattern([("?a",P,"?x"),(Q,"?b","?a"),("?x",R,"?c")])
result = sparqlGr.query(select,where)
where result is a list of tuples, each giving possible
binding combinations for "?a", "?b", and "?c", respectively. P,
Q, R, etc, must be the rdflib incarnations of RDF
resources, i.e., URIRef-s, Literals, etc. The
object of each pattern can also be one of the following Python types:
integer, long, float,
string, unicode, datetime.date,
datetime.time, datetime.datetime; these are
transformed into a Literal with the corresponding XML Schema
datatype on the fly. This allows coding in the form:
select = ("?a","?b","?c")
where = GraphPattern([("?a",P,"?x"),(Q,"?b","?a"),("?x",R,"?c"),("?x",S,"Some Literal Here"),("?x",R,43)])
result = sparqlGr.query(select,where)
Note that the SPARQL draft mandates datetime only, not separate date and time, but it was obvious to add this into the Python implementation (and useful in practice). See also the note above on literals, as well as the additional section on datatypes.
As a further convenience to the user, if select consists of a
single entry, it is not necessary to use a tuple and just giving the string
value will do. Similarly, if the where consists of one single
tuple, the array construction may be skipped, and the single tuple is
accepted as an argument. Finally, if select consists of one
entry, result is a list of the values rather than tuples of
(single) values.
The GraphPattern class instance can be built up gradually via
the addPattern and addPatterns methods (the former
takes one tuple, the latter a list of tuples).
The draft describes nested patterns, too, but it also draws the attention to the fact that nested patterns can be turned into regular patterns by possibly repeating some patterns. In other words, nested patterns can be handled by a parser and is therefore not implemented on this API level.
The rdflib.sparql.SPARQLError exception is raised (beyond the
possible exceptions raised by rdflib) if there are inconsistencies in the
select or where clauses (e.g., the tuples do not
have the correct length or there are incorrect data in the tuples
themselves).
SPARQL makes it possible to constrain values through operators, like:
SELECT ?a,?b,?c
WHERE { ?a P ?x .
Q ?b ?a .
?x R ?c .
FILTER ?x < 10
}
...
The draft also refers to the fact that application specific functions can
also be used in the 'FILTER' part. There are two ways to translate this
feature into sparql-p (see below for a further discussion).
This version is based on constraints that refer to the whole binding of
the pattern and is therefore executed against the full binding once
available. Here is how it looks in sparql-p:
select = ("?a","?b","?c")
where = GraphPattern([("?a",P,"?x"),(Q,"?b","?a"),("?x",R,"?c"),("?x",S,"Some Literal Here")])
where.addConstraint(func1)
where.addConstraints([func2,func3,...])
result = sparqlGr.query(select,where)
Each function in the constraints is of the form:
def func1(bindingDir) :
....
return True # or False
where bindingDir is a dictionary of the possible binding, ie,
of the form {"?a" : Resource1, "?b" : Resource2, ...}. Adding
several constraints (in a list or via a series of addConstraint
methods) is equivalent to a logical conjunction of the individual
constraints.
As an extra help to operator writers, the bindingDir also
includes a special entry referring to the SPARQLGraph instance
in use via a special key:
from rdflib.sparql import graphKey
graph = bindingDir[graphKey]
This construction, ie, the global constraint, is the faithful representation of the SPARQL spec. Note that a number of operator methods are available to make the construction of the global constraints easier, see the separate section on that.
This version is based on a constraint that can be imposed on one specific (bound) pattern only. This is achieved by adding a fourth element to the tuple representing the pattern, e.g.:
select = ("?a","?b","?c")
where = GraphPattern([("?a",P,"?x",func),(Q,"?b","?a"),("?x",R,"?c"),("?x",S,"Some Literal Here")])
result = sparqlGr.query(select,where)
where func is a function with three arguments (the bound
version of the "?a", P, "?x" in the
example).
Functionally, the global constraint is a 'superset' of the per pattern constraint; in other words, anything that can be expressed by per pattern constraints can be achieved by global constraints. E.g., a method above can be expressed in two different ways:
select = ("?a","?b","?c")
where = GraphPattern([("?a",P,"?x"),(Q,"?b","?a"),("?x",R,"?c"),("?x",S,"Some Literal Here")])
where.addConstraint(lambda binding: int(binding["?x"]) < 10)
result = sparqlGr.query(select,where)
or:
select = ("?a","?b","?c")
where = GraphPattern([("?a",P,"?x",lambda a,P,x: int(x) < 10),(Q,"?b","?a"),("?x",R,"?c"),("?x",S,"Some Literal Here")])
result = sparqlGr.query(select,where)
However, the second version may be much more efficient. The search is 'cut' in the by the constraint, ie the binding tree is not (unnecessarily) expanded further, whereas a full binding tree must be generated for a global constraint (see the notes on the implementation below). For large triple stores and/or large patterns this may be a significant difference. A parser may optimize by generating per-pattern constraints in some cases to make use of this optimization, hence this alternative.
A slight variation of the basic scheme could be described as:
SELECT ?a,?b,?c
WHERE { { ?a P ?x . Q ?b ?a } UNION { S ?b ?a. ?x R ?c } }
...
(I hope my understanding is correct that...) the meaning a logical 'or' on
one of the clauses. This is expressed in sparql-p by allowing
the query method to accept a list of graph patterns, too, instead of single
patterns only:
select = ("?a","?b","?c")
where1 = GraphPattern([("?a",P,"?x"),(Q,"?b","?a")])
where1 = GraphPattern([(S,"?b","?a"),("?x",R,"?c")])
result = sparqlGr.query(select,[where1,where2])
The two queries are evaluated separately, and the concatenation of the results is returned.
Another variation on the basic query is the usage of 'optional' clauses:
SELECT ?a,?b,?c,?d
WHERE { ?a P ?x .
Q ?b ?a .
?x R ?c .
OPTIONAL { ?x S ?d. ... }
}
What this means is that if the fourth tuple (with ?x already bound) is not in the triple store, that should not invalidate the possible bindings of '?a', '?b', and '?c'; instead, the '?d' unbound variable should be set to a null value, but the remaining bindings should be returned. In other words first the following query is performed:
SELECT ?a,?b,?c
WHERE { ?a P ?x .
Q ?b ?a .
?x R ?c
}
then, for each possible bindings, a second query is done:
SELECT ?d
WHERE { X S ?d }
where X stands for a possible binding of ?x.
The sparql-p expression of this facility is based on the
creation of a separate graph pattern for the optional clause:
select = ("?a","?b","?c","?d")
where = GraphPattern([("?a",P,"?x"),[(Q,"?b","?a"),("?x",R,"?c")])
opt = GraphPattern([("?x",S,"?d")])
result = sparqlGr.query(select,where,opt)
and the (possible) unbound "?d" is set to None
in the return value. Just as for the 'main' pattern, the third argument of
the call can be a list of graph patterns (for several OPTIONAL clauses)
evaluated separately. Each of the OPTIONAL clauses can have their global
constraints.
The SPARQL draft includes several 'Query forms', which is a term to
control how the query results are returned to the caller. In the case of
sparql-p this is implemented via a separate Python class, called
Query. All query results yield, in fact, an instance of that
class, and various methods on that class are defined corresponding to the
SPARQL Query Forms. The SPARQLGraph.queryObject method can be
invoked instead of SPARQLGraph.query to return an instance of
such object. (In fact, the SPARQLGraph.query method, used in all
previous examples, is simply a convenient shorthand, see below.)
The SELECT SPARQL query forms are used to retrieve the query results.
Corresponding to the draft, the Query class has a
select method, with two (keyword) arguments:
distinct (with possible values True and
False) and limit (which is either a positive
integer or None). For example:
select = ("?a","?b","?c","?d")
where = GraphPattern([("?a",P,"?x"),[(Q,"?b","?a"),("?x",R,"?c")])
opt = GraphPattern([("?x",S,"?d")])
resultObject = sparqlGr.queryObject(where,opt)
result = resultObject.select(select,distinct=True,limit=5)
returns the first 5 query results, all distinct from one another. The
default for distinct is set True and the
limit is None. Ie, the
SPARQLGraph.query is, in fact, a shorthand for
SPARQLGraph.queryObject(where,...).select(select) (it is
probably the most widespread use of select hence this shorthand
method).
Note that it is possible to use the same class instance returned by
queryObject to run different selections (though the SPARQL draft
does not make this distinction); in other words, running the
select method does not change any internal variable of the
class.
The construct method can be invoked either with an explicit Graph Pattern
or without (the latter corresponds to the CONSTRUCT * of the
draft, the former to the case when a separate CONSTUCT pattern
is defined). In both cases, a separate SPARQLGraph instance is
returned containing the constructed triples. For example, the construction in
the draft:
CONSTRUCT { <http://example.org/person#Alice> FN ?name }
WHERE { ?x nm ?name }
corresponds to the sparql-p construction:
where = GraphPattern([("?x",nm,"?name"])
constructPattern = GraphPattern([(URIRef("http://example.org/person#Alice"),FN,"?name")])
resultObject = sparqlGr.queryObject(where)
result = resultObject.construct(constructPattern)
whereas the example:
CONSTRUCT * WHERE (?x N ?name)
corresponds to:
where = GraphPattern([("?x",N,"?name"])
resultObject = sparqlGr.queryObject(where)
result = resultObject.construct() # or resultObject.construct(None)
The current draft is pretty vague as to what this facility is (and leaves
is to the implementor). What SPARQLGraph implements is a form of
clustering. The Query.describe() method has a seed
argument (to serve as a seed for clustering) and two keyword arguments,
forward and backward, each a boolean. What it
means:
forward=True and backward=False generates a triple store
with a transitive closure for each result of the query and the seed:
take, recursively, all the properties and objects that start by a
specific resource.forward=False and backward=True the same as
forward but in the 'other direction'.forward=True and backward=True combines the two into one
triple store.forward=False and backward=False returns and empty triple
store.The SPARQL draft refers to an ASK query form., which simply
says whether the set of patterns represent a non-empty subgraph. This is done
by:
resultObject = sparqlGr.queryObject(where)
result = resultObject.ask()
The ask method returns False or True (whether the resulting
subgraph is empty or not, respectively).
The current implementation does not (yet) do a full implementation of all
the datatypes with the precise lexical representation as defined in the XML
Schema Datatype document (and referred to in the SPARQL document). In theory,
these should be taken care of by the underlying RDFLib layer when parsing
strings into datatypes, but it does not happen yet. sparql-p
does a partial conversion to have the vast majority of queries running
properly, but there are some restrictions:
string:integer, float, long:double:decimal:date:Literal-s as date.time:Literal-s as time.dateTime:Literal as a
dateTime.These mappings are used when a typed literal value is specified in a Graph
pattern, and a Literal instance is generated on-the-fly: the
Literal instance uses these lexical representations and the
corresponding XML Schema datatype are stored. When comparing values coming
from an RDF data and parsed by RDFLib, these lexical representations are
pre-supposed when comparing Literal instances.
SPARQL defines a number of possible operators for the AND clause. It is
not obvious at this point which of those should be left to a parser and which
of those should be implemented by the engine. sparql-p provides
a number of methods that can be used to create an elementary operator and
that can also be used in the AND clause. More complex constructions can be
done using Python’s lambda function, for example.
The available binary operator functions are: lt (for less
than), le (for less or equal), gt (for greater
than), ge (for greater or equal), and eq (for
equal). Each of these operator methods take two parameters, which are both
either a query string or a typed value, and each of these operators return a
function that can be plugged into a global constraint. (All these
methods should be imported from the
rdflib.sparql.sparqlOperators module.) For example, to add the
constraint:
FILTER ?m < 42
one can use:
constraints = [ lt("?m",42) ]
For the more complex case of the form:
FILTER ?m < 42 || ?n > 56
the lambda construction can be used:
constraints = [ lambda binding: lt("?m",42)(binding) or gt("?n",56)(binding) ]
The complicated case of how values of different types compare is left
completely to Python for the time being. If a comparison does not make sense,
the return value is False. When the Working Group gets to an
equilibrium point on this issue, this should be compared to what Python
does but this is currently a matter of debate in the group, too.
The module also offers a special operator called
isOnCollection that can be used as a global constraint to check
whether a resource is on a collection or not.
The SPARQL document also defines a
number of special operators. The following of those operators are
implemented: bound, isURI, isBlank,
isLiteral, str, lang,
datatype. For example:
pattern.addConstraint(isURI("?mbox"))
adds a constraint that the value bound to ?mbox must
be a real URI (as opposed to a literal), or
pattern.addConstraint( lambda binding: datatype("?d")(binding) == "http://www.myexampledatatype.org" )
checks whether the datatype of a bound resource is of a specific URI.
Whether this set of elementary operators is enough or not for the complete implementation of SPARQL is not yet clear. I presume the final answer will come when somebody writes a parser to the query language...
TheThe sparqlOperator module in the
package includes some methods that might be useful in creating more complex
constraint methods, such as getLiteralValue (to return the value
of a Literal, possibly making on-the-fly conversion for the known datatypes),
or getValue (to create a 'retrieval' method to return either the
original Resource or a bound resource in case of a query string parameter).
Look at the detailed method description for details.
The implementation of SPARQL is based on an expansion tree. Each layer in
the tree takes care of a statement in the WHERE clause, starting
by the first for the first layer, then the second statement for the second
layer, etc. Once the full expansion is done, the results for
SELECT are collected by visiting the leaves. In more details:
The root of the tree is created by handing over the full list of
statements, and a dictionary with the variable bindings. Initially, this
dictionary looks like {"?x": None, "?a" : None, ...}. The node
picks the first tuple in the where, replaces all unbound
variables by None and makes a RDFLib query to the triple store.
The result are all tuples in the triple store that conform to the pattern
expressed by the first where pattern tuple. For each of
those a child node is created, by handing over the rest of the triples
in the where clause, and a binding where some of the
None values are replaced by 'real' RDF resources. The children
follow the same process recursively. There are several ways for the recursion
to stop:
where pattern to handle, no tuples
are found in the triple store in the process. This means that the
corresponding branch does not produce valid results. (In the
implementation, such a node is marked as 'clashed'). The same happens if,
though a tuple is found, that tuple is rejected by the constraint
function assigned to this tuple (the "per-tuple" constraint).where clause, there are still unbound variablesThe third type of leaf contains a valid, possible query result for the
unbound variables. Once the expansion is complete, the collection of the
results becomes obvious: successful leaves are visited to return their
results as the binding for the variables appearing in the select
clause; the non-leaf nodes simply collect and combine the results of their
children.
The implementation of the 'optional' feature
follows the semantic description. A pre-processing step separates the
'regular' and 'optional' select and where clauses.
First a regular expansion is done; then, separate optional expansions (for
each optional clauses) are attached to each successful leaf node
(obviously, by binding all variables that can be bound on that level). The
collection of the result follows the same mechanism except that if the
optional expansion tree yields no results, the real result tuples are padded
by the necessary number of None-s.