This article describes how glitter works, and why. At this stage of glitter history, feedback and feature requests are most welcome!
The glitter package helps writing SPARQL queries by implementing an internal domain-specific language: with glitter, you write code that mostly looks like R code, and end up with a SPARQL query. For instance:
library("glitter")
query <- spq_init() %>%
spq_add("?item wdt:P31 wd:Q13442814") %>%
spq_add("?item rdfs:label ?itemTitle") %>%
spq_filter(str_detect(str_to_lower(itemTitle), 'wikidata')) %>%
spq_filter(lang(itemTitle) == "en") %>%
spq_head(n = 5)
query
#> PREFIX wd: <http://www.wikidata.org/entity/>
#> PREFIX wdt: <http://www.wikidata.org/prop/direct/>
#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
#> SELECT ?item ?itemTitle
#> WHERE {
#>
#> ?item wdt:P31 wd:Q13442814.
#> ?item rdfs:label ?itemTitle.
#> FILTER(REGEX(LCASE(?itemTitle),"wikidata"))
#> FILTER(lang(?itemTitle)="en")
#> }
#>
#> LIMIT 5
The R code should therefore be easier to write, and read. The function names and syntax are meant to remind of the tidyverse, and of base R.
Code using glitter will feature:
- a call to
spq_init()
; - most often, calls to
spq_add()
to add SPARQL triple patterns to the query; - potentially, calls to
spq_set()
to define helper values likespq_set(species = c('wd:Q144','wd:Q146', 'wd:Q780'), mayorcode = "wd:Q30185")
; - potentially, calls to
spq_filter()
,spq_select()
to filter results; - potentially, calls to
spq_arrange()
,spq_head()
,spq_offset()
to order and trim results; - lastly a call to
spq_perform()
to send the query and return the results.
glitter query object
The query object is a list with elements such as the variables (vars), filters, etc. Later we might make it an actual class, maybe an R6 one?
It is built by the different calls to spq_
functions.
The SPARQL query string is assembled by spq_assemble()
.
Later we might add some linting at that stage.
glitter tooling
Under the hood, glitter uses
- the rlang package by Lionel Henry
and Hadley Wickham to encapsulate
...
arguments into quosures before handling them. Useful references were the Metaprogramming section of the Advanced R book by Hadley Wickham as well as the documentation of the rlang package. - the xmlparsedata package by Gábor Csárdi to parse the R looking code, and the xml2 package by Hadley Wickham, Jim Hester and Jeroen Ooms, to transform that code into SPARQL code with XPath. Useful references were the documentation of these two packages as well as searching for XPath query examples via a search engine.
- the httr2 package by Hadley Wickham package to perform queries. We learnt sending such queries thanks to the source code of the WikidataQueryServiceR package by Mikhail Popov.
More details in the next sections.
spq_add()
spq_add()
works differently from the other
spq_
functions because it looks closer to SPARQL.
Clearly something like
spq_add(query, "?item wdt:P31 wd:Q13442814")
does not look
like R code. The motivation for this is:
- easier copy-pasting from SPARQL query examples;
- the ability to keep the SPARQL concepts.
Triple patterns are parsed by decompose_triples
that
uses string manipulation.
Now, if one wants to go full DSL, it is possible, via
spq_filter()
and spq_mutate()
.
The triple pattern in
spq_add(query, "?item wdt:P31 wd:Q13442814")
means finding
items that are an instance of (“wdt:P31”) of a scholarly article
(“wd:Q13442814”). With glitter, you can also write it
spq_init() %>%
spq_filter(item == wdt::P31(wd::Q13442814))
This looks more like a normal tidyverse pipeline. Note that the
namespacing here is done the R way i.e. wdt::P31
as opposed
to “wdt:P31”.
Similary,
adds a variable that is “wdt:P1843” of Sonchus oleraceus (“wd:Q331676”). It can be written:
spq_init() %>%
spq_mutate(statement = wdt::P1843(wd::Q331676))
Other spq_
functions
The other spq_
functions spq_arrange()
,
spq_select()
, spq_mutate()
,
spq_mutate()
, spq_filter()
,
spq_summarize()
are the core of the DSL.
They have ...
as arguments where three different things
can be passed:
- R-looking snippets (the DSL), for instance
spq_filter(query, lang(itemTitle) == "en")
; - R-looking snippets as string (for programmatic use), for instance
spq_filter(query, 'lang(itemTitle) == "en"')
; - SPARQL snippets escaped with
spq()
(for copy-pasting from SPARQL examples), for instancespq_filter(query, spq('lang(?itemTitle)="en"'))
.
The names of their other arguments starts with a dot to prevent name clashes.
How do we differentiate these three things that users can pass?
- First, arguments are transformed into quosures with
rlang::enquos(...)
. - Each of them is passed to a function like
spq_treat_argument()
. “Like” as the more complex behavior ofspq_filter()
andspq_mutate()
, that can accept R-looking snippets that will be translated to either triple patterns or not, warrants a bit more logic. - In
spq_treat_argument()
we try evaluating the argument viarlang::eval_tidy()
.- If the result is of the class
spq
, then it means we can use the string as is, it was SPARQL. - If the result is another character, it is the code to be translated
(example:
'lang(itemTitle) == "en"'
). - If it is not, we need to get the text of the expression via
rlang::expr_text(arg) %>% stringr::str_replace("^~", "")
(example: the user wrotelang(itemTitle) == "en"
).
- If the result is of the class
- In the two latter cases, we have R-looking code to translate into a
SPARQL snippet. We first parse the code with xmlparsedata and xml2. Then
a series of transformations supported by XPath happen. The glitter
package contains a
tibble
calledall_correspondences
:
head(glitter::all_correspondences)
#> # A tibble: 6 × 3
#> R SPARQL args
#> <chr> <chr> <list>
#> 1 n COUNT <list [0]>
#> 2 sum SUM <list [0]>
#> 3 mean AVG <list [0]>
#> 4 min MIN <list [0]>
#> 5 max MAX <list [0]>
#> 6 sample SAMPLE <list [0]>
So all instances of n(blabla)
become
COUNT(blabla)
. We also transform argument names. Look at
the “SELECT” statement below, the str_c()
function becomes
GROUP_CONCAT()
and its argument SEPARATOR
.
Also note that the argument comes after a colon, not a comma like in
R.
spq_init() %>%
spq_summarise(authors = str_c(name, sep = ', '))
Later, we need to document these correspondences better, and we need to stress test the DSL with more cases using arguments.
Special case of spq_filter()
and
spq_mutate()
spq_filter()
receives R-looking fragments that are
translated into SPARQL snippets for FILTER… or triple patterns.
spq_mutate()
receives R-looking fragments that are
translated into SPARQL snippets for SELECT… or triple patterns.
At the moment the detection of which is which is based on
::
: if the R-looking fragment contains ::
, we
assume it will become a triple pattern. Later, we need to make this more
robust as the function spq_set()
makes it easier to create
synonyms for any subject/verb/object via SPARQL VALUES.
When we assume spq_filter()
/spq_mutate()
has received an R-looking fragment meant to be translated to a triple
pattern, it is parsed so, not forgetting the order is not the same in
the two cases:
-
spq_mutate(object = verb(subject))
; -
spq_filter(subject == verb(object))
.
R CMD check hack
The examples using something like
spq_init() %>%
spq_filter(item == wdt::P31(wd::Q13442814))
got flagged as if wdt were a dependency to be stated. This is understandable. To bypass it these examples are not examples, they are R chunks in a section called “Some examples”. This means they aren’t checked. Thankfully we have similar code in the real tests!