glitter design and internals

This article describes how glitter works, and why. At this stage of glitter history, feedback and feature requests are most welcome!

library(glitter)

The glitter package helps writing SPARQL queries by implementing an internal domain-specific language: with glitter, you write code that mostly looks like R code, and end up with a SPARQL query. For instance:

library("glitter")
query <- spq_init() %>%
  spq_add("?item wdt:P31 wd:Q13442814") %>%
  spq_add("?item rdfs:label ?itemTitle") %>%
  spq_filter(str_detect(str_to_lower(itemTitle), 'wikidata')) %>%
  spq_filter(lang(itemTitle) == "en") %>%
  spq_head(n = 5)

query
#> PREFIX wd: <http://www.wikidata.org/entity/>
#> PREFIX wdt: <http://www.wikidata.org/prop/direct/>
#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
#> SELECT ?item ?itemTitle
#> WHERE {
#> 
#> ?item wdt:P31 wd:Q13442814.
#> ?item rdfs:label ?itemTitle.
#> FILTER(REGEX(LCASE(?itemTitle),"wikidata"))
#> FILTER(lang(?itemTitle)="en")
#> }
#> 
#> LIMIT 5

The R code should therefore be easier to write, and read. The function names and syntax are meant to remind of the tidyverse, and of base R.

Code using glitter will feature:

a call to spq_init();
most often, calls to spq_add() to add SPARQL triple patterns to the query;
potentially, calls to spq_set() to define helper values like spq_set(species = c('wd:Q144','wd:Q146', 'wd:Q780'), mayorcode = "wd:Q30185");
potentially, calls to spq_filter(), spq_select() to filter results;
potentially, calls to spq_arrange(), spq_head(), spq_offset() to order and trim results;
lastly a call to spq_perform() to send the query and return the results.

glitter query object

The query object is a list with elements such as the variables (vars), filters, etc. Later we might make it an actual class, maybe an R6 one?

It is built by the different calls to spq_ functions. The SPARQL query string is assembled by spq_assemble(). Later we might add some linting at that stage.

glitter tooling

Under the hood, glitter uses

the rlang package by Lionel Henry and Hadley Wickham to encapsulate ... arguments into quosures before handling them. Useful references were the Metaprogramming section of the Advanced R book by Hadley Wickham as well as the documentation of the rlang package.
the xmlparsedata package by Gábor Csárdi to parse the R looking code, and the xml2 package by Hadley Wickham, Jim Hester and Jeroen Ooms, to transform that code into SPARQL code with XPath. Useful references were the documentation of these two packages as well as searching for XPath query examples via a search engine.
the httr2 package by Hadley Wickham package to perform queries. We learnt sending such queries thanks to the source code of the WikidataQueryServiceR package by Mikhail Popov.

More details in the next sections.

spq_add()

spq_add() works differently from the other spq_ functions because it looks closer to SPARQL.

Clearly something like spq_add(query, "?item wdt:P31 wd:Q13442814") does not look like R code. The motivation for this is:

easier copy-pasting from SPARQL query examples;
the ability to keep the SPARQL concepts.

Triple patterns are parsed by decompose_triples that uses string manipulation.

Now, if one wants to go full DSL, it is possible, via spq_filter() and spq_mutate().

The triple pattern in spq_add(query, "?item wdt:P31 wd:Q13442814") means finding items that are an instance of (“wdt:P31”) of a scholarly article (“wd:Q13442814”). With glitter, you can also write it

spq_init() %>%
  spq_filter(item == wdt::P31(wd::Q13442814))

This looks more like a normal tidyverse pipeline. Note that the namespacing here is done the R way i.e. wdt::P31 as opposed to “wdt:P31”.

Similary,

spq_init() %>%
  spq_add("wd:Q331676 wdt:P1843 ?statement")

adds a variable that is “wdt:P1843” of Sonchus oleraceus (“wd:Q331676”). It can be written:

spq_init() %>%
  spq_mutate(statement = wdt::P1843(wd::Q331676))

Other `spq_` functions

The other spq_ functions spq_arrange(), spq_select(), spq_mutate(), spq_mutate() , spq_filter(), spq_summarize() are the core of the DSL.

They have ... as arguments where three different things can be passed:

R-looking snippets (the DSL), for instance spq_filter(query, lang(itemTitle) == "en");
R-looking snippets as string (for programmatic use), for instance spq_filter(query, 'lang(itemTitle) == "en"');
SPARQL snippets escaped with spq() (for copy-pasting from SPARQL examples), for instance spq_filter(query, spq('lang(?itemTitle)="en"')).

The names of their other arguments starts with a dot to prevent name clashes.

How do we differentiate these three things that users can pass?

First, arguments are transformed into quosures with rlang::enquos(...).
Each of them is passed to a function like spq_treat_argument(). “Like” as the more complex behavior of spq_filter() and spq_mutate(), that can accept R-looking snippets that will be translated to either triple patterns or not, warrants a bit more logic.
In spq_treat_argument() we try evaluating the argument via rlang::eval_tidy().
- If the result is of the class spq, then it means we can use the string as is, it was SPARQL.
- If the result is another character, it is the code to be translated (example: 'lang(itemTitle) == "en"').
- If it is not, we need to get the text of the expression via rlang::expr_text(arg) %>% stringr::str_replace("^~", "") (example: the user wrote lang(itemTitle) == "en").
In the two latter cases, we have R-looking code to translate into a SPARQL snippet. We first parse the code with xmlparsedata and xml2. Then a series of transformations supported by XPath happen. The glitter package contains a tibble called all_correspondences:

head(glitter::all_correspondences)
#> # A tibble: 6 × 3
#>   R      SPARQL args      
#>   <chr>  <chr>  <list>    
#> 1 n      COUNT  <list [0]>
#> 2 sum    SUM    <list [0]>
#> 3 mean   AVG    <list [0]>
#> 4 min    MIN    <list [0]>
#> 5 max    MAX    <list [0]>
#> 6 sample SAMPLE <list [0]>

So all instances of n(blabla) become COUNT(blabla). We also transform argument names. Look at the “SELECT” statement below, the str_c() function becomes GROUP_CONCAT() and its argument SEPARATOR. Also note that the argument comes after a colon, not a comma like in R.

spq_init() %>%
  spq_summarise(authors = str_c(name, sep = ', '))

Later, we need to document these correspondences better, and we need to stress test the DSL with more cases using arguments.

Special case of `spq_filter()` and `spq_mutate()`

spq_filter() receives R-looking fragments that are translated into SPARQL snippets for FILTER… or triple patterns. spq_mutate() receives R-looking fragments that are translated into SPARQL snippets for SELECT… or triple patterns.

At the moment the detection of which is which is based on ::: if the R-looking fragment contains ::, we assume it will become a triple pattern. Later, we need to make this more robust as the function spq_set() makes it easier to create synonyms for any subject/verb/object via SPARQL VALUES.

When we assume spq_filter()/spq_mutate() has received an R-looking fragment meant to be translated to a triple pattern, it is parsed so, not forgetting the order is not the same in the two cases:

spq_mutate(object = verb(subject));
spq_filter(subject == verb(object)).

R CMD check hack

The examples using something like

spq_init() %>%
  spq_filter(item == wdt::P31(wd::Q13442814))

got flagged as if wdt were a dependency to be stated. This is understandable. To bypass it these examples are not examples, they are R chunks in a section called “Some examples”. This means they aren’t checked. Thankfully we have similar code in the real tests!

Future work

The issue tracker of glitter is quite representative of future work, as well as all sentences starting with “Later” in this article. As stated at the very beginning of this article, your ideas and comments are welcome.