This first vignette shows how to use glitter
to extract
data from the Wikidata SPARQL endpoint. We imagine here
a case study in which one is interested in the Wikidata items
available regarding the Lyon metro network.
Find items and properties to build your query
To find the identifiers of items and properties of interest for a particular case study, you can:
- browse Wikidata
- use package
WikidataR
(functionsWikidataR::find_item()
,WikidataR::find_property()
). Here, we will explore that second option
Let’s try and find the Wikidata identifier for the Lyon metro network:
WikidataR::find_item("Metro Lyon")
#>
#> Wikidata item search
#>
#> Number of results: 1
#>
#> Results:
#> 1 Lyon Metro (Q1552) - public transportation network in Lyon, France
So you’d be interested, for instance, in all the subway stations that are part of this network. Let’s try and find the property identifier that corresponds to this notion:
WikidataR::find_property("part of")
#>
#> Wikidata property search
#>
#> Number of results: 10
#>
#> Results:
#> 1 part of (P361) - object of which the subject is a part (if this subject is already part of object A which is a part of object B, then please only make the subject part of object A), inverse property of "has part" (P527, see also "has parts of the class" (P2670))
#> 2 parent organization (P749) - parent organization of an organization, opposite of subsidiaries (P355)
#> 3 published in (P1433) - larger work that a given work was published in, like a book, journal or music album
#> 4 constellation (P59) - the area of the celestial sphere of which the subject is a part (from a scientific standpoint, not an astrological one)
#> 5 on focus list of Wikimedia project (P5008) - property to indicate that an item is of particular interest for a Wikimedia project. This property does not add notability. Items should not be created with this property if they are not notable for Wikidata. See also P6104, P972, P2354.
#> 6 part of the series (P179) - series which contains the subject
#> 7 member of sports team (P54) - sports teams or clubs that the subject represents or represented
#> 8 transport network (P16) - network the infrastructure is a part of
#> 9 partially coincident with (P1382) - object that partially overlaps with the subject in its instances, parts, or members
#> 10 diaspora (P3833) - diaspora that a cultural group belongs to
So you’re looking for all the stations that are part of (“wdt:P16”) the Lyon metro network (“wd:Q1552”).
Use glitter functions to start exploring data
The glitter
functions might now be used to start
exploring data.
We’re looking for items (the “unknown” in our query below, hence the use of a “?”) which are part of the Lyon metro network:
stations = spq_init() %>%
spq_add("?items wdt:P16 wd:Q1552") %>%
spq_perform()
head(stations)
#> # A tibble: 6 × 1
#> items
#> <chr>
#> 1 http://www.wikidata.org/entity/Q2944
#> 2 http://www.wikidata.org/entity/Q2965
#> 3 http://www.wikidata.org/entity/Q2969
#> 4 http://www.wikidata.org/entity/Q2976
#> 5 http://www.wikidata.org/entity/Q5298
#> 6 http://www.wikidata.org/entity/Q599865
To also get the labels for stations, we can use
spq_label()
:
stations = spq_init() %>%
spq_add("?items wdt:P16 wd:Q1552") %>%
spq_label(items) %>%
spq_perform()
head(stations)
#> # A tibble: 6 × 2
#> items items_label
#> <chr> <chr>
#> 1 http://www.wikidata.org/entity/Q599865 Place Guichard - Bourse du Travail
#> 2 http://www.wikidata.org/entity/Q613893 Hôtel de Ville - Louis Pradel
#> 3 http://www.wikidata.org/entity/Q776088 Cordeliers
#> 4 http://www.wikidata.org/entity/Q2944 Lyon Metro Line A
#> 5 http://www.wikidata.org/entity/Q2965 Lyon Metro Line B
#> 6 http://www.wikidata.org/entity/Q2969 Lyon Metro Line C
Labelling
The query above, with spq_label(items)
, will return a
table comprising both items
(with the Wikidata identifiers)
and items_label
(with the human-readable label
corresponding to these items).
If the Wikidata unique identifier is not particularly useful, one can
use the argument .overwrite = TRUE
so that only labels will
be returned, under the shorter name items
:
stations=spq_init() %>%
spq_add("?items wdt:P16 wd:Q1552") %>%
spq_label(items, .overwrite = TRUE) %>%
spq_perform()
head(stations)
#> # A tibble: 6 × 1
#> items
#> <chr>
#> 1 Place Guichard - Bourse du Travail
#> 2 Hôtel de Ville - Louis Pradel
#> 3 Cordeliers
#> 4 Lyon Metro Line A
#> 5 Lyon Metro Line B
#> 6 Lyon Metro Line C
Detail query
Add another triple pattern
As it turns out, for now we get 48 items, which actually correspond not only to stations but also to other types of items such as metro lines. Let’s have a look at the item “Place Guichard - Bourse du Travail” (“wd:Q599865”) which we know correspond to a station. We can do that e.g. through the Wikidata url associated to this item.
Hence, the property called “wdt:P31” (“is an instance of”) should enable us to collect specifically stations (“wd:Q928830”) instead of any part of the Lyon metro network.
stations = spq_init() %>%
spq_add("?station wdt:P16 wd:Q1552") %>%
spq_add("?station wdt:P31 wd:Q928830") %>% # added instruction
spq_label(station, .overwrite = TRUE) %>%
spq_perform()
dim(stations)
#> [1] 41 1
head(stations)
#> # A tibble: 6 × 1
#> station
#> <chr>
#> 1 Place Guichard - Bourse du Travail
#> 2 Hôtel de Ville - Louis Pradel
#> 3 Cordeliers
#> 4 Gare de Vénissieux
#> 5 Stade de Gerland
#> 6 Saxe - Gambetta
Get coordinates
If we want to get the geographical coordinate of these stations (property “wdt:P625”) we can proceed this way:
stations_coords = spq_init() %>%
spq_add("?station wdt:P16 wd:Q1552") %>%
spq_add("?station wdt:P31 wd:Q928830") %>%
spq_add("?station wdt:P625 ?coords") %>% # added instruction
spq_label(station, .overwrite = TRUE) %>%
spq_perform()
dim(stations_coords)
#> [1] 41 2
head(stations_coords)
#> # A tibble: 6 × 2
#> coords station
#> <chr> <chr>
#> 1 Point(4.814544444 45.716669444) Gare d'Oullins
#> 2 Point(4.847308333 45.759261111) Place Guichard - Bourse du Travail
#> 3 Point(4.836022222 45.767377777) Hôtel de Ville - Louis Pradel
#> 4 Point(4.835894444 45.763511111) Cordeliers
#> 5 Point(4.88804 45.7058) Gare de Vénissieux
#> 6 Point(4.83084 45.72673) Stade de Gerland
This tibble can be transformed into a Simple feature collection (sfc)
object using package sf
:
stations_sf = st_as_sf(stations_coords, wkt = "coords")
head(stations_sf)
#> Simple feature collection with 6 features and 1 field
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 4.814544 ymin: 45.7058 xmax: 4.88804 ymax: 45.76738
#> CRS: NA
#> # A tibble: 6 × 2
#> coords station
#> <POINT> <chr>
#> 1 (4.814544 45.71667) Gare d'Oullins
#> 2 (4.847308 45.75926) Place Guichard - Bourse du Travail
#> 3 (4.836022 45.76738) Hôtel de Ville - Louis Pradel
#> 4 (4.835894 45.76351) Cordeliers
#> 5 (4.88804 45.7058) Gare de Vénissieux
#> 6 (4.83084 45.72673) Stade de Gerland
The resulting object may then be used easily with (for instance)
package leaflet
:
leaflet(stations_sf) %>%
addTiles() %>%
addCircles(popup = ~station)
Add property qualifiers
Now, we would like not only to view the stations but also the connecting lines. One property is of particular interest in this prospect: P197, which indicates which other stations one station is connected to. To form connecting lines, this information about the connection to other stations need to be complemented by the involved line and direction of that connection. Hence, we are not only interested in the values of the property P197, but also in the property qualifiers corresponding to the connecting line (P81) and direction (P5051)
We can thus complete our query this way:
stations_adjacency=spq_init() %>%
spq_add("?station wdt:P16 wd:Q1552") %>%
spq_add("?station wdt:P31 wd:Q928830") %>%
spq_add("?station wdt:P625 ?coords") %>%
spq_add("?station p:P197 ?statement") %>% # added instruction
spq_add("?statement ps:P197 ?adjacent") %>% # added instruction
spq_add("?statement pq:P81 ?line") %>% # added instruction
spq_add("?statement pq:P5051 ?direction") %>% # added instruction
spq_label("station", "adjacent", "line", "direction",.overwrite = TRUE) %>%
spq_select(-statement) %>%
spq_perform() %>%
na.omit() %>%
select(coords,station,adjacent,line,direction)
head(stations_adjacency)
#> # A tibble: 6 × 5
#> coords station adjacent line direction
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Point(4.858061111 45.761733333) Gare Part-Dieu - Viv… Place G… Lyon… ""
#> 2 Point(4.83366 45.7314) Debourg Stade d… Lyon… ""
#> 3 Point(4.847308333 45.759261111) Place Guichard - Bou… Saxe - … Lyon… ""
#> 4 Point(4.859433 45.767164) Brotteaux Gare Pa… Lyon… ""
#> 5 Point(4.863119 45.770539) Charpennes - Charles… Brottea… Lyon… ""
#> 6 Point(4.8463 45.7543) Saxe - Gambetta Jean Ma… Lyon… ""
Now, we would like to put the stations in the right order so that we will be able to form the connecting lines.
This data-wrangling part is a bit tricky though not directly due to any glitter-related operation.
We define a function form_line()
which will put the rows
in the table of stations in the correct order.
form_line = function(adjacencies, direction) {
N = nrow(adjacencies)
num = rep(NA,N)
ind = which(adjacencies$adjacent == direction)
i = N
num[ind] = i
while (i>1) {
indnew = which(adjacencies$adjacent == adjacencies$station[ind])
ind = indnew
i = i-1
num[ind] = i
}
adjacencies = adjacencies %>%
mutate(num = num) %>%
arrange(num)
adjacencies = c(adjacencies$station, direction)
return(adjacencies)
}
Now let’s apply this function to all lines and directions possible. Making full use of the tidyverse, we can use iteratively this function while not dropping the table-like structure of our data using a combination of tidyr::nest() and purrr::map().
stations_lines = stations_adjacency %>%
sf::st_drop_geometry() %>% # make this a regular tibble, not sf
group_by(direction,line) %>%
na.omit() %>%
tidyr::nest(.key = "adj") %>% # have nested "adj" table for each direction-line
mutate(station = purrr::map(.x = adj, .y = direction,
~form_line(.x,.y))) %>%
tidyr::unnest(cols = "station") %>%
ungroup()
We use left_join() to complete the table ordering the stations into lines with the coordinates of stations:
stations_lines=stations_lines %>%
left_join(unique(stations_coords), # get corresponding coordinates
by=c("station")) %>%
na.omit()
head(stations_lines)
#> # A tibble: 6 × 5
#> line direction adj station coords
#> <chr> <chr> <list> <chr> <chr>
#> 1 Lyon Metro Line B "" <tibble [11 × 3]> Charpennes - Charles Her… Point…
#> 2 Lyon Metro Line B "" <tibble [11 × 3]> Brotteaux Point…
#> 3 Lyon Metro Line B "" <tibble [11 × 3]> Gare Part-Dieu - Vivier … Point…
#> 4 Lyon Metro Line B "" <tibble [11 × 3]> Place Guichard - Bourse … Point…
#> 5 Lyon Metro Line B "" <tibble [11 × 3]> Saxe - Gambetta Point…
#> 6 Lyon Metro Line B "" <tibble [11 × 3]> Jean Macé Point…
stations_lines
is now an sf points
object which is properly formatted to be transformed into an
sf lines object (the stations are in the right order
for each line-direction, and the associated coordinates are provided in
the table):
stations_lines_sf=stations_lines %>%
sf::st_as_sf(wkt="coords") %>%
group_by(direction,line) %>%
summarise(do_union = FALSE) %>% # for each group, and keeping order of points,
sf::st_cast("LINESTRING") # form a linestring geometry
stations_lines_sf
#> Simple feature collection with 9 features and 2 fields
#> Geometry type: LINESTRING
#> Dimension: XY
#> Bounding box: xmin: 4.804185 ymin: 45.7016 xmax: 4.921998 ymax: 45.7855
#> CRS: NA
#> # A tibble: 9 × 3
#> # Groups: direction [9]
#> direction line coords
#> <chr> <chr> <LINESTRING>
#> 1 "" Lyon Metro Line B (4.863119 45.77054, 4.85…
#> 2 "Charpennes - Charles Hernu" Lyon Metro Line B (4.805357 45.71427, 4.80…
#> 3 "Cuire" Lyon Metro Line C (4.836022 45.76738, 4.83…
#> 4 "Gare de Vaise" Lyon Metro Line D (4.88804 45.7058, 4.8875…
#> 5 "Gare de Vénissieux" Lyon Metro Line D (4.80421 45.7794, 4.8054…
#> 6 "Hôtel de Ville - Louis Pradel" Lyon Metro Line C (4.83293 45.7855, 4.8275…
#> 7 "Perrache" Lyon Metro Line A (4.921998 45.76125, 4.90…
#> 8 "Saint-Just" Funiculaire de Sain… (4.82622 45.76)
#> 9 "Vaulx-en-Velin - La Soie" Lyon Metro Line A (4.829182 45.75302, 4.83…
We can now use this new object to display the Lyon metro lines on a leaflet map:
factpal <- colorFactor(topo.colors(8),
unique(stations_lines$line))
leaflet(data=stations_sf) %>%
addTiles() %>%
addCircles(popup=~station) %>%
addPolylines(data=stations_lines_sf,
color=~factpal(line), popup=~line)