Skip to contents

This first vignette shows how to use glitter to extract data from the Wikidata SPARQL endpoint. We imagine here a case study in which one is interested in the Wikidata items available regarding the Lyon metro network.

Find items and properties to build your query

To find the identifiers of items and properties of interest for a particular case study, you can:

Let’s try and find the Wikidata identifier for the Lyon metro network:

WikidataR::find_item("Metro Lyon")
#> 
#>  Wikidata item search
#> 
#> Number of results:    1 
#> 
#> Results:
#> 1     Lyon Metro (Q1552) - public transportation network in Lyon, France

So you’d be interested, for instance, in all the subway stations that are part of this network. Let’s try and find the property identifier that corresponds to this notion:

WikidataR::find_property("part of")
#> 
#>  Wikidata property search
#> 
#> Number of results:    10 
#> 
#> Results:
#> 1     part of (P361) - object of which the subject is a part (if this subject is already part of object A which is a part of object B, then please only make the subject part of object A), inverse property of "has part" (P527, see also "has parts of the class" (P2670)) 
#> 2     parent organization (P749) - parent organization of an organization, opposite of subsidiaries (P355) 
#> 3     published in (P1433) - larger work that a given work was published in, like a book, journal or music album 
#> 4     constellation (P59) - the area of the celestial sphere of which the subject is a part (from a scientific standpoint, not an astrological one) 
#> 5     on focus list of Wikimedia project (P5008) - property to indicate that an item is of particular interest for a Wikimedia project. This property does not add notability. Items should not be created with this property if they are not notable for Wikidata. See also P6104, P972, P2354. 
#> 6     part of the series (P179) - series which contains the subject 
#> 7     member of sports team (P54) - sports teams or clubs that the subject represents or represented 
#> 8     transport network (P16) - network the infrastructure is a part of 
#> 9     partially coincident with (P1382) - object that partially overlaps with the subject in its instances, parts, or members 
#> 10    diaspora (P3833) - diaspora that a cultural group belongs to

So you’re looking for all the stations that are part of (“wdt:P16”) the Lyon metro network (“wd:Q1552”).

Use glitter functions to start exploring data

The glitter functions might now be used to start exploring data.

We’re looking for items (the “unknown” in our query below, hence the use of a “?”) which are part of the Lyon metro network:

stations = spq_init() %>% 
  spq_add("?items wdt:P16 wd:Q1552") %>% 
  spq_perform()

head(stations)
#> # A tibble: 6 × 1
#>   items                                 
#>   <chr>                                 
#> 1 http://www.wikidata.org/entity/Q2944  
#> 2 http://www.wikidata.org/entity/Q2965  
#> 3 http://www.wikidata.org/entity/Q2969  
#> 4 http://www.wikidata.org/entity/Q2976  
#> 5 http://www.wikidata.org/entity/Q5298  
#> 6 http://www.wikidata.org/entity/Q599865

To also get the labels for stations, we can use spq_label():

stations = spq_init() %>% 
  spq_add("?items wdt:P16 wd:Q1552") %>% 
  spq_label(items) %>% 
  spq_perform()

head(stations)
#> # A tibble: 6 × 2
#>   items                                  items_label                       
#>   <chr>                                  <chr>                             
#> 1 http://www.wikidata.org/entity/Q599865 Place Guichard - Bourse du Travail
#> 2 http://www.wikidata.org/entity/Q613893 Hôtel de Ville - Louis Pradel     
#> 3 http://www.wikidata.org/entity/Q776088 Cordeliers                        
#> 4 http://www.wikidata.org/entity/Q2944   Lyon Metro Line A                 
#> 5 http://www.wikidata.org/entity/Q2965   Lyon Metro Line B                 
#> 6 http://www.wikidata.org/entity/Q2969   Lyon Metro Line C

Labelling

The query above, with spq_label(items), will return a table comprising both items (with the Wikidata identifiers) and items_label (with the human-readable label corresponding to these items).

If the Wikidata unique identifier is not particularly useful, one can use the argument .overwrite = TRUE so that only labels will be returned, under the shorter name items:

stations=spq_init() %>% 
  spq_add("?items wdt:P16 wd:Q1552") %>% 
  spq_label(items, .overwrite = TRUE) %>% 
  spq_perform()

head(stations)
#> # A tibble: 6 × 1
#>   items                             
#>   <chr>                             
#> 1 Place Guichard - Bourse du Travail
#> 2 Hôtel de Ville - Louis Pradel     
#> 3 Cordeliers                        
#> 4 Lyon Metro Line A                 
#> 5 Lyon Metro Line B                 
#> 6 Lyon Metro Line C

Detail query

Add another triple pattern

As it turns out, for now we get 48 items, which actually correspond not only to stations but also to other types of items such as metro lines. Let’s have a look at the item “Place Guichard - Bourse du Travail” (“wd:Q599865”) which we know correspond to a station. We can do that e.g. through the Wikidata url associated to this item.

Hence, the property called “wdt:P31” (“is an instance of”) should enable us to collect specifically stations (“wd:Q928830”) instead of any part of the Lyon metro network.

stations = spq_init() %>% 
  spq_add("?station wdt:P16 wd:Q1552") %>% 
  spq_add("?station wdt:P31 wd:Q928830") %>%   # added instruction
  spq_label(station, .overwrite = TRUE) %>% 
  spq_perform()

dim(stations)
#> [1] 41  1
head(stations)
#> # A tibble: 6 × 1
#>   station                           
#>   <chr>                             
#> 1 Place Guichard - Bourse du Travail
#> 2 Hôtel de Ville - Louis Pradel     
#> 3 Cordeliers                        
#> 4 Gare de Vénissieux                
#> 5 Stade de Gerland                  
#> 6 Saxe - Gambetta

Get coordinates

If we want to get the geographical coordinate of these stations (property “wdt:P625”) we can proceed this way:

stations_coords = spq_init() %>% 
  spq_add("?station wdt:P16 wd:Q1552") %>% 
  spq_add("?station wdt:P31 wd:Q928830") %>%
  spq_add("?station wdt:P625 ?coords") %>%      # added instruction
  spq_label(station, .overwrite = TRUE) %>% 
  spq_perform()

dim(stations_coords)
#> [1] 41  2
head(stations_coords)
#> # A tibble: 6 × 2
#>   coords                          station                           
#>   <chr>                           <chr>                             
#> 1 Point(4.814544444 45.716669444) Gare d'Oullins                    
#> 2 Point(4.847308333 45.759261111) Place Guichard - Bourse du Travail
#> 3 Point(4.836022222 45.767377777) Hôtel de Ville - Louis Pradel     
#> 4 Point(4.835894444 45.763511111) Cordeliers                        
#> 5 Point(4.88804 45.7058)          Gare de Vénissieux                
#> 6 Point(4.83084 45.72673)         Stade de Gerland

This tibble can be transformed into a Simple feature collection (sfc) object using package sf:

stations_sf = st_as_sf(stations_coords, wkt = "coords")
head(stations_sf)
#> Simple feature collection with 6 features and 1 field
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 4.814544 ymin: 45.7058 xmax: 4.88804 ymax: 45.76738
#> CRS:           NA
#> # A tibble: 6 × 2
#>                coords station                           
#>               <POINT> <chr>                             
#> 1 (4.814544 45.71667) Gare d'Oullins                    
#> 2 (4.847308 45.75926) Place Guichard - Bourse du Travail
#> 3 (4.836022 45.76738) Hôtel de Ville - Louis Pradel     
#> 4 (4.835894 45.76351) Cordeliers                        
#> 5   (4.88804 45.7058) Gare de Vénissieux                
#> 6  (4.83084 45.72673) Stade de Gerland

The resulting object may then be used easily with (for instance) package leaflet:

leaflet(stations_sf) %>%
  addTiles() %>%
  addCircles(popup = ~station)

Add property qualifiers

Now, we would like not only to view the stations but also the connecting lines. One property is of particular interest in this prospect: P197, which indicates which other stations one station is connected to. To form connecting lines, this information about the connection to other stations need to be complemented by the involved line and direction of that connection. Hence, we are not only interested in the values of the property P197, but also in the property qualifiers corresponding to the connecting line (P81) and direction (P5051)

We can thus complete our query this way:

stations_adjacency=spq_init() %>% 
  spq_add("?station wdt:P16 wd:Q1552") %>% 
  spq_add("?station wdt:P31 wd:Q928830") %>%
  spq_add("?station wdt:P625 ?coords") %>%
  spq_add("?station p:P197 ?statement") %>%          # added instruction
  spq_add("?statement ps:P197 ?adjacent") %>%        # added instruction
  spq_add("?statement pq:P81 ?line") %>%             # added instruction
  spq_add("?statement pq:P5051 ?direction") %>%      # added instruction
  spq_label("station", "adjacent", "line", "direction",.overwrite = TRUE) %>% 
  spq_select(-statement) %>% 
  spq_perform() %>% 
  na.omit() %>% 
  select(coords,station,adjacent,line,direction)

head(stations_adjacency)
#> # A tibble: 6 × 5
#>   coords                          station               adjacent line  direction
#>   <chr>                           <chr>                 <chr>    <chr> <chr>    
#> 1 Point(4.858061111 45.761733333) Gare Part-Dieu - Viv… Place G… Lyon… ""       
#> 2 Point(4.83366 45.7314)          Debourg               Stade d… Lyon… ""       
#> 3 Point(4.847308333 45.759261111) Place Guichard - Bou… Saxe - … Lyon… ""       
#> 4 Point(4.859433 45.767164)       Brotteaux             Gare Pa… Lyon… ""       
#> 5 Point(4.863119 45.770539)       Charpennes - Charles… Brottea… Lyon… ""       
#> 6 Point(4.8463 45.7543)           Saxe - Gambetta       Jean Ma… Lyon… ""

Now, we would like to put the stations in the right order so that we will be able to form the connecting lines.

This data-wrangling part is a bit tricky though not directly due to any glitter-related operation.

We define a function form_line() which will put the rows in the table of stations in the correct order.

form_line = function(adjacencies, direction) {
  N = nrow(adjacencies)
  num = rep(NA,N)
  ind = which(adjacencies$adjacent == direction)
  i = N
  num[ind] = i
  while (i>1) {
    indnew = which(adjacencies$adjacent == adjacencies$station[ind])
    ind = indnew
    i = i-1
    num[ind] = i
  }
  adjacencies = adjacencies %>% 
    mutate(num = num) %>%
    arrange(num) 
  adjacencies = c(adjacencies$station, direction)
  return(adjacencies)
}

Now let’s apply this function to all lines and directions possible. Making full use of the tidyverse, we can use iteratively this function while not dropping the table-like structure of our data using a combination of tidyr::nest() and purrr::map().

stations_lines = stations_adjacency %>% 
  sf::st_drop_geometry() %>% # make this a regular tibble, not sf
  group_by(direction,line) %>% 
  na.omit() %>% 
  tidyr::nest(.key = "adj") %>% # have nested "adj" table for each direction-line
  mutate(station = purrr::map(.x = adj, .y = direction,
                            ~form_line(.x,.y))) %>% 
  tidyr::unnest(cols = "station") %>% 
  ungroup()

We use left_join() to complete the table ordering the stations into lines with the coordinates of stations:

stations_lines=stations_lines %>% 
  left_join(unique(stations_coords), # get corresponding coordinates
            by=c("station")) %>%
  na.omit()
head(stations_lines)
#> # A tibble: 6 × 5
#>   line              direction adj               station                   coords
#>   <chr>             <chr>     <list>            <chr>                     <chr> 
#> 1 Lyon Metro Line B ""        <tibble [11 × 3]> Charpennes - Charles Her… Point…
#> 2 Lyon Metro Line B ""        <tibble [11 × 3]> Brotteaux                 Point…
#> 3 Lyon Metro Line B ""        <tibble [11 × 3]> Gare Part-Dieu - Vivier … Point…
#> 4 Lyon Metro Line B ""        <tibble [11 × 3]> Place Guichard - Bourse … Point…
#> 5 Lyon Metro Line B ""        <tibble [11 × 3]> Saxe - Gambetta           Point…
#> 6 Lyon Metro Line B ""        <tibble [11 × 3]> Jean Macé                 Point…

stations_lines is now an sf points object which is properly formatted to be transformed into an sf lines object (the stations are in the right order for each line-direction, and the associated coordinates are provided in the table):

stations_lines_sf=stations_lines %>% 
  sf::st_as_sf(wkt="coords") %>% 
  group_by(direction,line) %>% 
  summarise(do_union = FALSE) %>%   # for each group, and keeping order of points,
  sf::st_cast("LINESTRING")       # form a linestring geometry
stations_lines_sf
#> Simple feature collection with 9 features and 2 fields
#> Geometry type: LINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: 4.804185 ymin: 45.7016 xmax: 4.921998 ymax: 45.7855
#> CRS:           NA
#> # A tibble: 9 × 3
#> # Groups:   direction [9]
#>   direction                       line                                    coords
#>   <chr>                           <chr>                             <LINESTRING>
#> 1 ""                              Lyon Metro Line B    (4.863119 45.77054, 4.85…
#> 2 "Charpennes - Charles Hernu"    Lyon Metro Line B    (4.805357 45.71427, 4.80…
#> 3 "Cuire"                         Lyon Metro Line C    (4.836022 45.76738, 4.83…
#> 4 "Gare de Vaise"                 Lyon Metro Line D    (4.88804 45.7058, 4.8875…
#> 5 "Gare de Vénissieux"            Lyon Metro Line D    (4.80421 45.7794, 4.8054…
#> 6 "Hôtel de Ville - Louis Pradel" Lyon Metro Line C    (4.83293 45.7855, 4.8275…
#> 7 "Perrache"                      Lyon Metro Line A    (4.921998 45.76125, 4.90…
#> 8 "Saint-Just"                    Funiculaire de Sain…           (4.82622 45.76)
#> 9 "Vaulx-en-Velin - La Soie"      Lyon Metro Line A    (4.829182 45.75302, 4.83…

We can now use this new object to display the Lyon metro lines on a leaflet map:

factpal <- colorFactor(topo.colors(8),
                       unique(stations_lines$line))
leaflet(data=stations_sf) %>%
  addTiles() %>%
  addCircles(popup=~station) %>% 
  addPolylines(data=stations_lines_sf,
               color=~factpal(line), popup=~line)