4 Case studies

4.1 River geomorphology example

4.1.1 Description of the dataset

Let’s consider a qualitative geomorphological descriptor a river. This indicator has been built through a hierarchical clustering analysis based on several quantitative variables, namely:

f.slope: log(slope)
f.sinuosity: log(sinuosity -1)
f.vall_bott_W: log(valley bottom width)
f.wat_cov: logit(area under water divided by area of the active channel)
f.act_chan_cov: log(area of the active channel divided by catchment area)
f.act_chan_vW: log(interquartile range of active channel width)

The individuals are homogeneous reaches of average length 1.7km.

data_river <- read.csv("data/data_river.csv", sep=";")
attach(data_river)
x=1:nrow(data_river)
require(ade4)

Le chargement a nécessité le package : ade4

data=data_river[,2:ncol(data_river)]
datascaled=scale(data,center=TRUE)
#boxplot(datascaled)
monACP=dudi.pca(data,scannf=FALSE,nf=2)

tree=hclust(dist(datascaled), method="ward")

La méthode "ward" a été renommée "ward.D"; notez la nouvelle méthode "ward.D2"

lcol2=c("darkorchid","palegreen")
lcol5=c("slateblue2","maroon1",
       "gold1","darkolivegreen1",
       "mediumaquamarine")
#plot(tree)grouping5=as.factor(cutree(tree,k=5))

grouping5=as.factor(cutree(tree,k=5))
grouping2=as.factor(cutree(tree,k=2))

series5=paste0("class",grouping5)
series2=paste0("class",grouping2)

Figure 4.1: Description of classes. A Principal Component Analysis has been performed to help with the description.

Figure Figure 4.1 shows the quantitative geomorphological features of the 5 classes defined through hierarchical clustering.

Figure Figure 4.2 displays the sequence of classes along the river.

4.1.2 Fit of a HMM

We hypothesize that there are actually two hidden states (let’s call them “A” and “B”) behind the observed sequence of series). The proportions of classes in the observation series is:

myprop=round(table(series5)/length(series5),2)
print(myprop)

series5
class1 class2 class3 class4 class5 
  0.25   0.11   0.29   0.25   0.09

myprop=as.vector(myprop)

We initialize the emission matrix and transition matrix:

M_E=matrix(c(myprop[1]+0.05,myprop[1]-0.05,
             myprop[2]-0.05,myprop[2]+0.05,
             myprop[3]+0.05,myprop[3]-0.05,
             myprop[4]-0.05,myprop[4]+0.05,
             myprop[5]+0.05,myprop[5]-0.05),nrow=2)
rownames(M_E)=c("A","B")
colnames(M_E)=paste0("class",1:5)
M_E

  class1 class2 class3 class4 class5
A    0.3   0.06   0.34    0.2   0.14
B    0.2   0.16   0.24    0.3   0.04

M_T=matrix(c(.9,.1,.1,.9),2)
rownames(M_T)=c("A","B")
colnames(M_T)=c("A","B")
M_T

    A   B
A 0.9 0.1
B 0.1 0.9

Thus we can initialize the HMM :

initial_HMM=initHMM(States=c("A","B"),
                    Symbols=paste0("class",1:5),
                    transProbs=M_T,
                    emissionProbs=M_E)

We can now fit the HMM model using this initial parameterization.

hmm_fit= baumWelch(initial_HMM,series5,maxIterations=100)$hmm
colstates=c("cadetblue3","sienna2")


round(hmm_fit$emissionProbs,2)

      symbols
states class1 class2 class3 class4 class5
     A    0.6   0.13   0.03   0.06   0.18
     B    0.0   0.10   0.48   0.40   0.02

round(hmm_fit$transProbs,2)

    to
from    A    B
   A 0.93 0.07
   B 0.04 0.96

Figure Figure 4.3 illustrates the differences between states A and B: A is characterized by a frequent occurence of classes 3 and 4 while B is characterized by a frequent occurence of classes 1, 2 and 5.

layout(matrix(1:2,nrow=1))
par(las=1)
mosaicplot(hmm_fit$emissionProbs, col=lcol5)
mosaicplot(hmm_fit$transProbs, col=colstates)

Figure 4.3: Fitted parameters of the HMM.

Now, we use the fitted HMM to infer the hidden states series:

seriesHMM=as.factor(viterbi(hmm_fit,series5))

Figure 4.4: 1) observation series (classification into 5 clusters based on quantitative variables) 2) hidden states series (classification into two states based on observation series consisting of 5 possible clusters).

Figure Figure 4.4 displays the observation and estimated states series.

4.1.3 Discussion: HMM segmentation vs clustering

A HMM is primarily designed to describe the sequence of hidden states underlying the observation of a series. Doing so, it cuts a qualitative signal into homogeneous segments.

In this example, the data at hand has been used to first define 5 classes, and then a HMM model has been carried out to define 2 states based on these 5 classes. One can wonder why not define 2 classes directly.

Actually, these two methods, although they seemingly result in the same kind of output (i.e. a 2-classes segmentation of the river) do not proceed according to the same principles.

Hierarchical clustering makes no direct use in the geographical information in the data. The fact that two successive reaches fall into the same class is only due to the fact that two reaches tend to have similar geomorphological descriptions when they are close to each other.

On the other hand, a HMM model uses both geomorphological descriptors AND geographical information (i.e. the location of reaches in the sequence) to define which state they are most likely to result from.

To illustrate this we will use the quantitative data at hand to define a 2-classes description based on a hierarchical clustering. Figure Figure 4.5 shows the characteristics of these 2 classes as well as the characteristics of reaches occurring presumably in states A and B.

Figure 4.5: Comparison of 1) classes (output of hierarchical clustering with 2 classes) and 2) hidden states as estimated based on a HMM model.

Figure Figure 4.5 illustrates the closeness (in terms of descriptors) of classes either inferred (based on several quantitative variables) through hierarchical clustering and states based on a 5-classes descriptor as defined through a HMM model fit.

On the other hand, figure Figure 4.6 shows that there are some differences between segments either defined as the projection of the 2-classes categorization along the river (for hierarchical clustering) or defined as the most probable states path (for HMM). Taking into account the geographical succession of reaches and not only geomorphological features, the HMM path provides a less fragmented segmentation.

Figure 4.6: Comparison of 1) observation series (classification into 5 clusters based on quantitative variables) 2) observation series (classification into 2 clusters based on quantitative variables) 3) hidden states series (classification into two states based on observation series consisting of 5 possible clusters).

4.2 Textual analysis example

Let’s consider an extract of a Sherlock Holmes novel.

text=scan("data/sherlock.txt", what="raw", sep=" ")
print(length(text))

[1] 776

print(text[1:30])

 [1] "To"           "Sherlock"     "Holmes"       "she"          "is"          
 [6] "always"       "the"          "woman"        "."            "I"           
[11] "have"         "seldom"       "heard"        "him"          "mention"     
[16] "her"          "under"        "any"          "other"        "name"        
[21] "."            "In"           "his"          "eyes"         "she"         
[26] "eclipses"     "and"          "predominates" "the"          "whole"

We are interested in distinguishing the parts where the focus is mainly on Sherlock Holmes, to parts where the attention is mainly focused on the narrator, Dr Watson. We want to do this on all Sherlock Holmes novels so that this distinction should be done automatically (though here we will only use a short extract as an example).

We categorize words into 3 categories:

a neutral category (outcome “blah”)
a Sherlock Holmes-centered category (outome “he”)
a John Watson-centered category (outcome “I”)

           text observations
1            To         blah
2      Sherlock           he
3        Holmes           he
4           she         blah
5            is         blah
6        always         blah
7           the         blah
8         woman         blah
9             .         blah
10            I            I
11         have         blah
12       seldom         blah
13        heard         blah
14          him           he
15      mention         blah
16          her         blah
17        under         blah
18          any         blah
19        other         blah
20         name         blah
21            .         blah
22           In         blah
23          his           he
24         eyes         blah
25          she         blah
26     eclipses         blah
27          and         blah
28 predominates         blah
29          the         blah
30        whole         blah

Figure Figure 4.7 shows how the sequence of outcomes looks like for our example.

Figure 4.7: The observation series: each point corresponds to a word in a text. Blue points correspond to an occurrence of an ‘I’ outcome, while red points correspond to an occurrence of a ‘he’ outcome.

myprop=table(observations)
myprop=myprop/sum(myprop)
print(myprop)

observations
      blah         he          I 
0.90592784 0.06185567 0.03221649

myprop=as.vector(myprop)

We initialize the emission matrix stating that occurrence of the “he” outcome should be more frequent in the Holmes-centered parts while the “I” will be more frequent in the Watson-centered parts.

M_E=matrix(c(myprop[1], myprop[1],
             myprop[2]+0.01, myprop[2]-0.01,
             myprop[3]-0.01, myprop[3]+0.01),
             nrow=2)
rownames(M_E)=c("Holmes","Watson")
colnames(M_E)=c("blah","he","I")

We initialize the transition matrix stating that transition from a state to another happens about 1 time out of 100 words:

M_T=matrix(c(0.99,0.01,0.01,0.99),2)
rownames(M_T)=c("Holmes","Watson")
colnames(M_T)=c("Holmes","Watson")

The initial HMM is:

initial_HMM=initHMM(States=c("Holmes","Watson"),
                    Symbols=c("blah","he","I"),
                    transProbs=M_T,
                    emissionProbs=M_E)

We can now fit the HMM model using this initial parameterization.

hmm_fit= baumWelch(initial_HMM,observations,maxIterations=100)$hmm

col_outcomes=c("grey","red","blue")
col_states=c("pink2","skyblue3")

print(round(hmm_fit$emissionProbs,2))

        symbols
states   blah   he    I
  Holmes 0.91 0.07 0.02
  Watson 0.90 0.00 0.10

print(round(hmm_fit$transProbs,2))

        to
from     Holmes Watson
  Holmes   1.00   0.00
  Watson   0.01   0.99

The characteristics of states “Holmes” and “Watson” are displayed in figure Figure 4.8. Watson (outcome “I”), as a narrator, never completely disappears from the text even in Holmes-centered parts, while Watson-parts are characterized by a null occurrence of outcome “he”.

Figure 4.8: Fitted parameters of the HMM.

Now, we use the fitted HMM to infer the hidden states series:

states_speech=as.factor(viterbi(hmm_fit,observations))

Figure 4.9: Observation series and state series.

Figure Figure 4.9 displays the observation and estimated states series. There is only one part of the text which appears to be Watson-centered (according to how Watson-centerization is defined, i.e. total disappearance of “he” outcomes.)

print(text[states_speech=="Watson"])

 [1] "I"          "merely"     "shared"     "with"       "all"       
 [6] "the"        "readers"    "of"         "the"        "daily"     
[11] "press"      ","          "I"          "knew"       "little"    
[16] "of"         "my"         "former"     ""           "friend"    
[21] "and"        "companion"  "."          "One"        "night"     
[26] "-"          "it"         "was"        "on"         "the"       
[31] "twentieth"  "of"         "March"      ","          "1888"      
[36] "-"          "I"          "was"        "returning"  "from"      
[41] "a"          "journey"    "to"         "a"          "patient"   
[46] "("          "for"        "I"          "had"        "now"       
[51] "returned"   "to"         "civil"      "practice"   ")"         
[56] ","          "when"       "my"         "way"        "led"       
[61] "me"         "through"    "Baker"      "Street"     "."         
[66] "As"         "I"          "passed"     "the"        "well"      
[71] "-"          "remembered" "door"       ","          "which"     
[76] "must"       "always"     "be"         "associated" "in"        
[81] "my"         "mind"       "with"       "my"         "wooing"    
[86] ","          "and"        "with"       "the"        "dark"      
[91] "incidents"  "of"         "the"        "Study"      "in"        
[96] "Scarlet"    ","          "I"