Getting started

What is the RAKE algorithm?

The Rapid Automatic Keyword Extraction (RAKE) algorithm was first described in Rose et al. as a way to quickly extract keywords from documents. The algorithm involves two main steps:

1. Candidate keywords are identified. A candidate keyword is any set of contiguous words (i.e., any n-gram) that doesn’t contain a phrase delimiter or a stop word.¹ A phrase delimiter is a punctuation character that marks the beginning or end of a phrase (e.g., a period or a comma). Splitting up text based on phrase delimiters/stop words is the essential idea behind RAKE. According to the authors:

RAKE is based on our observation that keywords frequently contain multiple words but rarely contain standard punctuation or stop words, such as the function words and, the, and of, or other words with minimal lexical meaning

In addition to using stop words and phrase delimiters to identify candidate keywords, slowrake() also allows you to use a word’s part-of-speech (POS) to mark it as a potential delimiter. For example, most keywords don’t contain verbs, so you may want treat verbs as phrase delimiters. You can use slowrake()’s stop_pos parameter to choose which parts-of-speech to exclude from your candidate keywords.

2. Keywords get scored A keyword’s score (i.e., its degree of “keywordness”) is the sum of its member word scores. For example, the score for the keyword “dog leash” is calculated by adding the score for the word “dog” with the score for the word “leash.” A member word’s score is equal to its degree/frequency, where degree equals the number of times that the word co-occurs with another word in another keyword, and frequency is the total number of times that the word occurs overall (i.e., including keywords that only have one member word, like “dog”).

See Rose et al. for more details on how RAKE works.

Examples

RAKE is unique in that it is completely unsupervised, so it’s relatively quick to get started with. Let’s take a look at a few examples that demonstrate slowrake()’s parameters.

library(slowraker)

txt <- "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types of systems and systems of mixed types."

Use the default settings:

slowrake(txt)[[1]]
#>                         keyword freq score                    stem
#> 1  linear diophantine equations    1   8.5 linear diophantin equat
#> 2        minimal supporting set    1   6.8       minim support set
#> 3            linear constraints    1   4.5       linear constraint
#> 4               natural numbers    1   4.0            natur number
#> 5         nonstrict inequations    1   4.0         nonstrict inequ
#> 6            strict inequations    1   4.0            strict inequ
#> 7                   minimal set    1   3.8               minim set
#> 8                   mixed types    1   3.3                mix type
#> 9                       minimal    1   2.0                   minim
#> 10                          set    1   1.8                     set
#> 11                         sets    1   1.8                     set
#> 12                        types    2   1.3                    type
#> 13                   algorithms    2   1.0               algorithm
#> 14                compatibility    2   1.0                  compat
#> 15                   components    1   1.0                  compon
#> 16                 construction    1   1.0               construct
#> 17                     criteria    2   1.0                criteria
#> 18                    solutions    3   1.0                   solut
#> 19                       system    1   1.0                  system
#> 20                      systems    4   1.0                  system
#> 21                        upper    1   1.0                   upper

Don’t stem keywords before scoring them:

slowrake(txt, stem = FALSE)[[1]]
#>                         keyword freq score
#> 1  linear diophantine equations    1   8.5
#> 2        minimal supporting set    1   7.0
#> 3            linear constraints    1   4.5
#> 4                   minimal set    1   4.0
#> 5               natural numbers    1   4.0
#> 6         nonstrict inequations    1   4.0
#> 7            strict inequations    1   4.0
#> 8                   mixed types    1   3.3
#> 9                       minimal    1   2.0
#> 10                          set    1   2.0
#> 11                        types    2   1.3
#> 12                   algorithms    2   1.0
#> 13                compatibility    2   1.0
#> 14                   components    1   1.0
#> 15                 construction    1   1.0
#> 16                     criteria    2   1.0
#> 17                         sets    1   1.0
#> 18                    solutions    3   1.0
#> 19                       system    1   1.0
#> 20                      systems    4   1.0
#> 21                        upper    1   1.0

Add the word “diophantine” to the default set of stop words (default set = slowraker::smart_words):

slowrake(txt, stop_words = c(smart_words, "diophantine"))[[1]]
#>                   keyword freq score              stem
#> 1  minimal supporting set    1   6.8 minim support set
#> 2         natural numbers    1   4.0      natur number
#> 3   nonstrict inequations    1   4.0   nonstrict inequ
#> 4      strict inequations    1   4.0      strict inequ
#> 5             minimal set    1   3.8         minim set
#> 6      linear constraints    1   3.5 linear constraint
#> 7             mixed types    1   3.3          mix type
#> 8                 minimal    1   2.0             minim
#> 9                     set    1   1.8               set
#> 10                   sets    1   1.8               set
#> 11                 linear    1   1.5            linear
#> 12                  types    2   1.3              type
#> 13             algorithms    2   1.0         algorithm
#> 14          compatibility    2   1.0            compat
#> 15             components    1   1.0            compon
#> 16           construction    1   1.0         construct
#> 17               criteria    2   1.0          criteria
#> 18              equations    1   1.0             equat
#> 19              solutions    3   1.0             solut
#> 20                 system    1   1.0            system
#> 21                systems    4   1.0            system
#> 22                  upper    1   1.0             upper

Don’t use a word’s part-of-speech to determine if it’s a stop word:

slowrake(txt, stop_pos = NULL)[[1]]
#>                         keyword freq score                    stem
#> 1  linear diophantine equations    1   8.5 linear diophantin equat
#> 2       minimal generating sets    1   7.9         minim gener set
#> 3        minimal supporting set    1   7.9       minim support set
#> 4                   minimal set    1   4.9               minim set
#> 5            linear constraints    1   4.5       linear constraint
#> 6               natural numbers    1   4.0            natur number
#> 7         nonstrict inequations    1   4.0         nonstrict inequ
#> 8            strict inequations    1   4.0            strict inequ
#> 9                  upper bounds    1   4.0             upper bound
#> 10                  mixed types    1   3.7                mix type
#> 11             considered types    1   3.2             consid type
#> 12                          set    1   2.2                     set
#> 13                        types    1   1.7                    type
#> 14                   considered    1   1.5                  consid
#> 15                   algorithms    2   1.0               algorithm
#> 16                compatibility    2   1.0                  compat
#> 17                   components    1   1.0                  compon
#> 18                 constructing    1   1.0               construct
#> 19                 construction    1   1.0               construct
#> 20                     criteria    2   1.0                criteria
#> 21                    solutions    3   1.0                   solut
#> 22                      solving    1   1.0                    solv
#> 23                       system    1   1.0                  system
#> 24                      systems    4   1.0                  system

Consider any word that isn’t a noun to be a stop word:

slowrake(txt, stop_pos = pos_tags$tag[!grepl("^N", pos_tags$tag)])[[1]]
#>          keyword freq score       stem
#> 1     algorithms    2     1  algorithm
#> 2  compatibility    2     1     compat
#> 3     components    1     1     compon
#> 4    constraints    1     1 constraint
#> 5   construction    1     1  construct
#> 6       criteria    2     1   criteria
#> 7      equations    1     1      equat
#> 8    inequations    2     1      inequ
#> 9        numbers    1     1     number
#> 10           set    3     1        set
#> 11          sets    1     1        set
#> 12     solutions    3     1      solut
#> 13        system    1     1     system
#> 14       systems    4     1     system
#> 15         types    3     1       type
#> 16         upper    1     1      upper

List the keywords that occur most frequently (freq):

res <- slowrake(txt)[[1]]
res2 <- aggregate(freq ~ keyword + stem, data = res, FUN = sum)
res2[order(res2$freq, decreasing = TRUE), ]
#>                         keyword                    stem freq
#> 19                      systems                  system    4
#> 16                    solutions                   solut    3
#> 1                    algorithms               algorithm    2
#> 2                 compatibility                  compat    2
#> 5                      criteria                criteria    2
#> 20                        types                    type    2
#> 3                    components                  compon    1
#> 4                  construction               construct    1
#> 6            linear constraints       linear constraint    1
#> 7  linear diophantine equations linear diophantin equat    1
#> 8                       minimal                   minim    1
#> 9                   minimal set               minim set    1
#> 10       minimal supporting set       minim support set    1
#> 11                  mixed types                mix type    1
#> 12              natural numbers            natur number    1
#> 13        nonstrict inequations         nonstrict inequ    1
#> 14                          set                     set    1
#> 15                         sets                     set    1
#> 17           strict inequations            strict inequ    1
#> 18                       system                  system    1
#> 21                        upper                   upper    1

Run RAKE on a vector of documents instead of just one document:

slowrake(txt = dog_pubs$abstract[1:10])
#>   |                                                                              |                                                                      |   0%  |                                                                              |=======                                                               |  10%  |                                                                              |==============                                                        |  20%  |                                                                              |=====================                                                 |  30%  |                                                                              |============================                                          |  40%  |                                                                              |===================================                                   |  50%  |                                                                              |==========================================                            |  60%  |                                                                              |=================================================                     |  70%  |                                                                              |========================================================              |  80%  |                                                                              |===============================================================       |  90%  |                                                                              |======================================================================| 100%
#> 
#> # A rakelist containing 10 data frames:
#>  $ :'data.frame':    61 obs. of  4 variables:
#>   ..$ keyword:"assistance dog identification tags" ...
#>   ..$ freq   :1 1 ...
#>   ..$ score  :11 ...
#>   ..$ stem   :"assist dog identif tag" ...
#>  $ :'data.frame':    88 obs. of  4 variables:
#>   ..$ keyword:"current dog suitability assessments focus" ...
#>   ..$ freq   :1 1 ...
#>   ..$ score  :21 ...
#>   ..$ stem   :"current dog suitabl assess focu" ...
#> #...With 8 more data frames.

Technically the original version of RAKE allows some keywords to contain stop words, but slowrake() doesn’t support this.↩︎

- What is the RAKE algorithm?
- Examples