---
title: "Functionality of the fastText R package"
author: "Lampros Mouselimis"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Functionality of the fastText R package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
This vignette explains the functionality of the fastText R package. This R package is an interface to the [fasttext library](https://github.com/facebookresearch/fastText) for efficient learning of word representations and sentence classification. The following functions are included,
|    fastText                                                                                                          | 
| :----------------------------------: | :---------------------------------------------------------------------------: |
|   **fasttext_interface**             | Interface for the fasttext library                                            |
|   **plot_progress_logs**             | Plot the progress of loss, learning-rate and word-counts                      |
|   **printAnalogiesUsage**            | Print Usage Information when the command equals to 'analogies'                |
|   **printDumpUsage**                 | Print Usage Information when the command equals to 'dump'                     |
|   **printNNUsage**                   | Print Usage Information when the command equals to 'nn'                       |
|   **printPredictUsage**              | Print Usage Information when the command equals to 'predict' or 'predict-prob'|
|   **printPrintNgramsUsage**          | Print Usage Information when the command equals to 'print-ngrams'             |
|   **printPrintSentenceVectorsUsage** | Print Usage Information when the command equals to 'print-sentence-vectors'   |
|   **printPrintWordVectorsUsage**     | Print Usage Information when the command equals to 'print-word-vectors'       |
|   **printQuantizeUsage**             | Print Usage Information when the command equals to 'quantize'                 |
|   **printTestLabelUsage**            | Print Usage Information when the command equals to 'test-label'               |
|   **printTestUsage**                 | Print Usage Information when the command equals to 'test'                     |
|   **printUsage**                     | Print Usage Information for all parameters                                    |
|   **print_parameters**               | Print the parameters for a specific command                                   |
I'll explain each function separately based on example data. More information can be found in the package documentation.
#### print_parameters 
This function prints information about the default parameters for a specific 'command'. The 'command' can be for instance *supervised*, *skipgram*, *cbow* etc.,
```R
library(fastText)
print_parameters(command = "supervised")
Empty input or output path.
The following arguments are mandatory:
  -input              training file path
  -output             output file path
The following arguments are optional:
  -verbose            verbosity level [2]
The following arguments for the dictionary are optional:
  -minCount           minimal number of word occurences [1]
  -minCountLabel      minimal number of label occurences [0]
  -wordNgrams         max length of word ngram [1]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]
The following arguments for training are optional:
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -ws                 size of the context window [5]
  -epoch              number of epochs [5]
  -neg                number of negatives sampled [5]
  -loss               loss function {ns, hs, softmax, one-vs-all} [softmax]
  -thread             number of threads [12]
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [false]
The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            whether embeddings are finetuned if a cutoff is applied [false]
  -qnorm              whether the norm is quantized separately [false]
  -qout               whether the classifier is quantized [false]
  -dsub               size of each sub-vector [2]
Error in give_args_fasttext(args = c("fasttext", command)) : 
  EXIT_FAILURE -- args.cc file -- Args::parseArgs function
```
#### Print Usage Functions
Each one of the functions which includes the words *print* and *Usage* allows a user to print information for this specific function. For instance,
```R
printPredictUsage()
usage: fasttext predict[-prob]   [] []
        model filename
    test data filename (if -, read from stdin)
            (optional; 1 by default) predict top k labels
           (optional; 0.0 by default) probability threshold
	
```
  
	
	
#### fasttext_interface
 
This function allows the user to run the various methods included in the [fasttext library](https://github.com/facebookresearch/fastText) from within R. The data that I'll use in the following code snippets can be downloaded as a .zip file (named as **fastText_data**) from my [Github repository](https://github.com/mlampros/DataSets). The user should then unzip the file and make the extracted folder his / hers default directory (using the base R function *setwd()*) before running the following code chunks.
 
	
```R
setwd('fastText_data')                  # make the extracted data the default directory
#------
# cbow
#------
library(fastText)
list_params = list(command = 'cbow', 
                   lr = 0.1, 
                   dim = 50,
                   input = "example_text.txt",
                   output = file.path(tempdir(), 'word_vectors'), 
                   verbose = 2, 
                   thread = 1)
res = fasttext_interface(list_params, 
                         path_output = file.path(tempdir(), 'cbow_logs.txt'),
                         MilliSecs = 5,
                         remove_previous_file = TRUE,
                         print_process_time = TRUE)
Read 0M words
Number of words:  8
Number of labels: 0
Progress: 100.0% words/sec/thread:    2933 lr:  0.000000 loss:  4.060542 ETA:   0h 0m
time to complete : 3.542332 secs 
```
 
**The data is saved in the specified *tempdir()* folder for illustration purposes. The user is advised to specify his / her own folder.** 
 
```R
#-----------
# supervised
#-----------
list_params = list(command = 'supervised', 
                   lr = 0.1,
                   dim = 50,
                   input = file.path("cooking.stackexchange", "cooking.train"),
                   output = file.path(tempdir(), 'model_cooking'), 
                   verbose = 2, 
                   thread = 4)
res = fasttext_interface(list_params, 
                         path_output = file.path(tempdir(), 'sup_logs.txt'),
                         MilliSecs = 5,
                         remove_previous_file = TRUE,
                         print_process_time = TRUE)
Read 0M words
Number of words:  14543
Number of labels: 735
Progress: 100.0% words/sec/thread:   63282 lr:  0.000000 loss: 10.049338 ETA:   0h 0m
time to complete : 3.449003 secs 
```
 
The user has here also the option to plot the progress of *loss*, *learning-rate* and *word-counts*,
 
```R
res = plot_progress_logs(path = file.path(tempdir(), 'sup_logs.txt'), 
                         plot = TRUE)
dim(res)
```
 

 
The verbosity for the logs-file depends on,
* the *'MilliSecs'* (higher leads to fewer logs-lines in the file) and
* the *'thread'* (greater than 1 fewer logs-lines in the file) 
parameters.
 
The next command can be utilized to *'predict'* new data based on the output model,
 
```R
#-------------------
# 'predict' function
#-------------------
list_params = list(command = 'predict',
                   model = file.path(tempdir(), 'model_cooking.bin'),
                   test_data = file.path('cooking.stackexchange', 'cooking.valid'), 
                   k = 1,
                   th = 0.0)
res = fasttext_interface(list_params, 
                         path_output = file.path(tempdir(), 'predict_valid.txt'))
```
 
These output predictions will be of the following form  '__label__food-safety' , where each line will represent a new label (number of lines of the input data must match the number of lines of the output data). With the *'predict-prob'* command someone can obtain the probabilities of the labels as well,
 
```R
#------------------------
# 'predict-prob' function
#------------------------
list_params = list(command = 'predict-prob',
                   model = file.path(tempdir(), 'model_cooking.bin'),
                   test_data = file.path('cooking.stackexchange', 'cooking.valid'), 
                   k = 1,
                   th = 0.0)
res = fasttext_interface(list_params, 
                         path_output = file.path(tempdir(), 'predict_valid_prob.txt'))
```
 
Using *'predict-prob'* the output predictions will be of the following form  '__label__baking 0.0282927' 
 
Once the model was trained, someone can evaluate it by computing the precision and recall at 'k' on a test set. The 'test' command just prints the metrics in the R session,
 
```R
#----------------
# 'test' function
#----------------
list_params = list(command = 'test',
                   model = file.path(tempdir(), 'model_cooking.bin'),
                   test_data = file.path('cooking.stackexchange', 'cooking.valid'),
                   k = 1,
                   th = 0.0)
res = fasttext_interface(list_params)
N	3000
P@1	0.138
R@1	0.060
```
 
whereas the 'test-label' command allows the user to save,
 
```R
#----------------------
# 'test-label' function
#----------------------
list_params = list(command = 'test-label',
                   model = file.path(tempdir(), 'model_cooking.bin'),
                   test_data = file.path('cooking.stackexchange', 'cooking.valid'),
                   k = 1,
                   th = 0.0)
res = fasttext_interface(list_params, 
                         path_output = file.path(tempdir(), 'test_label_valid.txt'))
```
 
the output to the *'test_label_valid.txt'* file, which includes the 'Precision' and 'Recall' for each unique label on the data set (*'cooking.stackexchange.txt'*). That means the number of rows of the *'test_label_valid.txt'* must be equal to the unique labels in the *'cooking.stackexchange.txt'* data set. This can be verified using the following code snippet,
 
```R
st_dat = read.delim(file.path("cooking.stackexchange", "cooking.stackexchange.txt"), 
                    stringsAsFactors = FALSE)
res_stackexch = unlist(lapply(1:nrow(st_dat), function(y)
  
  strsplit(st_dat[y, ], " ")[[1]][which(sapply(strsplit(st_dat[y, ], " ")[[1]], function(x)
    
    substr(x, 1, 9) == "__label__") == T)])
)
test_label_valid = read.table(file.path(tempdir(), 'test_label_valid.txt'), 
                              quote="\"", comment.char="")
# number of unique labels of data equal to the rows of the 'test_label_valid.txt' file
length(unique(res_stackexch)) == nrow(test_label_valid)             
[1] TRUE
head(test_label_valid)
        V1 V2       V3        V4 V5       V6     V7 V8       V9                    V10
1 F1-Score  : 0.234244 Precision  : 0.139535 Recall  : 0.729167        __label__baking
2 F1-Score  : 0.227746 Precision  : 0.132571 Recall  : 0.807377   __label__food-safety
3 F1-Score  : 0.058824 Precision  : 0.750000 Recall  : 0.030612 __label__substitutions
4 F1-Score  : 0.000000 Precision  : -------- Recall  : 0.000000     __label__equipment
5 F1-Score  : 0.017699 Precision  : 1.000000 Recall  : 0.008929         __label__bread
6 F1-Score  : 0.000000 Precision  : -------- Recall  : 0.000000       __label__chicken
.....
```
 
The user can also *'quantize'* a supervised model to reduce its memory usage with the following command,
 
```R
#---------------------
# 'quantize' function
#---------------------
list_params = list(command = 'quantize',
                   input = file.path(tempdir(), 'model_cooking.bin'),
                   output = file.path(tempdir(), 'model_cooking')) 
res = fasttext_interface(list_params)
print(list.files(tempdir(), pattern = '.ftz'))
[1] "model_cooking.ftz"
```
 
The quantize function is currenlty (as of 01/02/2019) [single-threaded](https://github.com/facebookresearch/fastText/issues/353#issuecomment-342501742).
 
Based on the *'queries.txt'* text file the user can save the word vectors to a file using the following command ( one vector per line ),
 
```R
#----------------------------
# print-word-vectors function
#----------------------------
list_params = list(command = 'print-word-vectors',
                   model = file.path(tempdir(), 'model_cooking.bin'))
res = fasttext_interface(list_params,
                         path_input = 'queries.txt',
                         path_output = file.path(tempdir(), 'word_vecs_queries.txt'))
```
 
To compute vector representations of sentences or paragraphs use the following command,
 
```R
#--------------------------------
# print-sentence-vectors function
#--------------------------------
library(fastText)
list_params = list(command = 'print-sentence-vectors',
                   model = file.path(tempdir(), 'model_cooking.bin'))
res = fasttext_interface(list_params,
                         path_input = 'text_sentence.txt',
                         path_output = file.path(tempdir(), 'word_sentence_queries.txt'))
```
 
Be aware that for the *'print-sentence-vectors'* the *'word_sentence_queries.txt'* file should consist of sentences of the following form,
 
```R
How much does potato starch affect a cheese sauce recipe
Dangerous pathogens capable of growing in acidic environments
How do I cover up the white spots on my cast iron stove
How do I cover up the white spots on my cast iron stove
Michelin Three Star Restaurant but if the chef is not there
......
```
 
Therefore each line should end in **EOS (end of sentence )** and that if at the end of the file a **newline** exists then the function will return an additional word vector. Thus the user should **make sure** that the input file does **not** include empty lines at the end of the file.
 
The *'print-ngrams'* command prints the n-grams of a word in the R session or saves the n-grams to a file. But first the user should save the model and word-vectors with n-grams enabled (*minn*, *maxn* parameters)
 
```R
#----------------------------------------
# 'skipgram' function with n-gram enabled
#----------------------------------------
list_params = list(command = 'skipgram', 
                   lr = 0.1,
                   dim = 50,
                   input = "example_text.txt",
                   output = file.path(tempdir(), 'word_vectors'),
                   verbose = 2, 
                   thread = 1,
                   minn = 2, 
                   maxn = 2)
res = fasttext_interface(list_params, 
                         path_output = file.path(tempdir(), 'skipgram_logs.txt'),
                         MilliSecs = 5)
#-----------------------
# 'print-ngram' function
#-----------------------
list_params = list(command = 'print-ngrams',
                   model = file.path(tempdir(), 'word_vectors.bin'),
                   word = 'word')
# save output to file
res = fasttext_interface(list_params, 
                         path_output = file.path(tempdir(), 'ngrams.txt'))
                         
#  print output to console
res = fasttext_interface(list_params, 
                         path_output = "")      
# truncated output for the 'word' query
#--------------------------------------
 0.00749 0.00720 0.01171 0.01258  ......
```
 
The *'nn'* command returns the nearest neighbors for a specific word based on the input model,
 
```R
#--------------
# 'nn' function
#--------------
list_params = list(command = 'nn',
                   model = file.path(tempdir(), 'model_cooking.bin'),
                   k = 5,
                   query_word = 'sauce')
res = fasttext_interface(list_params, 
                         path_output = file.path(tempdir(), 'nearest.txt'))
# 'nearest.txt'
#--------------
rice 0.804595
 0.799858
Vide 0.78893
store 0.788918
cheese 0.785977
```
 
The *'analogies'* command works for triplets of words (separated by whitespace) and returns 'k' rows for each line (triplet) of the input file (separated by an empty line),
 
```R
#---------------------
# 'analogies' function
#---------------------
list_params = list(command = 'analogies',
                   model = file.path(tempdir(), 'model_cooking.bin'),
                   k = 5)
res = fasttext_interface(list_params, 
                         path_input = 'analogy_queries.txt', 
                         path_output = file.path(tempdir(), 'analogies_output.txt'))
# 'analogies_output.txt'
#-----------------------
batter 0.857213
I 0.854491
recipe? 0.851498
substituted 0.845269
flour 0.842508
covered 0.808651
calls 0.801348
fresh 0.800051
cold 0.797468
always 0.793695
.............
```
 
Finally, the *'dump'* function takes as *'option'* one of the *'args'*, *'dict'*, *'input'* or *'output'* and dumps the output to a text file,
 
```R
#--------------
# dump function
#--------------
list_params = list(command = 'dump',
                   model = file.path(tempdir(), 'model_cooking.bin'),
                   option = 'args')
res = fasttext_interface(list_params, 
                         path_output = file.path(tempdir(), 'dump_data.txt'), 
                         remove_previous_file = TRUE)
                         
# 'dump_data.txt'
#----------------
dim 50
ws 5
epoch 5
minCount 1
neg 5
wordNgrams 1
loss softmax
model sup
bucket 0
minn 0
maxn 0
lrUpdateRate 100
t 0.00010
```
	
 |  |