| Title: | Text Analysis for All | 
| Version: | 0.4.0 | 
| Description: | An R 'shiny' app designed for diverse text analysis tasks, offering a wide range of methodologies tailored to Natural Language Processing (NLP) needs. It is a versatile, general-purpose tool for analyzing textual data. 'tall' features a comprehensive workflow, including data cleaning, preprocessing, statistical analysis, and visualization, all integrated for effective text analysis. | 
| License: | MIT + file LICENSE | 
| Encoding: | UTF-8 | 
| RoxygenNote: | 7.3.3 | 
| URL: | https://github.com/massimoaria/tall, https://www.k-synth.com/tall/ | 
| BugReports: | https://github.com/massimoaria/tall/issues | 
| Depends: | R (≥ 3.5.0), shiny | 
| Imports: | base64enc, ca, chromote, curl (≥ 6.3.0), doParallel, dplyr (≥ 1.1.0), DT, fontawesome, ggraph, graphics, httr2, igraph, jsonlite, later, openxlsx, pagedown, parallel, pdftools (≥ 3.6.0), plotly, promises, purrr, Rcpp (≥ 1.0.3), readr, readtext, readxl, rlang, RSpectra, shinycssloaders (≥ 1.1.0), shinydashboardPlus, shinyFiles, shinyjs, shinyWidgets, sparkline, stringr, strucchange, textrank, tidygraph, tidyr, topicmodels, udpipe, umap, visNetwork, word2vec | 
| LazyData: | true | 
| LinkingTo: | Rcpp | 
| NeedsCompilation: | yes | 
| Packaged: | 2025-10-23 16:00:34 UTC; massimoaria | 
| Author: | Massimo Aria [aut, cre, cph] (0000-0002-8517-9411),
  Maria Spano | 
| Maintainer: | Massimo Aria <aria@unina.it> | 
| Repository: | CRAN | 
| Date/Publication: | 2025-10-23 16:10:02 UTC | 
Lemmatized Text of Moby-Dick (Chapters 1-10)
Description
This dataset contains the lemmatized version of the first 10 chapters of the novel Moby-Dick by Herman Melville. The data is structured as a dataframe with multiple linguistic annotations.
Usage
data(mobydick)
Format
A dataframe with multiple rows and 26 columns:
- doc_id
- Character: Unique document identifier 
- paragraph_id
- Integer: Paragraph index within the document 
- sentence_id
- Integer: Sentence index within the paragraph 
- sentence
- Character: Original sentence text 
- start
- Integer: Start position of the token in the sentence 
- end
- Integer: End position of the token in the sentence 
- term_id
- Integer: Unique term identifier 
- token_id
- Integer: Token index in the sentence 
- token
- Character: Original token (word) 
- lemma
- Character: Lemmatized form of the token 
- upos
- Character: Universal POS tag 
- xpos
- Character: Language-specific POS tag 
- feats
- Character: Morphological features 
- head_token_id
- Integer: Head token in dependency tree 
- dep_rel
- Character: Dependency relation label 
- deps
- Character: Enhanced dependency relations 
- misc
- Character: Additional information 
- folder
- Character: Folder containing the document 
- split_word
- Character: The word used to separate the chapters in the original book 
- filename
- Character: Source file name 
- doc_selected
- Logical: Whether the document is selected 
- POSSelected
- Logical: Whether POS was selected 
- sentence_hl
- Character: Highlighted sentence 
- docSelected
- Logical: Whether the document was manually selected 
- noHapax
- Logical: Whether hapax legomena were removed 
- noSingleChar
- Logical: Whether single-character words were removed 
- lemma_original_nomultiwords
- Character: Lemmatized form without multi-word units 
Source
Extracted and processed from the text of Moby-Dick by Herman Melville.
Examples
data(mobydick)
head(mobydick)
Plot Terms by Cluster
Description
This function creates a horizontal bar plot to visualize the most significant terms for each cluster, based on their Chi-squared statistics.
Usage
reinPlot(terms, nPlot = 10)
Arguments
| terms | A data frame containing terms and their associated statistics, such as Chi-squared values,
generated by the  
 | 
| nPlot | Integer. The number of top terms to plot for each sign ( | 
Details
The function organizes the input data by Chi-squared values and selects the top terms for each sign. The plot uses different colors for positive and negative terms, with hover tooltips providing detailed information.
Value
An interactive horizontal bar plot (using plotly) displaying the top terms for each cluster. The plot includes:
- Bars representing the Chi-squared values of terms. 
- Hover information displaying the term and its Chi-squared value. 
See Also
Examples
## Not run: 
data(mobydick)
res <- reinert(
  x = mobydick,
  k = 10,
  term = "token",
  segment_size = 40,
  min_segment_size = 5,
  min_split_members = 10,
  cc_test = 0.3,
  tsj = 3
)
tc <- term_per_cluster(res, cutree = NULL, k = 1, negative = FALSE)
fig <- reinPlot(tc$terms, nPlot = 10)
## End(Not run)
Summarize Reinert Clustering Results
Description
This function summarizes the results of the Reinert clustering algorithm, including the most frequent documents and significant terms for each cluster.
The input is the result returned by the term_per_cluster function.
Usage
reinSummary(tc, n = 10)
Arguments
| tc | A list returned by the  
 | 
| n | Integer. The number of top terms (based on Chi-squared value) to include in the summary for each cluster and sign. Default is 10. | 
Details
This function performs the following steps:
- Extracts the most frequent document for each cluster. 
- Summarizes the number of documents per cluster. 
- Selects the top - nterms for each cluster, separated by positive and negative signs.
- Combines the terms and segment information into a final summary table. 
Value
A data frame summarizing the clustering results. The table includes:
-  cluster: The cluster ID.
-  Positive terms: The topnpositive terms for each cluster, concatenated into a single string.
-  Negative terms: The topnnegative terms for each cluster, concatenated into a single string.
-  Most frequent document: The document ID that appears most frequently in each cluster.
-  N. of Documents per Cluster: The number of documents in each cluster.
See Also
Examples
data(mobydick)
res <- reinert(
  x = mobydick,
  k = 10,
  term = "token",
  segment_size = 40,
  min_segment_size = 5,
  min_split_members = 10,
  cc_test = 0.3,
  tsj = 3
)
tc <- term_per_cluster(res, cutree = NULL, k = 1:10, negative = FALSE)
S <- reinSummary(tc, n = 10)
head(S, 10)
Segment clustering based on the Reinert method - Simple clustering
Description
Segment clustering based on the Reinert method - Simple clustering
Usage
reinert(
  x,
  k = 10,
  term = "token",
  segment_size = 40,
  min_segment_size = 3,
  min_split_members = 5,
  cc_test = 0.3,
  tsj = 3
)
Arguments
| x | tall data frame of documents | 
| k | maximum number of clusters to compute | 
| term | indicates the type of form "lemma" or "token". Default value is term = "lemma". | 
| segment_size | number of forms by document. Default value is segment_size = 40 | 
| min_segment_size | minimum number of forms by document. Default value is min_segment_size = 5 | 
| min_split_members | minimum number of segment in a cluster | 
| cc_test | contingency coefficient value for feature selection | 
| tsj | minimum frequency value for feature selection | 
Details
See the references for original articles on the method. Special thanks to the authors of the rainette package (https://github.com/juba/rainette) for inspiring the coding approach used in this function.
Value
The result is a list of both class hclust and reinert_tall.
References
- Reinert M, Une méthode de classification descendante hiérarchique: application à l'analyse lexicale par contexte, Cahiers de l'analyse des données, Volume 8, Numéro 2, 1983. https://www.numdam.org/item/?id=CAD_1983__8_2_187_0 
- Reinert M., Alceste une méthodologie d'analyse des données textuelles et une application: Aurelia De Gerard De Nerval, Bulletin de Méthodologie Sociologique, Volume 26, Numéro 1, 1990. doi:10.1177/075910639002600103 
- Barnier J., Privé F., rainette: The Reinert Method for Textual Data Clustering, 2023, doi:10.32614/CRAN.package.rainette 
Examples
data(mobydick)
res <- reinert(
  x = mobydick,
  k = 10,
  term = "token",
  segment_size = 40,
  min_segment_size = 5,
  min_split_members = 10,
  cc_test = 0.3,
  tsj = 3
)
TALL UI
Description
tall performs text analysis for all.
Usage
tall(
  host = "127.0.0.1",
  port = NULL,
  launch.browser = TRUE,
  maxUploadSize = 1000
)
Arguments
| host | The IPv4 address that the application should listen on. Defaults to the shiny.host option, if set, or "127.0.0.1" if not. | 
| port | is the TCP port that the application should listen on. If the port is not specified, and the shiny.port option is set (with options(shiny.port = XX)), then that port will be used. Otherwise, use a random port. | 
| launch.browser | If true, the system's default web browser will be launched automatically after the app is started. Defaults to true in interactive sessions only. This value of this parameter can also be a function to call with the application's URL. | 
| maxUploadSize | is a integer. The max upload file size argument. Default value is 1000 (megabyte) | 
Value
No return value, called for side effects.
Extract Terms and Segments for Document Clusters
Description
This function processes the results of a document clustering algorithm based on the Reinert method. It computes the terms and their significance for each cluster, as well as the associated document segments.
Usage
term_per_cluster(res, cutree = NULL, k = 1, negative = TRUE)
Arguments
| res | A list containing the results of the Reinert clustering algorithm. Must include at least  | 
| cutree | A custom cutree structure. If  | 
| k | A vector of integers specifying the clusters to analyze. Default is  | 
| negative | Logical. If  | 
Details
The function integrates document-term matrix rows for missing segments, calculates term statistics for each cluster,
and filters terms based on their significance. Terms can be excluded based on their significance (signExcluded).
Value
A list with the following components:
| terms | A data frame of significant terms for each cluster. Columns include: 
 | 
| segments | A data frame of document segments associated with each cluster. Columns include: 
 | 
Examples
data(mobydick)
res <- reinert(
  x = mobydick,
  k = 10,
  term = "token",
  segment_size = 40,
  min_segment_size = 5,
  min_split_members = 10,
  cc_test = 0.3,
  tsj = 3
)
tc <- term_per_cluster(res, cutree = NULL, k = 1:10, negative = FALSE)
head(tc$segments, 10)
head(tc$terms, 10)