---
title: "The Means Function"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{The Means Function}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
The `proc_means()` function simulates a SASĀ® PROC MEANS procedure. It is used to
generate summary statistics on numeric variables. The function is both
interactive and returns datasets.
## Create Sample Data
Let's again create some sample data. This sample data is identical to that
created for the `proc_freq()` tutorial, but has an additional grouping
variable "g":
```{r eval=FALSE, echo=TRUE}
# Create sample data
dat <- read.table(header = TRUE,
text = 'x y z g
6 A 60 P
6 A 70 P
2 A 100 P
2 B 10 P
3 B 67 Q
2 C 81 Q
3 C 63 Q
5 C 55 Q')
# View sample data
dat
x y z g
1 6 A 60 P
2 6 A 70 P
3 2 A 100 P
4 2 B 10 P
5 3 B 67 Q
6 2 C 81 Q
7 3 C 63 Q
8 5 C 55 Q
```
## Get Summary Statistics
If no parameters are specified, the `proc_means()` function will calculate
N, Means, Standard Deviation, Minimum, and Maximum on all numeric variables.
Note that the `options` statement has been added to pass CRAN checks.
When you are running code samples, this statement may be omitted.
```{r eval=FALSE, echo=TRUE}
# Turn off printing for CRAN checks
options("procs.print" = FALSE)
# No parameters
proc_means(dat)
```
## Selected Variables
If you don't want statistics on all numeric variables, you may specify
variables on the `var` parameter:
```{r eval=FALSE, echo=TRUE}
# Specific variable
proc_means(dat, var = x)
```
## Statistics Options
The `proc_means()` function has a `stats` parameter that allows you to control
which statistics are generated. There are many statistics keywords. Here is
a sample of some of the most frequently used keywords:
| Keyword | Description |
|---|---|
| N | Number of Observations |
| NMISS | Number of missing observations |
| MEAN | Arithmetic mean |
| STD | Standard Deviation |
| MIN | Minimum |
| MAX | Maximum |
| SUM | Sum of observations |
| MEDIAN | 50th percentile |
| P1 | 1st percentile |
| P5 | 5st percentile |
| P10 | 10th percentile |
| P90 | 90th percentile |
| P95 | 95th percentile |
| P99 | 99th percentile |
| Q1 | First Quartile |
| Q3 | Third Quartile |
## Output Datasets
Similar to the `proc_freq()` function, `proc_means()` can return datasets.
There are three options: "out", "report", and "none". The
"out" option returns datasets meant for further manipulation and analysis, and
is the default.
The "report" keyword requests the exact datasets used in the interactive
report. Specifying either one of these options will cause the function
to return data.
Here is an example that shows the difference in the "report" and "out" options:
```{r eval=FALSE, echo=TRUE}
# Output dataset using "report" option
res1 <- proc_means(dat,
stats = v(median, sum, q1, q3),
output = report)
# View results
res1
# VAR MEDIAN SUM Q1 Q3
# 1 x 3 29 2.0 5.5
# 2 z 65 506 57.5 75.5
# Output dataset using "out" option
res2 <- proc_means(dat,
stats = v(median, sum, q1, q3),
output = out)
# View results
res2
# TYPE FREQ VAR MEDIAN SUM Q1 Q3
# 1 0 8 x 3 29 2.0 5.5
# 2 0 8 z 65 506 57.5 75.5
```
As can be seen in the above example, the "out" dataset includes additional
variables for TYPE and FREQ. These additional variables can be
turned off with options:
```{r eval=FALSE, echo=TRUE}
# Turn off TYPE and FREQ variables
res3 <- proc_means(dat,
stats = v(median, sum, q1, q3),
output = all,
options = v(notype, nofreq))
# View results
res3
# VAR MEDIAN SUM Q1 Q3
# 1 x 3 29 2.0 5.5
# 2 z 65 506 57.5 75.5
```
## Grouping
The `proc_means()` function provides two grouping parameters: `class` and `by`.
These parameters identify a variable or variables for subsetting the input
data. While these parameters have similar capabilities, there are some
difference between them. The differences can be examined by comparing
the two function calls.
### Class
```{r eval=FALSE, echo=TRUE}
# Class grouping
res1 <- proc_means(dat, stats = v(median, sum, q1, q3),
class = y, options = v(maxdec = 4))
```
Below is the output dataset from the `class` parameter. Notice that
summary values have been provided for each variable, in addition to the subsets
by the class variable. The summary rows are identifed by TYPE = 0, while the
subset rows are TYPE = 1.
```{r eval=FALSE, echo=TRUE}
# View results - class
res1
# CLASS TYPE FREQ VAR MEDIAN SUM Q1 Q3
# 1
Notice that with the `by` parameter, separate tables are created for each
by group on the interactive report.
Now let's look at the output dataset:
```{r eval=FALSE, echo=TRUE}
# View results - by
res2
# BY TYPE FREQ VAR MEDIAN SUM Q1 Q3
# 1 A 0 3 x 6.0 14 2 6
# 2 A 0 3 z 70.0 230 60 100
# 3 B 0 2 x 2.5 5 2 3
# 4 B 0 2 z 38.5 77 10 67
# 5 C 0 3 x 3.0 10 2 5
# 6 C 0 3 z 63.0 199 55 81
```
The output dataset is also different from the `class` output.
While the TYPE variables exists on the output dataset,
the output data for the by group does not include the summary rows. The
summary rows are a feature of the `class` variable. You should select
the grouping parameter that most suits your needs.
## Multiple Groups
### Class
The `proc_means()` function can perform analysis with multiple grouping variables.
First let's examine what happens when we pass multiple grouping variables to
the `class` parameter:
```{r eval=FALSE, echo=TRUE}
# Class grouping - two variables
res1 <- proc_means(dat, stats = v(median, sum, q1, q3),
class = v(g, y), options = v(maxdec = 0))
```
Here is the output dataset:
```{r eval=FALSE, echo=TRUE}
# View results - two class variables
res1
# CLASS1 CLASS2 TYPE FREQ VAR MEDIAN SUM Q1 Q3
# 1
Here is the output dataset for the `by` parameter:
```{r eval=FALSE, echo=TRUE}
# View results - two by variables
res2
# BY1 BY2 TYPE FREQ VAR MEDIAN SUM Q1 Q3
# 1 P A 0 3 x 6 14 2 6
# 2 P A 0 3 z 70 230 60 100
# 3 P B 0 1 x 2 2 2 2
# 4 P B 0 1 z 10 10 10 10
# 5 Q B 0 1 x 3 3 3 3
# 6 Q B 0 1 z 67 67 67 67
# 7 Q C 0 3 x 3 10 2 5
# 8 Q C 0 3 z 63 199 55 81
```
### By and Class
Finally, let's see what happens when we specify both `by` and `class`
parameters:
```{r eval=FALSE, echo=TRUE}
# By grouping - by and class
res3 <- proc_means(dat, stats = v(median, sum, q1, q3),
by = g,
class = y,
options = v(maxdec = 0))
```
```{r eval=FALSE, echo=TRUE}
# View results - by and class
res3
# BY CLASS TYPE FREQ VAR MEDIAN SUM Q1 Q3
# 1 P