This vignette explains how to use functions for “melting” wide data tables, i.e. converting to tall/long data tables. To clarify the discussion we introduce the following three terms:
measure.vars in reshape2/data.table.id.vars in
reshape2/data.table.The nc functions use data.table::melt
internally:
nc::capture_melt_single is useful when you have a wide
data table with lots of columns, all of the same type, that you would
like to melt/reshape into a single result/output column. This uses
melt with a character vector for
measure.vars.nc::capture_melt_multiple is useful when you have a
wide data table with columns of possibly different types that you would
like to melt/reshape into multiple result/output columns. This uses
melt with a list for measure.vars.Both are useful when you want to use a regular expression to specify both (1) the set of input columns to reshape and (2) some information to extract from those column names.
Sometimes you want to melt a “wide” data table which has several distinct pieces of information encoded in each column name. One example is the familiar iris data, which have flower part and measurement dimension encoded in each of four column names:
head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosaThose four reshape column names can be specified via a regex in
nc::capture_melt_single. The first argument is the input
data table to reshape, and the subsequent arguments are interpreted as a
pattern which is passed to nc::capture_first_vec. Any input
column names which match the specified regex will be passed as
measure.vars to melt:
(iris.tall <- nc::capture_melt_single(
  iris,
  part=".*",
  "[.]",
  dim=".*",
  value.name="cm"))
#>        Species   part    dim    cm
#>         <fctr> <char> <char> <num>
#>   1:    setosa  Sepal Length   5.1
#>   2:    setosa  Sepal Length   4.9
#>   3:    setosa  Sepal Length   4.7
#>   4:    setosa  Sepal Length   4.6
#>   5:    setosa  Sepal Length   5.0
#>  ---                              
#> 596: virginica  Petal  Width   2.3
#> 597: virginica  Petal  Width   1.9
#> 598: virginica  Petal  Width   2.0
#> 599: virginica  Petal  Width   2.3
#> 600: virginica  Petal  Width   1.8Note the output above has one copy column (Species), two capture
columns (part, dim), and one reshape column (cm). Internally the
function joins the result of nc::capture_first_vec (on
column names) to the result of melt (on the data).
The reshaped data can be plotted with different parts on rows and different dimensions on columns:
if(require(ggplot2)){
  ggplot()+
    theme_bw()+
    theme(panel.spacing=grid::unit(0, "lines"))+
    facet_grid(part ~ dim)+
    geom_bar(aes(cm, fill=Species), data=iris.tall)
}
#> Le chargement a nécessité le package : ggplot2We could instead use capture_melt_multiple to get
multiple output columns. Like capture_melt_single, the
first argument of capture_melt_multiple is the subject data
table and the following arguments form a pattern which is matched to the
input data column names. However the pattern must have at least two
groups:
column. This group is used to
capture the part of the input column names which will be used for the
output column names. The output will contain one column for each unique
value captured in this group.(iris.part.cols <- nc::capture_melt_multiple(
  iris,
  column=".*",
  "[.]",
  dim=".*"))
#>        Species    dim Petal Sepal
#>         <fctr> <char> <num> <num>
#>   1:    setosa Length   1.4   5.1
#>   2:    setosa Length   1.4   4.9
#>   3:    setosa Length   1.3   4.7
#>   4:    setosa Length   1.5   4.6
#>   5:    setosa Length   1.4   5.0
#>  ---                             
#> 296: virginica  Width   2.3   3.0
#> 297: virginica  Width   1.9   2.5
#> 298: virginica  Width   2.0   3.0
#> 299: virginica  Width   2.3   3.4
#> 300: virginica  Width   1.8   3.0Note that the reshaped table above contains one copy column (Species), one capture column (dim), and two reshape columns (Petal, Sepal). We can plot these data to see whether or not sepals are bigger than petals:
if(require(ggplot2)){
  ggplot()+
    theme_bw()+
    theme(panel.spacing=grid::unit(0, "lines"))+
    facet_grid(dim ~ Species)+
    coord_equal()+
    geom_abline(slope=1, intercept=0, color="grey")+
    geom_point(aes(
      Petal, Sepal),
      data=iris.part.cols)
}It is clear from the plot above that sepals are indeed both longer and wider than petals, on each measured plant.
Another data set where it is useful to do column name pattern matching followed by melting is the World Health Organization data:
if(requireNamespace("tidyr")){
  data(who, package="tidyr")
}else{
  who <- data.frame(id=1, new_sp_m5564=2, newrel_f65=3)
}
#> Le chargement a nécessité le package : tidyr
names(who)
#>  [1] "country"      "iso2"         "iso3"         "year"         "new_sp_m014" 
#>  [6] "new_sp_m1524" "new_sp_m2534" "new_sp_m3544" "new_sp_m4554" "new_sp_m5564"
#> [11] "new_sp_m65"   "new_sp_f014"  "new_sp_f1524" "new_sp_f2534" "new_sp_f3544"
#> [16] "new_sp_f4554" "new_sp_f5564" "new_sp_f65"   "new_sn_m014"  "new_sn_m1524"
#> [21] "new_sn_m2534" "new_sn_m3544" "new_sn_m4554" "new_sn_m5564" "new_sn_m65"  
#> [26] "new_sn_f014"  "new_sn_f1524" "new_sn_f2534" "new_sn_f3544" "new_sn_f4554"
#> [31] "new_sn_f5564" "new_sn_f65"   "new_ep_m014"  "new_ep_m1524" "new_ep_m2534"
#> [36] "new_ep_m3544" "new_ep_m4554" "new_ep_m5564" "new_ep_m65"   "new_ep_f014" 
#> [41] "new_ep_f1524" "new_ep_f2534" "new_ep_f3544" "new_ep_f4554" "new_ep_f5564"
#> [46] "new_ep_f65"   "newrel_m014"  "newrel_m1524" "newrel_m2534" "newrel_m3544"
#> [51] "newrel_m4554" "newrel_m5564" "newrel_m65"   "newrel_f014"  "newrel_f1524"
#> [56] "newrel_f2534" "newrel_f3544" "newrel_f4554" "newrel_f5564" "newrel_f65"Each column which starts with new has three distinct
pieces of information encoded in its name: diagnosis type (e.g. sp or
rel), gender (m or f), and age range (e.g. 5564 or 1524). We would like
to use a regex to match these column names, then using the matching
columns as measure.vars in a melt, then join the two results.
new.diag.gender <- list(
  "new_?",
  diagnosis=".*",
  "_",
  gender=".")
nc::capture_melt_single(who, new.diag.gender, ages=".*")
#>                          country   iso2   iso3  year diagnosis gender   ages
#>                           <char> <char> <char> <num>    <char> <char> <char>
#>     1:               Afghanistan     AF    AFG  1997        sp      m    014
#>     2:               Afghanistan     AF    AFG  1998        sp      m    014
#>     3:               Afghanistan     AF    AFG  1999        sp      m    014
#>     4:               Afghanistan     AF    AFG  2000        sp      m    014
#>     5:               Afghanistan     AF    AFG  2001        sp      m    014
#>    ---                                                                      
#> 76042:                  Viet Nam     VN    VNM  2013       rel      f     65
#> 76043: Wallis and Futuna Islands     WF    WLF  2013       rel      f     65
#> 76044:                     Yemen     YE    YEM  2013       rel      f     65
#> 76045:                    Zambia     ZM    ZMB  2013       rel      f     65
#> 76046:                  Zimbabwe     ZW    ZWE  2013       rel      f     65
#>        value
#>        <num>
#>     1:     0
#>     2:    30
#>     3:     8
#>     4:    52
#>     5:   129
#>    ---      
#> 76042:  3110
#> 76043:     2
#> 76044:   360
#> 76045:   669
#> 76046:   725Note the output includes the new reshape column called
value by default, as in melt. The input
reshape column names which matched the specified pattern, and there is a
new column for each group in that pattern. The following example shows
how to rename the value column and use numeric type
conversion functions:
years.pattern <- list(new.diag.gender, ages=list(
  min.years="0|[0-9]{2}", as.numeric,
  max.years="[0-9]{0,2}", function(x)ifelse(x=="", Inf, as.numeric(x))))
(who.typed <- nc::capture_melt_single(
  who, years.pattern,
  value.name="count"))
#>                          country   iso2   iso3  year diagnosis gender   ages
#>                           <char> <char> <char> <num>    <char> <char> <char>
#>     1:               Afghanistan     AF    AFG  1997        sp      m    014
#>     2:               Afghanistan     AF    AFG  1998        sp      m    014
#>     3:               Afghanistan     AF    AFG  1999        sp      m    014
#>     4:               Afghanistan     AF    AFG  2000        sp      m    014
#>     5:               Afghanistan     AF    AFG  2001        sp      m    014
#>    ---                                                                      
#> 76042:                  Viet Nam     VN    VNM  2013       rel      f     65
#> 76043: Wallis and Futuna Islands     WF    WLF  2013       rel      f     65
#> 76044:                     Yemen     YE    YEM  2013       rel      f     65
#> 76045:                    Zambia     ZM    ZMB  2013       rel      f     65
#> 76046:                  Zimbabwe     ZW    ZWE  2013       rel      f     65
#>        min.years max.years count
#>            <num>     <num> <num>
#>     1:         0        14     0
#>     2:         0        14    30
#>     3:         0        14     8
#>     4:         0        14    52
#>     5:         0        14   129
#>    ---                          
#> 76042:        65       Inf  3110
#> 76043:        65       Inf     2
#> 76044:        65       Inf   360
#> 76045:        65       Inf   669
#> 76046:        65       Inf   725
str(who.typed)
#> Classes 'data.table' and 'data.frame':   76046 obs. of  10 variables:
#>  $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
#>  $ iso2     : chr  "AF" "AF" "AF" "AF" ...
#>  $ iso3     : chr  "AFG" "AFG" "AFG" "AFG" ...
#>  $ year     : num  1997 1998 1999 2000 2001 ...
#>  $ diagnosis: chr  "sp" "sp" "sp" "sp" ...
#>  $ gender   : chr  "m" "m" "m" "m" ...
#>  $ ages     : chr  "014" "014" "014" "014" ...
#>  $ min.years: num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ max.years: num  14 14 14 14 14 14 14 14 14 14 ...
#>  $ count    : num  0 30 8 52 129 90 127 139 151 193 ...
#>  - attr(*, ".internal.selfref")=<externalptr>Note in the code/result above that non-character captured output columns can be obtained by specifying type conversion functions in the pattern.
Another example is exchange rate data from Eurostat. We first use read the entire data set into R:
ert.gz <- system.file(
  "extdata", "ert_eff_ic_m.tsv.gz", package="nc", mustWork=TRUE)
ert.all <- data.table::fread(ert.gz, na.strings=":")
ert.all[1:5, 1:5]
#>    exch_rt,unit,geo\\time 2020M07 2020M06 2020M05 2020M04
#>                    <char>   <num>   <num>   <num>   <num>
#> 1:       NEER_EA19,I10,AT  100.00  100.00   100.0  100.00
#> 2:       NEER_EA19,I10,AU   88.50   88.39    86.3   83.46
#> 3:       NEER_EA19,I10,BE  100.00  100.00   100.0  100.00
#> 4:       NEER_EA19,I10,BG  100.00  100.00   100.0  100.00
#> 5:       NEER_EA19,I10,BR   38.49   39.94    38.0   40.45We see that the first column has some CSV data which we can parse via:
ert.first <- ert.all[, 1]
csv.lines <- c(sub("\\\\.*", "", names(ert.first)), ert.first[[1]])
ert.first.dt <- data.table::fread(text=paste(csv.lines, collapse="\n"))
ert.wide <- data.table::data.table(ert.first.dt, ert.all[,-1])
ert.wide[1:5, 1:5]
#>      exch_rt   unit    geo 2020M07 2020M06
#>       <char> <char> <char>   <num>   <num>
#> 1: NEER_EA19    I10     AT  100.00  100.00
#> 2: NEER_EA19    I10     AU   88.50   88.39
#> 3: NEER_EA19    I10     BE  100.00  100.00
#> 4: NEER_EA19    I10     BG  100.00  100.00
#> 5: NEER_EA19    I10     BR   38.49   39.94The wide data table can then be melted:
(ert.tall <- nc::capture_melt_single(
  ert.wide,
  year="[0-9]{4}", as.integer,
  "M",
  month="[0-9]{2}", as.integer))
#>         exch_rt   unit    geo  year month   value
#>          <char> <char> <char> <int> <int>   <num>
#>    1: NEER_EA19    I10     AT  2020     7  100.00
#>    2: NEER_EA19    I10     AU  2020     7   88.50
#>    3: NEER_EA19    I10     BE  2020     7  100.00
#>    4: NEER_EA19    I10     BG  2020     7  100.00
#>    5: NEER_EA19    I10     BR  2020     7   38.49
#>   ---                                            
#> 2229: NEER_EA19    I10     BE  1994     1   99.07
#> 2230: NEER_EA19    I10     BG  1994     1 4808.49
#> 2231: NEER_EA19    I10     BR  1994     1 1225.11
#> 2232: NEER_EA19    I10     CA  1994     1   92.19
#> 2233: NEER_EA19    I10     CH  1994     1   83.43After that we can create a time variable and plot via
ert.tall[, month.IDate := data.table::as.IDate(
  sprintf("%d-%d-15", year, month))]
if(require("ggplot2")){
  ggplot()+
    geom_hline(aes(
      yintercept=value),
      color="grey",
      data=data.frame(value=100))+
    geom_line(aes(
      month.IDate, value, color=geo),
      data=ert.tall[geo %in% c("CA", "US", "JP", "FR")])+
    facet_grid(exch_rt ~ .)+
    theme_bw()+
    theme(panel.spacing=grid::unit(0, "lines"))
}Another way to do it would be via
nc::capture_melt_single(ert.wide, month.POSIXct="[0-9].*", function(x){
  as.POSIXct(strptime(paste0(x,"15"), "%YM%m%d"))
})
#>         exch_rt   unit    geo month.POSIXct   value
#>          <char> <char> <char>         <num>   <num>
#>    1: NEER_EA19    I10     AT    1594764000  100.00
#>    2: NEER_EA19    I10     AU    1594764000   88.50
#>    3: NEER_EA19    I10     BE    1594764000  100.00
#>    4: NEER_EA19    I10     BG    1594764000  100.00
#>    5: NEER_EA19    I10     BR    1594764000   38.49
#>   ---                                              
#> 2229: NEER_EA19    I10     BE     758588400   99.07
#> 2230: NEER_EA19    I10     BG     758588400 4808.49
#> 2231: NEER_EA19    I10     BR     758588400 1225.11
#> 2232: NEER_EA19    I10     CA     758588400   92.19
#> 2233: NEER_EA19    I10     CH     758588400   83.43What if the input data set has “missing” input columns?
iris.missing <- iris[, names(iris) != "Sepal.Length"]
head(iris.missing)
#>   Sepal.Width Petal.Length Petal.Width Species
#> 1         3.5          1.4         0.2  setosa
#> 2         3.0          1.4         0.2  setosa
#> 3         3.2          1.3         0.2  setosa
#> 4         3.1          1.5         0.2  setosa
#> 5         3.6          1.4         0.2  setosa
#> 6         3.9          1.7         0.4  setosaIn that case melting into multiple columns is an error by default:
iris.pattern <- list(column=".*", "[.]", dim=".*")
nc::capture_melt_multiple(iris.missing, iris.pattern)
#> Error in (function (subject.names, match.dt, no.match, fill = TRUE) : need dim=same count for each value, but have: Length=1 Width=2; please change pattern, edit input column names, or use fill=TRUE to output missing valuesThe error message explains that the number of input columns for each
value of dim must be the same, but there is one for
Length and two for Width. To ignore the error
and fill the output with missing values,
nc::capture_melt_multiple(iris.missing, iris.pattern, fill=TRUE)
#>        Species    dim Petal Sepal
#>         <fctr> <char> <num> <num>
#>   1:    setosa Length   1.4    NA
#>   2:    setosa Length   1.4    NA
#>   3:    setosa Length   1.3    NA
#>   4:    setosa Length   1.5    NA
#>   5:    setosa Length   1.4    NA
#>  ---                             
#> 296: virginica  Width   2.3   3.0
#> 297: virginica  Width   1.9   2.5
#> 298: virginica  Width   2.0   3.0
#> 299: virginica  Width   2.3   3.4
#> 300: virginica  Width   1.8   3.0Note the missing values in the table above, which correspond to the missing input column in the original/wide data set.
Some real-world data sets can be reshaped into output columns with different types. An example data set from the PROVEDIt benchmark in criminology:
peaks.csv <- system.file(
  "extdata", "RD12-0002_PP16HS_5sec_GM_F_1P.csv",
  package="nc", mustWork=TRUE)
peaks.wide <- data.table::fread(peaks.csv)
print(data.table::data.table(
  names=names(peaks.wide),
  class=sapply(peaks.wide, class)),
  topn=10)
#>            names     class
#>           <char>    <char>
#>   1: Sample File character
#>   2:      Marker character
#>   3:         Dye character
#>   4:    Allele 1 character
#>   5:      Size 1   numeric
#>   6:    Height 1   integer
#>   7:    Allele 2 character
#>   8:      Size 2   numeric
#>   9:    Height 2   integer
#>  10:    Allele 3 character
#>  ---                      
#> 294:   Height 97   logical
#> 295:   Allele 98   logical
#> 296:     Size 98   logical
#> 297:   Height 98   logical
#> 298:   Allele 99   logical
#> 299:     Size 99   logical
#> 300:   Height 99   logical
#> 301:  Allele 100   logical
#> 302:    Size 100   logical
#> 303:  Height 100   logicalThere are 303 columns, with info for 100 peaks. Each peak has three features: Allele=character, Size=numeric, and Height=integer. The ending peaks are class logical because they are all missing. These data can be reshaped via
peaks.tall <- nc::capture_melt_multiple(
  peaks.wide,
  column=".*",
  " ",
  peak="[0-9]+", as.integer,
  na.rm=TRUE)
#> Warning in melt.data.table(L[["data"]], measure.vars = L[["measure.vars"]], :
#> les colonnes 'measure.vars' [Allele 1, Allele 2, Allele 3, Allele 4, ...] ne
#> sont pas toutes du même type. Par ordre hiérarchique, la colonne de valeurs des
#> données transformée sera de type 'character'. Toutes les variables de mesure
#> qui ne sont pas de type 'character' seront également converties
#> automatiquement. Consultez DETAILS dans ?melt.data.table pour en savoir plus
#> sur la conversion automatique.
#> Warning in melt.data.table(L[["data"]], measure.vars = L[["measure.vars"]], :
#> les colonnes 'measure.vars' [Height 1, Height 2, Height 3, Height 4, ...] ne
#> sont pas toutes du même type. Par ordre hiérarchique, la colonne de valeurs des
#> données transformée sera de type 'integer'. Toutes les variables de mesure qui
#> ne sont pas de type 'integer' seront également converties automatiquement.
#> Consultez DETAILS dans ?melt.data.table pour en savoir plus sur la conversion
#> automatique.
#> Warning in melt.data.table(L[["data"]], measure.vars = L[["measure.vars"]], :
#> les colonnes 'measure.vars' [Size 1, Size 2, Size 3, Size 4, ...] ne sont pas
#> toutes du même type. Par ordre hiérarchique, la colonne de valeurs des données
#> transformée sera de type 'double'. Toutes les variables de mesure qui ne sont
#> pas de type 'double' seront également converties automatiquement. Consultez
#> DETAILS dans ?melt.data.table pour en savoir plus sur la conversion
#> automatique.
old.opt <- options(width=90)
print(peaks.tall)
#>                                Sample File  Marker    Dye  peak Allele Height   Size
#>                                     <char>  <char> <char> <int> <char>  <int>  <num>
#>   1: A02-RD12-0002-35-0.5PP16-001.5sec.fsa D3S1358      B     1     OL      5  94.68
#>   2: A02-RD12-0002-35-0.5PP16-001.5sec.fsa    TH01      B     1     OL      3 148.69
#>   3: A02-RD12-0002-35-0.5PP16-001.5sec.fsa  D21S11      B     1     OL      7 195.84
#>   4: A02-RD12-0002-35-0.5PP16-001.5sec.fsa  D18S51      B     1     OL      4 285.51
#>   5: A02-RD12-0002-35-0.5PP16-001.5sec.fsa Penta E      B     1     OL      3 369.36
#>  ---                                                                                
#> 438: A02-RD12-0002-35-0.5PP16-001.5sec.fsa     FGA      Y    25     OL      5 413.28
#> 439: A02-RD12-0002-35-0.5PP16-001.5sec.fsa Penta E      B    26     OL      3 479.21
#> 440: A02-RD12-0002-35-0.5PP16-001.5sec.fsa     FGA      Y    26     OL      4 415.55
#> 441: A02-RD12-0002-35-0.5PP16-001.5sec.fsa     FGA      Y    27     OL      4 428.06
#> 442: A02-RD12-0002-35-0.5PP16-001.5sec.fsa     FGA      Y    28     OL      4 445.08
str(peaks.tall)
#> Classes 'data.table' and 'data.frame':   442 obs. of  7 variables:
#>  $ Sample File: chr  "A02-RD12-0002-35-0.5PP16-001.5sec.fsa" "A02-RD12-0002-35-0.5PP16-001.5sec.fsa" "A02-RD12-0002-35-0.5PP16-001.5sec.fsa" "A02-RD12-0002-35-0.5PP16-001.5sec.fsa" ...
#>  $ Marker     : chr  "D3S1358" "TH01" "D21S11" "D18S51" ...
#>  $ Dye        : chr  "B" "B" "B" "B" ...
#>  $ peak       : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ Allele     : chr  "OL" "OL" "OL" "OL" ...
#>  $ Height     : int  5 3 7 4 3 4 3 197 45 4 ...
#>  $ Size       : num  94.7 148.7 195.8 285.5 369.4 ...
#>  - attr(*, ".internal.selfref")=<externalptr>
options(old.opt)