collapse 패키지 소개 v2

collapse; fast, flexible, parsimonoius code package for R.

R
statistics
data.table
Author
Published

October 29, 2024

Introduction

  • Collapse
    1. C/C++ 기반의 패키지로 큰 데이터셋을 보다 쉽게 다룰 수 있도록 구성됨.

    2. R code의 성능을 획기적으로 개선하여 대규모 데이터를 빠르고 효율적으로 처리함을 목표로 함.

    3. 성능을 극대화함과 동시에, 기존 데이터 조작 framework와 통합할 수 있도록 안정적이고 최적화된 사용자 API를 제공함. (dplyr, tidyverse, data.table 등)

  • MAIN FOCUS -> data.table과 함께 이용하여 보다 빠르게 연산을 처리하자.

Setup

##setup

#install.packages("collapse")

library(magrittr);library(dplyr);library(data.table) 

library(collapse);library(microbenchmark);library(lmtest)

Basic

data.table처럼 fread & fwrite를 이용하여 csv파일을 처리한다.

# Exam data: 09-15

dt <- fread("https://raw.githubusercontent.com/jinseob2kim/lecture-snuhlab/master/data/example_g1e.csv")
df <- read.csv("https://raw.githubusercontent.com/jinseob2kim/lecture-snuhlab/master/data/example_g1e.csv")
fwrite(dt, "aa.csv")

Columns: ‘fselect’로 원하는 열을 불러올 수 있다.

fselect(dt, 1:3, 13:16) |> head()
   EXMD_BZ_YYYY RN_INDI HME_YYYYMM  HGHT  WGHT  WSTC   BMI
          <int>   <int>      <int> <int> <int> <int> <num>
1:         2009  562083     200909   144    61    90  29.4
2:         2009  334536     200911   162    51    63  19.4
3:         2009  911867     200903   163    65    82  24.5
4:         2009  183321     200908   152    51    70  22.1
5:         2009  942671     200909   159    50    73  19.8
6:         2009  979358     200912   157    55    73  22.3
fselect(dt, EXMD_BZ_YYYY,RN_INDI,HME_YYYYMM )|> head() # fselect(dt, "EXMD_BZ_YYYY","RN_INDI","HME_YYYYMM" )
   EXMD_BZ_YYYY RN_INDI HME_YYYYMM
          <int>   <int>      <int>
1:         2009  562083     200909
2:         2009  334536     200911
3:         2009  911867     200903
4:         2009  183321     200908
5:         2009  942671     200909
6:         2009  979358     200912

Rows: ‘fsubset()’로 원하는 행/열을 불러올 수 있다.

fsubset(dt, 1:3)
   EXMD_BZ_YYYY RN_INDI HME_YYYYMM Q_PHX_DX_STK Q_PHX_DX_HTDZ Q_PHX_DX_HTN
          <int>   <int>      <int>        <int>         <int>        <int>
1:         2009  562083     200909            0             0            1
2:         2009  334536     200911            0             0            0
3:         2009  911867     200903            0             0            0
   Q_PHX_DX_DM Q_PHX_DX_DLD Q_PHX_DX_PTB Q_HBV_AG Q_SMK_YN Q_DRK_FRQ_V09N  HGHT
         <int>        <int>        <int>    <int>    <int>          <int> <int>
1:           0            0           NA        3        1              0   144
2:           0            0           NA        2        1              0   162
3:           0            0           NA        3        1              0   163
    WGHT  WSTC   BMI VA_LT VA_RT BP_SYS BP_DIA URN_PROT   HGB   FBS TOT_CHOL
   <int> <int> <num> <num> <num>  <int>  <int>    <int> <num> <int>    <int>
1:    61    90  29.4   0.7   0.8    120     80        1  12.6   117      264
2:    51    63  19.4   0.8   1.0    120     80        1  13.8    96      169
3:    65    82  24.5   0.7   0.6    130     80        1  15.0   118      216
      TG   HDL   LDL  CRTN  SGOT  SGPT   GGT   GFR
   <int> <int> <int> <num> <int> <int> <int> <int>
1:   128    60   179   0.9    25    20    25    59
2:    92    70    80   0.9    18    15    28    74
3:   132    55   134   0.8    26    30    30    79
#fsubset(dt, c(1:3, 13:16)) #rows
fsubset(dt, 1:3, 13:16)  #(dt, row, col)
    HGHT  WGHT  WSTC   BMI
   <int> <int> <int> <num>
1:   144    61    90  29.4
2:   162    51    63  19.4
3:   163    65    82  24.5
fsubset(dt, c(1:nrow(dt)),c(1:3, 13:16)) |> head() #cols
   EXMD_BZ_YYYY RN_INDI HME_YYYYMM  HGHT  WGHT  WSTC   BMI
          <int>   <int>      <int> <int> <int> <int> <num>
1:         2009  562083     200909   144    61    90  29.4
2:         2009  334536     200911   162    51    63  19.4
3:         2009  911867     200903   163    65    82  24.5
4:         2009  183321     200908   152    51    70  22.1
5:         2009  942671     200909   159    50    73  19.8
6:         2009  979358     200912   157    55    73  22.3
# fsubset(dt, EXMD_BZ_YYYY %in% 2009:2012 & BMI >= 25) %>%  fsubset(c(1:3),c(1:3,13:16))
fsubset(dt, c(1:nrow(dt)),c(1:3, 13:16)) %>% fsubset(EXMD_BZ_YYYY %in% 2009:2012 & BMI >= 25) |> head() # same
   EXMD_BZ_YYYY RN_INDI HME_YYYYMM  HGHT  WGHT  WSTC   BMI
          <int>   <int>      <int> <int> <int> <int> <num>
1:         2009  562083     200909   144    61    90  29.4
2:         2009  318669     200904   155    66    78  27.5
3:         2009  668438     200904   160    71    94  27.7
4:         2009  560878     200903   144    58    93  28.0
5:         2009  375694     200906   151    70    94  30.7
6:         2009  446652     200909   158    64    80  25.6
roworder(dt, HGHT) %>% fsubset(EXMD_BZ_YYYY %in% 2009:2012 & BMI >= 25) %>%
  fsubset(c(1:nrow(dt)),c(1:3,13:16)) |> head()
   EXMD_BZ_YYYY RN_INDI HME_YYYYMM  HGHT  WGHT  WSTC   BMI
          <int>   <int>      <int> <int> <int> <int> <num>
1:         2009  562083     200909   144    61    90  29.4
2:         2009  560878     200903   144    58    93  28.0
3:         2011  562083     201111   144    59    88  28.5
4:         2011  519824     201109   145    58    79  27.6
5:         2011  914987     201103   145    70    95  33.3
6:         2012  560878     201208   145    59    85  28.1

Collapse package

지금까지 collapse에서의 행/열 처리에 대해 알아보았다. 다음은 collapse에서 보다 빠른 연산 및 데이터 처리를 도와주는 도구들이다.

Fast Statistical Function

.FAST_STAT_FUN
 # [1]  "fmean"      "fmedian"    "fmode"      "fsum"       "fprod"      
 # [6]  "fsd"        "fvar"       "fmin"       "fmax"       "fnth"       
 # [11] "ffirst"     "flast"      "fnobs"      "fndistinct"

# 데이터 구조에 구애받지않음.
v1 <- c(1,2,3,4)
m1 <- matrix(1:50, nrow = 10, ncol = 5)
 
fmean(v1); fmean(m1); fmean(dt)
fmode(v1); fmode(m1); fmode(dt)
# fmean(m1): by columns
# collapse; baseR과 비교했을 때 보다 빠른 속도를 보인다.
x <- rnorm(1e7)
microbenchmark(mean(x), fmean(x), fmean(x, nthreads = 4)) 
Unit: milliseconds
                   expr       min        lq      mean    median        uq
                mean(x) 23.818189 23.881988 23.920430 23.913170 23.947722
               fmean(x) 15.334012 15.392979 15.428531 15.429847 15.449727
 fmean(x, nthreads = 4)  4.025523  6.862578  7.695351  7.864132  8.486929
      max neval cld
 24.09041   100 a  
 15.58527   100  b 
 10.64787   100   c
microbenchmark(colMeans(dt), sapply(dt, mean), fmean(dt))
Unit: microseconds
             expr      min        lq      mean    median       uq      max
     colMeans(dt) 3154.642 3301.1200 3295.8407 3304.7220 3309.081 3593.496
 sapply(dt, mean)  190.719  196.3270  208.5230  206.0745  216.491  270.384
        fmean(dt)   52.694   53.7125   56.1483   55.6745   56.788   86.429
 neval cld
   100 a  
   100  b 
   100   c
  • Data size가 더 클 경우, 보다 유용하다. (GGDC10S: 5000rows, 11cols, ~10% missing values)
microbenchmark(base = sapply(GGDC10S[6:16], mean, na.rm = TRUE), fmean(GGDC10S[6:16]))
Unit: microseconds
                 expr     min       lq     mean   median       uq      max
                 base 409.223 419.1965 659.1579 433.2085 813.6440 4242.667
 fmean(GGDC10S[6:16])  94.504  95.5710 100.6444 102.1360 103.8795  135.503
 neval cld
   100  a 
   100   b
  • 이처럼, Collapse는 data 형식에 구애받지 않고, 보다 빠른 속도를 특징으로 하는 package이다.

이들의 문법을 알아보자.

-   Fast Statistical Functions

  Syntax:

FUN(x, g = NULL, \[w = NULL,\] TRA = NULL, \[na.rm = TRUE\], use.g.names = TRUE, \[drop = TRUE,\] \[nthreads = 1,\] ...)

       
Argument            Description
      g             grouping vectors / lists of vectors or ’GRP’ object
      w             a vector of (frequency) weights
    TRA             a quoted operation to transform x using the statistics
  na.rm             efficiently skips missing values in x
  use.g.names       generate names/row-names from g
  drop              drop dimensions if g = TRA = NULL
  nthreads          number of threads for OpenMP multithreading

사용예시 : fmean

# Weighted Mean
w <- abs(rnorm(nrow(iris)))
all.equal(fmean(num_vars(iris), w = w), sapply(num_vars(iris), weighted.mean, w = w))
[1] TRUE
wNA <- na_insert(w, prop = 0.05)
sapply(num_vars(iris), weighted.mean, w = wNA) # weighted.mean(): 결측치를 처리하지 못한다.
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
          NA           NA           NA           NA 
fmean(num_vars(iris), w = wNA) #결측치를 자동으로 무시한다.
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    5.766455     3.088102     3.558861     1.107960 
# Grouped Mean
fmean(iris$Sepal.Length, g = iris$Species)
    setosa versicolor  virginica 
     5.006      5.936      6.588 
fmean(num_vars(iris), iris$Species)  
           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026
# Weighted Group Mean
fmean(num_vars(iris), iris$Species, w)
           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa         5.033403    3.434271     1.464525   0.2377769
versicolor     5.941119    2.783020     4.263050   1.3291146
virginica      6.499996    2.965439     5.492447   2.0008032
# 속도 상의 이점. 
microbenchmark(fmean = fmean(iris$Sepal.Length, iris$Species),
               tapply = tapply(iris$Sepal.Length, iris$Species, mean))
Unit: microseconds
   expr    min     lq     mean median      uq     max neval cld
  fmean  7.531  7.862  8.54733  8.239  8.4255  41.448   100  a 
 tapply 47.166 48.190 49.66377 48.636 49.1180 123.933   100   b

Consideration w/ missing data: 결측치 처리

#wlddev$GINI, g: country, function: mean, median, min, max, sum, prod
collap(wlddev, GINI ~ country, list(mean, median, min, max, sum, prod),
       na.rm = TRUE, give.names = FALSE) |> head()
         country     mean median  min  max   sum         prod
1    Afghanistan      NaN     NA  Inf -Inf   0.0 1.000000e+00
2        Albania 31.41111   31.7 27.0 34.6 282.7 2.902042e+13
3        Algeria 34.36667   35.3 27.6 40.2 103.1 3.916606e+04
4 American Samoa      NaN     NA  Inf -Inf   0.0 1.000000e+00
5        Andorra      NaN     NA  Inf -Inf   0.0 1.000000e+00
6         Angola 48.66667   51.3 42.7 52.0 146.0 1.139065e+05
# na.rm=T가 기본값이며, NA를 연산한 값은 모두 NA를 결과값으로 반영함. 
collap(wlddev, GINI ~ country, list(fmean, fmedian, fmin, fmax, fsum, fprod),
       give.names = FALSE) |> head()
         country    fmean fmedian fmin fmax  fsum        fprod
1    Afghanistan       NA      NA   NA   NA    NA           NA
2        Albania 31.41111    31.7 27.0 34.6 282.7 2.902042e+13
3        Algeria 34.36667    35.3 27.6 40.2 103.1 3.916606e+04
4 American Samoa       NA      NA   NA   NA    NA           NA
5        Andorra       NA      NA   NA   NA    NA           NA
6         Angola 48.66667    51.3 42.7 52.0 146.0 1.139065e+05
microbenchmark(a = collap(wlddev, GINI ~ country, list(mean, median, min, max, sum, prod),
                          na.rm = TRUE, give.names = FALSE) |> head(),
               b=collap(wlddev, GINI ~ country, list(fmean, fmedian, fmin, fmax, fsum, fprod),
                        give.names = FALSE) |> head())
Unit: microseconds
 expr      min        lq       mean    median        uq       max neval cld
    a 9854.603 9940.8865 10522.7090 10008.930 10297.669 14969.065   100  a 
    b  545.479  590.4145   611.5872   621.694   633.038   685.942   100   b
# 속도 상 이점을 다시 한 번 확인할 수 있다.

TRA function

  • TRA function을 이용, 여러 행/열의 연산을 간편하게 처리할 수 있다.
Syntax:
  TRA(x, STATS, FUN = "-", g = NULL, set = FALSE, ...)


  setTRA(x, STATS, FUN = "-", g = NULL, ...)

  STATS = vector/matrix/list of statistics

0        "replace_NA"     replace missing values in x
1        "replace_fill"   replace data and missing values in x
2        "replace"        replace data but preserve missing values in x
3        "-"              subtract (i.e. center)
4        "-+"             center on overall average statistic
5        "/"              divide (i.e. scale)
6        "%"              compute percentages (i.e. divide and multiply by 100)   
7        "+"              add
8        "*"              multiply
9        "%%"             modulus (i.e. remainder from division by STATS)
10       "-%%"            subtract modulus (i.e. make data divisible by STATS)
dt2 <- as.data.table(iris)

attach(iris)    #data.table에서처럼 변수명을 직접 호출하기 위해 attach를 사용할 수 있다.

# 평균값과의 차: g= Species
all_obj_equal(Sepal.Length - ave(Sepal.Length, g = Species),
              fmean(Sepal.Length, g = Species, TRA= "-"),
              TRA(Sepal.Length, fmean(Sepal.Length, g = Species), "-", g = Species))
[1] TRUE
microbenchmark(baseR= Sepal.Length - ave(Sepal.Length, g = Species),
               fmean = mean(Sepal.Length, g = Species, TRA= "-"),
               TRA_fmean = TRA(Sepal.Length, fmean(Sepal.Length, g = Species), "-", g = Species));detach(iris)
Unit: microseconds
      expr    min      lq     mean  median      uq     max neval cld
     baseR 57.077 58.4975 60.77960 59.4215 60.3085 159.264   100 a  
     fmean  3.796  3.9900  4.18378  4.1510  4.2665   7.214   100  b 
 TRA_fmean 11.882 12.3505 13.25294 12.7595 13.3165  44.254   100   c
  • TRA()를 사용하기보다 Fast Statistical Function에서 TRA 기능을 호출하자!
#예시
num_vars(dt2) %<>%  na_insert(prop = 0.05)

# NA 값을 median값으로 대체.
num_vars(dt2) |> fmedian(iris$Species, TRA = "replace_NA", set = TRUE)
# num_vars(dt2) |> fmean(iris$Species, TRA = "replace_NA", set = TRUE) --> mean으로 대체.


# 다양한 연산 및 작업을 한 번에 다룰 수 있다.
mtcars |> ftransform(A = fsum(mpg, TRA = "%"),
                     B = mpg > fmedian(mpg, cyl, TRA = "replace_fill"),
                     C = fmedian(mpg, list(vs, am), wt, "-"),
                     D = fmean(mpg, vs,, 1L) > fmean(mpg, am,, 1L)) |> head(3)
               mpg cyl disp  hp drat    wt  qsec vs am gear carb        A     B
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 3.266449  TRUE
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 3.266449  TRUE
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 3.546430 FALSE
                 C     D
Mazda RX4      1.3 FALSE
Mazda RX4 Wag  1.3 FALSE
Datsun 710    -7.6  TRUE

Grouping Object

  • GRP function을 이용, group을 쉽게 연산할 수 있다.

    Syntax:
    
        GRP(X, by = NULL, sort == TRUE, decreasing = FALSE, na.last = TRUE, 
        return.groups = TRUE, return.order = sort, method = "auto", ...)
g <- GRP(iris, by = ~ Species)
print(g)
collapse grouping object of length 150 with 3 ordered groups

Call: GRP.default(X = iris, by = ~Species), X is sorted

Distribution of group sizes: 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     50      50      50      50      50      50 

Groups with sizes: 
    setosa versicolor  virginica 
        50         50         50 
str(g)
Class 'GRP'  hidden list of 9
 $ N.groups    : int 3
 $ group.id    : int [1:150] 1 1 1 1 1 1 1 1 1 1 ...
 $ group.sizes : int [1:3] 50 50 50
 $ groups      :'data.frame':   3 obs. of  1 variable:
  ..$ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
 $ group.vars  : chr "Species"
 $ ordered     : Named logi [1:2] TRUE TRUE
  ..- attr(*, "names")= chr [1:2] "ordered" "sorted"
 $ order       : int [1:150] 1 2 3 4 5 6 7 8 9 10 ...
  ..- attr(*, "starts")= int [1:3] 1 51 101
  ..- attr(*, "maxgrpn")= int 50
  ..- attr(*, "sorted")= logi TRUE
 $ group.starts: int [1:3] 1 51 101
 $ call        : language GRP.default(X = iris, by = ~Species)
# GRP 기능- 호출하여 사용하자!
fmean(num_vars(iris), g)
           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026
fmean(num_vars(iris), iris$Species)
           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026

Factors in operation

Collaspe는 형식에 구애받지 않는다; factor를 바로 연산할 수 있으며, qF로 빠르게 factor를 생성할 수 있다.

x <- na_insert(rnorm(1e7), prop = 0.01) 
g <- sample.int(1e6, 1e7, TRUE)         
# grp와 비교
system.time(gg <- GRP(g))
   user  system elapsed 
  0.619   0.040   0.659 
system.time(f <- qF(g, na.exclude = FALSE))
   user  system elapsed 
  0.273   0.032   0.306 
[1] "factor"      "na.included"
microbenchmark(fmean(x, g), 
               fmean(x, gg), 
               fmean(x, gg, na.rm = FALSE), 
               fmean(x, f))
 ## Unit: milliseconds
 ##       expr                    min         lq          mean        median
 ## fmean(x, g)                   146.060983  150.493309  155.02585   152.197822
 ## fmean(x, gg)                  25.354564   27.709625   29.48497    29.022157
 ## fmean(x, gg, na.rm = FALSE)   13.184534   13.783585   15.61769    14.128067
 ## fmean(x, f)                   24.847271   27.503661   29.47271    29.248580

# qF를 통해 grp와 유사한 성능 향상을 기대할 수 있다.

Summary: FAST grouping and Ordering

다양한 기능이 있다. 
GRP()           Fast sorted or unsorted grouping of multivariate data, returns detailed object of class ’GRP’ 
qF()/qG()       Fast generation of factors and quick-group (’qG’) objects from atomic vectors 
finteraction()  Fast interactions: returns factor or ’qG’ objects 
fdroplevels()   Efficiently remove unused factor levels

radixorder()    Fast ordering and ordered grouping 
group()         Fast first-appearance-order grouping: returns ’qG’ object 
gsplit()        Split vector based on ’GRP’ object 
greorder()      Reorder the results

- that also return ’qG’ objects 
groupid()       Generalized run-length-type grouping seqid() Grouping of integer sequences 
timeid()        Grouping of time sequences (based on GCD)

dapply()        Apply a function to rows or columns of data.frame or matrix based objects. 
BY()            Apply a function to vectors or matrix/data frame columns by groups.

-   Specialized Data Transformation Functions 
fbetween()      Fast averaging and (quasi-)centering. 
fwithin()
fhdbetween()    Higher-Dimensional averaging/centering and linear prediction/partialling out 
fhdwithin()     (powered by fixest’s algorithm for multiple factors).
fscale()        (advanced) scaling and centering.

-   Time / Panel Series Functions 
fcumsum()       Cumulative sums 
flag()          Lags and leads 
fdiff()         (Quasi-, Log-, Iterated-) differences 
fgrowth()       (Compounded-) growth rates

-    Data manipulation functions
fselect(),      fsubset(),      fgroup_by(),    [f/set]transform[v](),          
fmutate(),      fsummarise(),   across(),       roworder[v](),            
colorder[v](),  [f/set]rename(),                [set]relabel()

Collapse는 빠르다!

fdim(wlddev)    ##faster dim for dt. col/row: 13176 13

# 1990년 이후를 기준으로, ODA/POP의 값 (g: region, income, OECD)
microbenchmark( 
  
dplyr = qDT(wlddev) |>
        filter(year >= 1990) |>
        mutate(ODA_POP = ODA / POP) |>
        group_by(region, income, OECD) |>
        summarise(across(PCGDP:POP, sum, na.rm = TRUE), .groups = "drop") |>
        arrange(income, desc(PCGDP)),

data.table = qDT(wlddev)[, ODA_POP := ODA / POP][
             year >= 1990, lapply(.SD, sum, na.rm = TRUE),
             by = .(region, income, OECD), .SDcols = PCGDP:ODA_POP][
             order(income, -PCGDP)],

collapse_base = qDT(wlddev) |>
                fsubset(year >= 1990) |>
                fmutate(ODA_POP = ODA / POP) |>
                fgroup_by(region, income, OECD) |>
                fsummarise(across(PCGDP:ODA_POP, sum, na.rm = TRUE)) |>
                roworder(income, -PCGDP),

collapse_optimized = qDT(wlddev) |>
                    fsubset(year >= 1990, region, income, OECD, PCGDP:POP) |>
                    fmutate(ODA_POP = ODA / POP) |>
                    fgroup_by(1:3, sort = FALSE) |> fsum() |>
                    roworder(income, -PCGDP)
)


## Unit: microseconds
##        expr            min         lq            mean            median          uq            max         neval
## dplyr                  71955.523   72291.9715    80009.2208      72453.1165      76902.671   393947.262  100 
## data.table             5960.503    6310.7045     7116.6673       6721.3450       7046.837    18615.736     100   
## collapse_base          859.505     948.2200      1041.1137       990.1375        1061.864     3148.804       100 
## collapse_optimized     442.040     482.9705      542.6927        523.6950        574.921     1036.817      100   

Collapse w/ Fast Statistical Function: 다양한 활용

# 아래 셋은 동일한 결과를 보인다.
# cyl별 mpg sum
 mtcars %>% ftransform(mpg_sum = fsum(mpg, g = cyl, TRA = "replace_fill")) %>% invisible()
 mtcars %>% fgroup_by(cyl) %>% ftransform(mpg_sum = fsum(mpg, GRP(.), TRA = "replace_fill")) %>% invisible()
 mtcars %>% fgroup_by(cyl) %>% fmutate(mpg_sum = fsum(mpg)) %>% head(10)
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb mpg_sum
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4   138.2
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4   138.2
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1   293.3
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   138.2
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   211.4
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1   138.2
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4   211.4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2   293.3
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2   293.3
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   138.2
  • ad-hoc grouping, often fastest!
microbenchmark(a=mtcars %>% ftransform(mpg_sum = fsum(mpg, g = cyl, TRA = "replace_fill")),
               b=mtcars %>% fgroup_by(cyl) %>% ftransform(mpg_sum = fsum(mpg, GRP(.), TRA = "replace_fill")),
               c=mtcars %>% fgroup_by(cyl) %>% fmutate(mpg_sum = fsum(mpg)))
Unit: microseconds
 expr    min      lq     mean  median      uq     max neval cld
    a 27.266 29.7885 31.47125 30.4025 31.7165 107.002   100 a  
    b 64.819 66.7990 68.84531 67.8300 68.8585 138.077   100  b 
    c 78.526 80.3145 82.11237 81.4460 82.3050 126.137   100   c
  • ftransform()은 앞의 fgroupby를 무시한다. 아래 둘은 값이 다르다. (fmutate, fsummarise만 이전 group을 반영한다.)
mtcars %>% fgroup_by(cyl) %>% ftransform(mpg_sum = fsum(mpg, GRP(.), TRA = "replace_fill")) %>% head()
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb mpg_sum
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   138.2
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4   138.2
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1   293.3
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1   138.2
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2   211.4
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1   138.2
mtcars %>% fgroup_by(cyl) %>% ftransform(mpg_sum = fsum(mpg, TRA = "replace_fill")) %>% head()
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb mpg_sum
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   642.9
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4   642.9
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1   642.9
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1   642.9
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2   642.9
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1   642.9
  • 위 언급과 같이 baseR“/”보다 collapseTRA function을 이용하는 것이 더 빠르다.
microbenchmark(
"/"=      mtcars |> fgroup_by(cyl) |> fmutate(mpg_prop = mpg / fsum(mpg))      |> head(),     
"TRA=/" = mtcars |> fgroup_by(cyl) |> fmutate(mpg_prop = fsum(mpg, TRA = "/")) |> head()
)
Unit: microseconds
  expr     min      lq     mean   median       uq     max neval cld
     / 208.423 211.461 216.9684 212.9690 215.2815 456.075   100  a 
 TRA=/ 198.332 200.922 203.9442 202.6085 204.8170 239.689   100   b
  • fsum은 grp 별로 연산을 처리하나, sum은 전체를 반영한다.
mtcars |> fgroup_by(cyl) |> fmutate(mpg_prop2 = fsum(mpg) / sum(mpg))|> head() #"!=1" 
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb mpg_prop2
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 0.2149634
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 0.2149634
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 0.4562140
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 0.2149634
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 0.3288225
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 0.2149634
  • 자유로운 %>% 의 사용
# 아래 둘은 동일하다.
 mtcars %>% fgroup_by(cyl) %>% ftransform(fselect(., hp:qsec) %>% fsum(TRA = "/")) %>% invisible()
 mtcars %>% fgroup_by(cyl) %>% fmutate(across(hp:qsec, fsum, TRA = "/")) %>% head()
                   mpg cyl disp         hp       drat         wt       qsec vs
Mazda RX4         21.0   6  160 0.12850467 0.15537849 0.12007333 0.13080102  0
Mazda RX4 Wag     21.0   6  160 0.12850467 0.15537849 0.13175985 0.13525111  0
Datsun 710        22.8   4  108 0.10231023 0.08597588 0.09227220 0.08840435  1
Hornet 4 Drive    21.4   6  258 0.12850467 0.12270916 0.14734189 0.15448188  1
Hornet Sportabout 18.7   8  360 0.05974735 0.06967485 0.06144064 0.07248414  0
Valiant           18.1   6  225 0.12266355 0.10996016 0.15857012 0.16068023  1
                  am gear carb
Mazda RX4          1    4    4
Mazda RX4 Wag      1    4    4
Datsun 710         1    4    1
Hornet 4 Drive     0    3    1
Hornet Sportabout  0    3    2
Valiant            0    3    1
  • set = TRUE를 통해 원본 데이터에 반영할 수 있다.
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# mtcars의 열 hp:qsec의 값과 해당하는 g:cyl별 합의 비율.
mtcars %>% fgroup_by(cyl) %>% fmutate(across(hp:qsec, fsum, TRA = "/", set = TRUE)) %>% invisible()
head(mtcars)
                   mpg cyl disp         hp       drat         wt       qsec vs
Mazda RX4         21.0   6  160 0.12850467 0.15537849 0.12007333 0.13080102  0
Mazda RX4 Wag     21.0   6  160 0.12850467 0.15537849 0.13175985 0.13525111  0
Datsun 710        22.8   4  108 0.10231023 0.08597588 0.09227220 0.08840435  1
Hornet 4 Drive    21.4   6  258 0.12850467 0.12270916 0.14734189 0.15448188  1
Hornet Sportabout 18.7   8  360 0.05974735 0.06967485 0.06144064 0.07248414  0
Valiant           18.1   6  225 0.12266355 0.10996016 0.15857012 0.16068023  1
                  am gear carb
Mazda RX4          1    4    4
Mazda RX4 Wag      1    4    4
Datsun 710         1    4    1
Hornet 4 Drive     0    3    1
Hornet Sportabout  0    3    2
Valiant            0    3    1
  • .apply = FALSE를 통해 subset group에만 적용할 수 있다.
# 각 g:cyl의 hp:qsec까지의 변수에 대한 부분 상관관계
mtcars %>% fgroup_by(cyl) %>% fsummarise(across(hp:qsec, \(x) qDF(pwcor(x), "var"), .apply = FALSE)) %>% head()
  cyl  var         hp       drat         wt       qsec
1   4   hp  1.0000000 -0.4702200  0.1598761 -0.1783611
2   4 drat -0.4702200  1.0000000 -0.4788681 -0.2833656
3   4   wt  0.1598761 -0.4788681  1.0000000  0.6380214
4   4 qsec -0.1783611 -0.2833656  0.6380214  1.0000000
5   6   hp  1.0000000  0.2171636 -0.3062284 -0.6280148
6   6 drat  0.2171636  1.0000000 -0.3546583 -0.6231083

이름/순번/vectors/정규표현식으로 행/열을 지칭할 수 있다.

get_vars(x, vars, return = "names", regex = FALSE, ...) 
get_vars(x, vars, regex = FALSE, ...) <- value 

- 위치도 선택가능하다.
add_vars(x, ..., pos = "end") 
add_vars(x, pos = "end") <- value 

- data type을 지정할 수 있다. 
num_vars(x, return = "data");   cat_vars(x, return = "data");   char_vars(x, return = "data"); 
fact_vars(x, return = "data");  logi_vars(x, return = "data");  date_vars(x, return = "data") 

- replace 또한 가능하다.
num_vars(x) <- value;   cat_vars(x) <- value;   char_vars(x) <- value; 
fact_vars(x) <- value;  logi_vars(x) <- value;  date_vars(x) <- value

Efficient programming

    > quick data conversion
-   qDF(),  qDT(),  qTBL(),   qM(),   mrtl(),   mctl()
    
-   anyv(x, value) / allv(x, value)     # Faster than any/all(x == value)
-   allNA(x)                            # Faster than all(is.na(x))
-   whichv(x, value, invert = F)        # Faster than which(x (!/=)= value)
-   whichNA(x, invert = FALSE)          # Faster than which((!)is.na(x))
-   x %(!/=)=% value                    # Infix for whichv(v, value, TRUE/FALSE)
-   setv(X, v, R, ...)                  # x\[x(!/=)=v\]\<-r / x\[v\]\<-r\[v\] (by reference)
-   setop(X, op, V, rowwise = F)        # Faster than X \<- X +/-/\*// V (by reference)
-   X %(+,-,\*,/)=% V                   # Infix for setop,()
-   na_rm(x)                            # Fast: if(anyNA(x)) x\[!is.na(x)\] else x,
-   na_omit(X, cols = NULL, ...)        # Faster na.omit for matrices and data frames
-   vlengths(X, use.names=TRUE)         # Faster version of lengths()
-   frange(x, na.rm = TRUE)             # Much faster base::range
-   fdim(X)                             # Faster dim for data frames

Collapse and data.table

data.table에서 collapse의 적용을 알아보자.

DT <- qDT(wlddev) # as.data.table(wlddev)
DT %>% fgroup_by(country) %>% get_vars(9:13) %>% fmean()  #fgroup_by == gby
                   country      PCGDP   LIFEEX     GINI        ODA         POP
                    <char>      <num>    <num>    <num>      <num>       <num>
  1:           Afghanistan   483.8351 49.19717       NA 1487548499 18362258.22
  2:               Albania  2819.2400 71.68027 31.41111  312928126  2708297.17
  3:               Algeria  3532.2714 63.56290 34.36667  612238500 25305290.68
  4:        American Samoa 10071.0659       NA       NA         NA    43115.10
  5:               Andorra 40083.0911       NA       NA         NA    51547.35
 ---                                                                          
212: Virgin Islands (U.S.) 35629.7336 73.71292       NA         NA    92238.53
213:    West Bank and Gaza  2388.4348 71.60780 34.52500 1638581462  3312289.13
214:           Yemen, Rep.  1069.6596 52.53707 35.46667  859950996 13741375.82
215:                Zambia  1318.8627 51.09263 52.68889  734624330  8614972.38
216:              Zimbabwe  1219.4360 54.53360 45.93333  397104997  9402160.33
DT[, lapply(.SD, mean, na.rm = TRUE), keyby = country, .SDcols = 9:13] %>%  invisible()
collap(DT, ~ country, fmean, cols = 9:13) %>% invisible()     #same

microbenchmark(collapse     = DT %>% gby(country) %>% get_vars(9:13) %>% fmean,
               data.table   = DT[, lapply(.SD, mean, na.rm = TRUE), keyby = country, .SDcols = 9:13],
               data.table2  = DT[, lapply(.SD, fmean, na.rm = TRUE), keyby = country, .SDcols = 9:13])
Unit: microseconds
        expr      min        lq      mean    median        uq      max neval
    collapse  339.330  369.4435  424.8557  397.9745  409.1075 3425.186   100
  data.table 5010.470 5242.2960 5356.0409 5280.7165 5376.4760 8372.164   100
 data.table2 5164.241 5322.2865 5557.2280 5391.0230 5541.6530 8763.263   100
 cld
 a  
  b 
   c
  • DT[, lapply(.SD, fmean,…)]DT[, lapply(.SD, mean,…)]보다 느린 것을 확인할 수 있다. data.table 내에서 meanbaseRmean이 아닌 gmean으로 data.table에 최적화되어있다. 반면, lapply와 함께 fmean을 사용하면 최적화된 방식으로 동작하지 않아 오히려 더 느리다.

  • 위 방식은 아래와 처리되는 방식이 유사하다. fmean을 모든 group, columns에 적용하기에 느리다.

BY(gv(DT, 9:13), g, fmean) 

이때, 아래와 같은 방법으로 이를 일정 수준 해결할 수 있다.

 fmean(gv(DT, 9:13), DT$country)
 g <- GRP(DT, "country"); add_vars(g[["groups"]], fmean(gv(DT, 9:13), g))
DT <- qDT(wlddev); g <- GRP(DT, "country")
#gv: abbreviation for get_vars()
microbenchmark(a = fmean(gv(DT, 9:13), DT$country),
               b0= g <- GRP(DT, "country"),
               b = add_vars(g[["groups"]], fmean(gv(DT, 9:13), g)),
               dt_fmean = DT[, lapply(.SD, fmean, na.rm = TRUE), keyby = country, .SDcols = 9:13],
               dt_gmean = DT[, lapply(.SD, mean, na.rm = TRUE), keyby = country, .SDcols = 9:13]) 
Unit: microseconds
     expr      min        lq      mean    median       uq       max neval  cld
        a  358.473  375.3265  392.0909  392.3025  398.751   580.505   100 a   
       b0   76.707   92.1125  105.7192  110.7625  113.721   175.649   100  b  
        b  224.185  238.6440  256.6076  255.7260  265.781   606.967   100 ab  
 dt_fmean 5213.289 5319.2375 5620.3223 5371.5520 5547.785 11522.735   100   c 
 dt_gmean 5064.030 5246.1655 5389.5898 5315.4075 5475.593  8260.358   100    d
  • dplyr의 data %>% group_by(…) %>% summarize(…) 및 data.table의 [i, j, by] 구문은 데이터 그룹에 함수를 적용하기 위한 보편적인 방식이다. 이는 다양한 함수를 그룹화된 데이터에 적용하며, 특히 data.table은 몇몇 내장 함수(min, max, mean 등)를 GForce; 내부적으로 최적화하여 처리한다.

  • collapse는 데이터를 그룹화하여(fgroup_by, collap) 통계 및 변환 함수를 처리한다. (by C++)

  • collapse의 모든 기능(BY는 예외)은 GForce 최적화가 되어 있지만, data.table 내에서 최적화 정도의 차이, lapply 적용 상의 문제가 있는 것으로 추정된다.

  • 그렇다면 fmeandata.table내에서 쓸 수는 없을까.

DT[, fmean(.SD, country), .SDcols = 9:13]
          PCGDP   LIFEEX     GINI        ODA         POP
          <num>    <num>    <num>      <num>       <num>
  1:   483.8351 49.19717       NA 1487548499 18362258.22
  2:  2819.2400 71.68027 31.41111  312928126  2708297.17
  3:  3532.2714 63.56290 34.36667  612238500 25305290.68
  4: 10071.0659       NA       NA         NA    43115.10
  5: 40083.0911       NA       NA         NA    51547.35
 ---                                                    
212: 35629.7336 73.71292       NA         NA    92238.53
213:  2388.4348 71.60780 34.52500 1638581462  3312289.13
214:  1069.6596 52.53707 35.46667  859950996 13741375.82
215:  1318.8627 51.09263 52.68889  734624330  8614972.38
216:  1219.4360 54.53360 45.93333  397104997  9402160.33
DT[, fmean(gby(.SD, country)), .SDcols = c(1L, 9:13)] #gby = abbrviation for fgroup_by()
                   country      PCGDP   LIFEEX     GINI        ODA         POP
                    <char>      <num>    <num>    <num>      <num>       <num>
  1:           Afghanistan   483.8351 49.19717       NA 1487548499 18362258.22
  2:               Albania  2819.2400 71.68027 31.41111  312928126  2708297.17
  3:               Algeria  3532.2714 63.56290 34.36667  612238500 25305290.68
  4:        American Samoa 10071.0659       NA       NA         NA    43115.10
  5:               Andorra 40083.0911       NA       NA         NA    51547.35
 ---                                                                          
212: Virgin Islands (U.S.) 35629.7336 73.71292       NA         NA    92238.53
213:    West Bank and Gaza  2388.4348 71.60780 34.52500 1638581462  3312289.13
214:           Yemen, Rep.  1069.6596 52.53707 35.46667  859950996 13741375.82
215:                Zambia  1318.8627 51.09263 52.68889  734624330  8614972.38
216:              Zimbabwe  1219.4360 54.53360 45.93333  397104997  9402160.33
microbenchmark(collapse        = DT %>% gby(country) %>% get_vars(9:13) %>% fmean,
               data.table      = DT[, lapply(.SD, mean, na.rm = TRUE), keyby = country, .SDcols = 9:13],
               data.table_base = DT[, lapply(.SD, base::mean, na.rm = TRUE), keyby = country, .SDcols = 9:13],
               hybrid_bad      = DT[, lapply(.SD, fmean), keyby = country, .SDcols = 9:13],
               hybrid_ok       = DT[, fmean(gby(.SD, country)), .SDcols = c(1L, 9:13)])
Unit: microseconds
            expr      min        lq      mean    median        uq      max
        collapse  345.137  376.5125  419.4952  393.9470  399.4225 3603.978
      data.table 5086.504 5240.5405 5384.2358 5319.8300 5484.4740 6284.893
 data.table_base 2545.283 2597.0910 2695.3421 2625.8495 2648.6360 6084.044
      hybrid_bad 5197.406 5331.3155 5539.1046 5382.9660 5607.2690 8885.335
       hybrid_ok  837.602  885.5560  899.4496  902.0515  914.6245 1003.834
 neval   cld
   100 a    
   100  b   
   100   c  
   100    d 
   100     e
  • data.table내에서 fmean 등을 같이 쓰는 것은 바람직하지 않다.
DT %>% gby(country) %>% get_vars(9:13) %>% fmean
fmean(gv(DT, 9:13), DT$country)
  • 보다 효율적인 작업을 위해 위와 같이 data.table 외에서 처리하는 방식을 사용하자.
#fmean 이외의 예시: fsum

# 국가별 ODA 합산 = 아래는 모두 동일. 
DT[, sum_ODA := sum(ODA, na.rm = TRUE), by = country]
DT[, sum_ODA := fsum(ODA, country, TRA = "replace_fill")]  
settfm(DT, sum_ODA = fsum(ODA, country, TRA = "replace_fill"))  # settfm/tfm= settransform/ftransform 

# 여러 열을 변경할 때 settransform이 ':=' 보다 편리하다. 
settfm(DT, perc_c_ODA = fsum(ODA, country, TRA = "%"),
           perc_y_ODA = fsum(ODA, year, TRA = "%"))

microbenchmark(
  S1 = DT[, sum_ODA := sum(ODA, na.rm = TRUE), by = country],
  S2 = DT[, sum_ODA := fsum(ODA, country, TRA = "replace_fill")],
  S3 = settfm(DT, sum_ODA = fsum(ODA, country, TRA = "replace_fill"))
)
Unit: microseconds
 expr      min        lq      mean    median       uq      max neval cld
   S1 2088.236 2178.8360 2255.2440 2243.0780 2280.824 4270.312   100 a  
   S2  409.735  484.6600  528.7572  533.0865  577.559  665.195   100  b 
   S3  121.994  171.2135  203.7296  202.9935  229.109  290.783   100   c
  • 위와 같이 data.table 외에서 처리하는 방식을 사용하자.

  • data.table에서 data 처리에 유용한 collapse 함수들:

"fcumsum()"   "fscale()"    "fbetween()"    "fwithin()"   "fhdbetween()" 
"fhdwithin()"   "flag()"    "fdiff()"       "fgrowth()"
# Centering GDP
#DT[, demean_PCGDP := PCGDP - mean(PCGDP, na.rm = TRUE), by = country]
DT[, demean_PCGDP := fwithin(PCGDP, country)]
settfm(DT, demean_PCGDP = fwithin(PCGDP, country)) #settfm를 사용하자.

# Lagging GDP
#DT[order(year), lag_PCGDP := shift(PCGDP, 1L), by = country]
DT[, lag_PCGDP := flag(PCGDP, 1L, country, year)]


# Computing a growth rate
#DT[order(year), growth_PCGDP := (PCGDP / shift(PCGDP, 1L) - 1) * 100, by = country]
DT[, lag_PCGDP := fgrowth(PCGDP, 1L, 1L, country, year)] # 1 lag, 1 iteration

# Several Growth rates
#DT[order(year), paste0("growth_", .c(PCGDP, LIFEEX, GINI, ODA)) := (.SD / shift(.SD, 1L) - 1) * 100, by = country, .SDcols = 9:13]
DT %<>% tfm(gv(., 9:13) %>% fgrowth(1L, 1L, country, year) %>% add_stub("growth_"))
settfmv(DT, 9:13, G, 1L, 1L, country, year, apply = FALSE)
 
result <- DT[sample(.N, 7)] |> fselect(9:ncol(DT)); print(result)
        PCGDP   LIFEEX  GINI        ODA      POP     sum_ODA perc_c_ODA
        <num>    <num> <num>      <num>    <num>       <num>      <num>
1:  7808.4047 71.04878    NA   65139999 42449038 26214490031  0.2484885
2: 35593.4255 68.80683    NA         NA    56911          NA         NA
3:  2171.3605 64.10800    NA 2025609985  2123180 73237229858  2.7658200
4: 47413.6225 70.84878    NA         NA    56186          NA         NA
5:   872.9171 55.17000    NA  350309998  6094259 18904160029  1.8530842
6:  3131.2099 65.24600    NA   61150002   792736  4340079982  1.4089602
7:  1814.4672 70.15500    NA  251289993 29774500  7412250079  3.3901985
   perc_y_ODA demean_PCGDP lag_PCGDP growth_PCGDP growth_LIFEEX growth_GINI
        <num>        <num>     <num>        <num>         <num>       <num>
1:  0.1087501   -3117.1382  6.019063     6.019063    0.70872947          NA
2:         NA    4307.9524  6.632324     6.632324    0.54279452          NA
3:  3.6079552    -990.5459        NA           NA    0.95430065          NA
4:         NA   16128.1494  4.547809     4.547809   -0.55800897          NA
5:  0.5839065     -21.0514  1.094015     1.094015   -0.05977936          NA
6:  0.1099883     120.4957 -3.230141    -3.230141    0.10893748          NA
7:  0.2788049     419.3212  5.806588     5.806588    0.35045058          NA
    growth_ODA growth_POP  G1.PCGDP   G1.LIFEEX G1.GINI      G1.ODA    G1.POP
         <num>      <num>     <num>       <num>   <num>       <num>     <num>
1: 1436.320823  0.9940010  6.019063  0.70872947      NA 1436.320823 0.9940010
2:          NA  0.2572007  6.632324  0.54279452      NA          NA 0.2572007
3:   12.181760  2.7719948        NA  0.95430065      NA   12.181760 2.7719948
4:          NA  0.1283102  4.547809 -0.55800897      NA          NA 0.1283102
5:    3.217532  3.1953119  1.094015 -0.05977936      NA    3.217532 3.1953119
6:    2.652345  1.0645269 -3.230141  0.10893748      NA    2.652345 1.0645269
7:   81.791218  1.4829887  5.806588  0.35045058      NA   81.791218 1.4829887
  • := 은 data.table내에서 최적화 정도가 낮아 collapse를 이용하는 것이 대부분의 경우 더 빠르다.
microbenchmark(
  W1 = DT[, demean_PCGDP := PCGDP - mean(PCGDP, na.rm = TRUE), by = country],
  W2 = DT[, demean_PCGDP := fwithin(PCGDP, country)],
  L1 = DT[order(year), lag_PCGDP := shift(PCGDP, 1L), by = country],
  L2 = DT[, lag_PCGDP := flag(PCGDP, 1L, country, year)],
  L3 = DT[, lag_PCGDP := shift(PCGDP, 1L), by = country], # Not ordered
  L4 = DT[, lag_PCGDP := flag(PCGDP, 1L, country)] # Not ordered
)
Unit: microseconds
 expr      min        lq      mean   median       uq       max neval    cld
   W1 1912.389 1990.3745 2156.8241 2023.363 2085.882 11093.551   100 a     
   W2  784.069  836.6985  873.2039  865.280  894.328  1336.105   100  b    
   L1 4025.457 4216.2925 4483.2931 4281.839 4414.296 16599.366   100   c   
   L2 1296.289 1338.7050 1467.9625 1373.009 1409.083  9663.483   100    d  
   L3 2604.725 2672.9305 2748.6873 2697.509 2746.567  3559.639   100     e 
   L4  451.273  476.2480  515.3331  507.103  535.107   861.385   100      f
# flag와 같은 time series는 우선적으로 재정렬을 하지 않아 분명한 성능 차이가 존재한다. 
m <- qM(mtcars)
# matrix to data: mrtl
mrtl(m, names = T, return = "data.table") %>% head(2) # convert: data.table
   Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant
       <num>         <num>      <num>          <num>             <num>   <num>
1:        21            21       22.8           21.4              18.7    18.1
2:         6             6        4.0            6.0               8.0     6.0
   Duster 360 Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE Merc 450SL
        <num>     <num>    <num>    <num>     <num>      <num>      <num>
1:       14.3      24.4     22.8     19.2      17.8       16.4       17.3
2:        8.0       4.0      4.0      6.0       6.0        8.0        8.0
   Merc 450SLC Cadillac Fleetwood Lincoln Continental Chrysler Imperial
         <num>              <num>               <num>             <num>
1:        15.2               10.4                10.4              14.7
2:         8.0                8.0                 8.0               8.0
   Fiat 128 Honda Civic Toyota Corolla Toyota Corona Dodge Challenger
      <num>       <num>          <num>         <num>            <num>
1:     32.4        30.4           33.9          21.5             15.5
2:      4.0         4.0            4.0           4.0              8.0
   AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
         <num>      <num>            <num>     <num>         <num>        <num>
1:        15.2       13.3             19.2      27.3            26         30.4
2:         8.0        8.0              8.0       4.0             4          4.0
   Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
            <num>        <num>         <num>      <num>
1:           15.8         19.7            15       21.4
2:            8.0          6.0             8        4.0
  • fast linear model: flm
wlddev %>% fselect(country, PCGDP, LIFEEX) %>% na_omit(cols = -1) %>%  
   fsubset(fnobs(PCGDP, country, "replace_fill") > 20L) %>% qDT %>% 
  .[, qDT(coeftest(lm(G(PCGDP) ~ G(LIFEEX))), "Coef"), keyby = country] %>% head
Key: <country>
   country        Coef   Estimate Std. Error    t value    Pr(>|t|)
    <char>      <char>      <num>      <num>      <num>       <num>
1: Albania (Intercept) -3.6146411   2.371885 -1.5239527 0.136023086
2: Albania   G(LIFEEX) 22.1596308   7.288971  3.0401591 0.004325856
3: Algeria (Intercept)  0.5973329   1.740619  0.3431726 0.732731107
4: Algeria   G(LIFEEX)  0.8412547   1.689221  0.4980134 0.620390703
5:  Angola (Intercept) -3.3793976   1.540330 -2.1939445 0.034597175
6:  Angola   G(LIFEEX)  4.2362895   1.402380  3.0207852 0.004553260
#절편과 변화율만 빠르게 알고싶다면 flm w/ mrtl: (no standard errors)
wlddev %>% fselect(country, PCGDP, LIFEEX) %>% na_omit(cols = -1L) %>% 
  fsubset(fnobs(PCGDP, country, "replace_fill") > 20L) %>% qDT %>% 
  .[, mrtl(flm(fgrowth(PCGDP)[-1L], 
               cbind(Intercept = 1, LIFEEX = fgrowth(LIFEEX)[-1L])), TRUE), keyby = country] %>% head
Key: <country>
               country   Intercept     LIFEEX
                <char>       <num>      <num>
1:             Albania -3.61464113 22.1596308
2:             Algeria  0.59733291  0.8412547
3:              Angola -3.37939760  4.2362895
4: Antigua and Barbuda -3.11880717 18.8700870
5:           Argentina  1.14613567 -0.2896305
6:             Armenia  0.08178344 11.5523992
microbenchmark(
  A= wlddev %>% fselect(country, PCGDP, LIFEEX) %>% na_omit(cols = -1) %>%  
   fsubset(fnobs(PCGDP, country, "replace_fill") > 20L) %>% qDT %>% 
  .[, qDT(coeftest(lm(G(PCGDP) ~ G(LIFEEX))), "Coef"), keyby = country] ,
  
  B= wlddev %>% fselect(country, PCGDP, LIFEEX) %>% na_omit(cols = -1L) %>% 
   fsubset(fnobs(PCGDP, country, "replace_fill") > 20L) %>% qDT %>% 
  .[, mrtl(flm(fgrowth(PCGDP)[-1L], 
               cbind(Intercept = 1, LIFEEX = fgrowth(LIFEEX)[-1L])), TRUE), keyby = country]
)
Unit: milliseconds
 expr        min        lq       mean     median         uq       max neval cld
    A 167.429776 168.55069 172.656171 169.121475 172.448620 336.99646   100  a 
    B   7.141076   7.40031   7.818226   7.546154   7.698282  12.44983   100   b
# coeftest + lm + G 를 flm + fgrowth와 같은 collapse식으로 대체하여 큰 속도 이득을 볼 수 있다.
  • collapse w/ list; rsplit; rapply2d; get_elem; unlist2d

rapply2d(): data.table/frame에 function 적용.

lm_summary_list <- DT_list %>% 
  rapply2d(lm, formula = G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country)) %>% 
  rapply2d(summary, classes = "lm")

get_elem(): 원하는 부분을 추출, 이후 unlist2d를 이용해 data.table로 만들 수 있다.

 lm_summary_list %>%
  get_elem("coefficients") %>% 
  unlist2d(idcols = .c(Region, Income), row.names = "Coef", DT = TRUE) %>% head
                Region              Income                  Coef  Estimate
                <char>              <char>                <char>     <num>
1: East Asia & Pacific         High income           (Intercept) 0.5313479
2: East Asia & Pacific         High income             G(LIFEEX) 2.4935584
3: East Asia & Pacific         High income B(G(LIFEEX), country) 3.8297123
4: East Asia & Pacific Lower middle income           (Intercept) 1.3476602
5: East Asia & Pacific Lower middle income             G(LIFEEX) 0.5238856
6: East Asia & Pacific Lower middle income B(G(LIFEEX), country) 0.9494439
   Std. Error   t value    Pr(>|t|)
        <num>     <num>       <num>
1:  0.7058550 0.7527720 0.451991327
2:  0.7586943 3.2866443 0.001095466
3:  1.6916770 2.2638554 0.024071386
4:  0.7008556 1.9228785 0.055015131
5:  0.7574904 0.6916069 0.489478164
6:  1.2031228 0.7891496 0.430367103
# 물론, 이렇게도 할 수 있다. 
DT[, qDT(coeftest(lm(G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country))), "Coef"), 
   keyby = .(region, income)]

Summary

  1. collapse는 빠르며, data/memory 측면에서 경제적이다.

  2. vector, matrix, data.table 등 데이터 형식에 구애받지 않고 사용가능하다.

  3. (dplyr, tidyverse, data.table 등) 기존 framework와 통합하여 사용 가능하다.

  4. data.table과 혼용하여 쓸 경우, dt[] 내부에서 사용하면 성능이 저하된다. 이는 내부적 데이터 처리 과정의 차이에서 기인한다.

  5. data.table 형식을 처리할 때는, 아래와 같은 문법으로 사용해야 이의 효과를 기대할 수 있다.

권장되지 않음:
>   DT[order(year), paste0("growth_", .c(PCGDP, LIFEEX, GINI, ODA)) := (.SD / shift(.SD, 1L) - 1) * 100, 
          by = country, .SDcols = 9:13]
권장됨
>>    DT %<>% tfm(gv(., 9:13) %>% fgrowth(1L, 1L, country, year) %>% add_stub("growth_"))
>>    settfmv(DT, 9:13, G, 1L, 1L, country, year, apply = FALSE)

Reuse

Citation

BibTeX citation:
@online{lee2024,
  author = {LEE, Hojun},
  title = {Collapse {패키지} {소개} V2},
  date = {2024-10-29},
  url = {https://blog.zarathu.com/posts/2024-10-28-Collapse/},
  langid = {en}
}
For attribution, please cite this work as:
LEE, Hojun. 2024. “Collapse 패키지 소개 V2.” October 29, 2024. https://blog.zarathu.com/posts/2024-10-28-Collapse/.