Introduction
-
Collapse
C/C++ 기반의 패키지로 큰 데이터셋을 보다 쉽게 다룰 수 있도록 구성됨.
R code의 성능을 획기적으로 개선하여 대규모 데이터를 빠르고 효율적으로 처리함을 목표로 함.
성능을 극대화함과 동시에, 기존 데이터 조작 framework와 통합할 수 있도록 안정적이고 최적화된 사용자 API를 제공함. (dplyr, tidyverse, data.table 등)
- MAIN FOCUS -> data.table과 함께 이용하여 보다 빠르게 연산을 처리하자.
Setup
Basic
data.table처럼 fread & fwrite를 이용하여 csv파일을 처리한다.
Columns: ‘fselect’로 원하는 열을 불러올 수 있다.
fselect(dt, 1:3, 13:16) |> head()
EXMD_BZ_YYYY RN_INDI HME_YYYYMM HGHT WGHT WSTC BMI
<int> <int> <int> <int> <int> <int> <num>
1: 2009 562083 200909 144 61 90 29.4
2: 2009 334536 200911 162 51 63 19.4
3: 2009 911867 200903 163 65 82 24.5
4: 2009 183321 200908 152 51 70 22.1
5: 2009 942671 200909 159 50 73 19.8
6: 2009 979358 200912 157 55 73 22.3
fselect(dt, EXMD_BZ_YYYY,RN_INDI,HME_YYYYMM )|> head() # fselect(dt, "EXMD_BZ_YYYY","RN_INDI","HME_YYYYMM" )
EXMD_BZ_YYYY RN_INDI HME_YYYYMM
<int> <int> <int>
1: 2009 562083 200909
2: 2009 334536 200911
3: 2009 911867 200903
4: 2009 183321 200908
5: 2009 942671 200909
6: 2009 979358 200912
Rows: ‘fsubset()’로 원하는 행/열을 불러올 수 있다.
fsubset(dt, 1:3)
EXMD_BZ_YYYY RN_INDI HME_YYYYMM Q_PHX_DX_STK Q_PHX_DX_HTDZ Q_PHX_DX_HTN
<int> <int> <int> <int> <int> <int>
1: 2009 562083 200909 0 0 1
2: 2009 334536 200911 0 0 0
3: 2009 911867 200903 0 0 0
Q_PHX_DX_DM Q_PHX_DX_DLD Q_PHX_DX_PTB Q_HBV_AG Q_SMK_YN Q_DRK_FRQ_V09N HGHT
<int> <int> <int> <int> <int> <int> <int>
1: 0 0 NA 3 1 0 144
2: 0 0 NA 2 1 0 162
3: 0 0 NA 3 1 0 163
WGHT WSTC BMI VA_LT VA_RT BP_SYS BP_DIA URN_PROT HGB FBS TOT_CHOL
<int> <int> <num> <num> <num> <int> <int> <int> <num> <int> <int>
1: 61 90 29.4 0.7 0.8 120 80 1 12.6 117 264
2: 51 63 19.4 0.8 1.0 120 80 1 13.8 96 169
3: 65 82 24.5 0.7 0.6 130 80 1 15.0 118 216
TG HDL LDL CRTN SGOT SGPT GGT GFR
<int> <int> <int> <num> <int> <int> <int> <int>
1: 128 60 179 0.9 25 20 25 59
2: 92 70 80 0.9 18 15 28 74
3: 132 55 134 0.8 26 30 30 79
#fsubset(dt, c(1:3, 13:16)) #rows
fsubset(dt, 1:3, 13:16) #(dt, row, col)
HGHT WGHT WSTC BMI
<int> <int> <int> <num>
1: 144 61 90 29.4
2: 162 51 63 19.4
3: 163 65 82 24.5
EXMD_BZ_YYYY RN_INDI HME_YYYYMM HGHT WGHT WSTC BMI
<int> <int> <int> <int> <int> <int> <num>
1: 2009 562083 200909 144 61 90 29.4
2: 2009 334536 200911 162 51 63 19.4
3: 2009 911867 200903 163 65 82 24.5
4: 2009 183321 200908 152 51 70 22.1
5: 2009 942671 200909 159 50 73 19.8
6: 2009 979358 200912 157 55 73 22.3
# fsubset(dt, EXMD_BZ_YYYY %in% 2009:2012 & BMI >= 25) %>% fsubset(c(1:3),c(1:3,13:16))
fsubset(dt, c(1:nrow(dt)),c(1:3, 13:16)) %>% fsubset(EXMD_BZ_YYYY %in% 2009:2012 & BMI >= 25) |> head() # same
EXMD_BZ_YYYY RN_INDI HME_YYYYMM HGHT WGHT WSTC BMI
<int> <int> <int> <int> <int> <int> <num>
1: 2009 562083 200909 144 61 90 29.4
2: 2009 318669 200904 155 66 78 27.5
3: 2009 668438 200904 160 71 94 27.7
4: 2009 560878 200903 144 58 93 28.0
5: 2009 375694 200906 151 70 94 30.7
6: 2009 446652 200909 158 64 80 25.6
roworder(dt, HGHT) %>% fsubset(EXMD_BZ_YYYY %in% 2009:2012 & BMI >= 25) %>%
fsubset(c(1:nrow(dt)),c(1:3,13:16)) |> head()
EXMD_BZ_YYYY RN_INDI HME_YYYYMM HGHT WGHT WSTC BMI
<int> <int> <int> <int> <int> <int> <num>
1: 2009 562083 200909 144 61 90 29.4
2: 2009 560878 200903 144 58 93 28.0
3: 2011 562083 201111 144 59 88 28.5
4: 2011 519824 201109 145 58 79 27.6
5: 2011 914987 201103 145 70 95 33.3
6: 2012 560878 201208 145 59 85 28.1
Collapse package
지금까지 collapse에서의 행/열 처리에 대해 알아보았다. 다음은 collapse에서 보다 빠른 연산 및 데이터 처리를 도와주는 도구들이다.
Fast Statistical Function
.FAST_STAT_FUN
# [1] "fmean" "fmedian" "fmode" "fsum" "fprod"
# [6] "fsd" "fvar" "fmin" "fmax" "fnth"
# [11] "ffirst" "flast" "fnobs" "fndistinct"
# 데이터 구조에 구애받지않음.
v1 <- c(1,2,3,4)
m1 <- matrix(1:50, nrow = 10, ncol = 5)
fmean(v1); fmean(m1); fmean(dt)
fmode(v1); fmode(m1); fmode(dt)
# fmean(m1): by columns
# collapse; baseR과 비교했을 때 보다 빠른 속도를 보인다.
x <- rnorm(1e7)
microbenchmark(mean(x), fmean(x), fmean(x, nthreads = 4))
Unit: milliseconds
expr min lq mean median uq
mean(x) 23.761096 23.786943 23.802908 23.799427 23.815928
fmean(x) 15.332085 15.367978 15.388554 15.387914 15.404170
fmean(x, nthreads = 4) 4.217606 6.684896 7.634676 7.741456 8.499509
max neval cld
23.94914 100 a
15.57999 100 b
11.20740 100 c
microbenchmark(colMeans(dt), sapply(dt, mean), fmean(dt))
Unit: microseconds
expr min lq mean median uq max
colMeans(dt) 3154.750 3302.700 3300.31781 3307.968 3312.7130 3641.641
sapply(dt, mean) 190.076 199.417 208.52219 206.010 215.9145 318.417
fmean(dt) 52.889 53.803 56.23603 55.644 56.8805 90.947
neval cld
100 a
100 b
100 c
- Data size가 더 클 경우, 보다 유용하다. (GGDC10S: 5000rows, 11cols, ~10% missing values)
microbenchmark(base = sapply(GGDC10S[6:16], mean, na.rm = TRUE), fmean(GGDC10S[6:16]))
Unit: microseconds
expr min lq mean median uq max neval
base 412.369 429.161 773.8810 807.4445 818.8705 7949.178 100
fmean(GGDC10S[6:16]) 94.481 95.856 102.7777 103.9790 108.0860 142.060 100
cld
a
b
- 이처럼, Collapse는 data 형식에 구애받지 않고, 보다 빠른 속도를 특징으로 하는 package이다.
이들의 문법을 알아보자.
- Fast Statistical Functions
Syntax:
FUN(x, g = NULL, \[w = NULL,\] TRA = NULL, \[na.rm = TRUE\], use.g.names = TRUE, \[drop = TRUE,\] \[nthreads = 1,\] ...)
Argument Description
g grouping vectors / lists of vectors or ’GRP’ object
w a vector of (frequency) weights
TRA a quoted operation to transform x using the statistics
na.rm efficiently skips missing values in x
use.g.names generate names/row-names from g
drop drop dimensions if g = TRA = NULL
nthreads number of threads for OpenMP multithreading
사용예시 : fmean
# Weighted Mean
w <- abs(rnorm(nrow(iris)))
all.equal(fmean(num_vars(iris), w = w), sapply(num_vars(iris), weighted.mean, w = w))
[1] TRUE
wNA <- na_insert(w, prop = 0.05)
sapply(num_vars(iris), weighted.mean, w = wNA) # weighted.mean(): 결측치를 처리하지 못한다.
Sepal.Length Sepal.Width Petal.Length Petal.Width
NA NA NA NA
fmean(num_vars(iris), w = wNA) #결측치를 자동으로 무시한다.
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.797389 3.048473 3.683776 1.151507
# Grouped Mean
fmean(iris$Sepal.Length, g = iris$Species)
setosa versicolor virginica
5.006 5.936 6.588
fmean(num_vars(iris), iris$Species)
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
# Weighted Group Mean
fmean(num_vars(iris), iris$Species, w)
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.015518 3.460443 1.479887 0.2495797
versicolor 5.918636 2.698947 4.259102 1.2888099
virginica 6.568402 2.959146 5.577613 2.0433786
# 속도 상의 이점.
microbenchmark(fmean = fmean(iris$Sepal.Length, iris$Species),
tapply = tapply(iris$Sepal.Length, iris$Species, mean))
Unit: microseconds
expr min lq mean median uq max neval cld
fmean 7.488 7.8230 8.49842 8.2905 8.5785 32.276 100 a
tapply 46.609 47.8695 49.83706 48.5615 48.9820 152.046 100 b
Consideration w/ missing data: 결측치 처리
#wlddev$GINI, g: country, function: mean, median, min, max, sum, prod
collap(wlddev, GINI ~ country, list(mean, median, min, max, sum, prod),
na.rm = TRUE, give.names = FALSE) |> head()
country mean median min max sum prod
1 Afghanistan NaN NA Inf -Inf 0.0 1.000000e+00
2 Albania 31.41111 31.7 27.0 34.6 282.7 2.902042e+13
3 Algeria 34.36667 35.3 27.6 40.2 103.1 3.916606e+04
4 American Samoa NaN NA Inf -Inf 0.0 1.000000e+00
5 Andorra NaN NA Inf -Inf 0.0 1.000000e+00
6 Angola 48.66667 51.3 42.7 52.0 146.0 1.139065e+05
# na.rm=T가 기본값이며, NA를 연산한 값은 모두 NA를 결과값으로 반영함.
collap(wlddev, GINI ~ country, list(fmean, fmedian, fmin, fmax, fsum, fprod),
give.names = FALSE) |> head()
country fmean fmedian fmin fmax fsum fprod
1 Afghanistan NA NA NA NA NA NA
2 Albania 31.41111 31.7 27.0 34.6 282.7 2.902042e+13
3 Algeria 34.36667 35.3 27.6 40.2 103.1 3.916606e+04
4 American Samoa NA NA NA NA NA NA
5 Andorra NA NA NA NA NA NA
6 Angola 48.66667 51.3 42.7 52.0 146.0 1.139065e+05
microbenchmark(a = collap(wlddev, GINI ~ country, list(mean, median, min, max, sum, prod),
na.rm = TRUE, give.names = FALSE) |> head(),
b=collap(wlddev, GINI ~ country, list(fmean, fmedian, fmin, fmax, fsum, fprod),
give.names = FALSE) |> head())
Unit: microseconds
expr min lq mean median uq max neval cld
a 9783.454 9857.552 10483.5492 9942.606 10258.8435 15291.497 100 a
b 534.478 577.196 604.1118 615.318 626.4175 855.808 100 b
# 속도 상 이점을 다시 한 번 확인할 수 있다.
TRA function
- TRA function을 이용, 여러 행/열의 연산을 간편하게 처리할 수 있다.
Syntax:
TRA(x, STATS, FUN = "-", g = NULL, set = FALSE, ...)
setTRA(x, STATS, FUN = "-", g = NULL, ...)
STATS = vector/matrix/list of statistics
0 "replace_NA" replace missing values in x
1 "replace_fill" replace data and missing values in x
2 "replace" replace data but preserve missing values in x
3 "-" subtract (i.e. center)
4 "-+" center on overall average statistic
5 "/" divide (i.e. scale)
6 "%" compute percentages (i.e. divide and multiply by 100)
7 "+" add
8 "*" multiply
9 "%%" modulus (i.e. remainder from division by STATS)
10 "-%%" subtract modulus (i.e. make data divisible by STATS)
dt2 <- as.data.table(iris)
attach(iris) #data.table에서처럼 변수명을 직접 호출하기 위해 attach를 사용할 수 있다.
# 평균값과의 차: g= Species
all_obj_equal(Sepal.Length - ave(Sepal.Length, g = Species),
fmean(Sepal.Length, g = Species, TRA= "-"),
TRA(Sepal.Length, fmean(Sepal.Length, g = Species), "-", g = Species))
[1] TRUE
microbenchmark(baseR= Sepal.Length - ave(Sepal.Length, g = Species),
fmean = mean(Sepal.Length, g = Species, TRA= "-"),
TRA_fmean = TRA(Sepal.Length, fmean(Sepal.Length, g = Species), "-", g = Species));detach(iris)
Unit: microseconds
expr min lq mean median uq max neval cld
baseR 57.640 58.9120 61.35788 59.9555 61.2070 156.905 100 a
fmean 3.754 3.9635 4.29474 4.1200 4.2485 19.150 100 b
TRA_fmean 11.907 12.4290 13.49149 13.0765 13.4220 55.347 100 c
- TRA()를 사용하기보다 Fast Statistical Function에서 TRA 기능을 호출하자!
#예시
num_vars(dt2) %<>% na_insert(prop = 0.05)
# NA 값을 median값으로 대체.
num_vars(dt2) |> fmedian(iris$Species, TRA = "replace_NA", set = TRUE)
# num_vars(dt2) |> fmean(iris$Species, TRA = "replace_NA", set = TRUE) --> mean으로 대체.
# 다양한 연산 및 작업을 한 번에 다룰 수 있다.
mtcars |> ftransform(A = fsum(mpg, TRA = "%"),
B = mpg > fmedian(mpg, cyl, TRA = "replace_fill"),
C = fmedian(mpg, list(vs, am), wt, "-"),
D = fmean(mpg, vs,, 1L) > fmean(mpg, am,, 1L)) |> head(3)
mpg cyl disp hp drat wt qsec vs am gear carb A B
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 3.266449 TRUE
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3.266449 TRUE
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3.546430 FALSE
C D
Mazda RX4 1.3 FALSE
Mazda RX4 Wag 1.3 FALSE
Datsun 710 -7.6 TRUE
Grouping Object
-
GRP function을 이용, group을 쉽게 연산할 수 있다.
Syntax: GRP(X, by = NULL, sort == TRUE, decreasing = FALSE, na.last = TRUE, return.groups = TRUE, return.order = sort, method = "auto", ...)
g <- GRP(iris, by = ~ Species)
print(g)
collapse grouping object of length 150 with 3 ordered groups
Call: GRP.default(X = iris, by = ~Species), X is sorted
Distribution of group sizes:
Min. 1st Qu. Median Mean 3rd Qu. Max.
50 50 50 50 50 50
Groups with sizes:
setosa versicolor virginica
50 50 50
str(g)
Class 'GRP' hidden list of 9
$ N.groups : int 3
$ group.id : int [1:150] 1 1 1 1 1 1 1 1 1 1 ...
$ group.sizes : int [1:3] 50 50 50
$ groups :'data.frame': 3 obs. of 1 variable:
..$ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
$ group.vars : chr "Species"
$ ordered : Named logi [1:2] TRUE TRUE
..- attr(*, "names")= chr [1:2] "ordered" "sorted"
$ order : int [1:150] 1 2 3 4 5 6 7 8 9 10 ...
..- attr(*, "starts")= int [1:3] 1 51 101
..- attr(*, "maxgrpn")= int 50
..- attr(*, "sorted")= logi TRUE
$ group.starts: int [1:3] 1 51 101
$ call : language GRP.default(X = iris, by = ~Species)
# GRP 기능- 호출하여 사용하자!
fmean(num_vars(iris), g)
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
fmean(num_vars(iris), iris$Species)
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
Factors in operation
Collaspe는 형식에 구애받지 않는다; factor를 바로 연산할 수 있으며, qF로 빠르게 factor를 생성할 수 있다.
x <- na_insert(rnorm(1e7), prop = 0.01)
g <- sample.int(1e6, 1e7, TRUE)
# grp와 비교
system.time(gg <- GRP(g))
user system elapsed
0.649 0.027 0.677
system.time(f <- qF(g, na.exclude = FALSE))
user system elapsed
0.254 0.044 0.298
class(f)
[1] "factor" "na.included"
microbenchmark(fmean(x, g),
fmean(x, gg),
fmean(x, gg, na.rm = FALSE),
fmean(x, f))
## Unit: milliseconds
## expr min lq mean median
## fmean(x, g) 146.060983 150.493309 155.02585 152.197822
## fmean(x, gg) 25.354564 27.709625 29.48497 29.022157
## fmean(x, gg, na.rm = FALSE) 13.184534 13.783585 15.61769 14.128067
## fmean(x, f) 24.847271 27.503661 29.47271 29.248580
# qF를 통해 grp와 유사한 성능 향상을 기대할 수 있다.
Summary: FAST grouping and Ordering
다양한 기능이 있다.
GRP() Fast sorted or unsorted grouping of multivariate data, returns detailed object of class ’GRP’
qF()/qG() Fast generation of factors and quick-group (’qG’) objects from atomic vectors
finteraction() Fast interactions: returns factor or ’qG’ objects
fdroplevels() Efficiently remove unused factor levels
radixorder() Fast ordering and ordered grouping
group() Fast first-appearance-order grouping: returns ’qG’ object
gsplit() Split vector based on ’GRP’ object
greorder() Reorder the results
- that also return ’qG’ objects
groupid() Generalized run-length-type grouping seqid() Grouping of integer sequences
timeid() Grouping of time sequences (based on GCD)
dapply() Apply a function to rows or columns of data.frame or matrix based objects.
BY() Apply a function to vectors or matrix/data frame columns by groups.
- Specialized Data Transformation Functions
fbetween() Fast averaging and (quasi-)centering.
fwithin()
fhdbetween() Higher-Dimensional averaging/centering and linear prediction/partialling out
fhdwithin() (powered by fixest’s algorithm for multiple factors).
fscale() (advanced) scaling and centering.
- Time / Panel Series Functions
fcumsum() Cumulative sums
flag() Lags and leads
fdiff() (Quasi-, Log-, Iterated-) differences
fgrowth() (Compounded-) growth rates
- Data manipulation functions
fselect(), fsubset(), fgroup_by(), [f/set]transform[v](),
fmutate(), fsummarise(), across(), roworder[v](),
colorder[v](), [f/set]rename(), [set]relabel()
Collapse는 빠르다!
fdim(wlddev) ##faster dim for dt. col/row: 13176 13
# 1990년 이후를 기준으로, ODA/POP의 값 (g: region, income, OECD)
microbenchmark(
dplyr = qDT(wlddev) |>
filter(year >= 1990) |>
mutate(ODA_POP = ODA / POP) |>
group_by(region, income, OECD) |>
summarise(across(PCGDP:POP, sum, na.rm = TRUE), .groups = "drop") |>
arrange(income, desc(PCGDP)),
data.table = qDT(wlddev)[, ODA_POP := ODA / POP][
year >= 1990, lapply(.SD, sum, na.rm = TRUE),
by = .(region, income, OECD), .SDcols = PCGDP:ODA_POP][
order(income, -PCGDP)],
collapse_base = qDT(wlddev) |>
fsubset(year >= 1990) |>
fmutate(ODA_POP = ODA / POP) |>
fgroup_by(region, income, OECD) |>
fsummarise(across(PCGDP:ODA_POP, sum, na.rm = TRUE)) |>
roworder(income, -PCGDP),
collapse_optimized = qDT(wlddev) |>
fsubset(year >= 1990, region, income, OECD, PCGDP:POP) |>
fmutate(ODA_POP = ODA / POP) |>
fgroup_by(1:3, sort = FALSE) |> fsum() |>
roworder(income, -PCGDP)
)
## Unit: microseconds
## expr min lq mean median uq max neval
## dplyr 71955.523 72291.9715 80009.2208 72453.1165 76902.671 393947.262 100
## data.table 5960.503 6310.7045 7116.6673 6721.3450 7046.837 18615.736 100
## collapse_base 859.505 948.2200 1041.1137 990.1375 1061.864 3148.804 100
## collapse_optimized 442.040 482.9705 542.6927 523.6950 574.921 1036.817 100
Collapse w/ Fast Statistical Function: 다양한 활용
# 아래 셋은 동일한 결과를 보인다.
# cyl별 mpg sum
mtcars %>% ftransform(mpg_sum = fsum(mpg, g = cyl, TRA = "replace_fill")) %>% invisible()
mtcars %>% fgroup_by(cyl) %>% ftransform(mpg_sum = fsum(mpg, GRP(.), TRA = "replace_fill")) %>% invisible()
mtcars %>% fgroup_by(cyl) %>% fmutate(mpg_sum = fsum(mpg)) %>% head(10)
mpg cyl disp hp drat wt qsec vs am gear carb mpg_sum
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 138.2
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 138.2
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 293.3
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 138.2
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 211.4
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 138.2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 211.4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 293.3
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 293.3
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 138.2
- ad-hoc grouping, often fastest!
microbenchmark(a=mtcars %>% ftransform(mpg_sum = fsum(mpg, g = cyl, TRA = "replace_fill")),
b=mtcars %>% fgroup_by(cyl) %>% ftransform(mpg_sum = fsum(mpg, GRP(.), TRA = "replace_fill")),
c=mtcars %>% fgroup_by(cyl) %>% fmutate(mpg_sum = fsum(mpg)))
Unit: microseconds
expr min lq mean median uq max neval cld
a 27.362 28.9525 30.22495 29.9020 31.5925 39.924 100 a
b 64.001 66.0010 68.26108 67.2155 68.9120 114.379 100 b
c 78.678 80.4105 84.21481 81.2445 82.4050 264.876 100 c
- ftransform()은 앞의 fgroupby를 무시한다. 아래 둘은 값이 다르다. (fmutate, fsummarise만 이전 group을 반영한다.)
mtcars %>% fgroup_by(cyl) %>% ftransform(mpg_sum = fsum(mpg, GRP(.), TRA = "replace_fill")) %>% head()
mpg cyl disp hp drat wt qsec vs am gear carb mpg_sum
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 138.2
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 138.2
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 293.3
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 138.2
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 211.4
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 138.2
mpg cyl disp hp drat wt qsec vs am gear carb mpg_sum
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 642.9
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 642.9
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 642.9
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 642.9
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 642.9
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 642.9
- 위 언급과 같이 baseR 의 “/”보다 collapse의 TRA function을 이용하는 것이 더 빠르다.
microbenchmark(
"/"= mtcars |> fgroup_by(cyl) |> fmutate(mpg_prop = mpg / fsum(mpg)) |> head(),
"TRA=/" = mtcars |> fgroup_by(cyl) |> fmutate(mpg_prop = fsum(mpg, TRA = "/")) |> head()
)
Unit: microseconds
expr min lq mean median uq max neval cld
/ 207.643 210.3915 214.5014 212.7320 215.4790 281.881 100 a
TRA=/ 196.455 200.0970 205.9386 202.4655 205.0695 471.228 100 b
- fsum은 grp 별로 연산을 처리하나, sum은 전체를 반영한다.
mpg cyl disp hp drat wt qsec vs am gear carb mpg_prop2
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0.2149634
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0.2149634
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0.4562140
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 0.2149634
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0.3288225
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0.2149634
- 자유로운 %>% 의 사용
# 아래 둘은 동일하다.
mtcars %>% fgroup_by(cyl) %>% ftransform(fselect(., hp:qsec) %>% fsum(TRA = "/")) %>% invisible()
mtcars %>% fgroup_by(cyl) %>% fmutate(across(hp:qsec, fsum, TRA = "/")) %>% head()
mpg cyl disp hp drat wt qsec vs
Mazda RX4 21.0 6 160 0.12850467 0.15537849 0.12007333 0.13080102 0
Mazda RX4 Wag 21.0 6 160 0.12850467 0.15537849 0.13175985 0.13525111 0
Datsun 710 22.8 4 108 0.10231023 0.08597588 0.09227220 0.08840435 1
Hornet 4 Drive 21.4 6 258 0.12850467 0.12270916 0.14734189 0.15448188 1
Hornet Sportabout 18.7 8 360 0.05974735 0.06967485 0.06144064 0.07248414 0
Valiant 18.1 6 225 0.12266355 0.10996016 0.15857012 0.16068023 1
am gear carb
Mazda RX4 1 4 4
Mazda RX4 Wag 1 4 4
Datsun 710 1 4 1
Hornet 4 Drive 0 3 1
Hornet Sportabout 0 3 2
Valiant 0 3 1
- set = TRUE를 통해 원본 데이터에 반영할 수 있다.
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# mtcars의 열 hp:qsec의 값과 해당하는 g:cyl별 합의 비율.
mtcars %>% fgroup_by(cyl) %>% fmutate(across(hp:qsec, fsum, TRA = "/", set = TRUE)) %>% invisible()
head(mtcars)
mpg cyl disp hp drat wt qsec vs
Mazda RX4 21.0 6 160 0.12850467 0.15537849 0.12007333 0.13080102 0
Mazda RX4 Wag 21.0 6 160 0.12850467 0.15537849 0.13175985 0.13525111 0
Datsun 710 22.8 4 108 0.10231023 0.08597588 0.09227220 0.08840435 1
Hornet 4 Drive 21.4 6 258 0.12850467 0.12270916 0.14734189 0.15448188 1
Hornet Sportabout 18.7 8 360 0.05974735 0.06967485 0.06144064 0.07248414 0
Valiant 18.1 6 225 0.12266355 0.10996016 0.15857012 0.16068023 1
am gear carb
Mazda RX4 1 4 4
Mazda RX4 Wag 1 4 4
Datsun 710 1 4 1
Hornet 4 Drive 0 3 1
Hornet Sportabout 0 3 2
Valiant 0 3 1
- .apply = FALSE를 통해 subset group에만 적용할 수 있다.
# 각 g:cyl의 hp:qsec까지의 변수에 대한 부분 상관관계
mtcars %>% fgroup_by(cyl) %>% fsummarise(across(hp:qsec, \(x) qDF(pwcor(x), "var"), .apply = FALSE)) %>% head()
cyl var hp drat wt qsec
1 4 hp 1.0000000 -0.4702200 0.1598761 -0.1783611
2 4 drat -0.4702200 1.0000000 -0.4788681 -0.2833656
3 4 wt 0.1598761 -0.4788681 1.0000000 0.6380214
4 4 qsec -0.1783611 -0.2833656 0.6380214 1.0000000
5 6 hp 1.0000000 0.2171636 -0.3062284 -0.6280148
6 6 drat 0.2171636 1.0000000 -0.3546583 -0.6231083
이름/순번/vectors/정규표현식으로 행/열을 지칭할 수 있다.
get_vars(x, vars, return = "names", regex = FALSE, ...)
get_vars(x, vars, regex = FALSE, ...) <- value
- 위치도 선택가능하다.
add_vars(x, ..., pos = "end")
add_vars(x, pos = "end") <- value
- data type을 지정할 수 있다.
num_vars(x, return = "data"); cat_vars(x, return = "data"); char_vars(x, return = "data");
fact_vars(x, return = "data"); logi_vars(x, return = "data"); date_vars(x, return = "data")
- replace 또한 가능하다.
num_vars(x) <- value; cat_vars(x) <- value; char_vars(x) <- value;
fact_vars(x) <- value; logi_vars(x) <- value; date_vars(x) <- value
Efficient programming
> quick data conversion
- qDF(), qDT(), qTBL(), qM(), mrtl(), mctl()
- anyv(x, value) / allv(x, value) # Faster than any/all(x == value)
- allNA(x) # Faster than all(is.na(x))
- whichv(x, value, invert = F) # Faster than which(x (!/=)= value)
- whichNA(x, invert = FALSE) # Faster than which((!)is.na(x))
- x %(!/=)=% value # Infix for whichv(v, value, TRUE/FALSE)
- setv(X, v, R, ...) # x\[x(!/=)=v\]\<-r / x\[v\]\<-r\[v\] (by reference)
- setop(X, op, V, rowwise = F) # Faster than X \<- X +/-/\*// V (by reference)
- X %(+,-,\*,/)=% V # Infix for setop,()
- na_rm(x) # Fast: if(anyNA(x)) x\[!is.na(x)\] else x,
- na_omit(X, cols = NULL, ...) # Faster na.omit for matrices and data frames
- vlengths(X, use.names=TRUE) # Faster version of lengths()
- frange(x, na.rm = TRUE) # Much faster base::range
- fdim(X) # Faster dim for data frames
Collapse and data.table
data.table에서 collapse의 적용을 알아보자.
DT <- qDT(wlddev) # as.data.table(wlddev)
DT %>% fgroup_by(country) %>% get_vars(9:13) %>% fmean() #fgroup_by == gby
country PCGDP LIFEEX GINI ODA POP
<char> <num> <num> <num> <num> <num>
1: Afghanistan 483.8351 49.19717 NA 1487548499 18362258.22
2: Albania 2819.2400 71.68027 31.41111 312928126 2708297.17
3: Algeria 3532.2714 63.56290 34.36667 612238500 25305290.68
4: American Samoa 10071.0659 NA NA NA 43115.10
5: Andorra 40083.0911 NA NA NA 51547.35
---
212: Virgin Islands (U.S.) 35629.7336 73.71292 NA NA 92238.53
213: West Bank and Gaza 2388.4348 71.60780 34.52500 1638581462 3312289.13
214: Yemen, Rep. 1069.6596 52.53707 35.46667 859950996 13741375.82
215: Zambia 1318.8627 51.09263 52.68889 734624330 8614972.38
216: Zimbabwe 1219.4360 54.53360 45.93333 397104997 9402160.33
DT[, lapply(.SD, mean, na.rm = TRUE), keyby = country, .SDcols = 9:13] %>% invisible()
collap(DT, ~ country, fmean, cols = 9:13) %>% invisible() #same
microbenchmark(collapse = DT %>% gby(country) %>% get_vars(9:13) %>% fmean,
data.table = DT[, lapply(.SD, mean, na.rm = TRUE), keyby = country, .SDcols = 9:13],
data.table2 = DT[, lapply(.SD, fmean, na.rm = TRUE), keyby = country, .SDcols = 9:13])
Unit: microseconds
expr min lq mean median uq max neval cld
collapse 334.570 368.7865 389.3642 396.088 400.6755 668.870 100 a
data.table 5047.715 5215.1535 5386.4627 5293.606 5492.7620 8787.870 100 b
data.table2 5215.911 5295.7220 5637.0815 5379.914 5587.5015 9707.914 100 c
DT[, lapply(.SD, fmean,…)]가 DT[, lapply(.SD, mean,…)]보다 느린 것을 확인할 수 있다. data.table 내에서 mean은 baseR의 mean이 아닌 gmean으로 data.table에 최적화되어있다. 반면, lapply와 함께 fmean을 사용하면 최적화된 방식으로 동작하지 않아 오히려 더 느리다.
위 방식은 아래와 처리되는 방식이 유사하다. fmean을 모든 group, columns에 적용하기에 느리다.
BY(gv(DT, 9:13), g, fmean)
이때, 아래와 같은 방법으로 이를 일정 수준 해결할 수 있다.
fmean(gv(DT, 9:13), DT$country)
g <- GRP(DT, "country"); add_vars(g[["groups"]], fmean(gv(DT, 9:13), g))
DT <- qDT(wlddev); g <- GRP(DT, "country")
#gv: abbreviation for get_vars()
microbenchmark(a = fmean(gv(DT, 9:13), DT$country),
b0= g <- GRP(DT, "country"),
b = add_vars(g[["groups"]], fmean(gv(DT, 9:13), g)),
dt_fmean = DT[, lapply(.SD, fmean, na.rm = TRUE), keyby = country, .SDcols = 9:13],
dt_gmean = DT[, lapply(.SD, mean, na.rm = TRUE), keyby = country, .SDcols = 9:13])
Unit: microseconds
expr min lq mean median uq max neval cld
a 358.543 378.1060 389.9855 391.505 398.437 615.572 100 a
b0 76.665 100.1125 107.3157 111.378 114.803 147.121 100 b
b 224.602 241.6375 257.6594 255.871 265.802 679.694 100 ab
dt_fmean 5131.574 5300.4195 5686.7481 5472.694 5572.977 12437.685 100 c
dt_gmean 5073.098 5293.6580 5446.3542 5399.370 5579.908 6110.983 100 d
dplyr의 data %>% group_by(…) %>% summarize(…) 및 data.table의 [i, j, by] 구문은 데이터 그룹에 함수를 적용하기 위한 보편적인 방식이다. 이는 다양한 함수를 그룹화된 데이터에 적용하며, 특히 data.table은 몇몇 내장 함수(min, max, mean 등)를 GForce; 내부적으로 최적화하여 처리한다.
collapse는 데이터를 그룹화하여(fgroup_by, collap) 통계 및 변환 함수를 처리한다. (by C++)
collapse의 모든 기능(BY는 예외)은 GForce 최적화가 되어 있지만, data.table 내에서 최적화 정도의 차이, lapply 적용 상의 문제가 있는 것으로 추정된다.
그렇다면 fmean을 data.table내에서 쓸 수는 없을까.
DT[, fmean(.SD, country), .SDcols = 9:13]
PCGDP LIFEEX GINI ODA POP
<num> <num> <num> <num> <num>
1: 483.8351 49.19717 NA 1487548499 18362258.22
2: 2819.2400 71.68027 31.41111 312928126 2708297.17
3: 3532.2714 63.56290 34.36667 612238500 25305290.68
4: 10071.0659 NA NA NA 43115.10
5: 40083.0911 NA NA NA 51547.35
---
212: 35629.7336 73.71292 NA NA 92238.53
213: 2388.4348 71.60780 34.52500 1638581462 3312289.13
214: 1069.6596 52.53707 35.46667 859950996 13741375.82
215: 1318.8627 51.09263 52.68889 734624330 8614972.38
216: 1219.4360 54.53360 45.93333 397104997 9402160.33
DT[, fmean(gby(.SD, country)), .SDcols = c(1L, 9:13)] #gby = abbrviation for fgroup_by()
country PCGDP LIFEEX GINI ODA POP
<char> <num> <num> <num> <num> <num>
1: Afghanistan 483.8351 49.19717 NA 1487548499 18362258.22
2: Albania 2819.2400 71.68027 31.41111 312928126 2708297.17
3: Algeria 3532.2714 63.56290 34.36667 612238500 25305290.68
4: American Samoa 10071.0659 NA NA NA 43115.10
5: Andorra 40083.0911 NA NA NA 51547.35
---
212: Virgin Islands (U.S.) 35629.7336 73.71292 NA NA 92238.53
213: West Bank and Gaza 2388.4348 71.60780 34.52500 1638581462 3312289.13
214: Yemen, Rep. 1069.6596 52.53707 35.46667 859950996 13741375.82
215: Zambia 1318.8627 51.09263 52.68889 734624330 8614972.38
216: Zimbabwe 1219.4360 54.53360 45.93333 397104997 9402160.33
microbenchmark(collapse = DT %>% gby(country) %>% get_vars(9:13) %>% fmean,
data.table = DT[, lapply(.SD, mean, na.rm = TRUE), keyby = country, .SDcols = 9:13],
data.table_base = DT[, lapply(.SD, base::mean, na.rm = TRUE), keyby = country, .SDcols = 9:13],
hybrid_bad = DT[, lapply(.SD, fmean), keyby = country, .SDcols = 9:13],
hybrid_ok = DT[, fmean(gby(.SD, country)), .SDcols = c(1L, 9:13)])
Unit: microseconds
expr min lq mean median uq max
collapse 341.792 366.5840 384.8933 390.777 399.7335 501.416
data.table 5125.348 5255.6865 5562.5321 5365.828 5636.5370 11012.503
data.table_base 2558.182 2589.2545 2626.3920 2615.736 2638.2230 2884.591
hybrid_bad 5166.150 5264.3745 5555.3007 5330.985 5526.0495 10046.046
hybrid_ok 842.343 883.5925 901.0746 898.846 920.2410 995.945
neval cld
100 a
100 b
100 c
100 b
100 d
- data.table내에서 fmean 등을 같이 쓰는 것은 바람직하지 않다.
DT %>% gby(country) %>% get_vars(9:13) %>% fmean
fmean(gv(DT, 9:13), DT$country)
- 보다 효율적인 작업을 위해 위와 같이 data.table 외에서 처리하는 방식을 사용하자.
#fmean 이외의 예시: fsum
# 국가별 ODA 합산 = 아래는 모두 동일.
DT[, sum_ODA := sum(ODA, na.rm = TRUE), by = country]
DT[, sum_ODA := fsum(ODA, country, TRA = "replace_fill")]
settfm(DT, sum_ODA = fsum(ODA, country, TRA = "replace_fill")) # settfm/tfm= settransform/ftransform
# 여러 열을 변경할 때 settransform이 ':=' 보다 편리하다.
settfm(DT, perc_c_ODA = fsum(ODA, country, TRA = "%"),
perc_y_ODA = fsum(ODA, year, TRA = "%"))
microbenchmark(
S1 = DT[, sum_ODA := sum(ODA, na.rm = TRUE), by = country],
S2 = DT[, sum_ODA := fsum(ODA, country, TRA = "replace_fill")],
S3 = settfm(DT, sum_ODA = fsum(ODA, country, TRA = "replace_fill"))
)
Unit: microseconds
expr min lq mean median uq max neval cld
S1 2080.392 2157.9970 2232.1498 2242.7080 2288.3410 2563.404 100 a
S2 421.740 507.3445 538.3390 535.7035 578.1545 799.340 100 b
S3 121.292 180.0820 206.7299 201.0345 235.0795 337.004 100 c
위와 같이 data.table 외에서 처리하는 방식을 사용하자.
data.table에서 data 처리에 유용한 collapse 함수들:
"fcumsum()" "fscale()" "fbetween()" "fwithin()" "fhdbetween()"
"fhdwithin()" "flag()" "fdiff()" "fgrowth()"
# Centering GDP
#DT[, demean_PCGDP := PCGDP - mean(PCGDP, na.rm = TRUE), by = country]
DT[, demean_PCGDP := fwithin(PCGDP, country)]
settfm(DT, demean_PCGDP = fwithin(PCGDP, country)) #settfm를 사용하자.
# Lagging GDP
#DT[order(year), lag_PCGDP := shift(PCGDP, 1L), by = country]
DT[, lag_PCGDP := flag(PCGDP, 1L, country, year)]
# Computing a growth rate
#DT[order(year), growth_PCGDP := (PCGDP / shift(PCGDP, 1L) - 1) * 100, by = country]
DT[, lag_PCGDP := fgrowth(PCGDP, 1L, 1L, country, year)] # 1 lag, 1 iteration
# Several Growth rates
#DT[order(year), paste0("growth_", .c(PCGDP, LIFEEX, GINI, ODA)) := (.SD / shift(.SD, 1L) - 1) * 100, by = country, .SDcols = 9:13]
DT %<>% tfm(gv(., 9:13) %>% fgrowth(1L, 1L, country, year) %>% add_stub("growth_"))
settfmv(DT, 9:13, G, 1L, 1L, country, year, apply = FALSE)
result <- DT[sample(.N, 7)] |> fselect(9:ncol(DT)); print(result)
PCGDP LIFEEX GINI ODA POP sum_ODA perc_c_ODA
<num> <num> <num> <num> <num> <num> <num>
1: 598.0675 61.607 NA 161470001 48949924 29002710049 0.5567411
2: 1223.2034 55.032 NA 959559998 13115131 23032089813 4.1661873
3: NA 43.213 NA 254979996 29248643 91042129921 0.2800681
4: 1101.9787 56.283 NA 1005000000 7086627 42753590080 2.3506798
5: NA NA NA 322790009 28720 7609940052 4.2416892
6: 3931.0345 66.926 NA 88650002 26900508 30162480091 0.2939082
7: 542.0542 56.963 NA 3204889893 837468930 217402459229 1.4741737
perc_y_ODA demean_PCGDP lag_PCGDP growth_PCGDP growth_LIFEEX growth_GINI
<num> <num> <num> <num> <num> <num>
1: 0.1663325 153.777583 12.674689 12.674689 0.5106536 NA
2: 1.0646265 3.767466 14.701173 14.701173 4.0381125 NA
3: 0.7028241 NA NA NA 0.6263972 NA
4: 1.7515717 -109.153605 -3.550449 -3.550449 1.2102140 NA
5: 0.4978978 NA NA NA NA NA
6: 0.1379507 -650.571973 1.814067 1.814067 0.7148124 NA
7: 5.5856661 -234.271909 7.299421 7.299421 0.7249836 NA
growth_ODA growth_POP G1.PCGDP G1.LIFEEX G1.GINI G1.ODA G1.POP
<num> <num> <num> <num> <num> <num> <num>
1: 14.9498112 0.7936664 12.674689 0.5106536 NA 14.9498112 0.7936664
2: 39.2442542 1.7124987 14.701173 4.0381125 NA 39.2442542 1.7124987
3: 0.3265795 2.9335342 NA 0.6263972 NA 0.3265795 2.9335342
4: -15.8256174 3.0669073 -3.550449 1.2102140 NA -15.8256174 3.0669073
5: -16.6757024 12.6274510 NA NA NA -16.6757024 12.6274510
6: -5.0246357 2.2327826 1.814067 0.7148124 NA -5.0246357 2.2327826
7: 9.9296837 2.1699666 7.299421 0.7249836 NA 9.9296837 2.1699666
- := 은 data.table내에서 최적화 정도가 낮아 collapse를 이용하는 것이 대부분의 경우 더 빠르다.
microbenchmark(
W1 = DT[, demean_PCGDP := PCGDP - mean(PCGDP, na.rm = TRUE), by = country],
W2 = DT[, demean_PCGDP := fwithin(PCGDP, country)],
L1 = DT[order(year), lag_PCGDP := shift(PCGDP, 1L), by = country],
L2 = DT[, lag_PCGDP := flag(PCGDP, 1L, country, year)],
L3 = DT[, lag_PCGDP := shift(PCGDP, 1L), by = country], # Not ordered
L4 = DT[, lag_PCGDP := flag(PCGDP, 1L, country)] # Not ordered
)
Unit: microseconds
expr min lq mean median uq max neval cld
W1 1907.281 1979.7700 2213.1757 2011.0155 2070.968 10277.173 100 a
W2 768.293 805.5075 843.2050 840.9535 865.933 1161.758 100 b
L1 4107.066 4190.2105 4329.4591 4258.0900 4373.160 5072.518 100 c
L2 1273.071 1316.5770 1424.8645 1358.4510 1377.865 8709.820 100 d
L3 2591.240 2637.7440 2710.1965 2678.2500 2708.528 3347.374 100 e
L4 427.919 459.2580 501.5867 498.7115 518.322 879.667 100 f
# flag와 같은 time series는 우선적으로 재정렬을 하지 않아 분명한 성능 차이가 존재한다.
m <- qM(mtcars)
# matrix to data: mrtl
mrtl(m, names = T, return = "data.table") %>% head(2) # convert: data.table
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant
<num> <num> <num> <num> <num> <num>
1: 21 21 22.8 21.4 18.7 18.1
2: 6 6 4.0 6.0 8.0 6.0
Duster 360 Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE Merc 450SL
<num> <num> <num> <num> <num> <num> <num>
1: 14.3 24.4 22.8 19.2 17.8 16.4 17.3
2: 8.0 4.0 4.0 6.0 6.0 8.0 8.0
Merc 450SLC Cadillac Fleetwood Lincoln Continental Chrysler Imperial
<num> <num> <num> <num>
1: 15.2 10.4 10.4 14.7
2: 8.0 8.0 8.0 8.0
Fiat 128 Honda Civic Toyota Corolla Toyota Corona Dodge Challenger
<num> <num> <num> <num> <num>
1: 32.4 30.4 33.9 21.5 15.5
2: 4.0 4.0 4.0 4.0 8.0
AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
<num> <num> <num> <num> <num> <num>
1: 15.2 13.3 19.2 27.3 26 30.4
2: 8.0 8.0 8.0 4.0 4 4.0
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
<num> <num> <num> <num>
1: 15.8 19.7 15 21.4
2: 8.0 6.0 8 4.0
- fast linear model: flm
wlddev %>% fselect(country, PCGDP, LIFEEX) %>% na_omit(cols = -1) %>%
fsubset(fnobs(PCGDP, country, "replace_fill") > 20L) %>% qDT %>%
.[, qDT(coeftest(lm(G(PCGDP) ~ G(LIFEEX))), "Coef"), keyby = country] %>% head
Key: <country>
country Coef Estimate Std. Error t value Pr(>|t|)
<char> <char> <num> <num> <num> <num>
1: Albania (Intercept) -3.6146411 2.371885 -1.5239527 0.136023086
2: Albania G(LIFEEX) 22.1596308 7.288971 3.0401591 0.004325856
3: Algeria (Intercept) 0.5973329 1.740619 0.3431726 0.732731107
4: Algeria G(LIFEEX) 0.8412547 1.689221 0.4980134 0.620390703
5: Angola (Intercept) -3.3793976 1.540330 -2.1939445 0.034597175
6: Angola G(LIFEEX) 4.2362895 1.402380 3.0207852 0.004553260
#절편과 변화율만 빠르게 알고싶다면 flm w/ mrtl: (no standard errors)
wlddev %>% fselect(country, PCGDP, LIFEEX) %>% na_omit(cols = -1L) %>%
fsubset(fnobs(PCGDP, country, "replace_fill") > 20L) %>% qDT %>%
.[, mrtl(flm(fgrowth(PCGDP)[-1L],
cbind(Intercept = 1, LIFEEX = fgrowth(LIFEEX)[-1L])), TRUE), keyby = country] %>% head
Key: <country>
country Intercept LIFEEX
<char> <num> <num>
1: Albania -3.61464113 22.1596308
2: Algeria 0.59733291 0.8412547
3: Angola -3.37939760 4.2362895
4: Antigua and Barbuda -3.11880717 18.8700870
5: Argentina 1.14613567 -0.2896305
6: Armenia 0.08178344 11.5523992
microbenchmark(
A= wlddev %>% fselect(country, PCGDP, LIFEEX) %>% na_omit(cols = -1) %>%
fsubset(fnobs(PCGDP, country, "replace_fill") > 20L) %>% qDT %>%
.[, qDT(coeftest(lm(G(PCGDP) ~ G(LIFEEX))), "Coef"), keyby = country] ,
B= wlddev %>% fselect(country, PCGDP, LIFEEX) %>% na_omit(cols = -1L) %>%
fsubset(fnobs(PCGDP, country, "replace_fill") > 20L) %>% qDT %>%
.[, mrtl(flm(fgrowth(PCGDP)[-1L],
cbind(Intercept = 1, LIFEEX = fgrowth(LIFEEX)[-1L])), TRUE), keyby = country]
)
Unit: milliseconds
expr min lq mean median uq max neval cld
A 166.570912 167.508276 172.559989 168.680762 172.24341 344.73358 100 a
B 7.115262 7.422762 7.813804 7.526314 7.61188 13.21533 100 b
# coeftest + lm + G 를 flm + fgrowth와 같은 collapse식으로 대체하여 큰 속도 이득을 볼 수 있다.
- collapse w/ list; rsplit; rapply2d; get_elem; unlist2d
rapply2d(): data.table/frame에 function 적용.
get_elem(): 원하는 부분을 추출, 이후 unlist2d를 이용해 data.table로 만들 수 있다.
lm_summary_list %>%
get_elem("coefficients") %>%
unlist2d(idcols = .c(Region, Income), row.names = "Coef", DT = TRUE) %>% head
Region Income Coef Estimate
<char> <char> <char> <num>
1: East Asia & Pacific High income (Intercept) 0.5313479
2: East Asia & Pacific High income G(LIFEEX) 2.4935584
3: East Asia & Pacific High income B(G(LIFEEX), country) 3.8297123
4: East Asia & Pacific Lower middle income (Intercept) 1.3476602
5: East Asia & Pacific Lower middle income G(LIFEEX) 0.5238856
6: East Asia & Pacific Lower middle income B(G(LIFEEX), country) 0.9494439
Std. Error t value Pr(>|t|)
<num> <num> <num>
1: 0.7058550 0.7527720 0.451991327
2: 0.7586943 3.2866443 0.001095466
3: 1.6916770 2.2638554 0.024071386
4: 0.7008556 1.9228785 0.055015131
5: 0.7574904 0.6916069 0.489478164
6: 1.2031228 0.7891496 0.430367103
# 물론, 이렇게도 할 수 있다.
DT[, qDT(coeftest(lm(G(PCGDP) ~ G(LIFEEX) + B(G(LIFEEX), country))), "Coef"),
keyby = .(region, income)]
Summary
collapse는 빠르며, data/memory 측면에서 경제적이다.
vector, matrix, data.table 등 데이터 형식에 구애받지 않고 사용가능하다.
(dplyr, tidyverse, data.table 등) 기존 framework와 통합하여 사용 가능하다.
data.table과 혼용하여 쓸 경우, dt[] 내부에서 사용하면 성능이 저하된다. 이는 내부적 데이터 처리 과정의 차이에서 기인한다.
data.table 형식을 처리할 때는, 아래와 같은 문법으로 사용해야 이의 효과를 기대할 수 있다.
권장되지 않음:
> DT[order(year), paste0("growth_", .c(PCGDP, LIFEEX, GINI, ODA)) := (.SD / shift(.SD, 1L) - 1) * 100,
by = country, .SDcols = 9:13]
권장됨
>> DT %<>% tfm(gv(., 9:13) %>% fgrowth(1L, 1L, country, year) %>% add_stub("growth_"))
>> settfmv(DT, 9:13, G, 1L, 1L, country, year, apply = FALSE)
Reuse
Citation
@online{lee2024,
author = {LEE, Hojun},
title = {Collapse {패키지} {소개} V2},
date = {2024-10-29},
url = {https://blog.zarathu.com/posts/2024-10-28-Collapse/},
langid = {en}
}