Introduction
-
Collapse
C/C++ 기반의 패키지로 큰 데이터셋을 보다 쉽게 다룰 수 있도록 구성됨.
R code의 성능을 획기적으로 개선하여 대규모 데이터를 빠르고 효율적으로 처리함을 목표로 함.
성능을 극대화함과 동시에, 기존 데이터 조작 framework와 통합할 수 있도록 안정적이고 최적화된 사용자 API를 제공함. (dplyr, tidyverse, data.table 등)
- MAIN FOCUS -> data.table과 함께 이용하여 보다 빠르게 연산을 처리하자.
Setup
Basic
data.table처럼 fread & fwrite를 이용하여 csv파일을 처리한다.
Columns: ‘fselect’로 원하는 열을 불러올 수 있다.
EXMD_BZ_YYYY RN_INDI HME_YYYYMM HGHT WGHT WSTC BMI
<int> <int> <int> <int> <int> <int> <num>
1: 2009 562083 200909 144 61 90 29.4
2: 2009 334536 200911 162 51 63 19.4
3: 2009 911867 200903 163 65 82 24.5
4: 2009 183321 200908 152 51 70 22.1
5: 2009 942671 200909 159 50 73 19.8
6: 2009 979358 200912 157 55 73 22.3
fselect(dt, EXMD_BZ_YYYY,RN_INDI,HME_YYYYMM )|> head() # fselect(dt, "EXMD_BZ_YYYY","RN_INDI","HME_YYYYMM" )
EXMD_BZ_YYYY RN_INDI HME_YYYYMM
<int> <int> <int>
1: 2009 562083 200909
2: 2009 334536 200911
3: 2009 911867 200903
4: 2009 183321 200908
5: 2009 942671 200909
6: 2009 979358 200912
Rows: ‘fsubset()’로 원하는 행/열을 불러올 수 있다.
fsubset(dt, 1:3)
EXMD_BZ_YYYY RN_INDI HME_YYYYMM Q_PHX_DX_STK Q_PHX_DX_HTDZ Q_PHX_DX_HTN
<int> <int> <int> <int> <int> <int>
1: 2009 562083 200909 0 0 1
2: 2009 334536 200911 0 0 0
3: 2009 911867 200903 0 0 0
Q_PHX_DX_DM Q_PHX_DX_DLD Q_PHX_DX_PTB Q_HBV_AG Q_SMK_YN Q_DRK_FRQ_V09N HGHT
<int> <int> <int> <int> <int> <int> <int>
1: 0 0 NA 3 1 0 144
2: 0 0 NA 2 1 0 162
3: 0 0 NA 3 1 0 163
WGHT WSTC BMI VA_LT VA_RT BP_SYS BP_DIA URN_PROT HGB FBS TOT_CHOL
<int> <int> <num> <num> <num> <int> <int> <int> <num> <int> <int>
1: 61 90 29.4 0.7 0.8 120 80 1 12.6 117 264
2: 51 63 19.4 0.8 1.0 120 80 1 13.8 96 169
3: 65 82 24.5 0.7 0.6 130 80 1 15.0 118 216
TG HDL LDL CRTN SGOT SGPT GGT GFR
<int> <int> <int> <num> <int> <int> <int> <int>
1: 128 60 179 0.9 25 20 25 59
2: 92 70 80 0.9 18 15 28 74
3: 132 55 134 0.8 26 30 30 79
#fsubset(dt, c(1:3, 13:16)) #rows
fsubset(dt, 1:3, 13:16) #(dt, row, col)
HGHT WGHT WSTC BMI
<int> <int> <int> <num>
1: 144 61 90 29.4
2: 162 51 63 19.4
3: 163 65 82 24.5
EXMD_BZ_YYYY RN_INDI HME_YYYYMM HGHT WGHT WSTC BMI
<int> <int> <int> <int> <int> <int> <num>
1: 2009 562083 200909 144 61 90 29.4
2: 2009 334536 200911 162 51 63 19.4
3: 2009 911867 200903 163 65 82 24.5
4: 2009 183321 200908 152 51 70 22.1
5: 2009 942671 200909 159 50 73 19.8
6: 2009 979358 200912 157 55 73 22.3
# fsubset(dt, EXMD_BZ_YYYY %in% 2009:2012 & BMI >= 25) %>% fsubset(c(1:3),c(1:3,13:16))
fsubset(dt, c(1:nrow(dt)),c(1:3, 13:16)) %>% fsubset(EXMD_BZ_YYYY %in% 2009:2012 & BMI >= 25) |> head() # same
EXMD_BZ_YYYY RN_INDI HME_YYYYMM HGHT WGHT WSTC BMI
<int> <int> <int> <int> <int> <int> <num>
1: 2009 562083 200909 144 61 90 29.4
2: 2009 318669 200904 155 66 78 27.5
3: 2009 668438 200904 160 71 94 27.7
4: 2009 560878 200903 144 58 93 28.0
5: 2009 375694 200906 151 70 94 30.7
6: 2009 446652 200909 158 64 80 25.6
roworder(dt, HGHT) %>% fsubset(EXMD_BZ_YYYY %in% 2009:2012 & BMI >= 25) %>%
fsubset(c(1:nrow(dt)),c(1:3,13:16)) |> head()
EXMD_BZ_YYYY RN_INDI HME_YYYYMM HGHT WGHT WSTC BMI
<int> <int> <int> <int> <int> <int> <num>
1: 2009 562083 200909 144 61 90 29.4
2: 2009 560878 200903 144 58 93 28.0
3: 2011 562083 201111 144 59 88 28.5
4: 2011 519824 201109 145 58 79 27.6
5: 2011 914987 201103 145 70 95 33.3
6: 2012 560878 201208 145 59 85 28.1
Collapse package
지금까지 collapse에서의 행/열 처리에 대해 알아보았다. 다음은 collapse에서 보다 빠른 연산 및 데이터 처리를 도와주는 도구들이다.
Fast Statistical Function
.FAST_STAT_FUN
# [1] "fmean" "fmedian" "fmode" "fsum" "fprod"
# [6] "fsd" "fvar" "fmin" "fmax" "fnth"
# [11] "ffirst" "flast" "fnobs" "fndistinct"
# 데이터 구조에 구애받지않음.
v1 <- c(1,2,3,4)
m1 <- matrix(1:50, nrow = 10, ncol = 5)
fmean(v1); fmean(m1); fmean(dt)
fmode(v1); fmode(m1); fmode(dt)
# fmean(m1): by columns
# collapse; baseR과 비교했을 때 보다 빠른 속도를 보인다.
x <- rnorm(1e7)
microbenchmark(mean(x), fmean(x), fmean(x, nthreads = 4))
Unit: milliseconds
expr min lq mean median uq
mean(x) 23.818189 23.881988 23.920430 23.913170 23.947722
fmean(x) 15.334012 15.392979 15.428531 15.429847 15.449727
fmean(x, nthreads = 4) 4.025523 6.862578 7.695351 7.864132 8.486929
max neval cld
24.09041 100 a
15.58527 100 b
10.64787 100 c
microbenchmark(colMeans(dt), sapply(dt, mean), fmean(dt))
Unit: microseconds
expr min lq mean median uq max
colMeans(dt) 3154.642 3301.1200 3295.8407 3304.7220 3309.081 3593.496
sapply(dt, mean) 190.719 196.3270 208.5230 206.0745 216.491 270.384
fmean(dt) 52.694 53.7125 56.1483 55.6745 56.788 86.429
neval cld
100 a
100 b
100 c
- Data size가 더 클 경우, 보다 유용하다. (GGDC10S: 5000rows, 11cols, ~10% missing values)
microbenchmark(base = sapply(GGDC10S[6:16], mean, na.rm = TRUE), fmean(GGDC10S[6:16]))
Unit: microseconds
expr min lq mean median uq max
base 409.223 419.1965 659.1579 433.2085 813.6440 4242.667
fmean(GGDC10S[6:16]) 94.504 95.5710 100.6444 102.1360 103.8795 135.503
neval cld
100 a
100 b
- 이처럼, Collapse는 data 형식에 구애받지 않고, 보다 빠른 속도를 특징으로 하는 package이다.
이들의 문법을 알아보자.
- Fast Statistical Functions
Syntax:
FUN(x, g = NULL, \[w = NULL,\] TRA = NULL, \[na.rm = TRUE\], use.g.names = TRUE, \[drop = TRUE,\] \[nthreads = 1,\] ...)
Argument Description
g grouping vectors / lists of vectors or ’GRP’ object
w a vector of (frequency) weights
TRA a quoted operation to transform x using the statistics
na.rm efficiently skips missing values in x
use.g.names generate names/row-names from g
drop drop dimensions if g = TRA = NULL
nthreads number of threads for OpenMP multithreading
사용예시 : fmean
# Weighted Mean
w <- abs(rnorm(nrow(iris)))
all.equal(fmean(num_vars(iris), w = w), sapply(num_vars(iris), weighted.mean, w = w))
[1] TRUE
wNA <- na_insert(w, prop = 0.05)
sapply(num_vars(iris), weighted.mean, w = wNA) # weighted.mean(): 결측치를 처리하지 못한다.
Sepal.Length Sepal.Width Petal.Length Petal.Width
NA NA NA NA
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.766455 3.088102 3.558861 1.107960
# Grouped Mean
fmean(iris$Sepal.Length, g = iris$Species)
setosa versicolor virginica
5.006 5.936 6.588
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.033403 3.434271 1.464525 0.2377769
versicolor 5.941119 2.783020 4.263050 1.3291146
virginica 6.499996 2.965439 5.492447 2.0008032
# 속도 상의 이점.
microbenchmark(fmean = fmean(iris$Sepal.Length, iris$Species),
tapply = tapply(iris$Sepal.Length, iris$Species, mean))
Unit: microseconds
expr min lq mean median uq max neval cld
fmean 7.531 7.862 8.54733 8.239 8.4255 41.448 100 a
tapply 47.166 48.190 49.66377 48.636 49.1180 123.933 100 b
Consideration w/ missing data: 결측치 처리
#wlddev$GINI, g: country, function: mean, median, min, max, sum, prod
collap(wlddev, GINI ~ country, list(mean, median, min, max, sum, prod),
na.rm = TRUE, give.names = FALSE) |> head()
country mean median min max sum prod
1 Afghanistan NaN NA Inf -Inf 0.0 1.000000e+00
2 Albania 31.41111 31.7 27.0 34.6 282.7 2.902042e+13
3 Algeria 34.36667 35.3 27.6 40.2 103.1 3.916606e+04
4 American Samoa NaN NA Inf -Inf 0.0 1.000000e+00
5 Andorra NaN NA Inf -Inf 0.0 1.000000e+00
6 Angola 48.66667 51.3 42.7 52.0 146.0 1.139065e+05
# na.rm=T가 기본값이며, NA를 연산한 값은 모두 NA를 결과값으로 반영함.
collap(wlddev, GINI ~ country, list(fmean, fmedian, fmin, fmax, fsum, fprod),
give.names = FALSE) |> head()
country fmean fmedian fmin fmax fsum fprod
1 Afghanistan NA NA NA NA NA NA
2 Albania 31.41111 31.7 27.0 34.6 282.7 2.902042e+13
3 Algeria 34.36667 35.3 27.6 40.2 103.1 3.916606e+04
4 American Samoa NA NA NA NA NA NA
5 Andorra NA NA NA NA NA NA
6 Angola 48.66667 51.3 42.7 52.0 146.0 1.139065e+05
microbenchmark(a = collap(wlddev, GINI ~ country, list(mean, median, min, max, sum, prod),
na.rm = TRUE, give.names = FALSE) |> head(),
b=collap(wlddev, GINI ~ country, list(fmean, fmedian, fmin, fmax, fsum, fprod),
give.names = FALSE) |> head())
Unit: microseconds
expr min lq mean median uq max neval cld
a 9854.603 9940.8865 10522.7090 10008.930 10297.669 14969.065 100 a
b 545.479 590.4145 611.5872 621.694 633.038 685.942 100 b
# 속도 상 이점을 다시 한 번 확인할 수 있다.
TRA function
- TRA function을 이용, 여러 행/열의 연산을 간편하게 처리할 수 있다.
Syntax:
TRA(x, STATS, FUN = "-", g = NULL, set = FALSE, ...)
setTRA(x, STATS, FUN = "-", g = NULL, ...)
STATS = vector/matrix/list of statistics
0 "replace_NA" replace missing values in x
1 "replace_fill" replace data and missing values in x
2 "replace" replace data but preserve missing values in x
3 "-" subtract (i.e. center)
4 "-+" center on overall average statistic
5 "/" divide (i.e. scale)
6 "%" compute percentages (i.e. divide and multiply by 100)
7 "+" add
8 "*" multiply
9 "%%" modulus (i.e. remainder from division by STATS)
10 "-%%" subtract modulus (i.e. make data divisible by STATS)
dt2 <- as.data.table(iris)
attach(iris) #data.table에서처럼 변수명을 직접 호출하기 위해 attach를 사용할 수 있다.
# 평균값과의 차: g= Species
all_obj_equal(Sepal.Length - ave(Sepal.Length, g = Species),
fmean(Sepal.Length, g = Species, TRA= "-"),
TRA(Sepal.Length, fmean(Sepal.Length, g = Species), "-", g = Species))
[1] TRUE
microbenchmark(baseR= Sepal.Length - ave(Sepal.Length, g = Species),
fmean = mean(Sepal.Length, g = Species, TRA= "-"),
TRA_fmean = TRA(Sepal.Length, fmean(Sepal.Length, g = Species), "-", g = Species));detach(iris)
Unit: microseconds
expr min lq mean median uq max neval cld
baseR 57.077 58.4975 60.77960 59.4215 60.3085 159.264 100 a
fmean 3.796 3.9900 4.18378 4.1510 4.2665 7.214 100 b
TRA_fmean 11.882 12.3505 13.25294 12.7595 13.3165 44.254 100 c
- TRA()를 사용하기보다 Fast Statistical Function에서 TRA 기능을 호출하자!
#예시
num_vars(dt2) %<>% na_insert(prop = 0.05)
# NA 값을 median값으로 대체.
num_vars(dt2) |> fmedian(iris$Species, TRA = "replace_NA", set = TRUE)
# num_vars(dt2) |> fmean(iris$Species, TRA = "replace_NA", set = TRUE) --> mean으로 대체.
# 다양한 연산 및 작업을 한 번에 다룰 수 있다.
mtcars |> ftransform(A = fsum(mpg, TRA = "%"),
B = mpg > fmedian(mpg, cyl, TRA = "replace_fill"),
C = fmedian(mpg, list(vs, am), wt, "-"),
D = fmean(mpg, vs,, 1L) > fmean(mpg, am,, 1L)) |> head(3)
mpg cyl disp hp drat wt qsec vs am gear carb A B
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 3.266449 TRUE
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3.266449 TRUE
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3.546430 FALSE
C D
Mazda RX4 1.3 FALSE
Mazda RX4 Wag 1.3 FALSE
Datsun 710 -7.6 TRUE
Grouping Object
-
GRP function을 이용, group을 쉽게 연산할 수 있다.
Syntax: GRP(X, by = NULL, sort == TRUE, decreasing = FALSE, na.last = TRUE, return.groups = TRUE, return.order = sort, method = "auto", ...)
collapse grouping object of length 150 with 3 ordered groups
Call: GRP.default(X = iris, by = ~Species), X is sorted
Distribution of group sizes:
Min. 1st Qu. Median Mean 3rd Qu. Max.
50 50 50 50 50 50
Groups with sizes:
setosa versicolor virginica
50 50 50
str(g)
Class 'GRP' hidden list of 9
$ N.groups : int 3
$ group.id : int [1:150] 1 1 1 1 1 1 1 1 1 1 ...
$ group.sizes : int [1:3] 50 50 50
$ groups :'data.frame': 3 obs. of 1 variable:
..$ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
$ group.vars : chr "Species"
$ ordered : Named logi [1:2] TRUE TRUE
..- attr(*, "names")= chr [1:2] "ordered" "sorted"
$ order : int [1:150] 1 2 3 4 5 6 7 8 9 10 ...
..- attr(*, "starts")= int [1:3] 1 51 101
..- attr(*, "maxgrpn")= int 50
..- attr(*, "sorted")= logi TRUE
$ group.starts: int [1:3] 1 51 101
$ call : language GRP.default(X = iris, by = ~Species)
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
Factors in operation
Collaspe는 형식에 구애받지 않는다; factor를 바로 연산할 수 있으며, qF로 빠르게 factor를 생성할 수 있다.
x <- na_insert(rnorm(1e7), prop = 0.01)
g <- sample.int(1e6, 1e7, TRUE)
# grp와 비교
system.time(gg <- GRP(g))
user system elapsed
0.619 0.040 0.659
system.time(f <- qF(g, na.exclude = FALSE))
user system elapsed
0.273 0.032 0.306
class(f)
[1] "factor" "na.included"
microbenchmark(fmean(x, g),
fmean(x, gg),
fmean(x, gg, na.rm = FALSE),
fmean(x, f))
## Unit: milliseconds
## expr min lq mean median
## fmean(x, g) 146.060983 150.493309 155.02585 152.197822
## fmean(x, gg) 25.354564 27.709625 29.48497 29.022157
## fmean(x, gg, na.rm = FALSE) 13.184534 13.783585 15.61769 14.128067
## fmean(x, f) 24.847271 27.503661 29.47271 29.248580
# qF를 통해 grp와 유사한 성능 향상을 기대할 수 있다.
Summary: FAST grouping and Ordering
다양한 기능이 있다.
GRP() Fast sorted or unsorted grouping of multivariate data, returns detailed object of class ’GRP’
qF()/qG() Fast generation of factors and quick-group (’qG’) objects from atomic vectors
finteraction() Fast interactions: returns factor or ’qG’ objects
fdroplevels() Efficiently remove unused factor levels
radixorder() Fast ordering and ordered grouping
group() Fast first-appearance-order grouping: returns ’qG’ object
gsplit() Split vector based on ’GRP’ object
greorder() Reorder the results
- that also return ’qG’ objects
groupid() Generalized run-length-type grouping seqid() Grouping of integer sequences
timeid() Grouping of time sequences (based on GCD)
dapply() Apply a function to rows or columns of data.frame or matrix based objects.
BY() Apply a function to vectors or matrix/data frame columns by groups.
- Specialized Data Transformation Functions
fbetween() Fast averaging and (quasi-)centering.
fwithin()
fhdbetween() Higher-Dimensional averaging/centering and linear prediction/partialling out
fhdwithin() (powered by fixest’s algorithm for multiple factors).
fscale() (advanced) scaling and centering.
- Time / Panel Series Functions
fcumsum() Cumulative sums
flag() Lags and leads
fdiff() (Quasi-, Log-, Iterated-) differences
fgrowth() (Compounded-) growth rates
- Data manipulation functions
fselect(), fsubset(), fgroup_by(), [f/set]transform[v](),
fmutate(), fsummarise(), across(), roworder[v](),
colorder[v](), [f/set]rename(), [set]relabel()
Collapse는 빠르다!
fdim(wlddev) ##faster dim for dt. col/row: 13176 13
# 1990년 이후를 기준으로, ODA/POP의 값 (g: region, income, OECD)
microbenchmark(
dplyr = qDT(wlddev) |>
filter(year >= 1990) |>
mutate(ODA_POP = ODA / POP) |>
group_by(region, income, OECD) |>
summarise(across(PCGDP:POP, sum, na.rm = TRUE), .groups = "drop") |>
arrange(income, desc(PCGDP)),
data.table = qDT(wlddev)[, ODA_POP := ODA / POP][
year >= 1990, lapply(.SD, sum, na.rm = TRUE),
by = .(region, income, OECD), .SDcols = PCGDP:ODA_POP][
order(income, -PCGDP)],
collapse_base = qDT(wlddev) |>
fsubset(year >= 1990) |>
fmutate(ODA_POP = ODA / POP) |>
fgroup_by(region, income, OECD) |>
fsummarise(across(PCGDP:ODA_POP, sum, na.rm = TRUE)) |>
roworder(income, -PCGDP),
collapse_optimized = qDT(wlddev) |>
fsubset(year >= 1990, region, income, OECD, PCGDP:POP) |>
fmutate(ODA_POP = ODA / POP) |>
fgroup_by(1:3, sort = FALSE) |> fsum() |>
roworder(income, -PCGDP)
)
## Unit: microseconds
## expr min lq mean median uq max neval
## dplyr 71955.523 72291.9715 80009.2208 72453.1165 76902.671 393947.262 100
## data.table 5960.503 6310.7045 7116.6673 6721.3450 7046.837 18615.736 100
## collapse_base 859.505 948.2200 1041.1137 990.1375 1061.864 3148.804 100
## collapse_optimized 442.040 482.9705 542.6927 523.6950 574.921 1036.817 100
Collapse w/ Fast Statistical Function: 다양한 활용
# 아래 셋은 동일한 결과를 보인다.
# cyl별 mpg sum
mtcars %>% ftransform(mpg_sum = fsum(mpg, g = cyl, TRA = "replace_fill")) %>% invisible()
mtcars %>% fgroup_by(cyl) %>% ftransform(mpg_sum = fsum(mpg, GRP(.), TRA = "replace_fill")) %>% invisible()
mtcars %>% fgroup_by(cyl) %>% fmutate(mpg_sum = fsum(mpg)) %>% head(10)
mpg cyl disp hp drat wt qsec vs am gear carb mpg_sum
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 138.2
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 138.2
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 293.3
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 138.2
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 211.4
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 138.2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 211.4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 293.3
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 293.3
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 138.2
- ad-hoc grouping, often fastest!
microbenchmark(a=mtcars %>% ftransform(mpg_sum = fsum(mpg, g = cyl, TRA = "replace_fill")),
b=mtcars %>% fgroup_by(cyl) %>% ftransform(mpg_sum = fsum(mpg, GRP(.), TRA = "replace_fill")),
c=mtcars %>% fgroup_by(cyl) %>% fmutate(mpg_sum = fsum(mpg)))
Unit: microseconds
expr min lq mean median uq max neval cld
a 27.266 29.7885 31.47125 30.4025 31.7165 107.002 100 a
b 64.819 66.7990 68.84531 67.8300 68.8585 138.077 100 b
c 78.526 80.3145 82.11237 81.4460 82.3050 126.137 100 c
- ftransform()은 앞의 fgroupby를 무시한다. 아래 둘은 값이 다르다. (fmutate, fsummarise만 이전 group을 반영한다.)
mtcars %>% fgroup_by(cyl) %>% ftransform(mpg_sum = fsum(mpg, GRP(.), TRA = "replace_fill")) %>% head()
mpg cyl disp hp drat wt qsec vs am gear carb mpg_sum
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 138.2
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 138.2
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 293.3
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 138.2
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 211.4
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 138.2
mpg cyl disp hp drat wt qsec vs am gear carb mpg_sum
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 642.9
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 642.9
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 642.9
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 642.9
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 642.9
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 642.9
- 위 언급과 같이 baseR 의 “/”보다 collapse의 TRA function을 이용하는 것이 더 빠르다.
microbenchmark(
"/"= mtcars |> fgroup_by(cyl) |> fmutate(mpg_prop = mpg / fsum(mpg)) |> head(),
"TRA=/" = mtcars |> fgroup_by(cyl) |> fmutate(mpg_prop = fsum(mpg, TRA = "/")) |> head()
)
Unit: microseconds
expr min lq mean median uq max neval cld
/ 208.423 211.461 216.9684 212.9690 215.2815 456.075 100 a
TRA=/ 198.332 200.922 203.9442 202.6085 204.8170 239.689 100 b
- fsum은 grp 별로 연산을 처리하나, sum은 전체를 반영한다.
mpg cyl disp hp drat wt qsec vs am gear carb mpg_prop2
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0.2149634
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0.2149634
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0.4562140
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 0.2149634
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0.3288225
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0.2149634
- 자유로운 %>% 의 사용
# 아래 둘은 동일하다.
mtcars %>% fgroup_by(cyl) %>% ftransform(fselect(., hp:qsec) %>% fsum(TRA = "/")) %>% invisible()
mtcars %>% fgroup_by(cyl) %>% fmutate(across(hp:qsec, fsum, TRA = "/")) %>% head()
mpg cyl disp hp drat wt qsec vs
Mazda RX4 21.0 6 160 0.12850467 0.15537849 0.12007333 0.13080102 0
Mazda RX4 Wag 21.0 6 160 0.12850467 0.15537849 0.13175985 0.13525111 0
Datsun 710 22.8 4 108 0.10231023 0.08597588 0.09227220 0.08840435 1
Hornet 4 Drive 21.4 6 258 0.12850467 0.12270916 0.14734189 0.15448188 1
Hornet Sportabout 18.7 8 360 0.05974735 0.06967485 0.06144064 0.07248414 0
Valiant 18.1 6 225 0.12266355 0.10996016 0.15857012 0.16068023 1
am gear carb
Mazda RX4 1 4 4
Mazda RX4 Wag 1 4 4
Datsun 710 1 4 1
Hornet 4 Drive 0 3 1
Hornet Sportabout 0 3 2
Valiant 0 3 1
- set = TRUE를 통해 원본 데이터에 반영할 수 있다.
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# mtcars의 열 hp:qsec의 값과 해당하는 g:cyl별 합의 비율.
mtcars %>% fgroup_by(cyl) %>% fmutate(across(hp:qsec, fsum, TRA = "/", set = TRUE)) %>% invisible()
head(mtcars)
mpg cyl disp hp drat wt qsec vs
Mazda RX4 21.0 6 160 0.12850467 0.15537849 0.12007333 0.13080102 0
Mazda RX4 Wag 21.0 6 160 0.12850467 0.15537849 0.13175985 0.13525111 0
Datsun 710 22.8 4 108 0.10231023 0.08597588 0.09227220 0.08840435 1
Hornet 4 Drive 21.4 6 258 0.12850467 0.12270916 0.14734189 0.15448188 1
Hornet Sportabout 18.7 8 360 0.05974735 0.06967485 0.06144064 0.07248414 0
Valiant 18.1 6 225 0.12266355 0.10996016 0.15857012 0.16068023 1
am gear carb
Mazda RX4 1 4 4
Mazda RX4 Wag 1 4 4
Datsun 710 1 4 1
Hornet 4 Drive 0 3 1
Hornet Sportabout 0 3 2
Valiant 0 3 1
- .apply = FALSE를 통해 subset group에만 적용할 수 있다.
# 각 g:cyl의 hp:qsec까지의 변수에 대한 부분 상관관계
mtcars %>% fgroup_by(cyl) %>% fsummarise(across(hp:qsec, \(x) qDF(pwcor(x), "var"), .apply = FALSE)) %>% head()
cyl var hp drat wt qsec
1 4 hp 1.0000000 -0.4702200 0.1598761 -0.1783611
2 4 drat -0.4702200 1.0000000 -0.4788681 -0.2833656
3 4 wt 0.1598761 -0.4788681 1.0000000 0.6380214
4 4 qsec -0.1783611 -0.2833656 0.6380214 1.0000000
5 6 hp 1.0000000 0.2171636 -0.3062284 -0.6280148
6 6 drat 0.2171636 1.0000000 -0.3546583 -0.6231083
이름/순번/vectors/정규표현식으로 행/열을 지칭할 수 있다.
get_vars(x, vars, return = "names", regex = FALSE, ...)
get_vars(x, vars, regex = FALSE, ...) <- value
- 위치도 선택가능하다.
add_vars(x, ..., pos = "end")
add_vars(x, pos = "end") <- value
- data type을 지정할 수 있다.
num_vars(x, return = "data"); cat_vars(x, return = "data"); char_vars(x, return = "data");
fact_vars(x, return = "data"); logi_vars(x, return = "data"); date_vars(x, return = "data")
- replace 또한 가능하다.
num_vars(x) <- value; cat_vars(x) <- value; char_vars(x) <- value;
fact_vars(x) <- value; logi_vars(x) <- value; date_vars(x) <- value
Efficient programming
> quick data conversion
- qDF(), qDT(), qTBL(), qM(), mrtl(), mctl()
- anyv(x, value) / allv(x, value) # Faster than any/all(x == value)
- allNA(x) # Faster than all(is.na(x))
- whichv(x, value, invert = F) # Faster than which(x (!/=)= value)
- whichNA(x, invert = FALSE) # Faster than which((!)is.na(x))
- x %(!/=)=% value # Infix for whichv(v, value, TRUE/FALSE)
- setv(X, v, R, ...) # x\[x(!/=)=v\]\<-r / x\[v\]\<-r\[v\] (by reference)
- setop(X, op, V, rowwise = F) # Faster than X \<- X +/-/\*// V (by reference)
- X %(+,-,\*,/)=% V # Infix for setop,()
- na_rm(x) # Fast: if(anyNA(x)) x\[!is.na(x)\] else x,
- na_omit(X, cols = NULL, ...) # Faster na.omit for matrices and data frames
- vlengths(X, use.names=TRUE) # Faster version of lengths()
- frange(x, na.rm = TRUE) # Much faster base::range
- fdim(X) # Faster dim for data frames
Collapse and data.table
data.table에서 collapse의 적용을 알아보자.
DT <- qDT(wlddev) # as.data.table(wlddev)
DT %>% fgroup_by(country) %>% get_vars(9:13) %>% fmean() #fgroup_by == gby
country PCGDP LIFEEX GINI ODA POP
<char> <num> <num> <num> <num> <num>
1: Afghanistan 483.8351 49.19717 NA 1487548499 18362258.22
2: Albania 2819.2400 71.68027 31.41111 312928126 2708297.17
3: Algeria 3532.2714 63.56290 34.36667 612238500 25305290.68
4: American Samoa 10071.0659 NA NA NA 43115.10
5: Andorra 40083.0911 NA NA NA 51547.35
---
212: Virgin Islands (U.S.) 35629.7336 73.71292 NA NA 92238.53
213: West Bank and Gaza 2388.4348 71.60780 34.52500 1638581462 3312289.13
214: Yemen, Rep. 1069.6596 52.53707 35.46667 859950996 13741375.82
215: Zambia 1318.8627 51.09263 52.68889 734624330 8614972.38
216: Zimbabwe 1219.4360 54.53360 45.93333 397104997 9402160.33
DT[, lapply(.SD, mean, na.rm = TRUE), keyby = country, .SDcols = 9:13] %>% invisible()
collap(DT, ~ country, fmean, cols = 9:13) %>% invisible() #same
microbenchmark(collapse = DT %>% gby(country) %>% get_vars(9:13) %>% fmean,
data.table = DT[, lapply(.SD, mean, na.rm = TRUE), keyby = country, .SDcols = 9:13],
data.table2 = DT[, lapply(.SD, fmean, na.rm = TRUE), keyby = country, .SDcols = 9:13])
Unit: microseconds
expr min lq mean median uq max neval
collapse 339.330 369.4435 424.8557 397.9745 409.1075 3425.186 100
data.table 5010.470 5242.2960 5356.0409 5280.7165 5376.4760 8372.164 100
data.table2 5164.241 5322.2865 5557.2280 5391.0230 5541.6530 8763.263 100
cld
a
b
c
DT[, lapply(.SD, fmean,…)]가 DT[, lapply(.SD, mean,…)]보다 느린 것을 확인할 수 있다. data.table 내에서 mean은 baseR의 mean이 아닌 gmean으로 data.table에 최적화되어있다. 반면, lapply와 함께 fmean을 사용하면 최적화된 방식으로 동작하지 않아 오히려 더 느리다.
위 방식은 아래와 처리되는 방식이 유사하다. fmean을 모든 group, columns에 적용하기에 느리다.
이때, 아래와 같은 방법으로 이를 일정 수준 해결할 수 있다.
DT <- qDT(wlddev); g <- GRP(DT, "country")
#gv: abbreviation for get_vars()
microbenchmark(a = fmean(gv(DT, 9:13), DT$country),
b0= g <- GRP(DT, "country"),
b = add_vars(g[["groups"]], fmean(gv(DT, 9:13), g)),
dt_fmean = DT[, lapply(.SD, fmean, na.rm = TRUE), keyby = country, .SDcols = 9:13],
dt_gmean = DT[, lapply(.SD, mean, na.rm = TRUE), keyby = country, .SDcols = 9:13])
Unit: microseconds
expr min lq mean median uq max neval cld
a 358.473 375.3265 392.0909 392.3025 398.751 580.505 100 a
b0 76.707 92.1125 105.7192 110.7625 113.721 175.649 100 b
b 224.185 238.6440 256.6076 255.7260 265.781 606.967 100 ab
dt_fmean 5213.289 5319.2375 5620.3223 5371.5520 5547.785 11522.735 100 c
dt_gmean 5064.030 5246.1655 5389.5898 5315.4075 5475.593 8260.358 100 d
dplyr의 data %>% group_by(…) %>% summarize(…) 및 data.table의 [i, j, by] 구문은 데이터 그룹에 함수를 적용하기 위한 보편적인 방식이다. 이는 다양한 함수를 그룹화된 데이터에 적용하며, 특히 data.table은 몇몇 내장 함수(min, max, mean 등)를 GForce; 내부적으로 최적화하여 처리한다.
collapse는 데이터를 그룹화하여(fgroup_by, collap) 통계 및 변환 함수를 처리한다. (by C++)
collapse의 모든 기능(BY는 예외)은 GForce 최적화가 되어 있지만, data.table 내에서 최적화 정도의 차이, lapply 적용 상의 문제가 있는 것으로 추정된다.
그렇다면 fmean을 data.table내에서 쓸 수는 없을까.
DT[, fmean(.SD, country), .SDcols = 9:13]
PCGDP LIFEEX GINI ODA POP
<num> <num> <num> <num> <num>
1: 483.8351 49.19717 NA 1487548499 18362258.22
2: 2819.2400 71.68027 31.41111 312928126 2708297.17
3: 3532.2714 63.56290 34.36667 612238500 25305290.68
4: 10071.0659 NA NA NA 43115.10
5: 40083.0911 NA NA NA 51547.35
---
212: 35629.7336 73.71292 NA NA 92238.53
213: 2388.4348 71.60780 34.52500 1638581462 3312289.13
214: 1069.6596 52.53707 35.46667 859950996 13741375.82
215: 1318.8627 51.09263 52.68889 734624330 8614972.38
216: 1219.4360 54.53360 45.93333 397104997 9402160.33
country PCGDP LIFEEX GINI ODA POP
<char> <num> <num> <num> <num> <num>
1: Afghanistan 483.8351 49.19717 NA 1487548499 18362258.22
2: Albania 2819.2400 71.68027 31.41111 312928126 2708297.17
3: Algeria 3532.2714 63.56290 34.36667 612238500 25305290.68
4: American Samoa 10071.0659 NA NA NA 43115.10
5: Andorra 40083.0911 NA NA NA 51547.35
---
212: Virgin Islands (U.S.) 35629.7336 73.71292 NA NA 92238.53
213: West Bank and Gaza 2388.4348 71.60780 34.52500 1638581462 3312289.13
214: Yemen, Rep. 1069.6596 52.53707 35.46667 859950996 13741375.82
215: Zambia 1318.8627 51.09263 52.68889 734624330 8614972.38
216: Zimbabwe 1219.4360 54.53360 45.93333 397104997 9402160.33
microbenchmark(collapse = DT %>% gby(country) %>% get_vars(9:13) %>% fmean,
data.table = DT[, lapply(.SD, mean, na.rm = TRUE), keyby = country, .SDcols = 9:13],
data.table_base = DT[, lapply(.SD, base::mean, na.rm = TRUE), keyby = country, .SDcols = 9:13],
hybrid_bad = DT[, lapply(.SD, fmean), keyby = country, .SDcols = 9:13],
hybrid_ok = DT[, fmean(gby(.SD, country)), .SDcols = c(1L, 9:13)])
Unit: microseconds
expr min lq mean median uq max
collapse 345.137 376.5125 419.4952 393.9470 399.4225 3603.978
data.table 5086.504 5240.5405 5384.2358 5319.8300 5484.4740 6284.893
data.table_base 2545.283 2597.0910 2695.3421 2625.8495 2648.6360 6084.044
hybrid_bad 5197.406 5331.3155 5539.1046 5382.9660 5607.2690 8885.335
hybrid_ok 837.602 885.5560 899.4496 902.0515 914.6245 1003.834
neval cld
100 a
100 b
100 c
100 d
100 e
- data.table내에서 fmean 등을 같이 쓰는 것은 바람직하지 않다.
DT %>% gby(country) %>% get_vars(9:13) %>% fmean
fmean(gv(DT, 9:13), DT$country)
- 보다 효율적인 작업을 위해 위와 같이 data.table 외에서 처리하는 방식을 사용하자.
#fmean 이외의 예시: fsum
# 국가별 ODA 합산 = 아래는 모두 동일.
DT[, sum_ODA := sum(ODA, na.rm = TRUE), by = country]
DT[, sum_ODA := fsum(ODA, country, TRA = "replace_fill")]
settfm(DT, sum_ODA = fsum(ODA, country, TRA = "replace_fill")) # settfm/tfm= settransform/ftransform
# 여러 열을 변경할 때 settransform이 ':=' 보다 편리하다.
settfm(DT, perc_c_ODA = fsum(ODA, country, TRA = "%"),
perc_y_ODA = fsum(ODA, year, TRA = "%"))
microbenchmark(
S1 = DT[, sum_ODA := sum(ODA, na.rm = TRUE), by = country],
S2 = DT[, sum_ODA := fsum(ODA, country, TRA = "replace_fill")],
S3 = settfm(DT, sum_ODA = fsum(ODA, country, TRA = "replace_fill"))
)
Unit: microseconds
expr min lq mean median uq max neval cld
S1 2088.236 2178.8360 2255.2440 2243.0780 2280.824 4270.312 100 a
S2 409.735 484.6600 528.7572 533.0865 577.559 665.195 100 b
S3 121.994 171.2135 203.7296 202.9935 229.109 290.783 100 c
위와 같이 data.table 외에서 처리하는 방식을 사용하자.
data.table에서 data 처리에 유용한 collapse 함수들:
"fcumsum()" "fscale()" "fbetween()" "fwithin()" "fhdbetween()"
"fhdwithin()" "flag()" "fdiff()" "fgrowth()"
# Centering GDP
#DT[, demean_PCGDP := PCGDP - mean(PCGDP, na.rm = TRUE), by = country]
DT[, demean_PCGDP := fwithin(PCGDP, country)]
settfm(DT, demean_PCGDP = fwithin(PCGDP, country)) #settfm를 사용하자.
# Lagging GDP
#DT[order(year), lag_PCGDP := shift(PCGDP, 1L), by = country]
DT[, lag_PCGDP := flag(PCGDP, 1L, country, year)]
# Computing a growth rate
#DT[order(year), growth_PCGDP := (PCGDP / shift(PCGDP, 1L) - 1) * 100, by = country]
DT[, lag_PCGDP := fgrowth(PCGDP, 1L, 1L, country, year)] # 1 lag, 1 iteration
# Several Growth rates
#DT[order(year), paste0("growth_", .c(PCGDP, LIFEEX, GINI, ODA)) := (.SD / shift(.SD, 1L) - 1) * 100, by = country, .SDcols = 9:13]
DT %<>% tfm(gv(., 9:13) %>% fgrowth(1L, 1L, country, year) %>% add_stub("growth_"))
settfmv(DT, 9:13, G, 1L, 1L, country, year, apply = FALSE)
result <- DT[sample(.N, 7)] |> fselect(9:ncol(DT)); print(result)
PCGDP LIFEEX GINI ODA POP sum_ODA perc_c_ODA
<num> <num> <num> <num> <num> <num> <num>
1: 7808.4047 71.04878 NA 65139999 42449038 26214490031 0.2484885
2: 35593.4255 68.80683 NA NA 56911 NA NA
3: 2171.3605 64.10800 NA 2025609985 2123180 73237229858 2.7658200
4: 47413.6225 70.84878 NA NA 56186 NA NA
5: 872.9171 55.17000 NA 350309998 6094259 18904160029 1.8530842
6: 3131.2099 65.24600 NA 61150002 792736 4340079982 1.4089602
7: 1814.4672 70.15500 NA 251289993 29774500 7412250079 3.3901985
perc_y_ODA demean_PCGDP lag_PCGDP growth_PCGDP growth_LIFEEX growth_GINI
<num> <num> <num> <num> <num> <num>
1: 0.1087501 -3117.1382 6.019063 6.019063 0.70872947 NA
2: NA 4307.9524 6.632324 6.632324 0.54279452 NA
3: 3.6079552 -990.5459 NA NA 0.95430065 NA
4: NA 16128.1494 4.547809 4.547809 -0.55800897 NA
5: 0.5839065 -21.0514 1.094015 1.094015 -0.05977936 NA
6: 0.1099883 120.4957 -3.230141 -3.230141 0.10893748 NA
7: 0.2788049 419.3212 5.806588 5.806588 0.35045058 NA
growth_ODA growth_POP G1.PCGDP G1.LIFEEX G1.GINI G1.ODA G1.POP
<num> <num> <num> <num> <num> <num> <num>
1: 1436.320823 0.9940010 6.019063 0.70872947 NA 1436.320823 0.9940010
2: NA 0.2572007 6.632324 0.54279452 NA NA 0.2572007
3: 12.181760 2.7719948 NA 0.95430065 NA 12.181760 2.7719948
4: NA 0.1283102 4.547809 -0.55800897 NA NA 0.1283102
5: 3.217532 3.1953119 1.094015 -0.05977936 NA 3.217532 3.1953119
6: 2.652345 1.0645269 -3.230141 0.10893748 NA 2.652345 1.0645269
7: 81.791218 1.4829887 5.806588 0.35045058 NA 81.791218 1.4829887
- := 은 data.table내에서 최적화 정도가 낮아 collapse를 이용하는 것이 대부분의 경우 더 빠르다.
microbenchmark(
W1 = DT[, demean_PCGDP := PCGDP - mean(PCGDP, na.rm = TRUE), by = country],
W2 = DT[, demean_PCGDP := fwithin(PCGDP, country)],
L1 = DT[order(year), lag_PCGDP := shift(PCGDP, 1L), by = country],
L2 = DT[, lag_PCGDP := flag(PCGDP, 1L, country, year)],
L3 = DT[, lag_PCGDP := shift(PCGDP, 1L), by = country], # Not ordered
L4 = DT[, lag_PCGDP := flag(PCGDP, 1L, country)] # Not ordered
)
Unit: microseconds
expr min lq mean median uq max neval cld
W1 1912.389 1990.3745 2156.8241 2023.363 2085.882 11093.551 100 a
W2 784.069 836.6985 873.2039 865.280 894.328 1336.105 100 b
L1 4025.457 4216.2925 4483.2931 4281.839 4414.296 16599.366 100 c
L2 1296.289 1338.7050 1467.9625 1373.009 1409.083 9663.483 100 d
L3 2604.725 2672.9305 2748.6873 2697.509 2746.567 3559.639 100 e
L4 451.273 476.2480 515.3331 507.103 535.107 861.385 100 f
# flag와 같은 time series는 우선적으로 재정렬을 하지 않아 분명한 성능 차이가 존재한다.
m <- qM(mtcars)
# matrix to data: mrtl
mrtl(m, names = T, return = "data.table") %>% head(2) # convert: data.table
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant
<num> <num> <num> <num> <num> <num>
1: 21 21 22.8 21.4 18.7 18.1
2: 6 6 4.0 6.0 8.0 6.0
Duster 360 Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE Merc 450SL
<num> <num> <num> <num> <num> <num> <num>
1: 14.3 24.4 22.8 19.2 17.8 16.4 17.3
2: 8.0 4.0 4.0 6.0 6.0 8.0 8.0
Merc 450SLC Cadillac Fleetwood Lincoln Continental Chrysler Imperial
<num> <num> <num> <num>
1: 15.2 10.4 10.4 14.7
2: 8.0 8.0 8.0 8.0
Fiat 128 Honda Civic Toyota Corolla Toyota Corona Dodge Challenger
<num> <num> <num> <num> <num>
1: 32.4 30.4 33.9 21.5 15.5
2: 4.0 4.0 4.0 4.0 8.0
AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
<num> <num> <num> <num> <num> <num>
1: 15.2 13.3 19.2 27.3 26 30.4
2: 8.0 8.0 8.0 4.0 4 4.0
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
<num> <num> <num> <num>
1: 15.8 19.7 15 21.4
2: 8.0 6.0 8 4.0
- fast linear model: flm
wlddev %>% fselect(country, PCGDP, LIFEEX) %>% na_omit(cols = -1) %>%
fsubset(fnobs(PCGDP, country, "replace_fill") > 20L) %>% qDT %>%
.[, qDT(coeftest(lm(G(PCGDP) ~ G(LIFEEX))), "Coef"), keyby = country] %>% head
Key: <country>
country Coef Estimate Std. Error t value Pr(>|t|)
<char> <char> <num> <num> <num> <num>
1: Albania (Intercept) -3.6146411 2.371885 -1.5239527 0.136023086
2: Albania G(LIFEEX) 22.1596308 7.288971 3.0401591 0.004325856
3: Algeria (Intercept) 0.5973329 1.740619 0.3431726 0.732731107
4: Algeria G(LIFEEX) 0.8412547 1.689221 0.4980134 0.620390703
5: Angola (Intercept) -3.3793976 1.540330 -2.1939445 0.034597175
6: Angola G(LIFEEX) 4.2362895 1.402380 3.0207852 0.004553260
#절편과 변화율만 빠르게 알고싶다면 flm w/ mrtl: (no standard errors)
wlddev %>% fselect(country, PCGDP, LIFEEX) %>% na_omit(cols = -1L) %>%
fsubset(fnobs(PCGDP, country, "replace_fill") > 20L) %>% qDT %>%
.[, mrtl(flm(fgrowth(PCGDP)[-1L],
cbind(Intercept = 1, LIFEEX = fgrowth(LIFEEX)[-1L])), TRUE), keyby = country] %>% head
Key: <country>
country Intercept LIFEEX
<char> <num> <num>
1: Albania -3.61464113 22.1596308
2: Algeria 0.59733291 0.8412547
3: Angola -3.37939760 4.2362895
4: Antigua and Barbuda -3.11880717 18.8700870
5: Argentina 1.14613567 -0.2896305
6: Armenia 0.08178344 11.5523992
microbenchmark(
A= wlddev %>% fselect(country, PCGDP, LIFEEX) %>% na_omit(cols = -1) %>%
fsubset(fnobs(PCGDP, country, "replace_fill") > 20L) %>% qDT %>%
.[, qDT(coeftest(lm(G(PCGDP) ~ G(LIFEEX))), "Coef"), keyby = country] ,
B= wlddev %>% fselect(country, PCGDP, LIFEEX) %>% na_omit(cols = -1L) %>%
fsubset(fnobs(PCGDP, country, "replace_fill") > 20L) %>% qDT %>%
.[, mrtl(flm(fgrowth(PCGDP)[-1L],
cbind(Intercept = 1, LIFEEX = fgrowth(LIFEEX)[-1L])), TRUE), keyby = country]
)
Unit: milliseconds
expr min lq mean median uq max neval cld
A 167.429776 168.55069 172.656171 169.121475 172.448620 336.99646 100 a
B 7.141076 7.40031 7.818226 7.546154 7.698282 12.44983 100 b
# coeftest + lm + G 를 flm + fgrowth와 같은 collapse식으로 대체하여 큰 속도 이득을 볼 수 있다.
- collapse w/ list; rsplit; rapply2d; get_elem; unlist2d
rapply2d(): data.table/frame에 function 적용.
get_elem(): 원하는 부분을 추출, 이후 unlist2d를 이용해 data.table로 만들 수 있다.
lm_summary_list %>%
get_elem("coefficients") %>%
unlist2d(idcols = .c(Region, Income), row.names = "Coef", DT = TRUE) %>% head
Region Income Coef Estimate
<char> <char> <char> <num>
1: East Asia & Pacific High income (Intercept) 0.5313479
2: East Asia & Pacific High income G(LIFEEX) 2.4935584
3: East Asia & Pacific High income B(G(LIFEEX), country) 3.8297123
4: East Asia & Pacific Lower middle income (Intercept) 1.3476602
5: East Asia & Pacific Lower middle income G(LIFEEX) 0.5238856
6: East Asia & Pacific Lower middle income B(G(LIFEEX), country) 0.9494439
Std. Error t value Pr(>|t|)
<num> <num> <num>
1: 0.7058550 0.7527720 0.451991327
2: 0.7586943 3.2866443 0.001095466
3: 1.6916770 2.2638554 0.024071386
4: 0.7008556 1.9228785 0.055015131
5: 0.7574904 0.6916069 0.489478164
6: 1.2031228 0.7891496 0.430367103
Summary
collapse는 빠르며, data/memory 측면에서 경제적이다.
vector, matrix, data.table 등 데이터 형식에 구애받지 않고 사용가능하다.
(dplyr, tidyverse, data.table 등) 기존 framework와 통합하여 사용 가능하다.
data.table과 혼용하여 쓸 경우, dt[] 내부에서 사용하면 성능이 저하된다. 이는 내부적 데이터 처리 과정의 차이에서 기인한다.
data.table 형식을 처리할 때는, 아래와 같은 문법으로 사용해야 이의 효과를 기대할 수 있다.
권장되지 않음:
> DT[order(year), paste0("growth_", .c(PCGDP, LIFEEX, GINI, ODA)) := (.SD / shift(.SD, 1L) - 1) * 100,
by = country, .SDcols = 9:13]
권장됨
>> DT %<>% tfm(gv(., 9:13) %>% fgrowth(1L, 1L, country, year) %>% add_stub("growth_"))
>> settfmv(DT, 9:13, G, 1L, 1L, country, year, apply = FALSE)
Reuse
Citation
@online{lee2024,
author = {LEE, Hojun},
title = {Collapse {패키지} {소개} V2},
date = {2024-10-29},
url = {https://blog.zarathu.com/posts/2024-10-28-Collapse/},
langid = {en}
}