통계 이야기

data mining, random forest + boosting in R

창이 2021. 12. 15.

728x90

< Random Forest >

- Bagging에서와 같이 bootstrapped training sample에서 여러 개의 decision tree를 만듦

- Tree에서 분할이 고려될 때마다 p개의 predictors의 full set에서 m개의 predictors로 구성된 random sample만 선택하여 이들 중에서 한 개가 선택되도록 함

- 보통 m≈√p을 사용

- Random forest는 bagging방법에서 variance를 더 줄임으로써 test error를 줄임.. Why?

• 하나의 very strong predictor와 여러 개의 moderately strong predictors가 있다고 가정하면 대부분의 tree에서는 top split에 very strong predictor가 위치할 것임

• 그 결과 모든 tree가 비슷해져서 예측값들의 correlation이 커질 것이므로 variance를 크게 줄이지 못하는 결과를 초래 -> variance가 커지므로 test error 가 커짐

• Random forest는 tree의 모양을 크게 바꾸어 주면서 variance를 줄임

< Boosting >

- Bagging 은 bootstrap을 사용하여 여러 개의 training set을 만들어 각각 decision tree를 적합한 후 이를 모두 결합하여 단일 prediction model을 만듦

- Boosting은 boostrap을 사용하지 않으며 modified version of original data를 가지고 decision tree를 sequentially 만듦

- Boosting for regression tree

1. Set f ̂(x)=0 and r_i = y_i for all i in the training set

2. For b = 1, 2, . . .,B, repeat:

1) Fit a tree f ̂^b with d splits (d+1 terminal nodes) to the training data (X, r)

2) Update f ̂ by adding in a shrunken version of the new tree: f ̂(x) ←f ̂(x)+λf ̂^b (x)

3) Update the residuals, r_i←r_i-λf ̂^b (x)

3. Output the boosted model, f ̂(x)=∑2_(b=1)^B▒〖λf ̂^b (x)" " 〗

Three tuning parameters in Boosting

- B (Number of tree): Bagging이나 random forest와 달리 너무 크면 overfitting됨 CV를 통해서 결정

- d (number of splits in each tree): Boosting의 complexity를 제어 (Typically d=1) [d는 interactive depth를 뜻함]

- Λ (shrinkage parameter): boosting learning 속도를 제어 (Typically λ=0.01 or 0.001)

# (7) random forest
set.seed(1)
rf.boston <- randomForest(medv~., data = Boston, subset = train, mtry = 6, importance = T)
yhat.rf <- predict(rf.boston, Boston[test,])
mean((yhat.rf - yhat)^2) # MSE
plot(yhat.rf, y)
abline(0, 1, col = "red")

#(8) Boosting 
install.packages("gbm")
library(gbm)
set.seed(1)
boost.boston <-  gbm(medv~., Boston[train,], distribution = "gaussian", n.trees = 5000, interaction.depth = 4, shrinkage = 0.2)
summary(boost.boston)
yhat.boost <- predict(boost.boston, Boston[test,], n.trees = 5000)
mean((yhat.boost-y)^2)

728x90

저작자표시 비영리 변경금지

'통계 이야기' 카테고리의 다른 글

data mining, forward + backward + ridge + lasso + pcr + pls (0)	2021.12.17
data mining, maximal margin classifier + support vector classifier + support vector machine in R (0)	2021.12.16
Data Mining , Classification tree + Regression tree + Bagging (0)	2021.12.14
Simulation 공부 with R (0)	2021.10.02
빅데이터 ; R markdown 실습하기 (0)	2021.08.02

data mining, random forest + boosting in R

< Random Forest >

< Boosting >

Three tuning parameters in Boosting

'통계 이야기' 카테고리의 다른 글

댓글

추천 글

티스토리툴바