통계 이야기

Data Mining , Classification tree + Regression tree + Bagging

창이 2021. 12. 14.

728x90

Bagging (배깅)

- 배깅의 목적은 분산 감소

- generate B different bootstrapped training data sets using random sampling with replacement from Z data set:

- compute f ̂_1^∗ (x), f ̂_2^∗ (x),…, f ̂_B^∗ (x) by training the method on B different training sets

- finally average all the predictions

< Bagging for regression trees >

- Regression tree에 bagging을 적용하기 위해서는 B개의 bootstrapped training sets을 사용하여 B개의 regression tree를 만들고 예측 결과의 평균을 계산

- 각 tree의 bias는 작은 반면 variance는 큰 경향이 있음

- Bagging을 통해 B개의 tree를 평균 내면 variance를 줄일 수 있음

< Bagging for Classification trees >

- bagging for regression tree와 유사하게 B개의 bootstrapped training sets을 사용하여 B개의 classification tree를 만듦

- B개의 예측된 class를 기반으로 majority vote (다수결), 즉 가장 자주 발생하는 class를 최종 예측 class로 선택

< Out-of-Bag (OOB) Error Estimation >

- Cross-validation을 이용해서 test MSE를 계산할 수 있지만 OOB 관측치를 이용하여 OOB test MSE (regression)나 OOB test error rate (classification)을 더 쉽게 계산

- Bagged tree는 평균적으로 2/3정도의 관측치를 이용하므로 tree를 training하는데 사용하지 않은 1/3 관측치를 OOB 관측치라고 함

- i번째 관측치는 그 관측치가 OOB였던 tree를 이용하여 예측 (B/3개 정도)하고 이를 average (regression)하거나 majority vote (classification)을 통해서 예측

- B가 충분히 클 경우 OOB오차는 LOOCV오차와 사실상 동일

< Variable Importance Measures >

- Bagging은 single tree에 비해 prediction accuracy는 향상시키지만 해석은 어렵게 만듦

- 각 predictor들의 중요도를 이에 의한 분할 때문에 감소되는 RSS (regression)나 Gini index (classification)의 평균 감소량을 통해서 측정 가능 (B개 tree의 평균)

-------코드

#classification trees using training data
rm(list = ls())
library(ISLR)
library(tree)
attach(Carseats)
str(Carseats)
quantile(Sales)
hight <- as.factor(ifelse(Sales <= 8, "no", "yes"))
table(hight)
Carseats <- data.frame(Carseats,hight)
head(Carseats)
tree.car <- tree(hight ~. -Sales, Carseats) # fit classification tree
summary(tree.car) # 27개 terminal node, error rate 9%

plot(tree.car)
text(tree.car, pretty = 0)
tree.car #reculsive binary splitting으로 만들어진 tree

#classification tree using test error rate  

set.seed(3)
train <-  sort(sample(1:400, 200))
length(unique(train))

test <- setdiff(1:400, train)
high.test <- hight[test]
carseat.test <- Carseats[test,]
length(high.test)

tree.carseat <-  tree(hight ~. -Sales, Carseats, subset = train) #training data
plot(tree.carseat)
text(tree.carseat, pretty = 0)
summary(tree.carseat) 

tree.pred <- predict(tree.carseat, newdata = carseat.test, type = "class") # no이면 sales가 high가 아니고 yes면 high이다 
table(tree.pred, high.test)
(88+48)/200

# (3) CV + pruning for classification tree
set.seed(6)
cv.carseat <- cv.tree(tree.carseat, FUN = prune.misclass)
class(cv.carseat)
names(cv.carseat)
cv.carseat # k값이 커지면 size가 느슨해진다. terminal node가 작음. 대신 error값이 커짐
prune.carseat <- prune.misclass(tree.carseat, best = 12)
plot(prune.carseat)
text(prune.carseat, pretty = 0)

tree.pred <- predict(prune.carseat, newdata = carseat.test, type = "class") # no이면 sales가 high가 아니고 yes면 high이다 
table(tree.pred, high.test)
(87+54)/200 #좀 더 좋아짐

#regression tree using training data
library(MASS)
set.seed(1)
head(Boston)
dim(Boston)
train <- sort(sample(1:nrow(Boston), nrow(Boston)/2))
test <- setdiff(1:nrow(Boston), train)
tree.boston <- tree(medv ~., Boston, subset = train)
summary(tree.boston)
plot(tree.boston)
text(tree.boston, pretty = 0, cex = 0.8)

# (5) CV + pruning for regression tree
cv.boston <- cv.tree(tree.boston)
cv.boston # 7번째 최소
prune.boston <- prune.tree(tree.boston, best = 7)
plot(prune.boston)
text(prune.boston, pretty = 0, cex = 0.8)
yhat <- predict(tree.boston, Boston[test,])
y <- Boston[test,]$medv
length(yhat)
length(y)
plot(yhat, y)
abline(0,1)
test.MSE <- mean((yhat - y)^2)
test.MSE

test.MSE

> 35.28688

# (6) bagging
install.packages("randomForest")
library(randomForest)
library(MASS)
set.seed(1)
train <- sort(sample(1:nrow(Boston), nrow(Boston)/2))
test <- setdiff(1:nrow(Boston), train)
bag.boston <- randomForest(medv~., data = Boston, subset=train, mtry = 13, importance = T, ntree = 500) # 각각 variable에 대한 중요성은 T, 반응변수를 제외한 설명변수 갯수 13개 ) 
bag.boston #평균값을 추정값으로 사용 
y <- Boston[test,]$medv
yhat.bag <- predict(bag.boston, newdata = Boston[test,])
plot(yhat.bag, y)
abline(0, 1, col = "red")
mean((yhat.bag - yhat)^2) # MSE
importance(bag.boston)
varImpPlot(bag.boston)
bag.boston

MSE = 5.093239

728x90

저작자표시 비영리 변경금지 (새창열림)

'통계 이야기' 카테고리의 다른 글

data mining, maximal margin classifier + support vector classifier + support vector machine in R (0)	2021.12.16
data mining, random forest + boosting in R (0)	2021.12.15
Simulation 공부 with R (0)	2021.10.02
빅데이터 ; R markdown 실습하기 (0)	2021.08.02
기초통계 ; 빅데이터 수업 복습하기 (3)	2021.07.25