内存受限的大数据预测模型
junjun
2016年2月15日
一、项目简介:回归树用于分类预测
1、项目集数据介绍
使用randomForest包和party包来创建随机森林的区别:randomForest包无法处理包含缺失值或者拥有超过32个等级水平的分类变量。
本例子是在内存受限的情况下简历一个预测模型。由于训练集太大而不能直接通过R构建决策树,所以需要先从训练集中通过随机抽样的方式抽取多个子集,并分别对每一个子集构建决策树,只选取决策树中存在的所有变量,以便缩减训练集的规模。在评分时,得分的集合同样被划分为多个子集,以便在内存受限的条件下成功运行。
数据简介 KDD Cup 1998年竞赛的目标是估计一个直邮的回复量,以便获得最多的捐款。数据集的格式是以逗号作为分隔符,其中学习数据集”cup98lrn.txt”包含了95412条记录,481个字段,验证数据集“cup98val.txt”包含了96367条记录,479个字段。每条记录都包含一个CONTROLN字段,该字段是记录的唯一标识符;有两个目标变量TARGET_B和TARGET_D,TARGET_B是一个二进制变量,表示当一条记录中的TARGET_D变量中有捐款时,该条记录是否对邮件做了回复。学习数据集和验证数据集的数据格式相同,但是在验证数据集中没有包含TARGET_B和TARGET_D这两个变量。
2、研究方法
本例的数据分为两类:目标客户和非目标客户,分别为1和0,与客户的风险模型相似。 本例仍然使用决策树技术,因为对于商人和管理者来说,决策树更易于理解,规则也更简单。与SVM或神经网络相比,决策树应用到业务上更容易被接受和执行。决策树还支持分类变量和数值变量的混合数据类型,同时还可以处理缺失值。特别地,party包中提供了函数ctree()来构建决策树。 在大数据上训练模型需要花费很长时间,特别是对于分类变量含有多个水平值的情况。一种方法是使用一个小样本来训练模型。这里我们使用另一种方法:它能够使用尽可能多的数据进行训练。首先,从训练数据中抽取20个随机样本集,并分别对每一个样本集创建一棵决策树,每一棵树中含大约20-30个变量,其中有多棵决策树包含了相同的变量。然后,收集包含在决策树中的所有变量,大约60个。接着使用原始数据中的这60个变量的数据进行训练。这样的方法可以将所有的训练实例都用于最后模型的训练,而不仅仅是抽样数据的20棵决策树中的变量。
二、项目过程
1、加载数据并查看
#1)加载数据
cup98 <- read.csv("F:\\R\\Rworkspace\\cup98lrn/cup98lrn.txt")
dim(cup98)
## [1] 95412 481
n.missing <- rowSums(is.na(cup98))
sum(n.missing > 0) #计算存在NA值的行数
## [1] 95412
#2)选择变量
varSet <- c(
#demographics
"ODATEDW", "OSOURCE", "STATE", "ZIP", "PVASTATE", "DOB", "RECINHSE", "MDMAUD",
"DOMAIN", "CLUSTER", "AGE", "HOMEOWNR", "CHILD03", "CHILD07", "CHILD12", "CHILD18",
"NUMCHLD", "INCOME", "GENDER", "WEALTH1", "HIT",
#donor interests
"COLLECT1", "VETERANS", "BIBLE", "CATLG", "HOMEE", "PETS", "CDPLAY", "STEREO",
"PCOWNERS", "PHOTO", "CRAFTS", "FISHER", "GARDENIN", "BOATS", "WALKER", "KIDSTUFF",
"CARDS", "PLATES", "PEPSTRFL",
#summary variables of promotion history
"CARDPROM", "MAXADATE", "NUMPROM", "CARDPM12", "NUMPRM12",
#summary variables of giving history
"RAMNTALL", "NGIFTALL", "CARDGIFT", "MINRAMNT", "MAXRAMNT", "LASTGIFT", "LASTDATE",
"FISTDATE", "TIMELAG", "AVGGIFT",
#ID & targets
"CONTROLN", "TARGET_B", "TARGET_D", "HPHONE_D",
#RFA
"RFA_2F", "RFA_2A", "MDMAUD_R", "MDMAUD_F", "MDMAUD_A",
#OTHERS
"CLUSTER2", "GEOCODE2")
#删除Id和TARGET_D属性
vars <- setdiff(varSet, c("CONTROLN", "TARGET_D"))
cup98 <- cup98[, vars]
dim(cup98)
## [1] 95412 64
2、使用随机森林创建模型
查看缺失值以及分类变量等级超过10 的数据
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
#model <- randomForest(TARGET_B~., data=cup98)
#此时会报:存在缺失值
#1)检测缺失值
n.missing <- rowSums(is.na(cup98))
(tab.missing <- table(n.missing))
## n.missing
## 0 1 2 3 4 5 6 7
## 6782 36864 23841 13684 11716 2483 41 1
#查看没有确实值数据的比例
round(tab.missing["0"]/nrow(cup98), digits=2)
## 0
## 0.07
#2)检查分类变量的等级大于10的属性
(idx.cat <- which(sapply(cup98, is.factor)))
## OSOURCE STATE ZIP PVASTATE RECINHSE MDMAUD DOMAIN HOMEOWNR
## 2 3 4 5 7 8 9 12
## CHILD03 CHILD07 CHILD12 CHILD18 GENDER COLLECT1 VETERANS BIBLE
## 13 14 15 16 19 22 23 24
## CATLG HOMEE PETS CDPLAY STEREO PCOWNERS PHOTO CRAFTS
## 25 26 27 28 29 30 31 32
## FISHER GARDENIN BOATS WALKER KIDSTUFF CARDS PLATES PEPSTRFL
## 33 34 35 36 37 38 39 40
## RFA_2A MDMAUD_R MDMAUD_F MDMAUD_A GEOCODE2
## 59 60 61 62 64
all.levels <- sapply(names(idx.cat), function(x) nlevels(cup98[, x]))
(levels10 <- all.levels[all.levels > 10])
## OSOURCE STATE ZIP MDMAUD DOMAIN
## 896 57 19938 28 17
#3)创建训练集和测试集数据:
ind <- sample(1:2, nrow(cup98), prob=c(80, 20), replace = T)
trainData <- cup98[ind==1, ]
testData <- cup98[ind==2, ]
#4)使用party包中的函数cforest()创建随机森林:内存受限而报错
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
(time1 <- Sys.time())
## [1] "2016-02-16 11:54:18 CST"
#cf <- cforest(TARGET_B~., data=trainData, control=cforest_unbiased(mtry=2, ntree=50))
#错误: 无法分配大小为11.4 Gb的矢量
(time2 <- Sys.time())
## [1] "2016-02-16 11:54:18 CST"
(time2-time1)
## Time difference of 0.06253886 secs
#print(object.size(cf), units="Mb")
#注意:此处代码不能正确运行,回报内存溢出。因为ZIP有19938个分类等级,OSOURCE有896个分类等级。
#5)减少内存需要的一种方法是对有多个等级水平的分类变量进行分组或者删除。删除"ZIP", "OSOURCE"两个属性,并重新创建测试集和训练集数据
cup <- cup98[, setdiff(names(cup98), c("ZIP", "OSOURCE"))]
train <- cup[ind==1, ]
test <- cup[ind==2, ]
#建模
#(teme1 <- Sys.time())
#cf <- cforest(TARGET_B~., data=train, controls = cforest_unbiased(mtry=2, ntree=50))
#print(object.size(cf), units="Mb")
#(time2 <- Sys.time())
#(time2 - time1)
#预测
#myPrediction <- predict(cf, newdata=test)
#(time3 <- Sys.time())
#print(object.size(myPrediction), units="Mb")
#time3 -time2
#总结:10万条记录,62个字段,字段的最大等级水平为57个;80%的数据用于建模,耗时将近一个小时;20%的数据用于预测,耗时10多分钟。(删除"ZIP"19938, "OSOURCE896"两个属性的情况下)
3、解决内存受限问题
减少内存需求的一种方法是对含有多个等级水平的分类变量进行分组或者删除。 确定哪些变量用于建模:为了找出哪些变量将用于建模,在本节中需要对创建决策树的过程重复10次。然后收集出现在所有决策树中的每一个变量,并将收集到的变量用于建立最终模型。
#1)创建训练集数据和测试集数据:将数据集划分为3个子集,训练数据集30%、测试数据集20%和其余的数据。划分出一小部分的数据是为了缩减训练数据和测试数据的规模,以便在内存受限的环境下成功的执行训练和测试。
library(party)
trainPercentage <- 30
testPercentage <- 20
restPercentage <- 100 - trainPercentage - testPercentage
(fileName <- paste("cup98-ctree", trainPercentage, testPercentage, sep="-"))
## [1] "cup98-ctree-30-20"
(vars <- setdiff(varSet, c("TARGET_D", "CONTROLN", "ZIP", "OSOURCE")))
## [1] "ODATEDW" "STATE" "PVASTATE" "DOB" "RECINHSE" "MDMAUD"
## [7] "DOMAIN" "CLUSTER" "AGE" "HOMEOWNR" "CHILD03" "CHILD07"
## [13] "CHILD12" "CHILD18" "NUMCHLD" "INCOME" "GENDER" "WEALTH1"
## [19] "HIT" "COLLECT1" "VETERANS" "BIBLE" "CATLG" "HOMEE"
## [25] "PETS" "CDPLAY" "STEREO" "PCOWNERS" "PHOTO" "CRAFTS"
## [31] "FISHER" "GARDENIN" "BOATS" "WALKER" "KIDSTUFF" "CARDS"
## [37] "PLATES" "PEPSTRFL" "CARDPROM" "MAXADATE" "NUMPROM" "CARDPM12"
## [43] "NUMPRM12" "RAMNTALL" "NGIFTALL" "CARDGIFT" "MINRAMNT" "MAXRAMNT"
## [49] "LASTGIFT" "LASTDATE" "FISTDATE" "TIMELAG" "AVGGIFT" "TARGET_B"
## [55] "HPHONE_D" "RFA_2F" "RFA_2A" "MDMAUD_R" "MDMAUD_F" "MDMAUD_A"
## [61] "CLUSTER2" "GEOCODE2"
ind <- sample(3, nrow(cup98), replace = T, prob=c(trainPercentage, testPercentage, restPercentage))
trainData <- cup98[ind==1, vars]
testData <- cup98[ind==2, vars]
#2)检查抽样后的训练集和测试集中的因变量,看其分布与原始数据中的分布时候一致,如果不一致,可是使用分层抽样
round(prop.table(table(cup98$TARGET_B)), digits = 3)
##
## 0 1
## 0.949 0.051
round(prop.table(table(trainData$TARGET_B)), digits = 3)
##
## 0 1
## 0.951 0.049
round(prop.table(table(testData$TARGET_B)), digits = 3)
##
## 0 1
## 0.95 0.05
#rm(cup98, ind)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 578656 31.0 940480 50.3 750400 40.1
## Vcells 9117436 69.6 83217080 634.9 91593727 698.9
#3)创建决策树
myCtree <- NULL
startTime <- Sys.time()
myCtree <- ctree(TARGET_B~., data=trainData)
Sys.time() - startTime
## Time difference of 5.178561 secs
print(object.size(myCtree), units="Mb")
## 4.4 Mb
memory.size()
## [1] 417.92
pdf(paste("F:\\R\\Rworkspace\\", fileName, ".pdf", sep=""))
plot(myCtree, type="simple", ip_args=list(pval=F), ep_args=list(digits=0), main=fileName)
graphics.off()
#4)创建10棵决策树:通过自定义的脚本创建
#source('F:/R/Rworkspace/ctreeN.R')
#ctreeN(10)
#大约耗时6分钟
4、使用已选变量建立模型
上面建立了10棵决策树之后,选取其中包含的所有变量来创建最后的模型。这一次所有的数据都用于学习,80%作为训练集和20%作为测试集。
#1)选择变量
vars.selected <- c("CARDS", "CARDGIFT", "CARDPM12", "CHILD12", "CLUSTER2", "DOMAIN", "GENDER", "GEOCODE2", "HIT", "HOMEOWNR", "INCOME", "LASTDATE", "MINRAMNT", "NGIFTALL", "PEPSTRFL", "RECINHSE", "RFA_2A", "RFA_2F", "STATE", "WALKER")
#2)创建训练集和测试集数据
trainPercentage <- 80
testPercentage <- 20
(fileName <- paste("cup98-ctree", trainPercentage, testPercentage, sep="-"))
## [1] "cup98-ctree-80-20"
vars <- c("TARGET_B", vars.selected)
ind <- sample(2, nrow(cup98), replace=T, prob=c(trainPercentage, testPercentage))
trainData <- cup98[ind==1, vars]
testData <- cup98[ind==2, vars]
round(100*prop.table(table(trainData$TARGET_B)), digits = 1)
##
## 0 1
## 95 5
round(100*prop.table(table(testData$TARGET_B)), digits = 1)
##
## 0 1
## 94.7 5.3
#3)建模
myCtree <- ctree(TARGET_B~., data=trainData)
print(object.size(myCtree), units="Mb")
## 43.6 Mb
memory.size()
## [1] 344.93
print(myCtree)
##
## Conditional inference tree with 23 terminal nodes
##
## Response: TARGET_B
## Inputs: CARDS, CARDGIFT, CARDPM12, CHILD12, CLUSTER2, DOMAIN, GENDER, GEOCODE2, HIT, HOMEOWNR, INCOME, LASTDATE, MINRAMNT, NGIFTALL, PEPSTRFL, RECINHSE, RFA_2A, RFA_2F, STATE, WALKER
## Number of observations: 76081
##
## 1) RFA_2A == {D, E}; criterion = 1, statistic = 416.197
## 2) LASTDATE <= 9611; criterion = 1, statistic = 79.624
## 3) RFA_2F <= 2; criterion = 1, statistic = 69.366
## 4) INCOME <= 6; criterion = 0.997, statistic = 49.32
## 5)* weights = 7159
## 4) INCOME > 6
## 6)* weights = 429
## 3) RFA_2F > 2
## 7) WALKER == {Y}; criterion = 1, statistic = 58.471
## 8)* weights = 1762
## 7) WALKER == { }
## 9) CARDPM12 <= 4; criterion = 0.999, statistic = 55.405
## 10)* weights = 1295
## 9) CARDPM12 > 4
## 11) PEPSTRFL == {X}; criterion = 0.998, statistic = 37.816
## 12) LASTDATE <= 9512; criterion = 0.978, statistic = 37.025
## 13)* weights = 3794
## 12) LASTDATE > 9512
## 14)* weights = 4693
## 11) PEPSTRFL == { }
## 15)* weights = 3310
## 2) LASTDATE > 9611
## 16) RFA_2F <= 2; criterion = 0.962, statistic = 29.529
## 17)* weights = 237
## 16) RFA_2F > 2
## 18)* weights = 363
## 1) RFA_2A == {F, G}
## 19) PEPSTRFL == {X}; criterion = 1, statistic = 109.472
## 20) LASTDATE <= 9607; criterion = 1, statistic = 59.983
## 21) RFA_2F <= 1; criterion = 1, statistic = 55.059
## 22) MINRAMNT <= 13; criterion = 0.993, statistic = 37.24
## 23) INCOME <= 2; criterion = 0.964, statistic = 34.578
## 24)* weights = 1929
## 23) INCOME > 2
## 25)* weights = 6252
## 22) MINRAMNT > 13
## 26) RFA_2A == {F}; criterion = 0.999, statistic = 24.021
## 27)* weights = 76
## 26) RFA_2A == {G}
## 28)* weights = 250
## 21) RFA_2F > 1
## 29) GENDER == { , A, J}; criterion = 0.999, statistic = 54.434
## 30) GENDER == {A, J}; criterion = 0.994, statistic = 32.28
## 31)* weights = 36
## 30) GENDER == { }
## 32)* weights = 316
## 29) GENDER == {F, M, U}
## 33)* weights = 8015
## 20) LASTDATE > 9607
## 34) CARDPM12 <= 10; criterion = 1, statistic = 27.286
## 35)* weights = 874
## 34) CARDPM12 > 10
## 36)* weights = 109
## 19) PEPSTRFL == { }
## 37) CARDGIFT <= 3; criterion = 1, statistic = 90.392
## 38) CLUSTER2 <= 42; criterion = 1, statistic = 100.831
## 39) STATE == {AA, AE, AP, AZ, CA, CO, CT, HI, ID, ND, NE, OK, OR, PA, SC, SD, WY}; criterion = 0.985, statistic = 90.333
## 40)* weights = 7563
## 39) STATE == {AK, AL, AR, AS, DE, FL, GA, IA, IL, IN, KS, KY, LA, MA, MD, ME, MI, MN, MO, MS, MT, NC, NJ, NM, NV, NY, OH, RI, TN, TX, UT, VA, VI, VT, WA, WI}
## 41)* weights = 12950
## 38) CLUSTER2 > 42
## 42)* weights = 9404
## 37) CARDGIFT > 3
## 43) CLUSTER2 <= 20; criterion = 0.959, statistic = 46.778
## 44)* weights = 2153
## 43) CLUSTER2 > 20
## 45)* weights = 3112
#4)将所有已建立的决策树保存为一个Rdata文件,并将决策树的图像保存到一个PDF文件中
save(myCtree, file=paste("F:\\R\\Rworkspace/项目/", fileName, ".rdata", sep=""))
#pdf(paste("F:\\R\\Rworkspace/项目/", ".pdf", sep=""),width=12, height=9, paper="a4r", pointsize=6)
#plot(myCtree, type="simple", ip_args=list(pval=F), ep_args=list(digits=0),main=fileName)
#plot(myCtree, terminal_panel=node_barplot(myCtree), ip_args=list(pval=F), ep_args=list(digits=0),main=fileName)
#graphics.off()
#5)预测并使用测试数据对决策树模型进行测试
myPrediction <- predict(myCtree, testData)
myPrediction <- predict(myCtree, testData, type="node")
str(myPrediction)
## int [1:19331] 45 42 41 41 8 5 41 41 41 33 ...
(testResult <- table(myPrediction, testData$TARGET_B))
##
## myPrediction 0 1
## 5 1778 108
## 6 103 8
## 8 399 38
## 10 262 43
## 13 911 73
## 14 1150 110
## 15 808 45
## 17 70 8
## 18 86 9
## 24 446 19
## 25 1504 68
## 27 17 2
## 28 54 5
## 31 10 0
## 32 47 5
## 33 1944 114
## 35 205 16
## 36 27 7
## 40 1827 84
## 41 3119 123
## 42 2241 75
## 44 505 26
## 45 798 34
(percentageOfOne <- round(100*testResult[, 2]/(testResult[, 1] + testResult[, 2]), digits=1))
## 5 6 8 10 13 14 15 17 18 24 25 27 28 31 32
## 5.7 7.2 8.7 14.1 7.4 8.7 5.3 10.3 9.5 4.1 4.3 10.5 8.5 0.0 9.6
## 33 35 36 40 41 42 44 45
## 5.5 7.2 20.6 4.4 3.8 3.2 4.9 4.1
(testResult <- cbind(testResult, percentageOfOne))
## 0 1 percentageOfOne
## 5 1778 108 5.7
## 6 103 8 7.2
## 8 399 38 8.7
## 10 262 43 14.1
## 13 911 73 7.4
## 14 1150 110 8.7
## 15 808 45 5.3
## 17 70 8 10.3
## 18 86 9 9.5
## 24 446 19 4.1
## 25 1504 68 4.3
## 27 17 2 10.5
## 28 54 5 8.5
## 31 10 0 0.0
## 32 47 5 9.6
## 33 1944 114 5.5
## 35 205 16 7.2
## 36 27 7 20.6
## 40 1827 84 4.4
## 41 3119 123 3.8
## 42 2241 75 3.2
## 44 505 26 4.9
## 45 798 34 4.1
#绘制预测数据0/1的箱线图
boxplot(myPrediction~testData$TARGET_B, xlab="TARGET_B", ylab="Prediction", ylim=c(0, 0.25))
#模型评估
s1 <- sort(myPrediction, decreasing = T, method="quick", index.return=T)
testSize <- nrow(testData)
TotalNumOfTarget <- sum(testData$TARGET_B)
NumOfTarget <- rep(0, testSize)
NumOfTarget[1] <- (testData$TARGET_B)[s1$ix[1]]
for(i in 2:testSize) {
NumOfTarget[i] <- NumOfTarget[i-1] + testData$TARGET_B[s1$ix[i]]
}
plot(1:testSize, NumOfTarget, pty=".", type="l", lty="solid", col="red", ylab="Count Of Responses in Top k", xlab="Top K", main=fileName)
grid(col="gray", lty="dotted")
percentile <- 100*(1:testSize)/testSize
percentileTarget <- 100*NumOfTarget/TotalNumOfTarget
plot(percentile, percentileTarget, pty=".", type="l", lty="solid", col="red", ylab="Percentage of Predicted Donations(%)", xlab="Percentage of Pool", main=fileName)
grid(col="gray", lty="dotted")
5、评分
当使用一棵较大的决策树对大数据评分是,将会出现内存溢出。为了减少内存消耗,将评分数据划分为多个子集,并对每一个子集分别使用预测模型,然后再将所有的评分结果进行融合。
#1)加载评分数据
cup98val <- read.csv("F:\\R\\Rworkspace\\cup98lrn/cup98val.txt")
cup98 <- read.csv("F:\\R\\Rworkspace\\cup98lrn/cup98lrn.txt")
library(party)
treeFileName <- "cup98-ctree-80-20"
splitNum <- 10
#2)设置评分数据的因子水平:把评分数据scoreData中的分类变量的等级水平设置和训练集数据trainData的一致
trainData <- cup98[, vars]
vars2 <- setdiff(c(vars, "CONTROLN"), "TARGET_B")
scoreData <- cup98val[, vars2]
#rm(cup98, cup98val)
trainNames <- names(trainData)
scoreNames <- names(scoreData)
newScoreData <- scoreData
variableList <- intersect(trainNames, scoreNames)
for(i in 1:length(variableList)) {
varname <- variableList[i]
trainLevels <- levels(trainData[, varname])
scoreLevels <- levels(newScoreData[, varname])
if(is.factor(trainData[, varname]) & setequal(trainLevels, scoreLevels)==F) {
cat("Waring: new values found in score data, and they will be changed to NA!\n")
cat(varname, "\n")
cat("train:", length(trainLevels), ", ", trainLevels, "\n")
cat("score:", length(scoreLevels), ", ", scoreLevels, "\n\n")
newScoreData[, varname] <- factor(newScoreData[, varname], levels=trainLevels)
}
}
## Waring: new values found in score data, and they will be changed to NA!
## GENDER
## train: 7 , A C F J M U
## score: 5 , F J M U
##
## Waring: new values found in score data, and they will be changed to NA!
## STATE
## train: 57 , AA AE AK AL AP AR AS AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VI VT WA WI WV WY
## score: 59 , AA AE AK AL AP AR AS AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR PW RI SC SD TN TX UT VA VI VT WA WI WV WY
#3)加载决策树模型并查看其大小
load(paste("F:\\R\\Rworkspace/项目/", fileName, ".rdata", sep=""))
print(object.size(trainData), units="Mb")
## 8 Mb
print(object.size(scoreData), units="Mb")
## 8.1 Mb
print(object.size(newScoreData), units="Mb")
## 8.1 Mb
print(object.size(myCtree), units="Mb")
## 43.6 Mb
#回收内存
memory.size()
## [1] 1086.55
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 702415 37.6 1442291 77.1 1168576 62.5
## Vcells 113783508 868.1 172988376 1319.8 172951373 1319.6
#4)将预测(评分)数据划分为多个子集,并对每一个子集建立一棵决策树以便降低内存消耗
nScore <- dim(newScoreData)[1]
(splitSize <- round(nScore/splitNum))
## [1] 9637
myPred <- NULL
for(i in 1:splitNum) {
startPos <- 1 + (i-1)*splitSize
if(i==splitNum) {
endPos <- nScore
} else{
endPos <- i*splitSize
}
print(paste("Predictions:", startPos, "-", endPos))
tmpPred <- predict(myCtree, newdata=newScoreData[startPos:endPos, ])
myPred <- c(myPred, tmpPred)
}
## [1] "Predictions: 1 - 9637"
## [1] "Predictions: 9638 - 19274"
## [1] "Predictions: 19275 - 28911"
## [1] "Predictions: 28912 - 38548"
## [1] "Predictions: 38549 - 48185"
## [1] "Predictions: 48186 - 57822"
## [1] "Predictions: 57823 - 67459"
## [1] "Predictions: 67460 - 77096"
## [1] "Predictions: 77097 - 86733"
## [1] "Predictions: 86734 - 96367"
#计算预测的数量及其所占的百分比
length(myPred)
## [1] 96367
(rankedLevels <- table(round(myPred, digits=4)))
##
## 0.0262 0.0295 0.03 0.0402 0.0443 0.0467 0.0515 0.055 0.0553 0.056
## 11904 2358 16415 3978 7848 9650 8997 4226 10208 311
## 0.0595 0.0651 0.0665 0.0789 0.084 0.0862 0.0928 0.1061 0.1127 0.1928
## 2628 4705 367 1138 6122 552 358 2172 1623 465
## 0.1944 0.2105 0.2294
## 68 133 141
#颠倒rankedLevels
rankedLevels <- rankedLevels[length(rankedLevels):1]
(levelNum <- length(rankedLevels))
## [1] 23
cumCnt <- rep(0, levelNum)
cumCnt[1] <- rankedLevels[1]
for(i in 2:levelNum) {
cumCnt[i] <- cumCnt[i-1] + rankedLevels[i]
}
(cumPercent <- 100*cumCnt/nScore)
## [1] 0.1463156 0.2843297 0.3548933 0.8374236 2.5216101
## [6] 4.7754937 5.1469902 5.7198003 12.0725975 13.2534996
## [11] 13.6343354 18.5167122 21.2437868 21.5665114 32.1593492
## [16] 36.5446678 45.8808513 55.8946527 64.0385194 68.1664885
## [21] 85.2003279 87.6472236 100.0000000
cumPercent <- round(cumPercent,digits=1)
percent <- 100*rankedLevels/nScore
precent <- round(percent,digits=1)
cumRanking <- data.frame(rankedLevels, cumCnt, percent, cumPercent)
names(cumRanking) <- c("Frequency", "CumFrequency", "Percentage", "CumPercentage")
print(cumRanking)
## Frequency CumFrequency Percentage CumPercentage
## 0.2294 141 141 0.14631565 0.1
## 0.2105 133 274 0.13801405 0.3
## 0.1944 68 342 0.07056357 0.4
## 0.1928 465 807 0.48253033 0.8
## 0.1127 1623 2430 1.68418650 2.5
## 0.1061 2172 4602 2.25388359 4.8
## 0.0928 358 4960 0.37149647 5.1
## 0.0862 552 5512 0.57281019 5.7
## 0.084 6122 11634 6.35279712 12.1
## 0.0789 1138 12772 1.18090218 13.3
## 0.0665 367 13139 0.38083576 13.6
## 0.0651 4705 17844 4.88237675 18.5
## 0.0595 2628 20472 2.72707462 21.2
## 0.056 311 20783 0.32272458 21.6
## 0.0553 10208 30991 10.59283780 32.2
## 0.055 4226 35217 4.38531863 36.5
## 0.0515 8997 44214 9.33618355 45.9
## 0.0467 9650 53864 10.01380141 55.9
## 0.0443 7848 61712 8.14386668 64.0
## 0.0402 3978 65690 4.12796912 68.2
## 0.03 16415 82105 17.03383938 85.2
## 0.0295 2358 84463 2.44689572 87.6
## 0.0262 11904 96367 12.35277637 100.0
#5)保存结果
#write.csv(cumRanking, "F:\\R\\Rworkspace/项目/cup98-cumulative-ranking.csv", row.names=T)
#pdf(paste("F:\\R\\Rworkspace/项目/cup98-score-distribution.pdf", sep=""))
#plot(rankedLevels, x=names(rankedLevels), type="h", xlab="Score", ylab="# of Customers")
#graphics.off()
#6)使用预测结果得分对客户进行排名,并将结果保存到一个.csv文件中
s1 <- sort(myPred, decreasing=T, method="quick", index.return=T)
varToOutput <- c("CONTROLN")
score <- round(myPred[s1$ix], digits=4)
table(score, useNA="ifany")
## score
## 0.0262 0.0295 0.03 0.0402 0.0443 0.0467 0.0515 0.055 0.0553 0.056
## 11904 2358 16415 3978 7848 9650 8997 4226 10208 311
## 0.0595 0.0651 0.0665 0.0789 0.084 0.0862 0.0928 0.1061 0.1127 0.1928
## 2628 4705 367 1138 6122 552 358 2172 1623 465
## 0.1944 0.2105 0.2294
## 68 133 141
result <- data.frame(cbind(newScoreData[s1$ix, varToOutput]), score)
names(result) <- c(varToOutput, "score")
#write.csv(result, "cup98-predicted-score.csv", row.names=F)