有没有办法使用R data.table来设置需要设置两级函数调用才能设置的列值,以及有没有办法在两个data.table之间的联接中设置列值?
示例:这是可行的,但使用了for循环.
library(data.table)
# use something from datasets to illusrate
tAmounts<-data.table(rbind(cbind(ID="Apple", Amt=as.numeric(EuStockMarkets[1:50,2])),
cbind(ID="Orange", Amt=as.numeric(EuStockMarkets[1:30,3])),
cbind(ID="Lemon", Amt=as.numeric(EuStockMarkets[1:60,4]))))
setkey(tAmounts, ID, Amt)
# is there a data.table way to do this without the for loops?
# summary table with the full hierarchical cluster for each ID
tSummary<-tAmounts[, .(.N, Clust=list()), keyby="ID"]
for (Id in unique(tAmounts$ID)) {
D<-dist(tAmounts[ID==Id]$Amt)
C<-hclust(D, method="average")
tSummary[ID==Id]$Clust<-list(C) # any way to mapply & lapply?
}
# ID N Clust
# Apple 50 <hclust[7]>
# Lemon 60 <hclust[7]>
# Orange 30 <hclust[7]>
也许有一种方法可以用lapply和mapply的组合来表示tSummary[, Clust:=hcust(dist(Amt), method="average"), by="ID")
?
同样,有没有办法使用函数设置联接中的列?从上面的示例继续:
# table of hierarchical cluster cuts, e.g., height of $20, height of $40
tCuts<-CJ(ID=unique(tAmounts$ID), Cut=seq(20,100,20))
setkey(tCuts, ID, Cut)
# ID Cut
# Apple 20
# Apple 40
# ...etc...
# table with clusters taken at each cut
tClust<-tCuts[tAmounts, on="ID", allow.cartesian=TRUE]
setkey(tClust, ID, Cut, Amt)
# ID Cut Amt
# Apple 20 1587.4
# Apple 20 1630.6
# ...etc...
# Orange 100 1789.5
# set ClustNum for each ID, cut, and amount
for (i in 1:nrow(tCuts)) {
Id<-tCuts[i]$ID
tClust[ID==Id & Cut==tCuts[i]$Cut, ClustNum:=cutree(tSummary[ID==Id]$Clust[[1]], h=tCuts[i]$Cut)] # any way to mapply in a join?
}
有没有像tClust[tCuts, ClustNum:=cutree(Clust, h=Cut)]
这样的东西可以一次连接并设置值?