Reply
Tue 28 Feb, 2017 05:36 am
I did item based collaborative filtering with R and have some questions about it.
1- How can I know the confidence level of the results. I mean results can show that x item similar to y item 50% probability. How can I rely on this result?
2- I see so many duplicated relation ratio on similarity matrix ( see some examples below).
1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
1 0.707106781186547 0.707106781186547 0.707106781186547 0.707106781186547 0.707106781186547 0.707106781186547 0.5 0.5 0.5 0.5
1 1 1 1 1 0 0 0 0 0 0
And so on.
Since almost all of my data is like that, I'm having difficulties to rely on. I totally believe that my data set is big and various enough. What could be the reason of this type of result? Could you please help me to clarify myself?
Here is my code:
RestaurantData1 <- read.csv(paste(getwd(),"/Restaurant/Restaurant.csv",sep = ""), stringsAsFactors = FALSE)
names(RestaurantData1)[1:2] <- c("UserName","ResName")
RestaurantData1 <- RestaurantData1[,!names(RestaurantData1) %in% "Visits"]
RestaurantData1 <- subset(RestaurantData1, RestaurantData1$Orders > 0 & RestaurantData1$UserName != "")
RestaurantData1$Orders <- 1
gc()
getCosine <- function(x,y)
{
this.cosine <- sum(x*y) / (sqrt(sum(x*x)) * sqrt(sum(y*y)))
return(this.cosine)
}
ColumnBasedData <-
reshape(
RestaurantData1, idvar = "UserName", timevar = "ResName", direction =
"wide"
)
rm(RestaurantData1)
gc()
ColumnBasedData[is.na(ColumnBasedData)] <- 0
ResData <<-
(ColumnBasedData[,!(names(ColumnBasedData) %in% c("UserName"))])
rm(ColumnBasedData)
gc()
holder <-
matrix(
NA, nrow = ncol(ResData),ncol = ncol(ResData),dimnames = list(colnames(ResData),colnames(ResData))
)
ResData.similarity <<- as.data.frame(holder)
for(i in 1:ncol(ResData)) {
for(j in 1:ncol(ResData)) {
ResData.similarity[i,j] <- getCosine(as.matrix(ResData),as.matrix(ResData[j]))
}
}
ResDataNeighbour <- matrix(NA, nrow=ncol(ResData.similarity),ncol=11,dimnames=list(colnames(ResData.similarity)))
for(i in 1:ncol(ResData))
{
ResDataNeighbour[i,] <- (t(head(n=11,rownames(ResData.similarity[order(ResData.similarity[,i],decreasing=TRUE),]))))
}