我有一个数据集,我需要在其中插入数据(在冲突时替换大约value列).由于这是一个应用程序的瓶颈,我希望这是相当优化的.但在这种情况下,与Sqlite相比,duckdb确实很慢.我到底做错了什么?
MWE
一个简单的例子是这个表,其中id
和name
是唯一的,val
和created_at
是应该在冲突中更新的值.
我把桌子摆成这样
con <- DBI::dbConnect(duckdb::duckdb())
q <- "CREATE TABLE main (
id INTEGER NOT NULL,
name TEXT NOT NULL,
val REAL NOT NULL,
created_at TIMESTAMP NOT NULL,
UNIQUE(id, name)
)"
DBI::dbExecute(con, q)
然后,我模拟一些值以插入到表中
# Create a data frame
create_data <- function(N, k_letters) {
data.frame(
id = sample.int(100, N, replace = TRUE),
name = replicate(N, paste(sample(letters, k_letters), collapse = "")),
val = runif(N),
created_at = Sys.time()
) |>
# to take only unique values...
dplyr::group_by(id, name) |> dplyr::slice(1) |> dplyr::ungroup()
}
set.seed(123)
df_orig <- create_data(10000, 2)
nrow(df_orig)
#> [1] 9285
DBI::dbAppendTable(con, "main", df_orig)
现在我模拟新数据...
set.seed(124)
new_data <- create_data(100000, 2)
nrow(new_data)
#>[1] 50902
# how many are in the original but remain as they are not in the new data?
anti_join(df_orig, new_data, by = c("id", "name")) |> nrow()
#> [1] 1971
# how many are overwritten by the new?
semi_join(df_orig, new_data, by = c("id", "name")) |> nrow()
#> [1] 7314
# how many are newly added
anti_join(new_data, df_orig, by = c("id", "name")) |> nrow()
#> [1] 43588
...并将其插入到表格中
upsert_db <- function(con, data) {
q <- "INSERT INTO test
VALUES ($id, $name, $val, $created_at)
ON CONFLICT (id, name)
DO UPDATE SET val = excluded.val, created_at = excluded.created_at;"
res <- dbSendStatement(con, q)
dbBind(res, params = unname(as.list(data)))
rr <- dbGetRowsAffected(res)
dbClearResult(res)
rr
}
t0 <- Sys.time()
upsert_db(con, new_data)
#> [1] 50902
difftime(Sys.time(), t0)
#> Time difference of 16.30046 secs
这是可行的,但速度非常慢,使用SQLite编写类似的代码只需不到一秒钟.
我有什么明显的遗漏吗?