在 R 中，如何将左截尾缺失数据估算在所需范围内(例如 0

发布于05月28日

我有大约left-censored data的Creact 蛋白(C-react 蛋白)，我想知道如何才能将低于检测限值的值归因于the imputed values would be inside a desired range(此处:0；lt；inputed_Value&lt；0.2).

我正在try 使用包‘imputeLCMD’来实现这一点，但由于‘imputeLCMD’及其所有依赖项的安装稍微有点复杂，我也愿意听听其他方法.

请考虑以下MWE:

# Load libraries
library(dplyr)
library(imputeLCMD)

# Assign the dputted random data to a data frame
df <- structure(list(participant_id = 1:10, CRP = c("2.9", "<0.2", 
                                                    "<0.2", "8.8", "9.4", "0.5", "5.3", "8.9", "5.5", "<0.2"), LDL_cholesterol = c(195.7, 
                                                                                                                                   145.3, 167.8, 157.3, 110.3, 190, 124.6, 104.2, 132.8, 195.5), 
                     fasting_glucose = c(114.5, 104.6, 102, 119.7, 102.8, 105.4, 
                                         97.2, 99.7, 84.5, 77.4), creatinine = c(1.5, 1.4, 1.2, 1.3, 
                                                                                 0.5, 1, 1.3, 0.7, 0.8, 0.7)), row.names = c(NA, -10L), class = c("tbl_df", 
                                                                                                                                                  "tbl", "data.frame"))

上面输入的模拟数据框和下面展示的模拟数据框类似于我的真实数据.

# View the random data
head(df, n=5)
#> # A tibble: 5 × 5
#>   participant_id CRP   LDL_cholesterol fasting_glucose creatinine
#>            <int> <chr>           <dbl>           <dbl>      <dbl>
#> 1              1 2.9              196.            114.        1.5
#> 2              2 <0.2             145.            105.        1.4
#> 3              3 <0.2             168.            102         1.2
#> 4              4 8.8              157.            120.        1.3
#> 5              5 9.4              110.            103.        0.5

然而，我对如何从现在开始继续下go 感到有点迷茫.为了将这些左删失数据的缺失值与包imputeLCMD联系起来，我假设必须首先将左中心值转换为Nas:

df <- df %>%
  mutate(CRP = na_if(CRP, "<0.2")) %>%
  mutate(CRP = as.numeric(CRP))

head(df, n=5)
#> # A tibble: 5 × 5
#>   participant_id   CRP LDL_cholesterol fasting_glucose creatinine
#>            <int> <dbl>           <dbl>           <dbl>      <dbl>
#> 1              1   2.9            196.            114.        1.5
#> 2              2  NA              145.            105.        1.4
#> 3              3  NA              168.            102         1.2
#> 4              4   8.8            157.            120.        1.3
#> 5              5   9.4            110.            103.        0.5

如果我现在运行包imputeLCMD中的一个包装器，我确实会得到某种结果:

# Impute the missing data
df_imputed <- impute.wrapper.SVD(df, K = 4) %>% as.data.frame()

# Round the result
df_imputed <- df_imputed %>% mutate_at(vars(CRP), ~round(., digits = 1))

# Place the original CRP next to the imputed one for comparison
df_imputed <- df_imputed %>% mutate(original_CRP = df$CRP)
df_imputed <- df_imputed %>% select(1,2,6,3,4,5)

# Display the result
head(df_imputed, n=5)
#>   participant_id CRP original_CRP LDL_cholesterol fasting_glucose creatinine
#> 1              1 2.9          2.9           195.7           114.5        1.5
#> 2              2 5.8           NA           145.3           104.6        1.4
#> 3              3 5.9           NA           167.8           102.0        1.2
#> 4              4 8.8          8.8           157.3           119.7        1.3
#> 5              5 9.4          9.4           110.3           102.8        0.5

^{创建于2023-05-27，共}reprex v2.0.2个

我的问题:

我不知道如何为imputeLCMD包设置参数，以便输入值应为:>0 AND <0.2.
如何确保imputeLCMD不将Participant_id本身作为数字数据归因于计算吗？

我已经在so中看到了大约great alternative approaches for left-censored data个，但如果知道我在‘puteLCMD’上做错了什么(或做对了什么)，那就太好了.

set.seed(42) # Simulate some underlying log-normal CRP values # Parameters from: https://stackoverflow.com/a/63938717/4550695 crp <- rlnorm(10000, 1.355, 1.45) summary(crp) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 0.011 1.418 3.842 11.158 10.139 2060.559 # Apply a lower limit of quantification to observations lloq <- 3 crp_obs <- replace(crp, crp < lloq, NA) summary(crp_obs) #> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's #> 3.002 5.019 8.738 18.669 18.107 2060.559 4331

# Any missing values in x are assumed to be known to be < LLOQ left_censored_log_normal_log_likelihood <- function(mu, sigma, x, lloq) { sum(dlnorm(na.omit(x), mu, sigma, log = TRUE)) + sum(is.na(x)) * plnorm(lloq, mu, sigma, log = TRUE) }

mean_sd <- function(x, ...) { c(mean(x, ...), sd(x, ...)) } # Initial values from observed data theta0 <- mean_sd(log(crp_obs), na.rm = TRUE) theta0 #> [1] 2.350642 0.924159 fit <- optim(theta0, function(theta) { -left_censored_log_normal_log_likelihood(theta[1], theta[2], crp_obs, lloq) })

str(fit) #> List of 5 #> $ par : num [1:2] 1.34 1.45 #> $ value : num 26786 #> $ counts : Named int [1:2] 69 NA #> ..- attr(*, "names")= chr [1:2] "function" "gradient" #> $ convergence: int 0 #> $ message : NULL

n <- sum(is.na(crp_obs)) p <- runif(n, 0, plnorm(lloq, fit$par[1], fit$par[2])) y <- qlnorm(p, fit$par[1], fit$par[2]) summary(y) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 0.00557 0.62488 1.20758 1.32565 1.97396 2.99494 crp_imp <- replace(crp_obs, which(is.na(crp_obs)), y)

mean_sd(log(crp_imp)) #> [1] 1.337859 1.461132 summary(crp_imp) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 0.0056 1.4200 3.8418 11.1576 10.1394 2060.5585 summary(crp) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 0.011 1.418 3.842 11.158 10.139 2060.559

library(censlm) # remotes::install_github("mikmart/censlm") #> Loading required package: survival set.seed(42) # Simulate left-censored log-normal CRP observations crp <- rlnorm(10000, 1.355, 1.450) lloq <- 3 obs <- replace(crp, crp < lloq, lloq) crpdf <- data.frame(crp, lloq, obs) # Fit censored linear model and extract (random) imputations fit <- clm(log(obs) ~ 1, left = log(lloq)) imp <- exp(imputed(fit)) summary(cbind(crpdf, imp)) #> crp lloq obs imp #> Min. : 0.011 Min. :3 Min. : 3.000 Min. : 0.0056 #> 1st Qu.: 1.418 1st Qu.:3 1st Qu.: 3.000 1st Qu.: 1.4200 #> Median : 3.842 Median :3 Median : 3.842 Median : 3.8418 #> Mean : 11.158 Mean :3 Mean : 11.883 Mean : 11.1576 #> 3rd Qu.: 10.139 3rd Qu.:3 3rd Qu.: 10.139 3rd Qu.: 10.1394 #> Max. :2060.559 Max. :3 Max. :2060.559 Max. :2060.5585

summary(fit) #> #> Call: #> survreg(formula = Surv(log(obs), log(obs) > log(lloq), type = "left") ~ #> 1, dist = "gaussian") #> Value Std. Error z p #> (Intercept) 1.3421 0.0168 79.8 <2e-16 #> Log(scale) 0.3749 0.0103 36.3 <2e-16 #> #> Scale= 1.45 #> #> Gaussian distribution #> Loglik(model)= -13460.2 Loglik(intercept only)= -13460.2 #> Number of Newton-Raphson Iterations: 4 #> n= 10000

在 R 中，如何将左截尾缺失数据估算在所需范围内(例如 0

推荐答案

R相关问答推荐

替换字符的所有实例，但仅限于匹配字符串中

指定要保留在wrap_plots中的传奇

插入指示行之间时间间隔的新行

从载体创建 pyramid

混淆矩阵，其中每列和等于1

将模拟变量乘以多个观测结果中的模拟变量

如何在四进制仪表板值框中显示值(使用shiny 的服务器计算)

在特定列上滞后n行，同时扩展框架的长度

使用R的序列覆盖

单击 map 后，将坐标复制到剪贴板

更改绘图上的x轴断点，而不影响风险？

R等效于LABpascal(n，1)不同的列符号

如何从当前行上方找到符合特定条件的最接近值？

当我们有多个特殊字符时，使用gsub删除名称和代码'

传递ggplot2的变量作为函数参数—没有映射级别以正确填充美学

如何将SAS数据集的列名和列标签同时包含在r中GT表的表首？

如何识别倒排的行并在R中删除它们？

在ggplot2上从多个数据框创建复杂的自定义图形

向R中的数据帧添加一列，该列统计另一列中每个唯一值的二进制观测值的数量

在r中整理图例和堆叠图的问题