我有一个有地区数据的数据框架,我希望能够使每个国家的比例在特定变量下相等.下面是我的例子. 我有一个表有详细的样本,按性别划分的国家.我希望能够删除样本,目标是0和1等价.

> table(df$Gender , df$COUNTRY)
   
      1   2   3
  0  86  81 282
  1  21   7  23

是否有任何包/函数可以删除等于0的值以保持足够的值来匹配等于1的值?

这将是预期的结果

> table(df$Gender , df$COUNTRY)
   
      1   2   3
  0  21   7  23
  1  21   7  23

如果还有一种更基本的方法来做到这一点,那也会有所帮助.例如,删除其中df$Country=1&df$Gender=0的随机样本65.然后我可以手动完成每个国家的工作.

就像有人要求的那样,我们开始吧.以上各表已作相应更改

df <- structure(list(Gender = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 
                                1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 
                                0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 
                                0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 
                                0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 
                                0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 
                                0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 
                                1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 
                                0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 
                                1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 
                                0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 
                                0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 
                                0, 1, 1, 0, 0), COUNTRY = c(2, 3, 2, 1, 3, 3, 3, 2, 2, 3, 3, 
                                                            3, 3, 3, 3, 2, 3, 3, 1, 3, 3, 3, 3, 1, 2, 3, 2, 3, 1, 3, 2, 3, 
                                                            3, 3, 2, 2, 3, 3, 3, 2, 3, 2, 2, 1, 3, 3, 3, 2, 2, 3, 1, 1, 2, 
                                                            2, 1, 3, 3, 1, 2, 1, 3, 3, 3, 1, 1, 3, 3, 3, 1, 3, 2, 1, 3, 2, 
                                                            3, 3, 2, 3, 3, 3, 3, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
                                                            3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 1, 1, 3, 1, 2, 1, 3, 
                                                            2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 1, 2, 3, 3, 3, 1, 1, 3, 1, 1, 2, 
                                                            2, 3, 3, 1, 2, 3, 3, 3, 2, 3, 3, 1, 3, 3, 1, 3, 1, 1, 3, 3, 2, 
                                                            3, 1, 1, 1, 3, 3, 3, 2, 3, 3, 2, 3, 2, 1, 3, 2, 3, 1, 3, 2, 2, 
                                                            2, 3, 3, 2, 1, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 2, 2, 
                                                            1, 1, 3, 1, 1, 1, 3, 3, 1, 2, 1, 1, 1, 1, 3, 3, 3, 1, 3, 3, 2, 
                                                            3, 3, 3, 3, 3, 1, 3, 3, 2, 1, 1, 2, 3, 2, 3, 3, 3, 2, 2, 3, 3, 
                                                            3, 3, 1, 3, 2, 2, 1, 3, 2, 1, 3, 2, 3, 2, 3, 3, 2, 3, 2, 3, 3, 
                                                            3, 1, 3, 2, 1, 1, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 1, 2, 3, 1, 2, 
                                                            3, 2, 1, 2, 1, 3, 1, 3, 3, 3, 3, 3, 1, 3, 1, 1, 3, 1, 3, 1, 1, 
                                                            3, 3, 1, 3, 1, 1, 1, 2, 3, 2, 2, 3, 3, 3, 2, 3, 3, 2, 3, 3, 3, 
                                                            3, 3, 3, 3, 2, 3, 1, 3, 3, 3, 1, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
                                                            1, 3, 3, 1, 2, 3, 1, 3, 3, 3, 1, 3, 1, 3, 3, 3, 1, 1, 3, 2, 3, 
                                                            1, 3, 3, 3, 1, 2, 3, 3, 3, 3, 1, 3, 1, 3, 1, 1, 3, 3, 3, 3, 1, 
                                                            3, 3, 3, 3, 3, 1, 1, 3, 3, 2, 3, 3, 3, 3, 1, 3, 3, 2, 3, 3, 1, 
                                                            3, 3, 3, 2, 3, 1, 3, 3, 1, 3, 2, 1, 2, 3, 3, 3, 3, 3, 3, 2, 2, 
                                                            2, 3, 3, 2, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 2, 1, 3, 3, 3, 3, 3, 
                                                            3, 1, 3, 1, 2, 3, 3, 2, 3, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3, 3, 3, 
                                                            1, 1, 3, 3, 3, 1, 3, 3, 1, 3, 3, 3, 3, 3, 1, 3, 3, 1, 3, 3, 3, 
                                                            3, 3, 2, 3, 2, 3)), row.names = c(NA, 500L), class = "data.frame")

推荐答案

Here is a base R way.
First, table the data and compute the rows differences. Then, sample from the indices that have 1st column 0 and are equal to each table column name as many as those differences. These are the rows to remove.

n <- 100L
set.seed(2023)
df1 <- data.frame(
  a = sample(0:1, n, TRUE, prob = c(3, 1)/4),
  b = sample(3, n, TRUE)
)

i <- df1$a == 0
table(df1) |>
  apply(2, diff) -> tmp
lapply(names(tmp), \(nm) {
  j <- which(i & df1$b == nm)
  sample(j, abs(tmp[nm]))
}) |> unlist() -> tmp

df1[-tmp, ] |> table()
#>    b
#> a    1  2  3
#>   0  8 10  8
#>   1  8 10  8
rm(tmp)  # tidy up

创建于2023-08-08,共reprex v2.0.2

R相关问答推荐

基于2行删除重复项指定每列要执行的操作

从多个前置日期中获取最长日期

获取列中值更改的行号

如何使用按钮切换轨迹?

筛选出以特定顺序患病的个体

迭代通过1个长度的字符串长字符R

在另存为PNG之前隐藏htmlwidget绘图元素

如何在所有绘图中保持条件值的 colored颜色 相同?

将饼图插入条形图

合并后返回列表的数据帧列表

创建列并对大型数据集中的特定条件进行成对比较的更高效程序

根据r中另一个文本列中给定的范围对各列求和

变长向量的矢量化和

对R中的列表列执行ROW Mean操作

主题(Legend.key=Element_RECT(Fill=&Quot;White&Quot;))不起作用

如何在GALT包的函数&geom_x样条线中调整线宽

按镜像列值自定义行顺序

在ggplot2图表中通过端点连接点

将美学添加到ggploy中的文本标签

以R表示的NaN值的IS.NA状态