我正在用R中的一个数据集,我试图根据每个女性与户主的关系来计算每个女性的子元素数量.数据集包括家庭ID、个人ID、与户主的关系、年龄、性别和收入等变量.

 HouseholdID IndividualID Relationshiptothehouseholdhead  Age Gender  Income
    <dbl>  <dbl> <chr>        <dbl> <chr> <dbl>
 1      1      1 C               80 male      150
 2      1      2 D               81 female      120
 3      1      3 A               60 male      630
 4      1      4 B               59 female      500
 5      1      5 E3              35 male      380
 6      1      6 F3              30 female      220
 7      1      7 E5              33 female      170
 8      1      8 F5              30 male      160
 9      1      9 G32             20 female      290
10      1     10 G51             15 female      200
11      1     11 G52             12 female      100
12      1     12 G55              8 male       80
13      2      1 A               58 male      380
14      2      2 B               55 female      220
15      2      3 E1              35 male      170
16      2      4 F1              37 female      160
17      2      5 E2              33 male      290
18      2      6 F2              30 female      110
19      2      7 G21             17 female      210
20      2      8 G22             15 female      750
21      2      9 G23             12 female      350

表中提供的数据 struct 包括以下变量:

Household ID: This is a unique identifier for a family household.
Individual ID: This is a unique number assigned to each individual within the household.
Relationship to the household head: Specific symbols are used to represent the relationship of an individual to the head of the household.

  • "A"指户主本身;
  • "B"指户主的配偶;
  • "C"指户主的父亲;
  • "D"指户主的母亲;
  • 对于户主的子女及其后代,符号"E1"用于第一个子女,"E2"用于第二个子女,等等;"F1"用于第一个子女的配偶,"F2"用于第二个子女的配偶,等等.
  • 对于孙辈,"G11"表示第一个子女(E1)的第一个子女、"G12"表示第一个子女(E1)的第二个子女、"G21"表示第二个子女(E2)的第一个子女等;"H11"表示第一个子女(G11)的配偶等.

Age: The age of the individual.
Gender: The gender of the individual, represented by "male" or "female".
Income: The income situation of the individual.

请根据表1中的数据生成类似于表2的数据集,并满足以下要求:

  1. 只包括女性个人.
  2. 计算每个女性生下的子元素数.

值得注意的是, children 的数量主要由字母后面的最高数量决定,而不是简单地计算数据中的观察数量.例如,在家庭1中,ID等于4的个人应被视为有5个子元素,而不是2个.

结果应如下:

HouseholdID IndividualID  Age Gender  Income  Numofkids
1 2 81  female 120 1
1 4 59  female 500 5
1 6 30  female 220 2
1 7 33  female 170 3
1 9 35  female 290 0
1 10  15  female 200 0
1 11  12  female 100 0
2 2 55  female 220 2
2 4 37  female 160 0
2 6 30  female 110 3
2 7 17  female 210 0
2 8 15  female 750 0
2 9 12  female 350 0

这是数据

data = structure(list(HouseholdID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2), IndividualID = c(1, 2, 3, 4, 
5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9), Relationshiptothehouseholdhead = c("C", 
"D", "A", "B", "E3", "F3", "E5", "F5", "G32", "G51", "G52", "G55", 
"A", "B", "E1", "F1", "E2", "F2", "G21", "G22", "G23"), Age = c(80, 
81, 60, 59, 35, 30, 33, 30, 20, 15, 12, 8, 58, 55, 35, 37, 33, 
30, 17, 15, 12), Gender = c("male", "female", "male", "female", 
"male", "female", "female", "male", "female", "female", "female", 
"male", "male", "female", "male", "female", "male", "female", 
"female", "female", "female"), Income = c(150, 120, 630, 500, 
380, 220, 170, 160, 290, 200, 100, 80, 380, 220, 170, 160, 290, 
110, 210, 750, 350)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -21L))

谢谢!

推荐答案

从编码的Angular 来看,这个问题很清楚,即使练习本身涉及简化和假设.然而,我假设你关于使用编号而不是计数的 comments 延伸到户主的子元素的子元素,所以我从你那里得到了一个不同的答案.他们是E5,和母亲G51G52G55,这意味着5个子元素,而不是3个.

library(data.table)
dt <- as.data.table(data)

# split the household relationship column into 3 for ease
dt[, let(hh_rel   = substr(Relationshiptothehouseholdhead,1,1),
         gen2_idx = as.integer(substr(Relationshiptothehouseholdhead,2,2)),
         gen3_idx = as.integer(substr(Relationshiptothehouseholdhead,3,3)))]

# family tree lookup tables (household head is gen1)
gen2 <- dt[,
           .(gen2_max = max(gen2_idx, na.rm=TRUE)),
           by=HouseholdID]

gen3 <- dt[!is.na( gen2_idx),
           .(gen3_max = max(gen3_idx, na.rm=TRUE)),
           by=.(HouseholdID, gen2_idx)]

这是计算得出的每个户主(第2代)所生子女数,以及家庭中第2代(或其配偶)所生子女数(第3代):

> gen2
   HouseholdID gen2_max
         <num>    <int>
1:           1        5
2:           2        2

> gen3
   HouseholdID gen2_idx gen3_max
         <num>    <int>    <int>
1:           1        3        2
2:           1        5        5
3:           2        1       NA
4:           2        2        3
   

构造这些查找的输出:

# women
out <- dt[Gender == "female",
          .(HouseholdID,
            IndividualID,
            Age,
            Gender,
            Income,
            hh_rel,
            gen2_idx)]

# if mother of HH, count 1 child
out[hh_rel == "D", Numofkids := 1L]

# if HH or wife of HH, highest 2nd gen index
out[hh_rel %in% c("A", "B"), Numofkids := gen2[.SD, on=.(HouseholdID), gen2_max]]

# if daughter or daughter-in-law of HH, highest 3rd gen index among own 2nd gen index
out[hh_rel %in% c("E", "F"), Numofkids := gen3[.SD, on=.(HouseholdID, gen2_idx), gen3_max]]

# otherwise 0
out[is.na(Numofkids), Numofkids := 0L]

# drop cols
out[, let(hh_rel = NULL, gen2_idx = NULL)]

输出:

> out
    HouseholdID IndividualID   Age Gender Income Numofkids
          <num>        <num> <num> <char>  <num>     <int>
 1:           1            2    81 female    120         1
 2:           1            4    59 female    500         5
 3:           1            6    30 female    220         2
 4:           1            7    33 female    170         5
 5:           1            9    20 female    290         0
 6:           1           10    15 female    200         0
 7:           1           11    12 female    100         0
 8:           2            2    55 female    220         2
 9:           2            4    37 female    160         0
10:           2            6    30 female    110         3
11:           2            7    17 female    210         0
12:           2            8    15 female    750         0
13:           2            9    12 female    350         0

R相关问答推荐

self_函数无法工作--无法子集结束后的列

有没有方法将琴弦完全捕捉到R中的多边形?

按R中的组查找相邻列的行累积和的最大值

使用R中相同值创建分组观测指标

如何在R中添加截止点到ROC曲线图?

如何使用STAT_SUMMARY向ggplot2中的密度图添加垂直线

如何自定义3D散点图的图例顺序?

移除仪表板Quarto中顶盖和车身之间的白色区域

在rpart. plot或fancyRpartPlot中使用带有下标的希腊字母作为标签?

将Posict转换为数字时的负时间(以秒为单位)

使用RSelenium在R中抓取Reddit时捕获多个标签

悬崖三角洲超大型群数计算导致整数溢出

从数据创建数字的命名列表.R中的框

防止在更新SHINY中的Reactive Value的部分内容时触发依赖事件

按组使用dummy r获取高于标准的行的平均值

在ggplot2图表中通过端点连接点

是否从列中删除★符号?

抽样变换-REXP与RWEIBUR

在使用ggplot2的情况下,如何在使用coord_trans函数的同时,根据未转换的坐标比来定位geom_瓷砖?

如何将一列相关性转换为R中的相关性矩阵