如何计算R数据集中每个女性的子元素数量

发布于04月03日

我正在用R中的一个数据集，我试图根据每个女性与户主的关系来计算每个女性的子元素数量.数据集包括家庭ID、个人ID、与户主的关系、年龄、性别和收入等变量.

 HouseholdID IndividualID Relationshiptothehouseholdhead  Age Gender  Income
    <dbl>  <dbl> <chr>        <dbl> <chr> <dbl>
 1      1      1 C               80 male      150
 2      1      2 D               81 female      120
 3      1      3 A               60 male      630
 4      1      4 B               59 female      500
 5      1      5 E3              35 male      380
 6      1      6 F3              30 female      220
 7      1      7 E5              33 female      170
 8      1      8 F5              30 male      160
 9      1      9 G32             20 female      290
10      1     10 G51             15 female      200
11      1     11 G52             12 female      100
12      1     12 G55              8 male       80
13      2      1 A               58 male      380
14      2      2 B               55 female      220
15      2      3 E1              35 male      170
16      2      4 F1              37 female      160
17      2      5 E2              33 male      290
18      2      6 F2              30 female      110
19      2      7 G21             17 female      210
20      2      8 G22             15 female      750
21      2      9 G23             12 female      350

表中提供的数据 struct 包括以下变量:

Household ID: This is a unique identifier for a family household.
Individual ID: This is a unique number assigned to each individual within the household.
Relationship to the household head: Specific symbols are used to represent the relationship of an individual to the head of the household.

"A"指户主本身；
"B"指户主的配偶；
"C"指户主的父亲；
"D"指户主的母亲；
对于户主的子女及其后代，符号"E1"用于第一个子女，"E2"用于第二个子女，等等；"F1"用于第一个子女的配偶，"F2"用于第二个子女的配偶，等等.
对于孙辈，"G11"表示第一个子女(E1)的第一个子女、"G12"表示第一个子女(E1)的第二个子女、"G21"表示第二个子女(E2)的第一个子女等；"H11"表示第一个子女(G11)的配偶等.

Age: The age of the individual.
Gender: The gender of the individual, represented by "male" or "female".
Income: The income situation of the individual.

请根据表1中的数据生成类似于表2的数据集，并满足以下要求:

只包括女性个人.
计算每个女性生下的子元素数.

值得注意的是， children 的数量主要由字母后面的最高数量决定，而不是简单地计算数据中的观察数量.例如，在家庭1中，ID等于4的个人应被视为有5个子元素，而不是2个.

结果应如下:

HouseholdID IndividualID  Age Gender  Income  Numofkids
1 2 81  female 120 1
1 4 59  female 500 5
1 6 30  female 220 2
1 7 33  female 170 3
1 9 35  female 290 0
1 10  15  female 200 0
1 11  12  female 100 0
2 2 55  female 220 2
2 4 37  female 160 0
2 6 30  female 110 3
2 7 17  female 210 0
2 8 15  female 750 0
2 9 12  female 350 0

这是数据

data = structure(list(HouseholdID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2), IndividualID = c(1, 2, 3, 4, 
5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9), Relationshiptothehouseholdhead = c("C", 
"D", "A", "B", "E3", "F3", "E5", "F5", "G32", "G51", "G52", "G55", 
"A", "B", "E1", "F1", "E2", "F2", "G21", "G22", "G23"), Age = c(80, 
81, 60, 59, 35, 30, 33, 30, 20, 15, 12, 8, 58, 55, 35, 37, 33, 
30, 17, 15, 12), Gender = c("male", "female", "male", "female", 
"male", "female", "female", "male", "female", "female", "female", 
"male", "male", "female", "male", "female", "male", "female", 
"female", "female", "female"), Income = c(150, 120, 630, 500, 
380, 220, 170, 160, 290, 200, 100, 80, 380, 220, 170, 160, 290, 
110, 210, 750, 350)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -21L))

谢谢！

library(data.table) dt <- as.data.table(data) # split the household relationship column into 3 for ease dt[, let(hh_rel = substr(Relationshiptothehouseholdhead,1,1), gen2_idx = as.integer(substr(Relationshiptothehouseholdhead,2,2)), gen3_idx = as.integer(substr(Relationshiptothehouseholdhead,3,3)))] # family tree lookup tables (household head is gen1) gen2 <- dt[, .(gen2_max = max(gen2_idx, na.rm=TRUE)), by=HouseholdID] gen3 <- dt[!is.na( gen2_idx), .(gen3_max = max(gen3_idx, na.rm=TRUE)), by=.(HouseholdID, gen2_idx)]

> gen2 HouseholdID gen2_max <num> <int> 1: 1 5 2: 2 2 > gen3 HouseholdID gen2_idx gen3_max <num> <int> <int> 1: 1 3 2 2: 1 5 5 3: 2 1 NA 4: 2 2 3

# women out <- dt[Gender == "female", .(HouseholdID, IndividualID, Age, Gender, Income, hh_rel, gen2_idx)] # if mother of HH, count 1 child out[hh_rel == "D", Numofkids := 1L] # if HH or wife of HH, highest 2nd gen index out[hh_rel %in% c("A", "B"), Numofkids := gen2[.SD, on=.(HouseholdID), gen2_max]] # if daughter or daughter-in-law of HH, highest 3rd gen index among own 2nd gen index out[hh_rel %in% c("E", "F"), Numofkids := gen3[.SD, on=.(HouseholdID, gen2_idx), gen3_max]] # otherwise 0 out[is.na(Numofkids), Numofkids := 0L] # drop cols out[, let(hh_rel = NULL, gen2_idx = NULL)]

> out HouseholdID IndividualID Age Gender Income Numofkids <num> <num> <num> <char> <num> <int> 1: 1 2 81 female 120 1 2: 1 4 59 female 500 5 3: 1 6 30 female 220 2 4: 1 7 33 female 170 5 5: 1 9 20 female 290 0 6: 1 10 15 female 200 0 7: 1 11 12 female 100 0 8: 2 2 55 female 220 2 9: 2 4 37 female 160 0 10: 2 6 30 female 110 3 11: 2 7 17 female 210 0 12: 2 8 15 female 750 0 13: 2 9 12 female 350 0

如何计算R数据集中每个女性的子元素数量

推荐答案

R相关问答推荐

self_函数无法工作--无法子集结束后的列

有没有方法将琴弦完全捕捉到R中的多边形？

按R中的组查找相邻列的行累积和的最大值

使用R中相同值创建分组观测指标

如何在R中添加截止点到ROC曲线图？

如何使用STAT_SUMMARY向ggplot2中的密度图添加垂直线

如何自定义3D散点图的图例顺序？

移除仪表板Quarto中顶盖和车身之间的白色区域

在rpart. plot或fancyRpartPlot中使用带有下标的希腊字母作为标签？

将Posict转换为数字时的负时间(以秒为单位)

使用RSelenium在R中抓取Reddit时捕获多个标签

悬崖三角洲超大型群数计算导致整数溢出

从数据创建数字的命名列表.R中的框

防止在更新SHINY中的Reactive Value的部分内容时触发依赖事件

按组使用dummy r获取高于标准的行的平均值

在ggplot2图表中通过端点连接点

是否从列中删除★符号？

抽样变换-REXP与RWEIBUR

在使用ggplot2的情况下，如何在使用coord_trans函数的同时，根据未转换的坐标比来定位geom_瓷砖？

如何将一列相关性转换为R中的相关性矩阵