我有以下玩具藤壶:
df <- data.frame(
product = c("apple", "banana", "cherry", "durian", "eggplant", "fuyu"),
ingredients = c("flour|fibre|500", "sugar|500", "505|wheat|flavouring", "fibre(500)|eggs", "wholegrainrice|sesameoil", "500|fibre|500"),
stringsAsFactors = FALSE
)
我的目标是检测纤维是否出现在产品成分中,计算纤维出现的次数,并提取用于记录产品成分中纤维的值.
出于本分析的目的,纤维可以在产品成分中表示为"纤维"、"500"或"纤维(500)".
我当前的代码是:
library(tidyverse)
fibre_strings_to_check <- c("fibre", "500", "fibre\\(500\\)")
df2 <- df %>%
mutate(
fibre_present = str_detect(ingredients, paste(fibre_strings_to_check, collapse = "|")),
fibre_count = str_count(ingredients, paste(fibre_strings_to_check, collapse = "|")),
fibre_used = str_extract_all(ingredients, paste(fibre_strings_to_check, collapse = "|"))
)
它给出df2
的输出为:
| product | ingredients | fibre_present | fibre_count | fibre_used |
|----------|-----------------------------|---------------|-------------|-------------------|
| apple | flour\|fibre\|500 | TRUE | 2 | fibre, 500 |
| banana | sugar\|500 | TRUE | 1 | 500 |
| cherry | 505\|wheat\|flavouring | FALSE | 0 | |
| durian | fibre(500)\|eggs | TRUE | 2 | fibre, 500 |
| eggplant | wholegrainrice\|sesameoil | FALSE | 0 | |
| fuyu | 500\|fibre\|500 | TRUE | 3 | 500, fibre, 500 |
我遇到的问题是"榴莲"产品.我希望将"Fibre(500)"算作Fibre的一个值/实例,正如fibre_strings_to_check
中定义的那样.但由于它似乎与fibre_strings_to_check
中的其他纤维实例相匹配,因此它算作纤维的两个值/实例.
我的预期输出df2
是:
| product | ingredients | fibre_present | fibre_count | fibre_used |
|----------|-----------------------------|---------------|-------------|-------------------|
| apple | flour\|fibre\|500 | TRUE | 2 | fibre, 500 |
| banana | sugar\|500 | TRUE | 1 | 500 |
| cherry | 505\|wheat\|flavouring | FALSE | 0 | |
| durian | fibre(500)\|eggs | TRUE | 1 | fibre(500) |
| eggplant | wholegrainrice\|sesameoil | FALSE | 0 | |
| fuyu | 500\|fibre\|500 | TRUE | 3 | 500, fibre, 500 |
How do I adjust the script so that there is no double counting of what is intended to be a single value?