R 如何刮这个

发布于05月23日

我的目标是获取html并创建一个包含10行(列也适用)的数据框，每行对应一个项目.我已经用过ChatGPT寻求帮助.

假设我有这个html(从https://www.history.com/shows/alone/cast/wyatt-black开始):

library(rvest)
contestant <- '    <p><strong>Here are the ten items Wyatt selected to bring on his survival journey to the bone-chilling temperatures of Northern Saskatchewan, Canada:</strong></p>
    <p>1. Cooking Pot</p>
    <p><p>2. Axe</p>
    <p><p>3. Saw</p>
    <p><p>4. Ferro Rod</p>
    <p><p>5. Sleeping Bag</p>
    <p><p>6. Snare Wire</p>
    <p><p>7. Paracord</p>
    <p><p>8. Fishing Line and Hooks</p>
    <p><p>9. Bow and Arrows</p>
    <p><p>10. Multitool</p>
    <p>'

contestant_html <- read_html(contestant)

然后，我可以使用以下命令来刮掉它:

contestant_items <- html_nodes(contestant_html, xpath = '//p[starts-with(text(), "1.")]/following-sibling::p')
item_list <- html_text(contestant_items[1:10])

包含在item_list中的是:

item_list
 [1] ""                "2. Axe"          ""                "3. Saw"          ""               
 [6] "4. Ferro Rod"    ""                "5. Sleeping Bag" ""                "6. Snare Wire"

有两个问题:第一个问题是第一项不包括在内.第二，有些项目是空白的.

我如何改进抓取代码以处理这些问题？

与之相关的是，如果列表不是以数字开头(从https://www.history.com/shows/alone/cast/brooke-and-dave-whipple开始)，该如何处理？

contestant2 <- '<p><strong>Here are the ten items Brooke and Dave selected to bring on their survival journey to Vancouver Island:</strong></p>
<ul>
<li>Bow saw</li>
<li>Pot &#8211; vintage aluminum coffee pot, 2 quarts</li>
<li>Tarp &#8211; 12&#8242; x 12&#8242;</li>
<li>Bar of Soap</li>
<li>Rations</li>
<li>Ax &#8211; full-sized felling ax</li>
<li>Tarp &#8211; 12&#8242; x 12&#8242;</li>
<li>Fishing line and hooks</li>
<li>Pan</li>
<li>Rations</li>
</ul>'

library(rvest) url1 <- "https://www.history.com/shows/alone/cast/wyatt-black" url2 <- "https://www.history.com/shows/alone/cast/brooke-and-dave-whipple" page <- read_html(url1) page %>% html_elements(xpath = "/html/body/div[1]/div[2]/div/div/article/p[position() >= 9 and position() mod 2 = 1]") |> html_text() # [1] "1. Cooking Pot" "2. Axe" "3. Saw" "4. Ferro Rod" # [5] "5. Sleeping Bag" "6. Snare Wire" "7. Paracord" "8. Fishing Line and Hooks" # [9] "9. Bow and Arrows" "10. Multitool" page2 <- read_html(url2) page2 %>% html_elements(xpath = "/html/body/div[1]/div[2]/div/div/article/ul/li") %>% html_text() # [1] "Bow saw" "Pot – vintage aluminum coffee pot, 2 quarts" # [3] "Tarp – 12′ x 12′" "Bar of Soap" # [5] "Rations" "Ax – full-sized felling ax" # [7] "Tarp – 12′ x 12′" "Fishing line and hooks" # [9] "Pan" "Rations"

R 如何刮这个

推荐答案

R相关问答推荐

有没有方法将琴弦完全捕捉到R中的多边形？

MCMC和零事件二元逻辑回归

如何根据条件计算时差(天)

RStudio中相关数据的分组箱形图

如何根据R中其他列的值有条件地从列中提取数据？

R根据条件进行累积更改

为了网络分析目的，将数据框转换为长格式列联表

2个Rscript.exe可执行文件有什么区别？

方法：：slotName如何处理非类、非字符的参数？

将箭头绘制在图形外部，而不是图形内部

从R中发出的咕噜声中的BUG？

R如何将列名转换为更好的年和月格式

以不同于绘图中元素的方式对GG图图例进行排序

删除字符串R中的重复项

减少雨云面之间的间距并绘制所有统计数据点

数据集上的R循环和存储模型系数

策略表单连接两个非常大的箭头数据集，而不会 destruct 内存使用

如何将字符类对象中的数据转换为R中的字符串

了解nchar在列表上的意外行为

在一个multiplot中以非对称的方式在R中绘制多个图