我的目标是获取html并创建一个包含10行(列也适用)的数据框,每行对应一个项目.我已经用过ChatGPT寻求帮助.
假设我有这个html(从https://www.history.com/shows/alone/cast/wyatt-black开始):
library(rvest)
contestant <- ' <p><strong>Here are the ten items Wyatt selected to bring on his survival journey to the bone-chilling temperatures of Northern Saskatchewan, Canada:</strong></p>
<p>1. Cooking Pot</p>
<p><p>2. Axe</p>
<p><p>3. Saw</p>
<p><p>4. Ferro Rod</p>
<p><p>5. Sleeping Bag</p>
<p><p>6. Snare Wire</p>
<p><p>7. Paracord</p>
<p><p>8. Fishing Line and Hooks</p>
<p><p>9. Bow and Arrows</p>
<p><p>10. Multitool</p>
<p>'
contestant_html <- read_html(contestant)
然后,我可以使用以下命令来刮掉它:
contestant_items <- html_nodes(contestant_html, xpath = '//p[starts-with(text(), "1.")]/following-sibling::p')
item_list <- html_text(contestant_items[1:10])
包含在item_list
中的是:
item_list
[1] "" "2. Axe" "" "3. Saw" ""
[6] "4. Ferro Rod" "" "5. Sleeping Bag" "" "6. Snare Wire"
有两个问题:第一个问题是第一项不包括在内.第二,有些项目是空白的.
我如何改进抓取代码以处理这些问题?
与之相关的是,如果列表不是以数字开头(从https://www.history.com/shows/alone/cast/brooke-and-dave-whipple开始),该如何处理?
contestant2 <- '<p><strong>Here are the ten items Brooke and Dave selected to bring on their survival journey to Vancouver Island:</strong></p>
<ul>
<li>Bow saw</li>
<li>Pot – vintage aluminum coffee pot, 2 quarts</li>
<li>Tarp – 12′ x 12′</li>
<li>Bar of Soap</li>
<li>Rations</li>
<li>Ax – full-sized felling ax</li>
<li>Tarp – 12′ x 12′</li>
<li>Fishing line and hooks</li>
<li>Pan</li>
<li>Rations</li>
</ul>'