我有以下总结的html代码(html_file.html).

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<div class="listing-wrapper__content">
<section class="card__amenities ">
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity" itemprop="floorSize"><span data-testid="l-icon" role="document" aria-label="Tamanho do imóvel" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">...</svg></span> 94 - 100 m² </p>
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity" itemprop="numberOfRooms"><span data-testid="l-icon" role="document" aria-label="Quantidade de quartos" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">...</svg></span> 3 </p>
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity" itemprop="numberOfBathroomsTotal"<span data-testid="l-icon" role="document" aria-label="Quantidade de banheiros" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">...</svg></span>3</p>
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity"><span data-testid="l-icon" role="document" aria-label="Quantidade de vagas de garagem" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg"><...</svg></span>2</p>
</section>
</div>
</body>
</html>

我设法提取了前三个元素.例如:

library(rvest)
pagee <- read_html("html_file.html") 
nofrooms <- html_elements(pagee, ".listing-wrapper__content")%>%html_nodes("[itemprop='numberOfRooms']")%>%html_text()
nofrooms

输出为

" 3 "

问题出在last p tag分.显然,对于我来说,没有标准能够从这样的标签中提取信息.我已经try 了以下几种方法,但都没有成功:

nofgarage <- html_elements(pagee, ".listing-wrapper__content")%>%html_nodes("[aria-label='Quantidade de vagas de garagem']")%>%html_text()
nofgarage

输出为

""

正如预期的那样,结果是空的,因为我要提取的值不在span标记之间.

谢谢你的帮助

推荐答案

Since it appears the that there is mostly 4 amenities, one could use xml_child() function from xml2 to select the that amenity.
In this case there are a few listing that is missing the 4th amenity so we need to filter before attempting to extract.
See comments below.

library(rvest)
library(xml2)
library(dplyr)

url <- "https://www.zapimoveis.com.br/venda/apartamentos/ms+campo-grande/?transacao=venda&onde=,Mato%20Grosso%20do%20Sul,Campo%20Grande,,,,,city,BR%3EMato%20Grosso%20do%20Sul%3ENULL%3ECampo%20Grande,-20.464852,-54.621848,&tipos=apartamento_residencial&pagina=1"

#read page
pagee <- read_html(url)

#get the amentities section from each listing
sections <- html_elements(pagee, "section.card__amenities ")
#sections %>% html_elements("p") %>% html_text()

#create an empty vector
garages <- vector("numeric", length=length(sections))

#retrieve the 4 node value - not all apartments have a 4 values thus the need to filter
garages[xml_length(sections)==4] <- sapply(sections[xml_length(sections)==4], function(node) 
                                   {xml_child(node, 4) %>% html_text()})

#answer the final vector
garages
# [1] "2" "4" "1" "1" "1" "1" "0" "2" "2" "2" "3" "1" "1" "1" "0"

Html相关问答推荐

HTML CSS如何使用图像作为中心点将元素置于中心

当表头包含特殊字符时,R Quarto发布的HTML表中不必要的大列宽

输入表单在奇怪的地方舍入的标签

Angular Project中的星级 Select

Angular 15 p-对话框不使用组件 Select 器呈现HTML

如何在用css使用网格视图时设置宽度?

如何使用CSS在<;表格中<;SVG&>?

使用bash从html表格中提取表格

有没有一种方法可以提高代码中tailwind 类名的可读性?

如何在ASP.NET Core中添加换行符

Div 容器不会遵守边距、填充或间隙

如何在 flex 项目中忽略子元素的宽度

如何为Vuetify输入控件减小标签和字段之间的间距?

如何使背景 colored颜色 逐渐渐变成另一种 colored颜色 ?

使用 css 将内容居中并向左对齐?

父背景仅在子元素中可见

删除按钮组件时 bootstrap col-auto 布局高度对齐中断

悬停一个 div 时更改所有其他 div 上的字体 colored颜色

flex-box 中的图像在 Chrome 和 Firefox 中看起来不同 - 如何使它们在 Firefox 中看起来像?

禁用的 Select 标签没有在 chrome 上的 css 中指定的 colored颜色