我一直在研究rvest包,有一个关于从列表中提取URL的问题.我的目标是生成一个带有以下标题的df:Country、City和城市的URL.我已经有了每个国家的DF和每个国家的城市列表.

我的问题是,我如何引用每个城市,以便获得其各自的URL链接?我试图在"wikable sorable jQuery-ablesorter"中引用td类中的href,但当我运行links = webpage %>% html_node("href") %>% html_text()时,我只得到主URL.

谢谢你的建议!

# Get URL
url = "https://en.wikipedia.org/wiki/List_of_towns_and_cities_with_100,000_or_more_inhabitants/country:_A-B"

# Read the HTML code from the website
page = read_html(url)

# Get name of the countries
countries = page %>% html_nodes(".mw-headline") %>% html_text()

#Remove the last two items which are not countries
countries = as.tibble(countries) %>%
  slice(1:(n()-2))

#Add row number to each Country to left_join later
countries = rowid_to_column(countries, "column_label")

# Get cities for that country
# Still working on this since it includes the first table and I get blanks when I filter the html_nodes(".jquery-tablesorter td")
tables = html_nodes(page, "table")
tables = lapply(tables, html_table)

#Remove fist element which is not a city, only on the first page
tables = tables[-1]

#---WIP
# Get links for the cities, currently picks the main domain instead of the city
# Can I add a clause before the html node to indicate I want the href from "wikitable sortable jquery-tablesorter"?
links = page %>% html_attr("href") %>% html_text()
#---

#Remove the Providence and Population columns and keeps City and URL
tables = lapply(tables, "[", -c(2, 3))

#Standardize City as the column
tables = map(tables, set_names, "City")

# Flatten List
all <- bind_rows(tables, .id = "column_label") %>%
  mutate(column_label = as.integer(column_label)) %>%
  left_join(countries, by = "column_label")

推荐答案

实现您想要的结果的一种方法可能如下所示.我使用了一个不同的appraoch,并使用一个小的定制函数通过抓取表行来获得您想要的内容:

library(tidyverse)
library(rvest)

# Get a dataframe of city names and urls for one table
get_cities <- function(x) {
  x %>%
    html_nodes("tr") %>%
    .[-1] %>%
    # Get first column/cell containing city
    html_node("td a") %>%
    map_dfr(function(x) {
      data.frame(
        city = html_text(x),
        url = paste0("https://en.wikipedia.org",
                     html_attr(x, 'href'))
      )
    })
}

url <- "https://en.wikipedia.org/wiki/List_of_towns_and_cities_with_100,000_or_more_inhabitants/country:_A-B"

# Read the HTML code from the website
webpage <- read_html(url)

# Get name of the countries
countries <- webpage %>%
  html_nodes(".mw-headline") %>%
  html_text()
countries <- countries[!grepl("(See also|References)", countries)]
# Get table nodes
tables <- webpage %>%
  html_nodes("table.wikitable.sortable")
names(tables) <- countries

res <- map_dfr(tables, get_cities, .id = "country")

head(res)
#>       country      city             url
#> 1 Afghanistan    Ghazni    /wiki/Ghazni
#> 2 Afghanistan     Herat     /wiki/Herat
#> 3 Afghanistan Jalalabad /wiki/Jalalabad
#> 4 Afghanistan     Kabul     /wiki/Kabul
#> 5 Afghanistan  Kandahar  /wiki/Kandahar
#> 6 Afghanistan     Khost     /wiki/Khost

R相关问答推荐

编码变量a、b、c以匹配来自另一个数据点的变量x

从有序数据中随机抽样

在值和NA的行顺序中寻找中断模式

R:更新后无法运行控制台

如何提取所有完美匹配的10个核苷酸在一个成对的匹配与生物字符串在R?>

用约翰逊分布进行均值比较

R如何计算现有行的总和以添加新的数据行

R中的类别比较

在gggraph中显示来自不同数据帧的单个值

如何删除设置大小的曲线图并添加条形图顶部数字的百分比

以不同于绘图中元素的方式对GG图图例进行排序

有没有办法将基于每个值中出现的两个关键字或短语的字符串向量重新编码为具有这两个值的新向量?

R仅当存在列时才发生变异

在生成打印的自定义函数中,可以通过变量将线型或 colored颜色 设置为NULL吗?

如何根据其他列中的两个条件来计算数据帧中的行之间的差异?

重写时间间隔模糊连接以减少内存消耗

使用一个标签共享多个组图图例符号

在R中使用ggraph包排列和着色圆

希望解析和复制R中特定模式的数据

如何计算物种矩阵中一行中的唯一个数?