我一直在研究rvest包,有一个关于从列表中提取URL的问题.我的目标是生成一个带有以下标题的df:Country、City和城市的URL.我已经有了每个国家的DF和每个国家的城市列表.
我的问题是,我如何引用每个城市,以便获得其各自的URL链接?我试图在"wikable sorable jQuery-ablesorter"中引用td类中的href,但当我运行links = webpage %>% html_node("href") %>% html_text()
时,我只得到主URL.
谢谢你的建议!
# Get URL
url = "https://en.wikipedia.org/wiki/List_of_towns_and_cities_with_100,000_or_more_inhabitants/country:_A-B"
# Read the HTML code from the website
page = read_html(url)
# Get name of the countries
countries = page %>% html_nodes(".mw-headline") %>% html_text()
#Remove the last two items which are not countries
countries = as.tibble(countries) %>%
slice(1:(n()-2))
#Add row number to each Country to left_join later
countries = rowid_to_column(countries, "column_label")
# Get cities for that country
# Still working on this since it includes the first table and I get blanks when I filter the html_nodes(".jquery-tablesorter td")
tables = html_nodes(page, "table")
tables = lapply(tables, html_table)
#Remove fist element which is not a city, only on the first page
tables = tables[-1]
#---WIP
# Get links for the cities, currently picks the main domain instead of the city
# Can I add a clause before the html node to indicate I want the href from "wikitable sortable jquery-tablesorter"?
links = page %>% html_attr("href") %>% html_text()
#---
#Remove the Providence and Population columns and keeps City and URL
tables = lapply(tables, "[", -c(2, 3))
#Standardize City as the column
tables = map(tables, set_names, "City")
# Flatten List
all <- bind_rows(tables, .id = "column_label") %>%
mutate(column_label = as.integer(column_label)) %>%
left_join(countries, by = "column_label")