使用rvest从多个页面抓取时避免404错误

发布于03月09日

这是我第here个问题的后续问题.

提供的代码确实提供了所需的输出，但是当页面不存在时似乎存在问题，我正在try 使用try/Catch来避免这些错误并继续.

例如，我使用以下内容指定所有日期:

month <- c('02')
year <- c('2024')
day <- c('220','270','280')   
team <- c('CHI')

这很好，因为芝加哥这些天都在主场比赛，因此以下URL都有效:

https://www.basketball-reference.com/boxscores/202402220CHI.html个

https://www.basketball-reference.com/boxscores/202402270CHI.html个

https://www.basketball-reference.com/boxscores/202402280CHI.html个

但如果我再添加一天和/或月份，如下所示:

month <- c('02')
year <- c('2024')
day <- c(**'210'**,'220','270','280')   
team <- c('CHI')

公牛队在2月21日和24日没有主场比赛，这个网址不存在:

https://www.basketball-reference.com/boxscores/202402210CHI.html个

我try 将其添加到以下代码中:

page <- tryCatch(read_html(url), error = function(err) "error 404")

但后来我收到了这样一条信息:

no applicable method for 'xml_find_first' applied to an object of class "character"

我如何跳过不存在的页面，而只返回那些存在的页面的值？

完整代码:

library(rvest)
library(dplyr)
library(tidyr)

##sample only - ultimately this will include all teams and all months and days
month <- c('02')
year <- c('2024')
day <- c('220','270','280')   
team <- c('CHI')

make_url <- function(team, year, month, day) {
   paste0('https://www.basketball-reference.com/boxscores/', year, month, day, team, '.html')
}

dates <- expand.grid(team = team, year = year, month = month, day = day)

urls <- dates |>
   mutate( url = make_url(team, year, month, day),
      team = team,
      date = paste(year, month, gsub('.{1}$', '', day), sep = '-'),
      .keep = 'unused'
   )

getPageTable <- function(url) {
   #read the page
   page <- read_html(url)

   #get the game's date
   gamedate <- page %>% html_element("div.scorebox_meta div") %>% html_text2()
   
   #get game title
   gameInfo <- page %>% html_elements("div.box h1") %>% html_text()
   #get the table headings
   headings <- page %>% html_elements("div.section_wrapper") %>% html_element("h2") %>% html_text()
   
   #find the quarter scores
   quarters <- grep("Q[1|2|3|4]", headings)
   
   #retrieve the tables from the page
   tables <- page %>% html_elements("div.section_wrapper") %>% html_element("table") 

   #select the desired headings and tables
   headings <- headings[quarters]
   tables <- tables[quarters] %>% html_table(header=FALSE)

   #add game date and team name/quater to the results
   tables <- lapply(1:length(tables), function(i) {
      #set column titles to second row
      names(tables[[i]]) <- tables[[i]][2,]
     tables[[i]] <- tables[[i]][-c(1:2),]  
      tables[[i]]$gamedate <- gamedate
      tables[[i]]$team <- headings[i]
      tables[[i]]$title <- gameInfo
      tables[[i]]
   })
   #merge the quarterly status into 1 dataframe
   df <- bind_rows(tables)
   df <- df %>% filter("Starters" != "Reserves"  | "Starters" != "Team Totals" )
   df
}


#loop through the URLS
dfs <- lapply(urls$url, getPageTable)
#merge into one big table
finalResult <- bind_rows(dfs)
finalResult <- finalResult %>% separate("team", into=c("team", "quarter"), " \\(")
finalResult$quarter <- sub("\\)", "", finalResult$quarter)

getPageTable <- function(url) { # read the page, returning the error condition if error 404 (or other) page <- tryCatch( read_html(url), error = function(e) e ) if(inherits(page, "error")) { return(page) } # then continue as in the question's code #get the game's date gamedate <- page %>% html_element("div.scorebox_meta div") %>% html_text2() #get game title gameInfo <- page %>% html_elements("div.box h1") %>% html_text() #get the table headings headings <- page %>% html_elements("div.section_wrapper") %>% html_element("h2") %>% html_text() #find the quarter scores quarters <- grep("Q[1|2|3|4]", headings) #retrieve the tables from the page tables <- page %>% html_elements("div.section_wrapper") %>% html_element("table") #select the desired headings and tables headings <- headings[quarters] tables <- tables[quarters] %>% html_table(header=FALSE) #add game date and team name/quater to the results tables <- lapply(1:length(tables), function(i) { #set column titles to second row names(tables[[i]]) <- tables[[i]][2,] tables[[i]] <- tables[[i]][-c(1:2),] tables[[i]]$gamedate <- gamedate tables[[i]]$team <- headings[i] tables[[i]]$title <- gameInfo tables[[i]] }) #merge the quarterly status into 1 dataframe df <- bind_rows(tables) df <- df %>% filter("Starters" != "Reserves" | "Starters" != "Team Totals" ) df }

#loop through the URLS dfs <- lapply(urls$url, getPageTable) # get which weren't read in err <- sapply(dfs, inherits, what = "error") # optional, make a list of the bad ones dfs_err <- dfs[err] # and print the URL's and error messages for(i in which(err)) { urls$url[i] %>% message() dfs_err[[i]] %>% conditionMessage() %>% message() } # these are the good ones and the rest of the code is like in the question dfs <- dfs[!err] #merge into one big table finalResult <- bind_rows(dfs)

使用rvest从多个页面抓取时避免404错误

推荐答案

R相关问答推荐

根据R中的另一个日期从多列中 Select 最近的日期和相应的结果

咕噜中的元素列表：map

如何使用R对每组变量进行随机化？

为什么当用osmdata映射R时会得到相邻状态？

一小时满足条件的日期的 Select

有效识别长载体中的高/低命中

根据列A中的差异变异列，其中行由列B中的相对值标识

使用rvest从多个页面抓取时避免404错误

解析R函数中的变量时出现的问题

WRS2包中带有bwtrim的简单ANOVA抛出错误

删除在R中的write.table()函数期间创建的附加行

在ggplot2图表中通过端点连接点

如何在刻面和翻转堆叠条形图中对齐geom_text()

如何在不使用SHINY的情况下将下拉滤镜列表添加到ggploy？

无法保存gglot的所有pdf元素

真实世界坐标的逆st_变换

创建由三个单独的shapefile组成的单个 map

R中刻面网格中的排序条形图

使用grepl过滤特定列范围内的列名

如何计算物种矩阵中一行中的唯一个数？