这是我第here个问题的后续问题.

提供的代码确实提供了所需的输出,但是当页面不存在时似乎存在问题,我正在try 使用try/Catch来避免这些错误并继续.

例如,我使用以下内容指定所有日期:

month <- c('02')
year <- c('2024')
day <- c('220','270','280')   
team <- c('CHI')

这很好,因为芝加哥这些天都在主场比赛,因此以下URL都有效:

https://www.basketball-reference.com/boxscores/202402220CHI.html

https://www.basketball-reference.com/boxscores/202402270CHI.html

https://www.basketball-reference.com/boxscores/202402280CHI.html

但如果我再添加一天和/或月份,如下所示:

month <- c('02')
year <- c('2024')
day <- c(**'210'**,'220','270','280')   
team <- c('CHI')

公牛队在2月21日和24日没有主场比赛,这个网址不存在:

https://www.basketball-reference.com/boxscores/202402210CHI.html

我try 将其添加到以下代码中:

page <- tryCatch(read_html(url), error = function(err) "error 404")

但后来我收到了这样一条信息:

no applicable method for 'xml_find_first' applied to an object of class "character"

我如何跳过不存在的页面,而只返回那些存在的页面的值?

完整代码:

library(rvest)
library(dplyr)
library(tidyr)

##sample only - ultimately this will include all teams and all months and days
month <- c('02')
year <- c('2024')
day <- c('220','270','280')   
team <- c('CHI')

make_url <- function(team, year, month, day) {
   paste0('https://www.basketball-reference.com/boxscores/', year, month, day, team, '.html')
}

dates <- expand.grid(team = team, year = year, month = month, day = day)

urls <- dates |>
   mutate( url = make_url(team, year, month, day),
      team = team,
      date = paste(year, month, gsub('.{1}$', '', day), sep = '-'),
      .keep = 'unused'
   )

getPageTable <- function(url) {
   #read the page
   page <- read_html(url)

   #get the game's date
   gamedate <- page %>% html_element("div.scorebox_meta div") %>% html_text2()
   
   #get game title
   gameInfo <- page %>% html_elements("div.box h1") %>% html_text()
   #get the table headings
   headings <- page %>% html_elements("div.section_wrapper") %>% html_element("h2") %>% html_text()
   
   #find the quarter scores
   quarters <- grep("Q[1|2|3|4]", headings)
   
   #retrieve the tables from the page
   tables <- page %>% html_elements("div.section_wrapper") %>% html_element("table") 

   #select the desired headings and tables
   headings <- headings[quarters]
   tables <- tables[quarters] %>% html_table(header=FALSE)

   #add game date and team name/quater to the results
   tables <- lapply(1:length(tables), function(i) {
      #set column titles to second row
      names(tables[[i]]) <- tables[[i]][2,]
     tables[[i]] <- tables[[i]][-c(1:2),]  
      tables[[i]]$gamedate <- gamedate
      tables[[i]]$team <- headings[i]
      tables[[i]]$title <- gameInfo
      tables[[i]]
   })
   #merge the quarterly status into 1 dataframe
   df <- bind_rows(tables)
   df <- df %>% filter("Starters" != "Reserves"  | "Starters" != "Team Totals" )
   df
}


#loop through the URLS
dfs <- lapply(urls$url, getPageTable)
#merge into one big table
finalResult <- bind_rows(dfs)
finalResult <- finalResult %>% separate("team", into=c("team", "quarter"), " \\(")
finalResult$quarter <- sub("\\)", "", finalResult$quarter)

推荐答案

这里有一个解决方案.在tryCatch中包装对read_html的调用,如果出现任何错误,则返回错误条件.然后在读取指令之后立即测试条件.这样,您将拥有一个既包含数据(URL正常)又包含错误(URL不正常)的列表,并可以测试哪些是函数外部的.

以下是已更正的函数.

getPageTable <- function(url) {
  # read the page, returning the error condition if error 404 (or other)
  page <- tryCatch(
    read_html(url),
    error = function(e) e
  )
  if(inherits(page, "error")) {
    return(page)
  }
  # then continue as in the question's code 
  #get the game's date
  gamedate <- page %>% html_element("div.scorebox_meta div") %>% html_text2()
  
  #get game title
  gameInfo <- page %>% html_elements("div.box h1") %>% html_text()
  #get the table headings
  headings <- page %>% html_elements("div.section_wrapper") %>% html_element("h2") %>% html_text()
  
  #find the quarter scores
  quarters <- grep("Q[1|2|3|4]", headings)
  
  #retrieve the tables from the page
  tables <- page %>% html_elements("div.section_wrapper") %>% html_element("table") 
  
  #select the desired headings and tables
  headings <- headings[quarters]
  tables <- tables[quarters] %>% html_table(header=FALSE)
  
  #add game date and team name/quater to the results
  tables <- lapply(1:length(tables), function(i) {
    #set column titles to second row
    names(tables[[i]]) <- tables[[i]][2,]
    tables[[i]] <- tables[[i]][-c(1:2),]  
    tables[[i]]$gamedate <- gamedate
    tables[[i]]$team <- headings[i]
    tables[[i]]$title <- gameInfo
    tables[[i]]
  })
  #merge the quarterly status into 1 dataframe
  df <- bind_rows(tables)
  df <- df %>% filter("Starters" != "Reserves"  | "Starters" != "Team Totals" )
  df
}

调用上面的函数,判断有效的返回值,并决定如何处理错误.在下面的情况下,错误的URL和相应的错误将作为消息打印出来.

#loop through the URLS
dfs <- lapply(urls$url, getPageTable)
# get which weren't read in
err <- sapply(dfs, inherits, what = "error")
# optional, make a list of the bad ones
dfs_err <- dfs[err]
# and print the URL's and error messages
for(i in which(err)) {
  urls$url[i] %>% message()
  dfs_err[[i]] %>%
    conditionMessage() %>%
    message()
}

# these are the good ones and the rest of the code is like in the question
dfs <- dfs[!err]
#merge into one big table
finalResult <- bind_rows(dfs)

R相关问答推荐

根据R中的另一个日期从多列中 Select 最近的日期和相应的结果

咕噜中的元素列表:map

如何使用R对每组变量进行随机化?

为什么当用osmdata映射R时会得到相邻状态?

一小时满足条件的日期的 Select

有效识别长载体中的高/低命中

根据列A中的差异变异列,其中行由列B中的相对值标识

使用rvest从多个页面抓取时避免404错误

解析R函数中的变量时出现的问题

WRS2包中带有bwtrim的简单ANOVA抛出错误

删除在R中的write.table()函数期间创建的附加行

在ggplot2图表中通过端点连接点

如何在刻面和翻转堆叠条形图中对齐geom_text()

如何在不使用SHINY的情况下将下拉滤镜列表添加到ggploy?

无法保存gglot的所有pdf元素

真实世界坐标的逆st_变换

创建由三个单独的shapefile组成的单个 map

R中刻面网格中的排序条形图

使用grepl过滤特定列范围内的列名

如何计算物种矩阵中一行中的唯一个数?