这是我第here个问题的后续问题.
提供的代码确实提供了所需的输出,但是当页面不存在时似乎存在问题,我正在try 使用try/Catch来避免这些错误并继续.
例如,我使用以下内容指定所有日期:
month <- c('02')
year <- c('2024')
day <- c('220','270','280')
team <- c('CHI')
这很好,因为芝加哥这些天都在主场比赛,因此以下URL都有效:
https://www.basketball-reference.com/boxscores/202402220CHI.html个
https://www.basketball-reference.com/boxscores/202402270CHI.html个
https://www.basketball-reference.com/boxscores/202402280CHI.html个
但如果我再添加一天和/或月份,如下所示:
month <- c('02')
year <- c('2024')
day <- c(**'210'**,'220','270','280')
team <- c('CHI')
公牛队在2月21日和24日没有主场比赛,这个网址不存在:
https://www.basketball-reference.com/boxscores/202402210CHI.html个
我try 将其添加到以下代码中:
page <- tryCatch(read_html(url), error = function(err) "error 404")
但后来我收到了这样一条信息:
no applicable method for 'xml_find_first' applied to an object of class "character"
我如何跳过不存在的页面,而只返回那些存在的页面的值?
完整代码:
library(rvest)
library(dplyr)
library(tidyr)
##sample only - ultimately this will include all teams and all months and days
month <- c('02')
year <- c('2024')
day <- c('220','270','280')
team <- c('CHI')
make_url <- function(team, year, month, day) {
paste0('https://www.basketball-reference.com/boxscores/', year, month, day, team, '.html')
}
dates <- expand.grid(team = team, year = year, month = month, day = day)
urls <- dates |>
mutate( url = make_url(team, year, month, day),
team = team,
date = paste(year, month, gsub('.{1}$', '', day), sep = '-'),
.keep = 'unused'
)
getPageTable <- function(url) {
#read the page
page <- read_html(url)
#get the game's date
gamedate <- page %>% html_element("div.scorebox_meta div") %>% html_text2()
#get game title
gameInfo <- page %>% html_elements("div.box h1") %>% html_text()
#get the table headings
headings <- page %>% html_elements("div.section_wrapper") %>% html_element("h2") %>% html_text()
#find the quarter scores
quarters <- grep("Q[1|2|3|4]", headings)
#retrieve the tables from the page
tables <- page %>% html_elements("div.section_wrapper") %>% html_element("table")
#select the desired headings and tables
headings <- headings[quarters]
tables <- tables[quarters] %>% html_table(header=FALSE)
#add game date and team name/quater to the results
tables <- lapply(1:length(tables), function(i) {
#set column titles to second row
names(tables[[i]]) <- tables[[i]][2,]
tables[[i]] <- tables[[i]][-c(1:2),]
tables[[i]]$gamedate <- gamedate
tables[[i]]$team <- headings[i]
tables[[i]]$title <- gameInfo
tables[[i]]
})
#merge the quarterly status into 1 dataframe
df <- bind_rows(tables)
df <- df %>% filter("Starters" != "Reserves" | "Starters" != "Team Totals" )
df
}
#loop through the URLS
dfs <- lapply(urls$url, getPageTable)
#merge into one big table
finalResult <- bind_rows(dfs)
finalResult <- finalResult %>% separate("team", into=c("team", "quarter"), " \\(")
finalResult$quarter <- sub("\\)", "", finalResult$quarter)