.org提供了Wayback CDX API个用于查找捕获的内容,它以表格形式或JSON形式返回时间戳和原始URL.这样的查询可以仅用read.table()
进行,然后可以从timestamp
和original
列和基本URL构建到特定捕获的链接.
read.table("https://web.archive.org/cdx/search/cdx?url=covid.cdc.gov/covid-data-tracker/&limit=5",
col.names = c("urlkey","timestamp","original","mimetype","statuscode","digest","length"),
colClasses = "character")
#> urlkey timestamp
#> 1 gov,cdc,covid)/covid-data-tracker 20200824224244
#> 2 gov,cdc,covid)/covid-data-tracker 20200825013347
#> 3 gov,cdc,covid)/covid-data-tracker 20200825024622
#> 4 gov,cdc,covid)/covid-data-tracker 20200825042657
#> 5 gov,cdc,covid)/covid-data-tracker 20200825050018
#> original mimetype statuscode
#> 1 https://covid.cdc.gov/covid-data-tracker/ text/html 200
#> 2 https://covid.cdc.gov/covid-data-tracker/ text/html 200
#> 3 https://covid.cdc.gov/covid-data-tracker/ text/html 200
#> 4 https://covid.cdc.gov/covid-data-tracker/ text/html 200
#> 5 https://covid.cdc.gov/covid-data-tracker/ text/html 200
#> digest length
#> 1 APS6SXNXBXCJU3P4N23WH4XCVDVZQYAD 5342
#> 2 XFEMFRGXIPWM4K5F6CBIYDSOFIGCUBQZ 5370
#> 3 TVQKZHRM452CFX4RIORWGSMK5PG3PAPR 5343
#> 4 XZDLPJ6EQIXEO4SUFQTFEX4S6SF7O4GT 5370
#> 5 A4J63TFU7HMZQE5KFTSLBD6EFNZ4IBZ4 5373
为了更方便使用,我们可以定制API请求,例如httr
/httr2
,并通过readr
/dplyr
/lubridate
管道传递响应:
library(dplyr)
library(httr2)
library(readr)
archive_links <- request("https://web.archive.org/cdx/search/cdx") %>%
# set query parameters
req_url_query(
url = "covid.cdc.gov/covid-data-tracker/",
filter = "statuscode:200", # include only succesful captures where HTTP status code was 200
collapse = "timestamp:8", # limit to 1 capt. per day by comparing first 8 digits of timestamp: <20200824>224244
limit = 10, # limit the number of returned values
# output = "json" # request json output, includes column names
) %>%
req_perform() %>%
# pass http response string to read_table() for pasring
resp_body_string() %>%
read_table(col_names = c("urlkey","timestamp","original","mimetype","statuscode","digest","length"),
col_types = cols_only(timestamp = "c",
original = "c",
mimetype = "c",
length = "i")) %>%
mutate(link = paste("https://web.archive.org/web", timestamp, original, sep = "/") %>% tibble::char(shorten = "front"),
timestamp = lubridate::ymd_hms(timestamp)) %>%
select(timestamp, link, length)
archive_links
#> # A tibble: 10 × 3
#> timestamp link length
#> <dttm> <char> <int>
#> 1 2020-08-24 22:42:44 …4224244/https://covid.cdc.gov/covid-data-tracker/ 5342
#> 2 2020-08-25 01:33:47 …5013347/https://covid.cdc.gov/covid-data-tracker/ 5370
#> 3 2020-08-26 02:37:09 …6023709/https://covid.cdc.gov/covid-data-tracker/ 5371
#> 4 2020-08-27 01:05:48 …7010548/https://covid.cdc.gov/covid-data-tracker/ 5703
#> 5 2020-08-28 02:23:26 …8022326/https://covid.cdc.gov/covid-data-tracker/ 31177
#> 6 2020-08-29 02:01:27 …9020127/https://covid.cdc.gov/covid-data-tracker/ 31237
#> 7 2020-08-30 00:06:31 …0000631/https://covid.cdc.gov/covid-data-tracker/ 31218
#> 8 2020-08-31 00:18:29 …1001829/https://covid.cdc.gov/covid-data-tracker/ 31640
#> 9 2020-09-01 02:30:30 …1023030/https://covid.cdc.gov/covid-data-tracker/ 31257
#> 10 2020-09-02 04:08:31 …2040831/https://covid.cdc.gov/covid-data-tracker/ 31654
# first capture:
archive_links$link[1]
#> <pillar_char<[1]>
#> [1] https://web.archive.org/web/20200824224244/https://covid.cdc.gov/covid-data-tracker/
创建于2023-07-02年第reprex v2.0.2页
还有一些适用于R的Archive.org客户端库,例如
https://github.com/liserman/archiveRetriever&;https://hrbrmstr.github.io/wayback/,虽然第一个的查询界面有点奇怪,而另一个目前不能通过cran访问.