我正在使用R编程语言.

对于以下网站:https://covid.cdc.gov/covid-data-tracker/-我正在try 获得WayBackMachine上可用的该网站的所有版本(以及月份和时间).最终结果应该如下所示:

         date                                                                                links     time
1 Jan-01-2023 https://web.archive.org/web/20230101000547/https://covid.cdc.gov/covid-data-tracker/ 00:05:47
2 Jan-01-2023 https://web.archive.org/web/20230101000557/https://covid.cdc.gov/covid-data-tracker/ 00:05-57

以下是我到目前为止try 过的:

首先,我判断了HTML源代码(在"元素"中),并将其复制/粘贴到一个记事本文件中:

enter image description here

Step 2:然后,我将此代码导入到R中,并解析得到的链接 struct 的html:

file <- "cdc.txt"

text <- readLines(file)

html <- paste(text, collapse = "\n")

pattern1 <- '/web/\\d+/https://covid\\.cdc\\.gov/covid-data-tracker/[^"]+'

links <- regmatches(html, gregexpr(pattern1, html))[[1]]

但这并不管用:

> links
character(0)

Can someone please show me if there is an easier way to do this?

谢谢!

Note:

  • 我正在try 学习如何在一般情况下做到这一点(例如,对于WayBackMachine上的任何网站-Covid Data Tracker只是这个问题的一个占位符示例)

  • 我意识到可能有更有效的方法来做到这一点--我乐于学习不同的方法!

推荐答案

.org提供了Wayback CDX API个用于查找捕获的内容,它以表格形式或JSON形式返回时间戳和原始URL.这样的查询可以仅用read.table()进行,然后可以从timestamporiginal列和基本URL构建到特定捕获的链接.

read.table("https://web.archive.org/cdx/search/cdx?url=covid.cdc.gov/covid-data-tracker/&limit=5", 
           col.names = c("urlkey","timestamp","original","mimetype","statuscode","digest","length"),
           colClasses = "character")
#>                              urlkey      timestamp
#> 1 gov,cdc,covid)/covid-data-tracker 20200824224244
#> 2 gov,cdc,covid)/covid-data-tracker 20200825013347
#> 3 gov,cdc,covid)/covid-data-tracker 20200825024622
#> 4 gov,cdc,covid)/covid-data-tracker 20200825042657
#> 5 gov,cdc,covid)/covid-data-tracker 20200825050018
#>                                    original  mimetype statuscode
#> 1 https://covid.cdc.gov/covid-data-tracker/ text/html        200
#> 2 https://covid.cdc.gov/covid-data-tracker/ text/html        200
#> 3 https://covid.cdc.gov/covid-data-tracker/ text/html        200
#> 4 https://covid.cdc.gov/covid-data-tracker/ text/html        200
#> 5 https://covid.cdc.gov/covid-data-tracker/ text/html        200
#>                             digest length
#> 1 APS6SXNXBXCJU3P4N23WH4XCVDVZQYAD   5342
#> 2 XFEMFRGXIPWM4K5F6CBIYDSOFIGCUBQZ   5370
#> 3 TVQKZHRM452CFX4RIORWGSMK5PG3PAPR   5343
#> 4 XZDLPJ6EQIXEO4SUFQTFEX4S6SF7O4GT   5370
#> 5 A4J63TFU7HMZQE5KFTSLBD6EFNZ4IBZ4   5373

为了更方便使用,我们可以定制API请求,例如httr/httr2,并通过readr/dplyr/lubridate管道传递响应:

library(dplyr)
library(httr2)
library(readr)

archive_links <- request("https://web.archive.org/cdx/search/cdx") %>% 
  # set query parameters
  req_url_query(
    url      = "covid.cdc.gov/covid-data-tracker/",
    filter   = "statuscode:200", # include only succesful captures where HTTP status code was 200
    collapse = "timestamp:8",    # limit to 1 capt. per day by comparing first 8 digits of timestamp: <20200824>224244
    limit    = 10,               # limit the number of returned values
    # output = "json"            # request json output, includes column names
  ) %>% 
  req_perform() %>%
  # pass http response string to read_table() for pasring
  resp_body_string() %>% 
  read_table(col_names = c("urlkey","timestamp","original","mimetype","statuscode","digest","length"),
             col_types = cols_only(timestamp = "c",
                                   original  = "c",
                                   mimetype  = "c",
                                   length    = "i")) %>% 
  mutate(link = paste("https://web.archive.org/web", timestamp, original, sep = "/") %>% tibble::char(shorten = "front"),
         timestamp = lubridate::ymd_hms(timestamp)) %>% 
  select(timestamp, link, length)
archive_links
#> # A tibble: 10 × 3
#>    timestamp           link                                               length
#>    <dttm>              <char>                                              <int>
#>  1 2020-08-24 22:42:44 …4224244/https://covid.cdc.gov/covid-data-tracker/   5342
#>  2 2020-08-25 01:33:47 …5013347/https://covid.cdc.gov/covid-data-tracker/   5370
#>  3 2020-08-26 02:37:09 …6023709/https://covid.cdc.gov/covid-data-tracker/   5371
#>  4 2020-08-27 01:05:48 …7010548/https://covid.cdc.gov/covid-data-tracker/   5703
#>  5 2020-08-28 02:23:26 …8022326/https://covid.cdc.gov/covid-data-tracker/  31177
#>  6 2020-08-29 02:01:27 …9020127/https://covid.cdc.gov/covid-data-tracker/  31237
#>  7 2020-08-30 00:06:31 …0000631/https://covid.cdc.gov/covid-data-tracker/  31218
#>  8 2020-08-31 00:18:29 …1001829/https://covid.cdc.gov/covid-data-tracker/  31640
#>  9 2020-09-01 02:30:30 …1023030/https://covid.cdc.gov/covid-data-tracker/  31257
#> 10 2020-09-02 04:08:31 …2040831/https://covid.cdc.gov/covid-data-tracker/  31654

# first capture:
archive_links$link[1]
#> <pillar_char<[1]>
#> [1] https://web.archive.org/web/20200824224244/https://covid.cdc.gov/covid-data-tracker/

创建于2023-07-02年第reprex v2.0.2

还有一些适用于R的Archive.org客户端库,例如 https://github.com/liserman/archiveRetriever&;https://hrbrmstr.github.io/wayback/,虽然第一个的查询界面有点奇怪,而另一个目前不能通过cran访问.

Html相关问答推荐

HTML横幅未正确对齐页面顶部

我需要主页按钮出现在中间

有没有可能一个按钮,有焦点,然后再次点击它清除它的焦点,只是使用css?

CSS位置:固定尊重母公司的保证金属性.""'为什么?

如何在Vim中删除Html标签?

容器内的SVG图像没有响应

为所有必填字段添加所需的占位符文本(Angular Material)

天使17:令人惊叹的动画

响应网格,响应调整大小以适应父div,保留父div S自己的响应高度

使用bash从html表格中提取表格

动画的停止和启动并不顺利

如何使用css在响应图像后面添加形状?

如何在没有包装元素的情况下在React(Next.js)中呈现HTML注释

如何使用 HTML 将对象数组发送到 Nodejs Express

为 HTML5 文本字段设置最后六位正则表达式模式

使用修饰键微调 HTML 输入范围中的值

具有 css 高度的输入元素:100% 溢出父 div

两个按钮范围滑块的 CSS

使元素扩展宽度减go margin-right

如何使文本显示在页眉/页脚之外?