从API中抓取R数据SON

发布于04月17日

我试图从相对较新的nypdonline.org网站上收集有关各种警察统计数据的公共数据，该网站可以通过多种方式访问.通常查询此数据的主要方法是通过位于https://nypdonline.org/link/2的在线查找工具或直接使用唯一的税务ID，例如:http://oip.nypdonline.org/view/1/@TAXID=938661.

理想情况下，我希望R脚本通过税务ID列表(这是唯一的标识符)，以从网站中拉下特定的数据点-我目前正在查看"Total Arrests"，因为这将有助于充当回归工作中的控件，但如果我能获得一个数据点，我应该能够获得所有数据点.

主要问题是该网站经历了查询第二个URL的过程，我认为它从中提取了SON数据.我的第一个努力是try 使用RVest来针对单个页面，并使用html_nodes()来try Select CSS元素或Inbox元素:

library(rvest)
library(dplyr)
url <- "http://oip.nypdonline.org/view/1/@TAXID=938661"
webpage <- read_html(url)
total_arrests <- webpage %>%  
html_nodes("[various CSS / xpath elements attempted here]") %>% 
  html_text() %>%
  as.numeric()

But this produces nothing; inspecting the HTML I can see these numbers embedded (see image one, & the number 201 within the code) but I cannot find a way to actually data scrape them, presumably because the way this code is arranged their location changes dynamically or because part of their data is a random identifier.

Using dev tools in chrome I can see (image 2) that the site is sending a request to a secondary URL. I assume this is the URL I really need to query for JSON data.

My confusion may only be exactly how to do this, especially in the case of a site that seems to be querying JSON data from an MSSQL database—if I visit the following site directly I can see what appears to be JSON parameters, including a likely variable for the unique Tax ID value I would need: http://oip.nypdonline.org/api/reports/

但这看起来只是数据的框架，我需要一个我找不到且不太理解的步骤来实际查询API以获取特定数据.我可以轻松地将上面的数据抓取到R中，但最终我得到的本质上是没有任何实际数据的SON框架:

library(rvest)
library(jsonlite)
url <- "http://oip.nypdonline.org/api/reports/"
webpage <- read_html(url)
json_data <- html_text(html_nodes(webpage, "body"))
data_list <- fromJSON(json_data, flatten = TRUE)
View(data_list)

这非常有帮助地为我提供了杨森类别，在这些类别下我想要搜索的唯一Tax ID值(也在上面的图片3中突出显示)可能是类似于:

json[[1]][["DataSource"]][["Parameters"]][[1]][["Name"]]

但我不太清楚如何使用它来实际查询API.很高兴被转向正确的方向，因为我认为有一个因素我只是不理解.

> rjsoncons::j_pivot(json, as = "tibble") # A tibble: 1 × 6 Label ImageURL CodeTemplate Items Interactions RelatedItems <chr> <chr> <chr> <list> <list> <list> 1 "HERNANDEZ, GREGORY D … https:/… "HERNANDEZ,… <list> <list [0]> <list [0]>

rjsoncons::j_pivot(json, "[0].Items", as = "tibble") ## # A tibble: 6 × 9 ## Id Label Value CodeTemplate LabelAlignment LabelFont LabelColor ## <chr> <chr> <chr> <chr> <chr> <chr> <list> ## 1 a2fded09-5439-4b… Rank: "POL… {Value} text-right 12px Ari… <chr [1]> ## 2 20e891ce-1dcf-4d… Appo… "7/1… {Value} text-right 12px Ari… <NULL> ## 3 1692f3bf-ed70-4b… Comm… "049… {Value} text-right 12px Ari… <chr [1]> ## 4 8a2bcb6f-e064-44… Assi… "7/5… {Value} text-right 12px Ari… <NULL> ## 5 0ec90f94-b636-47… Ethn… "HIS… {Value} text-right 12px Ari… <chr [1]> ## 6 42f74dfc-ee54-4b… Shie… "264… {Value} text-right 12px Ari… <NULL> ## # ℹ 2 more variables: ValueAlignment <chr>, ValueFont <chr> rjsoncons::j_pivot(json, "[0].Items", as = "tibble") |> dplyr::select(Label, Value) ## A tibble: 6 × 2 ## Label Value ## <chr> <chr> ## Rank: "POLICE OFFICER" ## Appointment Date: "7/11/2005 12:00:00 AM" ## Command: "049 PRECINCT" ## Assignment Date: "7/5/2006 12:00:00 AM" ## Ethnicity: "HISPANIC … ## Shield No: "26400 "

从API中抓取R数据SON

推荐答案

R相关问答推荐

导入到固定列宽的R中时出现问题

基于R中的GPS点用方向箭头替换点

对lme 4对象运行summary()时出错(diag中的错误(from，names = RST)：对象unpackedMatrix_diag_get找不到)

提取具有连续零值的行，如果它们前面有R中的有效值

如何在ggplot图中找到第二轴的比例

使用R闪光显示所有数据点作为默认设置

迭代到DataFrame列并获得成对的值列表(col1->；col2、col2->；col3、col3->；col4等)的正确方法.

将多个变量组合成宽格式

我如何使用循环来编写冗余的Rmarkdown脚本？

如何删除设置大小的曲线图并添加条形图顶部数字的百分比

使用来自嵌套列和非嵌套列的输入的PURRR：MAP和dplyr：：Mariate

在不对R中的变量分组的情况下取两行的平均值

SHILINY中DT列的条件着色

自定义交互作用图的标签

计算来自单独分组的分幅的值的百分位数

R没有按顺序显示我的有序系数？

位置_道奇在geom_point图中不躲避

了解nchar在列表上的意外行为

图中显示错误 colored颜色的图例geom_sf

删除r中每个因素级别的最后2行