我试图从相对较新的nypdonline.org网站上收集有关各种警察统计数据的公共数据,该网站可以通过多种方式访问.通常查询此数据的主要方法是通过位于https://nypdonline.org/link/2的在线查找工具或直接使用唯一的税务ID,例如:http://oip.nypdonline.org/view/1/@TAXID=938661.
理想情况下,我希望R脚本通过税务ID列表(这是唯一的标识符),以从网站中拉下特定的数据点-我目前正在查看"Total Arrests",因为这将有助于充当回归工作中的控件,但如果我能获得一个数据点,我应该能够获得所有数据点.
主要问题是该网站经历了查询第二个URL的过程,我认为它从中提取了SON数据.我的第一个努力是try 使用RVest来针对单个页面,并使用html_nodes()来try Select CSS元素或Inbox元素:
library(rvest)
library(dplyr)
url <- "http://oip.nypdonline.org/view/1/@TAXID=938661"
webpage <- read_html(url)
total_arrests <- webpage %>%
html_nodes("[various CSS / xpath elements attempted here]") %>%
html_text() %>%
as.numeric()
But this produces nothing; inspecting the HTML I can see these numbers embedded (see image one, & the number 201 within the code) but I cannot find a way to actually data scrape them, presumably because the way this code is arranged their location changes dynamically or because part of their data is a random identifier.
Using dev tools in chrome I can see (image 2) that the site is sending a request to a secondary URL. I assume this is the URL I really need to query for JSON data.
My confusion may only be exactly how to do this, especially in the case of a site that seems to be querying JSON data from an MSSQL database—if I visit the following site directly I can see what appears to be JSON parameters, including a likely variable for the unique Tax ID value I would need: http://oip.nypdonline.org/api/reports/
但这看起来只是数据的框架,我需要一个我找不到且不太理解的步骤来实际查询API以获取特定数据.我可以轻松地将上面的数据抓取到R中,但最终我得到的本质上是没有任何实际数据的SON框架:
library(rvest)
library(jsonlite)
url <- "http://oip.nypdonline.org/api/reports/"
webpage <- read_html(url)
json_data <- html_text(html_nodes(webpage, "body"))
data_list <- fromJSON(json_data, flatten = TRUE)
View(data_list)
这非常有帮助地为我提供了杨森类别,在这些类别下我想要搜索的唯一Tax ID值(也在上面的图片3中突出显示)可能是类似于:
json[[1]][["DataSource"]][["Parameters"]][[1]][["Name"]]
但我不太清楚如何使用它来实际查询API.很高兴被转向正确的方向,因为我认为有一个因素我只是不理解.