enter image description herei正在编写代码,以便从一个项目的reddit帖子中抓取帖子标题、 comments 和作者姓名. 我可以在网上抓取帖子的标题,作者的名字,但 comments 没有被正确提取.
如果帖子上有31条 comments ,则每条 comments 将被提取31次.下面的代码可供参考:
# load packages
library(RSelenium)
library(netstat)
# start the server
rs_driver_object <- rsDriver(browser = 'firefox',verbose = FALSE, port = free_port(), chromever = NULL)
# create a client object
remDr <- rs_driver_object$client
# open a browser
remDr$open()
# maximize window
remDr$maxWindowSize()
remDr$navigate("https://www.reddit.com/r/AnimeReviews/comments/essf1u/assassination_classroom_is_a_1010_the_charm_the/")
Sys.sleep(2)
# scroll to the end of the webpage
remDr$executeScript("window.scrollTo(0, document.body.scrollHeight);")
Sys.sleep(2)
remDr$executeScript("window.scrollTo(0, document.body.scrollHeight);")
load_more_comments <- remDr$findElement(using = 'xpath', '//*[@id="comment-tree"]/faceplate-partial/div[1]/button')
load_more_comments$clickElement()
#load_more_comments$refresh()
#pickup title
title <- remDr$findElement(using = 'xpath', '//*[@id="main-content"]/shreddit-title')$getElementAttribute('title')
#comments
comment_list <- remDr$findElements(using = 'tag name', 'shreddit-comment')
#print(typeof(comment_list))
for (each_comment in comment_list) {
print(paste("Author --->", each_comment$getElementAttribute('author')))
p_tags <- each_comment$findElements(using = "xpath", value = ".//div[3]/div/p")
# Extract and print the text from each <p> tag
for (p_tag in p_tags) {
print(p_tag$getElementText())
}
}
请参阅下面的屏幕截图:
我不知道为什么它不能只工作一次. 似乎有一些问题是关于如何
P_tag<;-each_Comment$findElements(Using="XPath",Value=".//div[3]/div/p") 正在起作用
参考上面的代码,我使用RSelum在R中try 了Web抓取.我试图删掉Reddit的 comments ,但它们出现了多次,而不是一次.