我正在为一个学校的网页制作一个网络爬行器.我导入了Puppeteer以获取客户端呈现的HTML文件.然而,我在开发过程中遇到了一些问题.
我的代码是:
const puppeteer = require("puppeteer");
async function scrapeData(url) {
console.log("Target URL: ", url);
const browser = await puppeteer.launch({ headless: "new" });
try {
const page = await browser.newPage();
await page.goto(url);
// wait for client-side loading
await page.waitForSelector(".tit");
// get texts from html. ignore this code.
const titles = await page.$$eval(".tit a", (elements) => {
return elements.map((element) => element.textContent);
});
console.log("before click");
// click element which has ".tit" class.
// that element have onclick event-listener (checked with chrome manually)
// however, this code throws timeout exception from `page.waitForNavigation()`
await Promise.all([page.waitForNavigation(), page.click(".tit")]);
console.log("navigation success.");
const newUrl = page.url();
const result = {
titles,
newUrl,
};
return result;
} finally {
await browser.close();
}
}
const targetUrl = "https://kau.ac.kr/web/pages/gc32172b.do";
scrapeData(targetUrl)
.then((result) => {
console.log("Scraped Titles:", result.titles);
console.log("New URL after click:", result.newUrl);
})
.catch((error) => console.error("Error during scraping:", error));
我的代码摘要:
- puppeteer 师打开浏览器并移动到"https://kau.ac.kr/web/pages/gc32172b.do"."
- 等待渲染,然后单击该元素(具有
'.tit'
类). - 当客户端单击
'.tit'
类元素时,浏览器将导航到新的URL.(没有其他选项,因为它会动态导航到新URL) - 导航后,获取导航的URL并返回URL值.
顺便说一句,代码await Promise.all([page.waitForNavigation(), page.click(".tit")]);
抛出超时异常.
我try 了什么:
- 在使用Chrome时,我在控制台中try 了这段代码.
const title = document.querySelector(".tit");
title.click();
// I checked this codes navigate browser
- 我通过Puppeteer的API手动设置超时,而不是
waitForNavigation
.然而,我无法获得新的URL.
这是否意味着page.click()
次创建新页面和导航到新URL?