我正在和 puppeteer 师一起抓取动态网站.我的目标是能够创建尽可能多的通用抓取逻辑,这也将删除大量模板代码.因此,我创建了外部函数,该函数在给定某些参数的情况下抓取数据.问题was是,当我try 在page. evergue()puppeteer方法中使用该函数时,我遇到了Reference错误,表明该函数未定义.
进行了一些研究,page. exposeValue然而,当我try 在我的scraper中使用它们时,addWritttag()不起作用,而且exposeStep()也没有为我提供访问公开函数内的多姆元素的能力.我知道exposeCopy()正在Node.js中执行,而addWritttag()则在浏览器中执行,但我不知道如何进一步处理该信息,也不知道它对我的情况是否有价值.
这是我的铲子:
import { Browser } from "puppeteer";
import { dataMapper } from "../../utils/api/functions/data-mapper.js";
export const mainCategoryScraper = async (browser: Browser) => {
const [page] = await browser.pages();
await page.setUserAgent(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
);
await page.setRequestInterception(true);
page.on("request", (req) => {
if (
req.resourceType() === "stylesheet" ||
req.resourceType() === "font" ||
req.resourceType() === "image"
) {
req.abort();
} else {
req.continue();
}
});
await page.goto("https://www.ozone.bg/pazeli-2d-3d/nastolni-igri", {
waitUntil: "domcontentloaded",
});
/**
* Function will execute in Node.js
*/
// await page.exposeFunction('dataMapper', dataMapper);
/**
* The way of passing DOM elements to the function, because like that the function executes in the browser
*/
// await page.addScriptTag({ content: `${dataMapper}` });
const data = await page.evaluate(async () => {
const contentContainer = document.querySelector(".col-main") as HTMLDivElement;
const carousels = Array.from(
contentContainer.querySelectorAll(".owl-item") as NodeListOf<HTMLDivElement>
);
const carouselsData = await dataMapper<HTMLDivElement>(carousels, ".title", "img", "a");
return {
carouselsData,
};
});
await browser.close();
return data;
};
这是dataMapper函数:
import { PossibleTags } from "../typescript/types.js";
export const dataMapper = function <T extends HTMLDivElement>(items: Array<T>, ...selectors: string[]) {
let hasTitle = false;
for (const selector of selectors) {
if (selector === ".title" || selector === "h3") {
hasTitle = true;
break;
}
}
return items.map((item) => {
const data: PossibleTags = {};
return selectors.map((selector) => {
const dataProp = item.querySelector(selector);
switch (selector) {
case ".title": {
data["title"] = (dataProp as HTMLSpanElement)?.innerText;
break;
}
case "h3": {
data["title"] = (dataProp as HTMLHeadingElement)?.innerText;
break;
}
case "h6": {
data["subTitle"] = (dataProp as HTMLHeadingElement)?.innerText;
break;
}
case "img": {
if (!hasTitle) {
data["img"] = (dataProp as HTMLImageElement)?.getAttribute("src") ?? undefined;
break;
}
data["title"] = (dataProp as HTMLImageElement)?.getAttribute("alt") ?? undefined;
break;
}
case "a": {
data["url"] = (dataProp as HTMLAnchorElement)?.getAttribute("href") ?? undefined;
}
default: {
throw new Error("Such selector is not yet added to the possible selectors");
}
}
});
});
};
当我使用page.exposeFunction('dataMapper', dataMapper);
时,它告诉我title. queryspel不是一个函数(在dataMapper内部).对于await page.addScriptTag({ content: `${dataMapper}` });
,它稍后会在page. evalve中抛出错误,即dataMapper不是一个函数.
更新:当指定addWritttag内的路径时,它仍然给我:Error [ReferenceError]: dataMapper is not defined
*
只想说mainCategoryScraper * is later on used in scrapersHandler function, which decides what scraper to be executed, based on URL endpoint.