我正在try 解析imDB电影连接(https://www.imdb.com/title/tt0090887/movieconnections/),但更多按钮不会加载每个类别的信息(精选于,后续由.).Puppeteer点击功能不起作用,因为它是一种JavaScript功能

enter image description here

const puppeteer = require('puppeteer');

const scrape = async function () {
    const browser = await puppeteer.launch({ headless: false });

    const page = await browser.newPage();
    await page.goto('https://www.imdb.com/title/tt0090887/movieconnections/');
    await page.click('.ipc-see-more__button');

我想单击的按钮位于包含文本"more"的范围内,我认为当我按下该按钮时,它会加载一个JavaScript函数来显示更多内容,但不是c:

  <span class="ipc-see-more sc-4d3dda93-0 fMZdeF single-page-see-more-button-followed_by">
          <button class="ipc-btn ipc-btn--single-padding ipc-btn--center-align-content ipc-btn--default-height ipc-btn--core-base ipc-btn--theme-base ipc-btn--on-accent2 ipc-text-button ipc-see-more__button" role="button" tabindex="0" aria-disabled="false">
            <span class="ipc-btn__text">
              <span class="ipc-see-more__text">2 more</span>
            </span>
            <svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" class="ipc-icon ipc-icon--expand-more ipc-btn__icon ipc-btn__icon--post" viewBox="0 0 24 24" fill="currentColor" role="presentation">
              <path opacity=".87" fill="none" d="M24 24H0V0h24v24z"></path>
              <path d="M15.88 9.29L12 13.17 8.12 9.29a.996.996 0 1 0-1.41 1.41l4.59 4.59c.39.39 1.02.39 1.41 0l4.59-4.59a.996.996 0 0 0 0-1.41c-.39-.38-1.03-.39-1.42 0z"></path>
            </svg>
          </button>
        </span>

推荐答案

这是一个有点复杂的抓取操作,因为您需要单击每个按钮,然后等待结果到达.您可以通过监视请求来实现这一点,或者使用"更多"按钮等待每个部分的长度增加.

以下是如何制作底部版本的快速草图.它有效,但可以进行一些清理,作为练习:

const puppeteer = require("puppeteer"); // ^22.6.0

const url = "<Your URL>";

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const ua =
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36";
  await page.setUserAgent(ua);

  // Performance optimization: block unnecessary requests
  await page.setRequestInterception(true);
  const allowedResources = ["script", "other", "fetch"];
  page.on("request", req => {
    if (
      (req.url().startsWith("https://www.imdb.com") ||
        allowedResources.includes(req.resourceType())) &&
      !/google|amazon|beacon/.test(req.url()) &&
      req.resourceType() !== "xhr"
    ) {
      req.continue();
    } else {
      req.abort();
    }
  });

  await page.goto(url, {waitUntil: "domcontentloaded"});

  // Retrieve the sections with 'more' buttons as [length, index] pairs
  const lengths = await page.$$eval(
    ".ipc-page-grid .ipc-page-section",
    els =>
      els
        .map((e, i) => [e, i])
        .filter(([e]) => e.querySelector(".ipc-see-more__text"))
        .map(([e, i]) => [e.querySelectorAll("p").length, i])
  );

  // Click all of the 'more' buttons
  await page.$$eval(".ipc-see-more__text", els =>
    els.forEach(el => el.click())
  );

  // Wait until the lengths of each 'more' section increase
  await page.waitForFunction(
    lengths =>
      [
        ...document.querySelectorAll(
          ".ipc-page-grid .ipc-page-section"
        ),
      ].every((el, i) => {
        const companion = lengths.find(e => e[1] === i);
        const {length} = el.querySelectorAll("p");
        return !companion || companion[0] < length;
      }),
    {},
    lengths
  );

  // Scrape the data
  const data = await page.$$eval(
    ".ipc-page-grid .ipc-page-section",
    els =>
      els
        .map(el => ({
          title: el
            .querySelector(".ipc-title")
            ?.textContent.trim(),
          items: [...el.querySelectorAll("p")].map(e => ({
            href: e.querySelector("a").href,
            year: [...e.childNodes].at(-1).textContent.trim(),
          })),
        }))
        .filter(e => e.items.length)
  );
  console.log(JSON.stringify(data, null, 2));
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

部分输出:

[
  {
    "title": "Edited into",
    "items": [
      {
        "href": "https://www.imdb.com/title/tt0101627?ref_=ttcnn",
        "year": "(1991)"
      },
      {
        "href": "https://www.imdb.com/title/tt3233580?ref_=ttcnn",
        "year": "(2013)"
      }
    ]
  },
  {
    "title": "Featured in",
    "items": [
      {
        "href": "https://www.imdb.com/title/tt14701700?ref_=ttcnn",
        "year": "(TV Episode 1986)"
      },
      {
        "href": "https://www.imdb.com/title/tt1577448?ref_=ttcnn",
        "year": "(TV Episode 1986)"
      },
      {
        "href": "https://www.imdb.com/title/tt0093629?ref_=ttcnn",
        "year": "(1987)"
      },
      {
        "href": "https://www.imdb.com/title/tt6079512?ref_=ttcnn",
        "year": "(TV Episode 1989)"
      },
      {
        "href": "https://www.imdb.com/title/tt0116289?ref_=ttcnn",
        "year": "(1996)"
      },
      {
        "href": "https://www.imdb.com/title/tt0834914?ref_=ttcnn",
        "year": "(Video 2006)"
      },
      {
        "href": "https://www.imdb.com/title/tt1748981?ref_=ttcnn",
        "year": "(TV Episode 2010)"
      },
      {
        "href": "https://www.imdb.com/title/tt4213530?ref_=ttcnn",
        "year": "(TV Episode 2014)"
      },
// ...

Javascript相关问答推荐

如何使用Echart 5.5.0创建箱形图

我无法在NightWatch.js测试中获取完整的Chrome浏览器控制台日志(log)

每次子路由重定向都会调用父加载器函数

我正在建立一个基于文本的游戏在react ,我是从JS转换.我怎样才能使变量变呢?

Angular 订阅部分相互依赖并返回数组多个异步Http调用

将核心模块导入另一个组件模块时存在多个主题

IF语句的计算结果与实际情况相反

在HTML语言中调用外部JavaScript文件中的函数

如何将未排序的元素追加到数组的末尾?

第三方包不需要NODE_MODULES文件夹就可以工作吗?

使用auth.js保护API路由的Next.JS,FETCH()不起作用

一个实体一刀VS每个实体多刀S

无法避免UV:flat的插值:非法使用保留字"

与svg相反;S getPointAtLength(D)-我想要getLengthAtPoint(x,y)

不同表的条件API端点Reaction-redux

如何修复错误&语法错误:不能在纯react 项目中JEST引发的模块&之外使用导入语句?

如何在下一个js中更改每个标记APEXCHARTS图表的 colored颜色

如何在Highlihte.js代码区旁边添加行号?

Firebase函数中的FireStore WHERE子句无法执行

使用Java脚本在div中创建新的span标记