Javascript 如何从网站www.example.com获取表与Cheerio谷歌应用程序脚本

发布于03月16日

我试图得到表从this网站使用Cheerio lib在谷歌应用程序脚本.我在this answer下面放了一些代码，但只有[] in console.log()

这是我的密码

function test2() {
  const url = 'https://github.com/labnol/apps-script-starter/blob/master/scopes.md';
  const res = UrlFetchApp.fetch(url, { muteHttpExceptions: true }).getContentText();
  const $ = Cheerio.load(res);
  var data = $('tbody').find('td').toArray().map((x) => { return $(x).text() });
  console.log(data);
}

我也看到了一些答案:

one
two个

但是他们没有给我任何关于如何得到预期结果的线索

推荐答案

对于初学者来说，您可能希望使用GitHub的API，以避免Web抓取的trap .

如果您确实想坚持使用GAS并避免使用API，那么问题似乎在于所提供的页面与浏览器中的页面不同.我通过添加DriveApp.createFile("test.html", res);来绕过日志(log)截断来确定这一点(显然，没有更好的方法according to TheMaster).从这个输出的HTML中可以明显看出，数据只在脚本标记内的Reaction JSON字符串中可用，可以用Cheerio提取、用JSON.parse()解析和遍历.

然而，一个更简单的 Select 可能是请求原始markdown，然后将其转换为HTML并使用marked继续Cheerio，或者手动解析表.我将使用后一种选项，因为我不太熟悉GAS包生态系统:

function myFunction() { // default GAS function name
  const url = "https://raw.githubusercontent.com/labnol/apps-script-starter/master/scopes.md";
  const res = UrlFetchApp.fetch(url).getContentText();
  const data = [];
  
  for (const line of res.split("\n")) {
    const chunks = line
      .replace(/[*`]/g, "")
      .split("|")
      .slice(1, 3)
      .filter(e => e !== " -- ")
      .map(e => e.trim());

    if (chunks.length) {
      data.push(chunks);
    }
  }
  
  console.log(data);
}

输出:

Logging output too large. Truncating output. [
  [ 'Google OAuth API Scope', 'Scope Description' ],
  [ 'Cloud SQL Admin API v1beta4', '' ],
  [ 'View and manage your data across Google Cloud Platform services',
    'https://www.googleapis.com/auth/cloud-platform' ],
  [ 'Manage your Google SQL Service instances',
    'https://www.googleapis.com/auth/sqlservice.admin' ],
  [ '', '' ],
  [ 'Android Management API v1', '' ],
  // ...

解析原始的markdown有点笨拙，但应该足够可靠.如果它被证明不是，try 其他选项之一.

如果您不喜欢使用GAS，那么您的原始代码在Node 20.11.1中适用于我:

const cheerio = require("cheerio"); // ^1.0.0-rc.12 or rc.10

const url = "https://github.com/labnol/apps-script-starter/blob/master/scopes.md";

fetch(url)
  .then(res => {
    if (!res.ok) {
      throw Error(res.statusText);
    }

    return res.text();
  })
  .then(html => {
    const $ = cheerio.load(html);
    const data = $("tbody")
      .find("td")
      .toArray()
      .map(x => $(x).text());
    console.log(data);
  })
  .catch(err => console.error(err));

输出:

[
  'Cloud SQL Admin API v1beta4',
  '',
  'View and manage your data across Google Cloud Platform services',
  'https://www.googleapis.com/auth/cloud-platform',
  'Manage your Google SQL Service instances',
  'https://www.googleapis.com/auth/sqlservice.admin',
  '',
  '',
  'Android Management API v1',
  // ... 1360 total items ...
]

尽管这种方法很有效，但上面显示的数组太平了，无法使用--基本上是一行.我将使用基于嵌套行和单元格的抓取来保留数据的表格性质，并避免将其展平.

// ...
const $ = cheerio.load(html);
const data = [...$("tr")].map(e =>
  [...$(e).find("td, th")].map(e => $(e).text().slice(0, 25))
);
console.table(data.slice(0, 10));
// ...

下面是输出，它类似于GAS脚本的输出(删除切片调用以查看所有数据，而不进行截断):

┌─────────┬─────────────────────────────┬─────────────────────────────┐
│ (index) │ 0                           │ 1                           │
├─────────┼─────────────────────────────┼─────────────────────────────┤
│ 0       │ 'Google OAuth API Scope'    │ 'Scope Description'         │
│ 1       │ 'Cloud SQL Admin API v1bet' │ ''                          │
│ 2       │ 'View and manage your data' │ 'https://www.googleapis.co' │
│ 3       │ 'Manage your Google SQL Se' │ 'https://www.googleapis.co' │
│ 4       │ ''                          │ ''                          │
│ 5       │ 'Android Management API v1' │ ''                          │
│ 6       │ 'Manage Android devices an' │ 'https://www.googleapis.co' │
│ 7       │ ''                          │ ''                          │
│ 8       │ 'YouTube Data API v3'       │ ''                          │
│ 9       │ 'Manage your YouTube accou' │ 'https://www.googleapis.co' │
└─────────┴─────────────────────────────┴─────────────────────────────┘

您可以对此进行进一步处理，以对子类别进行分组.带有两个空单元格的行是作用域类别之间的分隔符(我想--我不是领域专家)，而带有空右单元格的行是类别标题.下面是一个按子类别分组并将标题附加到每个单元格的示例:

const grouped = [];
const headers = data[0];

for (const row of data.slice(1)) {
  if (row.every(e => e === "")) {
    continue;
  } else if (row[1] === "") {
    grouped.push({title: row[0], items: []});
  } else {
    grouped
      .at(-1)
      .items.push(
        Object.fromEntries(
          row.map((e, i) => [headers[i], e])
        )
      );
  }
}

console.log(JSON.stringify(grouped, null, 2));

我在GAS和Node中测试了这个示例处理代码.

Javascript 如何从网站www.example.com获取表与Cheerio谷歌应用程序脚本

推荐答案

Javascript相关问答推荐

使用JavaScript单击上一个或下一个特定按钮创建卡滑动器以滑动单个卡

React：未调用useState变量在调试器的事件处理程序中不可用

如何使用侧边滚动按钮具体滚动每4个格？

docx.js：如何在客户端使用文档修补程序

为什么promise对js中的错误有一个奇怪的优先级？

函数返回与输入对象具有相同键的对象

更改JSON中使用AJAX返回的图像的路径

如何在ASP.NET中使用Google Charts API JavaScript将条形图标签显示为绝对值而不是负值

以Angular 实现ng-Circle-Progress时出错：模块没有导出的成员

从Nextjs中的 Select 项收集值，但当单击以处理时，未发生任何情况

使用Nuxt Apollo在Piniastore 中获取产品细节

为什么云存储中的文件不能公开使用？

当标题被点击时，如何使内容出现在另一个div上？

当我点击一个按钮后按回车键时，如何阻止它再次被点击

Reaction即使在重新呈现后也会在方法内部保留局部值

Reaction useState和useLoaderData的组合使用引发无限循环错误

如何在独立的Angular 应用程序中添加Lucide-Angel？

无法使用npm install安装react-dom、react和next

select 2-删除js插入的项目将其保留为选项

相对于具有选定类的不同SVG组放置自定义工具提示