使用bash从html表格中提取表格

发布于10月23日

你好，我在使用bash脚本解析html表时遇到了问题.我设法获取了表，但提取数据的过程并不顺利. 表内容如下所示:

<tr><td><small>1</small></td><td>Kalisz</td><td>62-800</td><td>Poland</td><td>Greater Poland</td><td>Kalisz</td><td>Kalisz<tr><td></td><td colspan=6>&nbsp;&nbsp;&nbsp;<a href="/maps/browse_51.75_18.087.html" rel="nofollow"><small>51.75/18.087</small></a></td></tr>
<tr class="odd"><td><small>2</small></td><td>Piotrków Trybunalski</td><td>97-300</td><td>Poland</td><td>Łódź Voivodeship</td><td>Piotrków Trybunalski</td><td>Piotrków Trybunalski<tr class="odd"><td></td><td colspan=6>&nbsp;&nbsp;&nbsp;<a href="/maps/browse_51.411_19.689.html" rel="nofollow"><small>51.411/19.689</small></a></td></tr>
<tr><td><small>3</small></td><td>Toruń</td><td>87-100</td><td>Poland</td><td>Kujawsko-Pomorskie</td><td>Toruń</td><td>Toruń<tr><td></td><td colspan=6>&nbsp;&nbsp;&nbsp;<a href="/maps/browse_53.021_18.623.html" rel="nofollow"><small>53.021/18.623</small></a></td></tr>

这样的行有200行.我的bash脚本如下所示:

#!/bin/bash

URL="https://www.geonames.org/postalcode-search.html?country=PL&q="

HTML=$(curl -s "$URL")

(echo "$HTML" | grep -A 201 "<table class=\"restable\">" | tail -n 200 )>> table.html

html_lines=()
while IFS= read -r line; do
  html_lines+=("$line")
done < "table.html"

for html_line in "${html_lines[@]}"; do
  field1_value=$(echo "$html_line" | grep -oP '(?<=<td>)[^<]+(?=</td>)')
  field2_value=$(echo "$html_line" | grep -oP '[0-9]{2}-[0-9]{3}')
  field3_value=$(echo "$html_line" | grep -oP '(?<=<small>)[^<]+(?=</small>)')

  # Printing the extracted fields
  echo "$field1_value;$field2_value;$field3_value" >> output.txt
done

目前我得到的结果是:

Kalisz
62-800
Poland
Greater Poland
Kalisz;62-800;1
51.75/18.087
Piotrków Trybunalski
97-300
Poland
Łódź Voivodeship
Piotrków Trybunalski;97-300;2
51.411/19.689
Toruń
87-100
Poland
Kujawsko-Pomorskie
Toruń;87-100;3
53.021/18.623

我想要的结果是:

Greater Poland;62-800;51.75/18.087
Łódź Voivodeship;97-300;51.411/19.689
Lesser Poland;33-300;49.609/20.704

我想要解析它们，以便将来在CSV文件中使用

^ ##Matching from starting of the line here. <tr.*?> ##Matching from <tr till very next occurrence of > which will include class= line occurrence also. (?: ##Creating a non-capturing group here. <td>.*?<\/td> ##Matching <td> till very next occurrence of </td> ){2} ##Matching 2 occurrences of it. <td>(.*?)<\/td> ##Matching from <td> till very next occurrence of </td> and storing values in between in 1st capturing group here. <td>.*?<\/td> ##Matching from <td> till very next occurrence of </td>. <td>(.*?)<\/td> ##Matching from <td> till very next occurrence of </td> and storing values in between in 2nd capturing group here. .*? ##Doing Lazy match <a href=".*?_ ##To make sure it matches very first occurrence of <a href=" till next occurrence of _ (.*?)_ ##Creating 3rd capturing group with Lazy match till next occurrence of _ here. (.*?) ##Creating 4th capturing group with a lazy match here. \.html.*$ ##matching literal . followed with html till last of the line.

使用bash从html表格中提取表格

推荐答案

Html相关问答推荐

简化指标与Delta保持一致

无法使xpath为HTML代码块工作

如何使div按钮链接到另一个网页

如何在angular 17.2中使用routerLink解决此错误

附加点的列表样式

在Quarto/Discrealjs演示文稿中只增加一个列表的填充

在<；style&>中列出的三种字体中，只有两种显示

Firefox和Chrome在文字装饰和文字阴影方面的不同优先顺序

每个元素的CSS网格高度相等，以保持响应性

Bootstrap 5.3.2模式页脚左填充还是左边距？

Vutify中看不见的V行

阴影部分内部的ionic Select 元素

按钮悬停效果不影响按钮内的图标

如何居中此按钮，即使它已经在计算机分辨率中居中

从触发元素移动到有间隙的内容时 CSS 工具提示关闭

如何删除输入中输入类型数字中的箭头？

那边界从何而来？

html元素可以被css跳过吗？

我怎么能有：not(.class)：nth-of-type(even)？

动态使用时波浪号不转换为绝对路径