你好,我在使用bash脚本解析html表时遇到了问题.我设法获取了表,但提取数据的过程并不顺利. 表内容如下所示:
<tr><td><small>1</small></td><td>Kalisz</td><td>62-800</td><td>Poland</td><td>Greater Poland</td><td>Kalisz</td><td>Kalisz<tr><td></td><td colspan=6> <a href="/maps/browse_51.75_18.087.html" rel="nofollow"><small>51.75/18.087</small></a></td></tr>
<tr class="odd"><td><small>2</small></td><td>Piotrków Trybunalski</td><td>97-300</td><td>Poland</td><td>Łódź Voivodeship</td><td>Piotrków Trybunalski</td><td>Piotrków Trybunalski<tr class="odd"><td></td><td colspan=6> <a href="/maps/browse_51.411_19.689.html" rel="nofollow"><small>51.411/19.689</small></a></td></tr>
<tr><td><small>3</small></td><td>Toruń</td><td>87-100</td><td>Poland</td><td>Kujawsko-Pomorskie</td><td>Toruń</td><td>Toruń<tr><td></td><td colspan=6> <a href="/maps/browse_53.021_18.623.html" rel="nofollow"><small>53.021/18.623</small></a></td></tr>
这样的行有200行.我的bash脚本如下所示:
#!/bin/bash
URL="https://www.geonames.org/postalcode-search.html?country=PL&q="
HTML=$(curl -s "$URL")
(echo "$HTML" | grep -A 201 "<table class=\"restable\">" | tail -n 200 )>> table.html
html_lines=()
while IFS= read -r line; do
html_lines+=("$line")
done < "table.html"
for html_line in "${html_lines[@]}"; do
field1_value=$(echo "$html_line" | grep -oP '(?<=<td>)[^<]+(?=</td>)')
field2_value=$(echo "$html_line" | grep -oP '[0-9]{2}-[0-9]{3}')
field3_value=$(echo "$html_line" | grep -oP '(?<=<small>)[^<]+(?=</small>)')
# Printing the extracted fields
echo "$field1_value;$field2_value;$field3_value" >> output.txt
done
目前我得到的结果是:
Kalisz
62-800
Poland
Greater Poland
Kalisz;62-800;1
51.75/18.087
Piotrków Trybunalski
97-300
Poland
Łódź Voivodeship
Piotrków Trybunalski;97-300;2
51.411/19.689
Toruń
87-100
Poland
Kujawsko-Pomorskie
Toruń;87-100;3
53.021/18.623
我想要的结果是:
Greater Poland;62-800;51.75/18.087
Łódź Voivodeship;97-300;51.411/19.689
Lesser Poland;33-300;49.609/20.704
我想要解析它们,以便将来在CSV文件中使用