enter image description here

对于给定的html页面示例:

<div class="ia-secondary-content">
  <div class="plugin_pagetree conf-macro output-inline" data-hasbody="false" data-macro-name="pagetree">
    <div class="plugin_pagetree_children_list plugin_pagetree_children_list_noleftspace">
      <div class="plugin_pagetree_children" id="children1326817570-0">
        <ul class="plugin_pagetree_children_list" id="child_ul1326817570-0">
          <li>
            <div class="plugin_pagetree_childtoggle_container">
              <a aria-expanded="false" aria-label="Expand item Topic 1" class="plugin_pagetree_childtoggle aui-icon aui-icon-small aui-iconfont-chevron-right" data-page-id="1630374642" data-tree-id="0" data-type="toggle" href="" id="plusminus1630374642-0"></a>
            </div>
            <div class="plugin_pagetree_children_content">
              <span class="plugin_pagetree_children_span" id="childrenspan1630374642-0"> <a href="#">Topic 1</a></span>
            </div>
            <div class="plugin_pagetree_children_container" id="children1630374642-0"></div>
          </li>
          <li>
            <div class="plugin_pagetree_childtoggle_container">
              <a aria-expanded="false" aria-label="Expand item Topic 2" class="plugin_pagetree_childtoggle aui-icon aui-icon-small aui-iconfont-chevron-right" data-page-id="1565544568" data-tree-id="0" data-type="toggle" href="" id="plusminus1565544568-0"></a>
            </div>
            <div class="plugin_pagetree_children_content">
              <span class="plugin_pagetree_children_span" id="childrenspan1565544568-0"> <a href="#">Topic 2</a></span>
            </div>
            <div class="plugin_pagetree_children_container" id="children1565544568-0"></div>
          </li>
          <li>
            <div class="plugin_pagetree_childtoggle_container">
              <a aria-expanded="true" aria-label="Expand item Topic 3" class="plugin_pagetree_childtoggle aui-icon aui-icon-small aui-iconfont-chevron-down" data-children-loaded="true" data-expanded="true" data-page-id="3733362288" data-tree-id="0" data-type="toggle"
                href="" id="plusminus3733362288-0"></a>
            </div>
            <div class="plugin_pagetree_children_content">
              <span class="plugin_pagetree_children_span" id="childrenspan3733362288-0"> <a href="#">Topic 3</a></span>
            </div>
            <div class="plugin_pagetree_children_container" id="children3733362288-0">
              <ul class="plugin_pagetree_children_list" id="child_ul3733362288-0">
                <li>
                  <div class="plugin_pagetree_childtoggle_container">
                    <span class="no-children icon"></span>
                  </div>
                  <div class="plugin_pagetree_children_content">
                    <span class="plugin_pagetree_children_span"> <a href="#">Subtopic 1</a></span>
                  </div>
                  <div class="plugin_pagetree_children_container"></div>
                </li>
                <li>
                  <div class="plugin_pagetree_childtoggle_container">
                    <span class="no-children icon"></span>
                  </div>
                  <div class="plugin_pagetree_children_content">
                    <span class="plugin_pagetree_children_span"> <a href="#">Subtopic 2</a></span>
                  </div>
                  <div class="plugin_pagetree_children_container"></div>
                </li>
              </ul>
            </div>
          </li>
          <li>
            <div class="plugin_pagetree_childtoggle_container">
              <a aria-expanded="false" aria-label="Expand item Topic 4" class="plugin_pagetree_childtoggle aui-icon aui-icon-small aui-iconfont-chevron-right" data-page-id="2238798992" data-tree-id="0" data-type="toggle" href="" id="plusminus2238798992-0"></a>
            </div>
            <div class="plugin_pagetree_children_content">
              <span class="plugin_pagetree_children_span" id="childrenspan2238798992-0"> <a href="#">Topic 4</a></span>
            </div>
            <div class="plugin_pagetree_children_container" id="children2238798992-0"></div>
          </li>
        </ul>
      </div>
    </div>
    <fieldset class="hidden">
    </fieldset>
  </div>
</div>

我需要从这种页面树中提取最里面的嵌套链接.给定标题,我需要在其中获取所有链接,我如何找到所有最里面的嵌套链接.我想为它写一个动态提取各种html页面的最内层嵌套链接的python脚本.请注意,嵌套级别可能不同.

例如,我应该得到:

<a href="#">Subtopic 1</a>
<a href="#">Subtopic 2</a>

我试着提取所有的链接在同一嵌套 struct ,但它没有工作

# Step 1: Find the div with the given title
title = "Topic 3"
target_div = soup.find('span', class_='plugin_pagetree_children_span', text=title)

# Step 2: Extract the next div with class "plugin_pagetree_children_container"
if target_div:
    container_div = target_div.find_next_sibling('div', class_='plugin_pagetree_children_container')

    # Step 3: Extract all links within the container and print them
    if container_div:
        links = container_div.find_all('a')
        for link in links:
            print(link['href'])

推荐答案

IIUC您可以做到:

from bs4 import BeautifulSoup

# html_text = ... # your html code from the question

soup = BeautifulSoup(html_text, "html.parser")

for a in soup.select("li li a"):
    print(a)

打印:

<a href="#">Subtopic 1</a>
<a href="#">Subtopic 2</a>

编辑:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_text, "html.parser")

result, tags = None, ["li", "a"]
while True:
    a = soup.select(" ".join(tags))

    if not a:
        break
    else:
        tags.insert(0, "li")
        result = a

print(result)

打印:

[<a href="#">Subtopic 1</a>, <a href="#">Subtopic 2</a>]

Python相关问答推荐

重新匹配{ }中包含的文本,其中文本可能包含{{var}

如何访问所有文件,例如环境变量

从numpy数组和参数创建收件箱

从spaCy的句子中提取日期

如何根据一列的值有条件地 Select 前N组?

当我try 在django中更新模型时,模型表单数据不可见

无法连接到Keycloat服务器

如何指定列数据类型

寻找Regex模式返回与我当前函数类似的结果

从列表中获取n个元素,其中list [i][0]== value''

Gunicorn无法启动Flask应用,因为无法将应用解析为属性名或函数调用.'"'' "

当条件满足时停止ODE集成?

使用__json__的 pyramid 在客户端返回意外格式

删除特定列后的所有列

需要帮助使用Python中的Google的People API更新联系人的多个字段'

在我融化极点数据帧之后,我如何在不添加索引的情况下将其旋转回其原始形式?

为罕见情况下的回退None值键入

如何获取包含`try`外部堆栈的`__traceback__`属性的异常

无法使用请求模块从网页上抓取一些产品的名称

Fake pathlib.使用pyfakefs的类变量中的路径'