我想创建一个网页,所有的图片都在我的网站上列出标题和替代表示.

我已经编写了一个小程序来查找和加载所有HTML文件,但现在我陷入了如何从这个HTML中提取srctitlealt的困境:

<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />

我想这应该用一些正则表达式来完成,但是由于标签的顺序可能会有所不同,而且我需要所有标签,我真的不知道如何以优雅的方式解析它(我可以用硬字符逐个字符的方式来解析,但这很痛苦).

推荐答案

编辑:现在我知道得更清楚了

使用regexp来解决这类问题的成本是a bad idea%,并且可能会导致无法维护和不可靠的代码.最好用HTML parser.

使用regexp解决方案

在这种情况下,最好将流程分为两部分:

  • 获取所有img标签
  • 提取他们的元数据

我假设你的文档不是严格的xHTML,所以你不能使用XML解析器.例如,使用此网页源代码:

/* preg_match_all match the regexp in all the $html string and output everything as 
an array in $result. "i" option is used to make it case insensitive */

preg_match_all('/<img[^>]+>/i',$html, $result); 

print_r($result);
Array
(
    [0] => Array
        (
            [0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />
            [1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
            [2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />
            [3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />
            [4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />

[...]
        )

)

然后我们通过循环获得所有img标签属性:

$img = array();
foreach( $result as $img_tag)
{
    preg_match_all('/(alt|title|src)=("[^"]*")/i',$img_tag, $img[$img_tag]);
}

print_r($img);

Array
(
    [<img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />] => Array
        (
            [0] => Array
                (
                    [0] => src="/Content/Img/stackoverflow-logo-250.png"
                    [1] => alt="logo link to homepage"
                )

            [1] => Array
                (
                    [0] => src
                    [1] => alt
                )

            [2] => Array
                (
                    [0] => "/Content/Img/stackoverflow-logo-250.png"
                    [1] => "logo link to homepage"
                )

        )

    [<img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />] => Array
        (
            [0] => Array
                (
                    [0] => src="/content/img/vote-arrow-up.png"
                    [1] => alt="vote up"
                    [2] => title="This was helpful (click again to undo)"
                )

            [1] => Array
                (
                    [0] => src
                    [1] => alt
                    [2] => title
                )

            [2] => Array
                (
                    [0] => "/content/img/vote-arrow-up.png"
                    [1] => "vote up"
                    [2] => "This was helpful (click again to undo)"
                )

        )

    [<img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />] => Array
        (
            [0] => Array
                (
                    [0] => src="/content/img/vote-arrow-down.png"
                    [1] => alt="vote down"
                    [2] => title="This was not helpful (click again to undo)"
                )

            [1] => Array
                (
                    [0] => src
                    [1] => alt
                    [2] => title
                )

            [2] => Array
                (
                    [0] => "/content/img/vote-arrow-down.png"
                    [1] => "vote down"
                    [2] => "This was not helpful (click again to undo)"
                )

        )

    [<img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />] => Array
        (
            [0] => Array
                (
                    [0] => src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
                    [1] => alt="gravatar image"
                )

            [1] => Array
                (
                    [0] => src
                    [1] => alt
                )

            [2] => Array
                (
                    [0] => "http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
                    [1] => "gravatar image"
                )

        )

   [..]
        )

)

regexp是CPU密集型的,因此您可能需要缓存此页面.如果你没有缓存系统,你可以通过使用ob_start和从文本文件加载/保存来调整自己的缓存系统.

这东西是怎么工作的?

首先,我们使用preg_ match_ all,这个函数获取与模式匹配的每个字符串,并将其输出到第三个参数中.

正则表达式:

<img[^>]+>

We apply it on all html web pages. It can be read as every string that starts with "<img", contains non ">" char and ends with a >.

(alt|title|src)=("[^"]*")

我们在每个img标签上依次应用它.可以理解为every string starting with "alt", "title" or "src", then a "=", then a ' " ', a bunch of stuff that are not ' " ' and ends with a ' " '. Isolate the sub-strings between ().

最后,每当您想要处理regexp时,使用好的工具来快速测试它们是很方便的.判断这个online regexp tester.

编辑:回答第一条 comments .

的确,我没有考虑到(希望是少数)使用单引号的人.

好吧,如果你只使用',只需替换所有的"by".

如果你两者都混合的话.首先你应该打自己一巴掌:-),然后试着用("|‘)或者"和[^ø]来代替[^"].

Php相关问答推荐

如何在WordPress中为特定自定义帖子类型自定义URL struct

启用额外的WooCommerce产品库存位置

仅在WooCommerce管理订单视图上显示订单项自定义元数据

Laravel;Composer安装突然返回选项快捷方式不能为空.

根据选定的字段值显示或隐藏WooCommerce注册字段

只收取WooCommerce中最高的运费

FlySystem v3:使用本地适配器时出现问题

限制某些产品只能在WooCommerce的特定邮政编码/邮政编码范围内发货

Symfony Validator:如何使用XML表示法验证深度嵌套的数据?

标签打印机的 CSS

将自定义保存金额移至 Woocommerce 简单产品中的价格以下

图片上传成功

PHP向API发送curl请求

在 Symfony 测试中何时使用 TestCase 而不是 KernelTestCase

docker |管道失败的 ubuntu 源列表

我试图在我的视图上显示从数据库中获取的数据,但我无法显示它.拉维尔 10.x /

在帖子内容中使用短代码触发 Woocommerce 挂钩

在全局安装 Composer 包后运行命令

使用 splat 运算符时按引用传递 (...)

我想更新表格中的结果并将它们发布到浏览器,但它不起作用