我正在try 从website获取短信 我的代码有效(有点)

for (i in 1:no_urls) {
  this_url=urls_meetings[[i]]
  page=read_html(this_url)
  
  text=page |> html_elements("body") |> html_text2()
  text_date=text[1]
  date<- str_extract(text_date, "\\b\\w+ \\d{1,2}, \\d{4}\\b")
  # Convert the abbreviated month name to its full form
  date_str <- gsub("^(.*)\\s(\\d{1,2}),\\s(\\d{4})$", "\\1 \\2, \\3", date)

  # Convert to Date object
  date <- mdy(date_str)
  date_1=as.character(date)
  date_1=gsub("-", "", date_1)


  text=text[2]
  statements_list2[[i]]=text
  names(statements_list)[i] <- date_1

}

问题是如果行的输出

text=page |> html_elements("body") |> html_text2()

这给了我页面的整个文本

[1] "\r \r\r \r\nRelease Date: January 29, 2003\r\n\n\n\n\n\r For immediate release\r\n\n\r\n\n\r\r\n\n\r The Federal Open Market Committee decided today to keep its target for the federal funds rate unchanged at 1-1/4 percent. \r\n\n\r Oil price premiums and other aspects of geopolitical risks have reportedly fostered continued restraint on spending and hiring by businesses. However, the Committee believes that as those risks lift, as most analysts expect, the accommodative stance of monetary policy, coupled with ongoing growth in productivity, will provide support to an improving economic climate over time.\r\n\n\r In these circumstances, the Committee believes that, against the background of its long-run goals of price stability and sustainable economic growth and of the information currently available, the risks are balanced with respect to the prospects for both goals for the foreseeable future. \r\n\n\r Voting for the FOMC monetary policy action were Alan Greenspan, Chairman; William J. McDonough, Vice Chairman; Ben S. Bernanke, Susan S. Bies; J. Alfred Broaddus, Jr.; Roger W. Ferguson, Jr.; Edward M. Gramlich; Jack Guynn; Donald L. Kohn; Michael H. Moskow; Mark W. Olson, and Robert T. Parry. \r \r \r\n\n\r -----------------------------------------------------------------------------------------\r DO NOT REMOVE: Wireless Generation\r ------------------------------------------------------------------------------------------\r 2003 Monetary policy \r\n\nHome | News and \r events\nAccessibility\r\n\r Last update: January 29, 2003\r\r \r\n(function(){if (!document.body) return;var js = \"window['__CF$cv$params']={r:'8775c6b49a2a2015',t:'MTcxMzYyMjgzOC41MjIwMDA='};_cpo=document.createElement('script');_cpo.nonce='',_cpo.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js',document.getElementsByTagName('head')[0].appendChild(_cpo);\";var _0xh = document.createElement('iframe');_0xh.height = 1;_0xh.width = 1;_0xh.style.position = 'absolute';_0xh.style.top = 0;_0xh.style.left = 0;_0xh.style.border = 'none';_0xh.style.visibility = 'hidden';document.body.appendChild(_0xh);function handler() {var _0xi = _0xh.contentDocument || _0xh.contentWindow.document;if (_0xi) {var _0xj = _0xi.createElement('script');_0xj.innerHTML = js;_0xi.getElementsByTagName('head')[0].appendChild(_0xj);}}if (document.readyState !== 'loading') {handler();} else if (window.addEventListener) {document.addEventListener('DOMContentLoaded', handler);} else {var prev = document.onreadystatechange || function () {};document.onreadystatechange = function (e) {prev(e);if (document.readyState !== 'loading') {document.onreadystatechange = prev;handler();}};}})();"


我只需要保留相关文本.我try 过各种各样的事情

str_extract(text, "(?<=The Federal Open Market)(.*?)(?=Voting)")


 str_match(text, "The Federal Open Market(.*?)Voting")

但他们都给了我一个空字符作为回报

理想的输出是

The Federal Open Market Committee decided today to keep its target for the federal funds rate unchanged at 1-1/4 percent. \r\n\n\r Oil price premiums and other aspects of geopolitical risks have reportedly fostered continued restraint on spending and hiring by businesses. However, the Committee believes that as those risks lift, as most analysts expect, the accommodative stance of monetary policy, coupled with ongoing growth in productivity, will provide support to an improving economic climate over time.\r\n\n\r In these circumstances, the Committee believes that, against the background of its long-run goals of price stability and sustainable economic growth and of the information currently available, the risks are balanced with respect to the prospects for both goals for the foreseeable future.

推荐答案

The . character does not match new lines by default

您的模式不起作用的原因是因为您的字符串中有新行..元字符中的definitionmatches any character except a newline.

这是一个较短的例子:

txt <- "there are some\r\nwords here"
str_extract(txt, "some.+words")
# [1] NA

改写默认值

要覆盖stringr::str_extract()中的默认设置,您需要将stringr::regex()与相关选项一起使用.在这种情况下,

通过设置dotall = TRUE,您可以允许.匹配所有内容,包括\n:

str_extract(txt, regex("some.+words", dotall = TRUE))
# [1] "some\r\nwords"

或者就您的字符串而言:

str_extract(text, regex("(?<=The Federal Open Market)(.*?)(?=Voting)", dotall = TRUE))  |> 
    trimws()
# [1] "Committee decided today to keep its target for the federal funds rate unchanged at 1-1/4 percent. \r\n\n\r Oil price premiums and other aspects of geopolitical risks have reportedly fostered continued restraint on spending and hiring by businesses. However, the Committee believes that as those risks lift, as most analysts expect, the accommodative stance of monetary policy, coupled with ongoing growth in productivity, will provide support to an improving economic climate over time.\r\n\n\r In these circumstances, the Committee believes that, against the background of its long-run goals of price stability and sustainable economic growth and of the information currently available, the risks are balanced with respect to the prospects for both goals for the foreseeable future."

我还将其传递给trimws()以删除开头和结尾的空白.

其他多线选项

如果您想一般扩展您的正规表达式以匹配多行(而不仅仅是.个元字符),您可以使用regex(pattern, multiline = TRUE):

对于多行字符串,您可以使用regex(multi line = TRUE).这改变了^和$的行为,并引入了三个新操作符:

  • #现在与每一行的开始相匹配.
  • $现在匹配每一行的结尾.
  • \A匹配输入的开始.
  • \z匹配输入的结束.
  • \Z匹配输入的结尾,但在最终行终止符之前(如果存在).

请参阅stringr docs了解更多信息.

Html相关问答推荐

如何增加行背景的宽度而不影响上面的内容?""' HTML CSS

如何使用css实现悬停时的动画zoom ?

将ANGLE模板高效地迁移到ANGLE 17中引入的新语法

在元素之间添加空格

如何使用html和css将两个项目垂直对齐

在Quarto/Discrealjs演示文稿中只增加一个列表的填充

如何使用html和css关键帧创建动画

在窄屏幕上显示表格,每个单元格占一行

Bootstrap nav没有崩溃

如何将CSS滤镜仅应用于背景图像

有没有一种方法可以很容易地将这个div从底部:0转换到顶部:0?

配置了 HTML 视口的 Google Apps 脚本移动网络应用程序无法正确显示

Angular MatBadge 在高于 99 时不显示完整数字

如何在黑暗模式切换中添加图标

:after 伪元素没有出现,即使它有 content 属性

Github上部署需要花费几小时时间:等待github pages部署批准

如何垂直对齐列表中的图像和文本?

如何将自定义图像插入 Shiny plot header?

是否可以设置用户在联系表单中输入的文本字体的样式

使用 bootstrap-icons 的未排序图标堆栈