嘿,我想从这个网站上刮https://www.yidio.com/movies部电影,但它只有https://www.yidio.com/movies部电影没有了.我发现,当在浏览器中向下滚动时,html代码发生了变化,隐藏的电影出现了,似乎增加了https://www.yidio.com/movies多部电影.

我想至少刮掉1000部电影,我怎么才能在 beautifulsoup 里做到呢?

这是我的一些Pyhton代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Define the URL of the movie page
list_url = 'https://www.yidio.com/movies'

# Send a GET request to the URL
response = requests.get(list_url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'lxml')

# Find all the movie links on the page
movie_links = soup.find_all('a', class_='card movie')

# Extract the movie URLs
urls = [link['href'] for link in movie_links]

movies_info = {
    'Title': [],
    'Genres': [],
    'Cast': [],
    'Director': [],
    'Release Date': [],
    'MPAA Rating': [],
    'Runtime': [],
    'Language': [],
    'IMDB Rating': [],
    'Metascore': []
}


def get_movie_info(url):
    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, 'lxml')

        # Extract specific information from the page
        '''
        Extract movie title, genres, cast, director, release date, MPAA rating, runtime, language,
        IMDB rating, meta score
        '''
        # extract title of the movie
        movie_title = soup.find('h1').text.strip().split(' ', 1)[1]
        print('processing movie ' + movie_title + '...')

'''
some other code ...
'''
def main():
    # Iterate over the URLs and scrape movie information
    for url in urls:
        get_movie_info(url)

    # Create a DataFrame from the movie information
    movies_df = pd.DataFrame(movies_info)
    movies_df = movies_df.dropna(subset=['MPAA Rating', 'IMDB Rating', 'Metascore'])
    # Save the movie information to a CSV file with index column and named 'id'
    movies_df.to_csv('movies_info.csv', index_label='id')
    print('==========================================================')
    print('\nMovie information saved to movies_info.csv successfully.')


if __name__ == '__main__':
    main()


推荐答案

要获取所有电影的名称和URL,可以使用它们的分页API:

import requests
import pandas as pd


api_url = 'https://www.yidio.com/redesign/json/browse_results.php'
params = {"type": "movie", "index": "0", "limit": "100"}

all_data = []
for params['index'] in range(0, 1500, 100):   # <-- increase the max index here (0, 100, 200, 300, ...)
    print(f"{params['index']=}")
    data = requests.get(api_url, params=params).json()
    all_data.extend(data['response'])

df = pd.DataFrame(all_data)
print(df.tail())

打印:

                         name   type      id                                                     url                                                   image
1495                  Sabrina  movie   14361               https://www.yidio.com/movie/sabrina/14361   //cfm.yidio.com/images/movie/14361/poster-193x290.jpg
1496  When Harry Met Sally...  movie   11237  https://www.yidio.com/movie/when-harry-met-sally/11237   //cfm.yidio.com/images/movie/11237/poster-193x290.jpg
1497              Roll Bounce  movie   23861           https://www.yidio.com/movie/roll-bounce/23861   //cfm.yidio.com/images/movie/23861/poster-193x290.jpg
1498              The Holiday  movie   24956           https://www.yidio.com/movie/the-holiday/24956   //cfm.yidio.com/images/movie/24956/poster-193x290.jpg
1499               Juneteenth  movie  229233           https://www.yidio.com/movie/juneteenth/229233  //cfm.yidio.com/images/movie/229233/poster-193x290.jpg

Python相关问答推荐

TARete错误:类型对象任务没有属性模型'

如何在Python中将returns.context. DeliverresContext与Deliverc函数一起使用?

按列分区,按另一列排序

如何从在虚拟Python环境中运行的脚本中运行需要宿主Python环境的Shell脚本?

Telethon加入私有频道

如果条件不满足,我如何获得掩码的第一个索引并获得None?

如何将多进程池声明为变量并将其导入到另一个Python文件

NumPy中条件嵌套for循环的向量化

Python Pandas获取层次路径直到顶层管理

使用特定值作为引用替换数据框行上的值

找到相对于列表索引的当前最大值列表""

在Admin中显示从ManyToMany通过模型的筛选结果

计算空值

如何合并具有相同元素的 torch 矩阵的行?

如何在Python中将超链接添加到PDF中每个页面的顶部?

当HTTP 201响应包含 Big Data 的POST请求时,应该是什么?  

比较两个有条件的数据帧并删除所有不合格的数据帧

如何在Polars中处理用户自定义函数的多行结果?

将Pandas DataFrame中的列名的长文本打断/换行为_STRING输出?

将索引表转换为Numy数组