在 python 中使用 beautifulsoup 抓取电影

发布于07月18日

嘿，我想从这个网站上刮https://www.yidio.com/movies部电影，但它只有https://www.yidio.com/movies部电影没有了.我发现，当在浏览器中向下滚动时，html代码发生了变化，隐藏的电影出现了，似乎增加了https://www.yidio.com/movies多部电影.

我想至少刮掉1000部电影，我怎么才能在 beautifulsoup 里做到呢？

这是我的一些Pyhton代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Define the URL of the movie page
list_url = 'https://www.yidio.com/movies'

# Send a GET request to the URL
response = requests.get(list_url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'lxml')

# Find all the movie links on the page
movie_links = soup.find_all('a', class_='card movie')

# Extract the movie URLs
urls = [link['href'] for link in movie_links]

movies_info = {
    'Title': [],
    'Genres': [],
    'Cast': [],
    'Director': [],
    'Release Date': [],
    'MPAA Rating': [],
    'Runtime': [],
    'Language': [],
    'IMDB Rating': [],
    'Metascore': []
}


def get_movie_info(url):
    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, 'lxml')

        # Extract specific information from the page
        '''
        Extract movie title, genres, cast, director, release date, MPAA rating, runtime, language,
        IMDB rating, meta score
        '''
        # extract title of the movie
        movie_title = soup.find('h1').text.strip().split(' ', 1)[1]
        print('processing movie ' + movie_title + '...')

'''
some other code ...
'''
def main():
    # Iterate over the URLs and scrape movie information
    for url in urls:
        get_movie_info(url)

    # Create a DataFrame from the movie information
    movies_df = pd.DataFrame(movies_info)
    movies_df = movies_df.dropna(subset=['MPAA Rating', 'IMDB Rating', 'Metascore'])
    # Save the movie information to a CSV file with index column and named 'id'
    movies_df.to_csv('movies_info.csv', index_label='id')
    print('==========================================================')
    print('\nMovie information saved to movies_info.csv successfully.')


if __name__ == '__main__':
    main()

import requests import pandas as pd api_url = 'https://www.yidio.com/redesign/json/browse_results.php' params = {"type": "movie", "index": "0", "limit": "100"} all_data = [] for params['index'] in range(0, 1500, 100): # <-- increase the max index here (0, 100, 200, 300, ...) print(f"{params['index']=}") data = requests.get(api_url, params=params).json() all_data.extend(data['response']) df = pd.DataFrame(all_data) print(df.tail())

name type id url image 1495 Sabrina movie 14361 https://www.yidio.com/movie/sabrina/14361 //cfm.yidio.com/images/movie/14361/poster-193x290.jpg 1496 When Harry Met Sally... movie 11237 https://www.yidio.com/movie/when-harry-met-sally/11237 //cfm.yidio.com/images/movie/11237/poster-193x290.jpg 1497 Roll Bounce movie 23861 https://www.yidio.com/movie/roll-bounce/23861 //cfm.yidio.com/images/movie/23861/poster-193x290.jpg 1498 The Holiday movie 24956 https://www.yidio.com/movie/the-holiday/24956 //cfm.yidio.com/images/movie/24956/poster-193x290.jpg 1499 Juneteenth movie 229233 https://www.yidio.com/movie/juneteenth/229233 //cfm.yidio.com/images/movie/229233/poster-193x290.jpg

在 python 中使用 beautifulsoup 抓取电影

推荐答案

Python相关问答推荐

TARete错误：类型对象任务没有属性模型'

如何在Python中将returns.context. DeliverresContext与Deliverc函数一起使用？

按列分区，按另一列排序

如何从在虚拟Python环境中运行的脚本中运行需要宿主Python环境的Shell脚本？

Telethon加入私有频道

如果条件不满足，我如何获得掩码的第一个索引并获得None？

如何将多进程池声明为变量并将其导入到另一个Python文件

NumPy中条件嵌套for循环的向量化

Python Pandas获取层次路径直到顶层管理

使用特定值作为引用替换数据框行上的值

找到相对于列表索引的当前最大值列表""

在Admin中显示从ManyToMany通过模型的筛选结果

计算空值

如何合并具有相同元素的 torch 矩阵的行？

如何在Python中将超链接添加到PDF中每个页面的顶部？

当HTTP 201响应包含 Big Data 的POST请求时，应该是什么？

比较两个有条件的数据帧并删除所有不合格的数据帧

如何在Polars中处理用户自定义函数的多行结果？

将Pandas DataFrame中的列名的长文本打断/换行为_STRING输出？

将索引表转换为Numy数组