嘿,我想从这个网站上刮https://www.yidio.com/movies部电影,但它只有https://www.yidio.com/movies部电影没有了.我发现,当在浏览器中向下滚动时,html代码发生了变化,隐藏的电影出现了,似乎增加了https://www.yidio.com/movies多部电影.
我想至少刮掉1000部电影,我怎么才能在 beautifulsoup 里做到呢?
这是我的一些Pyhton代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Define the URL of the movie page
list_url = 'https://www.yidio.com/movies'
# Send a GET request to the URL
response = requests.get(list_url)
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'lxml')
# Find all the movie links on the page
movie_links = soup.find_all('a', class_='card movie')
# Extract the movie URLs
urls = [link['href'] for link in movie_links]
movies_info = {
'Title': [],
'Genres': [],
'Cast': [],
'Director': [],
'Release Date': [],
'MPAA Rating': [],
'Runtime': [],
'Language': [],
'IMDB Rating': [],
'Metascore': []
}
def get_movie_info(url):
# Send a GET request to the URL
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'lxml')
# Extract specific information from the page
'''
Extract movie title, genres, cast, director, release date, MPAA rating, runtime, language,
IMDB rating, meta score
'''
# extract title of the movie
movie_title = soup.find('h1').text.strip().split(' ', 1)[1]
print('processing movie ' + movie_title + '...')
'''
some other code ...
'''
def main():
# Iterate over the URLs and scrape movie information
for url in urls:
get_movie_info(url)
# Create a DataFrame from the movie information
movies_df = pd.DataFrame(movies_info)
movies_df = movies_df.dropna(subset=['MPAA Rating', 'IMDB Rating', 'Metascore'])
# Save the movie information to a CSV file with index column and named 'id'
movies_df.to_csv('movies_info.csv', index_label='id')
print('==========================================================')
print('\nMovie information saved to movies_info.csv successfully.')
if __name__ == '__main__':
main()