Crawler With ChatGPT On this page Conversation# 🤡:Write a crawler in Python, can input URL 🤖:Gave an example of Beautiful Soup 🤡:How to capture content dynamically generated by JS 🤖:Can use Selenium, Scrapy + Splash 🤡:Method of using Scrapy + Splash 🤖:Here is an example script using Scrapy After running it, I found that all logs were printed to the screen, so I went to the official website to look at the configuration and found that you can specify the log file, crawl depth, priority, and so on.
I asked GPT again to give some configuration examples and annotations, and finally, after organizing and optimizing, I got the following final input.
Write a python crawler script
- A total of 4 input parameters
- The 1st parameter: the URL to start crawling
- The 2nd parameter: the keyword to match, supports multiple
- The 3rd parameter: sets the AND or OR relationship for multiple keywords
- The 4th parameter: sets the depth of the crawler
- When a webpage successfully matches multiple keywords, only record once
- Output the URL of the matched content to the screen, one per line
- Using splash and scrapy technology
- The scrapy project name is mycrawler
- The scrapy log level is INFO, the log file is spider.log
- The scrapy configuration is placed separately in settings.py
- settings.py adds related configurations for crawl depth
- Each configuration in settings.py adds Chinese annotations
## ChatGPT Answer
Below is ChatGPT's response, but I made some modifications and optimizations.
### Install Splash
安装 Scrapy-Splash 以及 Splash 服务:
```bash {frame="none"}
pip install scrapy scrapy-splash
Create a Scrapy Project# First, create a new Scrapy project in the command line:
scrapy startproject mycrawler
Run Splash in Docker# docker run -p 8050:8050 scrapinghub/splash
Modify settings.py# # -*- coding: utf-8 -*-
# Scrapy settings for mycrawler project
BOT_NAME = 'mycrawler'
SPIDER_MODULES = [ 'mycrawler.spiders' ]
NEWSPIDER_MODULE = 'mycrawler.spiders'
# 使用 Splash 渲染 JavaScript 页面
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware' : 723 ,
'scrapy_splash.SplashMiddleware' : 725 ,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware' : 810 ,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware' : 100 ,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# Crawl depth settings
DEPTH_LIMIT = 3 # Set the crawl depth limit
# Depth of each crawl
DEPTH_STATS = True # Enable depth statistics
# Enable crawl depth statistics
DEPTH_PRIORITY = 1 # Enable crawl depth priority
# Set crawl depth priority, 1 indicates depth-first, -1 indicates breadth-first
# Log settings
LOG_LEVEL = 'INFO' # Log level
# Set the log recording level to INFO
LOG_FILE = 'spider.log' # Log file
# Specify the path of the log file
# Other configurations
ROBOTSTXT_OBEY = True # Obey robots.txt rules
# Whether the spider obeys robots.txt rules, default is True
# Set User-Agent, simulating common browsers
#USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
Create a Spider# Create a new spider file my_spider.py
in the mycrawler/spiders
directory with the following content:
import scrapy
from scrapy_splash import SplashRequest
class MySpider ( scrapy . Spider ):
name = "my_spider" # Define the name of the spider
def __init__ ( self , start_url , keywords , logic , depth , * args , ** kwargs ):
super ( MySpider , self ) . __init__ ( * args , ** kwargs )
self . start_urls = [ start_url ] # Initialize the starting URL to be crawled
self . keywords = keywords . split ( ',' ) # Split the keyword string into a list
self . logic = logic # Set the logical relationship of keyword matching (and/or relationship)
self . custom_settings = { 'DEPTH_LIMIT' : int ( depth )} # Set the crawl depth limit
def start_requests ( self ):
# Generate initial requests
for url in self . start_urls :
yield SplashRequest ( url , self . parse , args = { 'wait' : 1 })
def parse ( self , response ):
# Parse the response content
# Get the page text and convert it to lowercase
page_text = response . text . lower ()
# Check if each keyword appears in the page text
matches = [ kw . lower () in page_text for kw in self . keywords ]
# Determine if the match is successful based on the logical relationship
if ( self . logic == 'and' and all ( matches )) or ( self . logic == 'or' and any ( matches )):
# Record the matching URL
self . log ( f 'Matched URL: { response . url } ' )
print ( response . url )
# Get all the links on the page and crawl recursively
for next_page in response . css ( 'a::attr(href)' ) . getall ():
if next_page :
yield SplashRequest ( response . urljoin ( next_page ), self . parse , args = { 'wait' : 1 })
Run the Spider# Run the spider using the command line:
scrapy crawl my_spider -a start_url = 'http://example.com' -a keywords = 'keyword1,keyword2' -a logic = 'and' -a depth = 3
This script will crawl the web pages, match the keywords, and output the matching URLs based on the input parameters.
Summary# ChatGPT is more like a guiding role.
Sometimes you need a tool, maybe you don’t know what this tool will look like in the end.
But as long as you keep talking, step by step, he can always give you the answer you want.
This process can also learn a lot of things, as long as you know how to ask questions.