在Python中实现爬虫翻页爬取,需结合目标网站的翻页机制设计请求逻辑,并通过循环或递归完成多页数据采集。以下是具体实现步骤和代码示例:1. 识别翻页机制GET参数翻页:URL中包含page=N参数(如https://example.com?page=2)。路径翻页:URL路径包含页码(如https://example.com/page/2/)。Ajax动态加载:通过分析网络请求(如XHR)找到数据接口(如https://example.com/api/data?page=2)。按钮/链接翻页:需模拟点击操作(如使用Selenium)。检测方法:手动浏览网站,观察URL或网络请求的变化。使用浏览器开发者工具(Network标签)查看翻页时的请求。2. 构造翻页请求根据翻页机制构造请求URL或参数。以下是常见场景的代码示例:场景1:GET参数翻页import requestsdef get_page_data(url, page): params = {"page": page} # GET参数 response = requests.get(url, params=params) if response.status_code == 200: return response.text else: print(f"请求失败,状态码:{response.status_code}") return None# 示例调用base_url = "https://example.com/data"for page in range(1, 6): # 爬取前5页 html = get_page_data(base_url, page) # 解析html...场景2:路径翻页def get_page_data_path(base_url, page): url = f"{base_url}/page/{page}/" # 路径包含页码 response = requests.get(url) return response.text if response.ok else None# 示例调用base_url = "https://example.com"for page in range(1, 6): html = get_page_data_path(base_url, page) # 解析html...场景3:Ajax接口翻页def get_ajax_data(api_url, page): params = {"page": page, "size": 10} # 假设接口需要page和size参数 response = requests.get(api_url, params=params) return response.json() if response.ok else None# 示例调用api_url = "https://example.com/api/data"for page in range(1, 6): data = get_ajax_data(api_url, page) # 处理JSON数据...3. 解析翻页页面使用解析库(如BeautifulSoup、lxml或json)提取数据。示例:from bs4 import BeautifulSoupdef parse_html(html): soup = BeautifulSoup(html, "html.parser") items = soup.select(".item") # 假设数据在class为item的标签中 for item in items: title = item.select_one(".title").text print(title)# 在获取html后调用html = get_page_data("https://example.com/data", 1)if html: parse_html(html)4. 遍历所有页面通过循环或递归实现多页爬取,需设置终止条件(如最大页码或无更多数据)。方法1:固定页码范围def crawl_fixed_pages(base_url, max_page): for page in range(1, max_page + 1): html = get_page_data(base_url, page) if html: parse_html(html) else: break # 请求失败时终止crawl_fixed_pages("https://example.com/data", 5)方法2:动态检测终止条件某些网站会在最后一页返回空数据或特定标记,可通过解析结果判断是否停止:def crawl_dynamic_pages(base_url): page = 1 while True: html = get_page_data(base_url, page) if not html: break items = parse_html(html) # 假设parse_html返回提取的数据列表 if not items: # 无数据时终止 break page += 1crawl_dynamic_pages("https://example.com/data")注意事项反爬虫策略:设置请求头(如User-Agent、Referer)。控制爬取频率(如time.sleep(2))。使用代理IP池(如requests.get(url, proxies={"http": "ip:port"})。页面结构变化:定期检查解析逻辑是否适配新页面。并发优化:使用threading或asyncio实现异步请求。示例(多线程):import threadingfrom queue import Queuedef worker(url_queue): while True: page = url_queue.get() html = get_page_data("https://example.com/data", page) if html: parse_html(html) url_queue.task_done()url_queue = Queue()for page in range(1, 6): url_queue.put(page)threads = [threading.Thread(target=worker, args=(url_queue,)) for _ in range(5)]for t in threads: t.start()url_queue.join()完整示例import requestsfrom bs4 import BeautifulSoupimport timedef get_page_data(url, page): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" } params = {"page": page} try: response = requests.get(url, params=params, headers=headers, timeout=10) return response.text if response.ok else None except requests.RequestException as e: print(f"请求错误:{e}") return Nonedef parse_html(html): soup = BeautifulSoup(html, "html.parser") items = [] for item in soup.select(".item"): title = item.select_one(".title").text.strip() items.append(title) return itemsdef crawl_pages(base_url, max_page=None): page = 1 all_data = [] while True: if max_page is not None and page > max_page: break html = get_page_data(base_url, page) if not html: break data = parse_html(html) if not data: # 无数据时终止 break all_data.extend(data) print(f"已爬取第{page}页,数据量:{len(data)}") page += 1 time.sleep(1) # 防反爬 return all_data# 示例调用data = crawl_pages("https://example.com/data", max_page=5)print(f"总数据量:{len(data)}")通过以上步骤,可实现稳定的翻页爬取。实际开发中需根据目标网站特性调整请求参数、解析逻辑和反爬策略。



































