文档解析实战 PDF Word与HTML清洗提取指南

首页/AI教程/文章详情

文档解析实战 PDF Word与HTML清洗提取指南

时间：2026-07-01 15:06

本文围绕PDF、Word与HTML三种主流文档格式，全面介绍了从环境配置、核心库安装到文本与表格提取、噪声清洗等关键技术要点，并给出了标准化统一处理流程，助力高效完成文档解析任务。

文档解析实战指南：PDF、Word 与 HTML 的清洗提取技巧

虽然文档解析听起来像是一项体力活，但只要选对工具和方法，就能节省大量时间。本文围绕三种最常见的文档格式展开，从环境搭建、表格提取到编码修复，逐步拆解其中的关键环节，助你高效完成文档清洗与数据提取。

① 解析环境搭建与核心库安装

开始之前，请先搭好运行环境。建议使用虚拟环境隔离项目依赖，避免不同项目间的库版本冲突。

文档解析实战：PDF、Word 与 HTML 的清洗提取指南

# 创建虚拟环境（Python 3.8+）python -m venv doc_parser_env# 激活环境（Windows）doc_parser_envScriptsactivate# 激活环境（Mac/Linux）source doc_parser_env/bin/activate

然后一次性安装接下来要用到的所有核心库：

pip install pdfplumber pypdf python-docx beautifulsoup4 lxml openpyxl pandas camelot-py tabula-py ftfy chardet

简单说明每个库的定位：

pdfplumber：PDF文本与表格提取的主力工具。
pypdf：PDF的结构化操作，如合并、拆分、旋转、加密等。
python-docx：Word（.docx）文档的读写库。
beautifulsoup4：HTML解析与标签剥离的核心库。
lxml：作为BeautifulSoup的解析器后端，解析速度更快。
openpyxl：读写Excel文件，用于导出表格数据。
pandas：数据处理，配合表格提取使用。
camelot / tabula-py：PDF表格专项提取工具。
ftfy：修复乱码文本的利器。
chardet：检测文件编码的库。

② 三大文档格式核心概念速览

在动手写代码之前，花两分钟了解这三种格式的本质差异，能帮你少走很多弯路。

PDF（Portable Document Format）：PDF的设计目标是“所见即所得”——无论在任何设备上打开，排版都保持一致。但代价是，PDF内部的文本不一定按阅读顺序存储，可能是由分散的字符碎片通过坐标拼凑而成。因此直接读取PDF常出现文字乱序。另外，PDF分为两种：文字版（从Word等导出，文字可选中）和扫描版（本质上是图片）。

Word（.docx）：.docx 本质上是一个ZIP压缩包，内部包含一堆XML文件。因此它天生具有结构化特点——段落是段落，表格是表格，标题是标题。使用 python-docx 读取时相对规整。

HTML：HTML也是结构化的，通过标签嵌套表示层级关系。但网页中混杂着大量样式、脚本、广告等噪声，我们的目标就是剥离标签，只保留有价值的正文文本。

③ PDF 文档内容提取与噪声清洗

3.1 用 pdfplumber 提取文本

pdfplumber 是目前提取PDF文本最顺手的库，尤其适合处理中文文档。

import pdfplumberimport redef extract_pdf_text(pdf_path):full_text = []with pdfplumber.open(pdf_path) as pdf:for page in pdf.pages:text = page.extract_text()if text:full_text.append(text)return 'n'.join(full_text)# 使用text = extract_pdf_text('document.pdf')print(text[:500])

3.2 清洗PDF文本的常见噪声

从PDF提取出的文本通常夹杂着大量杂乱内容——页码、页眉页脚、断词连字符、多余换行。以下清洗函数能应对大多数常见问题：

def clean_pdf_text(text):# 1. 移除页码（常见格式：- 1 -、Page 1 of 10、第1页）text = re.sub(r'[-—]*s*第?s*d+s*页s*[-—]*', '', text)text = re.sub(r'[-—]*s*Pages*d+s*(ofs*d+)?s*[-—]*', '', text, flags=re.I)text = re.sub(r'[-—]*s*d+s*[-—]*', '', text)# 2. 修复断词连字符（PDF中常见的词尾 '-' + 换行）text = re.sub(r'-n', '', text)# 3. 合并被换行打断的段落（连续两行之间没有句号结尾的，合并）lines = text.split('n')merged = []for line in lines:if merged and not merged[-1].endswith(('。', '！', '？', '.', '!', '?')):merged[-1] += lineelse:merged.append(line)text = 'n'.join(merged)# 4. 压缩多余空白text = re.sub(r' +', ' ', text)text = re.sub(r'n{3,}', 'nn', text)return text.strip()

3.3 用 pypdf 做结构化操作

如果只想进行合并、拆分、加密等操作，使用 pypdf 更加合适：

from pypdf import PdfReader, PdfWriter, PdfMerger# 合并多个PDFmerger = PdfMerger()merger.append('file1.pdf')merger.append('file2.pdf')merger.write('merged.pdf')merger.close()# 拆分PDF（每页存成一个文件）reader = PdfReader('big_file.pdf')for i, page in enumerate(reader.pages):writer = PdfWriter()writer.add_page(page)with open(f'page_{i+1}.pdf', 'wb') as f:writer.write(f)

④ Word 文档结构化数据读取方法

4.1 提取段落文本

python-docx 的使用比 PDF 处理简单得多，因为 Word 天生就是结构化的：

from docx import Documentdef extract_docx_text(file_path):doc = Document(file_path)return 'n'.join([para.text for para in doc.paragraphs])text = extract_docx_text('report.docx')

4.2 提取表格数据

表格是 Word 文档中最具价值的结构化数据：

def extract_docx_tables(file_path):doc = Document(file_path)all_tables = []for table in doc.tables:table_data = []for row in table.rows:row_data = [cell.text.strip() for cell in row.cells]table_data.append(row_data)all_tables.append(table_data)return all_tablestables = extract_docx_tables('report.docx')for i, table in enumerate(tables):print(f"表格 {i+1}: {len(table)} 行")for row in table[:3]:# 打印前3行预览print(row)

4.3 处理合并单元格

合并单元格是 Word 表格提取中最棘手的问题。python-docx 不会自动告知哪些单元格已被合并，需要自行判断。一个简单的处理思路是：遍历时记录空单元格的位置，用上一个非空值填充：

def extract_table_with_merged(table):rows_data = []for row in table.rows:row_data = []for cell in row.cells:text = cell.text.strip()if text:row_data.append(text)elif row_data:# 如果是空单元格，用左边最近的非空值填充row_data.append(row_data[-1])else:row_data.append('')rows_data.append(row_data)return rows_data

4.4 处理老旧的 .doc 格式

python-docx 只能处理 .docx，遇到 .doc 格式需要转换。Windows 上可以使用 pywin32 调用本地的 Word 程序：

# Windows + 已安装Microsoft Wordimport win32com.client as win32def extract_doc_text(file_path):word = win32.gencache.EnsureDispatch('Word.Application')doc = word.Documents.Open(file_path)text = doc.Content.Textdoc.Close()word.Quit()return text

跨平台方案是使用 LibreOffice 的命令行工具进行转换：

unoconv -f docx input.doc

⑤ HTML 网页标签剥离与文本净化

5.1 用 BeautifulSoup 提取纯文本

HTML 解析最常用的组合是 BeautifulSoup + lxml：

from bs4 import BeautifulSoupdef extract_html_text(html_content):soup = BeautifulSoup(html_content, 'lxml')# 移除脚本和样式标签（这些里面没有正文）for tag in soup(['script', 'style', 'head', 'meta', 'noscript']):tag.decompose()# 提取文本，用换行分隔不同块text = soup.get_text(separator='n', strip=True)return text# 从文件读取with open('page.html', 'r', encoding='utf-8') as f:html = f.read()text = extract_html_text(html)

5.2 进阶清洗：去除多余空行和特殊字符

import redef clean_html_text(text):# 移除多余空行text = re.sub(r'ns*n', 'nn', text)# 移除行首行尾空白lines = [line.strip() for line in text.split('n') if line.strip()]text = 'n'.join(lines)# 处理HTML实体（  & 等）BeautifulSoup已自动处理return text

5.3 按需保留特定标签

有时候并不想删掉所有标签，比如希望保留标题层级：

def extract_html_with_headers(html_content):soup = BeautifulSoup(html_content, 'lxml')# 移除噪声标签for tag in soup(['script', 'style', 'na v', 'footer', 'aside']):tag.decompose()# 提取标题和正文result = []for tag in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'p']):if tag.name.startswith('h'):result.append(f"【{tag.name.upper()}】{tag.get_text(strip=True)}")else:result.append(tag.get_text(strip=True))return 'n'.join(result)

⑥ 多格式统一处理流程实战

实际工作中常常需要处理混合类型的文档——今天收到PDF，明天是Word，后天爬了个网页。写一个统一入口函数，根据文件扩展名自动选择解析方式：

import osdef parse_document(file_path):ext = os.path.splitext(file_path)[1].lower()if ext == '.pdf':raw_text = extract_pdf_text(file_path)return clean_pdf_text(raw_text)elif ext in ['.docx', '.doc']:if ext == '.doc':# 先转成docx（这里可以补充转换逻辑，参考4.4节）passraw_text = extract_docx_text(file_path)# Word文档通常噪声较少，简单清洗即可return re.sub(r'n{3,}', 'nn', raw_text)elif ext in ['.html', '.htm']:with open(file_path, 'r', encoding='utf-8') as f:html = f.read()text = extract_html_text(html)return clean_html_text(text)else:raise ValueError(f"不支持的格式: {ext}")# 批量处理def batch_parse(folder_path):results = {}for filename in os.listdir(folder_path):file_path = os.path.join(folder_path, filename)try:results[filename] = parse_document(file_path)except Exception as e:print(f"解析 {filename} 失败: {e}")results[filename] = Nonereturn results

⑦ 特殊字符编码异常修复技巧

编码问题是文档解析中最常见也最令人头疼的“坑”。

7.1 检测文件编码

使用 chardet 自动检测编码：

import chardetdef detect_encoding(file_path):with open(file_path, 'rb') as f:raw_data = f.read(10000)# 读前10KB足够判断result = chardet.detect(raw_data)return result['encoding'], result['confidence']# 使用encoding, confidence = detect_encoding('unknown.txt')print(f"检测到编码: {encoding}, 置信度: {confidence}")

7.2 用 ftfy 修复乱码

有些文件打开后全是乱码，比如“æ–‡æ¡£”这种，用 ftfy 一键修复：

from ftfy import fix_text# 修复单个字符串garbled = "âœ” No problems"fixed = fix_text(garbled)print(fixed)# 输出: "✔ No problems"# 批量修复文本def fix_document_text(text):return fix_text(text)

7.3 读取文件时的编码容错

def safe_read_text(file_path):# 尝试常见编码encodings = ['utf-8', 'gbk', 'gb2312', 'gb18030', 'big5', 'shift-jis']for encoding in encodings:try:with open(file_path, 'r', encoding=encoding) as f:return f.read()except UnicodeDecodeError:continue# 全部失败则用检测结果encoding, _ = detect_encoding(file_path)with open(file_path, 'r', encoding=encoding, errors='ignore') as f:return f.read()

⑧ 复杂排版下的表格提取方案

表格提取是文档解析中最具技术含量的环节，尤其是PDF中的表格。

8.1 pdfplumber 提取表格（适合有边框的表格）

import pdfplumberdef extract_tables_pdfplumber(pdf_path):with pdfplumber.open(pdf_path) as pdf:all_tables = []for page in pdf.pages:tables = page.extract_tables()for table in tables:# 过滤空表if table and len(table) > 1:all_tables.append(table)return all_tables

8.2 Camelot 提取表格（适合复杂表格）

Camelot 提供两种模式：

lattice：适用于有明确表格线的网格表格。
stream：适用于没有表格线的表格，依靠空白分隔。

import camelot# lattice模式 - 有表格线的PDFtables = camelot.read_pdf('table_with_lines.pdf', fla vor='lattice')tables[0].df# 转为DataFrame# stream模式 - 无表格线的PDFtables = camelot.read_pdf('table_no_lines.pdf', fla vor='stream')# 导出为CSVtables[0].to_csv('output.csv')

8.3 Tabula-py 提取表格（Ja va依赖，但更稳定）

import tabula# 读取PDF中的所有表格tables = tabula.read_pdf('table.pdf', pages='all')# 指定页面tables = tabula.read_pdf('table.pdf', pages='1-3')# 转为DataFrame列表for df in tables:print(df.head())

8.4 导出到Excel

import pandas as pdfrom openpyxl import Workbookdef tables_to_excel(tables, output_path):wb = Workbook()for i, table in enumerate(tables):ws = wb.create_sheet(title=f'Table_{i+1}')for row_idx, row in enumerate(table, 1):for col_idx, cell in enumerate(row, 1):ws.cell(row=row_idx, column=col_idx, value=cell)# 删除默认创建的空sheetif 'Sheet' in wb.sheetnames:wb.remove(wb['Sheet'])wb.sa ve(output_path)

⑨ 常见解析报错与排查思路

错误现象	可能原因	解决方法
PDF提取文本为空	扫描版PDF（纯图片）	需配合OCR（Tesseract + pdf2image）
PDF文本乱序	多栏排版	用pdfplumber的坐标信息重新排序
Word读取报错	文件是.doc而非.docx	用LibreOffice或pywin32转换
HTML解析卡死	页面超大或标签不闭合	换用'lxml'解析器，或设置时间限制
UnicodeDecodeError	编码不对	用chardet检测后用正确编码重读
Camelot找不到表格	表格线太淡或没有线	换stream模式，或调参数
内存溢出	文件太大	分页读取，用生成器而非一次性加载

扫描版PDF的OCR处理示例：

from pdf2image import convert_from_pathimport pytesseractdef ocr_pdf(pdf_path, lang='chi_sim+eng'):images = convert_from_path(pdf_path, dpi=300)text = ''for i, img in enumerate(images):# 灰度化 + 二值化提升识别率img = img.convert('L').point(lambda x: 0 if x < 140 else 255)text += f"--- Page {i+1} ---n"text += pytesseract.image_to_string(img, lang=lang)return text

⑩ 批量处理脚本编写与性能优化

10.1 基础批量处理脚本

import osfrom concurrent.futures import ProcessPoolExecutor, as_completeddef process_single_file(file_path):"""处理单个文件的函数（放在全局以便多进程调用）"""try:text = parse_document(file_path)# 保存结果output_path = file_path + '.txt'with open(output_path, 'w', encoding='utf-8') as f:f.write(text)return file_path, True, len(text)except Exception as e:return file_path, False, str(e)def batch_process_parallel(folder_path, max_workers=4):"""并行批量处理"""files = []for root, dirs, filenames in os.walk(folder_path):for f in filenames:if f.lower().endswith(('.pdf', '.docx', '.doc', '.html', '.htm')):files.append(os.path.join(root, f))print(f"找到 {len(files)} 个文档")results = []with ProcessPoolExecutor(max_workers=max_workers) as executor:future_to_file = {executor.submit(process_single_file, file_path): file_pathfor file_path in files}for future in as_completed(future_to_file):file_path, success, info = future.result()if success:print(f"✓ {os.path.basename(file_path)}: {info} 字符")else:print(f"✗ {os.path.basename(file_path)}: {info}")results.append((file_path, success))return results

10.2 性能优化建议

使用多进程而非多线程：由于Python的GIL限制，CPU密集型任务使用多进程才能充分利用多核性能。
分块读取大文件：对于大型PDF，按页处理，避免一次性加载全部内容到内存。
缓存中间结果：若同一文档需要反复解析，将解析结果缓存下来可显著提升效率。
选择合适的DPI：进行OCR时，DPI 300是性价比最高的平衡点。
控制并发数：一般设置为CPU核心数的1-2倍即可，过多会增加上下文切换开销。

# 带进度条的批量处理from tqdm import tqdmdef batch_process_with_progress(folder_path):files = [...]# 同上results = []with ProcessPoolExecutor(max_workers=4) as executor:futures = [executor.submit(process_single_file, f) for f in files]for future in tqdm(as_completed(futures), total=len(files), desc="处理文档"):results.append(future.result())return results

以上就是文档解析的全流程实战指南。从环境搭建、三大格式分别处理、统一流程、编码修复、表格提取到批量优化，基本覆盖了日常工作中可能遇到的各种场景。遇到具体问题时，请先确认文档类型（文字版还是扫描版、.doc还是.docx、HTML结构是否规范），再对症下药选择合适的工具，这样可以节省大量调试时间。

来源：https://juejin.cn/post/7656817003366465555

word

上一篇从零开始做网站出海，如何精准找到需求？ 下一篇还没用Claude Code？AI编程助手提升效率必备

本站内容用于信息整理与展示，如有侵权或内容问题请及时联系处理。