PDF Skill 详解
PDF 文档的处理、创建和分析
基本信息
| 属性 | 值 |
|---|---|
| 名称 | pdf |
| 类别 | 文档处理 |
| 输出格式 | |
| 许可证 | 专有(仅源码可见) |
yaml
name: pdf
description: Comprehensive PDF manipulation toolkit for extracting text
and tables, creating new PDFs, merging/splitting documents, and handling
forms. When Claude needs to fill in a PDF form or programmatically process,
generate, or analyze PDF documents at scale.1. Python 库概览
1.1 pypdf - 基本操作
适用场景:合并、分割、旋转、加密、水印
python
from pypdf import PdfReader, PdfWriter
# 读取 PDF
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")
# 提取文本
text = ""
for page in reader.pages:
text += page.extract_text()1.2 pdfplumber - 文本和表格提取
适用场景:布局保留的文本提取、表格数据
python
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)1.3 reportlab - 创建 PDF
适用场景:从零创建新 PDF
python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("hello.pdf", pagesize=letter)
c.drawString(100, 750, "Hello World!")
c.save()2. 常用操作
2.1 合并 PDF
python
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)2.2 分割 PDF
python
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)2.3 旋转页面
python
reader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90) # 顺时针旋转 90 度
writer.add_page(page)
with open("rotated.pdf", "wb") as output:
writer.write(output)2.4 提取元数据
python
reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Subject: {meta.subject}")
print(f"Creator: {meta.creator}")3. 表格提取
3.1 基本表格提取
python
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for j, table in enumerate(tables):
print(f"Table {j+1} on page {i+1}:")
for row in table:
print(row)3.2 高级表格提取(导出到 Excel)
python
import pandas as pd
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table: # 检查表格非空
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)
# 合并所有表格
if all_tables:
combined_df = pd.concat(all_tables, ignore_index=True)
combined_df.to_excel("extracted_tables.xlsx", index=False)4. 创建复杂 PDF
4.1 多页文档
python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []
# 添加内容
title = Paragraph("Report Title", styles['Title'])
story.append(title)
story.append(Spacer(1, 12))
body = Paragraph("This is the body of the report. " * 20, styles['Normal'])
story.append(body)
story.append(PageBreak())
# 第 2 页
story.append(Paragraph("Page 2", styles['Heading1']))
story.append(Paragraph("Content for page 2", styles['Normal']))
# 构建 PDF
doc.build(story)5. 命令行工具
5.1 pdftotext (poppler-utils)
bash
# 提取文本
pdftotext input.pdf output.txt
# 保留布局
pdftotext -layout input.pdf output.txt
# 提取指定页面
pdftotext -f 1 -l 5 input.pdf output.txt # 第 1-5 页5.2 qpdf
bash
# 合并 PDF
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
# 分割页面
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf
# 旋转页面
qpdf input.pdf output.pdf --rotate=+90:1 # 第 1 页旋转 90 度
# 移除密码
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf5.3 pdftk
bash
# 合并
pdftk file1.pdf file2.pdf cat output merged.pdf
# 分割
pdftk input.pdf burst
# 旋转
pdftk input.pdf rotate 1east output rotated.pdf6. 高级功能
6.1 OCR 扫描件
python
# 需要:pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path
# PDF 转图片
images = convert_from_path('scanned.pdf')
# OCR 每页
text = ""
for i, image in enumerate(images):
text += f"Page {i+1}:\n"
text += pytesseract.image_to_string(image)
text += "\n\n"
print(text)6.2 添加水印
python
from pypdf import PdfReader, PdfWriter
# 创建水印(或加载现有水印)
watermark = PdfReader("watermark.pdf").pages[0]
# 应用到所有页面
reader = PdfReader("document.pdf")
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
writer.write(output)6.3 提取图片
bash
# 使用 pdfimages (poppler-utils)
pdfimages -j input.pdf output_prefix
# 生成:output_prefix-000.jpg, output_prefix-001.jpg 等6.4 密码保护
python
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
# 添加密码
writer.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output:
writer.write(output)7. 快速参考表
| 任务 | 最佳工具 | 命令/代码 |
|---|---|---|
| 合并 PDF | pypdf | writer.add_page(page) |
| 分割 PDF | pypdf | 每页一个文件 |
| 提取文本 | pdfplumber | page.extract_text() |
| 提取表格 | pdfplumber | page.extract_tables() |
| 创建 PDF | reportlab | Canvas 或 Platypus |
| 命令行合并 | qpdf | qpdf --empty --pages ... |
| OCR 扫描件 | pytesseract | 先转图片 |
| 填写表单 | pdf-lib/pypdf | 见 forms.md |
8. 表单处理
8.1 参考文档
如需填写 PDF 表单,请阅读 forms.md 文件获取完整指南。
8.2 基本流程
markdown
1. 识别表单字段
2. 提取字段名称和类型
3. 填入值
4. 保存修改后的 PDF9. 使用示例
9.1 触发方式
"帮我提取这个 PDF 的文本"
"合并这些 PDF 文件"
"从 PDF 中提取表格数据"
"create a PDF report"
"fill in this PDF form"9.2 依赖安装
bash
# Python 库
pip install pypdf pdfplumber reportlab
# 命令行工具
sudo apt-get install poppler-utils # pdftotext, pdfimages
sudo apt-get install qpdf # qpdf
# OCR(可选)
pip install pytesseract pdf2image
sudo apt-get install tesseract-ocr10. 本节小结
| 要点 | 说明 |
|---|---|
| pypdf | 基本操作:合并、分割、旋转、加密 |
| pdfplumber | 文本和表格提取,保留布局 |
| reportlab | 从零创建新 PDF |
| 命令行工具 | qpdf, pdftotext, pdftk |
| 表单处理 | 见 forms.md |
返回:Skills 目录