Skip to content

PDF Skill 详解

PDF 文档的处理、创建和分析

基本信息

属性
名称pdf
类别文档处理
输出格式.pdf
许可证专有(仅源码可见)
yaml
name: pdf
description: Comprehensive PDF manipulation toolkit for extracting text
and tables, creating new PDFs, merging/splitting documents, and handling
forms. When Claude needs to fill in a PDF form or programmatically process,
generate, or analyze PDF documents at scale.

1. Python 库概览

1.1 pypdf - 基本操作

适用场景:合并、分割、旋转、加密、水印

python
from pypdf import PdfReader, PdfWriter

# 读取 PDF
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")

# 提取文本
text = ""
for page in reader.pages:
    text += page.extract_text()

1.2 pdfplumber - 文本和表格提取

适用场景:布局保留的文本提取、表格数据

python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

1.3 reportlab - 创建 PDF

适用场景:从零创建新 PDF

python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("hello.pdf", pagesize=letter)
c.drawString(100, 750, "Hello World!")
c.save()

2. 常用操作

2.1 合并 PDF

python
from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output:
    writer.write(output)

2.2 分割 PDF

python
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)

2.3 旋转页面

python
reader = PdfReader("input.pdf")
writer = PdfWriter()

page = reader.pages[0]
page.rotate(90)  # 顺时针旋转 90 度
writer.add_page(page)

with open("rotated.pdf", "wb") as output:
    writer.write(output)

2.4 提取元数据

python
reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Subject: {meta.subject}")
print(f"Creator: {meta.creator}")

3. 表格提取

3.1 基本表格提取

python
with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        tables = page.extract_tables()
        for j, table in enumerate(tables):
            print(f"Table {j+1} on page {i+1}:")
            for row in table:
                print(row)

3.2 高级表格提取(导出到 Excel)

python
import pandas as pd
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    all_tables = []
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            if table:  # 检查表格非空
                df = pd.DataFrame(table[1:], columns=table[0])
                all_tables.append(df)

# 合并所有表格
if all_tables:
    combined_df = pd.concat(all_tables, ignore_index=True)
    combined_df.to_excel("extracted_tables.xlsx", index=False)

4. 创建复杂 PDF

4.1 多页文档

python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []

# 添加内容
title = Paragraph("Report Title", styles['Title'])
story.append(title)
story.append(Spacer(1, 12))

body = Paragraph("This is the body of the report. " * 20, styles['Normal'])
story.append(body)
story.append(PageBreak())

# 第 2 页
story.append(Paragraph("Page 2", styles['Heading1']))
story.append(Paragraph("Content for page 2", styles['Normal']))

# 构建 PDF
doc.build(story)

5. 命令行工具

5.1 pdftotext (poppler-utils)

bash
# 提取文本
pdftotext input.pdf output.txt

# 保留布局
pdftotext -layout input.pdf output.txt

# 提取指定页面
pdftotext -f 1 -l 5 input.pdf output.txt  # 第 1-5 页

5.2 qpdf

bash
# 合并 PDF
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

# 分割页面
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf

# 旋转页面
qpdf input.pdf output.pdf --rotate=+90:1  # 第 1 页旋转 90 度

# 移除密码
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf

5.3 pdftk

bash
# 合并
pdftk file1.pdf file2.pdf cat output merged.pdf

# 分割
pdftk input.pdf burst

# 旋转
pdftk input.pdf rotate 1east output rotated.pdf

6. 高级功能

6.1 OCR 扫描件

python
# 需要:pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path

# PDF 转图片
images = convert_from_path('scanned.pdf')

# OCR 每页
text = ""
for i, image in enumerate(images):
    text += f"Page {i+1}:\n"
    text += pytesseract.image_to_string(image)
    text += "\n\n"

print(text)

6.2 添加水印

python
from pypdf import PdfReader, PdfWriter

# 创建水印(或加载现有水印)
watermark = PdfReader("watermark.pdf").pages[0]

# 应用到所有页面
reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.merge_page(watermark)
    writer.add_page(page)

with open("watermarked.pdf", "wb") as output:
    writer.write(output)

6.3 提取图片

bash
# 使用 pdfimages (poppler-utils)
pdfimages -j input.pdf output_prefix
# 生成:output_prefix-000.jpg, output_prefix-001.jpg 等

6.4 密码保护

python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

# 添加密码
writer.encrypt("userpassword", "ownerpassword")

with open("encrypted.pdf", "wb") as output:
    writer.write(output)

7. 快速参考表

任务最佳工具命令/代码
合并 PDFpypdfwriter.add_page(page)
分割 PDFpypdf每页一个文件
提取文本pdfplumberpage.extract_text()
提取表格pdfplumberpage.extract_tables()
创建 PDFreportlabCanvas 或 Platypus
命令行合并qpdfqpdf --empty --pages ...
OCR 扫描件pytesseract先转图片
填写表单pdf-lib/pypdf见 forms.md

8. 表单处理

8.1 参考文档

如需填写 PDF 表单,请阅读 forms.md 文件获取完整指南。

8.2 基本流程

markdown
1. 识别表单字段
2. 提取字段名称和类型
3. 填入值
4. 保存修改后的 PDF

9. 使用示例

9.1 触发方式

"帮我提取这个 PDF 的文本"
"合并这些 PDF 文件"
"从 PDF 中提取表格数据"
"create a PDF report"
"fill in this PDF form"

9.2 依赖安装

bash
# Python 库
pip install pypdf pdfplumber reportlab

# 命令行工具
sudo apt-get install poppler-utils  # pdftotext, pdfimages
sudo apt-get install qpdf           # qpdf

# OCR(可选)
pip install pytesseract pdf2image
sudo apt-get install tesseract-ocr

10. 本节小结

要点说明
pypdf基本操作:合并、分割、旋转、加密
pdfplumber文本和表格提取,保留布局
reportlab从零创建新 PDF
命令行工具qpdf, pdftotext, pdftk
表单处理见 forms.md

返回:Skills 目录

基于 MIT 许可证发布。内容版权归作者所有。