Data Interpreter：数据科学智能体

概述

Data Interpreter (DI) 是 MetaGPT 中专门用于解决数据相关问题的智能体。它能够理解用户需求、制定计划、编写并执行代码，必要时还能使用工具。这些能力使其能够应对广泛的数据科学场景。

论文：Data Interpreter: An LLM Agent For Data Science

核心能力

能力	说明
需求理解	解析用户的自然语言需求
任务规划	将复杂任务分解为可执行的步骤
代码生成	自动生成 Python 代码
代码执行	在沙箱环境中执行代码
工具调用	使用 Web 爬虫、图像处理等工具
反思改进	基于执行结果进行迭代优化

快速开始

基本使用

python

import asyncio
from metagpt.roles.di.data_interpreter import DataInterpreter

async def main():
    di = DataInterpreter()
    await di.run("Run data analysis on sklearn Iris dataset, include a plot")

asyncio.run(main())

在 Jupyter Notebook 中使用

python

from metagpt.roles.di.data_interpreter import DataInterpreter

di = DataInterpreter()
await di.run("Analyze the Titanic dataset and predict survival")

支持的任务类型

1. 数据可视化

python

di = DataInterpreter()
await di.run("""
    Load the iris dataset from sklearn,
    create a scatter plot showing the relationship between
    sepal length and sepal width, colored by species
""")

2. 机器学习建模

python

di = DataInterpreter()
await di.run("""
    This is a titanic passenger survival dataset.
    Your goal is to predict passenger survival outcome.
    The target column is 'Survived'.
    Perform data analysis, preprocessing, feature engineering,
    and modeling to predict the target.
    Train data path: './data/titanic_train.csv'
    Eval data path: './data/titanic_eval.csv'
""")

3. 图像处理

python

di = DataInterpreter()
await di.run("""
    Remove the background of the image using rembg.
    Image path: './images/photo.jpg'
    Save to: './images/photo_nobg.png'
""")

4. OCR 文字识别

python

di = DataInterpreter()
await di.run("""
    This is an invoice image.
    Perform OCR on the image using PaddleOCR,
    extract the total amount and save as a table.
    Image path: './invoices/invoice.png'
""")

5. 网页爬取

python

di = DataInterpreter()
await di.run("""
    Get products data from website https://example.com/shop
    and save it as a CSV file.
    Extract: product name, price, URL, and image URL.
""")

6. 网页模仿

python

di = DataInterpreter()
await di.run("""
    This is a URL of webpage: https://example.com/
    Utilize Selenium and WebDriver for rendering.
    Convert image to a webpage including HTML, CSS and JS in one go.
    Save webpage in a file.
""")

7. 数学问题求解

python

di = DataInterpreter()
await di.run("""
    Solve the following math problem step by step:
    If a train travels at 60 mph for 2.5 hours,
    then at 80 mph for 1.5 hours,
    what is the total distance traveled?
""")

8. 邮件处理

python

di = DataInterpreter()
await di.run("""
    Check the content of the latest email in my Outlook.
    If the sender is from @company.com domain,
    automatically reply with a confirmation message.
    Email: user@example.com
    Password: ****
""")

工作流程

Data Interpreter 的执行流程：

text

用户需求
    │
    ▼
┌─────────────┐
│  需求分析   │  理解用户意图
└─────────────┘
    │
    ▼
┌─────────────┐
│  任务规划   │  分解为子任务
└─────────────┘
    │
    ▼
┌─────────────┐
│  代码生成   │  生成 Python 代码
└─────────────┘
    │
    ▼
┌─────────────┐
│  代码执行   │  在沙箱中运行
└─────────────┘
    │
    ▼
┌─────────────┐
│  结果验证   │  检查执行结果
└─────────────┘
    │
    ├──▶ 成功：返回结果
    │
    └──▶ 失败：反思并重试

反思机制

Data Interpreter 内置反思机制，能够从失败中学习：

python

di = DataInterpreter(use_reflection=True)
await di.run("Complex task that might need iteration")

反思流程：

执行代码：运行生成的代码
分析错误：如果失败，分析错误原因
生成修复：基于错误信息生成修复方案
重新执行：执行修复后的代码
迭代优化：重复直到成功或达到最大重试次数

ML-Benchmark 数据集

MetaGPT 提供了 8 个典型机器学习数据集用于评估：

ID	任务名称	数据集	评估指标
01	iris	Iris	可视化
02	wines	Wine Recognition	Accuracy
03	breast_cancer	Breast Cancer	Accuracy
04	titanic	Titanic	Accuracy
05	house_prices	House Prices	RMSE (log)
06	santander_customer	Santander Customer	AUC
07	icr_identify	ICR Identifying	F1 Score
08	santander_value	Santander Value	RMSLE

运行基准测试

bash

python run_ml_benchmark.py --task_name 04_titanic

Open-Ended Tasks 数据集

20 个开放性任务，涵盖多种场景：

类别	任务示例
OCR	发票识别、文字提取
网页爬取	数据抓取、信息整理
邮件处理	自动回复、内容摘要
网页模仿	截图转代码
图像处理	背景移除
文生图	Stable Diffusion 生成
游戏生成	Pyxel 游戏开发

运行开放任务

bash

python run_open_ended_tasks.py \
    --task_name 14_image_background_removal \
    --data_dir ./di_dataset \
    --use_reflection True

配置选项

基本配置

python

di = DataInterpreter(
    use_reflection=True,       # 启用反思
    max_auto_reply=3,          # 最大自动重试次数
    tools=["pandas", "matplotlib", "sklearn"],  # 可用工具
)

自定义工具

python

from metagpt.tools import Tool

class CustomTool(Tool):
    name: str = "custom_tool"
    desc: str = "A custom tool for specific tasks"

    async def run(self, *args) -> str:
        # 工具实现
        return result

di = DataInterpreter(tools=[CustomTool()])

与软件公司模式的对比

特性	Software Company	Data Interpreter
专注领域	软件开发	数据科学
团队结构	多角色协作	单一智能体
输出产物	完整项目	分析结果/代码
适用场景	复杂软件系统	数据分析任务
执行方式	流水线式	迭代反思式

最佳实践

1. 明确任务描述

python

# 好的描述
await di.run("""
    Analyze the sales data in 'sales.csv'.
    1. Clean missing values
    2. Calculate monthly revenue trends
    3. Create a line chart showing revenue over time
    4. Save the chart as 'revenue_trend.png'
""")

# 避免模糊描述
await di.run("Analyze data")  # 太模糊

2. 提供上下文

python

await di.run("""
    Context: This is a customer churn dataset with the following columns:
    - CustomerID: Unique identifier
    - Tenure: Months with service
    - MonthlyCharges: Monthly fee
    - Churn: Target variable (Yes/No)

    Task: Build a classification model to predict churn.
    Report: Accuracy, Precision, Recall, F1-Score
""")

3. 指定输出格式

python

await di.run("""
    Analyze the dataset and provide:
    1. Summary statistics in a table
    2. Correlation heatmap saved as 'correlation.png'
    3. Top 5 insights as bullet points
    4. Code saved to 'analysis.py'
""")

下一节：16.4 AFlow 与 SPO

Data Interpreter：数据科学智能体 ​

概述 ​

核心能力 ​

快速开始 ​

基本使用 ​

在 Jupyter Notebook 中使用 ​

支持的任务类型 ​

1. 数据可视化 ​

2. 机器学习建模 ​

3. 图像处理 ​

4. OCR 文字识别 ​

5. 网页爬取 ​

6. 网页模仿 ​

7. 数学问题求解 ​

8. 邮件处理 ​

工作流程 ​

反思机制 ​

ML-Benchmark 数据集 ​

运行基准测试 ​

Open-Ended Tasks 数据集 ​

运行开放任务 ​

配置选项 ​

基本配置 ​

自定义工具 ​

与软件公司模式的对比 ​

最佳实践 ​

1. 明确任务描述 ​

2. 提供上下文 ​

3. 指定输出格式 ​

Data Interpreter：数据科学智能体

概述

核心能力

快速开始

基本使用

在 Jupyter Notebook 中使用

支持的任务类型

1. 数据可视化

2. 机器学习建模

3. 图像处理

4. OCR 文字识别

5. 网页爬取

6. 网页模仿

7. 数学问题求解

8. 邮件处理

工作流程

反思机制

ML-Benchmark 数据集

运行基准测试

Open-Ended Tasks 数据集

运行开放任务

配置选项

基本配置

自定义工具

与软件公司模式的对比

最佳实践

1. 明确任务描述

2. 提供上下文

3. 指定输出格式