星驰编程网

免费编程资源分享平台_编程教程_代码示例_开发技术文章

酷,TableGPT2:一个专门用于表格数据处理的开源语言模型

简介

浙江大学开源的TableGPT2是一款专为表格数据处理设计的语言模型。它能够应对不规则表格、模糊查询等现实场景,非常适用于企业级商业智能(BI)和文档处理应用。


该模型具备以下核心能力:

  • 预建代理设计:TableGPT-Agent是专为TableGPT2设计的预建代理,旨在提升表格数据处理能力。
  • 支持多格式数据读取:该工具支持从CSV或Excel文件中读取表格数据,并生成相应的分析结果。
  • 结合大语言模型:基于TableGPT2系列模型,专为表格问答任务优化,能够高效处理复杂查询。
  • 多模态功能集成:整合了表格识别和生成技术,提供从图像到数据的完整分析解决方案。

TableGPT2基于Qwen2.5进行训练,创新设计了全新的表格编码器,能够更有效地捕捉表格的结构和语义信息,同时支持处理结构化和非结构化数据。

目前,该模型提供7B和72B两个版本,其中7B版本已对外开源。

github项目地址:
https://github.com/tablegpt/tablegpt-agent/

hg:https://huggingface.co/tablegpt/TableGPT2-7B


部署和应用

下载模型

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("tablegpt/TableGPT2-7B")

Code

from transformers import AutoModelForCausalLM, AutoTokenizer

# Using pandas to read some structured data
import pandas as pd
from io import StringIO

# single table
EXAMPLE_CSV_CONTENT = """
"Loss","Date","Score","Opponent","Record","Attendance"
"Hampton (14–12)","September 25","8–7","Padres","67–84","31,193"
"Speier (5–3)","September 26","3–1","Padres","67–85","30,711"
"Elarton (4–9)","September 22","3–1","@ Expos","65–83","9,707"
"Lundquist (0–1)","September 24","15–11","Padres","67–83","30,774"
"Hampton (13–11)","September 6","9–5","Dodgers","61–78","31,407"
"""

csv_file = StringIO(EXAMPLE_CSV_CONTENT)
df = pd.read_csv(csv_file)

model_name = "tablegpt/TableGPT2-7B"

model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

example_prompt_template = """Given access to several pandas dataframes, write the Python code to answer the user's question.

/*
"{var_name}.head(5).to_string(index=False)" as follows:
{df_info}
*/

Question: {user_question}
"""
question = "Please read the table data and help me interpret what data is in it."

prompt = example_prompt_template.format(
    var_name="df",
    df_info=df.head(5).to_string(index=False),
    user_question=question,
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=512)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(response)

运行效果

准确输出最终表格信息


复杂使用场景

针对复杂的使用场景,提供了tablegpt-agent工具包,帮助你更方便地处理各种类型的表格输入。

首先,通过以下命令安装必要的 Python 包:

pip install tablegpt-agent
pip install 'tablegpt-agent[local]'
pip install "vllm>=0.5.5" 和 pip install transformers>=4.37.0

设置 vllm 服务器:
运行以下命令启动服务器:

python -m vllm.entrypoints.openai.api_server --served-model-name TableGPT2-7B --model path/to/weights

配置代理:
导入必要的模块并设置 LLM:

from langchain_openai import ChatOpenAI
from pybox import LocalPyBoxManager
from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR
from tablegpt.agent import create_tablegpt_graph
from langgraph.checkpoint.memory import MemorySaver

llm = ChatOpenAI(openai_api_base="http://localhost:8000", openai_api_key="whatever", model_name="TableGPT2-7B")
pybox_manager = LocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)
checkpointer = MemorySaver()
agent = create_tablegpt_graph(llm=llm, pybox_manager=pybox_manager, checkpointer=checkpointer, session_id="some-session-id")

交互与代理:

agent.run("I have a file called data.csv. Please read it and understand its content.")


总结

TableGPT2-7B 在表格理解、代码生成和结构化数据推理的基准测试中表现始终良好,在标准基准测试中比同类模型的性能提高了35.20% ,以 BI 为重点的评估中提高了49.32%。整体效果还是很不错的。

控制面板
您好,欢迎到访网站!
  查看权限
网站分类
最新留言