简介
浙江大学开源的TableGPT2是一款专为表格数据处理设计的语言模型。它能够应对不规则表格、模糊查询等现实场景,非常适用于企业级商业智能(BI)和文档处理应用。
该模型具备以下核心能力:
- 预建代理设计:TableGPT-Agent是专为TableGPT2设计的预建代理,旨在提升表格数据处理能力。
- 支持多格式数据读取:该工具支持从CSV或Excel文件中读取表格数据,并生成相应的分析结果。
- 结合大语言模型:基于TableGPT2系列模型,专为表格问答任务优化,能够高效处理复杂查询。
- 多模态功能集成:整合了表格识别和生成技术,提供从图像到数据的完整分析解决方案。
TableGPT2基于Qwen2.5进行训练,创新设计了全新的表格编码器,能够更有效地捕捉表格的结构和语义信息,同时支持处理结构化和非结构化数据。
目前,该模型提供7B和72B两个版本,其中7B版本已对外开源。
github项目地址:
https://github.com/tablegpt/tablegpt-agent/
hg:https://huggingface.co/tablegpt/TableGPT2-7B
部署和应用
下载模型
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("tablegpt/TableGPT2-7B")
Code
from transformers import AutoModelForCausalLM, AutoTokenizer
# Using pandas to read some structured data
import pandas as pd
from io import StringIO
# single table
EXAMPLE_CSV_CONTENT = """
"Loss","Date","Score","Opponent","Record","Attendance"
"Hampton (14–12)","September 25","8–7","Padres","67–84","31,193"
"Speier (5–3)","September 26","3–1","Padres","67–85","30,711"
"Elarton (4–9)","September 22","3–1","@ Expos","65–83","9,707"
"Lundquist (0–1)","September 24","15–11","Padres","67–83","30,774"
"Hampton (13–11)","September 6","9–5","Dodgers","61–78","31,407"
"""
csv_file = StringIO(EXAMPLE_CSV_CONTENT)
df = pd.read_csv(csv_file)
model_name = "tablegpt/TableGPT2-7B"
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
example_prompt_template = """Given access to several pandas dataframes, write the Python code to answer the user's question.
/*
"{var_name}.head(5).to_string(index=False)" as follows:
{df_info}
*/
Question: {user_question}
"""
question = "Please read the table data and help me interpret what data is in it."
prompt = example_prompt_template.format(
var_name="df",
df_info=df.head(5).to_string(index=False),
user_question=question,
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
运行效果
准确输出最终表格信息
复杂使用场景
针对复杂的使用场景,提供了tablegpt-agent工具包,帮助你更方便地处理各种类型的表格输入。
首先,通过以下命令安装必要的 Python 包:
pip install tablegpt-agent
pip install 'tablegpt-agent[local]'
pip install "vllm>=0.5.5" 和 pip install transformers>=4.37.0
设置 vllm 服务器:
运行以下命令启动服务器:
python -m vllm.entrypoints.openai.api_server --served-model-name TableGPT2-7B --model path/to/weights
配置代理:
导入必要的模块并设置 LLM:
from langchain_openai import ChatOpenAI
from pybox import LocalPyBoxManager
from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR
from tablegpt.agent import create_tablegpt_graph
from langgraph.checkpoint.memory import MemorySaver
llm = ChatOpenAI(openai_api_base="http://localhost:8000", openai_api_key="whatever", model_name="TableGPT2-7B")
pybox_manager = LocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)
checkpointer = MemorySaver()
agent = create_tablegpt_graph(llm=llm, pybox_manager=pybox_manager, checkpointer=checkpointer, session_id="some-session-id")
交互与代理:
agent.run("I have a file called data.csv. Please read it and understand its content.")
总结
TableGPT2-7B 在表格理解、代码生成和结构化数据推理的基准测试中表现始终良好,在标准基准测试中比同类模型的性能提高了35.20% ,以 BI 为重点的评估中提高了49.32%。整体效果还是很不错的。