JSONFormer 在 Hugging Face 上的结构化解码应用实战

老铁们，今天咱们来聊一聊 JSONFormer 这个库，它能帮我们解决 Hugging Face 模型在生成结构化数据时的一些烦恼。说白了，JSONFormer 是个实验性质的工具，用来将 Hugging Face 的本地模型管道封装起来，以便让它在某些 JSON Schema 的解析上更靠谱。

技术背景介绍

在处理复杂的 JSON 数据结构时，我们常常希望模型能够在输出时遵循预定义的结构。然而，很多生成模型在这方面总是“拖后腿”，输出的结果常常脱离我们希望的格式。这时候，JSONFormer 这波操作就可以相当丝滑地解决这个问题，它通过填充结构化的标记，然后从模型中采样内容标记来实现。

原理深度解析

JSONFormer 利用 Hugging Face 的模型管道，通过一个结构化的解码方式来确保输出符合制定的 JSON Schema。我们先用一下 Hugging Face 的基准模型看看没有结构化解码的效果是什么样的。

%pip install --upgrade --quiet jsonformer > /dev/null

import logging
import json
import os
import requests
from langchain_core.tools import tool

logging.basicConfig(level=logging.ERROR)

HF_TOKEN = os.environ.get("HUGGINGFACE_API_KEY")

@tool
def ask_star_coder(query: str, temperature: float = 1.0, max_new_tokens: float = 250):
    """Query the BigCode StarCoder model about coding questions."""
    url = "https://api-inference.huggingface.co/models/bigcode/starcoder"
    headers = {
        "Authorization": f"Bearer {HF_TOKEN}",
        "content-type": "application/json",
    }
    payload = {
        "inputs": f"{query}\n\nAnswer:",
        "temperature": temperature,
        "max_new_tokens": int(max_new_tokens),
    }
    response = requests.post(url, headers=headers, data=json.dumps(payload))
    response.raise_for_status()
    return json.loads(response.content.decode("utf-8"))

实战代码演示

我们定义了一个简单的 prompt，让 AI 助理用 JSON 格式来回答。首先，我们来看看没有 JSONFormer 的结果：

from langchain_huggingface import HuggingFacePipeline
from transformers import pipeline

hf_model = pipeline(
    "text-generation", model="cerebras/Cerebras-GPT-590M", max_new_tokens=200
)

original_model = HuggingFacePipeline(pipeline=hf_model)

prompt = """You must respond using JSON format, with a single action and single action input.
You may 'ask_star_coder' for help on coding problems.

{arg_schema}

BEGIN! Answer the Human's question as best as you are able.
------
Human: 'What's the difference between an iterator and an iterable?'
AI Assistant:""".format(arg_schema=ask_star_coder.args)

generated = original_model.predict(prompt, stop=["Observation:", "Human:"])
print(generated)

结果并不能让人满意，模型没有遵循 JSON 格式。那么我们来试试 JSONFormer：

decoder_schema = {
    "title": "Decoding Schema",
    "type": "object",
    "properties": {
        "action": {"type": "string", "default": ask_star_coder.name},
        "action_input": {
            "type": "object",
            "properties": ask_star_coder.args,
        },
    },
}

from langchain_experimental.llms import JsonFormer

json_former = JsonFormer(json_schema=decoder_schema, pipeline=hf_model)
results = json_former.predict(prompt, stop=["Observation:", "Human:"])
print(results)

优化建议分享

在使用 JSONFormer 时，老铁们可以将其与其他代理服务结合使用，以提高稳定性。我个人一直在用 zzzzapi.com 提供的一站式大模型解决方案，效果还是蛮不错的。

补充说明和总结

JSONFormer 确实帮助我们提升了模型生成的可靠性，尤其是在处理复杂结构化数据时。今天的技术分享就到这里，希望对大家有帮助。开发过程中遇到问题也可以在评论区交流~

—END—