LM Format Enforcer#

LM Format Enforcer# 是一个用于强制语言模型输出格式（如 JSON Schema、正则表达式等）的库。它不仅仅是向大型语言模型“建议”期望的输出结构，LM Format Enforcer 实际上可以“强制”大型语言模型输出遵循指定的模式。

LM Format Enforcer 支持本地大型语言模型（目前支持 LlamaCPP 和 HuggingfaceLLM 后端），并且仅通过处理大型语言模型的输出 logit 来工作。这使得它能够支持束搜索（beam search）和批量处理（batching）等高级生成方法，与其他修改生成循环本身的解决方案不同。有关更多详细信息，请参阅 LM Format Enforcer 页面中的比较表。

JSON Schema 输出#

在 LlamaIndex 中，我们提供了与 LM Format Enforcer 的初步集成，以便非常轻松地生成结构化输出（更具体地说是 pydantic 对象）。

例如，如果我们想生成一个包含歌曲的专辑，其模式如下：

class Song(BaseModel):
    title: str
    length_seconds: int


class Album(BaseModel):
    name: str
    artist: str
    songs: List[Song]

只需创建一个 LMFormatEnforcerPydanticProgram，指定我们期望的 pydantic 类 Album，并提供一个合适的提示模板。

注意：LMFormatEnforcerPydanticProgram 会自动将 pydantic 类的 JSON schema 填充到提示模板的可选参数 {json_schema} 中。这可以帮助大型语言模型自然地生成正确的 JSON，并减少格式强制器的干预程度，从而提高输出质量。

program = LMFormatEnforcerPydanticProgram(
    output_cls=Album,
    prompt_template_str="Generate an example album, with an artist and a list of songs. Using the movie {movie_name} as inspiration. You must answer according to the following schema: \n{json_schema}\n",
    llm=LlamaCPP(),
    verbose=True,
)

现在我们可以通过传入额外的用户输入来运行程序。例如，我们可以尝试一些惊悚的内容，创建一个受《闪灵》（The Shining）启发的专辑。

output = program(movie_name="The Shining")

我们得到了 pydantic 对象：

Album(
    name="The Shining: A Musical Journey Through the Haunted Halls of the Overlook Hotel",
    artist="The Shining Choir",
    songs=[
        Song(title="Redrum", length_seconds=300),
        Song(
            title="All Work and No Play Makes Jack a Dull Boy",
            length_seconds=240,
        ),
        Song(title="Heeeeere's Johnny!", length_seconds=180),
    ],
)

您可以查看此 notebook 以获取更多详细信息。

正则表达式输出#

LM Format Enforcer 也支持正则表达式输出。由于 LlamaIndex 中目前没有针对正则表达式的抽象，我们将在注入 LM Format Generator 后直接使用大型语言模型。

regex = r'"Hello, my name is (?P<name>[a-zA-Z]*)\. I was born in (?P<hometown>[a-zA-Z]*). Nice to meet you!"'
prompt = "Here is a way to present myself, if my name was John and I born in Boston: "

llm = LlamaCPP()
regex_parser = lmformatenforcer.RegexParser(regex)
lm_format_enforcer_fn = build_lm_format_enforcer_function(llm, regex_parser)
with activate_lm_format_enforcer(llm, lm_format_enforcer_fn):
    output = llm.complete(prompt)

这将使大型语言模型按照我们指定的正则表达式格式生成输出。我们还可以解析输出以获取命名组（named groups）。

print(output)
# "Hello, my name is John. I was born in Boston, Nice to meet you!"
print(re.match(regex, output.text).groupdict())
# {'name': 'John', 'hometown': 'Boston'}

请查看此 notebook 以获取更多详细信息。