Docling

DoclingReader #

Bases: BasePydanticReader

Docling 阅读器。

将 PDF, DOCX 和其他文档格式提取为 LlamaIndex 文档，格式可以是 Markdown 或 JSON 序列化的 Docling 原生格式。

参数

名称	类型	描述	默认值
`export_type`	`Literal[markdown, json]`	要导出的类型。默认为 "markdown"。	必需的
`doc_converter`	`DocumentConverter`	要使用的 Docling 转换器。默认工厂：`DocumentConverter`。	必需的
`md_export_kwargs`	`Dict[str, Any]`	Markdown 导出时使用的 Kwargs。默认为 `{"image_placeholder": ""}`。	必需的
`id_func`		(DocIDGenCallable, 可选): 要使用的文档 ID 生成函数。默认值：`_uuid4_doc_id_gen`	必需的

源代码位于 llama-index-integrations/readers/llama-index-readers-docling/llama_index/readers/docling/base.py

class DoclingReader(BasePydanticReader):
    """
    Docling Reader.

    Extracts PDF, DOCX, and other document formats into LlamaIndex Documents as either Markdown or JSON-serialized Docling native format.

    Args:
        export_type (Literal["markdown", "json"], optional): The type to export to. Defaults to "markdown".
        doc_converter (DocumentConverter, optional): The Docling converter to use. Default factory: `DocumentConverter`.
        md_export_kwargs (Dict[str, Any], optional): Kwargs to use in case of markdown export. Defaults to `{"image_placeholder": ""}`.
        id_func: (DocIDGenCallable, optional): Doc ID generation function to use. Default: `_uuid4_doc_id_gen`

    """

    class ExportType(str, Enum):
        MARKDOWN = "markdown"
        JSON = "json"

    @runtime_checkable
    class DocIDGenCallable(Protocol):
        def __call__(self, doc: DLDocument, file_path: str | Path) -> str:
            ...

    @staticmethod
    def _uuid4_doc_id_gen(doc: DLDocument, file_path: str | Path) -> str:
        return str(uuid.uuid4())

    export_type: ExportType = ExportType.MARKDOWN
    doc_converter: DocumentConverter = Field(default_factory=DocumentConverter)
    md_export_kwargs: Dict[str, Any] = {"image_placeholder": ""}
    id_func: DocIDGenCallable = _uuid4_doc_id_gen

    def lazy_load_data(
        self,
        file_path: str | Path | Iterable[str] | Iterable[Path],
        extra_info: dict | None = None,
        fs: Optional[AbstractFileSystem] = None,
    ) -> Iterable[LIDocument]:
        """
        Lazily load from given source.

        Args:
            file_path (str | Path | Iterable[str] | Iterable[Path]): Document file source as single str (URL or local file) or pathlib.Path — or iterable thereof
            extra_info (dict | None, optional): Any pre-existing metadata to include. Defaults to None.

        Returns:
            Iterable[LIDocument]: Iterable over the created LlamaIndex documents.

        """
        file_paths = (
            file_path
            if isinstance(file_path, Iterable) and not isinstance(file_path, str)
            else [file_path]
        )

        for source in file_paths:
            dl_doc = self.doc_converter.convert(source).document
            text: str
            if self.export_type == self.ExportType.MARKDOWN:
                text = dl_doc.export_to_markdown(**self.md_export_kwargs)
            elif self.export_type == self.ExportType.JSON:
                text = json.dumps(dl_doc.export_to_dict())
            else:
                raise ValueError(f"Unexpected export type: {self.export_type}")
            li_doc = LIDocument(
                doc_id=self.id_func(doc=dl_doc, file_path=source),
                text=text,
            )
            li_doc.metadata = extra_info or {}
            yield li_doc

lazy_load_data #

lazy_load_data(file_path: str | Path | Iterable[str] | Iterable[Path], extra_info: dict | None = None, fs: Optional[AbstractFileSystem] = None) -> Iterable[Document]

从给定源延迟加载。

参数

名称	类型	描述	默认值
`file_path`	`str \| Path \| Iterable[str] \| Iterable[Path]`	文档文件源，可以是单个字符串（URL 或本地文件）或 pathlib.Path — 或其可迭代对象	必需的
`extra_info`	`dict \| None`	任何要包含的预先存在的元数据。默认为 None。	`无`

返回值

类型	描述
`Iterable[Document]`	Iterable[LIDocument]: 创建的 LlamaIndex 文档的可迭代对象。

源代码位于 llama-index-integrations/readers/llama-index-readers-docling/llama_index/readers/docling/base.py

def lazy_load_data(
    self,
    file_path: str | Path | Iterable[str] | Iterable[Path],
    extra_info: dict | None = None,
    fs: Optional[AbstractFileSystem] = None,
) -> Iterable[LIDocument]:
    """
    Lazily load from given source.

    Args:
        file_path (str | Path | Iterable[str] | Iterable[Path]): Document file source as single str (URL or local file) or pathlib.Path — or iterable thereof
        extra_info (dict | None, optional): Any pre-existing metadata to include. Defaults to None.

    Returns:
        Iterable[LIDocument]: Iterable over the created LlamaIndex documents.

    """
    file_paths = (
        file_path
        if isinstance(file_path, Iterable) and not isinstance(file_path, str)
        else [file_path]
    )

    for source in file_paths:
        dl_doc = self.doc_converter.convert(source).document
        text: str
        if self.export_type == self.ExportType.MARKDOWN:
            text = dl_doc.export_to_markdown(**self.md_export_kwargs)
        elif self.export_type == self.ExportType.JSON:
            text = json.dumps(dl_doc.export_to_dict())
        else:
            raise ValueError(f"Unexpected export type: {self.export_type}")
        li_doc = LIDocument(
            doc_id=self.id_func(doc=dl_doc, file_path=source),
            text=text,
        )
        li_doc.metadata = extra_info or {}
        yield li_doc