PDF 标记器

PDFMarkerReader #

基类: BaseReader

PDF 标记器读取器。读取 PDF 并将其转换为 Markdown 格式和带有布局的表格。

源代码位于 llama-index-integrations/readers/llama-index-readers-pdf-marker/llama_index/readers/pdf_marker/base.py

class PDFMarkerReader(BaseReader):
    """
    PDF Marker Reader. Reads a pdf to markdown format and tables with layout.
    """

    def __init__(self, *args: Any, **kwargs: Any) -> None:
        super().__init__(*args, **kwargs)

    def load_data(
        self,
        file: Path,
        max_pages: int = None,
        langs: List[str] = None,
        batch_multiplier: int = 2,
        start_page: int = None,
        extra_info: Optional[Dict] = None,
    ) -> List[Document]:
        """
        Load data from PDF
        Args:
            file (Path): Path for the PDF file.
            max_pages (int): is the maximum number of pages to process. Omit this to convert the entire document.
            langs (List[str]): List of languages to use for OCR. See supported languages : https://github.com/VikParuchuri/surya/blob/master/surya/languages.py
            batch_multiplier (int): is how much to multiply default batch sizes by if you have extra VRAM. Higher numbers will take more VRAM, but process faster. Set to 2 by default. The default batch sizes will take ~3GB of VRAM.
            start_page (int): Start page for conversion.

        Returns:
            List[Document]: List of documents.

        """
        from marker.convert import convert_single_pdf
        from marker.models import load_all_models

        model_lst = load_all_models()
        full_text, images, out_meta = convert_single_pdf(
            str(file),
            model_lst,
            max_pages=max_pages,
            langs=langs,
            batch_multiplier=batch_multiplier,
            start_page=start_page,
        )

        doc = Document(text=full_text, extra_info=extra_info or {})

        return [doc]

load_data #

load_data(file: Path, max_pages: int = None, langs: List[str] = None, batch_multiplier: int = 2, start_page: int = None, extra_info: Optional[Dict] = None) -> List[Document]

从 PDF 加载数据参数：file (Path)：PDF 文件路径。max_pages (int)：要处理的最大页数。省略此参数将转换整个文档。langs (List[str])：用于 OCR 的语言列表。请参阅支持的语言：https://github.com/VikParuchuri/surya/blob/master/surya/languages.py batch_multiplier (int)：如果有额外的显存，将默认批次大小乘以多少。更高的数值会占用更多显存，但处理速度更快。默认设置为 2。默认批次大小将占用约 3GB 显存。start_page (int)：转换的起始页。

返回

类型	描述
`列表[文档]`	List[Document]：文档列表。

源代码位于 llama-index-integrations/readers/llama-index-readers-pdf-marker/llama_index/readers/pdf_marker/base.py

def load_data(
    self,
    file: Path,
    max_pages: int = None,
    langs: List[str] = None,
    batch_multiplier: int = 2,
    start_page: int = None,
    extra_info: Optional[Dict] = None,
) -> List[Document]:
    """
    Load data from PDF
    Args:
        file (Path): Path for the PDF file.
        max_pages (int): is the maximum number of pages to process. Omit this to convert the entire document.
        langs (List[str]): List of languages to use for OCR. See supported languages : https://github.com/VikParuchuri/surya/blob/master/surya/languages.py
        batch_multiplier (int): is how much to multiply default batch sizes by if you have extra VRAM. Higher numbers will take more VRAM, but process faster. Set to 2 by default. The default batch sizes will take ~3GB of VRAM.
        start_page (int): Start page for conversion.

    Returns:
        List[Document]: List of documents.

    """
    from marker.convert import convert_single_pdf
    from marker.models import load_all_models

    model_lst = load_all_models()
    full_text, images, out_meta = convert_single_pdf(
        str(file),
        model_lst,
        max_pages=max_pages,
        langs=langs,
        batch_multiplier=batch_multiplier,
        start_page=start_page,
    )

    doc = Document(text=full_text, extra_info=extra_info or {})

    return [doc]