Skip to content

Document Processor

Document processing utilities for loading and splitting PDFs.

load_and_split_pdf(paper_id, pdf_url, paper_metadata, config, **kwargs)

Load a PDF and split it into chunks.

Parameters:

Name Type Description Default
paper_id str

Unique identifier for the paper.

required
pdf_url str

URL to the PDF.

required
paper_metadata Dict[str, Any]

Metadata about the paper (e.g. Title, Authors, etc.).

required
config Any

Configuration object with chunk_size and chunk_overlap attributes.

required
metadata_fields

List of additional metadata keys to propagate into each

required
documents_dict

Dictionary where split chunks will also be stored under keys of the form "{paper_id}_{chunk_index}" (passed via kwargs).

required

Returns:

Type Description
List[Document]

A list of Document chunks, each with updated metadata.

Source code in aiagents4pharma/talk2scholars/tools/pdf/utils/document_processor.py
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
def load_and_split_pdf(
    paper_id: str,
    pdf_url: str,
    paper_metadata: Dict[str, Any],
    config: Any,
    **kwargs: Any,
) -> List[Document]:
    """
    Load a PDF and split it into chunks.

    Args:
        paper_id: Unique identifier for the paper.
        pdf_url: URL to the PDF.
        paper_metadata: Metadata about the paper (e.g. Title, Authors, etc.).
        config: Configuration object with `chunk_size` and `chunk_overlap` attributes.
        metadata_fields: List of additional metadata keys to propagate into each
        chunk (passed via kwargs).
        documents_dict: Dictionary where split chunks will also be stored under keys
            of the form "{paper_id}_{chunk_index}" (passed via kwargs).

    Returns:
        A list of Document chunks, each with updated metadata.
    """
    metadata_fields: List[str] = kwargs["metadata_fields"]
    documents_dict: Dict[str, Document] = kwargs["documents_dict"]

    logger.info("Loading PDF for paper %s from %s", paper_id, pdf_url)

    # Load pages
    documents = PyPDFLoader(pdf_url).load()
    logger.info("Loaded %d pages from paper %s", len(documents), paper_id)

    if config is None:
        raise ValueError("Configuration is required for text splitting in Vectorstore.")
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=config.chunk_size,
        chunk_overlap=config.chunk_overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )

    # Split into chunks
    chunks = splitter.split_documents(documents)
    logger.info("Split paper %s into %d chunks", paper_id, len(chunks))

    # Attach metadata & populate documents_dict
    for i, chunk in enumerate(chunks):
        chunk_id = f"{paper_id}_{i}"
        chunk.metadata.update(
            {
                "paper_id": paper_id,
                "title": paper_metadata.get("Title", "Unknown"),
                "chunk_id": i,
                "page": chunk.metadata.get("page", 0),
                "source": pdf_url,
            }
        )
        for field in metadata_fields:
            if field in paper_metadata and field not in chunk.metadata:
                chunk.metadata[field] = paper_metadata[field]
        documents_dict[chunk_id] = chunk

    return chunks