This tutorial will guide you on how to use your own customized chunking strategy and pass those chunks to BerriAI.
Let’s say we want to QA a research paper.Since we can’t fit the entire paper into GPT, we need a way to break this paper down into smaller chunks. By default Berri provides custom chunking, but we can also write our own.Writing your own chunking strategy is a good way of improving the quality of our responses (since we’re answering user questions based on the most relevant chunk we find).Relevant Links:
In this case, let’s make every page a chunk (i.e. the thing we feed into GPT).
Copy
import PyPDF2text_list = []with open("./ml_paper.pdf", "rb") as fp: # Create a PDF object pdf = PyPDF2.PdfReader(fp) # Get the number of pages in the PDF document num_pages = len(pdf.pages) # Iterate over every page for page in range(num_pages): # Extract the text from the page page_text = pdf.pages[page].extract_text() text_list.append(page_text) # chunk by page
Here we’re using our own data loader (PyPDF2), extracting the text from the page, and adding that to text_list.
Each instance has it’s own unique playground link. This is a place for you to test your model and quickly make any changes (e.g. updating prompt, etc.)