Improving Document QA w/ Custom Chunking Tutorial
This tutorial will guide you on how to use your own customized chunking strategy and pass those chunks to BerriAI.
Let’s say we want to QA a research paper.
Since we can’t fit the entire paper into GPT, we need a way to break this paper down into smaller chunks. By default Berri provides custom chunking, but we can also write our own.
Writing your own chunking strategy is a good way of improving the quality of our responses (since we’re answering user questions based on the most relevant chunk we find).
Relevant Links:
Step 1: Set up your environment
For this tutorial we’re going to use a sample ML research paper as our initial data source.
Step 2: Customize chunking
In this case, let’s make every page a chunk (i.e. the thing we feed into GPT).
Here we’re using our own data loader (PyPDF2), extracting the text from the page, and adding that to text_list
.
Step 3: Creating a custom chatGPT instance to QA against our Doc
Since we’ve stored our chunks as a list (text_list
), let’s pass that to Berri.
Step 4: Testing our instance in playground
Each instance has it’s own unique playground link. This is a place for you to test your model and quickly make any changes (e.g. updating prompt, etc.)