Why do I need to comment out this line after running it for the first time? Do I have to do this every time I add a document? After commenting it out, add the copy again. It seems that the query results are all the copy content added for the first time.
When 10,000 PDF documents need to be queried, what is the processing mechanism of PDFKnowledgeBase?
Hi @zgli
Thank you for reaching out and using Phidata! I’ve tagged the relevant engineers to assist you with your query. We aim to respond within 24 hours.
If this is urgent, please feel free to let us know, and we’ll do our best to prioritize it.
Thanks for your patience!
Hello @zgli ! You do not need to comment out the knowledge_base.load() line as long as the load function has recreate=False param passed to it. For example, in the following code, the knowledge base will be loaded only on the first run. On subsequent runs the Knowledge base checks whether you are loading any new documents. If not then the same documents are not added again.
from phi.agent import Agent
from phi.model.anthropic import Claude
from phi.knowledge.pdf import PDFUrlKnowledgeBase
from phi.vectordb.pgvector import PgVector
db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"
knowledge_base = PDFUrlKnowledgeBase(
urls=["https://phi-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf"],
vector_db=PgVector(table_name="recipes", db_url=db_url),
)
knowledge_base.load(recreate=False) # Comment out after first run
agent = Agent(
model=Claude(id="claude-3-5-sonnet-20241022"),
knowledge_base=knowledge_base,
use_tools=True,
show_tool_calls=True,
)
agent.print_response("How to make Thai curry?", markdown=True)
When 10,000 PDF documents need to be queried, what is the processing mechanism of PDFKnowledgeBase?
The processing mechanism for PDFKnowledgeBase is currently sequential. Though the team is working on supporting async insert for Vectordbs
I used lancedb,looked at the phi.vectordb.lancedb class and found that in the initialization init method, if the lancedb.db.LanceTable is not initialized, the self.table = self._init_table() method is called
This means that lancedb is called, the table data will be recreated.
First run, knowledge_base.load(recreate=False) will load the PDFs in the TEMP_DIR directory into lancedb, if you comment out knowledge_base.load(recreate=False), the init method in LanceDb will delete the lancedb data from the first run.
class LanceDb(VectorDb):
def init(
…
self.table: lancedb.db.LanceTable
self.table_name: str
if table:
if not isinstance(table, lancedb.db.LanceTable):
…
else:
if not table_name:
raise ValueError(“Either table or table_name should be provided.”)
self.table_name = table_name
self._id = “id”
self._vector_col = “vector” self.table = self._init_table()
…
It means that different unique user identifiers need to be established to distinguish lanceDb; otherwise, an operation where one user uploads a file may cause failures for other users?
This is my test code
import pandas as pd
import lancedb
db = lancedb.connect(DB_URL)
table_name = “documents”
try:
table = db.open_table(table_name)
print(f"Successfully opened table: {table_name}“)
row_count = table.count_rows()
print(f"Total number of documents in the table: {row_count}”)
df = table.to_pandas()
for _, row in df.iterrows():
payload = json.loads(row[‘payload’])
print(f"Document ID: {row[‘id’]}“)
print(f"Content: {payload[‘content’]}”)
print(f"Metadata: {payload[‘metadata’]}“)
print(”—“)
print(”\nTable Contents:“)
print(df)
except Exception as e:
print(f"An error occurred: {e}”)
Another question:
How to configure the agent or assistant to find the knowledge of knowledge_base and send to llm?
Hey @zgli, if you add search_knowledge=True to your Agent then it will give you the results you are looking for! So your Agent class will look something like this