```markdown # LlamaIndex RAG Setup Guide ## Overview RewardPilot uses LlamaIndex to build a semantic search system over 50+ credit card benefit documents. This enables the agent to answer complex questions like "Which card has the best travel insurance?" or "Does Amex Gold work at Costco?" ## Why LlamaIndex + RAG? | Problem | Traditional Approach | RAG Solution | |---------|---------------------|--------------| | **Card benefits change** | Hardcode rules → outdated | Dynamic document retrieval | | **Complex questions** | Manual lookup | Semantic search | | **50+ cards** | Impossible to memorize | Vector similarity | | **Nuanced rules** | Prone to errors | Context-aware answers | **Example:** - **Question:** "Can I use Chase Sapphire Reserve for airport lounge access when flying domestic?" - **Traditional:** Check 10+ pages of terms - **RAG:** Semantic search → "Yes, Priority Pass includes domestic lounges" --- ## Architecture ``` ┌─────────────────────────────────────────────────────────┐ │ User Question │ │ "Which card has best grocery rewards?" │ └────────────────────┬────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Query Transformation │ │ (Expand, rephrase, extract keywords) │ └────────────────────┬────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Embedding Model │ │ OpenAI text-embedding-3-small │ │ (1536 dimensions) │ └────────────────────┬────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Vector Store │ │ ChromaDB │ │ (50+ card documents) │ │ (10,000+ chunks) │ └────────────────────┬────────────────────────────────────┘ │ │ Retrieve top-k (k=5) ▼ ┌─────────────────────────────────────────────────────────┐ │ Retrieved Context │ │ 1. Amex Gold: 4x points on U.S. supermarkets... │ │ 2. Citi Custom Cash: 5% on top category... │ │ 3. Chase Freedom Flex: 5% rotating categories... │ └────────────────────┬────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Reranking │ │ (Cohere Rerank or Cross-Encoder) │ └────────────────────┬────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ LLM Synthesis │ │ Gemini 2.0 Flash Exp │ │ (Generate answer from context) │ └────────────────────┬────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Final Answer │ │ "Amex Gold offers 4x points (best rate) but has │ │ $25k annual cap. Citi Custom Cash gives 5% but │ │ only $500/month. For high spenders, use Amex." │ └─────────────────────────────────────────────────────────┘ ``` --- ## Setup ### 1. Install Dependencies ```bash pip install llama-index==0.12.5 \ llama-index-vector-stores-chroma==0.4.1 \ llama-index-embeddings-openai==0.3.1 \ llama-index-llms-gemini==0.4.2 \ chromadb==0.5.23 \ pypdf==5.1.0 \ beautifulsoup4==4.12.3 ``` ### 2. Prepare Card Documents Create directory structure: ``` data/ ├── cards/ │ ├── amex_gold.pdf │ ├── chase_sapphire_reserve.pdf │ ├── citi_custom_cash.pdf │ └── ... (50+ cards) ├── terms/ │ ├── amex_terms.pdf │ ├── chase_terms.pdf │ └── ... └── guides/ ├── maximizing_rewards.md ├── category_codes.md └── ... ``` ### 3. Document Sources #### Option A: Scrape from Issuer Websites ```python # scrape_card_docs.py import requests from bs4 import BeautifulSoup import PyPDF2 import os CARD_URLS = { "amex_gold": "https://www.americanexpress.com/us/credit-cards/card/gold-card/", "chase_sapphire_reserve": "https://creditcards.chase.com/rewards-credit-cards/sapphire/reserve", # ... more cards } def scrape_card_benefits(url, output_file): """Scrape card benefits from issuer website""" response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') # Extract benefits section benefits = soup.find('div', class_='benefits-section') # Save to markdown with open(output_file, 'w') as f: f.write(f"# {card_name}\n\n") f.write(benefits.get_text()) # Scrape all cards for card_name, url in CARD_URLS.items(): scrape_card_benefits(url, f"data/cards/{card_name}.md") ``` #### Option B: Manual Documentation Create markdown files: **File:** `data/cards/amex_gold.md` ```markdown # American Express Gold Card ## Overview - **Annual Fee:** $325 - **Rewards Rate:** 4x points on dining & U.S. supermarkets (up to $25k/year) - **Welcome Bonus:** 90,000 points after $6k spend in 6 months ## Earning Structure ### 4x Points - Restaurants worldwide (including takeout & delivery) - U.S. supermarkets (up to $25,000 per year, then 1x) ### 3x Points - Flights booked directly with airlines or on amextravel.com ### 1x Points - All other purchases ## Monthly Credits - $10 Uber Cash (Uber Eats eligible) - $10 Grubhub/Seamless/The Cheesecake Factory/select Shake Shack ## Travel Benefits - No foreign transaction fees - Trip delay insurance - Lost luggage insurance - Car rental loss and damage insurance ## Merchant Acceptance - **Accepted:** Most merchants worldwide - **Not Accepted:** Costco warehouses (Costco.com works) - **Not Accepted:** Some small businesses ## Redemption Options - Transfer to 20+ airline/hotel partners (1:1 ratio) - Pay with Points at Amazon (0.7 cents per point) - Statement credits (0.6 cents per point) - Book travel through Amex Travel (1 cent per point) ## Best For - High grocery spending (up to $25k/year) - Frequent dining out - Travelers who value transfer partners ## Limitations - $25,000 annual cap on 4x supermarket category - Amex not accepted everywhere - Annual fee not waived first year ``` --- ## Implementation ### File: `rewards_rag_server.py` ```python """ LlamaIndex RAG server for credit card benefits """ from llama_index.core import ( VectorStoreIndex, SimpleDirectoryReader, StorageContext, ServiceContext, Settings ) from llama_index.vector_stores.chroma import ChromaVectorStore from llama_index.embeddings.openai import OpenAIEmbedding from llama_index.llms.gemini import Gemini from llama_index.core.node_parser import SentenceSplitter import chromadb from fastapi import FastAPI, HTTPException from pydantic import BaseModel import os # Initialize FastAPI app = FastAPI(title="Rewards RAG MCP Server") # Configure LlamaIndex Settings.embed_model = OpenAIEmbedding( model="text-embedding-3-small", api_key=os.getenv("OPENAI_API_KEY") ) Settings.llm = Gemini( model="models/gemini-2.0-flash-exp", api_key=os.getenv("GEMINI_API_KEY") ) Settings.chunk_size = 512 Settings.chunk_overlap = 50 # Initialize ChromaDB chroma_client = chromadb.PersistentClient(path="./chroma_db") chroma_collection = chroma_client.get_or_create_collection("credit_cards") # Create vector store vector_store = ChromaVectorStore(chroma_collection=chroma_collection) storage_context = StorageContext.from_defaults(vector_store=vector_store) --- ## Document Loading & Indexing def load_and_index_documents(): """Load card documents and create vector index""" # Load documents from directory documents = SimpleDirectoryReader( input_dir="./data", recursive=True, required_exts=[".pdf", ".md", ".txt"] ).load_data() print(f"Loaded {len(documents)} documents") # Parse into nodes (chunks) node_parser = SentenceSplitter( chunk_size=512, chunk_overlap=50 ) nodes = node_parser.get_nodes_from_documents(documents) print(f"Created {len(nodes)} nodes") # Create index index = VectorStoreIndex( nodes=nodes, storage_context=storage_context ) # Persist to disk index.storage_context.persist(persist_dir="./storage") return index # Load index on startup try: # Try loading existing index storage_context = StorageContext.from_defaults( vector_store=vector_store, persist_dir="./storage" ) index = VectorStoreIndex.from_storage_context(storage_context) print("Loaded existing index") except: # Create new index print("Creating new index...") index = load_and_index_documents() # Create query engine query_engine = index.as_query_engine( similarity_top_k=5, response_mode="compact" ) --- ## API Endpoints class QueryRequest(BaseModel): query: str card_name: str = None top_k: int = 5 class QueryResponse(BaseModel): answer: str sources: list confidence: float @app.post("/query", response_model=QueryResponse) async def query_benefits(request: QueryRequest): """ Query credit card benefits Example: POST /query { "query": "Which card has best grocery rewards?", "top_k": 5 } """ try: # Add card filter if specified if request.card_name: query = f"For {request.card_name}: {request.query}" else: query = request.query # Query the index response = query_engine.query(query) # Extract sources sources = [] for node in response.source_nodes: sources.append({ "card_name": node.metadata.get("file_name", "Unknown"), "content": node.text[:200] + "...", "relevance_score": float(node.score) }) # Calculate confidence based on top score confidence = sources[0]["relevance_score"] if sources else 0.0 return QueryResponse( answer=str(response), sources=sources, confidence=confidence ) except Exception as e: raise HTTPException(status_code=500, detail=str(e)) --- ## Advanced Query Techniques @app.post("/compare") async def compare_cards(request: dict): """ Compare multiple cards on specific criteria Example: POST /compare { "cards": ["Amex Gold", "Chase Sapphire Reserve"], "criteria": "travel benefits" } """ cards = request["cards"] criteria = request["criteria"] # Query each card comparisons = [] for card in cards: query = f"What are the {criteria} for {card}?" response = query_engine.query(query) comparisons.append({ "card": card, "benefits": str(response) }) # Synthesize comparison synthesis_prompt = f""" Compare these cards on {criteria}: {comparisons} Provide a clear winner and reasoning. """ final_response = Settings.llm.complete(synthesis_prompt) return { "comparison": str(final_response), "details": comparisons } --- ## Metadata Filtering def add_metadata_to_documents(): """Add rich metadata for filtering""" documents = SimpleDirectoryReader("./data").load_data() for doc in documents: # Extract card name from filename card_name = doc.metadata["file_name"].replace(".md", "") # Add metadata doc.metadata.update({ "card_name": card_name, "issuer": extract_issuer(card_name), "annual_fee": extract_annual_fee(doc.text), "category": extract_category(doc.text) }) return documents # Query with filters @app.post("/query_filtered") async def query_with_filters(request: dict): """ Query with metadata filters Example: POST /query_filtered { "query": "best travel card", "filters": { "issuer": "Chase", "annual_fee": {"$lte": 500} } } """ from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter # Build filters filters = MetadataFilters( filters=[ ExactMatchFilter(key="issuer", value=request["filters"]["issuer"]) ] ) # Query with filters query_engine = index.as_query_engine( similarity_top_k=5, filters=filters ) response = query_engine.query(request["query"]) return {"answer": str(response)} --- ## Hybrid Search (Keyword + Semantic) from llama_index.core.retrievers import VectorIndexRetriever, BM25Retriever from llama_index.core.query_engine import RetrieverQueryEngine def create_hybrid_retriever(): """Combine vector search + keyword search""" # Vector retriever vector_retriever = VectorIndexRetriever( index=index, similarity_top_k=10 ) # BM25 keyword retriever bm25_retriever = BM25Retriever.from_defaults( docstore=index.docstore, similarity_top_k=10 ) # Combine retrievers from llama_index.core.retrievers import QueryFusionRetriever hybrid_retriever = QueryFusionRetriever( retrievers=[vector_retriever, bm25_retriever], similarity_top_k=5, num_queries=1 ) return RetrieverQueryEngine(retriever=hybrid_retriever) --- ## Reranking for Better Results from llama_index.postprocessor.cohere_rerank import CohereRerank def create_reranking_query_engine(): """Add reranking for improved relevance""" # Cohere reranker reranker = CohereRerank( api_key=os.getenv("COHERE_API_KEY"), top_n=3 ) query_engine = index.as_query_engine( similarity_top_k=10, # Retrieve more candidates node_postprocessors=[reranker] # Rerank to top 3 ) return query_engine --- ## Evaluation & Metrics from llama_index.core.evaluation import ( RelevancyEvaluator, FaithfulnessEvaluator ) async def evaluate_rag_quality(): """Evaluate RAG system quality""" # Test queries test_queries = [ "Which card has best grocery rewards?", "Does Amex Gold work at Costco?", "What are Chase Sapphire Reserve travel benefits?" ] # Ground truth answers ground_truth = [ "Citi Custom Cash offers 5% on groceries...", "No, American Express is not accepted at Costco warehouses...", "Chase Sapphire Reserve includes Priority Pass..." ] # Evaluators relevancy_evaluator = RelevancyEvaluator(llm=Settings.llm) faithfulness_evaluator = FaithfulnessEvaluator(llm=Settings.llm) results = [] for query, truth in zip(test_queries, ground_truth): response = query_engine.query(query) # Evaluate relevancy relevancy_result = await relevancy_evaluator.aevaluate( query=query, response=str(response) ) # Evaluate faithfulness faithfulness_result = await faithfulness_evaluator.aevaluate( query=query, response=str(response), contexts=[node.text for node in response.source_nodes] ) results.append({ "query": query, "relevancy_score": relevancy_result.score, "faithfulness_score": faithfulness_result.score }) return results --- ## Deployment ### 1. Build Docker Image **File:** `Dockerfile` ```dockerfile FROM python:3.11-slim WORKDIR /app # Install dependencies COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy application COPY . . # Download and index documents on build RUN python -c "from rewards_rag_server import load_and_index_documents; load_and_index_documents()" # Expose port EXPOSE 7860 # Run server CMD ["uvicorn", "rewards_rag_server:app", "--host", "0.0.0.0", "--port", "7860"] ``` ### 2. Deploy to Hugging Face Spaces ```bash # Create Space huggingface-cli repo create rewardpilot-rewards-rag --type space --space_sdk docker # Push files git add . git commit -m "Deploy RAG server" git push ``` --- ## Performance Optimization ### 1. Caching Embeddings ```python from functools import lru_cache @lru_cache(maxsize=1000) def get_embedding(text: str): """Cache embeddings for repeated queries""" return Settings.embed_model.get_text_embedding(text) ``` ### 2. Batch Processing ```python async def batch_query(queries: list): """Process multiple queries in parallel""" import asyncio tasks = [query_engine.aquery(q) for q in queries] results = await asyncio.gather(*tasks) return results ``` ### 3. Index Optimization ```python # Use smaller embedding model for speed Settings.embed_model = OpenAIEmbedding( model="text-embedding-3-small", # 1536 dims # vs text-embedding-3-large (3072 dims) ) # Reduce chunk size for faster retrieval Settings.chunk_size = 256 # vs 512 ``` --- ## Monitoring ```python import time from prometheus_client import Counter, Histogram # Metrics query_counter = Counter('rag_queries_total', 'Total RAG queries') query_duration = Histogram('rag_query_duration_seconds', 'RAG query duration') @app.post("/query") async def query_with_monitoring(request: QueryRequest): query_counter.inc() start_time = time.time() response = query_engine.query(request.query) duration = time.time() - start_time query_duration.observe(duration) return response ``` --- **Related Documentation:** - [MCP Server Implementation](./mcp_architecture.md) - [Modal Deployment Guide](./modal_deployment.md) - [Agent Reasoning Flow](./agent_reasoning.md) ``` ---