Spaces:

MCP-1st-Birthday
/

common_core_mcp

Running

App Files Files Community

Lindow commited on 12 days ago

Commit

7602502

0 Parent(s):

initial commit

Browse files

Files changed (37) hide show

.agent/docs/common_core_spec.md +136 -0
.agent/docs/hackathon_criteria.md +97 -0
.agent/docs/user_stories.md +81 -0
.agent/specs/.gitignore +2 -0
.agent/specs/000_data_cli/spec.md +1347 -0
.agent/specs/001_pinecone/spec.md +660 -0
.agent/specs/001_pinecone/tasks.md +109 -0
.cursor/commands/complete_tasks.md +4 -0
.cursor/commands/create_tasks.md +48 -0
.cursor/commands/finalize_spec.md +21 -0
.cursor/commands/spec_draft.md +122 -0
.cursor/rules/standards/best_practices.mdc +37 -0
.cursor/rules/standards/cli/overview.mdc +54 -0
.cursor/rules/standards/code_style/emojis.mdc +11 -0
.cursor/rules/standards/code_style/pydantic.mdc +72 -0
.cursor/rules/standards/code_style/python.mdc +71 -0
.cursor/rules/standards/testing.mdc +12 -0
.cursor/rules/standards/ux/deletions.mdc +19 -0
.cursor/rules/standards/workflows.mdc +22 -0
.env.example +7 -0
.gitignore +56 -0
data/.gitkeep +2 -0
pyproject.toml +23 -0
src/__init__.py +2 -0
tests/__init__.py +2 -0
tests/test_pinecone_client.py +318 -0
tests/test_pinecone_models.py +339 -0
tests/test_pinecone_processor.py +463 -0
tools/__init__.py +2 -0
tools/api_client.py +435 -0
tools/cli.py +752 -0
tools/config.py +65 -0
tools/data_manager.py +81 -0
tools/models.py +129 -0
tools/pinecone_client.py +262 -0
tools/pinecone_models.py +95 -0
tools/pinecone_processor.py +375 -0

.agent/docs/common_core_spec.md ADDED Viewed

	@@ -0,0 +1,136 @@

+This is the authoritative **Common Core Data Specification**. It contains the exact source locations, data schemas, field definitions, and the specific processing logic required to interpret the hierarchy correctly.
+**Use this document as the source of truth for `tools/build_data.py`.**
+---
+# Data Specification: Common Core Standards
+**Authority:** Common Standards Project (GitHub)
+**License:** Creative Commons Attribution 4.0 (CC BY 4.0)
+**Format:** JSON (Flat List of Objects)
+## 1. Source Locations
+We are using the "Clean Data" export from the Common Standards Project. These files are static JSON dumps where each file represents a full Subject.
+| Subject            | Direct Download URL                                                                                                          |
+| :----------------- | :--------------------------------------------------------------------------------------------------------------------------- |
+| **Mathematics**    | `https://raw.githubusercontent.com/commoncurriculum/common-standards-project/master/data/clean-data/CCSSI/Mathematics.json`  |
+| **ELA / Literacy** | `https://raw.githubusercontent.com/commoncurriculum/common-standards-project/master/data/clean-data/CCSSI/ELA-Literacy.json` |
+---
+## 2. The Data Structure (Glossary)
+The JSON file contains a root object. The actual standards are located in the `standards` dictionary, keyed by their internal GUID.
+### **Root Object**
+```json
+{
+  "subject": "Mathematics",
+  "standards": {
+    "6051566A...": { ... }, // Standard Object
+    "5E367098...": { ... }  // Standard Object
+  }
+}
+```
+### **Standard Object (The Item)**
+Each item represents a node in the curriculum tree. It could be a broad **Domain**, a grouping **Cluster**, or a specific **Standard**.
+| Field Name              | Type            | Definition & Usage                                                                                                                    |
+| :---------------------- | :-------------- | :------------------------------------------------------------------------------------------------------------------------------------ |
+| **`id`**                | `String (GUID)` | The internal unique identifier. Used for lookups in `ancestorIds`.                                                                    |
+| **`statementNotation`** | `String`        | **The Display Code.** (e.g., `CCSS.Math.Content.1.OA.A.1`). This is what teachers recognize. Use this for the UI.                     |
+| **`description`**       | `String`        | The text content. **Warning:** For standards, this text is often incomplete without its parent context (see Hierarchy below).         |
+| **`statementLabel`**    | `String`        | The hierarchy type. Critical values: <br>• `Domain` (Highest) <br>• `Cluster` (Grouping) <br>• `Standard` (The actionable item)       |
+| **`gradeLevels`**       | `Array[String]` | Scope of the standard. <br>• Format: `["01", "02"]` (Grades 1 & 2), `["K"]` (Kindergarten), `["09", "10", "11", "12"]` (High School). |
+| **`ancestorIds`**       | `Array[GUID]`   | **CRITICAL.** An ordered list of parent IDs (from root to immediate parent). You must resolve these to build the full context.        |
+---
+## 3. Hierarchy & Context (The "Interpretation" Problem)
+**The Problem:**
+A standard's description often relies on its parent "Cluster" for meaning.
+- _Cluster Text:_ "Understand the place value system."
+- _Standard Text:_ "Recognize that in a multi-digit number, a digit in one place represents 10 times as much..."
+If you only embed the _Standard Text_, the vector will miss the concept of "Place Value."
+**The Solution (Processing Logic):**
+To generate the **Search String** for embedding, you must concatenate the hierarchy.
+1.  **Domain:** The broad category (e.g., "Number and Operations in Base Ten").
+2.  **Cluster:** The specific topic (e.g., "Generalize place value understanding").
+3.  **Standard:** The task.
+**Formula:**
+```text
+"{Subject} {Grade}: {Domain Text} - {Cluster Text} - {Standard Text}"
+```
+---
+## 4. Build Pipeline Specification (`tools/build_data.py`)
+This specific logic ensures we extract meaningful vectors.
+### **Step A: Ingestion**
+1.  Download both JSON files.
+2.  Merge the `standards` dictionaries into a single **Lookup Map** (Memory: `Map<GUID, Object>`).
+### **Step B: Iteration & Filtering**
+Iterate through the Lookup Map.
+**Filter Rule:**
+- **KEEP** if `statementLabel` equals `"Standard"`.
+- **DISCARD** if `statementLabel` is `"Domain"`, `"Cluster"`, or `"Component"`. (We only index the actionable leaves).
+### **Step C: Context Resolution (The "Breadcrumb" Loop)**
+For every kept Standard:
+1.  Initialize `context_text = ""`
+2.  Iterate through `ancestorIds`:
+    - Use the ID to look up the Parent Object in the **Lookup Map**.
+    - Append `Parent.description` to `context_text`.
+3.  Construct the final string:
+    - `full_text = f"{context_text} {current_standard.description}"`
+4.  **Vectorize `full_text`**.
+### **Step D: Output Schema (`data/standards.json`)**
+The clean, flat JSON file you save for the App to load must look like this:
+```json
+[
+  {
+    "id": "CCSS.Math.Content.1.OA.A.1", // From 'statementNotation'
+    "guid": "6051566A...", // From 'id'
+    "grade": "01", // From 'gradeLevels[0]'
+    "subject": "Mathematics", // From 'subject'
+    "description": "Use addition and subtraction within 20 to solve word problems...", // From 'description'
+    "full_context": "Operations and Algebraic Thinking - Represent and solve problems... - Use addition and..." // The text we used for embedding
+  }
+]
+```
+---
+## 5. Summary of Valid `gradeLevels`
+When processing, normalize these strings if necessary, but typically they appear as:
+- `K` (Kindergarten)
+- `01` - `08` (Grades 1-8)
+- `09-12` (High School generic)
+_Note: If `gradeLevels` is an array `["09", "10", "11", "12"]`, you can display it as "High School" or "Grades 9-12"._

.agent/docs/hackathon_criteria.md ADDED Viewed

	@@ -0,0 +1,97 @@

+### **Hackathon Track 1: Building MCP**
+**Goal:** Build **Model Context Protocol (MCP) servers** that extend Large Language Model (LLM) capabilities.
+**Focus:** Creating tools for data analysis, API integrations, specialized workflows, or novel capabilities that make AI agents more powerful.
+[Hugging Face Link](https://huggingface.co/MCP-1st-Birthday)
+---
+### **1. Categories & Submission Tags**
+You must classify your entry into **one** of the three categories below by adding the specific tag to your Hugging Face Space `README.md`.
+- **Enterprise MCP Servers**
+  - _Focus:_ Business/corporate tools, workflows, data analysis.
+  - _Required Tag:_ `building-mcp-track-enterprise`
+- **Consumer MCP Servers**
+  - _Focus:_ Personal utility, lifestyle, daily tasks.
+  - _Required Tag:_ `building-mcp-track-consumer`
+- **Creative MCP Servers**
+  - _Focus:_ Art, media, novelty, unique interactions.
+  - _Required Tag:_ `building-mcp-track-creative`
+---
+### **2. Mandatory Registration Steps**
+Before submitting, ensure you have completed the administrative requirements:
+1.  **Join the Organization:** Click "Request to join this org" on the [Hackathon Hugging Face page](https://huggingface.co/MCP-1st-Birthday).
+2.  **Register:** Complete the official registration form (linked on the hackathon page).
+3.  **Team Members:** If working in a team (2-5 people), **all** members must join the organization and register individually.
+---
+### **3. Technical Requirements (Track 1 Specific)**
+Your project must meet these technical standards to be eligible:
+- **Functioning MCP Server:** The core of your project must be a working MCP server.
+- **Integration:** It must integrate with an MCP client (e.g., Claude Desktop, Cursor, or similar).
+- **Platform:** It must be published as a **Hugging Face Space**.
+  - _Note:_ It can be a Gradio app, but "Any Gradio app can be an MCP server."
+---
+### **4. Submission Deliverables**
+Your final submission must include the following elements by the deadline:
+- **Hugging Face Space:** The actual codebase hosted in the event organization.
+- **README.md Metadata:**
+  - Include the **Track Tag** (see Section 1).
+  - (If a team) Include Hugging Face usernames of all team members.
+- **Documentation:** Clear explanation of the tool’s purpose, capabilities, and usage instructions in the README.
+- **Demo Video:**
+  - **Length:** 1–5 minutes.
+  - **Content:** Must show the MCP server **in action**, specifically demonstrating its integration with an MCP client (like Claude Desktop).
+- **Social Media Proof:** A link to a post (X/Twitter, LinkedIn, etc.) about your project. This link must be included in your submission (likely in the README or submission form).
+---
+### **5. Judging Criteria**
+Judges will evaluate your project based on:
+1.  **Completeness:** Is the Space, video, documentation, and social link all present?
+2.  **Functionality:** Does it work? Does it effectively use relevant functionalities (Gradio 6, MCPs)?
+3.  **Real-world Impact:** Is the tool useful? Does it have potential for real-world application?
+4.  **Creativity:** Is the idea or implementation innovative/original?
+5.  **Design/UI-UX:** Is it polished, intuitive, and easy to use?
+6.  **Documentation:** Is the implementation well-communicated in the README/video?
+---
+### **6. Timeline**
+- **Hackathon Period:** November 14 – November 30, 2025.
+- **Submission Deadline:** **November 30, 2025, at 11:59 PM UTC**.
+- **Judging Period:** December 1 – December 14, 2025.
+- **Winners Announced:** December 15, 2025.
+---
+### **7. Prizes (Track 1)**
+- **Best Overall:** $1,500 USD + $1,250 Claude API credits.
+- **Best Enterprise MCP Server:** $750 Claude API credits.
+- **Best Consumer MCP Server:** $750 Claude API credits.
+- **Best Creative MCP Server:** $750 Claude API credits.
+- _Note:_ There are additional sponsor prizes (e.g., from Google Gemini, Modal, Blaxel) available if you use their specific tools/APIs.
+### **Rules Summary**
+- **Original Work:** Must be created during the hackathon period (Nov 14–30).
+- **Open Source:** Open-source licenses (MIT, Apache 2.0) are encouraged.
+- **Team Size:** Solo or 2–5 members.

.agent/docs/user_stories.md ADDED Viewed

	@@ -0,0 +1,81 @@

+Here is the refined **User Stories & Acceptance Criteria** document. This focuses on the professional and intentional nature of both homeschooling parents and classroom teachers, ensuring the tone reflects the serious work of education management.
+---
+# EduMatch MCP: User Stories (Sprint 1)
+**Project:** EduMatch MCP
+**Sprint Focus:** Core Functionality (Search & Lookup)
+**Target Personas:**
+1.  **The Intentional Parent:** A homeschool educator seeking to formalize learning experiences and align daily life with educational benchmarks.
+2.  **The Adaptive Teacher:** A classroom educator looking to tailor curriculum to student interests while maintaining strict adherence to state standards.
+---
+## Story 1: The "Retroactive Alignment" (Experience to Record)
+**As a** homeschool parent or teacher,
+**I want to** describe a completed activity, field trip, or real-world experience to the AI,
+**So that** I can identify which Common Core standards were addressed and articulate the educational value in my official logs or lesson plans.
+### Context
+Educators often seize "teachable moments" (e.g., a trip to a science center, a gardening project). They need to translate these rich, unstructured experiences into the rigid language of educational bureaucracy for reporting purposes.
+### Acceptance Criteria
+1.  **Input:** The user provides a natural language narrative (e.g., "We visited the planetarium, looked at constellations, and calculated the distance between stars.").
+2.  **System Action:** The system queries the `find_relevant_standards` tool using the narrative text.
+3.  **Output:** The system returns a list of relevant standards (with ID and text) and a generated reasoning explaining _how_ the activity met that standard.
+4.  **Tone Check:** The system treats the activity as a valid educational event, not an "accident," and helps the user professionalize their documentation.
+**Example Prompt:**
+> "I took my class to the Natural History Museum today. We focused on the timeline of the Jurassic period and compared the sizes of different fossils. Can you find the Common Core standards this visit supported so I can add them to my weekly report?"
+---
+## Story 2: The "Interest-Based Planner" (Proactive Integration)
+**As an** educator looking to engage a student,
+**I want to** input a specific student interest (e.g., Minecraft, Baking, Robotics) alongside a target grade level,
+**So that** I can discover standards that can be taught _through_ that activity.
+### Context
+Students learn best when engaged. Teachers and parents often want to build lessons around a child's obsession but need to ensure they aren't skipping required learning targets. This bridges the gap between "Fun" and "Required."
+### Acceptance Criteria
+1.  **Input:** The user provides a topic and a constraint (e.g., "Baking cookies, 3rd Grade Math").
+2.  **System Action:** The system queries `find_relevant_standards` with a combined vector of the activity and the grade level context.
+3.  **Output:** The system returns standards that are semantically viable (e.g., standards about measurement, volume, or fractions for baking).
+4.  **Reasoning:** The generated explanation explicitly suggests the connection (e.g., "This standard applies because baking requires understanding fractions to measure ingredients.").
+**Example Prompt:**
+> "My 3rd grader is obsessed with baking. I want to build a math unit around doubling recipes and measuring ingredients. Which standards can we cover with this project?"
+---
+## Story 3: The "Jargon Decoder" (Curriculum Clarification)
+**As a** parent or teacher reviewing administrative documents,
+**I want to** ask about a specific standard code (e.g., `CCSS.ELA-LITERACY.RL.4.3`),
+**So that** I can retrieve the full text and hierarchy to understand exactly what is required of the student.
+### Context
+Educational documentation is full of codes that are opaque to parents and hard to memorize for teachers. Users need a quick, authoritative lookup to verify requirements without leaving the chat interface.
+### Acceptance Criteria
+1.  **Input:** The user provides a specific Standard ID/Code.
+2.  **System Action:** The system identifies the code format and calls `get_standard_details`.
+3.  **Output:** The system returns the full object, including the parent Domain and Cluster text, allowing the LLM to explain the standard in plain English.
+4.  **Error Handling:** If the code doesn't exist, the system returns a polite failure message or suggests the user try a keyword search instead.
+**Example Prompt:**
+> "The state curriculum guide lists '1.OA.B.3' as a prerequisite for next week. What is that standard, and what does it look like in practice?"

.agent/specs/.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ *draft.md
2	+ */__

.agent/specs/000_data_cli/spec.md ADDED Viewed

	@@ -0,0 +1,1347 @@

+# Data Ingestion CLI Specification
+**Tool Name:** Common Core MCP Data CLI
+**Framework:** `typer` (Python)
+**Purpose:** To explore, discover, and download official standard sets (e.g., Utah Math, Wyoming Science) from the Common Standards Project API for local processing.
+**Scope:** Development Tool (Dev Dependency). Not deployed to production.
+**Architecture:** Clean separation between CLI interface (`tools/cli.py`) and business logic (`tools/api_client.py`, `tools/data_processor.py`). The CLI file contains only command definitions and invokes reusable functions.
+**Initial Proof of Concept:** Grade 3 Mathematics for Utah, Wyoming, and Idaho.
+---
+## 1. Environment & Setup
+### 1.1 Prerequisites
+To use this CLI, you must register for an API key.
+1.  Go to [Common Standards Project Developers](https://commonstandardsproject.com/developers).
+2.  Create an account and generate an **API Key**.
+### 1.2 Configuration (`.env`)
+The CLI must load sensitive credentials from a local `.env` file.
+```bash
+# .env file in project root
+CSP_API_KEY=your_generated_api_key_here
+```
+### 1.3 Dependencies (`pyproject.toml`)
+Add these to the existing `[project.dependencies]` section:
+```toml
+[project.dependencies]
+# ... existing dependencies ...
+"typer",          # CLI framework
+"requests",       # HTTP client for API calls
+"rich",           # Pretty printing tables in terminal
+"loguru",         # Structured logging
+```
+**Note:** `python-dotenv` is already in the project dependencies.
+### 1.4 CLI Invocation
+The CLI is invoked directly with Python (not via `uv`):
+```bash
+python tools/cli.py --help
+```
+---
+## 2. API Reference (Internal)
+The CLI acts as a wrapper around these specific Common Standards Project API endpoints.
+**Base URL:** `https://api.commonstandardsproject.com/api/v1`
+**Authentication:** Header `Api-Key: <YOUR_KEY>`
+### Endpoint A: List Jurisdictions
+- **URL:** `/jurisdictions`
+- **Purpose:** Find the ID for "Utah", "Wyoming", "Idaho".
+- **Response Shape:**
+  ```json
+  {
+    "data": [
+      { "id": "49FCDFBD...", "title": "Utah", "type": "state" },
+      ...
+    ]
+  }
+  ```
+### Endpoint B: List Standard Sets
+- **URL:** `/standard_sets`
+- **Query Params:** `jurisdictionId=<ID>`
+- **Purpose:** Find "Utah Core Standards - Mathematics - Grade 3".
+- **Response Shape:**
+  ```json
+  {
+    "data": [
+      {
+        "id": "SOME_SET_ID",
+        "title": "Utah Core Standards - Mathematics",
+        "subject": "Mathematics",
+        "educationLevels": ["03"]
+      }
+    ]
+  }
+  ```
+### Endpoint C: Get Standard Set (Download)
+- **URL:** `/standard_sets/{standard_set_id}`
+- **Purpose:** Download the full hierarchy for a specific set.
+- **Response Shape:** Returns a complex object containing the full tree (Standards, Clusters, etc.).
+---
+## 3. CLI Architecture
+### 3.1 File Structure
+The CLI is organized with clean separation of concerns:
+- **`tools/cli.py`**: CLI command definitions only (uses Typer). Imports and invokes functions from other modules.
+- **`tools/api_client.py`**: Business logic for interacting with Common Standards Project API. Includes retry mechanisms, rate limiting, and error handling.
+- **`tools/data_processor.py`**: Business logic for processing raw API data into flattened format with embeddings.
+- **`tools/data_manager.py`**: Business logic for managing local data files (listing, status tracking, cleanup).
+### 3.2 Command Structure
+```bash
+# View help
+python tools/cli.py --help
+# Explore API
+python tools/cli.py jurisdictions --search "Utah"
+python tools/cli.py sets <JURISDICTION_ID>
+# Download raw data
+python tools/cli.py download <SET_ID>
+# View local data
+python tools/cli.py list
+# Process raw data
+python tools/cli.py process <SET_ID>
+# Check processing status
+python tools/cli.py status
+```
+---
+## 4. Command Specifications
+### Command 1: `jurisdictions`
+Allows the developer to find the internal IDs for states/organizations.
+- **Arguments:** None.
+- **Options:**
+  - `--search` / `-s` (Optional): Filter output by name (case-insensitive).
+- **Business Logic:** Implemented in `api_client.get_jurisdictions(search_term: str | None) -> list[dict]`
+- **Display Logic:**
+  1.  Call `api_client.get_jurisdictions()`.
+  2.  Print table using `rich.table.Table`: `ID | Title | Type`.
+  3.  Log operation with loguru.
+### Command 2: `sets`
+Allows the developer to see what standards are available for a specific state.
+- **Arguments:**
+  - `jurisdiction_id` (Required): The ID found in the previous command.
+- **Business Logic:** Implemented in `api_client.get_standard_sets(jurisdiction_id: str) -> list[dict]`
+- **Display Logic:**
+  1.  Call `api_client.get_standard_sets(jurisdiction_id)`.
+  2.  Print table: `Set ID | Subject | Title | Grade Levels`.
+  3.  Log operation.
+### Command 3: `download`
+Downloads the official JSON definition for a standard set and saves it locally with organized directory structure.
+- **Arguments:**
+  - `set_id` (Required): The ID of the standard set (e.g., Utah Math).
+- **Options:** None (output path is automatically determined based on metadata).
+- **Business Logic:** Implemented in `api_client.download_standard_set(set_id: str) -> dict` and `data_manager.save_raw_data(set_id: str, data: dict, metadata: dict) -> Path`
+- **Workflow:**
+  1.  Call `api_client.download_standard_set(set_id)` (includes retry logic).
+  2.  Extract metadata: jurisdiction, subject, grade levels.
+  3.  Call `data_manager.save_raw_data()` to save with auto-generated path.
+  4.  Print success message with file path.
+  5.  Log download operation.
+### Command 4: `list`
+Shows all downloaded raw datasets with their metadata.
+- **Arguments:** None.
+- **Business Logic:** Implemented in `data_manager.list_downloaded_data() -> list[dict]`
+- **Display Logic:**
+  1.  Call `data_manager.list_downloaded_data()`.
+  2.  Print table: `Set ID | Subject | Title | Grade Levels | Downloaded | Processed`.
+  3.  Show total count.
+### Command 5: `process`
+Processes a raw downloaded dataset into flattened format with embeddings.
+- **Arguments:**
+  - `set_id` (Required): The ID of the standard set to process.
+- **Business Logic:** Implemented in `data_processor.process_standard_set(set_id: str) -> tuple[Path, Path]`
+- **Workflow:**
+  1.  Verify raw data exists for set_id.
+  2.  Call `data_processor.process_standard_set(set_id)`.
+  3.  Generate flattened standards.json.
+  4.  Generate embeddings.npy.
+  5.  Save to `data/processed/<jurisdiction>/<subject>/`.
+  6.  Update processing status metadata.
+  7.  Print success message with output paths.
+  8.  Log processing operation.
+### Command 6: `status`
+Shows processing status for all datasets (processed vs unprocessed).
+- **Arguments:** None.
+- **Business Logic:** Implemented in `data_manager.get_processing_status() -> dict`
+- **Display Logic:**
+  1.  Call `data_manager.get_processing_status()`.
+  2.  Show summary: Total Downloaded, Processed, Unprocessed.
+  3.  List unprocessed datasets.
+  4.  List processed datasets with output paths.
+---
+## 5. Data Directory Structure
+### 5.1 Raw Data Organization
+Downloaded raw data is organized by jurisdiction and stored locally only (not in git):
+```
+data/raw/
+├── <jurisdiction_id>/
+│   ├── <set_id>/
+│   │   ├── data.json              # Raw API response
+│   │   └── metadata.json          # Download metadata
+│   └── <set_id>/
+│       ├── data.json
+│       └── metadata.json
+```
+**Example:**
+```
+data/raw/
+├── 49FCDFBD.../                   # Utah
+│   ├── ABC123.../                 # Utah Math Grade 3
+│   │   ├── data.json
+│   │   └── metadata.json
+│   └── DEF456.../                 # Utah Science Grade 5
+│       ├── data.json
+│       └── metadata.json
+├── 82ABCDEF.../                   # Wyoming
+    └── GHI789.../                 # Wyoming Math Grade 3
+        ├── data.json
+        └── metadata.json
+```
+### 5.2 Processed Data Organization
+Processed data (flattened standards with embeddings) is organized by logical grouping:
+```
+data/processed/
+├── <jurisdiction_name>/
+│   ├── <subject>/
+│   │   ├── <grade_level>/
+│   │   │   ├── standards.json     # Flattened standards
+│   │   │   └── embeddings.npy     # Vector embeddings
+```
+**Example (Initial Proof of Concept):**
+```
+data/processed/
+├── utah/
+│   └── mathematics/
+│       └── grade_03/
+│           ├── standards.json
+│           └── embeddings.npy
+├── wyoming/
+│   └── mathematics/
+│       └── grade_03/
+│           ├── standards.json
+│           └── embeddings.npy
+├── idaho/
+    └── mathematics/
+        └── grade_03/
+            ├── standards.json
+            └── embeddings.npy
+```
+**Git Tracking:**
+- `data/raw/` is added to `.gitignore` (local only)
+- `data/processed/` for example datasets (Utah, Wyoming, Idaho Math Grade 3) is committed to git
+- For production expansion, processed data would move to a vector database
+### 5.3 Metadata Schema
+The `metadata.json` file stored with each raw dataset:
+```json
+{
+  "set_id": "ABC123...",
+  "title": "Utah Core Standards - Mathematics - Grade 3",
+  "jurisdiction": {
+    "id": "49FCDFBD...",
+    "title": "Utah"
+  },
+  "subject": "Mathematics",
+  "grade_levels": ["03"],
+  "download_date": "2024-11-25T10:30:00Z",
+  "download_url": "https://api.commonstandardsproject.com/api/v1/standard_sets/ABC123...",
+  "processed": false,
+  "processed_date": null,
+  "processed_output": null
+}
+```
+---
+## 6. Implementation Guide
+The implementation follows clean architecture principles with separated concerns.
+### 6.1 API Client Module (`tools/api_client.py`)
+Handles all interactions with the Common Standards Project API, including retry logic, rate limiting, and error handling.
+```python
+"""API client for Common Standards Project with retry logic and rate limiting."""
+from __future__ import annotations
+import os
+import time
+from typing import Any
+import requests
+from dotenv import load_dotenv
+from loguru import logger
+load_dotenv()
+API_KEY = os.getenv("CSP_API_KEY")
+BASE_URL = "https://api.commonstandardsproject.com/api/v1"
+# Rate limiting: Max requests per minute
+MAX_REQUESTS_PER_MINUTE = 60
+_request_timestamps: list[float] = []
+class APIError(Exception):
+    """Raised when API request fails after all retries."""
+    pass
+def _get_headers() -> dict[str, str]:
+    """Get authentication headers for API requests."""
+    if not API_KEY:
+        logger.error("CSP_API_KEY not found in .env file")
+        raise ValueError("CSP_API_KEY environment variable not set")
+    return {"Api-Key": API_KEY}
+def _enforce_rate_limit() -> None:
+    """Enforce rate limiting by tracking request timestamps."""
+    global _request_timestamps
+    now = time.time()
+    # Remove timestamps older than 1 minute
+    _request_timestamps = [ts for ts in _request_timestamps if now - ts < 60]
+    # If at limit, wait
+    if len(_request_timestamps) >= MAX_REQUESTS_PER_MINUTE:
+        sleep_time = 60 - (now - _request_timestamps[0])
+        logger.warning(f"Rate limit reached. Waiting {sleep_time:.1f} seconds...")
+        time.sleep(sleep_time)
+        _request_timestamps = []
+    _request_timestamps.append(now)
+def _make_request(
+    endpoint: str,
+    params: dict[str, Any] | None = None,
+    max_retries: int = 3
+) -> dict[str, Any]:
+    """
+    Make API request with exponential backoff retry logic.
+    Args:
+        endpoint: API endpoint path (e.g., "/jurisdictions")
+        params: Query parameters
+        max_retries: Maximum number of retry attempts
+    Returns:
+        Parsed JSON response
+    Raises:
+        APIError: After all retries exhausted or on fatal errors
+    """
+    url = f"{BASE_URL}{endpoint}"
+    headers = _get_headers()
+    for attempt in range(max_retries):
+        try:
+            _enforce_rate_limit()
+            logger.debug(f"API request: {endpoint} (attempt {attempt + 1}/{max_retries})")
+            response = requests.get(url, headers=headers, params=params, timeout=30)
+            # Handle specific status codes
+            if response.status_code == 401:
+                logger.error("Invalid API key (401 Unauthorized)")
+                raise APIError("Authentication failed. Check your CSP_API_KEY in .env")
+            if response.status_code == 404:
+                logger.error(f"Resource not found (404): {endpoint}")
+                raise APIError(f"Resource not found: {endpoint}")
+            if response.status_code == 429:
+                # Rate limited by server
+                retry_after = int(response.headers.get("Retry-After", 60))
+                logger.warning(f"Server rate limit hit. Waiting {retry_after} seconds...")
+                time.sleep(retry_after)
+                continue
+            response.raise_for_status()
+            logger.info(f"API request successful: {endpoint}")
+            return response.json()
+        except requests.exceptions.Timeout:
+            wait_time = 2 ** attempt  # Exponential backoff: 1s, 2s, 4s
+            logger.warning(f"Request timeout. Retrying in {wait_time}s...")
+            if attempt < max_retries - 1:
+                time.sleep(wait_time)
+            else:
+                raise APIError(f"Request timeout after {max_retries} attempts")
+        except requests.exceptions.ConnectionError:
+            wait_time = 2 ** attempt
+            logger.warning(f"Connection error. Retrying in {wait_time}s...")
+            if attempt < max_retries - 1:
+                time.sleep(wait_time)
+            else:
+                raise APIError(f"Connection failed after {max_retries} attempts")
+        except requests.exceptions.HTTPError as e:
+            # Don't retry on 4xx errors (except 429)
+            if 400 <= response.status_code < 500 and response.status_code != 429:
+                raise APIError(f"HTTP {response.status_code}: {response.text}")
+            # Retry on 5xx errors
+            wait_time = 2 ** attempt
+            logger.warning(f"Server error {response.status_code}. Retrying in {wait_time}s...")
+            if attempt < max_retries - 1:
+                time.sleep(wait_time)
+            else:
+                raise APIError(f"Server error after {max_retries} attempts")
+    raise APIError("Request failed after all retries")
+def get_jurisdictions(search_term: str | None = None) -> list[dict[str, Any]]:
+    """
+    Fetch all jurisdictions from the API.
+    Args:
+        search_term: Optional filter for jurisdiction title (case-insensitive)
+    Returns:
+        List of jurisdiction dicts with 'id', 'title', 'type' fields
+    """
+    logger.info("Fetching jurisdictions from API")
+    response = _make_request("/jurisdictions")
+    jurisdictions = response.get("data", [])
+    if search_term:
+        search_lower = search_term.lower()
+        jurisdictions = [
+            j for j in jurisdictions
+            if search_lower in j.get("title", "").lower()
+        ]
+        logger.info(f"Filtered to {len(jurisdictions)} jurisdictions matching '{search_term}'")
+    return jurisdictions
+def get_standard_sets(jurisdiction_id: str) -> list[dict[str, Any]]:
+    """
+    Fetch standard sets for a specific jurisdiction.
+    Args:
+        jurisdiction_id: The jurisdiction GUID
+    Returns:
+        List of standard set dicts
+    """
+    logger.info(f"Fetching standard sets for jurisdiction {jurisdiction_id}")
+    response = _make_request("/standard_sets", params={"jurisdictionId": jurisdiction_id})
+    return response.get("data", [])
+def download_standard_set(set_id: str) -> dict[str, Any]:
+    """
+    Download full standard set data.
+    Args:
+        set_id: The standard set GUID
+    Returns:
+        Complete standard set data including hierarchy
+    """
+    logger.info(f"Downloading standard set {set_id}")
+    response = _make_request(f"/standard_sets/{set_id}")
+    return response.get("data", {})
+```
+### 6.2 Data Manager Module (`tools/data_manager.py`)
+Handles local file operations, directory structure, and metadata tracking.
+```python
+"""Manages local data storage and metadata tracking."""
+from __future__ import annotations
+import json
+from datetime import datetime
+from pathlib import Path
+from typing import Any
+from loguru import logger
+# Data directories
+PROJECT_ROOT = Path(__file__).parent.parent
+RAW_DATA_DIR = PROJECT_ROOT / "data" / "raw"
+PROCESSED_DATA_DIR = PROJECT_ROOT / "data" / "processed"
+def save_raw_data(set_id: str, data: dict[str, Any], metadata_override: dict[str, Any] | None = None) -> Path:
+    """
+    Save raw standard set data with metadata.
+    Args:
+        set_id: Standard set GUID
+        data: Raw API response data
+        metadata_override: Optional metadata to merge (for jurisdiction info, etc.)
+    Returns:
+        Path to saved data file
+    """
+    # Extract metadata from data
+    jurisdiction_id = data.get("jurisdiction", {}).get("id", "unknown")
+    jurisdiction_title = data.get("jurisdiction", {}).get("title", "Unknown")
+    # Create directory structure
+    set_dir = RAW_DATA_DIR / jurisdiction_id / set_id
+    set_dir.mkdir(parents=True, exist_ok=True)
+    # Save raw data
+    data_file = set_dir / "data.json"
+    with open(data_file, "w", encoding="utf-8") as f:
+        json.dump(data, f, indent=2, ensure_ascii=False)
+    # Create metadata
+    metadata = {
+        "set_id": set_id,
+        "title": data.get("title", ""),
+        "jurisdiction": {
+            "id": jurisdiction_id,
+            "title": jurisdiction_title
+        },
+        "subject": data.get("subject", "Unknown"),
+        "grade_levels": data.get("educationLevels", []),
+        "download_date": datetime.utcnow().isoformat() + "Z",
+        "download_url": f"https://api.commonstandardsproject.com/api/v1/standard_sets/{set_id}",
+        "processed": False,
+        "processed_date": None,
+        "processed_output": None
+    }
+    # Merge override metadata
+    if metadata_override:
+        metadata.update(metadata_override)
+    # Save metadata
+    metadata_file = set_dir / "metadata.json"
+    with open(metadata_file, "w", encoding="utf-8") as f:
+        json.dump(metadata, f, indent=2, ensure_ascii=False)
+    logger.info(f"Saved raw data to {data_file}")
+    logger.info(f"Saved metadata to {metadata_file}")
+    return data_file
+def list_downloaded_data() -> list[dict[str, Any]]:
+    """
+    List all downloaded raw datasets with their metadata.
+    Returns:
+        List of metadata dicts for each downloaded dataset
+    """
+    if not RAW_DATA_DIR.exists():
+        return []
+    datasets = []
+    for jurisdiction_dir in RAW_DATA_DIR.iterdir():
+        if not jurisdiction_dir.is_dir():
+            continue
+        for set_dir in jurisdiction_dir.iterdir():
+            if not set_dir.is_dir():
+                continue
+            metadata_file = set_dir / "metadata.json"
+            if metadata_file.exists():
+                with open(metadata_file, encoding="utf-8") as f:
+                    metadata = json.load(f)
+                    datasets.append(metadata)
+    logger.debug(f"Found {len(datasets)} downloaded datasets")
+    return datasets
+def get_processing_status() -> dict[str, Any]:
+    """
+    Get processing status summary for all datasets.
+    Returns:
+        Dict with 'total', 'processed', 'unprocessed', 'processed_list', 'unprocessed_list'
+    """
+    datasets = list_downloaded_data()
+    processed = [d for d in datasets if d.get("processed", False)]
+    unprocessed = [d for d in datasets if not d.get("processed", False)]
+    return {
+        "total": len(datasets),
+        "processed": len(processed),
+        "unprocessed": len(unprocessed),
+        "processed_list": processed,
+        "unprocessed_list": unprocessed
+    }
+def mark_as_processed(set_id: str, output_path: Path) -> None:
+    """
+    Update metadata to mark a dataset as processed.
+    Args:
+        set_id: Standard set GUID
+        output_path: Path to processed output directory
+    """
+    # Find the dataset
+    for jurisdiction_dir in RAW_DATA_DIR.iterdir():
+        if not jurisdiction_dir.is_dir():
+            continue
+        set_dir = jurisdiction_dir / set_id
+        if set_dir.exists():
+            metadata_file = set_dir / "metadata.json"
+            if metadata_file.exists():
+                with open(metadata_file, encoding="utf-8") as f:
+                    metadata = json.load(f)
+                metadata["processed"] = True
+                metadata["processed_date"] = datetime.utcnow().isoformat() + "Z"
+                metadata["processed_output"] = str(output_path)
+                with open(metadata_file, "w", encoding="utf-8") as f:
+                    json.dump(metadata, f, indent=2, ensure_ascii=False)
+                logger.info(f"Marked {set_id} as processed")
+                return
+    logger.warning(f"Could not find dataset {set_id} to mark as processed")
+```
+### 6.3 Data Processor Module (`tools/data_processor.py`)
+Handles the transformation of raw API data into flattened format with embeddings.
+```python
+"""Processes raw standard sets into flattened format with embeddings."""
+from __future__ import annotations
+import json
+from pathlib import Path
+from typing import Any
+import numpy as np
+from loguru import logger
+from sentence_transformers import SentenceTransformer
+from tools.data_manager import PROCESSED_DATA_DIR, RAW_DATA_DIR, mark_as_processed
+def _build_lookup_map(standards_dict: dict[str, Any]) -> dict[str, Any]:
+    """
+    Build lookup map from standards dictionary in API response.
+    The API returns standards in a flat dictionary keyed by GUID.
+    Args:
+        standards_dict: The 'standards' field from API response
+    Returns:
+        Lookup map of GUID -> standard object
+    """
+    logger.debug(f"Building lookup map with {len(standards_dict)} items")
+    return standards_dict
+def _resolve_context(standard: dict[str, Any], lookup_map: dict[str, Any]) -> str:
+    """
+    Build full context string by resolving ancestor chain.
+    Concatenates descriptions from Domain -> Cluster -> Standard
+    to create rich context for embedding.
+    Args:
+        standard: Standard dict with 'ancestorIds' and 'description'
+        lookup_map: Map of GUID -> standard object
+    Returns:
+        Full context string with ancestors
+    """
+    context_parts = []
+    # Resolve ancestors
+    ancestor_ids = standard.get("ancestorIds", [])
+    for ancestor_id in ancestor_ids:
+        if ancestor_id in lookup_map:
+            ancestor = lookup_map[ancestor_id]
+            ancestor_desc = ancestor.get("description", "").strip()
+            if ancestor_desc:
+                context_parts.append(ancestor_desc)
+    # Add standard's own description
+    standard_desc = standard.get("description", "").strip()
+    if standard_desc:
+        context_parts.append(standard_desc)
+    return " - ".join(context_parts)
+def _extract_grade(grade_levels: list[str]) -> str:
+    """Extract primary grade level from gradeLevels array."""
+    if not grade_levels:
+        return "Unknown"
+    grade = grade_levels[0]
+    # Handle high school ranges
+    if grade in ["09", "10", "11", "12"]:
+        return "09-12"
+    return grade
+def _process_standards(
+    data: dict[str, Any],
+    lookup_map: dict[str, Any]
+) -> list[dict[str, Any]]:
+    """
+    Filter and process standards from raw API data.
+    Keeps only items where statementLabel == "Standard".
+    Args:
+        data: Raw API response
+        lookup_map: Map of GUID -> standard object
+    Returns:
+        List of processed standard dicts
+    """
+    processed = []
+    standards_dict = data.get("standards", {})
+    subject = data.get("subject", "Unknown")
+    for guid, item in standards_dict.items():
+        # Filter: Keep only "Standard" items
+        if item.get("statementLabel") != "Standard":
+            continue
+        # Extract fields
+        standard_id = item.get("statementNotation", "")
+        grade_levels = item.get("educationLevels", [])
+        grade = _extract_grade(grade_levels)
+        description = item.get("description", "").strip()
+        # Skip if missing critical fields
+        if not standard_id or not description:
+            continue
+        # Resolve full context
+        full_context = _resolve_context(item, lookup_map)
+        # Build output record
+        record = {
+            "id": standard_id,
+            "guid": guid,
+            "subject": subject,
+            "grade": grade,
+            "description": description,
+            "full_context": full_context
+        }
+        processed.append(record)
+    logger.info(f"Processed {len(processed)} standards")
+    return processed
+def _generate_embeddings(standards: list[dict[str, Any]]) -> np.ndarray:
+    """
+    Generate embeddings for all standards.
+    Uses sentence-transformers with 'full_context' field.
+    Args:
+        standards: List of standard dicts
+    Returns:
+        Numpy array of embeddings
+    """
+    logger.info("Initializing sentence-transformers model...")
+    model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
+    contexts = [s["full_context"] for s in standards]
+    logger.info(f"Generating embeddings for {len(contexts)} standards...")
+    embeddings = model.encode(contexts, show_progress_bar=True)
+    return embeddings
+def process_standard_set(set_id: str) -> tuple[Path, Path]:
+    """
+    Process a raw standard set into flattened format with embeddings.
+    Args:
+        set_id: Standard set GUID
+    Returns:
+        Tuple of (standards_file_path, embeddings_file_path)
+    Raises:
+        FileNotFoundError: If raw data not found for set_id
+        ValueError: If processing fails
+    """
+    logger.info(f"Processing standard set {set_id}")
+    # Find raw data
+    raw_data_file = None
+    metadata_file = None
+    for jurisdiction_dir in RAW_DATA_DIR.iterdir():
+        if not jurisdiction_dir.is_dir():
+            continue
+        set_dir = jurisdiction_dir / set_id
+        if set_dir.exists():
+            raw_data_file = set_dir / "data.json"
+            metadata_file = set_dir / "metadata.json"
+            break
+    if not raw_data_file or not raw_data_file.exists():
+        raise FileNotFoundError(f"Raw data not found for set {set_id}. Run download first.")
+    # Load metadata
+    with open(metadata_file, encoding="utf-8") as f:
+        metadata = json.load(f)
+    # Load raw data
+    with open(raw_data_file, encoding="utf-8") as f:
+        raw_data = json.load(f)
+    # Build lookup map
+    standards_dict = raw_data.get("standards", {})
+    lookup_map = _build_lookup_map(standards_dict)
+    # Process standards
+    processed_standards = _process_standards(raw_data, lookup_map)
+    if not processed_standards:
+        raise ValueError(f"No standards processed from set {set_id}")
+    # Generate embeddings
+    embeddings = _generate_embeddings(processed_standards)
+    # Determine output path
+    jurisdiction_name = metadata["jurisdiction"]["title"].lower().replace(" ", "_")
+    subject_name = metadata["subject"].lower().replace(" ", "_").replace("-", "_")
+    grade = _extract_grade(metadata["grade_levels"])
+    grade_str = f"grade_{grade}".replace("-", "_")
+    output_dir = PROCESSED_DATA_DIR / jurisdiction_name / subject_name / grade_str
+    output_dir.mkdir(parents=True, exist_ok=True)
+    # Save standards
+    standards_file = output_dir / "standards.json"
+    with open(standards_file, "w", encoding="utf-8") as f:
+        json.dump(processed_standards, f, indent=2, ensure_ascii=False)
+    logger.info(f"Saved standards to {standards_file}")
+    # Save embeddings
+    embeddings_file = output_dir / "embeddings.npy"
+    np.save(embeddings_file, embeddings)
+    logger.info(f"Saved embeddings to {embeddings_file}")
+    # Mark as processed
+    mark_as_processed(set_id, output_dir)
+    return standards_file, embeddings_file
+```
+### 6.4 CLI Entry Point (`tools/cli.py`)
+Thin CLI layer that imports and invokes business logic functions.
+```python
+"""CLI entry point for EduMatch Data Management."""
+from __future__ import annotations
+import sys
+import typer
+from loguru import logger
+from rich.console import Console
+from rich.table import Table
+from tools import api_client, data_manager, data_processor
+# Configure logger
+logger.remove()  # Remove default handler
+logger.add(sys.stderr, format="<green>{time:HH:mm:ss}</green> | <level>{level: <8}</level> | <level>{message}</level>")
+logger.add("data/cli.log", rotation="10 MB", retention="7 days", format="{time} | {level} | {message}")
+app = typer.Typer(help="EduMatch Data CLI - Manage educational standards data")
+console = Console()
+@app.command()
+def jurisdictions(
+    search: str = typer.Option(None, "--search", "-s", help="Filter by jurisdiction name")
+):
+    """List all available jurisdictions (states/organizations)."""
+    try:
+        results = api_client.get_jurisdictions(search)
+        table = Table("ID", "Title", "Type", title="Jurisdictions")
+        for j in results:
+            table.add_row(
+                j.get("id", ""),
+                j.get("title", ""),
+                j.get("type", "N/A")
+            )
+        console.print(table)
+        console.print(f"\n[green]Found {len(results)} jurisdictions[/green]")
+    except Exception as e:
+        console.print(f"[red]Error: {e}[/red]")
+        logger.exception("Failed to fetch jurisdictions")
+        raise typer.Exit(code=1)
+@app.command()
+def sets(jurisdiction_id: str = typer.Argument(..., help="Jurisdiction ID")):
+    """List standard sets for a specific jurisdiction."""
+    try:
+        results = api_client.get_standard_sets(jurisdiction_id)
+        table = Table("Set ID", "Subject", "Title", "Grades", title=f"Standard Sets")
+        for s in results:
+            grade_levels = ", ".join(s.get("educationLevels", []))
+            table.add_row(
+                s.get("id", ""),
+                s.get("subject", "N/A"),
+                s.get("title", ""),
+                grade_levels or "N/A"
+            )
+        console.print(table)
+        console.print(f"\n[green]Found {len(results)} standard sets[/green]")
+    except Exception as e:
+        console.print(f"[red]Error: {e}[/red]")
+        logger.exception("Failed to fetch standard sets")
+        raise typer.Exit(code=1)
+@app.command()
+def download(set_id: str = typer.Argument(..., help="Standard set ID")):
+    """Download a standard set and save locally."""
+    try:
+        with console.status(f"[bold blue]Downloading set {set_id}..."):
+            data = api_client.download_standard_set(set_id)
+            output_path = data_manager.save_raw_data(set_id, data)
+        console.print(f"[green]✓ Successfully downloaded to {output_path}[/green]")
+    except Exception as e:
+        console.print(f"[red]Error: {e}[/red]")
+        logger.exception("Failed to download standard set")
+        raise typer.Exit(code=1)
+@app.command()
+def list():
+    """List all downloaded datasets."""
+    try:
+        datasets = data_manager.list_downloaded_data()
+        if not datasets:
+            console.print("[yellow]No datasets downloaded yet.[/yellow]")
+            return
+        table = Table("Set ID", "Subject", "Title", "Grades", "Downloaded", "Processed", title="Downloaded Datasets")
+        for d in datasets:
+            table.add_row(
+                d["set_id"][:12] + "...",
+                d.get("subject", "N/A"),
+                d.get("title", "")[:50],
+                ", ".join(d.get("grade_levels", [])),
+                d.get("download_date", "")[:10],
+                "✓" if d.get("processed") else "✗"
+            )
+        console.print(table)
+        console.print(f"\n[green]Total: {len(datasets)} datasets[/green]")
+    except Exception as e:
+        console.print(f"[red]Error: {e}[/red]")
+        logger.exception("Failed to list datasets")
+        raise typer.Exit(code=1)
+@app.command()
+def process(set_id: str = typer.Argument(..., help="Standard set ID to process")):
+    """Process a downloaded dataset into flattened format with embeddings."""
+    try:
+        with console.status(f"[bold blue]Processing set {set_id}..."):
+            standards_file, embeddings_file = data_processor.process_standard_set(set_id)
+        console.print(f"[green]✓ Processing complete![/green]")
+        console.print(f"  Standards: {standards_file}")
+        console.print(f"  Embeddings: {embeddings_file}")
+    except FileNotFoundError as e:
+        console.print(f"[red]Error: {e}[/red]")
+        console.print("[yellow]Hint: Run 'download' command first.[/yellow]")
+        raise typer.Exit(code=1)
+    except Exception as e:
+        console.print(f"[red]Error: {e}[/red]")
+        logger.exception("Failed to process dataset")
+        raise typer.Exit(code=1)
+@app.command()
+def status():
+    """Show processing status for all datasets."""
+    try:
+        status_data = data_manager.get_processing_status()
+        console.print(f"\n[bold]Processing Status Summary[/bold]")
+        console.print(f"  Total Downloaded: {status_data['total']}")
+        console.print(f"  Processed: {status_data['processed']}")
+        console.print(f"  Unprocessed: {status_data['unprocessed']}")
+        if status_data["unprocessed_list"]:
+            console.print(f"\n[yellow]Unprocessed Datasets:[/yellow]")
+            for d in status_data["unprocessed_list"]:
+                console.print(f"  • {d['title']} ({d['set_id'][:12]}...)")
+        if status_data["processed_list"]:
+            console.print(f"\n[green]Processed Datasets:[/green]")
+            for d in status_data["processed_list"]:
+                console.print(f"  • {d['title']}")
+                console.print(f"    Output: {d.get('processed_output', 'N/A')}")
+    except Exception as e:
+        console.print(f"[red]Error: {e}[/red]")
+        logger.exception("Failed to get status")
+        raise typer.Exit(code=1)
+if __name__ == "__main__":
+    app()
+```
+---
+## 7. API Data Format Reference
+### 7.1 Raw Data Structure (from API)
+When you run the `download` command, the `data/raw/<jurisdiction>/<set_id>/data.json` files will contain the Common Standards Project API response format:
+```json
+{
+  "id": "SET_ID",
+  "title": "Utah Core Standards - Mathematics",
+  "subject": "Mathematics",
+  "educationLevels": ["03"],
+  "jurisdiction": {
+    "id": "JURISDICTION_ID",
+    "title": "Utah"
+  },
+  "standards": {
+    "STANDARD_UUID": {
+      "id": "STANDARD_UUID",
+      "statementNotation": "3.OA.1",
+      "description": "Interpret products of whole numbers...",
+      "ancestorIds": ["CLUSTER_UUID", "DOMAIN_UUID"],
+      "statementLabel": "Standard",
+      "educationLevels": ["03"]
+    },
+    "CLUSTER_UUID": {
+      "id": "CLUSTER_UUID",
+      "description": "Represent and solve problems involving multiplication...",
+      "statementLabel": "Cluster",
+      "ancestorIds": ["DOMAIN_UUID"]
+    },
+    "DOMAIN_UUID": {
+      "id": "DOMAIN_UUID",
+      "description": "Operations and Algebraic Thinking",
+      "statementLabel": "Domain",
+      "ancestorIds": []
+    }
+  }
+}
+```
+**Key Points:**
+- The `standards` field is a flat dictionary keyed by GUID (not an array)
+- Each item includes `ancestorIds` that reference other items in the same dictionary
+- The `statementLabel` field indicates the type: "Domain", "Cluster", or "Standard"
+- We filter to keep only `"statementLabel": "Standard"` items and resolve their ancestor context
+### 7.2 Processed Data Format
+After running the `process` command, the output `standards.json` has this flattened structure:
+```json
+[
+  {
+    "id": "CCSS.Math.Content.3.OA.A.1",
+    "guid": "STANDARD_UUID",
+    "subject": "Mathematics",
+    "grade": "03",
+    "description": "Interpret products of whole numbers...",
+    "full_context": "Operations and Algebraic Thinking - Represent and solve problems involving multiplication... - Interpret products of whole numbers..."
+  }
+]
+```
+The `full_context` field is created by concatenating:
+1. Domain description
+2. Cluster description
+3. Standard description
+This rich context is used for generating embeddings that capture the hierarchical meaning.
+---
+## 8. Error Handling & Retry Logic
+### 8.1 API Error Categories
+The CLI handles these error scenarios:
+| Error Type         | Status Code | Behavior                                     |
+| ------------------ | ----------- | -------------------------------------------- |
+| Invalid API Key    | 401         | Stop immediately, show error message         |
+| Resource Not Found | 404         | Stop immediately, show helpful error         |
+| Rate Limited       | 429         | Wait for `Retry-After` header, then retry    |
+| Timeout            | -           | Exponential backoff: 1s, 2s, 4s (3 attempts) |
+| Connection Error   | -           | Exponential backoff: 1s, 2s, 4s (3 attempts) |
+| Server Error       | 5xx         | Exponential backoff: 1s, 2s, 4s (3 attempts) |
+| Client Error       | 4xx         | Stop immediately (no retry)                  |
+### 8.2 Rate Limiting
+- **Client-Side Limit:** 60 requests per minute
+- **Implementation:** Track request timestamps, enforce delays when limit reached
+- **Server-Side Limit:** Respect `429` status and `Retry-After` header
+### 8.3 Logging
+All operations are logged using `loguru`:
+- **Console:** Formatted output with timestamps and log levels
+- **File:** `data/cli.log` with rotation (10MB max, 7 days retention)
+- **Exception Tracking:** Full stack traces logged for debugging
+---
+## 9. Git Configuration
+### 9.1 .gitignore Additions
+Add to `.gitignore`:
+```
+# Raw data (local only)
+data/raw/
+# CLI logs
+data/cli.log
+```
+### 9.2 Git Tracking
+**Tracked in Git:**
+- `data/processed/utah/mathematics/grade_03/` (example dataset)
+- `data/processed/wyoming/mathematics/grade_03/` (example dataset)
+- `data/processed/idaho/mathematics/grade_03/` (example dataset)
+**Not Tracked:**
+- `data/raw/` (developer's local cache)
+- `data/cli.log` (operational logs)
+---
+## 10. Testing Strategy
+### 10.1 Manual Testing Checklist
+For this initial sprint, manual testing is sufficient. Complete these test scenarios:
+**API Discovery:**
+- [ ] Run `jurisdictions` command without search
+- [ ] Run `jurisdictions --search "Utah"` to filter
+- [ ] Verify table output is readable
+- [ ] Confirm logging to console and file
+**Standard Set Discovery:**
+- [ ] Find Utah jurisdiction ID from previous step
+- [ ] Run `sets <UTAH_ID>` to list available standards
+- [ ] Identify Mathematics Grade 3 set ID
+- [ ] Repeat for Wyoming and Idaho
+**Data Download:**
+- [ ] Run `download <SET_ID>` for Utah Math Grade 3
+- [ ] Verify file created in `data/raw/<jurisdiction>/<set_id>/data.json`
+- [ ] Verify metadata created in `data/raw/<jurisdiction>/<set_id>/metadata.json`
+- [ ] Repeat for Wyoming and Idaho Math Grade 3
+- [ ] Run `list` command to see all downloads
+**Data Processing:**
+- [ ] Run `status` to see unprocessed datasets
+- [ ] Run `process <SET_ID>` for Utah Math Grade 3
+- [ ] Verify `data/processed/utah/mathematics/grade_03/standards.json` created
+- [ ] Verify `data/processed/utah/mathematics/grade_03/embeddings.npy` created
+- [ ] Run `status` again to confirm marked as processed
+- [ ] Repeat for Wyoming and Idaho
+**Error Handling:**
+- [ ] Test with invalid API key (should fail immediately with clear message)
+- [ ] Test `download` with invalid set ID (should show 404 error)
+- [ ] Test `process` without downloading first (should show helpful error)
+### 10.2 Validation Criteria
+After processing all three datasets, verify:
+- Each `standards.json` contains array of standard objects
+- Each standard has all required fields: `id`, `guid`, `subject`, `grade`, `description`, `full_context`
+- The `full_context` field is not empty and contains ancestor descriptions
+- Each `embeddings.npy` file shape matches the number of standards
+- Metadata files correctly show `"processed": true`
+---
+## 11. Implementation Workflow
+### 11.1 Development Order
+Implement modules in this order:
+1. **`tools/api_client.py`** - Core API interactions with retry logic
+2. **`tools/data_manager.py`** - File management and metadata tracking
+3. **`tools/data_processor.py`** - Data transformation and embeddings
+4. **`tools/cli.py`** - CLI commands that tie everything together
+### 11.2 Testing Workflow
+After implementing each module:
+1. Test API discovery commands (`jurisdictions`, `sets`)
+2. Test download command for one dataset
+3. Test list command
+4. Test process command
+5. Test status command
+6. Repeat download and process for remaining datasets
+### 11.3 Deliverables
+**Code:**
+- `tools/api_client.py`
+- `tools/data_manager.py`
+- `tools/data_processor.py`
+- `tools/cli.py`
+**Data (committed to git):**
+- `data/processed/utah/mathematics/grade_03/standards.json`
+- `data/processed/utah/mathematics/grade_03/embeddings.npy`
+- `data/processed/wyoming/mathematics/grade_03/standards.json`
+- `data/processed/wyoming/mathematics/grade_03/embeddings.npy`
+- `data/processed/idaho/mathematics/grade_03/standards.json`
+- `data/processed/idaho/mathematics/grade_03/embeddings.npy`
+**Configuration:**
+- Updated `pyproject.toml` with new dependencies
+- Updated `.gitignore` with data/raw/ and data/cli.log
+---
+## 12. Future Enhancements
+Features not included in this sprint but planned for future:
+- **Batch Processing:** Command to process all downloaded datasets at once
+- **Update Detection:** Check API for updates to already-downloaded sets
+- **Data Validation:** Verify processed data integrity
+- **Export Formats:** Support CSV or other output formats
+- **Automated Tests:** Unit tests for each module
+- **Configuration File:** YAML config for default settings (rate limits, retry attempts, etc.)
+- **Progress Tracking:** Better progress bars for long operations
+---
+_This specification provides complete requirements for implementing a CLI tool to download and process Common Core standards from the Common Standards Project API, with clean architecture, robust error handling, and proper data organization for the MVP._

.agent/specs/001_pinecone/spec.md ADDED Viewed

	@@ -0,0 +1,660 @@

+# Pinecone Integration Sprint
+## Overview
+This sprint integrates Pinecone for vector storage and semantic search of educational standards. We will use Pinecone's hosted embedding models (`llama-text-embed-v2`) for both creating embeddings and search, leveraging their native search functionality through the Python SDK. This approach allows rapid iteration, takes advantage of their free tier, and eliminates the need for local embedding generation.
+---
+## User Stories
+**US-1: Transform Standards on Download**
+As a developer, I want downloaded standard sets to be automatically transformed into Pinecone-ready format so that I don't need a separate processing step before uploading.
+**US-2: Upload Standards to Pinecone**
+As a developer, I want to upload processed standards to Pinecone with a single CLI command so that the standards become searchable.
+**US-3: Prevent Duplicate Uploads**
+As a developer, I want the system to track which standard sets have been uploaded so that I don't waste time and API calls re-uploading data.
+**US-4: Resume Failed Uploads**
+As a developer, I want to be able to resume uploads after a failure so that I can recover from crashes without starting over.
+**US-5: Preview Before Upload**
+As a developer, I want a dry-run option to see what would be uploaded without actually uploading so that I can verify the data before committing.
+---
+## Sprint Parts
+### Part 1: Transform Standard Sets on Download
+Update the CLI `download-sets` command to create transformed `processed.json` files alongside the original `data.json`. These transformed files contain records ready for Pinecone ingestion.
+### Part 2: Pinecone Upsert CLI Command
+Implement a new CLI command to upload transformed standard set records to Pinecone in batches, with tracking to prevent duplicate uploads.
+---
+## Technical Decisions
+### Text Content Format (for Embedding)
+Use a structured text block with depth-based headers to preserve the full parent-child context:
+```
+Depth 0: Geometry
+Depth 1 (1.G.K): Reason with shapes and their attributes.
+Depth 2 (1.G.K.3.Ba): Partition circles and rectangles into two equal shares and:
+Depth 3 (1.G.K.3.Ba.B): Describe the whole as two of the shares.
+```
+Format rules:
+- Each line starts with `Depth N:` where N is the standard's depth value
+- If `statementNotation` is present, include it in parentheses after the depth label
+- Include the full ancestor chain from root (depth 0) down to the current standard
+- Join all lines with `\n`
+This format is depth-agnostic and works regardless of how deep the hierarchy goes, avoiding assumptions about what each depth level represents semantically.
+### Which Standards to Process
+**Process ALL standards in the hierarchy, not just leaf nodes.** This enables:
+- Direct lookup of any standard by ID (including parents)
+- Navigation up and down the hierarchy
+- Finding children of any standard
+- Complete relationship traversal
+Each record includes an `is_leaf` boolean to distinguish leaf nodes (no children) from branch nodes (have children). Search queries can filter to `is_leaf: true` when only the most specific standards are desired.
+**Identifying leaf vs branch nodes:** A standard is a leaf node if its `id` does NOT appear as a `parentId` for any other standard in the set.
+**Note:** The previous implementation in `data_processor.py` that filtered only on `statementLabel == "Standard"` is incorrect and should be completely replaced.
+### Pinecone Record ID Strategy
+Use the individual standard's GUID as the record `_id` (e.g., `EA60C8D165F6481B90BFF782CE193F93`). This ensures uniqueness and enables direct lookup. The parent hierarchy is preserved in the text content, not the ID.
+### Namespace Strategy
+Use a single namespace for all standards (configurable via environment variable, default: `standards`). This allows cross-jurisdiction and cross-subject searches in a single query. Filtering by metadata handles scoping to specific jurisdictions, subjects, or grade levels.
+### Upload Tracking
+Create a `.pinecone_uploaded` marker file in each standard set directory after successful upload. Before uploading, check for this marker file to skip already-uploaded sets.
+### Index Management
+The Pinecone index should be created manually using the Pinecone CLI before running the upsert command. The index name is configured via environment variable.
+**Index creation command (run once manually):**
+```bash
+pc index create -n common-core-standards -m cosine -c aws -r us-east-1 --model llama-text-embed-v2 --field_map text=content
+```
+The upsert CLI command should validate the index exists and fail with helpful instructions if not found.
+### File Changes
+**Files to Delete:**
+- `tools/data_processor.py` - Outdated local embedding approach, completely replaced
+**Files to Edit:**
+- `tools/cli.py` - Add `pinecone-upload` command, update `download-sets` to call processor, remove old `process` command
+- `tools/config.py` - Add Pinecone configuration settings
+- `.env.example` - Add Pinecone environment variables
+- `pyproject.toml` - Remove `sentence-transformers` and `numpy`, add `pinecone`
+**Files to Create:**
+- `tools/pinecone_processor.py` - New module for transforming standards to Pinecone format
+- `tools/pinecone_client.py` - New module for Pinecone SDK interactions (upsert, index validation)
+### Processing Trigger
+The `download-sets` command will automatically generate `processed.json` immediately after saving `data.json`. This ensures processed data is always in sync with raw data.
+### Handling Missing Optional Fields
+Per Pinecone best practices, handle missing fields as follows:
+- **Omit missing string/array fields** from the record entirely. Do not include empty strings or empty arrays for optional fields.
+- **`parent_id`**: Store as `null` for root nodes (do not omit). This explicitly indicates "no parent" vs "unknown parent".
+- **`statement_label`**: If missing in source, omit the field entirely. Do not infer from depth.
+- **`statement_notation`**: If missing, omit from content text parentheses (just use `Depth N: {description}`).
+### Handling Education Levels
+The source `educationLevels` field may contain comma-separated values within individual array elements (e.g., `["01,02"]` instead of `["01", "02"]`). Process as follows:
+1. **Split comma-separated strings**: For each element in the array, split on commas to extract individual grade levels
+2. **Flatten**: Combine all split values into a single array
+3. **Deduplicate**: Remove any duplicate grade level strings
+4. **Preserve as array**: Store as an array of strings in the output—do NOT join back into a comma-separated string
+Pinecone metadata supports string lists natively.
+**Example transformation:**
+```python
+# Input: ["01,02", "02", "03"]
+# After split: [["01", "02"], ["02"], ["03"]]
+# After flatten: ["01", "02", "02", "03"]
+# After dedupe: ["01", "02", "03"]
+# Output: ["01", "02", "03"]
+```
+**Note:** The `education_levels` value comes from the **standard set** level (`data.educationLevels`), not from individual standards. Individual standards do not have their own education level field. The same education levels are applied to all records from a given standard set to enhance retrieval filtering.
+---
+## Processed.json Schema
+Each `processed.json` file contains records ready for Pinecone upsert:
+```json
+{
+  "records": [
+    {
+      "_id": "EA60C8D165F6481B90BFF782CE193F93",
+      "content": "Depth 0: Geometry\nDepth 1 (1.G.K): Reason with shapes and their attributes.\nDepth 2 (1.G.K.3.Ba): Partition circles and rectangles into two equal shares and:\nDepth 3 (1.G.K.3.Ba.B): Describe the whole as two of the shares.",
+      "standard_set_id": "744704BE56D44FB9B3D18B543FBF9BCC_D21218769_grade-01",
+      "standard_set_title": "Grade 1",
+      "subject": "Mathematics (2021-)",
+      "normalized_subject": "Math",
+      "education_levels": ["01"],
+      "document_id": "D21218769",
+      "document_valid": "2021",
+      "publication_status": "Published",
+      "jurisdiction_id": "744704BE56D44FB9B3D18B543FBF9BCC",
+      "jurisdiction_title": "Wyoming",
+      "asn_identifier": "S21238682",
+      "statement_notation": "1.G.K.3.Ba.B",
+      "statement_label": "Benchmark",
+      "depth": 3,
+      "is_leaf": true,
+      "is_root": false,
+      "parent_id": "3445678A58C74065B7DF5617B353B89C",
+      "root_id": "FE0D33F3287E4137AD66FA3926FAB114",
+      "ancestor_ids": [
+        "FE0D33F3287E4137AD66FA3926FAB114",
+        "386EA56EADD24A209DC2D77A71B2F89B",
+        "3445678A58C74065B7DF5617B353B89C"
+      ],
+      "child_ids": [],
+      "sibling_count": 1
+    }
+  ]
+}
+```
+### Metadata Fields
+All metadata fields must be flat (no nested objects) per Pinecone requirements. Arrays of strings are allowed.
+**Standard Set Context:**
+| Field                | Description                                                                 |
+| -------------------- | --------------------------------------------------------------------------- |
+| `_id`                | Standard's unique GUID                                                      |
+| `content`            | Rich text block with full hierarchy (used for embedding)                    |
+| `standard_set_id`    | ID of the parent standard set                                               |
+| `standard_set_title` | Title of the standard set (e.g., "Grade 1")                                 |
+| `subject`            | Full subject name                                                           |
+| `normalized_subject` | Normalized subject (e.g., "Math", "ELA")                                    |
+| `education_levels`   | Array of grade level strings (e.g., `["01"]` or `["09", "10", "11", "12"]`) |
+| `document_id`        | Document ID                                                                 |
+| `document_valid`     | Year the document is valid                                                  |
+| `publication_status` | Publication status (e.g., "Published")                                      |
+| `jurisdiction_id`    | Jurisdiction GUID                                                           |
+| `jurisdiction_title` | Jurisdiction name (e.g., "Wyoming")                                         |
+**Standard Identity & Position:**
+| Field                | Description                                                  |
+| -------------------- | ------------------------------------------------------------ |
+| `asn_identifier`     | ASN identifier if available (e.g., "S21238682")              |
+| `statement_notation` | Standard notation teachers use (e.g., "1.G.K.3.Ba.B")        |
+| `statement_label`    | Type of standard if present in source (e.g., "Benchmark")    |
+| `depth`              | Hierarchy depth level (0 is root, increases with each level) |
+| `is_leaf`            | Boolean: true if this standard has no children               |
+| `is_root`            | Boolean: true if this is a root node (depth=0, no parent)    |
+**Hierarchy Relationships:**
+| Field           | Description                                                       |
+| --------------- | ----------------------------------------------------------------- |
+| `parent_id`     | Immediate parent's GUID, or `null` for root nodes                 |
+| `root_id`       | Root ancestor's GUID. For root nodes, equals the node's own `_id` |
+| `ancestor_ids`  | Array of ancestor GUIDs ordered root→parent (see ordering below)  |
+| `child_ids`     | Array of direct children's GUIDs ordered by position (see below)  |
+| `sibling_count` | Number of siblings (standards with same parent_id), excludes self |
+**Array Ordering Guarantees:**
+`ancestor_ids` is ordered from **root (index 0) to immediate parent (last index)**:
+```
+ancestor_ids[0]        = root ancestor (depth 0)
+ancestor_ids[1]        = second level ancestor (depth 1)
+ancestor_ids[2]        = third level ancestor (depth 2)
+...
+ancestor_ids[-1]       = immediate parent (depth = current_depth - 1)
+ancestor_ids.length    = current standard's depth
+```
+Example for a depth-3 standard:
+```python
+ancestor_ids = ["ROOT_ID", "DEPTH1_ID", "DEPTH2_ID"]
+# ancestor_ids[0] is the root (depth 0)
+# ancestor_ids[1] is the depth-1 ancestor
+# ancestor_ids[2] is the immediate parent (depth 2)
+# To get ancestor at depth N: ancestor_ids[N]
+```
+`child_ids` is ordered by the source `position` field (ascending), preserving the natural document order of standards.
+---
+## Configuration
+### Environment Variables
+Add to `tools/config.py` and `.env.example`:
+| Variable              | Description                | Default                 |
+| --------------------- | -------------------------- | ----------------------- |
+| `PINECONE_API_KEY`    | Pinecone API key           | (required)              |
+| `PINECONE_INDEX_NAME` | Name of the Pinecone index | `common-core-standards` |
+| `PINECONE_NAMESPACE`  | Namespace for records      | `standards`             |
+---
+## File Locations
+- **Original data:** `data/raw/standardSets/{standard_set_id}/data.json`
+- **Processed data:** `data/raw/standardSets/{standard_set_id}/processed.json`
+- **Upload marker:** `data/raw/standardSets/{standard_set_id}/.pinecone_uploaded`
+### Upload Marker File Format
+The `.pinecone_uploaded` marker file contains an ISO 8601 timestamp indicating when the upload completed:
+```
+2025-01-15T14:30:00Z
+```
+This allows tracking when each standard set was last uploaded to Pinecone.
+---
+## Source Data Reference
+The source data is stored at `data/raw/standardSets/{standard_set_id}/data.json`. Key fields used for transformation:
+**Standard Set Level:**
+- `id`, `title`, `subject`, `normalizedSubject`, `educationLevels`
+- `document.id`, `document.valid`, `document.asnIdentifier`, `document.publicationStatus`
+- `jurisdiction.id`, `jurisdiction.title`
+**Individual Standard Level:**
+- `id` (GUID)
+- `asnIdentifier`
+- `depth` (hierarchy level, 0 is root)
+- `position` (numeric sort order within the document - used for ordering `child_ids`)
+- `statementNotation` (e.g., "1.G.K.3.Ba.B")
+- `statementLabel` (e.g., "Domain", "Standard", "Benchmark")
+- `description`
+- `ancestorIds` (array of ancestor GUIDs - **order is NOT guaranteed**, must be rebuilt programmatically)
+- `parentId`
+---
+## CLI Commands
+### Updated Command: `download-sets`
+After downloading `data.json`, automatically call the processor to generate `processed.json` in the same directory. No changes to command arguments.
+### New Command: `pinecone-upload`
+```
+Usage: cli pinecone-upload [OPTIONS]
+Upload processed standard sets to Pinecone.
+Options:
+  --set-id TEXT      Upload a specific standard set by ID
+  --all              Upload all downloaded standard sets with processed.json
+  --force            Re-upload even if .pinecone_uploaded marker exists
+  --dry-run          Show what would be uploaded without actually uploading
+  --batch-size INT   Number of records per batch (default: 96)
+```
+**Behavior:**
+- If neither `--set-id` nor `--all` is provided, prompt for confirmation before uploading all
+- Skip sets that have `.pinecone_uploaded` marker unless `--force` is specified
+- Show progress with count of records uploaded
+- On success, create `.pinecone_uploaded` marker file with timestamp
+- On failure, log error and continue with next set (if `--all`)
+- Validate index exists before attempting upload; fail with helpful instructions if not found
+### Removed Command: `process`
+The old `process` command is removed as it used the deprecated local embedding approach.
+---
+## Dependencies
+### Remove from `pyproject.toml`:
+- `sentence-transformers`
+- `numpy<2`
+- `huggingface_hub`
+### Add to `pyproject.toml`:
+- `pinecone` (current SDK, not `pinecone-client`)
+---
+## Transformation Algorithm
+### Pre-processing: Build Relationship Maps
+Before processing individual standards, build helper data structures from ALL standards in the set:
+1. **ID-to-standard map**: Map of `id` → standard object for lookups
+2. **Parent-to-children map**: Map of `parentId` → `[child_ids]`, with children **sorted by `position` ascending**
+3. **Leaf node set**: A standard is a leaf if its `id` does NOT appear as any standard's `parentId`
+4. **Root identification**: Find all standards where `parentId` is `null`. These are root nodes.
+**Note on ordering:** The source data includes a `position` field for each standard that defines the natural document order. When building `child_ids`, sort by this `position` value to maintain consistent ordering.
+### Determining root_id
+**Do NOT rely on the order of `ancestorIds` from the source data.** Instead, programmatically determine the root by walking up the parent chain:
+```python
+def find_root_id(standard: dict, id_to_standard: dict[str, dict]) -> str:
+    """Walk up the parent chain to find the root ancestor."""
+    current = standard
+    visited = set()  # Prevent infinite loops from bad data
+    while current.get("parentId") is not None:
+        parent_id = current["parentId"]
+        if parent_id in visited:
+            break  # Circular reference protection
+        visited.add(parent_id)
+        if parent_id not in id_to_standard:
+            break  # Parent not found, use current as root
+        current = id_to_standard[parent_id]
+    return current["id"]
+```
+For root nodes themselves (where `parentId` is `null`), `root_id` equals the node's own `_id`.
+### Building ancestor_ids in Correct Order
+Since `ancestorIds` order in source data is NOT guaranteed, rebuild the ancestor chain by walking up the parent chain:
+```python
+def build_ordered_ancestors(standard: dict, id_to_standard: dict[str, dict]) -> list[str]:
+    """Build ancestor list ordered from root to immediate parent."""
+    ancestors = []
+    current_id = standard.get("parentId")
+    visited = set()
+    while current_id is not None and current_id not in visited:
+        visited.add(current_id)
+        if current_id in id_to_standard:
+            ancestors.append(current_id)
+            current_id = id_to_standard[current_id].get("parentId")
+        else:
+            break
+    ancestors.reverse()  # Now ordered root → immediate parent
+    return ancestors
+```
+### Processing Each Standard
+For **EACH** standard in the set (not just leaves), create a record:
+**Step 1: Compute hierarchy relationships**
+| Field           | How to compute                                                          |
+| --------------- | ----------------------------------------------------------------------- |
+| `parent_id`     | Copy from source `parentId` (`null` if not present)                     |
+| `ancestor_ids`  | Build using `build_ordered_ancestors()` - ordered root (idx 0) → parent |
+| `root_id`       | Use `find_root_id()`. For root nodes, equals own `_id`                  |
+| `is_root`       | `True` if `parentId` is `null`                                          |
+| `child_ids`     | Look up in parent-to-children map, **sorted by `position` ascending**   |
+| `is_leaf`       | `True` if `child_ids` is empty                                          |
+| `sibling_count` | Count of other standards with same `parent_id` (excludes self)          |
+**Step 2: Build the content text block**
+1. Get ordered ancestors from the computed `ancestor_ids`
+2. Look up each ancestor in `id_to_standard` map
+3. Build text lines in order from root to current standard:
+   - If `statementNotation` is present: `Depth {depth} ({statementNotation}): {description}`
+   - Otherwise: `Depth {depth}: {description}`
+4. Join all lines with `\n`
+**Step 3: Set statement_label**
+- Copy `statementLabel` from source if present
+- If missing in source, **omit the field entirely** (do not infer from depth)
+### Example Transformation
+Given this hierarchy:
+- Root (depth 0, id "FE0D..."): "Geometry"
+- Child (depth 1, notation "1.G.K"): "Reason with shapes and their attributes."
+- Child (depth 2, notation "1.G.K.3.Ba"): "Partition circles and rectangles into two equal shares and:"
+- Child (depth 3, notation "1.G.K.3.Ba.B"): "Describe the whole as two of the shares."
+Output `content` for the depth-3 standard:
+```
+Depth 0: Geometry
+Depth 1 (1.G.K): Reason with shapes and their attributes.
+Depth 2 (1.G.K.3.Ba): Partition circles and rectangles into two equal shares and:
+Depth 3 (1.G.K.3.Ba.B): Describe the whole as two of the shares.
+```
+**For a root node** (depth 0, e.g., "Geometry"):
+- `is_root`: `true`
+- `root_id`: equals own `_id` (e.g., "FE0D...")
+- `parent_id`: `null`
+- `ancestor_ids`: `[]` (empty array)
+- `content`: `Depth 0: Geometry`
+---
+## Error Handling
+### Processing Errors
+- **Missing `data.json`**: Skip with warning, continue to next set
+- **Invalid JSON**: Log error with file path and continue to next set
+- **No leaf nodes found**: Create `processed.json` with empty records array, log warning
+### Pinecone API Errors
+| Error Type          | Action                                                 |
+| ------------------- | ------------------------------------------------------ |
+| 4xx (client errors) | Fail immediately, do not retry (indicates bad request) |
+| 429 (rate limit)    | Retry with exponential backoff                         |
+| 5xx (server errors) | Retry with exponential backoff                         |
+### Retry Pattern
+```python
+import time
+from pinecone.exceptions import PineconeException
+def exponential_backoff_retry(func, max_retries=5):
+    for attempt in range(max_retries):
+        try:
+            return func()
+        except PineconeException as e:
+            status_code = getattr(e, 'status', None)
+            # Only retry transient errors
+            if status_code and (status_code >= 500 or status_code == 429):
+                if attempt < max_retries - 1:
+                    delay = min(2 ** attempt, 60)  # Cap at 60s
+                    time.sleep(delay)
+                else:
+                    raise
+            else:
+                raise  # Don't retry client errors
+```
+### Upload Failure Recovery
+- On batch failure, log which standard set and batch failed
+- Continue with remaining sets if `--all` flag is used
+- Do NOT create `.pinecone_uploaded` marker for failed sets
+- User can re-run command to retry failed sets
+---
+## Pinecone SDK Requirements
+### Installation
+```bash
+pip install pinecone          # ✅ Correct (current SDK)
+# NOT: pip install pinecone-client  # ❌ Deprecated
+```
+### Initialization
+```python
+from pinecone import Pinecone
+import os
+api_key = os.getenv("PINECONE_API_KEY")
+if not api_key:
+    raise ValueError("PINECONE_API_KEY environment variable not set")
+pc = Pinecone(api_key=api_key)
+index = pc.Index("common-core-standards")
+```
+### Index Validation
+Use SDK to check index exists before upload:
+```python
+if not pc.has_index(index_name):
+    # Fail with helpful message including the CLI command to create index
+    raise ValueError(f"Index '{index_name}' not found. Create it with:\n"
+                     f"pc index create -n {index_name} -m cosine -c aws -r us-east-1 "
+                     f"--model llama-text-embed-v2 --field_map text=content")
+```
+### Upserting Records
+**Critical**: Use `upsert_records()` for indexes with integrated embeddings, NOT `upsert()`:
+```python
+# ✅ Correct - for integrated embeddings
+index.upsert_records(namespace, records)
+# ❌ Wrong - this is for pre-computed vectors
+index.upsert(vectors=...)
+```
+### Batch Processing
+```python
+def batch_upsert(index, namespace, records, batch_size=96):
+    """Upsert records in batches with rate limiting."""
+    for i in range(0, len(records), batch_size):
+        batch = records[i:i + batch_size]
+        exponential_backoff_retry(
+            lambda b=batch: index.upsert_records(namespace, b)
+        )
+        time.sleep(0.1)  # Rate limiting between batches
+```
+### Key Constraints
+| Constraint          | Limit                                      | Notes                                    |
+| ------------------- | ------------------------------------------ | ---------------------------------------- |
+| Text batch size     | 96 records                                 | Also 2MB total per batch                 |
+| Metadata per record | 40KB                                       | Flat JSON only                           |
+| Metadata types      | strings, ints, floats, bools, string lists | No nested objects                        |
+| Consistency         | Eventually consistent                      | Wait ~1-5s after upsert before searching |
+### Record Format
+Records must have:
+- `_id`: Unique identifier (string)
+- `content`: Text field for embedding (must match `field_map` in index config)
+- Additional flat metadata fields (no nesting)
+```python
+record = {
+    "_id": "EA60C8D165F6481B90BFF782CE193F93",
+    "content": "Depth 0: Geometry\nDepth 1 (1.G.K): ...",  # Embedded by Pinecone
+    "subject": "Mathematics",  # Flat metadata
+    "jurisdiction_title": "Wyoming",
+    "depth": 3,  # int allowed
+    "is_root": False,  # bool allowed
+    "parent_id": "3445678A...",  # null for root nodes
+}
+```
+### Common Mistakes to Avoid
+1. **Nested metadata**: Will cause API errors
+   ```python
+   # ❌ Wrong
+   {"user": {"name": "John"}}
+   # ✅ Correct
+   {"user_name": "John"}
+   ```
+2. **Hardcoded API keys**: Always use environment variables
+3. **Missing namespace**: Always specify namespace in all operations
+4. **Wrong upsert method**: Use `upsert_records()` not `upsert()` for integrated embeddings
+---
+## Assumptions and Dependencies
+### Assumptions
+- Pinecone free tier limits are sufficient for initial dataset
+- The index has been created manually via Pinecone CLI before running upload
+- API key has been configured in environment
+### Dependencies
+- Python 3.12+
+- `pinecone` SDK (current version, 2025)
+- Pinecone account with API key
+- Network access to Pinecone API

.agent/specs/001_pinecone/tasks.md ADDED Viewed

	@@ -0,0 +1,109 @@

+# Spec Tasks
+Tasks for implementing the Pinecone Integration Sprint as defined in `spec.md`.
+Recommended execution order:
+Task 1 (Setup) - foundation
+Task 2 (Models) - data structures
+Tasks 3-7 (Processor) - in sequence
+Tasks 8-9 (Client) - can parallel with 3-7
+Tasks 10-12 (CLI) - depends on processor and client
+---
+## Tasks
+- [x] 1. **Project Setup & Configuration**
+  - [x] 1.1 Update `pyproject.toml`: Remove `sentence-transformers`, `numpy<2`, `huggingface_hub`; add `pinecone`
+  - [x] 1.2 Update `tools/config.py`: Add `pinecone_api_key`, `pinecone_index_name` (default: `common-core-standards`), `pinecone_namespace` (default: `standards`) settings
+  - [x] 1.3 Create/update `.env.example`: Add `PINECONE_API_KEY`, `PINECONE_INDEX_NAME`, `PINECONE_NAMESPACE` variables
+  - [x] 1.4 Delete `tools/data_processor.py` (outdated local embedding approach)
+  - [x] 1.5 Run `pip install -e .` to verify dependencies resolve correctly
+- [x] 2. **Pydantic Models for Processed Records**
+  - [x] 2.1 Create `tools/pinecone_models.py` with `PineconeRecord` model containing all fields from processed.json schema
+  - [x] 2.2 Define `ProcessedStandardSet` model with `records: list[PineconeRecord]` for the output file structure
+  - [x] 2.3 Add field validators for `education_levels` (split comma-separated, flatten, dedupe)
+  - [x] 2.4 Add `model_config` with `json_encoders` for proper null handling of `parent_id`
+  - [x] 2.5 Write unit tests for model validation and education_levels processing
+- [x] 3. **Pinecone Processor - Relationship Maps**
+  - [x] 3.1 Create `tools/pinecone_processor.py` with `StandardSetProcessor` class
+  - [x] 3.2 Implement `_build_id_to_standard_map()`: Map of `id` → standard object
+  - [x] 3.3 Implement `_build_parent_to_children_map()`: Map of `parentId` → `[child_ids]` sorted by `position` ascending
+  - [x] 3.4 Implement `_identify_leaf_nodes()`: Set of IDs that are NOT any standard's `parentId`
+  - [x] 3.5 Implement `_identify_root_nodes()`: Set of IDs where `parentId` is `null`
+  - [x] 3.6 Write unit tests for relationship map building with sample data
+- [x] 4. **Pinecone Processor - Hierarchy Functions**
+  - [x] 4.1 Implement `find_root_id()`: Walk up parent chain with circular reference protection
+  - [x] 4.2 Implement `build_ordered_ancestors()`: Build ancestor list ordered root (idx 0) → immediate parent
+  - [x] 4.3 Implement `_compute_sibling_count()`: Count standards with same `parent_id`, excluding self
+  - [x] 4.4 Write unit tests for hierarchy functions with various depth levels (0, 1, 3+)
+- [x] 5. **Pinecone Processor - Content Generation**
+  - [x] 5.1 Implement `_build_content_text()`: Generate `Depth N (notation): description` format
+  - [x] 5.2 Handle missing `statementNotation` (omit parentheses)
+  - [x] 5.3 Handle root nodes (single line `Depth 0: description`)
+  - [x] 5.4 Write unit tests for content generation with various hierarchy depths
+- [x] 6. **Pinecone Processor - Record Transformation**
+  - [x] 6.1 Implement `_transform_standard()`: Convert single standard to `PineconeRecord`
+  - [x] 6.2 Extract standard set context fields (subject, jurisdiction, document, education_levels)
+  - [x] 6.3 Compute all hierarchy fields (is_leaf, is_root, parent_id, root_id, ancestor_ids, child_ids, sibling_count)
+  - [x] 6.4 Handle optional fields (omit if missing: statement_label, statement_notation, asn_identifier)
+  - [x] 6.5 Implement `process_standard_set()`: Main entry point that processes all standards and returns `ProcessedStandardSet`
+  - [x] 6.6 Write integration test with real `data.json` sample file
+- [x] 7. **Pinecone Processor - File Operations**
+  - [x] 7.1 Implement `process_and_save()`: Load `data.json`, process, save `processed.json`
+  - [x] 7.2 Add error handling for missing `data.json` (skip with warning)
+  - [x] 7.3 Add error handling for invalid JSON (log error, continue)
+  - [x] 7.4 Add logging for processing progress and record counts
+  - [x] 7.5 Write integration test for file operations
+- [x] 8. **Pinecone Client - Core Functions**
+  - [x] 8.1 Create `tools/pinecone_client.py` with `PineconeClient` class
+  - [x] 8.2 Implement `__init__()`: Initialize Pinecone SDK from config settings
+  - [x] 8.3 Implement `validate_index()`: Check index exists with `pc.has_index()`, raise helpful error if not
+  - [x] 8.4 Implement `exponential_backoff_retry()`: Retry on 429/5xx, fail on 4xx
+  - [x] 8.5 Implement `batch_upsert()`: Upsert records in batches of 96 with rate limiting
+- [x] 9. **Pinecone Client - Upload Tracking**
+  - [x] 9.1 Implement `is_uploaded()`: Check for `.pinecone_uploaded` marker file
+  - [x] 9.2 Implement `mark_uploaded()`: Create marker file with ISO 8601 timestamp
+  - [x] 9.3 Implement `get_upload_timestamp()`: Read timestamp from marker file
+  - [x] 9.4 Write unit tests for upload tracking functions
+- [x] 10. **CLI - Update download-sets Command**
+  - [x] 10.1 Import `pinecone_processor` in `tools/cli.py`
+  - [x] 10.2 After successful `download_standard_set()` call, invoke `process_and_save()` for single set download
+  - [x] 10.3 After successful bulk download loop, invoke `process_and_save()` for each downloaded set
+  - [x] 10.4 Add console output showing processing status
+  - [x] 10.5 Update `list` command to show processing status (processed.json exists)
+- [x] 11. **CLI - Remove Old Process Command**
+  - [x] 11.1 Remove the `process` command function from `tools/cli.py`
+  - [x] 11.2 Remove `data_processor` import from `tools/cli.py`
+  - [x] 11.3 Update `list` command if it references old processing status
+- [x] 12. **CLI - Implement pinecone-upload Command**
+  - [x] 12.1 Add `pinecone-upload` command with options: `--set-id`, `--all`, `--force`, `--dry-run`, `--batch-size`
+  - [x] 12.2 Implement set discovery: Find all standard sets with `processed.json`
+  - [x] 12.3 Implement upload filtering: Skip sets with `.pinecone_uploaded` unless `--force`
+  - [x] 12.4 Implement `--dry-run`: Show what would be uploaded without uploading
+  - [x] 12.5 Implement upload execution: Call `PineconeClient.batch_upsert()` for each set
+  - [x] 12.6 Add progress output with record counts
+  - [x] 12.7 Handle upload failures: Log error, continue with next set if `--all`, don't create marker
+  - [x] 12.8 Write manual test script to verify end-to-end upload flow

.cursor/commands/complete_tasks.md ADDED Viewed

	@@ -0,0 +1,4 @@

+Please implement the tasks for the following section(s) in the provided task document.
+The provided spec document includes documentation for completing these tasks. Use this to understand the overall context of this unit of work and the overall sprint.
+Complete only the tasks in the indicated section(s). Do not move onto other sections and their tasks.
+When you have completed all the work for the section(s), mark the section(s) and all associated tasks complete. Do this by editing the provided tasks.md and marking the checkboxes with an 'x' (`- [x] Some section or task ...`).

.cursor/commands/create_tasks.md ADDED Viewed

	@@ -0,0 +1,48 @@

+Generate tasks from a spec document. The task should be broken into numbered sections with clear, cohesive, focused units of work. The generated tasks document must fully implement the requirements specified in the spec document.
+<guidelines>
+  - One unit of work per section: Each section delivers a single, clearly defined outcome an LLM can implement end-to-end in one pass.
+  - Timebox: Target 30–40 minutes of dev work.
+  - Size and focus: 3–7 concrete subtasks per section. Avoid grab-bag sections.
+  - Boundaries: Keeps sections and tasks focused on a small subset of files, unless edits are simple and tightly related.
+</guidelines>
+<file_template>
+  <header>
+    # Spec Tasks
+  </header>
+</file_template>
+<task_structure>
+  <major_tasks>
+    - count: 1-12
+    - format: numbered checklist
+    - grouping: by feature or component
+  </major_tasks>
+  <subtasks>
+    - count: up to 8 per major task
+    - format: decimal notation (1.1, 1.2)
+  </subtasks>
+</task_structure>
+<task_template>
+  ## Tasks
+  - [ ] 1. [MAJOR_TASK_DESCRIPTION]
+    - [ ] 1.1 [IMPLEMENTATION_STEP]
+    - [ ] 1.2 [IMPLEMENTATION_STEP]
+    - [ ] 1.3 [IMPLEMENTATION_STEP]
+  - [ ] 2. [MAJOR_TASK_DESCRIPTION]
+    - [ ] 2.1 [IMPLEMENTATION_STEP]
+    - [ ] 2.2 [IMPLEMENTATION_STEP]
+</task_template>
+<ordering_principles>
+  - Consider technical dependencies
+  - Follow TDD approach
+  - Group related functionality
+  - Build incrementally
+</ordering_principles>
+</step>

.cursor/commands/finalize_spec.md ADDED Viewed

	@@ -0,0 +1,21 @@

+Convert the spec_draft document into a final draft in the spec.md file.
+The spec acts as the comprehensive source of truth for this sprint and should include all the necessary context and technical details to implement this sprint. It should leave no ambiguity for important details necessary to properly implement the changes required.
+The spec.md will act as a reference for an LLM coding agent responsible for completing this sprint.
+The spec should include the following information if applicable:
+- An overview of the changes implemented in this sprint.
+- User stories for the new functionality, if applicable.
+- An outline of any new data models proposed.
+- An other technical details determined in the spec_draft or related conversations.
+- Specific filepaths for files for any files that need to be added, edited, or deleted as part of this sprint.
+- Specific files or modules relevant to this sprint.
+- Details on how things should function such as a function, workflow, or other process.
+- Describe what any new functions, services, ect. are supposed to do.
+- Any reasoning or rationale behind decisions, preferences, or changes that provides context for the sprint and its changes.
+- Any other information required to properly understand this sprint, the desired changes, the expected deliverables, or important technical details.
+Strive to retain all the final decisions and implementation details provided in the spec_draft and related conversations. Cleaning and organizing these raw notes is desirable, but do not exclude or leave out information provided in the spec_draft if it is relevant to this sprint. If there is information in the spec_draft that is outdated and negated or revised by further direction in the draft or related conversation, you should leave that stale information out of the final spec.
+The spec should have all the information a junior developer needs to complete this sprint. They should be able to independently find answers to any questions they have about this sprint and how to implement it in this document.

.cursor/commands/spec_draft.md ADDED Viewed

	@@ -0,0 +1,122 @@

+I am working on developing a comprehensive spec document for the next development sprint.
+<goal>
+Solidify the current spec_draft document into a comprehensive specification for the next development sprint through iterative refinement.
+The spec draft represents the rough notes and ideas for the next sprint. These notes are likely incomplete and require additional details and decisions to obtain sufficient information to move forward with the sprint.
+READ: @.cursor/commands/finalize_spec.md to see the complete requirements for the finalized spec. The goal is to reach the level of specificity and clarity required to create this final spec.
+</goal>
+<process>
+<overview>
+    Iteratively carry out the following steps to progressively refine the requirements for this sprint. Use `Requests for Input` only to gather information that cannot be inferred from the user's selection of a Recommendation; do not ask to confirm details already specified by a selected option. The initial `spec_draft` may be a loose assortment of notes, ideas, and thoughts; treat it accordingly in the first round.
+    First round: produce a response that includes Recommendations and Requests for Input. The user will reply by selecting exactly one option per Recommendation (or asking for refinement if none fit) and answering only those questions that cannot be inferred from selected options.
+    After each user response: update the `spec_draft` to incorporate the selected options with minimal, focused edits. Remove any conflicting or superseded information made obsolete by the selection. Avoid unrelated formatting or editorial changes.
+    Repeat this back-and-forth until ambiguity is removed and the draft aligns with the requirements in `@.cursor/commands/finalize_spec.md`.
+</overview>
+<steps>
+    - READ the spec_draft.
+    - IDENTIFY anything in the spec draft that is confusing, conflicting, unclear, or missing. Identify important decisions that need to be made.
+    - REVIEW the current state of the project to fully understand how these new requirements fit into what already exists.
+    - RECOMMEND specific additions or updates to the draft spec to resolve confusion, add clarity, fill gaps, or add specificity. Recommendations may provide a single option when appropriate or multiple options when needed. Each Recommendation expects selection of one and only one option by the user.
+    - ASK targeted questions to acquire details, decisions, or preferences from the user.
+    - APPLY the user's selections: make minimal, localized edits to the `spec_draft` to incorporate the chosen options and remove conflicting content. Incorporate all information contained in the selected options; do not omit details. Do not change unrelated text, structure, or formatting.
+    - REFINE: if the user rejects the provided options, revise the Recommendations based on feedback and repeat selection and apply.
+</steps>
+<end_conditions>
+    - Continue this process until the draft is unambiguous and conforms to `@.cursor/commands/finalize_spec.md`, or the user directs you to do otherwise.
+    - Do not stop after a single round unless the draft already satisfies all requirements in `@.cursor/commands/finalize_spec.md`.
+</end_conditions>
+</process>
+<response>
+<overview>
+    Your responses should be focused on providing clear, concrete recommendations for content to add to the spec draft to resolve ambiguity, add specificity, and increase clarity for the sprint. The options you provide in your recommendations should provide complete content that can be incorporated into the spec draft. For each Recommendation, expect the user to select exactly one option; Recommendations may include a single option when appropriate. If no option fits, the user may request refinement. If you do not have sufficient understanding of the user's intent or the meaning of some element of the spec draft, use `Request for Input` sections to ask targeted questions of the user. Only ask for information that cannot be inferred from the user's selection of a Recommendation. Do not ask to confirm details already encoded in an option (e.g., if Option 1.1 specifies renaming a file to `foo.py`, do not ask to confirm that rename).
+    Using incrementing section numbers are essential for helping the user quickly reference specific options or questions in their responses.
+    Responses must strictly follow the Format section. Include only the specified sections and no additional commentary or subsections.
+    The agent is responsible for updating the spec draft after each user response.
+</overview>
+<guidelines>
+    - Break recommendations and requests for input into related sections to provide concrete options or ask targeted questions to the user.
+    - Focus sections on a specific, concrete decision or unit of work related to the sprint outlined in the spec draft.
+    - Recommendations may provide one or more options; when multiple options are presented, the user must select exactly one.
+    - `Requests for Input` may include one or more questions, but only for details that cannot be derived from the selected option(s).
+    - Do not ask confirmation questions about facts stated by options; assume the selected option is authoritative.
+    - Use numbered sections that increment.
+    - Use incrementing decimals for recommendation options and request for input questions.
+    - After the user selects options, apply minimal, focused edits to the `spec_draft` reflecting only those selections. Remove conflicting or superseded content. Avoid broad formatting or editorial changes to unrelated content.
+    - Do not clutter options or questions with information already clear and unambiguous from the current draft.
+    - Do not add subsections beyond those defined in the Format.
+</guidelines>
+<format>
+# Recommendations
+## 1: Section Title
+Short overview providing background on the section.
+**Option 1.1**
+Specifics of the first option.
+**Option 1.2**
+Specifics of the second option.
+## 2: Section Title
+Short overview providing background on the section.
+**Option 2.1**
+Specifics of the first option.
+# Request for Input
+## 3: Section Title
+Short overview providing background on the section.
+**Questions**
+- 3.1 Some question.
+- 3.2 Another question.
+</format>
+<user_selection_format>
+    Respond by indicating a single selection per Recommendation, e.g.: `Select 1.2, 2.1`. If no option fits, reply with `Refine 1:` followed by feedback to guide revised options. You may also answer targeted questions under `Request for Input` inline.
+    Example mixed selections and answers:
+```text
+1.1 OK
+2: Clarifying question from the user?
+3.1 OK
+4.1 OK
+5.1 OK
+6: Answer to the specific question.
+7 Directions that indicate the users preference in response to the question.
+8 Clear directive in response to the question.
+```
+</user_selection_format>
+<selection_and_editing_rules>
+    - One and only one option must be selected per Recommendation. If none fit, request refinement.
+    - Apply edits narrowly: change only text directly impacted by the chosen option(s).
+    - Incorporate all information from the selected options into the draft.
+    - Remove or rewrite conflicting statements made obsolete by the selection.
+    - Preserve unrelated content and overall formatting; do not perform wide editorial passes.
+</selection_and_editing_rules>
+</response>
+<guardrails>
+    - Only edit the draft to apply selected options and answers. Do not edit code or any other files.
+</guardrails>
+<finalize_spec_compliance_checklist>
+- [ ] All information required by @.cursor/commands/finalize_spec.md is present.
+- [ ] Requirements are testable and unambiguous.
+- [ ] Risks, dependencies, and assumptions captured.
+- [ ] Approval received.
+</finalize_spec_compliance_checklist>

.cursor/rules/standards/best_practices.mdc ADDED Viewed

	@@ -0,0 +1,37 @@

+---
+alwaysApply: true
+---
+# Development Best Practices
+## Core Principles
+### Keep It Simple
+- Implement code in the fewest lines possible
+- Avoid over-engineering solutions
+- Choose straightforward approaches over clever ones
+### Optimize for Readability
+- Prioritize code clarity over micro-optimizations
+- Write self-documenting code with clear variable names
+- Add comments for "why" not "what"
+### DRY (Don't Repeat Yourself)
+- Extract repeated business logic to private methods
+- Extract repeated UI markup to reusable components
+- Create utility functions for common operations
+### File Structure
+- Keep files focused on a single responsibility
+- Group related functionality together
+- Use consistent naming conventions
+## Dependencies
+### Choose Libraries Wisely
+When adding third-party dependencies:
+- Select the most popular and actively maintained option
+- Check the library's GitHub repository for:
+  - Recent commits (within last 6 months)
+  - Active issue resolution
+  - Number of stars/downloads
+  - Clear documentation

.cursor/rules/standards/cli/overview.mdc ADDED Viewed

	@@ -0,0 +1,54 @@

+---
+description: Defines rules for defining CLI functionality.
+alwaysApply: false
+---
+# CLI Guidance & Best Practices
+- The main entrypoint for using the cli in this project is `cli.py` the sole responsibility of this file is to import the cli app and expose it for execution. NO OTHER CODE SHOULD EXIST IN THIS FILE!
+- The root-level app for defining the main CLI and its sub-commands is `src/cli.py`. This file should import sub-sections of the CLI from other modules as well as provide easy-access, root-level commands for the most common workflows.
+- Sub-modules that expose CLI functionality should do so by defining their own `cli.py` or `cli_{subgroup}.py` file(s).
+- If a single module has multiple natural groupings of CLI commands, rather than creating one, unrelated, complex `cli.py` create multiple `cli_{subgroup}.py` files for better organization and ease of finding related content.
+- Module-specific `cli.py` or `cli_{subgroup}.py` files should export a typer app with a collection of sub-commands. Import and attach these groups of commands to the main cli in `src/cli.py`.
+- Add all imports at the top of typer commands for the CLI, to avoid expensive imports before the CLI starts.
+# Example Template
+```python
+# Example root-level src/cli.py
+from __future__ import annotations
+import typer
+# Import groups of commands from sub-modules exported as typer apps.
+from src.first_mod.cli import first_mod_app
+from src.second_mod.cli import second_mod_app
+# Define the root-level app.
+app = typer.Typer()
+# Attach the sub-commands.
+app.add_typer(first_mod_app, name="first_mod")
+app.add_typer(second_mod_app, name="second_mod")
+# Add root-level commands.
+@app.command()
+def my_command(
+    flag: bool = typer.Option(False, "--flag", help="Flag for some option."),
+) -> None:
+    """Root level command."""
+    # Print statement to provide rapid feedback to the user, before expensive imports.
+    print("Starting my command.")
+    # Import command-level modules inside the command to avoid expensive imports before executing commands.
+    from src.my_mod import my_function_with_expensive_imports
+    for i in range(10):
+        my_function_with_expensive_imports(toggle_arg=flag)
+if __name__ == "__name__":
+    app()
+```

.cursor/rules/standards/code_style/emojis.mdc ADDED Viewed

	@@ -0,0 +1,11 @@

+---
+description: Guidelines for the use of emojis.
+alwaysApply: false
+---
+# Emoji Usage
+- **NEVER use emojis** in code, user-facing messages, comments, or UI elements
+- **ALWAYS use professional icon sets** appropriate to the technology and context (e.g., Material icons `:material/icon_name:` in Streamlit)
+- If no appropriate icon exists in the available icon set, omit the icon rather than using an emoji
+- Keep user-facing messages clear and professional without decorative elements

.cursor/rules/standards/code_style/pydantic.mdc ADDED Viewed

	@@ -0,0 +1,72 @@

+---
+description: Rules for implementing Pydantic models in Python.
+alwaysApply: false
+---
+## Type Safety and Data Validation
+- **Always use Pydantic models for structured data**: Wrap JSON, dictionaries, or untyped data in Pydantic models. Never pass raw dictionaries or JSON strings when a typed model can be used.
+- **Use the most specific model type**: Use the most specific Pydantic model type in function signatures. Only use union types or `BaseModel` when the function truly handles multiple types. Example: prefer `RoleOverviewUpdate` over `ProposalContent` if only handling role updates.
+- **Single source of truth**: Define Pydantic models once (typically in database layer for persisted data) and import elsewhere. Avoid duplicating model definitions.
+- **Database boundary pattern**: Serialization happens at the database boundary. Database models provide `parse_*`/`serialize_*` methods. Application code works with typed models, not JSON strings or dictionaries.
+## Model Definition
+- Prefer the simpler syntax for default values in Pydantic models, over the more verbose `Field` notation whenever possible.
+```python
+from pydantic import BaseModel
+class Model(BaseModel):
+    # Use simple syntax for basic, mutable values.
+    # Pydantic creates a deep copy so this is safe.
+    item_counts: list[dict[str, int]] = [{}]
+    # Use simple `= {default}` syntax for basic values too.
+    some_number: int = 42
+```
+- Prefer the use of `Annotated` to attach additional metadata to models.
+```python
+from typing import Annotated
+from pydantic import BaseModel, Field, WithJsonSchema
+class Model(BaseModel):
+    name: Annotated[str, Field(strict=True), WithJsonSchema({'extra': 'data'})]
+```
+- However, note that certain arguments to the Field() function (namely, default, default_factory, and alias) are taken into account by static type checkers to synthesize a correct **init** method. The annotated pattern is not understood by them, so you should use the normal assignment form instead.
+- Use a discriminator in scenarios with mixed types where there is a need to differentiate between the types.
+```python
+# Example of using a discriminator to identify specific models in a mixed-type situation.
+from typing import Annotated, Literal
+from pydantic import BaseModel, Discriminator, Field, Tag
+class Cat(BaseModel):
+    pet_type: Literal['cat']
+    age: int
+class Dog(BaseModel):
+    pet_kind: Literal['dog']
+    age: int
+def pet_discriminator(v):
+    if isinstance(v, dict):
+        return v.get('pet_type', v.get('pet_kind'))
+    return getattr(v, 'pet_type', getattr(v, 'pet_kind', None))
+class Model(BaseModel):
+    pet: Annotated[Cat, Tag('cat')] | Annotated[Dog, Tag('dog')] = Field(
+        discriminator=Discriminator(pet_discriminator)
+    )
+```

.cursor/rules/standards/code_style/python.mdc ADDED Viewed

	@@ -0,0 +1,71 @@

+---
+globs: **/*.py
+alwaysApply: false
+---
+# General
+- This project uses Python 3.12. Follow best practices and coding standards for this version.
+- This project is in a Python virtual environment. Remember to activate the environment in `.venv` before executing commands that rely on Python.
+- This project uses pip and pyproject.toml. All dependencies should be declared in pyproject.toml.
+## Type Annotations
+- **ALWAYS** include full type annotations for all function parameters and return types
+- **EXCEPTION**: Test functions do not need return type annotations (they can use `-> None` or omit entirely)
+- Use typing standards for Python 3.12+, such as using `list` instead of `List` and `str | None` instead of `Optional[str]`.
+- Use `from __future__ import annotations` at the top of files for forward references
+- Whenever there is an if-else statement for an enum or set of literals where all values need to be handled, use `assert_never` to catch missed values.
+- Always use the new syntax for defining generics rather than using TypeVars and Generic.
+- **Prefer typed models over dictionaries**: When functions accept or return structured data, use Pydantic models instead of `dict` types. See `code_style/pydantic.mdc` for details.
+- **Use the most specific type possible**: Use the most specific type that matches what the function handles. Only use generic types (`dict`, `BaseModel`, union types) when the function truly needs to handle multiple types. Example: prefer `AchievementAdd` over `ProposalContent` if the function only handles achievement additions.
+## Docstrings
+- Use Google-style docstrings for all public functions and classes
+- Include type information in docstrings for complex parameters
+- Keep docstrings concise but informative
+## Code Structure
+- **Main function first**: Place the primary/main function(s) at the top of the file. Supporting functions, constants, and setup code go below. This ensures readers see the most important content immediately.
+- Keep functions under 50 lines when possible
+- Use early returns to reduce nesting
+- Prefer list comprehensions over explicit loops for simple transformations
+- Use type guards for complex conditional logic
+## String Formatting
+- **ALWAYS use f-strings** for string formatting and interpolation
+- **NEVER use percent-style formatting** (`%s`, `%d`, `%f`, etc.)
+- Use f-strings for all string concatenation and formatting needs
+## Configuration and Constants
+- Define constants at module level
+- Use `typing.Final` for constants that shouldn't be modified
+- Group related constants together
+- Use PydanticSettings for overall settings, including ones read from .env.
+## Logging
+- Import the logger from `logger_config` via `from src.logger_config import logger`.
+- Use structured logging with appropriate log levels
+- Include context in log messages
+- Use loguru for consistent logging across the project
+- Make sure to use `logger.exception` to log all errors inside of expect blocks.
+- Outside of except blocks use `logger.error(msg, exception=True)` and be sure to include the exception and stack trace.
+## Performance Considerations
+- Use `pathlib.Path` instead of string paths
+- Prefer `list` comprehensions over `map()` for readability
+- Use `collections.defaultdict` when appropriate
+- Use Pydantic for simple data containers.
+## Security
+- Never log sensitive information (passwords, API keys, etc.)
+- Use environment variables for configuration
+- Validate all user inputs
+- Use parameterized queries for database operations

.cursor/rules/standards/testing.mdc ADDED Viewed

	@@ -0,0 +1,12 @@

+---
+globs: tests/**/test_*.py
+alwaysApply: false
+---
+- Don't import `pytest` unless you use it directly.
+- Test functions should be descriptive and test one specific behavior
+- Use pytest fixtures for common setup
+- Mock external dependencies
+- Use parametrized tests for multiple input scenarios
+- Test functions do not need to provide a return type annotation.
+- Test fixtures and test arguments should use type annotations.

.cursor/rules/standards/ux/deletions.mdc ADDED Viewed

	@@ -0,0 +1,19 @@

+---
+description: UX patterns for destructive actions (deletions)
+alwaysApply: false
+---
+### Deletion Confirmations
+Use confirmation dialogs before deleting significant objects that are difficult or impossible to undo. Confirmation dialogs should:
+- Display the item name/title being deleted
+- Clearly state the action cannot be undone
+- Require explicit "Delete" or "Confirm" action
+- Use destructive button styling (e.g., `variant="destructive"`)
+**When confirmation is not necessary:**
+- Deleting easily recoverable items
+- Operations that can be easily undone
+- Temporary or draft content with minimal user investment

.cursor/rules/standards/workflows.mdc ADDED Viewed

	@@ -0,0 +1,22 @@

+---
+description: Rules for implementing workflow files with LLM chains and structured outputs.
+alwaysApply: false
+---
+# Workflow File Structure
+Files containing simple LLM interactions with inline prompts and structured outputs follow a specific structure.
+## File Layout Order
+1. **Main workflow function(s)** - Primary public function(s)
+2. **Helper functions** - Private utility functions
+3. **Prompt strings** - System and user prompts (`_SYSTEM_PROMPT`, `_USER_PROMPT`)
+4. **Pydantic models** - Structured output schema models
+5. **Model and chain setup** - LLM initialization, structured output config, prompt template, chain
+## Key Principles
+- Use `from __future__ import annotations` to enable forward references
+- Main function references types/variables defined later via forward references
+- Use clear section dividers to separate logical sections

.env.example ADDED Viewed

	@@ -0,0 +1,7 @@

+# Common Standards Project API Configuration
+CSP_API_KEY=your_generated_api_key_here
+# Pinecone Configuration
+PINECONE_API_KEY=your_pinecone_api_key_here
+PINECONE_INDEX_NAME=common-core-standards
+PINECONE_NAMESPACE=standards

.gitignore ADDED Viewed

	@@ -0,0 +1,56 @@

+# Environment variables
+.env
+# Python cache
+__pycache__/
+.mypy_cache/
+*.py[cod]
+*$py.class
+*.so
+.Python
+# Distribution / packaging
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# Virtual environments
+.venv/
+venv/
+ENV/
+env/
+# IDEs
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# Testing
+.pytest_cache/
+.coverage
+htmlcov/
+# OS
+.DS_Store
+Thumbs.db
+# Raw data (local only)
+data/raw/
+# CLI logs
+data/cli.log

data/.gitkeep ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # This directory will contain generated data files
2	+

pyproject.toml ADDED Viewed

	@@ -0,0 +1,23 @@

+[project]
+name = "common-core-mcp"
+version = "0.1.0"
+requires-python = ">=3.12"
+dependencies = [
+    "mcp",
+    "gradio>=5.0.0,<6.0.0",
+    "pinecone",
+    "python-dotenv",
+    "typer",
+    "requests",
+    "rich",
+    "loguru",
+    "pydantic>=2.0.0",
+    "pydantic-settings>=2.0.0",
+]
+[project.optional-dependencies]
+dev = ["pytest>=8.0.0", "pytest-asyncio>=0.23.0"]
+[build-system]
+requires = ["setuptools>=61.0"]
+build-backend = "setuptools.build_meta"

src/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ """Common Core MCP - Educational Standards Search."""
2	+

tests/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ """Test suite for Common Core MCP."""
2	+

tests/test_pinecone_client.py ADDED Viewed

	@@ -0,0 +1,318 @@

+"""Unit tests for Pinecone client."""
+from __future__ import annotations
+import tempfile
+from datetime import datetime, timezone
+from pathlib import Path
+from unittest.mock import MagicMock, patch
+import pytest
+from pinecone.exceptions import PineconeException
+from tools.pinecone_client import PineconeClient
+from tools.pinecone_models import PineconeRecord
+class TestUploadTracking:
+    """Tests for upload tracking marker file operations."""
+    def test_is_uploaded_returns_false_when_marker_missing(self):
+        """is_uploaded() returns False when marker file doesn't exist."""
+        with tempfile.TemporaryDirectory() as tmpdir:
+            set_dir = Path(tmpdir)
+            assert PineconeClient.is_uploaded(set_dir) is False
+    def test_is_uploaded_returns_true_when_marker_exists(self):
+        """is_uploaded() returns True when marker file exists."""
+        with tempfile.TemporaryDirectory() as tmpdir:
+            set_dir = Path(tmpdir)
+            marker_file = set_dir / ".pinecone_uploaded"
+            marker_file.write_text("2025-01-15T14:30:00Z")
+            assert PineconeClient.is_uploaded(set_dir) is True
+    def test_mark_uploaded_creates_marker_file(self):
+        """mark_uploaded() creates marker file with ISO 8601 timestamp."""
+        with tempfile.TemporaryDirectory() as tmpdir:
+            set_dir = Path(tmpdir)
+            marker_file = set_dir / ".pinecone_uploaded"
+            assert not marker_file.exists()
+            PineconeClient.mark_uploaded(set_dir)
+            assert marker_file.exists()
+            # Verify timestamp format
+            timestamp = marker_file.read_text(encoding="utf-8").strip()
+            # Should be valid ISO 8601 format
+            datetime.fromisoformat(timestamp.replace("Z", "+00:00"))
+    def test_mark_uploaded_writes_utc_timestamp(self):
+        """mark_uploaded() writes UTC timestamp in ISO 8601 format."""
+        with tempfile.TemporaryDirectory() as tmpdir:
+            set_dir = Path(tmpdir)
+            PineconeClient.mark_uploaded(set_dir)
+            marker_file = set_dir / ".pinecone_uploaded"
+            timestamp_str = marker_file.read_text(encoding="utf-8").strip()
+            # Parse and verify it's UTC
+            if timestamp_str.endswith("Z"):
+                timestamp_str = timestamp_str[:-1] + "+00:00"
+            parsed = datetime.fromisoformat(timestamp_str)
+            assert parsed.tzinfo == timezone.utc
+    def test_get_upload_timestamp_returns_none_when_marker_missing(self):
+        """get_upload_timestamp() returns None when marker file doesn't exist."""
+        with tempfile.TemporaryDirectory() as tmpdir:
+            set_dir = Path(tmpdir)
+            assert PineconeClient.get_upload_timestamp(set_dir) is None
+    def test_get_upload_timestamp_returns_timestamp_when_marker_exists(self):
+        """get_upload_timestamp() returns timestamp string when marker exists."""
+        with tempfile.TemporaryDirectory() as tmpdir:
+            set_dir = Path(tmpdir)
+            expected_timestamp = "2025-01-15T14:30:00Z"
+            marker_file = set_dir / ".pinecone_uploaded"
+            marker_file.write_text(expected_timestamp)
+            result = PineconeClient.get_upload_timestamp(set_dir)
+            assert result == expected_timestamp
+    def test_get_upload_timestamp_handles_read_error(self):
+        """get_upload_timestamp() returns None if marker file can't be read."""
+        with tempfile.TemporaryDirectory() as tmpdir:
+            set_dir = Path(tmpdir)
+            marker_file = set_dir / ".pinecone_uploaded"
+            marker_file.write_text("test")
+            # Make file unreadable (on Unix systems)
+            marker_file.chmod(0o000)
+            try:
+                result = PineconeClient.get_upload_timestamp(set_dir)
+                # Should return None or handle gracefully
+                assert result is None or isinstance(result, str)
+            finally:
+                # Restore permissions for cleanup
+                marker_file.chmod(0o644)
+class TestPineconeClientCore:
+    """Tests for core Pinecone client functionality."""
+    @patch("tools.pinecone_client.Pinecone")
+    @patch("tools.pinecone_client.get_settings")
+    def test_init_raises_error_when_api_key_missing(self, mock_get_settings):
+        """__init__() raises ValueError when API key is not set."""
+        mock_settings = MagicMock()
+        mock_settings.pinecone_api_key = ""
+        mock_get_settings.return_value = mock_settings
+        with pytest.raises(ValueError, match="PINECONE_API_KEY"):
+            PineconeClient()
+    @patch("tools.pinecone_client.Pinecone")
+    @patch("tools.pinecone_client.get_settings")
+    def test_init_initializes_pinecone_client(self, mock_get_settings):
+        """__init__() initializes Pinecone SDK with API key."""
+        mock_settings = MagicMock()
+        mock_settings.pinecone_api_key = "test-api-key"
+        mock_settings.pinecone_index_name = "test-index"
+        mock_settings.pinecone_namespace = "test-namespace"
+        mock_get_settings.return_value = mock_settings
+        mock_pc = MagicMock()
+        mock_pc.Index.return_value = MagicMock()
+        mock_pc.has_index.return_value = True
+        mock_pinecone_class = MagicMock(return_value=mock_pc)
+        with patch("tools.pinecone_client.Pinecone", mock_pinecone_class):
+            client = PineconeClient()
+        assert client.pc == mock_pc
+        assert client.index_name == "test-index"
+        assert client.namespace == "test-namespace"
+    @patch("tools.pinecone_client.Pinecone")
+    @patch("tools.pinecone_client.get_settings")
+    def test_validate_index_raises_error_when_index_missing(self, mock_get_settings):
+        """validate_index() raises ValueError when index doesn't exist."""
+        mock_settings = MagicMock()
+        mock_settings.pinecone_api_key = "test-api-key"
+        mock_settings.pinecone_index_name = "missing-index"
+        mock_get_settings.return_value = mock_settings
+        mock_pc = MagicMock()
+        mock_pc.has_index.return_value = False
+        mock_pinecone_class = MagicMock(return_value=mock_pc)
+        with patch("tools.pinecone_client.Pinecone", mock_pinecone_class):
+            client = PineconeClient()
+        with pytest.raises(ValueError, match="Index 'missing-index' not found"):
+            client.validate_index()
+    @patch("tools.pinecone_client.Pinecone")
+    @patch("tools.pinecone_client.get_settings")
+    def test_validate_index_succeeds_when_index_exists(self, mock_get_settings):
+        """validate_index() succeeds when index exists."""
+        mock_settings = MagicMock()
+        mock_settings.pinecone_api_key = "test-api-key"
+        mock_settings.pinecone_index_name = "existing-index"
+        mock_get_settings.return_value = mock_settings
+        mock_pc = MagicMock()
+        mock_pc.has_index.return_value = True
+        mock_pinecone_class = MagicMock(return_value=mock_pc)
+        with patch("tools.pinecone_client.Pinecone", mock_pinecone_class):
+            client = PineconeClient()
+        # Should not raise
+        client.validate_index()
+    def test_exponential_backoff_retry_succeeds_on_first_attempt(self):
+        """exponential_backoff_retry() succeeds when function succeeds immediately."""
+        func = MagicMock(return_value="success")
+        result = PineconeClient.exponential_backoff_retry(func)
+        assert result == "success"
+        func.assert_called_once()
+    @patch("tools.pinecone_client.time.sleep")
+    def test_exponential_backoff_retry_retries_on_429(self, mock_sleep):
+        """exponential_backoff_retry() retries on 429 rate limit errors."""
+        error_429 = PineconeException("Rate limited")
+        error_429.status = 429
+        func = MagicMock(side_effect=[error_429, "success"])
+        result = PineconeClient.exponential_backoff_retry(func, max_retries=2)
+        assert result == "success"
+        assert func.call_count == 2
+        mock_sleep.assert_called_once_with(1)  # 2^0 = 1
+    @patch("tools.pinecone_client.time.sleep")
+    def test_exponential_backoff_retry_retries_on_5xx(self, mock_sleep):
+        """exponential_backoff_retry() retries on 5xx server errors."""
+        error_500 = PineconeException("Server error")
+        error_500.status = 500
+        func = MagicMock(side_effect=[error_500, "success"])
+        result = PineconeClient.exponential_backoff_retry(func, max_retries=2)
+        assert result == "success"
+        assert func.call_count == 2
+        mock_sleep.assert_called_once_with(1)
+    def test_exponential_backoff_retry_fails_on_4xx(self):
+        """exponential_backoff_retry() fails immediately on 4xx client errors."""
+        error_400 = PineconeException("Bad request")
+        error_400.status = 400
+        func = MagicMock(side_effect=error_400)
+        with pytest.raises(PineconeException):
+            PineconeClient.exponential_backoff_retry(func, max_retries=3)
+        # Should only try once (no retries for 4xx)
+        assert func.call_count == 1
+    @patch("tools.pinecone_client.time.sleep")
+    def test_exponential_backoff_retry_caps_delay_at_60s(self, mock_sleep):
+        """exponential_backoff_retry() caps delay at 60 seconds."""
+        error_500 = PineconeException("Server error")
+        error_500.status = 500
+        func = MagicMock(side_effect=[error_500, error_500, "success"])
+        result = PineconeClient.exponential_backoff_retry(func, max_retries=3)
+        assert result == "success"
+        # First retry: 2^0 = 1s, second retry: min(2^1, 60) = 2s
+        assert mock_sleep.call_count == 2
+        mock_sleep.assert_any_call(1)
+        mock_sleep.assert_any_call(2)
+    def test_record_to_dict_omits_none_optional_fields(self):
+        """_record_to_dict() omits None values for optional fields."""
+        record = PineconeRecord(
+            _id="test-id",
+            content="test content",
+            standard_set_id="set-id",
+            standard_set_title="Test Set",
+            subject="Math",
+            education_levels=["01"],
+            document_id="doc-id",
+            document_valid="2021",
+            jurisdiction_id="jur-id",
+            jurisdiction_title="Test Jurisdiction",
+            depth=0,
+            is_leaf=True,
+            is_root=True,
+            root_id="test-id",
+            ancestor_ids=[],
+            child_ids=[],
+            sibling_count=0,
+            # Optional fields set to None
+            normalized_subject=None,
+            publication_status=None,
+            asn_identifier=None,
+            statement_notation=None,
+            statement_label=None,
+            parent_id=None,
+        )
+        record_dict = PineconeClient._record_to_dict(record)
+        # Verify _id is serialized (not id)
+        assert "_id" in record_dict
+        assert record_dict["_id"] == "test-id"
+        assert "id" not in record_dict
+        # Optional fields should be omitted
+        assert "asn_identifier" not in record_dict
+        assert "statement_notation" not in record_dict
+        assert "statement_label" not in record_dict
+        assert "normalized_subject" not in record_dict
+        assert "publication_status" not in record_dict
+        # parent_id should be present as null
+        assert "parent_id" in record_dict
+        assert record_dict["parent_id"] is None
+    def test_record_to_dict_includes_present_optional_fields(self):
+        """_record_to_dict() includes optional fields when they have values."""
+        record = PineconeRecord(
+            _id="test-id",
+            content="test content",
+            standard_set_id="set-id",
+            standard_set_title="Test Set",
+            subject="Math",
+            normalized_subject="Math",
+            education_levels=["01"],
+            document_id="doc-id",
+            document_valid="2021",
+            publication_status="Published",
+            jurisdiction_id="jur-id",
+            jurisdiction_title="Test Jurisdiction",
+            asn_identifier="ASN123",
+            statement_notation="1.2.3",
+            statement_label="Standard",
+            depth=1,
+            is_leaf=True,
+            is_root=False,
+            parent_id="parent-id",
+            root_id="root-id",
+            ancestor_ids=["root-id"],
+            child_ids=[],
+            sibling_count=0,
+        )
+        record_dict = PineconeClient._record_to_dict(record)
+        # Verify _id is serialized (not id)
+        assert "_id" in record_dict
+        assert record_dict["_id"] == "test-id"
+        assert "id" not in record_dict
+        # Optional fields should be included when present
+        assert record_dict["asn_identifier"] == "ASN123"
+        assert record_dict["statement_notation"] == "1.2.3"
+        assert record_dict["statement_label"] == "Standard"
+        assert record_dict["normalized_subject"] == "Math"
+        assert record_dict["publication_status"] == "Published"
+        assert record_dict["parent_id"] == "parent-id"

tests/test_pinecone_models.py ADDED Viewed

	@@ -0,0 +1,339 @@

+"""Unit tests for Pinecone Pydantic models."""
+from __future__ import annotations
+import json
+import pytest
+from tools.pinecone_models import PineconeRecord, ProcessedStandardSet
+class TestEducationLevelsProcessing:
+    """Test education_levels field validator."""
+    def test_simple_array(self):
+        """Test simple array without comma-separated values."""
+        record = PineconeRecord(
+            **{"_id": "test-id"},
+            content="Test content",
+            standard_set_id="set-1",
+            standard_set_title="Grade 1",
+            subject="Math",
+            education_levels=["01", "02"],
+            document_id="doc-1",
+            document_valid="2021",
+            jurisdiction_id="jur-1",
+            jurisdiction_title="Wyoming",
+            depth=0,
+            is_leaf=True,
+            is_root=True,
+            root_id="test-id",
+            ancestor_ids=[],
+            child_ids=[],
+            sibling_count=0,
+        )
+        assert record.education_levels == ["01", "02"]
+    def test_comma_separated_strings(self):
+        """Test array with comma-separated values."""
+        record = PineconeRecord(
+            **{"_id": "test-id"},
+            content="Test content",
+            standard_set_id="set-1",
+            standard_set_title="Grade 1",
+            subject="Math",
+            education_levels=["01,02", "02", "03"],
+            document_id="doc-1",
+            document_valid="2021",
+            jurisdiction_id="jur-1",
+            jurisdiction_title="Wyoming",
+            depth=0,
+            is_leaf=True,
+            is_root=True,
+            root_id="test-id",
+            ancestor_ids=[],
+            child_ids=[],
+            sibling_count=0,
+        )
+        assert record.education_levels == ["01", "02", "03"]
+    def test_high_school_range(self):
+        """Test high school grade levels."""
+        record = PineconeRecord(
+            **{"_id": "test-id"},
+            content="Test content",
+            standard_set_id="set-1",
+            standard_set_title="High School",
+            subject="Math",
+            education_levels=["09,10,11,12"],
+            document_id="doc-1",
+            document_valid="2021",
+            jurisdiction_id="jur-1",
+            jurisdiction_title="Wyoming",
+            depth=0,
+            is_leaf=True,
+            is_root=True,
+            root_id="test-id",
+            ancestor_ids=[],
+            child_ids=[],
+            sibling_count=0,
+        )
+        assert record.education_levels == ["09", "10", "11", "12"]
+    def test_empty_array(self):
+        """Test empty array."""
+        record = PineconeRecord(
+            **{"_id": "test-id"},
+            content="Test content",
+            standard_set_id="set-1",
+            standard_set_title="Grade 1",
+            subject="Math",
+            education_levels=[],
+            document_id="doc-1",
+            document_valid="2021",
+            jurisdiction_id="jur-1",
+            jurisdiction_title="Wyoming",
+            depth=0,
+            is_leaf=True,
+            is_root=True,
+            root_id="test-id",
+            ancestor_ids=[],
+            child_ids=[],
+            sibling_count=0,
+        )
+        assert record.education_levels == []
+    def test_whitespace_handling(self):
+        """Test that whitespace is stripped."""
+        record = PineconeRecord(
+            **{"_id": "test-id"},
+            content="Test content",
+            standard_set_id="set-1",
+            standard_set_title="Grade 1",
+            subject="Math",
+            education_levels=["01 , 02", " 03 "],
+            document_id="doc-1",
+            document_valid="2021",
+            jurisdiction_id="jur-1",
+            jurisdiction_title="Wyoming",
+            depth=0,
+            is_leaf=True,
+            is_root=True,
+            root_id="test-id",
+            ancestor_ids=[],
+            child_ids=[],
+            sibling_count=0,
+        )
+        assert record.education_levels == ["01", "02", "03"]
+class TestParentIdNullHandling:
+    """Test that parent_id null is properly serialized."""
+    def test_root_node_parent_id_null(self):
+        """Test root node has parent_id as null."""
+        record = PineconeRecord(
+            **{"_id": "root-id"},
+            content="Root content",
+            standard_set_id="set-1",
+            standard_set_title="Grade 1",
+            subject="Math",
+            education_levels=["01"],
+            document_id="doc-1",
+            document_valid="2021",
+            jurisdiction_id="jur-1",
+            jurisdiction_title="Wyoming",
+            depth=0,
+            is_leaf=False,
+            is_root=True,
+            parent_id=None,
+            root_id="root-id",
+            ancestor_ids=[],
+            child_ids=["child-1"],
+            sibling_count=0,
+        )
+        assert record.parent_id is None
+        # Test JSON serialization preserves null
+        json_str = record.model_dump_json()
+        data = json.loads(json_str)
+        assert data["parent_id"] is None
+    def test_child_node_parent_id_set(self):
+        """Test child node has parent_id set."""
+        record = PineconeRecord(
+            **{"_id": "child-id"},
+            content="Child content",
+            standard_set_id="set-1",
+            standard_set_title="Grade 1",
+            subject="Math",
+            education_levels=["01"],
+            document_id="doc-1",
+            document_valid="2021",
+            jurisdiction_id="jur-1",
+            jurisdiction_title="Wyoming",
+            depth=1,
+            is_leaf=True,
+            is_root=False,
+            parent_id="parent-id",
+            root_id="root-id",
+            ancestor_ids=["root-id"],
+            child_ids=[],
+            sibling_count=0,
+        )
+        assert record.parent_id == "parent-id"
+        # Test JSON serialization
+        json_str = record.model_dump_json()
+        data = json.loads(json_str)
+        assert data["parent_id"] == "parent-id"
+class TestOptionalFields:
+    """Test optional fields can be omitted."""
+    def test_all_optional_fields_omitted(self):
+        """Test record with all optional fields omitted."""
+        record = PineconeRecord(
+            **{"_id": "test-id"},
+            content="Test content",
+            standard_set_id="set-1",
+            standard_set_title="Grade 1",
+            subject="Math",
+            education_levels=["01"],
+            document_id="doc-1",
+            document_valid="2021",
+            jurisdiction_id="jur-1",
+            jurisdiction_title="Wyoming",
+            depth=0,
+            is_leaf=True,
+            is_root=True,
+            root_id="test-id",
+            ancestor_ids=[],
+            child_ids=[],
+            sibling_count=0,
+        )
+        assert record.normalized_subject is None
+        assert record.asn_identifier is None
+        assert record.statement_notation is None
+        assert record.statement_label is None
+        assert record.publication_status is None
+    def test_optional_fields_set(self):
+        """Test record with optional fields set."""
+        record = PineconeRecord(
+            **{"_id": "test-id"},
+            content="Test content",
+            standard_set_id="set-1",
+            standard_set_title="Grade 1",
+            subject="Math",
+            normalized_subject="Math",
+            education_levels=["01"],
+            document_id="doc-1",
+            document_valid="2021",
+            publication_status="Published",
+            jurisdiction_id="jur-1",
+            jurisdiction_title="Wyoming",
+            asn_identifier="S12345",
+            statement_notation="1.G.K",
+            statement_label="Standard",
+            depth=1,
+            is_leaf=True,
+            is_root=False,
+            parent_id="parent-id",
+            root_id="root-id",
+            ancestor_ids=["root-id"],
+            child_ids=[],
+            sibling_count=1,
+        )
+        assert record.normalized_subject == "Math"
+        assert record.asn_identifier == "S12345"
+        assert record.statement_notation == "1.G.K"
+        assert record.statement_label == "Standard"
+        assert record.publication_status == "Published"
+class TestProcessedStandardSet:
+    """Test ProcessedStandardSet container model."""
+    def test_empty_records(self):
+        """Test ProcessedStandardSet with empty records."""
+        processed = ProcessedStandardSet(records=[])
+        assert processed.records == []
+    def test_multiple_records(self):
+        """Test ProcessedStandardSet with multiple records."""
+        record1 = PineconeRecord(
+            **{"_id": "id-1"},
+            content="Content 1",
+            standard_set_id="set-1",
+            standard_set_title="Grade 1",
+            subject="Math",
+            education_levels=["01"],
+            document_id="doc-1",
+            document_valid="2021",
+            jurisdiction_id="jur-1",
+            jurisdiction_title="Wyoming",
+            depth=0,
+            is_leaf=True,
+            is_root=True,
+            root_id="id-1",
+            ancestor_ids=[],
+            child_ids=[],
+            sibling_count=0,
+        )
+        record2 = PineconeRecord(
+            **{"_id": "id-2"},
+            content="Content 2",
+            standard_set_id="set-1",
+            standard_set_title="Grade 1",
+            subject="Math",
+            education_levels=["01"],
+            document_id="doc-1",
+            document_valid="2021",
+            jurisdiction_id="jur-1",
+            jurisdiction_title="Wyoming",
+            depth=1,
+            is_leaf=True,
+            is_root=False,
+            parent_id="id-1",
+            root_id="id-1",
+            ancestor_ids=["id-1"],
+            child_ids=[],
+            sibling_count=0,
+        )
+        processed = ProcessedStandardSet(records=[record1, record2])
+        assert len(processed.records) == 2
+        assert processed.records[0].id == "id-1"
+        assert processed.records[1].id == "id-2"
+    def test_json_serialization(self):
+        """Test JSON serialization of ProcessedStandardSet."""
+        record = PineconeRecord(
+            **{"_id": "test-id"},
+            content="Test content",
+            standard_set_id="set-1",
+            standard_set_title="Grade 1",
+            subject="Math",
+            education_levels=["01"],
+            document_id="doc-1",
+            document_valid="2021",
+            jurisdiction_id="jur-1",
+            jurisdiction_title="Wyoming",
+            depth=0,
+            is_leaf=True,
+            is_root=True,
+            root_id="test-id",
+            ancestor_ids=[],
+            child_ids=[],
+            sibling_count=0,
+        )
+        processed = ProcessedStandardSet(records=[record])
+        json_str = processed.model_dump_json(by_alias=True)
+        data = json.loads(json_str)
+        assert "records" in data
+        assert len(data["records"]) == 1
+        assert data["records"][0]["_id"] == "test-id"
+        assert data["records"][0]["parent_id"] is None  # Verify null handling

tests/test_pinecone_processor.py ADDED Viewed

	@@ -0,0 +1,463 @@

+"""Tests for Pinecone processor module."""
+from __future__ import annotations
+import json
+from pathlib import Path
+import pytest
+from tools.models import Standard, StandardSet
+from tools.pinecone_processor import StandardSetProcessor, process_and_save
+@pytest.fixture
+def sample_standard_set():
+    """Create a sample standard set for testing."""
+    # Create a simple hierarchy:
+    # Root (depth 0): "Math"
+    #   Child (depth 1, notation "1.1"): "Numbers"
+    #     Child (depth 2, notation "1.1.A"): "Count to 10"
+    root_id = "ROOT_ID"
+    child1_id = "CHILD1_ID"
+    child2_id = "CHILD2_ID"
+    standards = {
+        root_id: Standard(
+            id=root_id,
+            position=100000,
+            depth=0,
+            description="Math",
+            statementLabel="Domain",
+            ancestorIds=[],
+            parentId=None,
+        ),
+        child1_id: Standard(
+            id=child1_id,
+            position=101000,
+            depth=1,
+            description="Numbers",
+            statementNotation="1.1",
+            statementLabel="Standard",
+            ancestorIds=[root_id],
+            parentId=root_id,
+        ),
+        child2_id: Standard(
+            id=child2_id,
+            position=102000,
+            depth=2,
+            description="Count to 10",
+            statementNotation="1.1.A",
+            statementLabel="Benchmark",
+            ancestorIds=[root_id, child1_id],
+            parentId=child1_id,
+        ),
+    }
+    standard_set = StandardSet(
+        id="SET_ID",
+        title="Grade 1",
+        subject="Mathematics",
+        normalizedSubject="Math",
+        educationLevels=["01"],
+        license={
+            "title": "CC BY",
+            "URL": "https://example.com",
+            "rightsHolder": "Test",
+        },
+        document={
+            "id": "DOC_ID",
+            "title": "Test Document",
+            "valid": "2021",
+            "publicationStatus": "Published",
+        },
+        jurisdiction={"id": "JUR_ID", "title": "Test State"},
+        standards=standards,
+    )
+    return standard_set
+class TestRelationshipMaps:
+    """Test relationship map building (Task 3)."""
+    def test_build_id_to_standard_map(self, sample_standard_set):
+        """Test ID-to-standard map building."""
+        processor = StandardSetProcessor()
+        standards_dict = {
+            std_id: std.model_dump()
+            for std_id, std in sample_standard_set.standards.items()
+        }
+        result = processor._build_id_to_standard_map(standards_dict)
+        assert len(result) == 3
+        assert "ROOT_ID" in result
+        assert "CHILD1_ID" in result
+        assert "CHILD2_ID" in result
+        assert result["ROOT_ID"]["id"] == "ROOT_ID"
+    def test_build_parent_to_children_map(self, sample_standard_set):
+        """Test parent-to-children map building with position sorting."""
+        processor = StandardSetProcessor()
+        standards_dict = {
+            std_id: std.model_dump()
+            for std_id, std in sample_standard_set.standards.items()
+        }
+        result = processor._build_parent_to_children_map(standards_dict)
+        # Root should have one child
+        assert None in result
+        assert result[None] == ["ROOT_ID"]
+        # Root should have child1 as child
+        assert "ROOT_ID" in result
+        assert result["ROOT_ID"] == ["CHILD1_ID"]
+        # Child1 should have child2 as child
+        assert "CHILD1_ID" in result
+        assert result["CHILD1_ID"] == ["CHILD2_ID"]
+        # Child2 should have no children
+        assert "CHILD2_ID" not in result or result.get("CHILD2_ID") == []
+    def test_identify_leaf_nodes(self, sample_standard_set):
+        """Test leaf node identification."""
+        processor = StandardSetProcessor()
+        standards_dict = {
+            std_id: std.model_dump()
+            for std_id, std in sample_standard_set.standards.items()
+        }
+        result = processor._identify_leaf_nodes(standards_dict)
+        # Only child2 should be a leaf (no children)
+        assert "CHILD2_ID" in result
+        assert "ROOT_ID" not in result
+        assert "CHILD1_ID" not in result
+    def test_identify_root_nodes(self, sample_standard_set):
+        """Test root node identification."""
+        processor = StandardSetProcessor()
+        standards_dict = {
+            std_id: std.model_dump()
+            for std_id, std in sample_standard_set.standards.items()
+        }
+        result = processor._identify_root_nodes(standards_dict)
+        # Only ROOT_ID should be a root
+        assert "ROOT_ID" in result
+        assert "CHILD1_ID" not in result
+        assert "CHILD2_ID" not in result
+class TestHierarchyFunctions:
+    """Test hierarchy functions (Task 4)."""
+    def test_find_root_id_for_root(self, sample_standard_set):
+        """Test finding root ID for a root node."""
+        processor = StandardSetProcessor()
+        standards_dict = {
+            std_id: std.model_dump()
+            for std_id, std in sample_standard_set.standards.items()
+        }
+        processor.id_to_standard = processor._build_id_to_standard_map(standards_dict)
+        root_std = standards_dict["ROOT_ID"]
+        root_id = processor.find_root_id(root_std, processor.id_to_standard)
+        assert root_id == "ROOT_ID"
+    def test_find_root_id_for_child(self, sample_standard_set):
+        """Test finding root ID for a child node."""
+        processor = StandardSetProcessor()
+        standards_dict = {
+            std_id: std.model_dump()
+            for std_id, std in sample_standard_set.standards.items()
+        }
+        processor.id_to_standard = processor._build_id_to_standard_map(standards_dict)
+        child_std = standards_dict["CHILD2_ID"]
+        root_id = processor.find_root_id(child_std, processor.id_to_standard)
+        assert root_id == "ROOT_ID"
+    def test_build_ordered_ancestors(self, sample_standard_set):
+        """Test building ordered ancestor list."""
+        processor = StandardSetProcessor()
+        standards_dict = {
+            std_id: std.model_dump()
+            for std_id, std in sample_standard_set.standards.items()
+        }
+        processor.id_to_standard = processor._build_id_to_standard_map(standards_dict)
+        # For root, ancestors should be empty
+        root_std = standards_dict["ROOT_ID"]
+        ancestors = processor.build_ordered_ancestors(root_std, processor.id_to_standard)
+        assert ancestors == []
+        # For child1, ancestors should be [ROOT_ID]
+        child1_std = standards_dict["CHILD1_ID"]
+        ancestors = processor.build_ordered_ancestors(child1_std, processor.id_to_standard)
+        assert ancestors == ["ROOT_ID"]
+        # For child2, ancestors should be [ROOT_ID, CHILD1_ID]
+        child2_std = standards_dict["CHILD2_ID"]
+        ancestors = processor.build_ordered_ancestors(child2_std, processor.id_to_standard)
+        assert ancestors == ["ROOT_ID", "CHILD1_ID"]
+    def test_compute_sibling_count(self, sample_standard_set):
+        """Test sibling count computation."""
+        processor = StandardSetProcessor()
+        standards_dict = {
+            std_id: std.model_dump()
+            for std_id, std in sample_standard_set.standards.items()
+        }
+        processor.parent_to_children = processor._build_parent_to_children_map(standards_dict)
+        # Root has no siblings
+        root_std = standards_dict["ROOT_ID"]
+        count = processor._compute_sibling_count(root_std)
+        assert count == 0
+        # Child1 has no siblings
+        child1_std = standards_dict["CHILD1_ID"]
+        count = processor._compute_sibling_count(child1_std)
+        assert count == 0
+        # Child2 has no siblings
+        child2_std = standards_dict["CHILD2_ID"]
+        count = processor._compute_sibling_count(child2_std)
+        assert count == 0
+class TestContentGeneration:
+    """Test content text generation (Task 5)."""
+    def test_build_content_text_for_root(self, sample_standard_set):
+        """Test content generation for root node."""
+        processor = StandardSetProcessor()
+        standards_dict = {
+            std_id: std.model_dump()
+            for std_id, std in sample_standard_set.standards.items()
+        }
+        processor.id_to_standard = processor._build_id_to_standard_map(standards_dict)
+        root_std = standards_dict["ROOT_ID"]
+        content = processor._build_content_text(root_std)
+        assert content == "Depth 0: Math"
+    def test_build_content_text_for_child(self, sample_standard_set):
+        """Test content generation for child node with notation."""
+        processor = StandardSetProcessor()
+        standards_dict = {
+            std_id: std.model_dump()
+            for std_id, std in sample_standard_set.standards.items()
+        }
+        processor.id_to_standard = processor._build_id_to_standard_map(standards_dict)
+        child1_std = standards_dict["CHILD1_ID"]
+        content = processor._build_content_text(child1_std)
+        expected = "Depth 0: Math\nDepth 1 (1.1): Numbers"
+        assert content == expected
+    def test_build_content_text_for_deep_child(self, sample_standard_set):
+        """Test content generation for deep child node."""
+        processor = StandardSetProcessor()
+        standards_dict = {
+            std_id: std.model_dump()
+            for std_id, std in sample_standard_set.standards.items()
+        }
+        processor.id_to_standard = processor._build_id_to_standard_map(standards_dict)
+        child2_std = standards_dict["CHILD2_ID"]
+        content = processor._build_content_text(child2_std)
+        expected = "Depth 0: Math\nDepth 1 (1.1): Numbers\nDepth 2 (1.1.A): Count to 10"
+        assert content == expected
+    def test_build_content_text_without_notation(self):
+        """Test content generation without statement notation."""
+        processor = StandardSetProcessor()
+        # Create a standard without notation
+        standard = {
+            "id": "TEST_ID",
+            "depth": 1,
+            "description": "Test Description",
+            "parentId": "PARENT_ID",
+        }
+        parent = {
+            "id": "PARENT_ID",
+            "depth": 0,
+            "description": "Parent",
+            "parentId": None,
+        }
+        processor.id_to_standard = {"TEST_ID": standard, "PARENT_ID": parent}
+        content = processor._build_content_text(standard)
+        expected = "Depth 0: Parent\nDepth 1: Test Description"
+        assert content == expected
+class TestRecordTransformation:
+    """Test record transformation (Task 6)."""
+    def test_transform_root_standard(self, sample_standard_set):
+        """Test transforming a root standard."""
+        processor = StandardSetProcessor()
+        processor._build_relationship_maps(sample_standard_set.standards)
+        root_standard = sample_standard_set.standards["ROOT_ID"]
+        record = processor._transform_standard(root_standard, sample_standard_set)
+        assert record.id == "ROOT_ID"
+        assert record.is_root is True
+        assert record.is_leaf is False
+        assert record.parent_id is None
+        assert record.root_id == "ROOT_ID"
+        assert record.ancestor_ids == []
+        assert record.depth == 0
+        assert record.content == "Depth 0: Math"
+    def test_transform_leaf_standard(self, sample_standard_set):
+        """Test transforming a leaf standard."""
+        processor = StandardSetProcessor()
+        processor._build_relationship_maps(sample_standard_set.standards)
+        leaf_standard = sample_standard_set.standards["CHILD2_ID"]
+        record = processor._transform_standard(leaf_standard, sample_standard_set)
+        assert record.id == "CHILD2_ID"
+        assert record.is_root is False
+        assert record.is_leaf is True
+        assert record.parent_id == "CHILD1_ID"
+        assert record.root_id == "ROOT_ID"
+        assert record.ancestor_ids == ["ROOT_ID", "CHILD1_ID"]
+        assert record.depth == 2
+        assert "Depth 0: Math" in record.content
+        assert "Depth 2 (1.1.A): Count to 10" in record.content
+    def test_transform_standard_with_optional_fields(self, sample_standard_set):
+        """Test transformation includes optional fields when present."""
+        processor = StandardSetProcessor()
+        processor._build_relationship_maps(sample_standard_set.standards)
+        standard = sample_standard_set.standards["CHILD2_ID"]
+        record = processor._transform_standard(standard, sample_standard_set)
+        assert record.statement_notation == "1.1.A"
+        assert record.statement_label == "Benchmark"
+    def test_transform_standard_without_optional_fields(self):
+        """Test transformation omits optional fields when missing."""
+        # Create a minimal standard set
+        standard = Standard(
+            id="MIN_ID",
+            position=100000,
+            depth=0,
+            description="Minimal",
+            ancestorIds=[],
+            parentId=None,
+        )
+        standard_set = StandardSet(
+            id="SET_ID",
+            title="Test",
+            subject="Test",
+            normalizedSubject="Test",
+            educationLevels=["01"],
+            license={"title": "CC", "URL": "https://example.com", "rightsHolder": "Test"},
+            document={"id": "DOC", "title": "Doc", "valid": "2021", "publicationStatus": "Published"},
+            jurisdiction={"id": "JUR", "title": "Jur"},
+            standards={"MIN_ID": standard},
+        )
+        processor = StandardSetProcessor()
+        processor._build_relationship_maps(standard_set.standards)
+        record = processor._transform_standard(standard, standard_set)
+        assert record.asn_identifier is None
+        assert record.statement_notation is None
+        assert record.statement_label is None
+    def test_process_standard_set(self, sample_standard_set):
+        """Test processing entire standard set."""
+        processor = StandardSetProcessor()
+        processed_set = processor.process_standard_set(sample_standard_set)
+        assert len(processed_set.records) == 3
+        assert all(isinstance(r, type(processed_set.records[0])) for r in processed_set.records)
+class TestFileOperations:
+    """Test file operations (Task 7)."""
+    def test_process_and_save(self, tmp_path, sample_standard_set):
+        """Test processing and saving to file."""
+        # Create temporary directory structure
+        set_dir = tmp_path / "standardSets" / "TEST_SET_ID"
+        set_dir.mkdir(parents=True)
+        # Save sample data.json
+        data_file = set_dir / "data.json"
+        response_data = {"data": sample_standard_set.model_dump(mode="json")}
+        with open(data_file, "w", encoding="utf-8") as f:
+            json.dump(response_data, f)
+        # Mock the settings to use tmp_path
+        from unittest.mock import patch
+        from tools.config import ToolsSettings
+        with patch("tools.pinecone_processor.settings") as mock_settings:
+            mock_settings.standard_sets_dir = tmp_path / "standardSets"
+            processed_file = process_and_save("TEST_SET_ID")
+            assert processed_file.exists()
+            assert processed_file.name == "processed.json"
+            # Verify content
+            with open(processed_file, encoding="utf-8") as f:
+                data = json.load(f)
+            assert "records" in data
+            assert len(data["records"]) == 3
+    def test_process_and_save_missing_file(self):
+        """Test error handling for missing data.json."""
+        from unittest.mock import patch
+        from tools.config import ToolsSettings
+        with patch("tools.pinecone_processor.settings") as mock_settings:
+            mock_settings.standard_sets_dir = Path("/nonexistent/path")
+            with pytest.raises(FileNotFoundError):
+                process_and_save("NONEXISTENT_SET")
+    def test_process_and_save_invalid_json(self, tmp_path):
+        """Test error handling for invalid JSON."""
+        set_dir = tmp_path / "standardSets" / "TEST_SET_ID"
+        set_dir.mkdir(parents=True)
+        # Write invalid JSON
+        data_file = set_dir / "data.json"
+        with open(data_file, "w", encoding="utf-8") as f:
+            f.write("{ invalid json }")
+        from unittest.mock import patch
+        with patch("tools.pinecone_processor.settings") as mock_settings:
+            mock_settings.standard_sets_dir = tmp_path / "standardSets"
+            with pytest.raises(ValueError, match="Invalid JSON"):
+                process_and_save("TEST_SET_ID")

tools/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ """Build and maintenance tools."""
2	+

tools/api_client.py ADDED Viewed

	@@ -0,0 +1,435 @@

+"""API client for Common Standards Project with retry logic and rate limiting."""
+from __future__ import annotations
+import json
+import time
+from typing import Any
+import requests
+from loguru import logger
+from tools.config import get_settings
+from tools.models import (
+    Jurisdiction,
+    JurisdictionDetails,
+    StandardSet,
+    StandardSetReference,
+)
+settings = get_settings()
+# Cache file for jurisdictions
+JURISDICTIONS_CACHE_FILE = settings.raw_data_dir / "jurisdictions.json"
+# Rate limiting: Max requests per minute
+MAX_REQUESTS_PER_MINUTE = settings.max_requests_per_minute
+_request_timestamps: list[float] = []
+class APIError(Exception):
+    """Raised when API request fails after all retries."""
+    pass
+def _get_headers() -> dict[str, str]:
+    """Get authentication headers for API requests."""
+    if not settings.csp_api_key:
+        logger.error("CSP_API_KEY not found in .env file")
+        raise ValueError("CSP_API_KEY environment variable not set")
+    return {"Api-Key": settings.csp_api_key}
+def _enforce_rate_limit() -> None:
+    """Enforce rate limiting by tracking request timestamps."""
+    global _request_timestamps
+    now = time.time()
+    # Remove timestamps older than 1 minute
+    _request_timestamps = [ts for ts in _request_timestamps if now - ts < 60]
+    # If at limit, wait
+    if len(_request_timestamps) >= MAX_REQUESTS_PER_MINUTE:
+        sleep_time = 60 - (now - _request_timestamps[0])
+        logger.warning(f"Rate limit reached. Waiting {sleep_time:.1f} seconds...")
+        time.sleep(sleep_time)
+        _request_timestamps = []
+    _request_timestamps.append(now)
+def _make_request(
+    endpoint: str, params: dict[str, Any] | None = None, max_retries: int = 3
+) -> dict[str, Any]:
+    """
+    Make API request with exponential backoff retry logic.
+    Args:
+        endpoint: API endpoint path (e.g., "/jurisdictions")
+        params: Query parameters
+        max_retries: Maximum number of retry attempts
+    Returns:
+        Parsed JSON response
+    Raises:
+        APIError: After all retries exhausted or on fatal errors
+    """
+    url = f"{settings.csp_base_url}{endpoint}"
+    headers = _get_headers()
+    for attempt in range(max_retries):
+        try:
+            _enforce_rate_limit()
+            logger.debug(
+                f"API request: {endpoint} (attempt {attempt + 1}/{max_retries})"
+            )
+            response = requests.get(url, headers=headers, params=params, timeout=30)
+            # Handle specific status codes
+            if response.status_code == 401:
+                logger.error("Invalid API key (401 Unauthorized)")
+                raise APIError("Authentication failed. Check your CSP_API_KEY in .env")
+            if response.status_code == 404:
+                logger.error(f"Resource not found (404): {endpoint}")
+                raise APIError(f"Resource not found: {endpoint}")
+            if response.status_code == 429:
+                # Rate limited by server
+                retry_after = int(response.headers.get("Retry-After", 60))
+                logger.warning(
+                    f"Server rate limit hit. Waiting {retry_after} seconds..."
+                )
+                time.sleep(retry_after)
+                continue
+            response.raise_for_status()
+            logger.info(f"API request successful: {endpoint}")
+            return response.json()
+        except requests.exceptions.Timeout:
+            wait_time = 2**attempt  # Exponential backoff: 1s, 2s, 4s
+            logger.warning(f"Request timeout. Retrying in {wait_time}s...")
+            if attempt < max_retries - 1:
+                time.sleep(wait_time)
+            else:
+                raise APIError(f"Request timeout after {max_retries} attempts")
+        except requests.exceptions.ConnectionError:
+            wait_time = 2**attempt
+            logger.warning(f"Connection error. Retrying in {wait_time}s...")
+            if attempt < max_retries - 1:
+                time.sleep(wait_time)
+            else:
+                raise APIError(f"Connection failed after {max_retries} attempts")
+        except requests.exceptions.HTTPError as e:
+            # Don't retry on 4xx errors (except 429)
+            if 400 <= response.status_code < 500 and response.status_code != 429:
+                raise APIError(f"HTTP {response.status_code}: {response.text}")
+            # Retry on 5xx errors
+            wait_time = 2**attempt
+            logger.warning(
+                f"Server error {response.status_code}. Retrying in {wait_time}s..."
+            )
+            if attempt < max_retries - 1:
+                time.sleep(wait_time)
+            else:
+                raise APIError(f"Server error after {max_retries} attempts")
+    raise APIError("Request failed after all retries")
+def get_jurisdictions(
+    search_term: str | None = None,
+    type_filter: str | None = None,
+    force_refresh: bool = False,
+) -> list[Jurisdiction]:
+    """
+    Fetch all jurisdictions from the API or local cache.
+    Jurisdictions are cached locally in data/raw/jurisdictions.json to avoid
+    repeated API calls. Use force_refresh=True to fetch fresh data from the API.
+    Args:
+        search_term: Optional filter for jurisdiction title (case-insensitive partial match)
+        type_filter: Optional filter for jurisdiction type (case-insensitive).
+                     Valid values: "school", "organization", "state", "nation"
+        force_refresh: If True, fetch fresh data from API and update cache
+    Returns:
+        List of Jurisdiction models
+    """
+    jurisdictions: list[Jurisdiction] = []
+    raw_data: list[dict[str, Any]] = []
+    # Check cache first (unless forcing refresh)
+    if not force_refresh and JURISDICTIONS_CACHE_FILE.exists():
+        try:
+            logger.info("Loading jurisdictions from cache")
+            with open(JURISDICTIONS_CACHE_FILE, encoding="utf-8") as f:
+                cached_response = json.load(f)
+            raw_data = cached_response.get("data", [])
+            logger.info(f"Loaded {len(raw_data)} jurisdictions from cache")
+        except (json.JSONDecodeError, IOError) as e:
+            logger.warning(f"Failed to load cache: {e}. Fetching from API...")
+            force_refresh = True
+    # Fetch from API if cache doesn't exist or force_refresh is True
+    if force_refresh or not raw_data:
+        logger.info("Fetching jurisdictions from API")
+        response = _make_request("/jurisdictions")
+        raw_data = response.get("data", [])
+        # Save to cache
+        try:
+            settings.raw_data_dir.mkdir(parents=True, exist_ok=True)
+            with open(JURISDICTIONS_CACHE_FILE, "w", encoding="utf-8") as f:
+                json.dump(response, f, indent=2, ensure_ascii=False)
+            logger.info(
+                f"Cached {len(raw_data)} jurisdictions to {JURISDICTIONS_CACHE_FILE}"
+            )
+        except IOError as e:
+            logger.warning(f"Failed to save cache: {e}")
+    # Parse into Pydantic models
+    jurisdictions = [Jurisdiction(**j) for j in raw_data]
+    # Apply type filter if provided (case-insensitive)
+    if type_filter:
+        type_lower = type_filter.lower()
+        original_count = len(jurisdictions)
+        jurisdictions = [j for j in jurisdictions if j.type.lower() == type_lower]
+        logger.info(
+            f"Filtered to {len(jurisdictions)} jurisdictions of type '{type_filter}' (from {original_count})"
+        )
+    # Apply search filter if provided (case-insensitive partial match)
+    if search_term:
+        search_lower = search_term.lower()
+        original_count = len(jurisdictions)
+        jurisdictions = [j for j in jurisdictions if search_lower in j.title.lower()]
+        logger.info(
+            f"Filtered to {len(jurisdictions)} jurisdictions matching '{search_term}' (from {original_count})"
+        )
+    return jurisdictions
+def get_jurisdiction_details(
+    jurisdiction_id: str, force_refresh: bool = False, hide_hidden_sets: bool = True
+) -> JurisdictionDetails:
+    """
+    Fetch jurisdiction metadata including standard set references.
+    Jurisdiction metadata is cached locally in data/raw/jurisdictions/{jurisdiction_id}/data.json
+    to avoid repeated API calls. Use force_refresh=True to fetch fresh data from the API.
+    Note: This returns metadata about standard sets (IDs, titles, subjects) but NOT the
+    full standard set content. Use download_standard_set() to get full standard set data.
+    Args:
+        jurisdiction_id: The jurisdiction GUID
+        force_refresh: If True, fetch fresh data from API and update cache
+        hide_hidden_sets: If True, hide deprecated/outdated sets (default: True)
+    Returns:
+        JurisdictionDetails model with jurisdiction metadata and standardSets array
+    """
+    cache_dir = settings.raw_data_dir / "jurisdictions" / jurisdiction_id
+    cache_file = cache_dir / "data.json"
+    raw_data: dict[str, Any] = {}
+    # Check cache first (unless forcing refresh)
+    if not force_refresh and cache_file.exists():
+        try:
+            logger.info(f"Loading jurisdiction {jurisdiction_id} from cache")
+            with open(cache_file, encoding="utf-8") as f:
+                cached_response = json.load(f)
+            raw_data = cached_response.get("data", {})
+            logger.info(f"Loaded jurisdiction metadata from cache")
+        except (json.JSONDecodeError, IOError) as e:
+            logger.warning(f"Failed to load cache: {e}. Fetching from API...")
+            force_refresh = True
+    # Fetch from API if cache doesn't exist or force_refresh is True
+    if force_refresh or not raw_data:
+        logger.info(f"Fetching jurisdiction {jurisdiction_id} from API")
+        params = {"hideHiddenSets": "true" if hide_hidden_sets else "false"}
+        response = _make_request(f"/jurisdictions/{jurisdiction_id}", params=params)
+        raw_data = response.get("data", {})
+        # Save to cache
+        try:
+            cache_dir.mkdir(parents=True, exist_ok=True)
+            with open(cache_file, "w", encoding="utf-8") as f:
+                json.dump(response, f, indent=2, ensure_ascii=False)
+            logger.info(f"Cached jurisdiction metadata to {cache_file}")
+        except IOError as e:
+            logger.warning(f"Failed to save cache: {e}")
+    # Parse into Pydantic model
+    return JurisdictionDetails(**raw_data)
+def download_standard_set(set_id: str, force_refresh: bool = False) -> StandardSet:
+    """
+    Download full standard set data with caching.
+    Standard set data is cached locally in data/raw/standardSets/{set_id}/data.json
+    to avoid repeated API calls. Use force_refresh=True to fetch fresh data from the API.
+    Args:
+        set_id: The standard set GUID
+        force_refresh: If True, fetch fresh data from API and update cache
+    Returns:
+        StandardSet model with complete standard set data including hierarchy
+    """
+    cache_dir = settings.raw_data_dir / "standardSets" / set_id
+    cache_file = cache_dir / "data.json"
+    raw_data: dict[str, Any] = {}
+    # Check cache first (unless forcing refresh)
+    if not force_refresh and cache_file.exists():
+        try:
+            logger.info(f"Loading standard set {set_id} from cache")
+            with open(cache_file, encoding="utf-8") as f:
+                cached_response = json.load(f)
+            raw_data = cached_response.get("data", {})
+            logger.info(f"Loaded standard set from cache")
+        except (json.JSONDecodeError, IOError) as e:
+            logger.warning(f"Failed to load cache: {e}. Fetching from API...")
+            force_refresh = True
+    # Fetch from API if cache doesn't exist or force_refresh is True
+    if force_refresh or not raw_data:
+        logger.info(f"Downloading standard set {set_id} from API")
+        response = _make_request(f"/standard_sets/{set_id}")
+        raw_data = response.get("data", {})
+        # Save to cache
+        try:
+            cache_dir.mkdir(parents=True, exist_ok=True)
+            with open(cache_file, "w", encoding="utf-8") as f:
+                json.dump(response, f, indent=2, ensure_ascii=False)
+            logger.info(f"Cached standard set to {cache_file}")
+        except IOError as e:
+            logger.warning(f"Failed to save cache: {e}")
+    # Parse into Pydantic model
+    return StandardSet(**raw_data)
+def _filter_standard_set(
+    standard_set: StandardSetReference,
+    education_levels: list[str] | None = None,
+    publication_status: str | None = None,
+    valid_year: str | None = None,
+    title_search: str | None = None,
+    subject_search: str | None = None,
+) -> bool:
+    """
+    Check if a standard set matches all provided filters (AND logic).
+    Args:
+        standard_set: StandardSetReference model from jurisdiction metadata
+        education_levels: List of grade levels to match (any match)
+        publication_status: Publication status to match
+        valid_year: Valid year string to match
+        title_search: Partial string match on title (case-insensitive)
+        subject_search: Partial string match on subject (case-insensitive)
+    Returns:
+        True if standard set matches all provided filters
+    """
+    # Filter by education levels (any match)
+    if education_levels:
+        set_levels = {level.upper() for level in standard_set.educationLevels}
+        filter_levels = {level.upper() for level in education_levels}
+        if not set_levels.intersection(filter_levels):
+            return False
+    # Filter by publication status
+    if publication_status:
+        if (
+            standard_set.document.publicationStatus
+            and standard_set.document.publicationStatus.lower()
+            != publication_status.lower()
+        ):
+            return False
+    # Filter by valid year
+    if valid_year:
+        if standard_set.document.valid != valid_year:
+            return False
+    # Filter by title search (partial match, case-insensitive)
+    if title_search:
+        if title_search.lower() not in standard_set.title.lower():
+            return False
+    # Filter by subject search (partial match, case-insensitive)
+    if subject_search:
+        if subject_search.lower() not in standard_set.subject.lower():
+            return False
+    return True
+def download_standard_sets_by_jurisdiction(
+    jurisdiction_id: str,
+    force_refresh: bool = False,
+    education_levels: list[str] | None = None,
+    publication_status: str | None = None,
+    valid_year: str | None = None,
+    title_search: str | None = None,
+    subject_search: str | None = None,
+) -> list[str]:
+    """
+    Download standard sets for a jurisdiction with optional filtering.
+    Args:
+        jurisdiction_id: The jurisdiction GUID
+        force_refresh: If True, force refresh all downloads (ignores cache)
+        education_levels: List of grade levels to filter by
+        publication_status: Publication status to filter by
+        valid_year: Valid year string to filter by
+        title_search: Partial string match on title
+        subject_search: Partial string match on subject
+    Returns:
+        List of downloaded standard set IDs
+    """
+    # Get jurisdiction metadata
+    jurisdiction_data = get_jurisdiction_details(jurisdiction_id, force_refresh=False)
+    standard_sets = jurisdiction_data.standardSets
+    # Apply filters
+    filtered_sets = [
+        s
+        for s in standard_sets
+        if _filter_standard_set(
+            s,
+            education_levels=education_levels,
+            publication_status=publication_status,
+            valid_year=valid_year,
+            title_search=title_search,
+            subject_search=subject_search,
+        )
+    ]
+    # Download each filtered standard set
+    downloaded_ids = []
+    for standard_set in filtered_sets:
+        set_id = standard_set.id
+        try:
+            download_standard_set(set_id, force_refresh=force_refresh)
+            downloaded_ids.append(set_id)
+        except Exception as e:
+            logger.error(f"Failed to download standard set {set_id}: {e}")
+    return downloaded_ids

tools/cli.py ADDED Viewed

	@@ -0,0 +1,752 @@

+"""CLI entry point for EduMatch Data Management."""
+from __future__ import annotations
+import sys
+from pathlib import Path
+# Add project root to Python path
+project_root = Path(__file__).parent.parent
+if str(project_root) not in sys.path:
+    sys.path.insert(0, str(project_root))
+import typer
+from loguru import logger
+from rich.console import Console
+from rich.table import Table
+from tools import api_client, data_manager
+from tools.config import get_settings
+from tools.pinecone_processor import process_and_save
+settings = get_settings()
+# Configure logger
+logger.remove()  # Remove default handler
+logger.add(
+    sys.stderr,
+    format="<green>{time:HH:mm:ss}</green> | <level>{level: <8}</level> | <level>{message}</level>",
+)
+logger.add(
+    settings.log_file,
+    rotation=settings.log_rotation,
+    retention=settings.log_retention,
+    format="{time} | {level} | {message}",
+)
+app = typer.Typer(help="Common Core MCP CLI - Manage educational standards data")
+console = Console()
+@app.command()
+def jurisdictions(
+    search: str = typer.Option(
+        None,
+        "--search",
+        "-s",
+        help="Filter by jurisdiction name (case-insensitive partial match)",
+    ),
+    type: str = typer.Option(
+        None,
+        "--type",
+        "-t",
+        help="Filter by jurisdiction type: school, organization, state, or nation",
+    ),
+    force: bool = typer.Option(
+        False, "--force", "-f", help="Force refresh from API, ignoring local cache"
+    ),
+):
+    """
+    List all available jurisdictions (states/organizations).
+    By default, jurisdictions are loaded from local cache (data/raw/jurisdictions.json)
+    to avoid repeated API calls. Use --force to fetch fresh data from the API and update
+    the cache. The cache is automatically created on first use.
+    Filters can be combined: use --search to filter by name and --type to filter by type.
+    """
+    try:
+        if force:
+            console.print("[yellow]Forcing refresh from API...[/yellow]")
+        # Validate type filter if provided
+        if type:
+            valid_types = {"school", "organization", "state", "nation"}
+            if type.lower() not in valid_types:
+                console.print(
+                    f"[red]Error: Invalid type '{type}'. Must be one of: {', '.join(sorted(valid_types))}[/red]"
+                )
+                raise typer.Exit(code=1)
+        results = api_client.get_jurisdictions(
+            search_term=search, type_filter=type, force_refresh=force
+        )
+        table = Table("ID", "Title", "Type", title="Jurisdictions")
+        for j in results:
+            table.add_row(j.id, j.title, j.type)
+        console.print(table)
+        console.print(f"\n[green]Found {len(results)} jurisdictions[/green]")
+        if not force:
+            console.print("[dim]Tip: Use --force to refresh from API[/dim]")
+    except Exception as e:
+        console.print(f"[red]Error: {e}[/red]")
+        logger.exception("Failed to fetch jurisdictions")
+        raise typer.Exit(code=1)
+@app.command()
+def jurisdiction_details(
+    jurisdiction_id: str = typer.Argument(..., help="Jurisdiction ID"),
+    force: bool = typer.Option(
+        False, "--force", "-f", help="Force refresh from API, ignoring local cache"
+    ),
+):
+    """
+    Download and display jurisdiction metadata including standard set references.
+    By default, jurisdiction metadata is loaded from local cache (data/raw/jurisdictions/{id}/data.json)
+    to avoid repeated API calls. Use --force to fetch fresh data from the API and update the cache.
+    The cache is automatically created on first use.
+    Note: This command downloads metadata about standard sets (IDs, titles, subjects) but NOT
+    the full standard set content. Use the 'download' command to get full standard set data.
+    """
+    try:
+        if force:
+            console.print("[yellow]Forcing refresh from API...[/yellow]")
+        jurisdiction_data = api_client.get_jurisdiction_details(
+            jurisdiction_id, force_refresh=force
+        )
+        # Display jurisdiction info
+        console.print(f"\n[bold]Jurisdiction:[/bold] {jurisdiction_data.title}")
+        console.print(f"[bold]Type:[/bold] {jurisdiction_data.type}")
+        console.print(f"[bold]ID:[/bold] {jurisdiction_data.id}")
+        # Display standard sets
+        standard_sets = jurisdiction_data.standardSets
+        if standard_sets:
+            table = Table(
+                "Set ID", "Subject", "Title", "Grade Levels", title="Standard Sets"
+            )
+            for s in standard_sets:
+                grade_levels = ", ".join(s.educationLevels)
+                table.add_row(
+                    s.id,
+                    s.subject,
+                    s.title,
+                    grade_levels or "N/A",
+                )
+            console.print("\n")
+            console.print(table)
+            console.print(f"\n[green]Found {len(standard_sets)} standard sets[/green]")
+        else:
+            console.print("\n[yellow]No standard sets found[/yellow]")
+        if not force:
+            console.print("[dim]Tip: Use --force to refresh from API[/dim]")
+    except Exception as e:
+        console.print(f"[red]Error: {e}[/red]")
+        logger.exception("Failed to fetch jurisdiction details")
+        raise typer.Exit(code=1)
+@app.command("download-sets")
+def download_sets(
+    set_id: str = typer.Argument(None, help="Standard set ID (if downloading by ID)"),
+    jurisdiction: str = typer.Option(
+        None,
+        "--jurisdiction",
+        "-j",
+        help="Jurisdiction ID (if downloading by jurisdiction)",
+    ),
+    force: bool = typer.Option(
+        False, "--force", "-f", help="Force refresh from API, ignoring local cache"
+    ),
+    yes: bool = typer.Option(
+        False,
+        "--yes",
+        "-y",
+        help="Skip confirmation prompt when downloading by jurisdiction",
+    ),
+    dry_run: bool = typer.Option(
+        False,
+        "--dry-run",
+        help="Show what would be downloaded without actually downloading",
+    ),
+    education_levels: str = typer.Option(
+        None,
+        "--education-levels",
+        help="Comma-separated grade levels (e.g., '03,04,05')",
+    ),
+    publication_status: str = typer.Option(
+        None,
+        "--publication-status",
+        help="Publication status filter (e.g., 'Published', 'Deprecated')",
+    ),
+    valid_year: str = typer.Option(
+        None, "--valid-year", help="Valid year filter (e.g., '2012')"
+    ),
+    title: str = typer.Option(
+        None, "--title", help="Partial title match (case-insensitive)"
+    ),
+    subject: str = typer.Option(
+        None, "--subject", help="Partial subject match (case-insensitive)"
+    ),
+):
+    """
+    Download standard sets either by ID or by jurisdiction with filtering.
+    When downloading by jurisdiction, filters can be applied and all filters combine with AND logic.
+    A confirmation prompt will be shown listing all standard sets that will be downloaded.
+    Use --dry-run to preview what would be downloaded without actually downloading anything.
+    """
+    try:
+        # Validate arguments
+        if not set_id and not jurisdiction:
+            console.print(
+                "[red]Error: Must provide either set_id or --jurisdiction[/red]"
+            )
+            raise typer.Exit(code=1)
+        if set_id and jurisdiction:
+            console.print(
+                "[red]Error: Cannot specify both set_id and --jurisdiction[/red]"
+            )
+            raise typer.Exit(code=1)
+        # Download by ID
+        if set_id:
+            if dry_run:
+                console.print(
+                    f"[yellow][DRY RUN] Would download standard set: {set_id}[/yellow]"
+                )
+                cache_path = Path("data/raw/standardSets") / set_id / "data.json"
+                console.print(f"  Would cache to: {cache_path}")
+                return
+            with console.status(f"[bold blue]Downloading standard set {set_id}..."):
+                api_client.download_standard_set(set_id, force_refresh=force)
+            cache_path = Path("data/raw/standardSets") / set_id / "data.json"
+            console.print("[green]✓ Successfully downloaded standard set[/green]")
+            console.print(f"  Cached to: {cache_path}")
+            # Process the downloaded set
+            try:
+                with console.status(f"[bold blue]Processing standard set {set_id}..."):
+                    processed_path = process_and_save(set_id)
+                console.print("[green]✓ Successfully processed standard set[/green]")
+                console.print(f"  Processed to: {processed_path}")
+            except FileNotFoundError:
+                console.print(
+                    "[yellow]Warning: data.json not found, skipping processing[/yellow]"
+                )
+            except Exception as e:
+                console.print(
+                    f"[yellow]Warning: Failed to process standard set: {e}[/yellow]"
+                )
+                logger.exception(f"Failed to process standard set {set_id}")
+            return
+        # Download by jurisdiction
+        if jurisdiction:
+            # Parse education levels
+            education_levels_list = None
+            if education_levels:
+                education_levels_list = [
+                    level.strip() for level in education_levels.split(",")
+                ]
+            # Get jurisdiction metadata
+            jurisdiction_data = api_client.get_jurisdiction_details(
+                jurisdiction, force_refresh=False
+            )
+            all_sets = jurisdiction_data.standardSets
+            # Apply filters using the API client's filter function
+            from tools.api_client import _filter_standard_set
+            filtered_sets = [
+                s
+                for s in all_sets
+                if _filter_standard_set(
+                    s,
+                    education_levels=education_levels_list,
+                    publication_status=publication_status,
+                    valid_year=valid_year,
+                    title_search=title,
+                    subject_search=subject,
+                )
+            ]
+            if not filtered_sets:
+                console.print(
+                    "[yellow]No standard sets match the provided filters.[/yellow]"
+                )
+                return
+            # Display filtered sets
+            if dry_run:
+                console.print(
+                    f"\n[yellow][DRY RUN] Standard sets that would be downloaded ({len(filtered_sets)}):[/yellow]"
+                )
+            else:
+                console.print(
+                    f"\n[bold]Standard sets to download ({len(filtered_sets)}):[/bold]"
+                )
+            table = Table(
+                "Set ID",
+                "Subject",
+                "Title",
+                "Grade Levels",
+                "Status",
+                "Year",
+                "Downloaded",
+                title="Standard Sets",
+            )
+            for s in filtered_sets:
+                display_id = s.id[:20] + "..." if len(s.id) > 20 else s.id
+                # Check if already downloaded
+                set_data_path = settings.standard_sets_dir / s.id / "data.json"
+                is_downloaded = set_data_path.exists()
+                downloaded_status = (
+                    "[green]✓[/green]" if is_downloaded else "[yellow]✗[/yellow]"
+                )
+                table.add_row(
+                    display_id,
+                    s.subject,
+                    s.title[:40],
+                    ", ".join(s.educationLevels),
+                    s.document.publicationStatus or "N/A",
+                    s.document.valid,
+                    downloaded_status,
+                )
+            console.print(table)
+            # If dry run, show summary and exit
+            if dry_run:
+                console.print(
+                    f"\n[yellow][DRY RUN] Would download {len(filtered_sets)} standard set(s)[/yellow]"
+                )
+                console.print(
+                    "[dim]Run without --dry-run to actually download these standard sets.[/dim]"
+                )
+                return
+            # Confirmation prompt
+            if not yes:
+                if not typer.confirm(
+                    f"\nDownload {len(filtered_sets)} standard set(s)?"
+                ):
+                    console.print("[yellow]Download cancelled.[/yellow]")
+                    return
+            # Download each standard set
+            console.print(
+                f"\n[bold blue]Downloading {len(filtered_sets)} standard set(s)...[/bold blue]"
+            )
+            downloaded = 0
+            failed = 0
+            for i, standard_set in enumerate(filtered_sets, 1):
+                set_id = standard_set.id
+                try:
+                    with console.status(
+                        f"[bold blue][{i}/{len(filtered_sets)}] Downloading {set_id[:20]}..."
+                    ):
+                        api_client.download_standard_set(set_id, force_refresh=force)
+                    downloaded += 1
+                    # Process the downloaded set
+                    try:
+                        with console.status(
+                            f"[bold blue][{i}/{len(filtered_sets)}] Processing {set_id[:20]}..."
+                        ):
+                            process_and_save(set_id)
+                    except FileNotFoundError:
+                        console.print(
+                            f"[yellow]Warning: Skipping processing for {set_id[:20]}... (data.json not found)[/yellow]"
+                        )
+                    except Exception as e:
+                        console.print(
+                            f"[yellow]Warning: Failed to process {set_id[:20]}...: {e}[/yellow]"
+                        )
+                        logger.exception(f"Failed to process standard set {set_id}")
+                except Exception as e:
+                    console.print(f"[red]✗ Failed to download {set_id}: {e}[/red]")
+                    logger.exception(f"Failed to download standard set {set_id}")
+                    failed += 1
+            # Summary
+            console.print(
+                f"\n[green]✓ Successfully downloaded {downloaded} standard set(s)[/green]"
+            )
+            if failed > 0:
+                console.print(
+                    f"[red]✗ Failed to download {failed} standard set(s)[/red]"
+                )
+    except Exception as e:
+        console.print(f"[red]Error: {e}[/red]")
+        logger.exception("Failed to download standard sets")
+        raise typer.Exit(code=1)
+@app.command("list")
+def list_datasets():
+    """List all downloaded standard sets and their processing status."""
+    try:
+        datasets = data_manager.list_downloaded_standard_sets()
+        if not datasets:
+            console.print("[yellow]No standard sets downloaded yet.[/yellow]")
+            console.print("[dim]Use 'download-sets' to download standard sets.[/dim]")
+            return
+        # Check for processed.json files
+        for d in datasets:
+            set_dir = settings.standard_sets_dir / d.set_id
+            processed_file = set_dir / "processed.json"
+            d.processed = processed_file.exists()
+        # Count processed vs unprocessed
+        processed_count = sum(1 for d in datasets if d.processed)
+        unprocessed_count = len(datasets) - processed_count
+        table = Table(
+            "Set ID",
+            "Jurisdiction",
+            "Subject",
+            "Title",
+            "Grades",
+            "Status",
+            "Processed",
+            title="Downloaded Standard Sets",
+        )
+        for d in datasets:
+            # Truncate long set IDs
+            display_id = d.set_id[:25] + "..." if len(d.set_id) > 25 else d.set_id
+            table.add_row(
+                display_id,
+                d.jurisdiction,
+                d.subject[:30],
+                d.title[:30],
+                ", ".join(d.education_levels),
+                d.publication_status,
+                "[green]✓[/green]" if d.processed else "[yellow]✗[/yellow]",
+            )
+        console.print(table)
+        console.print("\n[bold]Summary:[/bold]")
+        console.print(f"  Total: {len(datasets)} standard sets")
+        console.print(f"  Processed: [green]{processed_count}[/green]")
+        console.print(f"  Unprocessed: [yellow]{unprocessed_count}[/yellow]")
+    except Exception as e:
+        console.print(f"[red]Error: {e}[/red]")
+        logger.exception("Failed to list datasets")
+        raise typer.Exit(code=1)
+@app.command("pinecone-init")
+def pinecone_init():
+    """
+    Initialize Pinecone index.
+    Checks if the configured index exists and creates it if not.
+    Uses integrated embeddings with llama-text-embed-v2 model.
+    """
+    try:
+        from tools.pinecone_client import PineconeClient
+        console.print("[bold]Initializing Pinecone...[/bold]")
+        # Initialize Pinecone client (validates API key)
+        try:
+            client = PineconeClient()
+        except ValueError as e:
+            console.print(f"[red]Error: {e}[/red]")
+            raise typer.Exit(code=1)
+        console.print(f"  Index name: [cyan]{client.index_name}[/cyan]")
+        console.print(f"  Namespace: [cyan]{client.namespace}[/cyan]")
+        # Check and create index if needed
+        with console.status("[bold blue]Checking index status..."):
+            created = client.ensure_index_exists()
+        if created:
+            console.print(
+                f"\n[green]Successfully created index '{client.index_name}'[/green]"
+            )
+            console.print("[dim]Index configuration:[/dim]")
+            console.print("  Cloud: aws")
+            console.print("  Region: us-east-1")
+            console.print("  Embedding model: llama-text-embed-v2")
+            console.print("  Field map: text -> content")
+        else:
+            console.print(
+                f"\n[green]Index '{client.index_name}' already exists[/green]"
+            )
+            # Show index stats
+            with console.status("[bold blue]Fetching index stats..."):
+                stats = client.get_index_stats()
+            console.print("\n[bold]Index Statistics:[/bold]")
+            console.print(f"  Total vectors: [cyan]{stats['total_vector_count']}[/cyan]")
+            namespaces = stats.get("namespaces", {})
+            if namespaces:
+                console.print(f"  Namespaces: [cyan]{len(namespaces)}[/cyan]")
+                table = Table("Namespace", "Vector Count", title="Namespace Details")
+                for ns_name, ns_info in namespaces.items():
+                    vector_count = getattr(ns_info, "vector_count", 0)
+                    table.add_row(ns_name or "(default)", str(vector_count))
+                console.print(table)
+            else:
+                console.print("  Namespaces: [yellow]None (empty index)[/yellow]")
+    except Exception as e:
+        console.print(f"[red]Error: {e}[/red]")
+        logger.exception("Failed to initialize Pinecone")
+        raise typer.Exit(code=1)
+@app.command("pinecone-upload")
+def pinecone_upload(
+    set_id: str = typer.Option(
+        None, "--set-id", help="Upload a specific standard set by ID"
+    ),
+    all: bool = typer.Option(
+        False, "--all", help="Upload all downloaded standard sets with processed.json"
+    ),
+    force: bool = typer.Option(
+        False,
+        "--force",
+        help="Re-upload even if .pinecone_uploaded marker exists",
+    ),
+    dry_run: bool = typer.Option(
+        False,
+        "--dry-run",
+        help="Show what would be uploaded without actually uploading",
+    ),
+    batch_size: int = typer.Option(
+        96, "--batch-size", help="Number of records per batch (default: 96)"
+    ),
+):
+    """
+    Upload processed standard sets to Pinecone.
+    Use --set-id to upload a specific set, or --all to upload all sets with processed.json.
+    If neither is provided, you'll be prompted to confirm uploading all sets.
+    """
+    try:
+        from tools.pinecone_client import PineconeClient
+        from tools.pinecone_models import ProcessedStandardSet
+        import json
+        # Initialize Pinecone client
+        try:
+            client = PineconeClient()
+        except ValueError as e:
+            console.print(f"[red]Error: {e}[/red]")
+            raise typer.Exit(code=1)
+        # Validate index exists
+        try:
+            client.validate_index()
+        except ValueError as e:
+            console.print(f"[red]Error: {e}[/red]")
+            raise typer.Exit(code=1)
+        # Discover standard sets with processed.json
+        standard_sets_dir = settings.standard_sets_dir
+        if not standard_sets_dir.exists():
+            console.print("[yellow]No standard sets directory found.[/yellow]")
+            console.print(
+                "[dim]Use 'download-sets' to download standard sets first.[/dim]"
+            )
+            return
+        # Find all sets with processed.json
+        sets_to_upload = []
+        for set_dir in standard_sets_dir.iterdir():
+            if not set_dir.is_dir():
+                continue
+            processed_file = set_dir / "processed.json"
+            if not processed_file.exists():
+                continue
+            set_id_from_dir = set_dir.name
+            # Check if already uploaded (unless --force)
+            # Mark all sets during discovery; filtering by --set-id happens later
+            if not force and PineconeClient.is_uploaded(set_dir):
+                sets_to_upload.append(
+                    (set_id_from_dir, set_dir, True)
+                )  # True = already uploaded
+            else:
+                sets_to_upload.append(
+                    (set_id_from_dir, set_dir, False)
+                )  # False = needs upload
+        if not sets_to_upload:
+            console.print(
+                "[yellow]No standard sets with processed.json found.[/yellow]"
+            )
+            console.print(
+                "[dim]Use 'download-sets' to download and process standard sets first.[/dim]"
+            )
+            return
+        # Filter by --set-id if provided
+        if set_id:
+            sets_to_upload = [
+                (sid, sdir, skipped)
+                for sid, sdir, skipped in sets_to_upload
+                if sid == set_id
+            ]
+            if not sets_to_upload:
+                console.print(
+                    f"[yellow]Standard set '{set_id}' not found or has no processed.json.[/yellow]"
+                )
+                return
+        # If neither --set-id nor --all provided, prompt for confirmation
+        if not set_id and not all:
+            console.print(
+                f"\n[bold]Found {len(sets_to_upload)} standard set(s) with processed.json:[/bold]"
+            )
+            table = Table("Set ID", "Status", title="Standard Sets")
+            for sid, sdir, skipped in sets_to_upload:
+                status = (
+                    "[yellow]Already uploaded[/yellow]"
+                    if skipped
+                    else "[green]Ready[/green]"
+                )
+                table.add_row(sid, status)
+            console.print(table)
+            if not typer.confirm(
+                f"\nUpload {len(sets_to_upload)} standard set(s) to Pinecone?"
+            ):
+                console.print("[yellow]Upload cancelled.[/yellow]")
+                return
+        # Show what would be uploaded (dry-run or preview)
+        if dry_run or not all:
+            console.print(
+                f"\n[bold]Standard sets to upload ({len(sets_to_upload)}):[/bold]"
+            )
+            table = Table("Set ID", "Records", "Status", title="Upload Preview")
+            for sid, sdir, skipped in sets_to_upload:
+                if skipped and not force:
+                    table.add_row(
+                        sid, "N/A", "[yellow]Skipped (already uploaded)[/yellow]"
+                    )
+                    continue
+                # Load processed.json to count records
+                try:
+                    with open(sdir / "processed.json", encoding="utf-8") as f:
+                        processed_data = json.load(f)
+                    record_count = len(processed_data.get("records", []))
+                    status = (
+                        "[green]Ready[/green]"
+                        if not dry_run
+                        else "[yellow]Would upload[/yellow]"
+                    )
+                    table.add_row(sid, str(record_count), status)
+                except Exception as e:
+                    table.add_row(sid, "Error", f"[red]Failed to read: {e}[/red]")
+            console.print(table)
+        if dry_run:
+            console.print(
+                f"\n[yellow][DRY RUN] Would upload {len([s for s in sets_to_upload if not s[2] or force])} standard set(s)[/yellow]"
+            )
+            console.print("[dim]Run without --dry-run to actually upload.[/dim]")
+            return
+        # Perform uploads
+        uploaded_count = 0
+        failed_count = 0
+        skipped_count = 0
+        for i, (sid, sdir, already_uploaded) in enumerate(sets_to_upload, 1):
+            if already_uploaded and not force:
+                skipped_count += 1
+                continue
+            try:
+                # Load processed.json
+                with open(sdir / "processed.json", encoding="utf-8") as f:
+                    processed_data = json.load(f)
+                processed_set = ProcessedStandardSet(**processed_data)
+                records = processed_set.records
+                if not records:
+                    console.print(
+                        f"[yellow]Skipping {sid} (no records)[/yellow]"
+                    )
+                    skipped_count += 1
+                    continue
+                # Upload records
+                with console.status(
+                    f"[bold blue][{i}/{len(sets_to_upload)}] Uploading {sid} ({len(records)} records)"
+                ):
+                    client.batch_upsert(records, batch_size=batch_size)
+                # Mark as uploaded
+                PineconeClient.mark_uploaded(sdir)
+                uploaded_count += 1
+                console.print(
+                    f"[green]✓ [{i}/{len(sets_to_upload)}] Uploaded {sid} ({len(records)} records)[/green]"
+                )
+            except FileNotFoundError:
+                console.print(
+                    f"[red]✗ [{i}/{len(sets_to_upload)}] Failed: {sid} (processed.json not found)[/red]"
+                )
+                logger.exception(f"Failed to upload standard set {sid}")
+                failed_count += 1
+            except Exception as e:
+                console.print(
+                    f"[red]✗ [{i}/{len(sets_to_upload)}] Failed: {sid} ({e})[/red]"
+                )
+                logger.exception(f"Failed to upload standard set {sid}")
+                failed_count += 1
+        # Summary
+        console.print("\n[bold]Upload Summary:[/bold]")
+        console.print(f"  Uploaded: [green]{uploaded_count}[/green]")
+        if skipped_count > 0:
+            console.print(f"  Skipped: [yellow]{skipped_count}[/yellow]")
+        if failed_count > 0:
+            console.print(f"  Failed: [red]{failed_count}[/red]")
+    except Exception as e:
+        console.print(f"[red]Error: {e}[/red]")
+        logger.exception("Failed to upload to Pinecone")
+        raise typer.Exit(code=1)
+if __name__ == "__main__":
+    app()

tools/config.py ADDED Viewed

	@@ -0,0 +1,65 @@

+"""Centralized configuration for the tools module."""
+from __future__ import annotations
+from pathlib import Path
+from pydantic_settings import BaseSettings, SettingsConfigDict
+class ToolsSettings(BaseSettings):
+    """Configuration settings for the tools module."""
+    model_config = SettingsConfigDict(
+        env_file=".env",
+        env_file_encoding="utf-8",
+        case_sensitive=False,
+    )
+    # API Configuration
+    csp_api_key: str = ""
+    csp_base_url: str = "https://api.commonstandardsproject.com/api/v1"
+    max_requests_per_minute: int = 60
+    # Path Configuration
+    # These are computed properties based on project root
+    @property
+    def project_root(self) -> Path:
+        """Get the project root directory."""
+        return Path(__file__).parent.parent
+    @property
+    def raw_data_dir(self) -> Path:
+        """Get the raw data directory."""
+        return self.project_root / "data" / "raw"
+    @property
+    def standard_sets_dir(self) -> Path:
+        """Get the standard sets directory."""
+        return self.raw_data_dir / "standardSets"
+    @property
+    def processed_data_dir(self) -> Path:
+        """Get the processed data directory."""
+        return self.project_root / "data" / "processed"
+    # Logging Configuration
+    log_file: str = "data/cli.log"
+    log_rotation: str = "10 MB"
+    log_retention: str = "7 days"
+    # Pinecone Configuration
+    pinecone_api_key: str = ""
+    pinecone_index_name: str = "common-core-standards"
+    pinecone_namespace: str = "standards"
+_settings: ToolsSettings | None = None
+def get_settings() -> ToolsSettings:
+    """Get the singleton settings instance."""
+    global _settings
+    if _settings is None:
+        _settings = ToolsSettings()
+    return _settings

tools/data_manager.py ADDED Viewed

	@@ -0,0 +1,81 @@

+"""Manages local data storage and metadata tracking."""
+from __future__ import annotations
+import json
+from dataclasses import dataclass
+from loguru import logger
+from tools.config import get_settings
+from tools.models import StandardSetResponse
+settings = get_settings()
+# Data directories (from config)
+RAW_DATA_DIR = settings.raw_data_dir
+STANDARD_SETS_DIR = settings.standard_sets_dir
+PROCESSED_DATA_DIR = settings.processed_data_dir
+@dataclass
+class StandardSetInfo:
+    """Information about a downloaded standard set with processing status."""
+    set_id: str
+    title: str
+    subject: str
+    education_levels: list[str]
+    jurisdiction: str
+    publication_status: str
+    valid_year: str
+    processed: bool
+def list_downloaded_standard_sets() -> list[StandardSetInfo]:
+    """
+    List all downloaded standard sets from the standardSets directory.
+    Returns:
+        List of StandardSetInfo with standard set info and processing status
+    """
+    if not STANDARD_SETS_DIR.exists():
+        return []
+    datasets = []
+    for set_dir in STANDARD_SETS_DIR.iterdir():
+        if not set_dir.is_dir():
+            continue
+        data_file = set_dir / "data.json"
+        if not data_file.exists():
+            continue
+        try:
+            with open(data_file, encoding="utf-8") as f:
+                raw_data = json.load(f)
+            # Parse the API response wrapper
+            response = StandardSetResponse(**raw_data)
+            standard_set = response.data
+            # Build the dataset info
+            dataset_info = StandardSetInfo(
+                set_id=standard_set.id,
+                title=standard_set.title,
+                subject=standard_set.subject,
+                education_levels=standard_set.educationLevels,
+                jurisdiction=standard_set.jurisdiction.title,
+                publication_status=standard_set.document.publicationStatus or "Unknown",
+                valid_year=standard_set.document.valid,
+                processed=False,  # TODO: Check against processed directory
+            )
+            datasets.append(dataset_info)
+        except (json.JSONDecodeError, IOError, Exception) as e:
+            logger.warning(f"Failed to read {data_file}: {e}")
+            continue
+    logger.debug(f"Found {len(datasets)} downloaded standard sets")
+    return datasets

tools/models.py ADDED Viewed

	@@ -0,0 +1,129 @@

+"""Pydantic models for Common Standards Project API data structures."""
+from __future__ import annotations
+from typing import Any, Optional
+from pydantic import BaseModel, ConfigDict
+class CSPBaseModel(BaseModel):
+    """Base model for all CSP API models with extra fields allowed."""
+    model_config = ConfigDict(extra="allow")
+# ============================================================================
+# Jurisdiction List Models
+# ============================================================================
+class Jurisdiction(CSPBaseModel):
+    """Basic jurisdiction information from the jurisdictions list endpoint."""
+    id: str
+    title: str
+    type: str  # "school", "organization", "state", "nation"
+class JurisdictionsResponse(CSPBaseModel):
+    """API response wrapper for jurisdictions list."""
+    data: list[Jurisdiction]
+# ============================================================================
+# Jurisdiction Details Models
+# ============================================================================
+class Document(CSPBaseModel):
+    """Standard document metadata."""
+    id: Optional[str] = None
+    title: str
+    valid: Optional[str] = None  # Year as string
+    sourceURL: Optional[str] = None
+    asnIdentifier: Optional[str] = None
+    publicationStatus: Optional[str] = None
+class StandardSetReference(CSPBaseModel):
+    """Reference to a standard set (metadata only, not full content)."""
+    id: str
+    title: str
+    subject: str
+    educationLevels: list[str]
+    document: Document
+class JurisdictionDetails(CSPBaseModel):
+    """Full jurisdiction details including standard set references."""
+    id: str
+    title: str
+    type: str  # "school", "organization", "state", "nation"
+    standardSets: list[StandardSetReference]
+class JurisdictionDetailsResponse(CSPBaseModel):
+    """API response wrapper for jurisdiction details."""
+    data: JurisdictionDetails
+# ============================================================================
+# Standard Set Models
+# ============================================================================
+class License(CSPBaseModel):
+    """License information for a standard set."""
+    title: str
+    URL: str
+    rightsHolder: str
+class JurisdictionRef(CSPBaseModel):
+    """Simple jurisdiction reference within a standard set."""
+    id: str
+    title: str
+class Standard(CSPBaseModel):
+    """Individual standard within a standard set."""
+    id: str
+    asnIdentifier: Optional[str] = None
+    position: int
+    depth: int
+    statementNotation: Optional[str] = None
+    description: str
+    ancestorIds: list[str]
+    parentId: Optional[str] = None
+    statementLabel: Optional[str] = None  # e.g., "Standard", "Benchmark"
+    educationLevels: Optional[list[str]] = None
+class StandardSet(CSPBaseModel):
+    """Full standard set data including all standards."""
+    id: str
+    title: str
+    subject: str
+    normalizedSubject: Optional[str] = None
+    educationLevels: list[str]
+    license: License
+    document: Document
+    jurisdiction: JurisdictionRef
+    standards: dict[str, Standard]  # GUID -> Standard mapping
+    cspStatus: Optional[dict[str, Any]] = None
+class StandardSetResponse(CSPBaseModel):
+    """API response wrapper for standard set."""
+    data: StandardSet

tools/pinecone_client.py ADDED Viewed

	@@ -0,0 +1,262 @@

+"""Pinecone client for uploading and managing standard records."""
+from __future__ import annotations
+import time
+from datetime import datetime, timezone
+from pathlib import Path
+from collections.abc import Callable
+from typing import Any
+from loguru import logger
+from pinecone import Pinecone
+from pinecone.exceptions import PineconeException
+from tools.config import get_settings
+from tools.pinecone_models import PineconeRecord
+settings = get_settings()
+class PineconeClient:
+    """Client for interacting with Pinecone index."""
+    def __init__(self) -> None:
+        """Initialize Pinecone SDK from config settings."""
+        api_key = settings.pinecone_api_key
+        if not api_key:
+            raise ValueError("PINECONE_API_KEY environment variable not set")
+        self.pc = Pinecone(api_key=api_key)
+        self.index_name = settings.pinecone_index_name
+        self.namespace = settings.pinecone_namespace
+        self._index = None
+    @property
+    def index(self):
+        """Get the index object, creating it if needed."""
+        if self._index is None:
+            self._index = self.pc.Index(self.index_name)
+        return self._index
+    def validate_index(self) -> None:
+        """
+        Check index exists with pc.has_index(), raise helpful error if not.
+        Raises:
+            ValueError: If index does not exist, with instructions to create it.
+        """
+        if not self.pc.has_index(name=self.index_name):
+            raise ValueError(
+                f"Index '{self.index_name}' not found. Create it with:\n"
+                f"pc index create -n {self.index_name} -m cosine -c aws -r us-east-1 "
+                f"--model llama-text-embed-v2 --field_map text=content"
+            )
+    def ensure_index_exists(self) -> bool:
+        """
+        Check if index exists, create it if not.
+        Creates the index with integrated embeddings using llama-text-embed-v2 model.
+        Returns:
+            True if index was created, False if it already existed.
+        """
+        if self.pc.has_index(name=self.index_name):
+            logger.info(f"Index '{self.index_name}' already exists")
+            return False
+        logger.info(f"Creating index '{self.index_name}' with integrated embeddings...")
+        self.pc.create_index_for_model(
+            name=self.index_name,
+            cloud="aws",
+            region="us-east-1",
+            embed={
+                "model": "llama-text-embed-v2",
+                "field_map": {"text": "content"},
+            },
+        )
+        logger.info(f"Successfully created index '{self.index_name}'")
+        return True
+    def get_index_stats(self) -> dict[str, Any]:
+        """
+        Get index statistics including vector count and namespaces.
+        Returns:
+            Dictionary with index stats including total_vector_count and namespaces.
+        """
+        stats = self.index.describe_index_stats()
+        return {
+            "total_vector_count": stats.total_vector_count,
+            "namespaces": dict(stats.namespaces) if stats.namespaces else {},
+        }
+    @staticmethod
+    def exponential_backoff_retry(
+        func: Callable[[], Any], max_retries: int = 5
+    ) -> Any:
+        """
+        Retry function with exponential backoff on 429/5xx, fail on 4xx.
+        Args:
+            func: Function to retry (should be a callable that takes no args)
+            max_retries: Maximum number of retry attempts
+        Returns:
+            Result of func()
+        Raises:
+            PineconeException: If retries exhausted or non-retryable error
+        """
+        for attempt in range(max_retries):
+            try:
+                return func()
+            except PineconeException as e:
+                status_code = getattr(e, "status", None)
+                # Only retry transient errors
+                if status_code and (status_code >= 500 or status_code == 429):
+                    if attempt < max_retries - 1:
+                        delay = min(2 ** attempt, 60)  # Cap at 60s
+                        logger.warning(
+                            f"Retryable error (status {status_code}), "
+                            f"retrying in {delay}s (attempt {attempt + 1}/{max_retries})"
+                        )
+                        time.sleep(delay)
+                    else:
+                        logger.error(
+                            f"Max retries ({max_retries}) exceeded for retryable error"
+                        )
+                        raise
+                else:
+                    # Don't retry client errors
+                    logger.error(f"Non-retryable error (status {status_code}): {e}")
+                    raise
+            except Exception as e:
+                # Non-Pinecone exceptions should not be retried
+                logger.error(f"Non-retryable exception: {e}")
+                raise
+    def batch_upsert(
+        self, records: list[PineconeRecord], batch_size: int = 96
+    ) -> None:
+        """
+        Upsert records in batches of specified size with rate limiting.
+        Args:
+            records: List of PineconeRecord objects to upsert
+            batch_size: Number of records per batch (default: 96)
+        """
+        if not records:
+            logger.info("No records to upsert")
+            return
+        total_batches = (len(records) + batch_size - 1) // batch_size
+        logger.info(
+            f"Upserting {len(records)} records in {total_batches} batch(es) "
+            f"(batch size: {batch_size})"
+        )
+        for i in range(0, len(records), batch_size):
+            batch = records[i : i + batch_size]
+            batch_num = (i // batch_size) + 1
+            # Convert PineconeRecord models to dict format for Pinecone
+            batch_dicts = [self._record_to_dict(record) for record in batch]
+            logger.debug(f"Upserting batch {batch_num}/{total_batches} ({len(batch)} records)")
+            # Retry with exponential backoff
+            self.exponential_backoff_retry(
+                lambda b=batch_dicts: self.index.upsert_records(
+                    namespace=self.namespace, records=b
+                )
+            )
+            # Rate limiting between batches
+            if i + batch_size < len(records):
+                time.sleep(0.1)
+        logger.info(f"Successfully upserted {len(records)} records")
+    @staticmethod
+    def _record_to_dict(record: PineconeRecord) -> dict[str, Any]:
+        """
+        Convert PineconeRecord model to dict format for Pinecone API.
+        Handles optional fields by omitting them if None. Pinecone doesn't accept
+        null values for metadata fields, so parent_id must be omitted entirely
+        when None (for root nodes).
+        Args:
+            record: PineconeRecord model instance
+        Returns:
+            Dictionary ready for Pinecone upsert_records
+        """
+        # Use by_alias=True to serialize 'id' as '_id' per model serialization_alias
+        record_dict = record.model_dump(exclude_none=False, by_alias=True)
+        # Remove None values for optional fields
+        optional_fields = {
+            "asn_identifier",
+            "statement_notation",
+            "statement_label",
+            "normalized_subject",
+            "publication_status",
+            "parent_id",  # Must be omitted when None (Pinecone doesn't accept null)
+        }
+        for field in optional_fields:
+            if record_dict.get(field) is None:
+                record_dict.pop(field, None)
+        return record_dict
+    @staticmethod
+    def is_uploaded(set_dir: Path) -> bool:
+        """
+        Check for .pinecone_uploaded marker file.
+        Args:
+            set_dir: Path to standard set directory
+        Returns:
+            True if marker file exists, False otherwise
+        """
+        marker_file = set_dir / ".pinecone_uploaded"
+        return marker_file.exists()
+    @staticmethod
+    def mark_uploaded(set_dir: Path) -> None:
+        """
+        Create marker file with ISO 8601 timestamp.
+        Args:
+            set_dir: Path to standard set directory
+        """
+        marker_file = set_dir / ".pinecone_uploaded"
+        timestamp = datetime.now(timezone.utc).isoformat()
+        marker_file.write_text(timestamp, encoding="utf-8")
+        logger.debug(f"Created upload marker: {marker_file}")
+    @staticmethod
+    def get_upload_timestamp(set_dir: Path) -> str | None:
+        """
+        Read timestamp from marker file.
+        Args:
+            set_dir: Path to standard set directory
+        Returns:
+            ISO 8601 timestamp string if marker exists, None otherwise
+        """
+        marker_file = set_dir / ".pinecone_uploaded"
+        if not marker_file.exists():
+            return None
+        try:
+            return marker_file.read_text(encoding="utf-8").strip()
+        except Exception as e:
+            logger.warning(f"Failed to read upload marker {marker_file}: {e}")
+            return None

tools/pinecone_models.py ADDED Viewed

	@@ -0,0 +1,95 @@

+"""Pydantic models for Pinecone-processed standard records."""
+from __future__ import annotations
+from typing import Any
+from pydantic import BaseModel, ConfigDict, Field, field_validator
+class PineconeRecord(BaseModel):
+    """A single standard record ready for Pinecone upsert."""
+    model_config = ConfigDict(
+        json_encoders={
+            # Ensure parent_id null is serialized as null, not omitted
+            type(None): lambda v: None,
+        },
+        # Use snake_case for field names (matches JSON schema)
+        populate_by_name=True,
+    )
+    # Core identifier - use alias to serialize as _id
+    id: str = Field(alias="_id", serialization_alias="_id")
+    # Content for embedding
+    content: str
+    # Standard Set Context
+    standard_set_id: str
+    standard_set_title: str
+    subject: str
+    normalized_subject: str | None = None
+    education_levels: list[str]
+    document_id: str
+    document_valid: str
+    publication_status: str | None = None
+    jurisdiction_id: str
+    jurisdiction_title: str
+    # Standard Identity & Position
+    asn_identifier: str | None = None
+    statement_notation: str | None = None
+    statement_label: str | None = None
+    depth: int
+    is_leaf: bool
+    is_root: bool
+    # Hierarchy Relationships
+    parent_id: str | None = None  # null for root nodes
+    root_id: str
+    ancestor_ids: list[str]
+    child_ids: list[str]
+    sibling_count: int
+    @field_validator("education_levels", mode="before")
+    @classmethod
+    def process_education_levels(cls, v: Any) -> list[str]:
+        """
+        Process education_levels: split comma-separated strings, flatten, dedupe.
+        Handles cases where source data has comma-separated values within array
+        elements (e.g., ["01,02"] instead of ["01", "02"]).
+        Args:
+            v: Input value (list[str] or list with comma-separated strings)
+        Returns:
+            Flattened, deduplicated list of grade level strings
+        """
+        if not isinstance(v, list):
+            return []
+        # Split comma-separated strings and flatten
+        flattened: list[str] = []
+        for item in v:
+            if isinstance(item, str):
+                # Split on commas and strip whitespace
+                split_items = [s.strip() for s in item.split(",") if s.strip()]
+                flattened.extend(split_items)
+        # Deduplicate while preserving order
+        seen: set[str] = set()
+        result: list[str] = []
+        for item in flattened:
+            if item not in seen:
+                seen.add(item)
+                result.append(item)
+        return result
+class ProcessedStandardSet(BaseModel):
+    """Container for processed standard set records ready for Pinecone."""
+    records: list[PineconeRecord]

tools/pinecone_processor.py ADDED Viewed

	@@ -0,0 +1,375 @@

+"""Processor for transforming standard sets into Pinecone-ready format."""
+from __future__ import annotations
+import json
+from pathlib import Path
+from typing import TYPE_CHECKING
+from loguru import logger
+from tools.config import get_settings
+from tools.models import StandardSet, StandardSetResponse
+from tools.pinecone_models import PineconeRecord, ProcessedStandardSet
+if TYPE_CHECKING:
+    from collections.abc import Mapping
+settings = get_settings()
+class StandardSetProcessor:
+    """Processes standard sets into Pinecone-ready format."""
+    def __init__(self):
+        """Initialize the processor."""
+        self.id_to_standard: dict[str, dict] = {}
+        self.parent_to_children: dict[str | None, list[str]] = {}
+        self.leaf_nodes: set[str] = set()
+        self.root_nodes: set[str] = set()
+    def process_standard_set(self, standard_set: StandardSet) -> ProcessedStandardSet:
+        """
+        Process a standard set into Pinecone-ready records.
+        Args:
+            standard_set: The StandardSet model from the API
+        Returns:
+            ProcessedStandardSet with all records ready for Pinecone
+        """
+        # Build relationship maps from all standards
+        self._build_relationship_maps(standard_set.standards)
+        # Process each standard into a PineconeRecord
+        records = []
+        for standard_id, standard in standard_set.standards.items():
+            record = self._transform_standard(standard, standard_set)
+            records.append(record)
+        return ProcessedStandardSet(records=records)
+    def _build_relationship_maps(self, standards: dict[str, Standard]) -> None:
+        """
+        Build helper data structures from all standards in the set.
+        Args:
+            standards: Dictionary mapping standard ID to Standard object
+        """
+        # Convert to dict format for easier manipulation
+        standards_dict = {
+            std_id: standard.model_dump() for std_id, standard in standards.items()
+        }
+        # Build ID-to-standard map
+        self.id_to_standard = self._build_id_to_standard_map(standards_dict)
+        # Build parent-to-children map (sorted by position)
+        self.parent_to_children = self._build_parent_to_children_map(standards_dict)
+        # Identify leaf nodes
+        self.leaf_nodes = self._identify_leaf_nodes(standards_dict)
+        # Identify root nodes
+        self.root_nodes = self._identify_root_nodes(standards_dict)
+    def _build_id_to_standard_map(
+        self, standards: dict[str, dict]
+    ) -> dict[str, dict]:
+        """Build map of id -> standard object."""
+        return {std_id: std for std_id, std in standards.items()}
+    def _build_parent_to_children_map(
+        self, standards: dict[str, dict]
+    ) -> dict[str | None, list[str]]:
+        """
+        Build map of parentId -> [child_ids], sorted by position ascending.
+        Args:
+            standards: Dictionary of standard ID to standard dict
+        Returns:
+            Dictionary mapping parent ID (or None for roots) to sorted list of child IDs
+        """
+        parent_map: dict[str | None, list[tuple[int, str]]] = {}
+        for std_id, std in standards.items():
+            parent_id = std.get("parentId")
+            position = std.get("position", 0)
+            if parent_id not in parent_map:
+                parent_map[parent_id] = []
+            parent_map[parent_id].append((position, std_id))
+        # Sort each list by position and extract just the IDs
+        result: dict[str | None, list[str]] = {}
+        for parent_id, children in parent_map.items():
+            sorted_children = sorted(children, key=lambda x: x[0])
+            result[parent_id] = [std_id for _, std_id in sorted_children]
+        return result
+    def _identify_leaf_nodes(self, standards: dict[str, dict]) -> set[str]:
+        """
+        Identify leaf nodes: standards whose ID does NOT appear as any standard's parentId.
+        Args:
+            standards: Dictionary of standard ID to standard dict
+        Returns:
+            Set of standard IDs that are leaf nodes
+        """
+        all_ids = set(standards.keys())
+        parent_ids = {std.get("parentId") for std in standards.values() if std.get("parentId") is not None}
+        # Leaf nodes are IDs that are NOT in parent_ids
+        return all_ids - parent_ids
+    def _identify_root_nodes(self, standards: dict[str, dict]) -> set[str]:
+        """
+        Identify root nodes: standards where parentId is null.
+        Args:
+            standards: Dictionary of standard ID to standard dict
+        Returns:
+            Set of standard IDs that are root nodes
+        """
+        return {
+            std_id
+            for std_id, std in standards.items()
+            if std.get("parentId") is None
+        }
+    def find_root_id(self, standard: dict, id_to_standard: dict[str, dict]) -> str:
+        """
+        Walk up the parent chain to find the root ancestor.
+        Args:
+            standard: The standard dict to find root for
+            id_to_standard: Map of ID to standard dict
+        Returns:
+            The root ancestor's ID
+        """
+        current = standard
+        visited = set()  # Prevent infinite loops from bad data
+        while current.get("parentId") is not None:
+            parent_id = current["parentId"]
+            if parent_id in visited:
+                break  # Circular reference protection
+            visited.add(parent_id)
+            if parent_id not in id_to_standard:
+                break  # Parent not found, use current as root
+            current = id_to_standard[parent_id]
+        return current["id"]
+    def build_ordered_ancestors(
+        self, standard: dict, id_to_standard: dict[str, dict]
+    ) -> list[str]:
+        """
+        Build ancestor list ordered from root (index 0) to immediate parent (last index).
+        Args:
+            standard: The standard dict to build ancestors for
+            id_to_standard: Map of ID to standard dict
+        Returns:
+            List of ancestor IDs ordered root -> immediate parent
+        """
+        ancestors = []
+        current_id = standard.get("parentId")
+        visited = set()
+        while current_id is not None and current_id not in visited:
+            visited.add(current_id)
+            if current_id in id_to_standard:
+                ancestors.append(current_id)
+                current_id = id_to_standard[current_id].get("parentId")
+            else:
+                break
+        ancestors.reverse()  # Now ordered root → immediate parent
+        return ancestors
+    def _compute_sibling_count(self, standard: dict) -> int:
+        """
+        Count standards with same parent_id, excluding self.
+        Args:
+            standard: The standard dict
+        Returns:
+            Number of siblings (excluding self)
+        """
+        parent_id = standard.get("parentId")
+        if parent_id not in self.parent_to_children:
+            return 0
+        siblings = self.parent_to_children[parent_id]
+        # Exclude self from count
+        return len([s for s in siblings if s != standard["id"]])
+    def _build_content_text(self, standard: dict) -> str:
+        """
+        Generate content text block with full hierarchy.
+        Format: "Depth N (notation): description" for each ancestor and self.
+        Args:
+            standard: The standard dict
+        Returns:
+            Multi-line text block with full hierarchy
+        """
+        # Build ordered ancestor chain
+        ancestor_ids = self.build_ordered_ancestors(standard, self.id_to_standard)
+        # Build lines from root to current standard
+        lines = []
+        # Add ancestor lines
+        for ancestor_id in ancestor_ids:
+            ancestor = self.id_to_standard[ancestor_id]
+            depth = ancestor.get("depth", 0)
+            description = ancestor.get("description", "")
+            notation = ancestor.get("statementNotation")
+            if notation:
+                lines.append(f"Depth {depth} ({notation}): {description}")
+            else:
+                lines.append(f"Depth {depth}: {description}")
+        # Add current standard line
+        depth = standard.get("depth", 0)
+        description = standard.get("description", "")
+        notation = standard.get("statementNotation")
+        if notation:
+            lines.append(f"Depth {depth} ({notation}): {description}")
+        else:
+            lines.append(f"Depth {depth}: {description}")
+        return "\n".join(lines)
+    def _transform_standard(
+        self, standard: Standard, standard_set: StandardSet
+    ) -> PineconeRecord:
+        """
+        Transform a single standard into a PineconeRecord.
+        Args:
+            standard: The Standard object to transform
+            standard_set: The parent StandardSet containing context
+        Returns:
+            PineconeRecord ready for Pinecone upsert
+        """
+        std_dict = standard.model_dump()
+        # Compute hierarchy relationships
+        is_root = std_dict.get("parentId") is None
+        root_id = (
+            std_dict["id"] if is_root else self.find_root_id(std_dict, self.id_to_standard)
+        )
+        ancestor_ids = self.build_ordered_ancestors(std_dict, self.id_to_standard)
+        child_ids = self.parent_to_children.get(std_dict["id"], [])
+        is_leaf = std_dict["id"] in self.leaf_nodes
+        sibling_count = self._compute_sibling_count(std_dict)
+        # Build content text
+        content = self._build_content_text(std_dict)
+        # Extract standard set context
+        parent_id = std_dict.get("parentId")  # Keep as None if null
+        # Build record with all fields
+        # Note: Use "id" not "_id" - Pydantic handles serialization alias automatically
+        record_data = {
+            "id": std_dict["id"],
+            "content": content,
+            "standard_set_id": standard_set.id,
+            "standard_set_title": standard_set.title,
+            "subject": standard_set.subject,
+            "normalized_subject": standard_set.normalizedSubject,  # Optional, can be None
+            "education_levels": standard_set.educationLevels,
+            "document_id": standard_set.document.id,
+            "document_valid": standard_set.document.valid,
+            "publication_status": standard_set.document.publicationStatus,  # Optional, can be None
+            "jurisdiction_id": standard_set.jurisdiction.id,
+            "jurisdiction_title": standard_set.jurisdiction.title,
+            "depth": std_dict.get("depth", 0),
+            "is_leaf": is_leaf,
+            "is_root": is_root,
+            "parent_id": parent_id,
+            "root_id": root_id,
+            "ancestor_ids": ancestor_ids,
+            "child_ids": child_ids,
+            "sibling_count": sibling_count,
+        }
+        # Add optional fields only if present
+        if std_dict.get("asnIdentifier"):
+            record_data["asn_identifier"] = std_dict["asnIdentifier"]
+        if std_dict.get("statementNotation"):
+            record_data["statement_notation"] = std_dict["statementNotation"]
+        if std_dict.get("statementLabel"):
+            record_data["statement_label"] = std_dict["statementLabel"]
+        return PineconeRecord(**record_data)
+def process_and_save(standard_set_id: str) -> Path:
+    """
+    Load data.json, process it, and save processed.json.
+    Args:
+        standard_set_id: The ID of the standard set to process
+    Returns:
+        Path to the saved processed.json file
+    Raises:
+        FileNotFoundError: If data.json doesn't exist
+        ValueError: If JSON is invalid
+    """
+    # Locate data.json
+    data_file = settings.standard_sets_dir / standard_set_id / "data.json"
+    if not data_file.exists():
+        logger.warning(f"data.json not found for set {standard_set_id}, skipping")
+        raise FileNotFoundError(f"data.json not found for set {standard_set_id}")
+    # Load and parse JSON
+    try:
+        with open(data_file, encoding="utf-8") as f:
+            raw_data = json.load(f)
+    except json.JSONDecodeError as e:
+        raise ValueError(f"Invalid JSON in {data_file}: {e}") from e
+    # Parse into Pydantic model
+    try:
+        response = StandardSetResponse(**raw_data)
+        standard_set = response.data
+    except Exception as e:
+        raise ValueError(f"Failed to parse standard set data: {e}") from e
+    # Process the standard set
+    processor = StandardSetProcessor()
+    processed_set = processor.process_standard_set(standard_set)
+    # Save processed.json
+    processed_file = settings.standard_sets_dir / standard_set_id / "processed.json"
+    processed_file.parent.mkdir(parents=True, exist_ok=True)
+    with open(processed_file, "w", encoding="utf-8") as f:
+        json.dump(processed_set.model_dump(mode="json"), f, indent=2)
+    logger.info(
+        f"Processed {standard_set_id}: {len(processed_set.records)} records"
+    )
+    return processed_file