Lindow commited on
Commit
7602502
·
0 Parent(s):

initial commit

Browse files
.agent/docs/common_core_spec.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ This is the authoritative **Common Core Data Specification**. It contains the exact source locations, data schemas, field definitions, and the specific processing logic required to interpret the hierarchy correctly.
2
+
3
+ **Use this document as the source of truth for `tools/build_data.py`.**
4
+
5
+ ---
6
+
7
+ # Data Specification: Common Core Standards
8
+
9
+ **Authority:** Common Standards Project (GitHub)
10
+ **License:** Creative Commons Attribution 4.0 (CC BY 4.0)
11
+ **Format:** JSON (Flat List of Objects)
12
+
13
+ ## 1. Source Locations
14
+
15
+ We are using the "Clean Data" export from the Common Standards Project. These files are static JSON dumps where each file represents a full Subject.
16
+
17
+ | Subject | Direct Download URL |
18
+ | :----------------- | :--------------------------------------------------------------------------------------------------------------------------- |
19
+ | **Mathematics** | `https://raw.githubusercontent.com/commoncurriculum/common-standards-project/master/data/clean-data/CCSSI/Mathematics.json` |
20
+ | **ELA / Literacy** | `https://raw.githubusercontent.com/commoncurriculum/common-standards-project/master/data/clean-data/CCSSI/ELA-Literacy.json` |
21
+
22
+ ---
23
+
24
+ ## 2. The Data Structure (Glossary)
25
+
26
+ The JSON file contains a root object. The actual standards are located in the `standards` dictionary, keyed by their internal GUID.
27
+
28
+ ### **Root Object**
29
+
30
+ ```json
31
+ {
32
+ "subject": "Mathematics",
33
+ "standards": {
34
+ "6051566A...": { ... }, // Standard Object
35
+ "5E367098...": { ... } // Standard Object
36
+ }
37
+ }
38
+ ```
39
+
40
+ ### **Standard Object (The Item)**
41
+
42
+ Each item represents a node in the curriculum tree. It could be a broad **Domain**, a grouping **Cluster**, or a specific **Standard**.
43
+
44
+ | Field Name | Type | Definition & Usage |
45
+ | :---------------------- | :-------------- | :------------------------------------------------------------------------------------------------------------------------------------ |
46
+ | **`id`** | `String (GUID)` | The internal unique identifier. Used for lookups in `ancestorIds`. |
47
+ | **`statementNotation`** | `String` | **The Display Code.** (e.g., `CCSS.Math.Content.1.OA.A.1`). This is what teachers recognize. Use this for the UI. |
48
+ | **`description`** | `String` | The text content. **Warning:** For standards, this text is often incomplete without its parent context (see Hierarchy below). |
49
+ | **`statementLabel`** | `String` | The hierarchy type. Critical values: <br>• `Domain` (Highest) <br>• `Cluster` (Grouping) <br>• `Standard` (The actionable item) |
50
+ | **`gradeLevels`** | `Array[String]` | Scope of the standard. <br>• Format: `["01", "02"]` (Grades 1 & 2), `["K"]` (Kindergarten), `["09", "10", "11", "12"]` (High School). |
51
+ | **`ancestorIds`** | `Array[GUID]` | **CRITICAL.** An ordered list of parent IDs (from root to immediate parent). You must resolve these to build the full context. |
52
+
53
+ ---
54
+
55
+ ## 3. Hierarchy & Context (The "Interpretation" Problem)
56
+
57
+ **The Problem:**
58
+ A standard's description often relies on its parent "Cluster" for meaning.
59
+
60
+ - _Cluster Text:_ "Understand the place value system."
61
+ - _Standard Text:_ "Recognize that in a multi-digit number, a digit in one place represents 10 times as much..."
62
+
63
+ If you only embed the _Standard Text_, the vector will miss the concept of "Place Value."
64
+
65
+ **The Solution (Processing Logic):**
66
+ To generate the **Search String** for embedding, you must concatenate the hierarchy.
67
+
68
+ 1. **Domain:** The broad category (e.g., "Number and Operations in Base Ten").
69
+ 2. **Cluster:** The specific topic (e.g., "Generalize place value understanding").
70
+ 3. **Standard:** The task.
71
+
72
+ **Formula:**
73
+
74
+ ```text
75
+ "{Subject} {Grade}: {Domain Text} - {Cluster Text} - {Standard Text}"
76
+ ```
77
+
78
+ ---
79
+
80
+ ## 4. Build Pipeline Specification (`tools/build_data.py`)
81
+
82
+ This specific logic ensures we extract meaningful vectors.
83
+
84
+ ### **Step A: Ingestion**
85
+
86
+ 1. Download both JSON files.
87
+ 2. Merge the `standards` dictionaries into a single **Lookup Map** (Memory: `Map<GUID, Object>`).
88
+
89
+ ### **Step B: Iteration & Filtering**
90
+
91
+ Iterate through the Lookup Map.
92
+ **Filter Rule:**
93
+
94
+ - **KEEP** if `statementLabel` equals `"Standard"`.
95
+ - **DISCARD** if `statementLabel` is `"Domain"`, `"Cluster"`, or `"Component"`. (We only index the actionable leaves).
96
+
97
+ ### **Step C: Context Resolution (The "Breadcrumb" Loop)**
98
+
99
+ For every kept Standard:
100
+
101
+ 1. Initialize `context_text = ""`
102
+ 2. Iterate through `ancestorIds`:
103
+ - Use the ID to look up the Parent Object in the **Lookup Map**.
104
+ - Append `Parent.description` to `context_text`.
105
+ 3. Construct the final string:
106
+ - `full_text = f"{context_text} {current_standard.description}"`
107
+ 4. **Vectorize `full_text`**.
108
+
109
+ ### **Step D: Output Schema (`data/standards.json`)**
110
+
111
+ The clean, flat JSON file you save for the App to load must look like this:
112
+
113
+ ```json
114
+ [
115
+ {
116
+ "id": "CCSS.Math.Content.1.OA.A.1", // From 'statementNotation'
117
+ "guid": "6051566A...", // From 'id'
118
+ "grade": "01", // From 'gradeLevels[0]'
119
+ "subject": "Mathematics", // From 'subject'
120
+ "description": "Use addition and subtraction within 20 to solve word problems...", // From 'description'
121
+ "full_context": "Operations and Algebraic Thinking - Represent and solve problems... - Use addition and..." // The text we used for embedding
122
+ }
123
+ ]
124
+ ```
125
+
126
+ ---
127
+
128
+ ## 5. Summary of Valid `gradeLevels`
129
+
130
+ When processing, normalize these strings if necessary, but typically they appear as:
131
+
132
+ - `K` (Kindergarten)
133
+ - `01` - `08` (Grades 1-8)
134
+ - `09-12` (High School generic)
135
+
136
+ _Note: If `gradeLevels` is an array `["09", "10", "11", "12"]`, you can display it as "High School" or "Grades 9-12"._
.agent/docs/hackathon_criteria.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### **Hackathon Track 1: Building MCP**
2
+
3
+ **Goal:** Build **Model Context Protocol (MCP) servers** that extend Large Language Model (LLM) capabilities.
4
+ **Focus:** Creating tools for data analysis, API integrations, specialized workflows, or novel capabilities that make AI agents more powerful.
5
+
6
+ [Hugging Face Link](https://huggingface.co/MCP-1st-Birthday)
7
+
8
+ ---
9
+
10
+ ### **1. Categories & Submission Tags**
11
+
12
+ You must classify your entry into **one** of the three categories below by adding the specific tag to your Hugging Face Space `README.md`.
13
+
14
+ - **Enterprise MCP Servers**
15
+ - _Focus:_ Business/corporate tools, workflows, data analysis.
16
+ - _Required Tag:_ `building-mcp-track-enterprise`
17
+ - **Consumer MCP Servers**
18
+ - _Focus:_ Personal utility, lifestyle, daily tasks.
19
+ - _Required Tag:_ `building-mcp-track-consumer`
20
+ - **Creative MCP Servers**
21
+ - _Focus:_ Art, media, novelty, unique interactions.
22
+ - _Required Tag:_ `building-mcp-track-creative`
23
+
24
+ ---
25
+
26
+ ### **2. Mandatory Registration Steps**
27
+
28
+ Before submitting, ensure you have completed the administrative requirements:
29
+
30
+ 1. **Join the Organization:** Click "Request to join this org" on the [Hackathon Hugging Face page](https://huggingface.co/MCP-1st-Birthday).
31
+ 2. **Register:** Complete the official registration form (linked on the hackathon page).
32
+ 3. **Team Members:** If working in a team (2-5 people), **all** members must join the organization and register individually.
33
+
34
+ ---
35
+
36
+ ### **3. Technical Requirements (Track 1 Specific)**
37
+
38
+ Your project must meet these technical standards to be eligible:
39
+
40
+ - **Functioning MCP Server:** The core of your project must be a working MCP server.
41
+ - **Integration:** It must integrate with an MCP client (e.g., Claude Desktop, Cursor, or similar).
42
+ - **Platform:** It must be published as a **Hugging Face Space**.
43
+ - _Note:_ It can be a Gradio app, but "Any Gradio app can be an MCP server."
44
+
45
+ ---
46
+
47
+ ### **4. Submission Deliverables**
48
+
49
+ Your final submission must include the following elements by the deadline:
50
+
51
+ - **Hugging Face Space:** The actual codebase hosted in the event organization.
52
+ - **README.md Metadata:**
53
+ - Include the **Track Tag** (see Section 1).
54
+ - (If a team) Include Hugging Face usernames of all team members.
55
+ - **Documentation:** Clear explanation of the tool’s purpose, capabilities, and usage instructions in the README.
56
+ - **Demo Video:**
57
+ - **Length:** 1–5 minutes.
58
+ - **Content:** Must show the MCP server **in action**, specifically demonstrating its integration with an MCP client (like Claude Desktop).
59
+ - **Social Media Proof:** A link to a post (X/Twitter, LinkedIn, etc.) about your project. This link must be included in your submission (likely in the README or submission form).
60
+
61
+ ---
62
+
63
+ ### **5. Judging Criteria**
64
+
65
+ Judges will evaluate your project based on:
66
+
67
+ 1. **Completeness:** Is the Space, video, documentation, and social link all present?
68
+ 2. **Functionality:** Does it work? Does it effectively use relevant functionalities (Gradio 6, MCPs)?
69
+ 3. **Real-world Impact:** Is the tool useful? Does it have potential for real-world application?
70
+ 4. **Creativity:** Is the idea or implementation innovative/original?
71
+ 5. **Design/UI-UX:** Is it polished, intuitive, and easy to use?
72
+ 6. **Documentation:** Is the implementation well-communicated in the README/video?
73
+
74
+ ---
75
+
76
+ ### **6. Timeline**
77
+
78
+ - **Hackathon Period:** November 14 – November 30, 2025.
79
+ - **Submission Deadline:** **November 30, 2025, at 11:59 PM UTC**.
80
+ - **Judging Period:** December 1 – December 14, 2025.
81
+ - **Winners Announced:** December 15, 2025.
82
+
83
+ ---
84
+
85
+ ### **7. Prizes (Track 1)**
86
+
87
+ - **Best Overall:** $1,500 USD + $1,250 Claude API credits.
88
+ - **Best Enterprise MCP Server:** $750 Claude API credits.
89
+ - **Best Consumer MCP Server:** $750 Claude API credits.
90
+ - **Best Creative MCP Server:** $750 Claude API credits.
91
+ - _Note:_ There are additional sponsor prizes (e.g., from Google Gemini, Modal, Blaxel) available if you use their specific tools/APIs.
92
+
93
+ ### **Rules Summary**
94
+
95
+ - **Original Work:** Must be created during the hackathon period (Nov 14–30).
96
+ - **Open Source:** Open-source licenses (MIT, Apache 2.0) are encouraged.
97
+ - **Team Size:** Solo or 2–5 members.
.agent/docs/user_stories.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Here is the refined **User Stories & Acceptance Criteria** document. This focuses on the professional and intentional nature of both homeschooling parents and classroom teachers, ensuring the tone reflects the serious work of education management.
2
+
3
+ ---
4
+
5
+ # EduMatch MCP: User Stories (Sprint 1)
6
+
7
+ **Project:** EduMatch MCP
8
+ **Sprint Focus:** Core Functionality (Search & Lookup)
9
+ **Target Personas:**
10
+
11
+ 1. **The Intentional Parent:** A homeschool educator seeking to formalize learning experiences and align daily life with educational benchmarks.
12
+ 2. **The Adaptive Teacher:** A classroom educator looking to tailor curriculum to student interests while maintaining strict adherence to state standards.
13
+
14
+ ---
15
+
16
+ ## Story 1: The "Retroactive Alignment" (Experience to Record)
17
+
18
+ **As a** homeschool parent or teacher,
19
+ **I want to** describe a completed activity, field trip, or real-world experience to the AI,
20
+ **So that** I can identify which Common Core standards were addressed and articulate the educational value in my official logs or lesson plans.
21
+
22
+ ### Context
23
+
24
+ Educators often seize "teachable moments" (e.g., a trip to a science center, a gardening project). They need to translate these rich, unstructured experiences into the rigid language of educational bureaucracy for reporting purposes.
25
+
26
+ ### Acceptance Criteria
27
+
28
+ 1. **Input:** The user provides a natural language narrative (e.g., "We visited the planetarium, looked at constellations, and calculated the distance between stars.").
29
+ 2. **System Action:** The system queries the `find_relevant_standards` tool using the narrative text.
30
+ 3. **Output:** The system returns a list of relevant standards (with ID and text) and a generated reasoning explaining _how_ the activity met that standard.
31
+ 4. **Tone Check:** The system treats the activity as a valid educational event, not an "accident," and helps the user professionalize their documentation.
32
+
33
+ **Example Prompt:**
34
+
35
+ > "I took my class to the Natural History Museum today. We focused on the timeline of the Jurassic period and compared the sizes of different fossils. Can you find the Common Core standards this visit supported so I can add them to my weekly report?"
36
+
37
+ ---
38
+
39
+ ## Story 2: The "Interest-Based Planner" (Proactive Integration)
40
+
41
+ **As an** educator looking to engage a student,
42
+ **I want to** input a specific student interest (e.g., Minecraft, Baking, Robotics) alongside a target grade level,
43
+ **So that** I can discover standards that can be taught _through_ that activity.
44
+
45
+ ### Context
46
+
47
+ Students learn best when engaged. Teachers and parents often want to build lessons around a child's obsession but need to ensure they aren't skipping required learning targets. This bridges the gap between "Fun" and "Required."
48
+
49
+ ### Acceptance Criteria
50
+
51
+ 1. **Input:** The user provides a topic and a constraint (e.g., "Baking cookies, 3rd Grade Math").
52
+ 2. **System Action:** The system queries `find_relevant_standards` with a combined vector of the activity and the grade level context.
53
+ 3. **Output:** The system returns standards that are semantically viable (e.g., standards about measurement, volume, or fractions for baking).
54
+ 4. **Reasoning:** The generated explanation explicitly suggests the connection (e.g., "This standard applies because baking requires understanding fractions to measure ingredients.").
55
+
56
+ **Example Prompt:**
57
+
58
+ > "My 3rd grader is obsessed with baking. I want to build a math unit around doubling recipes and measuring ingredients. Which standards can we cover with this project?"
59
+
60
+ ---
61
+
62
+ ## Story 3: The "Jargon Decoder" (Curriculum Clarification)
63
+
64
+ **As a** parent or teacher reviewing administrative documents,
65
+ **I want to** ask about a specific standard code (e.g., `CCSS.ELA-LITERACY.RL.4.3`),
66
+ **So that** I can retrieve the full text and hierarchy to understand exactly what is required of the student.
67
+
68
+ ### Context
69
+
70
+ Educational documentation is full of codes that are opaque to parents and hard to memorize for teachers. Users need a quick, authoritative lookup to verify requirements without leaving the chat interface.
71
+
72
+ ### Acceptance Criteria
73
+
74
+ 1. **Input:** The user provides a specific Standard ID/Code.
75
+ 2. **System Action:** The system identifies the code format and calls `get_standard_details`.
76
+ 3. **Output:** The system returns the full object, including the parent Domain and Cluster text, allowing the LLM to explain the standard in plain English.
77
+ 4. **Error Handling:** If the code doesn't exist, the system returns a polite failure message or suggests the user try a keyword search instead.
78
+
79
+ **Example Prompt:**
80
+
81
+ > "The state curriculum guide lists '1.OA.B.3' as a prerequisite for next week. What is that standard, and what does it look like in practice?"
.agent/specs/.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ *draft.md
2
+ **/__*
.agent/specs/000_data_cli/spec.md ADDED
@@ -0,0 +1,1347 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Data Ingestion CLI Specification
2
+
3
+ **Tool Name:** Common Core MCP Data CLI
4
+ **Framework:** `typer` (Python)
5
+ **Purpose:** To explore, discover, and download official standard sets (e.g., Utah Math, Wyoming Science) from the Common Standards Project API for local processing.
6
+ **Scope:** Development Tool (Dev Dependency). Not deployed to production.
7
+
8
+ **Architecture:** Clean separation between CLI interface (`tools/cli.py`) and business logic (`tools/api_client.py`, `tools/data_processor.py`). The CLI file contains only command definitions and invokes reusable functions.
9
+
10
+ **Initial Proof of Concept:** Grade 3 Mathematics for Utah, Wyoming, and Idaho.
11
+
12
+ ---
13
+
14
+ ## 1. Environment & Setup
15
+
16
+ ### 1.1 Prerequisites
17
+
18
+ To use this CLI, you must register for an API key.
19
+
20
+ 1. Go to [Common Standards Project Developers](https://commonstandardsproject.com/developers).
21
+ 2. Create an account and generate an **API Key**.
22
+
23
+ ### 1.2 Configuration (`.env`)
24
+
25
+ The CLI must load sensitive credentials from a local `.env` file.
26
+
27
+ ```bash
28
+ # .env file in project root
29
+ CSP_API_KEY=your_generated_api_key_here
30
+ ```
31
+
32
+ ### 1.3 Dependencies (`pyproject.toml`)
33
+
34
+ Add these to the existing `[project.dependencies]` section:
35
+
36
+ ```toml
37
+ [project.dependencies]
38
+ # ... existing dependencies ...
39
+ "typer", # CLI framework
40
+ "requests", # HTTP client for API calls
41
+ "rich", # Pretty printing tables in terminal
42
+ "loguru", # Structured logging
43
+ ```
44
+
45
+ **Note:** `python-dotenv` is already in the project dependencies.
46
+
47
+ ### 1.4 CLI Invocation
48
+
49
+ The CLI is invoked directly with Python (not via `uv`):
50
+
51
+ ```bash
52
+ python tools/cli.py --help
53
+ ```
54
+
55
+ ---
56
+
57
+ ## 2. API Reference (Internal)
58
+
59
+ The CLI acts as a wrapper around these specific Common Standards Project API endpoints.
60
+
61
+ **Base URL:** `https://api.commonstandardsproject.com/api/v1`
62
+ **Authentication:** Header `Api-Key: <YOUR_KEY>`
63
+
64
+ ### Endpoint A: List Jurisdictions
65
+
66
+ - **URL:** `/jurisdictions`
67
+ - **Purpose:** Find the ID for "Utah", "Wyoming", "Idaho".
68
+ - **Response Shape:**
69
+ ```json
70
+ {
71
+ "data": [
72
+ { "id": "49FCDFBD...", "title": "Utah", "type": "state" },
73
+ ...
74
+ ]
75
+ }
76
+ ```
77
+
78
+ ### Endpoint B: List Standard Sets
79
+
80
+ - **URL:** `/standard_sets`
81
+ - **Query Params:** `jurisdictionId=<ID>`
82
+ - **Purpose:** Find "Utah Core Standards - Mathematics - Grade 3".
83
+ - **Response Shape:**
84
+ ```json
85
+ {
86
+ "data": [
87
+ {
88
+ "id": "SOME_SET_ID",
89
+ "title": "Utah Core Standards - Mathematics",
90
+ "subject": "Mathematics",
91
+ "educationLevels": ["03"]
92
+ }
93
+ ]
94
+ }
95
+ ```
96
+
97
+ ### Endpoint C: Get Standard Set (Download)
98
+
99
+ - **URL:** `/standard_sets/{standard_set_id}`
100
+ - **Purpose:** Download the full hierarchy for a specific set.
101
+ - **Response Shape:** Returns a complex object containing the full tree (Standards, Clusters, etc.).
102
+
103
+ ---
104
+
105
+ ## 3. CLI Architecture
106
+
107
+ ### 3.1 File Structure
108
+
109
+ The CLI is organized with clean separation of concerns:
110
+
111
+ - **`tools/cli.py`**: CLI command definitions only (uses Typer). Imports and invokes functions from other modules.
112
+ - **`tools/api_client.py`**: Business logic for interacting with Common Standards Project API. Includes retry mechanisms, rate limiting, and error handling.
113
+ - **`tools/data_processor.py`**: Business logic for processing raw API data into flattened format with embeddings.
114
+ - **`tools/data_manager.py`**: Business logic for managing local data files (listing, status tracking, cleanup).
115
+
116
+ ### 3.2 Command Structure
117
+
118
+ ```bash
119
+ # View help
120
+ python tools/cli.py --help
121
+
122
+ # Explore API
123
+ python tools/cli.py jurisdictions --search "Utah"
124
+ python tools/cli.py sets <JURISDICTION_ID>
125
+
126
+ # Download raw data
127
+ python tools/cli.py download <SET_ID>
128
+
129
+ # View local data
130
+ python tools/cli.py list
131
+
132
+ # Process raw data
133
+ python tools/cli.py process <SET_ID>
134
+
135
+ # Check processing status
136
+ python tools/cli.py status
137
+ ```
138
+
139
+ ---
140
+
141
+ ## 4. Command Specifications
142
+
143
+ ### Command 1: `jurisdictions`
144
+
145
+ Allows the developer to find the internal IDs for states/organizations.
146
+
147
+ - **Arguments:** None.
148
+ - **Options:**
149
+ - `--search` / `-s` (Optional): Filter output by name (case-insensitive).
150
+ - **Business Logic:** Implemented in `api_client.get_jurisdictions(search_term: str | None) -> list[dict]`
151
+ - **Display Logic:**
152
+ 1. Call `api_client.get_jurisdictions()`.
153
+ 2. Print table using `rich.table.Table`: `ID | Title | Type`.
154
+ 3. Log operation with loguru.
155
+
156
+ ### Command 2: `sets`
157
+
158
+ Allows the developer to see what standards are available for a specific state.
159
+
160
+ - **Arguments:**
161
+ - `jurisdiction_id` (Required): The ID found in the previous command.
162
+ - **Business Logic:** Implemented in `api_client.get_standard_sets(jurisdiction_id: str) -> list[dict]`
163
+ - **Display Logic:**
164
+ 1. Call `api_client.get_standard_sets(jurisdiction_id)`.
165
+ 2. Print table: `Set ID | Subject | Title | Grade Levels`.
166
+ 3. Log operation.
167
+
168
+ ### Command 3: `download`
169
+
170
+ Downloads the official JSON definition for a standard set and saves it locally with organized directory structure.
171
+
172
+ - **Arguments:**
173
+ - `set_id` (Required): The ID of the standard set (e.g., Utah Math).
174
+ - **Options:** None (output path is automatically determined based on metadata).
175
+ - **Business Logic:** Implemented in `api_client.download_standard_set(set_id: str) -> dict` and `data_manager.save_raw_data(set_id: str, data: dict, metadata: dict) -> Path`
176
+ - **Workflow:**
177
+ 1. Call `api_client.download_standard_set(set_id)` (includes retry logic).
178
+ 2. Extract metadata: jurisdiction, subject, grade levels.
179
+ 3. Call `data_manager.save_raw_data()` to save with auto-generated path.
180
+ 4. Print success message with file path.
181
+ 5. Log download operation.
182
+
183
+ ### Command 4: `list`
184
+
185
+ Shows all downloaded raw datasets with their metadata.
186
+
187
+ - **Arguments:** None.
188
+ - **Business Logic:** Implemented in `data_manager.list_downloaded_data() -> list[dict]`
189
+ - **Display Logic:**
190
+ 1. Call `data_manager.list_downloaded_data()`.
191
+ 2. Print table: `Set ID | Subject | Title | Grade Levels | Downloaded | Processed`.
192
+ 3. Show total count.
193
+
194
+ ### Command 5: `process`
195
+
196
+ Processes a raw downloaded dataset into flattened format with embeddings.
197
+
198
+ - **Arguments:**
199
+ - `set_id` (Required): The ID of the standard set to process.
200
+ - **Business Logic:** Implemented in `data_processor.process_standard_set(set_id: str) -> tuple[Path, Path]`
201
+ - **Workflow:**
202
+ 1. Verify raw data exists for set_id.
203
+ 2. Call `data_processor.process_standard_set(set_id)`.
204
+ 3. Generate flattened standards.json.
205
+ 4. Generate embeddings.npy.
206
+ 5. Save to `data/processed/<jurisdiction>/<subject>/`.
207
+ 6. Update processing status metadata.
208
+ 7. Print success message with output paths.
209
+ 8. Log processing operation.
210
+
211
+ ### Command 6: `status`
212
+
213
+ Shows processing status for all datasets (processed vs unprocessed).
214
+
215
+ - **Arguments:** None.
216
+ - **Business Logic:** Implemented in `data_manager.get_processing_status() -> dict`
217
+ - **Display Logic:**
218
+ 1. Call `data_manager.get_processing_status()`.
219
+ 2. Show summary: Total Downloaded, Processed, Unprocessed.
220
+ 3. List unprocessed datasets.
221
+ 4. List processed datasets with output paths.
222
+
223
+ ---
224
+
225
+ ## 5. Data Directory Structure
226
+
227
+ ### 5.1 Raw Data Organization
228
+
229
+ Downloaded raw data is organized by jurisdiction and stored locally only (not in git):
230
+
231
+ ```
232
+ data/raw/
233
+ ├── <jurisdiction_id>/
234
+ │ ├── <set_id>/
235
+ │ │ ├── data.json # Raw API response
236
+ │ │ └── metadata.json # Download metadata
237
+ │ └── <set_id>/
238
+ │ ├── data.json
239
+ │ └── metadata.json
240
+ ```
241
+
242
+ **Example:**
243
+
244
+ ```
245
+ data/raw/
246
+ ├── 49FCDFBD.../ # Utah
247
+ │ ├── ABC123.../ # Utah Math Grade 3
248
+ │ │ ├── data.json
249
+ │ │ └── metadata.json
250
+ │ └── DEF456.../ # Utah Science Grade 5
251
+ │ ├── data.json
252
+ │ └── metadata.json
253
+ ├── 82ABCDEF.../ # Wyoming
254
+ └── GHI789.../ # Wyoming Math Grade 3
255
+ ├── data.json
256
+ └── metadata.json
257
+ ```
258
+
259
+ ### 5.2 Processed Data Organization
260
+
261
+ Processed data (flattened standards with embeddings) is organized by logical grouping:
262
+
263
+ ```
264
+ data/processed/
265
+ ├── <jurisdiction_name>/
266
+ │ ├── <subject>/
267
+ │ │ ├── <grade_level>/
268
+ │ │ │ ├── standards.json # Flattened standards
269
+ │ │ │ └── embeddings.npy # Vector embeddings
270
+ ```
271
+
272
+ **Example (Initial Proof of Concept):**
273
+
274
+ ```
275
+ data/processed/
276
+ ├── utah/
277
+ │ └── mathematics/
278
+ │ └── grade_03/
279
+ │ ├── standards.json
280
+ │ └── embeddings.npy
281
+ ├── wyoming/
282
+ │ └── mathematics/
283
+ │ └── grade_03/
284
+ │ ├── standards.json
285
+ │ └── embeddings.npy
286
+ ├── idaho/
287
+ └── mathematics/
288
+ └── grade_03/
289
+ ├── standards.json
290
+ └── embeddings.npy
291
+ ```
292
+
293
+ **Git Tracking:**
294
+
295
+ - `data/raw/` is added to `.gitignore` (local only)
296
+ - `data/processed/` for example datasets (Utah, Wyoming, Idaho Math Grade 3) is committed to git
297
+ - For production expansion, processed data would move to a vector database
298
+
299
+ ### 5.3 Metadata Schema
300
+
301
+ The `metadata.json` file stored with each raw dataset:
302
+
303
+ ```json
304
+ {
305
+ "set_id": "ABC123...",
306
+ "title": "Utah Core Standards - Mathematics - Grade 3",
307
+ "jurisdiction": {
308
+ "id": "49FCDFBD...",
309
+ "title": "Utah"
310
+ },
311
+ "subject": "Mathematics",
312
+ "grade_levels": ["03"],
313
+ "download_date": "2024-11-25T10:30:00Z",
314
+ "download_url": "https://api.commonstandardsproject.com/api/v1/standard_sets/ABC123...",
315
+ "processed": false,
316
+ "processed_date": null,
317
+ "processed_output": null
318
+ }
319
+ ```
320
+
321
+ ---
322
+
323
+ ## 6. Implementation Guide
324
+
325
+ The implementation follows clean architecture principles with separated concerns.
326
+
327
+ ### 6.1 API Client Module (`tools/api_client.py`)
328
+
329
+ Handles all interactions with the Common Standards Project API, including retry logic, rate limiting, and error handling.
330
+
331
+ ```python
332
+ """API client for Common Standards Project with retry logic and rate limiting."""
333
+ from __future__ import annotations
334
+
335
+ import os
336
+ import time
337
+ from typing import Any
338
+
339
+ import requests
340
+ from dotenv import load_dotenv
341
+ from loguru import logger
342
+
343
+ load_dotenv()
344
+
345
+ API_KEY = os.getenv("CSP_API_KEY")
346
+ BASE_URL = "https://api.commonstandardsproject.com/api/v1"
347
+
348
+ # Rate limiting: Max requests per minute
349
+ MAX_REQUESTS_PER_MINUTE = 60
350
+ _request_timestamps: list[float] = []
351
+
352
+
353
+ class APIError(Exception):
354
+ """Raised when API request fails after all retries."""
355
+ pass
356
+
357
+
358
+ def _get_headers() -> dict[str, str]:
359
+ """Get authentication headers for API requests."""
360
+ if not API_KEY:
361
+ logger.error("CSP_API_KEY not found in .env file")
362
+ raise ValueError("CSP_API_KEY environment variable not set")
363
+ return {"Api-Key": API_KEY}
364
+
365
+
366
+ def _enforce_rate_limit() -> None:
367
+ """Enforce rate limiting by tracking request timestamps."""
368
+ global _request_timestamps
369
+ now = time.time()
370
+
371
+ # Remove timestamps older than 1 minute
372
+ _request_timestamps = [ts for ts in _request_timestamps if now - ts < 60]
373
+
374
+ # If at limit, wait
375
+ if len(_request_timestamps) >= MAX_REQUESTS_PER_MINUTE:
376
+ sleep_time = 60 - (now - _request_timestamps[0])
377
+ logger.warning(f"Rate limit reached. Waiting {sleep_time:.1f} seconds...")
378
+ time.sleep(sleep_time)
379
+ _request_timestamps = []
380
+
381
+ _request_timestamps.append(now)
382
+
383
+
384
+ def _make_request(
385
+ endpoint: str,
386
+ params: dict[str, Any] | None = None,
387
+ max_retries: int = 3
388
+ ) -> dict[str, Any]:
389
+ """
390
+ Make API request with exponential backoff retry logic.
391
+
392
+ Args:
393
+ endpoint: API endpoint path (e.g., "/jurisdictions")
394
+ params: Query parameters
395
+ max_retries: Maximum number of retry attempts
396
+
397
+ Returns:
398
+ Parsed JSON response
399
+
400
+ Raises:
401
+ APIError: After all retries exhausted or on fatal errors
402
+ """
403
+ url = f"{BASE_URL}{endpoint}"
404
+ headers = _get_headers()
405
+
406
+ for attempt in range(max_retries):
407
+ try:
408
+ _enforce_rate_limit()
409
+
410
+ logger.debug(f"API request: {endpoint} (attempt {attempt + 1}/{max_retries})")
411
+ response = requests.get(url, headers=headers, params=params, timeout=30)
412
+
413
+ # Handle specific status codes
414
+ if response.status_code == 401:
415
+ logger.error("Invalid API key (401 Unauthorized)")
416
+ raise APIError("Authentication failed. Check your CSP_API_KEY in .env")
417
+
418
+ if response.status_code == 404:
419
+ logger.error(f"Resource not found (404): {endpoint}")
420
+ raise APIError(f"Resource not found: {endpoint}")
421
+
422
+ if response.status_code == 429:
423
+ # Rate limited by server
424
+ retry_after = int(response.headers.get("Retry-After", 60))
425
+ logger.warning(f"Server rate limit hit. Waiting {retry_after} seconds...")
426
+ time.sleep(retry_after)
427
+ continue
428
+
429
+ response.raise_for_status()
430
+ logger.info(f"API request successful: {endpoint}")
431
+ return response.json()
432
+
433
+ except requests.exceptions.Timeout:
434
+ wait_time = 2 ** attempt # Exponential backoff: 1s, 2s, 4s
435
+ logger.warning(f"Request timeout. Retrying in {wait_time}s...")
436
+ if attempt < max_retries - 1:
437
+ time.sleep(wait_time)
438
+ else:
439
+ raise APIError(f"Request timeout after {max_retries} attempts")
440
+
441
+ except requests.exceptions.ConnectionError:
442
+ wait_time = 2 ** attempt
443
+ logger.warning(f"Connection error. Retrying in {wait_time}s...")
444
+ if attempt < max_retries - 1:
445
+ time.sleep(wait_time)
446
+ else:
447
+ raise APIError(f"Connection failed after {max_retries} attempts")
448
+
449
+ except requests.exceptions.HTTPError as e:
450
+ # Don't retry on 4xx errors (except 429)
451
+ if 400 <= response.status_code < 500 and response.status_code != 429:
452
+ raise APIError(f"HTTP {response.status_code}: {response.text}")
453
+ # Retry on 5xx errors
454
+ wait_time = 2 ** attempt
455
+ logger.warning(f"Server error {response.status_code}. Retrying in {wait_time}s...")
456
+ if attempt < max_retries - 1:
457
+ time.sleep(wait_time)
458
+ else:
459
+ raise APIError(f"Server error after {max_retries} attempts")
460
+
461
+ raise APIError("Request failed after all retries")
462
+
463
+
464
+ def get_jurisdictions(search_term: str | None = None) -> list[dict[str, Any]]:
465
+ """
466
+ Fetch all jurisdictions from the API.
467
+
468
+ Args:
469
+ search_term: Optional filter for jurisdiction title (case-insensitive)
470
+
471
+ Returns:
472
+ List of jurisdiction dicts with 'id', 'title', 'type' fields
473
+ """
474
+ logger.info("Fetching jurisdictions from API")
475
+ response = _make_request("/jurisdictions")
476
+ jurisdictions = response.get("data", [])
477
+
478
+ if search_term:
479
+ search_lower = search_term.lower()
480
+ jurisdictions = [
481
+ j for j in jurisdictions
482
+ if search_lower in j.get("title", "").lower()
483
+ ]
484
+ logger.info(f"Filtered to {len(jurisdictions)} jurisdictions matching '{search_term}'")
485
+
486
+ return jurisdictions
487
+
488
+
489
+ def get_standard_sets(jurisdiction_id: str) -> list[dict[str, Any]]:
490
+ """
491
+ Fetch standard sets for a specific jurisdiction.
492
+
493
+ Args:
494
+ jurisdiction_id: The jurisdiction GUID
495
+
496
+ Returns:
497
+ List of standard set dicts
498
+ """
499
+ logger.info(f"Fetching standard sets for jurisdiction {jurisdiction_id}")
500
+ response = _make_request("/standard_sets", params={"jurisdictionId": jurisdiction_id})
501
+ return response.get("data", [])
502
+
503
+
504
+ def download_standard_set(set_id: str) -> dict[str, Any]:
505
+ """
506
+ Download full standard set data.
507
+
508
+ Args:
509
+ set_id: The standard set GUID
510
+
511
+ Returns:
512
+ Complete standard set data including hierarchy
513
+ """
514
+ logger.info(f"Downloading standard set {set_id}")
515
+ response = _make_request(f"/standard_sets/{set_id}")
516
+ return response.get("data", {})
517
+ ```
518
+
519
+ ### 6.2 Data Manager Module (`tools/data_manager.py`)
520
+
521
+ Handles local file operations, directory structure, and metadata tracking.
522
+
523
+ ```python
524
+ """Manages local data storage and metadata tracking."""
525
+ from __future__ import annotations
526
+
527
+ import json
528
+ from datetime import datetime
529
+ from pathlib import Path
530
+ from typing import Any
531
+
532
+ from loguru import logger
533
+
534
+ # Data directories
535
+ PROJECT_ROOT = Path(__file__).parent.parent
536
+ RAW_DATA_DIR = PROJECT_ROOT / "data" / "raw"
537
+ PROCESSED_DATA_DIR = PROJECT_ROOT / "data" / "processed"
538
+
539
+
540
+ def save_raw_data(set_id: str, data: dict[str, Any], metadata_override: dict[str, Any] | None = None) -> Path:
541
+ """
542
+ Save raw standard set data with metadata.
543
+
544
+ Args:
545
+ set_id: Standard set GUID
546
+ data: Raw API response data
547
+ metadata_override: Optional metadata to merge (for jurisdiction info, etc.)
548
+
549
+ Returns:
550
+ Path to saved data file
551
+ """
552
+ # Extract metadata from data
553
+ jurisdiction_id = data.get("jurisdiction", {}).get("id", "unknown")
554
+ jurisdiction_title = data.get("jurisdiction", {}).get("title", "Unknown")
555
+
556
+ # Create directory structure
557
+ set_dir = RAW_DATA_DIR / jurisdiction_id / set_id
558
+ set_dir.mkdir(parents=True, exist_ok=True)
559
+
560
+ # Save raw data
561
+ data_file = set_dir / "data.json"
562
+ with open(data_file, "w", encoding="utf-8") as f:
563
+ json.dump(data, f, indent=2, ensure_ascii=False)
564
+
565
+ # Create metadata
566
+ metadata = {
567
+ "set_id": set_id,
568
+ "title": data.get("title", ""),
569
+ "jurisdiction": {
570
+ "id": jurisdiction_id,
571
+ "title": jurisdiction_title
572
+ },
573
+ "subject": data.get("subject", "Unknown"),
574
+ "grade_levels": data.get("educationLevels", []),
575
+ "download_date": datetime.utcnow().isoformat() + "Z",
576
+ "download_url": f"https://api.commonstandardsproject.com/api/v1/standard_sets/{set_id}",
577
+ "processed": False,
578
+ "processed_date": None,
579
+ "processed_output": None
580
+ }
581
+
582
+ # Merge override metadata
583
+ if metadata_override:
584
+ metadata.update(metadata_override)
585
+
586
+ # Save metadata
587
+ metadata_file = set_dir / "metadata.json"
588
+ with open(metadata_file, "w", encoding="utf-8") as f:
589
+ json.dump(metadata, f, indent=2, ensure_ascii=False)
590
+
591
+ logger.info(f"Saved raw data to {data_file}")
592
+ logger.info(f"Saved metadata to {metadata_file}")
593
+
594
+ return data_file
595
+
596
+
597
+ def list_downloaded_data() -> list[dict[str, Any]]:
598
+ """
599
+ List all downloaded raw datasets with their metadata.
600
+
601
+ Returns:
602
+ List of metadata dicts for each downloaded dataset
603
+ """
604
+ if not RAW_DATA_DIR.exists():
605
+ return []
606
+
607
+ datasets = []
608
+ for jurisdiction_dir in RAW_DATA_DIR.iterdir():
609
+ if not jurisdiction_dir.is_dir():
610
+ continue
611
+
612
+ for set_dir in jurisdiction_dir.iterdir():
613
+ if not set_dir.is_dir():
614
+ continue
615
+
616
+ metadata_file = set_dir / "metadata.json"
617
+ if metadata_file.exists():
618
+ with open(metadata_file, encoding="utf-8") as f:
619
+ metadata = json.load(f)
620
+ datasets.append(metadata)
621
+
622
+ logger.debug(f"Found {len(datasets)} downloaded datasets")
623
+ return datasets
624
+
625
+
626
+ def get_processing_status() -> dict[str, Any]:
627
+ """
628
+ Get processing status summary for all datasets.
629
+
630
+ Returns:
631
+ Dict with 'total', 'processed', 'unprocessed', 'processed_list', 'unprocessed_list'
632
+ """
633
+ datasets = list_downloaded_data()
634
+ processed = [d for d in datasets if d.get("processed", False)]
635
+ unprocessed = [d for d in datasets if not d.get("processed", False)]
636
+
637
+ return {
638
+ "total": len(datasets),
639
+ "processed": len(processed),
640
+ "unprocessed": len(unprocessed),
641
+ "processed_list": processed,
642
+ "unprocessed_list": unprocessed
643
+ }
644
+
645
+
646
+ def mark_as_processed(set_id: str, output_path: Path) -> None:
647
+ """
648
+ Update metadata to mark a dataset as processed.
649
+
650
+ Args:
651
+ set_id: Standard set GUID
652
+ output_path: Path to processed output directory
653
+ """
654
+ # Find the dataset
655
+ for jurisdiction_dir in RAW_DATA_DIR.iterdir():
656
+ if not jurisdiction_dir.is_dir():
657
+ continue
658
+
659
+ set_dir = jurisdiction_dir / set_id
660
+ if set_dir.exists():
661
+ metadata_file = set_dir / "metadata.json"
662
+ if metadata_file.exists():
663
+ with open(metadata_file, encoding="utf-8") as f:
664
+ metadata = json.load(f)
665
+
666
+ metadata["processed"] = True
667
+ metadata["processed_date"] = datetime.utcnow().isoformat() + "Z"
668
+ metadata["processed_output"] = str(output_path)
669
+
670
+ with open(metadata_file, "w", encoding="utf-8") as f:
671
+ json.dump(metadata, f, indent=2, ensure_ascii=False)
672
+
673
+ logger.info(f"Marked {set_id} as processed")
674
+ return
675
+
676
+ logger.warning(f"Could not find dataset {set_id} to mark as processed")
677
+ ```
678
+
679
+ ### 6.3 Data Processor Module (`tools/data_processor.py`)
680
+
681
+ Handles the transformation of raw API data into flattened format with embeddings.
682
+
683
+ ```python
684
+ """Processes raw standard sets into flattened format with embeddings."""
685
+ from __future__ import annotations
686
+
687
+ import json
688
+ from pathlib import Path
689
+ from typing import Any
690
+
691
+ import numpy as np
692
+ from loguru import logger
693
+ from sentence_transformers import SentenceTransformer
694
+
695
+ from tools.data_manager import PROCESSED_DATA_DIR, RAW_DATA_DIR, mark_as_processed
696
+
697
+
698
+ def _build_lookup_map(standards_dict: dict[str, Any]) -> dict[str, Any]:
699
+ """
700
+ Build lookup map from standards dictionary in API response.
701
+
702
+ The API returns standards in a flat dictionary keyed by GUID.
703
+
704
+ Args:
705
+ standards_dict: The 'standards' field from API response
706
+
707
+ Returns:
708
+ Lookup map of GUID -> standard object
709
+ """
710
+ logger.debug(f"Building lookup map with {len(standards_dict)} items")
711
+ return standards_dict
712
+
713
+
714
+ def _resolve_context(standard: dict[str, Any], lookup_map: dict[str, Any]) -> str:
715
+ """
716
+ Build full context string by resolving ancestor chain.
717
+
718
+ Concatenates descriptions from Domain -> Cluster -> Standard
719
+ to create rich context for embedding.
720
+
721
+ Args:
722
+ standard: Standard dict with 'ancestorIds' and 'description'
723
+ lookup_map: Map of GUID -> standard object
724
+
725
+ Returns:
726
+ Full context string with ancestors
727
+ """
728
+ context_parts = []
729
+
730
+ # Resolve ancestors
731
+ ancestor_ids = standard.get("ancestorIds", [])
732
+ for ancestor_id in ancestor_ids:
733
+ if ancestor_id in lookup_map:
734
+ ancestor = lookup_map[ancestor_id]
735
+ ancestor_desc = ancestor.get("description", "").strip()
736
+ if ancestor_desc:
737
+ context_parts.append(ancestor_desc)
738
+
739
+ # Add standard's own description
740
+ standard_desc = standard.get("description", "").strip()
741
+ if standard_desc:
742
+ context_parts.append(standard_desc)
743
+
744
+ return " - ".join(context_parts)
745
+
746
+
747
+ def _extract_grade(grade_levels: list[str]) -> str:
748
+ """Extract primary grade level from gradeLevels array."""
749
+ if not grade_levels:
750
+ return "Unknown"
751
+
752
+ grade = grade_levels[0]
753
+
754
+ # Handle high school ranges
755
+ if grade in ["09", "10", "11", "12"]:
756
+ return "09-12"
757
+
758
+ return grade
759
+
760
+
761
+ def _process_standards(
762
+ data: dict[str, Any],
763
+ lookup_map: dict[str, Any]
764
+ ) -> list[dict[str, Any]]:
765
+ """
766
+ Filter and process standards from raw API data.
767
+
768
+ Keeps only items where statementLabel == "Standard".
769
+
770
+ Args:
771
+ data: Raw API response
772
+ lookup_map: Map of GUID -> standard object
773
+
774
+ Returns:
775
+ List of processed standard dicts
776
+ """
777
+ processed = []
778
+ standards_dict = data.get("standards", {})
779
+ subject = data.get("subject", "Unknown")
780
+
781
+ for guid, item in standards_dict.items():
782
+ # Filter: Keep only "Standard" items
783
+ if item.get("statementLabel") != "Standard":
784
+ continue
785
+
786
+ # Extract fields
787
+ standard_id = item.get("statementNotation", "")
788
+ grade_levels = item.get("educationLevels", [])
789
+ grade = _extract_grade(grade_levels)
790
+ description = item.get("description", "").strip()
791
+
792
+ # Skip if missing critical fields
793
+ if not standard_id or not description:
794
+ continue
795
+
796
+ # Resolve full context
797
+ full_context = _resolve_context(item, lookup_map)
798
+
799
+ # Build output record
800
+ record = {
801
+ "id": standard_id,
802
+ "guid": guid,
803
+ "subject": subject,
804
+ "grade": grade,
805
+ "description": description,
806
+ "full_context": full_context
807
+ }
808
+
809
+ processed.append(record)
810
+
811
+ logger.info(f"Processed {len(processed)} standards")
812
+ return processed
813
+
814
+
815
+ def _generate_embeddings(standards: list[dict[str, Any]]) -> np.ndarray:
816
+ """
817
+ Generate embeddings for all standards.
818
+
819
+ Uses sentence-transformers with 'full_context' field.
820
+
821
+ Args:
822
+ standards: List of standard dicts
823
+
824
+ Returns:
825
+ Numpy array of embeddings
826
+ """
827
+ logger.info("Initializing sentence-transformers model...")
828
+ model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
829
+
830
+ contexts = [s["full_context"] for s in standards]
831
+
832
+ logger.info(f"Generating embeddings for {len(contexts)} standards...")
833
+ embeddings = model.encode(contexts, show_progress_bar=True)
834
+
835
+ return embeddings
836
+
837
+
838
+ def process_standard_set(set_id: str) -> tuple[Path, Path]:
839
+ """
840
+ Process a raw standard set into flattened format with embeddings.
841
+
842
+ Args:
843
+ set_id: Standard set GUID
844
+
845
+ Returns:
846
+ Tuple of (standards_file_path, embeddings_file_path)
847
+
848
+ Raises:
849
+ FileNotFoundError: If raw data not found for set_id
850
+ ValueError: If processing fails
851
+ """
852
+ logger.info(f"Processing standard set {set_id}")
853
+
854
+ # Find raw data
855
+ raw_data_file = None
856
+ metadata_file = None
857
+
858
+ for jurisdiction_dir in RAW_DATA_DIR.iterdir():
859
+ if not jurisdiction_dir.is_dir():
860
+ continue
861
+
862
+ set_dir = jurisdiction_dir / set_id
863
+ if set_dir.exists():
864
+ raw_data_file = set_dir / "data.json"
865
+ metadata_file = set_dir / "metadata.json"
866
+ break
867
+
868
+ if not raw_data_file or not raw_data_file.exists():
869
+ raise FileNotFoundError(f"Raw data not found for set {set_id}. Run download first.")
870
+
871
+ # Load metadata
872
+ with open(metadata_file, encoding="utf-8") as f:
873
+ metadata = json.load(f)
874
+
875
+ # Load raw data
876
+ with open(raw_data_file, encoding="utf-8") as f:
877
+ raw_data = json.load(f)
878
+
879
+ # Build lookup map
880
+ standards_dict = raw_data.get("standards", {})
881
+ lookup_map = _build_lookup_map(standards_dict)
882
+
883
+ # Process standards
884
+ processed_standards = _process_standards(raw_data, lookup_map)
885
+
886
+ if not processed_standards:
887
+ raise ValueError(f"No standards processed from set {set_id}")
888
+
889
+ # Generate embeddings
890
+ embeddings = _generate_embeddings(processed_standards)
891
+
892
+ # Determine output path
893
+ jurisdiction_name = metadata["jurisdiction"]["title"].lower().replace(" ", "_")
894
+ subject_name = metadata["subject"].lower().replace(" ", "_").replace("-", "_")
895
+ grade = _extract_grade(metadata["grade_levels"])
896
+ grade_str = f"grade_{grade}".replace("-", "_")
897
+
898
+ output_dir = PROCESSED_DATA_DIR / jurisdiction_name / subject_name / grade_str
899
+ output_dir.mkdir(parents=True, exist_ok=True)
900
+
901
+ # Save standards
902
+ standards_file = output_dir / "standards.json"
903
+ with open(standards_file, "w", encoding="utf-8") as f:
904
+ json.dump(processed_standards, f, indent=2, ensure_ascii=False)
905
+ logger.info(f"Saved standards to {standards_file}")
906
+
907
+ # Save embeddings
908
+ embeddings_file = output_dir / "embeddings.npy"
909
+ np.save(embeddings_file, embeddings)
910
+ logger.info(f"Saved embeddings to {embeddings_file}")
911
+
912
+ # Mark as processed
913
+ mark_as_processed(set_id, output_dir)
914
+
915
+ return standards_file, embeddings_file
916
+ ```
917
+
918
+ ### 6.4 CLI Entry Point (`tools/cli.py`)
919
+
920
+ Thin CLI layer that imports and invokes business logic functions.
921
+
922
+ ```python
923
+ """CLI entry point for EduMatch Data Management."""
924
+ from __future__ import annotations
925
+
926
+ import sys
927
+
928
+ import typer
929
+ from loguru import logger
930
+ from rich.console import Console
931
+ from rich.table import Table
932
+
933
+ from tools import api_client, data_manager, data_processor
934
+
935
+ # Configure logger
936
+ logger.remove() # Remove default handler
937
+ logger.add(sys.stderr, format="<green>{time:HH:mm:ss}</green> | <level>{level: <8}</level> | <level>{message}</level>")
938
+ logger.add("data/cli.log", rotation="10 MB", retention="7 days", format="{time} | {level} | {message}")
939
+
940
+ app = typer.Typer(help="EduMatch Data CLI - Manage educational standards data")
941
+ console = Console()
942
+
943
+
944
+ @app.command()
945
+ def jurisdictions(
946
+ search: str = typer.Option(None, "--search", "-s", help="Filter by jurisdiction name")
947
+ ):
948
+ """List all available jurisdictions (states/organizations)."""
949
+ try:
950
+ results = api_client.get_jurisdictions(search)
951
+
952
+ table = Table("ID", "Title", "Type", title="Jurisdictions")
953
+ for j in results:
954
+ table.add_row(
955
+ j.get("id", ""),
956
+ j.get("title", ""),
957
+ j.get("type", "N/A")
958
+ )
959
+
960
+ console.print(table)
961
+ console.print(f"\n[green]Found {len(results)} jurisdictions[/green]")
962
+
963
+ except Exception as e:
964
+ console.print(f"[red]Error: {e}[/red]")
965
+ logger.exception("Failed to fetch jurisdictions")
966
+ raise typer.Exit(code=1)
967
+
968
+
969
+ @app.command()
970
+ def sets(jurisdiction_id: str = typer.Argument(..., help="Jurisdiction ID")):
971
+ """List standard sets for a specific jurisdiction."""
972
+ try:
973
+ results = api_client.get_standard_sets(jurisdiction_id)
974
+
975
+ table = Table("Set ID", "Subject", "Title", "Grades", title=f"Standard Sets")
976
+ for s in results:
977
+ grade_levels = ", ".join(s.get("educationLevels", []))
978
+ table.add_row(
979
+ s.get("id", ""),
980
+ s.get("subject", "N/A"),
981
+ s.get("title", ""),
982
+ grade_levels or "N/A"
983
+ )
984
+
985
+ console.print(table)
986
+ console.print(f"\n[green]Found {len(results)} standard sets[/green]")
987
+
988
+ except Exception as e:
989
+ console.print(f"[red]Error: {e}[/red]")
990
+ logger.exception("Failed to fetch standard sets")
991
+ raise typer.Exit(code=1)
992
+
993
+
994
+ @app.command()
995
+ def download(set_id: str = typer.Argument(..., help="Standard set ID")):
996
+ """Download a standard set and save locally."""
997
+ try:
998
+ with console.status(f"[bold blue]Downloading set {set_id}..."):
999
+ data = api_client.download_standard_set(set_id)
1000
+ output_path = data_manager.save_raw_data(set_id, data)
1001
+
1002
+ console.print(f"[green]✓ Successfully downloaded to {output_path}[/green]")
1003
+
1004
+ except Exception as e:
1005
+ console.print(f"[red]Error: {e}[/red]")
1006
+ logger.exception("Failed to download standard set")
1007
+ raise typer.Exit(code=1)
1008
+
1009
+
1010
+ @app.command()
1011
+ def list():
1012
+ """List all downloaded datasets."""
1013
+ try:
1014
+ datasets = data_manager.list_downloaded_data()
1015
+
1016
+ if not datasets:
1017
+ console.print("[yellow]No datasets downloaded yet.[/yellow]")
1018
+ return
1019
+
1020
+ table = Table("Set ID", "Subject", "Title", "Grades", "Downloaded", "Processed", title="Downloaded Datasets")
1021
+ for d in datasets:
1022
+ table.add_row(
1023
+ d["set_id"][:12] + "...",
1024
+ d.get("subject", "N/A"),
1025
+ d.get("title", "")[:50],
1026
+ ", ".join(d.get("grade_levels", [])),
1027
+ d.get("download_date", "")[:10],
1028
+ "✓" if d.get("processed") else "✗"
1029
+ )
1030
+
1031
+ console.print(table)
1032
+ console.print(f"\n[green]Total: {len(datasets)} datasets[/green]")
1033
+
1034
+ except Exception as e:
1035
+ console.print(f"[red]Error: {e}[/red]")
1036
+ logger.exception("Failed to list datasets")
1037
+ raise typer.Exit(code=1)
1038
+
1039
+
1040
+ @app.command()
1041
+ def process(set_id: str = typer.Argument(..., help="Standard set ID to process")):
1042
+ """Process a downloaded dataset into flattened format with embeddings."""
1043
+ try:
1044
+ with console.status(f"[bold blue]Processing set {set_id}..."):
1045
+ standards_file, embeddings_file = data_processor.process_standard_set(set_id)
1046
+
1047
+ console.print(f"[green]✓ Processing complete![/green]")
1048
+ console.print(f" Standards: {standards_file}")
1049
+ console.print(f" Embeddings: {embeddings_file}")
1050
+
1051
+ except FileNotFoundError as e:
1052
+ console.print(f"[red]Error: {e}[/red]")
1053
+ console.print("[yellow]Hint: Run 'download' command first.[/yellow]")
1054
+ raise typer.Exit(code=1)
1055
+ except Exception as e:
1056
+ console.print(f"[red]Error: {e}[/red]")
1057
+ logger.exception("Failed to process dataset")
1058
+ raise typer.Exit(code=1)
1059
+
1060
+
1061
+ @app.command()
1062
+ def status():
1063
+ """Show processing status for all datasets."""
1064
+ try:
1065
+ status_data = data_manager.get_processing_status()
1066
+
1067
+ console.print(f"\n[bold]Processing Status Summary[/bold]")
1068
+ console.print(f" Total Downloaded: {status_data['total']}")
1069
+ console.print(f" Processed: {status_data['processed']}")
1070
+ console.print(f" Unprocessed: {status_data['unprocessed']}")
1071
+
1072
+ if status_data["unprocessed_list"]:
1073
+ console.print(f"\n[yellow]Unprocessed Datasets:[/yellow]")
1074
+ for d in status_data["unprocessed_list"]:
1075
+ console.print(f" • {d['title']} ({d['set_id'][:12]}...)")
1076
+
1077
+ if status_data["processed_list"]:
1078
+ console.print(f"\n[green]Processed Datasets:[/green]")
1079
+ for d in status_data["processed_list"]:
1080
+ console.print(f" • {d['title']}")
1081
+ console.print(f" Output: {d.get('processed_output', 'N/A')}")
1082
+
1083
+ except Exception as e:
1084
+ console.print(f"[red]Error: {e}[/red]")
1085
+ logger.exception("Failed to get status")
1086
+ raise typer.Exit(code=1)
1087
+
1088
+
1089
+ if __name__ == "__main__":
1090
+ app()
1091
+ ```
1092
+
1093
+ ---
1094
+
1095
+ ## 7. API Data Format Reference
1096
+
1097
+ ### 7.1 Raw Data Structure (from API)
1098
+
1099
+ When you run the `download` command, the `data/raw/<jurisdiction>/<set_id>/data.json` files will contain the Common Standards Project API response format:
1100
+
1101
+ ```json
1102
+ {
1103
+ "id": "SET_ID",
1104
+ "title": "Utah Core Standards - Mathematics",
1105
+ "subject": "Mathematics",
1106
+ "educationLevels": ["03"],
1107
+ "jurisdiction": {
1108
+ "id": "JURISDICTION_ID",
1109
+ "title": "Utah"
1110
+ },
1111
+ "standards": {
1112
+ "STANDARD_UUID": {
1113
+ "id": "STANDARD_UUID",
1114
+ "statementNotation": "3.OA.1",
1115
+ "description": "Interpret products of whole numbers...",
1116
+ "ancestorIds": ["CLUSTER_UUID", "DOMAIN_UUID"],
1117
+ "statementLabel": "Standard",
1118
+ "educationLevels": ["03"]
1119
+ },
1120
+ "CLUSTER_UUID": {
1121
+ "id": "CLUSTER_UUID",
1122
+ "description": "Represent and solve problems involving multiplication...",
1123
+ "statementLabel": "Cluster",
1124
+ "ancestorIds": ["DOMAIN_UUID"]
1125
+ },
1126
+ "DOMAIN_UUID": {
1127
+ "id": "DOMAIN_UUID",
1128
+ "description": "Operations and Algebraic Thinking",
1129
+ "statementLabel": "Domain",
1130
+ "ancestorIds": []
1131
+ }
1132
+ }
1133
+ }
1134
+ ```
1135
+
1136
+ **Key Points:**
1137
+
1138
+ - The `standards` field is a flat dictionary keyed by GUID (not an array)
1139
+ - Each item includes `ancestorIds` that reference other items in the same dictionary
1140
+ - The `statementLabel` field indicates the type: "Domain", "Cluster", or "Standard"
1141
+ - We filter to keep only `"statementLabel": "Standard"` items and resolve their ancestor context
1142
+
1143
+ ### 7.2 Processed Data Format
1144
+
1145
+ After running the `process` command, the output `standards.json` has this flattened structure:
1146
+
1147
+ ```json
1148
+ [
1149
+ {
1150
+ "id": "CCSS.Math.Content.3.OA.A.1",
1151
+ "guid": "STANDARD_UUID",
1152
+ "subject": "Mathematics",
1153
+ "grade": "03",
1154
+ "description": "Interpret products of whole numbers...",
1155
+ "full_context": "Operations and Algebraic Thinking - Represent and solve problems involving multiplication... - Interpret products of whole numbers..."
1156
+ }
1157
+ ]
1158
+ ```
1159
+
1160
+ The `full_context` field is created by concatenating:
1161
+
1162
+ 1. Domain description
1163
+ 2. Cluster description
1164
+ 3. Standard description
1165
+
1166
+ This rich context is used for generating embeddings that capture the hierarchical meaning.
1167
+
1168
+ ---
1169
+
1170
+ ## 8. Error Handling & Retry Logic
1171
+
1172
+ ### 8.1 API Error Categories
1173
+
1174
+ The CLI handles these error scenarios:
1175
+
1176
+ | Error Type | Status Code | Behavior |
1177
+ | ------------------ | ----------- | -------------------------------------------- |
1178
+ | Invalid API Key | 401 | Stop immediately, show error message |
1179
+ | Resource Not Found | 404 | Stop immediately, show helpful error |
1180
+ | Rate Limited | 429 | Wait for `Retry-After` header, then retry |
1181
+ | Timeout | - | Exponential backoff: 1s, 2s, 4s (3 attempts) |
1182
+ | Connection Error | - | Exponential backoff: 1s, 2s, 4s (3 attempts) |
1183
+ | Server Error | 5xx | Exponential backoff: 1s, 2s, 4s (3 attempts) |
1184
+ | Client Error | 4xx | Stop immediately (no retry) |
1185
+
1186
+ ### 8.2 Rate Limiting
1187
+
1188
+ - **Client-Side Limit:** 60 requests per minute
1189
+ - **Implementation:** Track request timestamps, enforce delays when limit reached
1190
+ - **Server-Side Limit:** Respect `429` status and `Retry-After` header
1191
+
1192
+ ### 8.3 Logging
1193
+
1194
+ All operations are logged using `loguru`:
1195
+
1196
+ - **Console:** Formatted output with timestamps and log levels
1197
+ - **File:** `data/cli.log` with rotation (10MB max, 7 days retention)
1198
+ - **Exception Tracking:** Full stack traces logged for debugging
1199
+
1200
+ ---
1201
+
1202
+ ## 9. Git Configuration
1203
+
1204
+ ### 9.1 .gitignore Additions
1205
+
1206
+ Add to `.gitignore`:
1207
+
1208
+ ```
1209
+ # Raw data (local only)
1210
+ data/raw/
1211
+
1212
+ # CLI logs
1213
+ data/cli.log
1214
+ ```
1215
+
1216
+ ### 9.2 Git Tracking
1217
+
1218
+ **Tracked in Git:**
1219
+
1220
+ - `data/processed/utah/mathematics/grade_03/` (example dataset)
1221
+ - `data/processed/wyoming/mathematics/grade_03/` (example dataset)
1222
+ - `data/processed/idaho/mathematics/grade_03/` (example dataset)
1223
+
1224
+ **Not Tracked:**
1225
+
1226
+ - `data/raw/` (developer's local cache)
1227
+ - `data/cli.log` (operational logs)
1228
+
1229
+ ---
1230
+
1231
+ ## 10. Testing Strategy
1232
+
1233
+ ### 10.1 Manual Testing Checklist
1234
+
1235
+ For this initial sprint, manual testing is sufficient. Complete these test scenarios:
1236
+
1237
+ **API Discovery:**
1238
+
1239
+ - [ ] Run `jurisdictions` command without search
1240
+ - [ ] Run `jurisdictions --search "Utah"` to filter
1241
+ - [ ] Verify table output is readable
1242
+ - [ ] Confirm logging to console and file
1243
+
1244
+ **Standard Set Discovery:**
1245
+
1246
+ - [ ] Find Utah jurisdiction ID from previous step
1247
+ - [ ] Run `sets <UTAH_ID>` to list available standards
1248
+ - [ ] Identify Mathematics Grade 3 set ID
1249
+ - [ ] Repeat for Wyoming and Idaho
1250
+
1251
+ **Data Download:**
1252
+
1253
+ - [ ] Run `download <SET_ID>` for Utah Math Grade 3
1254
+ - [ ] Verify file created in `data/raw/<jurisdiction>/<set_id>/data.json`
1255
+ - [ ] Verify metadata created in `data/raw/<jurisdiction>/<set_id>/metadata.json`
1256
+ - [ ] Repeat for Wyoming and Idaho Math Grade 3
1257
+ - [ ] Run `list` command to see all downloads
1258
+
1259
+ **Data Processing:**
1260
+
1261
+ - [ ] Run `status` to see unprocessed datasets
1262
+ - [ ] Run `process <SET_ID>` for Utah Math Grade 3
1263
+ - [ ] Verify `data/processed/utah/mathematics/grade_03/standards.json` created
1264
+ - [ ] Verify `data/processed/utah/mathematics/grade_03/embeddings.npy` created
1265
+ - [ ] Run `status` again to confirm marked as processed
1266
+ - [ ] Repeat for Wyoming and Idaho
1267
+
1268
+ **Error Handling:**
1269
+
1270
+ - [ ] Test with invalid API key (should fail immediately with clear message)
1271
+ - [ ] Test `download` with invalid set ID (should show 404 error)
1272
+ - [ ] Test `process` without downloading first (should show helpful error)
1273
+
1274
+ ### 10.2 Validation Criteria
1275
+
1276
+ After processing all three datasets, verify:
1277
+
1278
+ - Each `standards.json` contains array of standard objects
1279
+ - Each standard has all required fields: `id`, `guid`, `subject`, `grade`, `description`, `full_context`
1280
+ - The `full_context` field is not empty and contains ancestor descriptions
1281
+ - Each `embeddings.npy` file shape matches the number of standards
1282
+ - Metadata files correctly show `"processed": true`
1283
+
1284
+ ---
1285
+
1286
+ ## 11. Implementation Workflow
1287
+
1288
+ ### 11.1 Development Order
1289
+
1290
+ Implement modules in this order:
1291
+
1292
+ 1. **`tools/api_client.py`** - Core API interactions with retry logic
1293
+ 2. **`tools/data_manager.py`** - File management and metadata tracking
1294
+ 3. **`tools/data_processor.py`** - Data transformation and embeddings
1295
+ 4. **`tools/cli.py`** - CLI commands that tie everything together
1296
+
1297
+ ### 11.2 Testing Workflow
1298
+
1299
+ After implementing each module:
1300
+
1301
+ 1. Test API discovery commands (`jurisdictions`, `sets`)
1302
+ 2. Test download command for one dataset
1303
+ 3. Test list command
1304
+ 4. Test process command
1305
+ 5. Test status command
1306
+ 6. Repeat download and process for remaining datasets
1307
+
1308
+ ### 11.3 Deliverables
1309
+
1310
+ **Code:**
1311
+
1312
+ - `tools/api_client.py`
1313
+ - `tools/data_manager.py`
1314
+ - `tools/data_processor.py`
1315
+ - `tools/cli.py`
1316
+
1317
+ **Data (committed to git):**
1318
+
1319
+ - `data/processed/utah/mathematics/grade_03/standards.json`
1320
+ - `data/processed/utah/mathematics/grade_03/embeddings.npy`
1321
+ - `data/processed/wyoming/mathematics/grade_03/standards.json`
1322
+ - `data/processed/wyoming/mathematics/grade_03/embeddings.npy`
1323
+ - `data/processed/idaho/mathematics/grade_03/standards.json`
1324
+ - `data/processed/idaho/mathematics/grade_03/embeddings.npy`
1325
+
1326
+ **Configuration:**
1327
+
1328
+ - Updated `pyproject.toml` with new dependencies
1329
+ - Updated `.gitignore` with data/raw/ and data/cli.log
1330
+
1331
+ ---
1332
+
1333
+ ## 12. Future Enhancements
1334
+
1335
+ Features not included in this sprint but planned for future:
1336
+
1337
+ - **Batch Processing:** Command to process all downloaded datasets at once
1338
+ - **Update Detection:** Check API for updates to already-downloaded sets
1339
+ - **Data Validation:** Verify processed data integrity
1340
+ - **Export Formats:** Support CSV or other output formats
1341
+ - **Automated Tests:** Unit tests for each module
1342
+ - **Configuration File:** YAML config for default settings (rate limits, retry attempts, etc.)
1343
+ - **Progress Tracking:** Better progress bars for long operations
1344
+
1345
+ ---
1346
+
1347
+ _This specification provides complete requirements for implementing a CLI tool to download and process Common Core standards from the Common Standards Project API, with clean architecture, robust error handling, and proper data organization for the MVP._
.agent/specs/001_pinecone/spec.md ADDED
@@ -0,0 +1,660 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Pinecone Integration Sprint
2
+
3
+ ## Overview
4
+
5
+ This sprint integrates Pinecone for vector storage and semantic search of educational standards. We will use Pinecone's hosted embedding models (`llama-text-embed-v2`) for both creating embeddings and search, leveraging their native search functionality through the Python SDK. This approach allows rapid iteration, takes advantage of their free tier, and eliminates the need for local embedding generation.
6
+
7
+ ---
8
+
9
+ ## User Stories
10
+
11
+ **US-1: Transform Standards on Download**
12
+ As a developer, I want downloaded standard sets to be automatically transformed into Pinecone-ready format so that I don't need a separate processing step before uploading.
13
+
14
+ **US-2: Upload Standards to Pinecone**
15
+ As a developer, I want to upload processed standards to Pinecone with a single CLI command so that the standards become searchable.
16
+
17
+ **US-3: Prevent Duplicate Uploads**
18
+ As a developer, I want the system to track which standard sets have been uploaded so that I don't waste time and API calls re-uploading data.
19
+
20
+ **US-4: Resume Failed Uploads**
21
+ As a developer, I want to be able to resume uploads after a failure so that I can recover from crashes without starting over.
22
+
23
+ **US-5: Preview Before Upload**
24
+ As a developer, I want a dry-run option to see what would be uploaded without actually uploading so that I can verify the data before committing.
25
+
26
+ ---
27
+
28
+ ## Sprint Parts
29
+
30
+ ### Part 1: Transform Standard Sets on Download
31
+
32
+ Update the CLI `download-sets` command to create transformed `processed.json` files alongside the original `data.json`. These transformed files contain records ready for Pinecone ingestion.
33
+
34
+ ### Part 2: Pinecone Upsert CLI Command
35
+
36
+ Implement a new CLI command to upload transformed standard set records to Pinecone in batches, with tracking to prevent duplicate uploads.
37
+
38
+ ---
39
+
40
+ ## Technical Decisions
41
+
42
+ ### Text Content Format (for Embedding)
43
+
44
+ Use a structured text block with depth-based headers to preserve the full parent-child context:
45
+
46
+ ```
47
+ Depth 0: Geometry
48
+ Depth 1 (1.G.K): Reason with shapes and their attributes.
49
+ Depth 2 (1.G.K.3.Ba): Partition circles and rectangles into two equal shares and:
50
+ Depth 3 (1.G.K.3.Ba.B): Describe the whole as two of the shares.
51
+ ```
52
+
53
+ Format rules:
54
+
55
+ - Each line starts with `Depth N:` where N is the standard's depth value
56
+ - If `statementNotation` is present, include it in parentheses after the depth label
57
+ - Include the full ancestor chain from root (depth 0) down to the current standard
58
+ - Join all lines with `\n`
59
+
60
+ This format is depth-agnostic and works regardless of how deep the hierarchy goes, avoiding assumptions about what each depth level represents semantically.
61
+
62
+ ### Which Standards to Process
63
+
64
+ **Process ALL standards in the hierarchy, not just leaf nodes.** This enables:
65
+
66
+ - Direct lookup of any standard by ID (including parents)
67
+ - Navigation up and down the hierarchy
68
+ - Finding children of any standard
69
+ - Complete relationship traversal
70
+
71
+ Each record includes an `is_leaf` boolean to distinguish leaf nodes (no children) from branch nodes (have children). Search queries can filter to `is_leaf: true` when only the most specific standards are desired.
72
+
73
+ **Identifying leaf vs branch nodes:** A standard is a leaf node if its `id` does NOT appear as a `parentId` for any other standard in the set.
74
+
75
+ **Note:** The previous implementation in `data_processor.py` that filtered only on `statementLabel == "Standard"` is incorrect and should be completely replaced.
76
+
77
+ ### Pinecone Record ID Strategy
78
+
79
+ Use the individual standard's GUID as the record `_id` (e.g., `EA60C8D165F6481B90BFF782CE193F93`). This ensures uniqueness and enables direct lookup. The parent hierarchy is preserved in the text content, not the ID.
80
+
81
+ ### Namespace Strategy
82
+
83
+ Use a single namespace for all standards (configurable via environment variable, default: `standards`). This allows cross-jurisdiction and cross-subject searches in a single query. Filtering by metadata handles scoping to specific jurisdictions, subjects, or grade levels.
84
+
85
+ ### Upload Tracking
86
+
87
+ Create a `.pinecone_uploaded` marker file in each standard set directory after successful upload. Before uploading, check for this marker file to skip already-uploaded sets.
88
+
89
+ ### Index Management
90
+
91
+ The Pinecone index should be created manually using the Pinecone CLI before running the upsert command. The index name is configured via environment variable.
92
+
93
+ **Index creation command (run once manually):**
94
+
95
+ ```bash
96
+ pc index create -n common-core-standards -m cosine -c aws -r us-east-1 --model llama-text-embed-v2 --field_map text=content
97
+ ```
98
+
99
+ The upsert CLI command should validate the index exists and fail with helpful instructions if not found.
100
+
101
+ ### File Changes
102
+
103
+ **Files to Delete:**
104
+
105
+ - `tools/data_processor.py` - Outdated local embedding approach, completely replaced
106
+
107
+ **Files to Edit:**
108
+
109
+ - `tools/cli.py` - Add `pinecone-upload` command, update `download-sets` to call processor, remove old `process` command
110
+ - `tools/config.py` - Add Pinecone configuration settings
111
+ - `.env.example` - Add Pinecone environment variables
112
+ - `pyproject.toml` - Remove `sentence-transformers` and `numpy`, add `pinecone`
113
+
114
+ **Files to Create:**
115
+
116
+ - `tools/pinecone_processor.py` - New module for transforming standards to Pinecone format
117
+ - `tools/pinecone_client.py` - New module for Pinecone SDK interactions (upsert, index validation)
118
+
119
+ ### Processing Trigger
120
+
121
+ The `download-sets` command will automatically generate `processed.json` immediately after saving `data.json`. This ensures processed data is always in sync with raw data.
122
+
123
+ ### Handling Missing Optional Fields
124
+
125
+ Per Pinecone best practices, handle missing fields as follows:
126
+
127
+ - **Omit missing string/array fields** from the record entirely. Do not include empty strings or empty arrays for optional fields.
128
+ - **`parent_id`**: Store as `null` for root nodes (do not omit). This explicitly indicates "no parent" vs "unknown parent".
129
+ - **`statement_label`**: If missing in source, omit the field entirely. Do not infer from depth.
130
+ - **`statement_notation`**: If missing, omit from content text parentheses (just use `Depth N: {description}`).
131
+
132
+ ### Handling Education Levels
133
+
134
+ The source `educationLevels` field may contain comma-separated values within individual array elements (e.g., `["01,02"]` instead of `["01", "02"]`). Process as follows:
135
+
136
+ 1. **Split comma-separated strings**: For each element in the array, split on commas to extract individual grade levels
137
+ 2. **Flatten**: Combine all split values into a single array
138
+ 3. **Deduplicate**: Remove any duplicate grade level strings
139
+ 4. **Preserve as array**: Store as an array of strings in the output—do NOT join back into a comma-separated string
140
+
141
+ Pinecone metadata supports string lists natively.
142
+
143
+ **Example transformation:**
144
+
145
+ ```python
146
+ # Input: ["01,02", "02", "03"]
147
+ # After split: [["01", "02"], ["02"], ["03"]]
148
+ # After flatten: ["01", "02", "02", "03"]
149
+ # After dedupe: ["01", "02", "03"]
150
+ # Output: ["01", "02", "03"]
151
+ ```
152
+
153
+ **Note:** The `education_levels` value comes from the **standard set** level (`data.educationLevels`), not from individual standards. Individual standards do not have their own education level field. The same education levels are applied to all records from a given standard set to enhance retrieval filtering.
154
+
155
+ ---
156
+
157
+ ## Processed.json Schema
158
+
159
+ Each `processed.json` file contains records ready for Pinecone upsert:
160
+
161
+ ```json
162
+ {
163
+ "records": [
164
+ {
165
+ "_id": "EA60C8D165F6481B90BFF782CE193F93",
166
+ "content": "Depth 0: Geometry\nDepth 1 (1.G.K): Reason with shapes and their attributes.\nDepth 2 (1.G.K.3.Ba): Partition circles and rectangles into two equal shares and:\nDepth 3 (1.G.K.3.Ba.B): Describe the whole as two of the shares.",
167
+ "standard_set_id": "744704BE56D44FB9B3D18B543FBF9BCC_D21218769_grade-01",
168
+ "standard_set_title": "Grade 1",
169
+ "subject": "Mathematics (2021-)",
170
+ "normalized_subject": "Math",
171
+ "education_levels": ["01"],
172
+ "document_id": "D21218769",
173
+ "document_valid": "2021",
174
+ "publication_status": "Published",
175
+ "jurisdiction_id": "744704BE56D44FB9B3D18B543FBF9BCC",
176
+ "jurisdiction_title": "Wyoming",
177
+ "asn_identifier": "S21238682",
178
+ "statement_notation": "1.G.K.3.Ba.B",
179
+ "statement_label": "Benchmark",
180
+ "depth": 3,
181
+ "is_leaf": true,
182
+ "is_root": false,
183
+ "parent_id": "3445678A58C74065B7DF5617B353B89C",
184
+ "root_id": "FE0D33F3287E4137AD66FA3926FAB114",
185
+ "ancestor_ids": [
186
+ "FE0D33F3287E4137AD66FA3926FAB114",
187
+ "386EA56EADD24A209DC2D77A71B2F89B",
188
+ "3445678A58C74065B7DF5617B353B89C"
189
+ ],
190
+ "child_ids": [],
191
+ "sibling_count": 1
192
+ }
193
+ ]
194
+ }
195
+ ```
196
+
197
+ ### Metadata Fields
198
+
199
+ All metadata fields must be flat (no nested objects) per Pinecone requirements. Arrays of strings are allowed.
200
+
201
+ **Standard Set Context:**
202
+
203
+ | Field | Description |
204
+ | -------------------- | --------------------------------------------------------------------------- |
205
+ | `_id` | Standard's unique GUID |
206
+ | `content` | Rich text block with full hierarchy (used for embedding) |
207
+ | `standard_set_id` | ID of the parent standard set |
208
+ | `standard_set_title` | Title of the standard set (e.g., "Grade 1") |
209
+ | `subject` | Full subject name |
210
+ | `normalized_subject` | Normalized subject (e.g., "Math", "ELA") |
211
+ | `education_levels` | Array of grade level strings (e.g., `["01"]` or `["09", "10", "11", "12"]`) |
212
+ | `document_id` | Document ID |
213
+ | `document_valid` | Year the document is valid |
214
+ | `publication_status` | Publication status (e.g., "Published") |
215
+ | `jurisdiction_id` | Jurisdiction GUID |
216
+ | `jurisdiction_title` | Jurisdiction name (e.g., "Wyoming") |
217
+
218
+ **Standard Identity & Position:**
219
+
220
+ | Field | Description |
221
+ | -------------------- | ------------------------------------------------------------ |
222
+ | `asn_identifier` | ASN identifier if available (e.g., "S21238682") |
223
+ | `statement_notation` | Standard notation teachers use (e.g., "1.G.K.3.Ba.B") |
224
+ | `statement_label` | Type of standard if present in source (e.g., "Benchmark") |
225
+ | `depth` | Hierarchy depth level (0 is root, increases with each level) |
226
+ | `is_leaf` | Boolean: true if this standard has no children |
227
+ | `is_root` | Boolean: true if this is a root node (depth=0, no parent) |
228
+
229
+ **Hierarchy Relationships:**
230
+
231
+ | Field | Description |
232
+ | --------------- | ----------------------------------------------------------------- |
233
+ | `parent_id` | Immediate parent's GUID, or `null` for root nodes |
234
+ | `root_id` | Root ancestor's GUID. For root nodes, equals the node's own `_id` |
235
+ | `ancestor_ids` | Array of ancestor GUIDs ordered root→parent (see ordering below) |
236
+ | `child_ids` | Array of direct children's GUIDs ordered by position (see below) |
237
+ | `sibling_count` | Number of siblings (standards with same parent_id), excludes self |
238
+
239
+ **Array Ordering Guarantees:**
240
+
241
+ `ancestor_ids` is ordered from **root (index 0) to immediate parent (last index)**:
242
+
243
+ ```
244
+ ancestor_ids[0] = root ancestor (depth 0)
245
+ ancestor_ids[1] = second level ancestor (depth 1)
246
+ ancestor_ids[2] = third level ancestor (depth 2)
247
+ ...
248
+ ancestor_ids[-1] = immediate parent (depth = current_depth - 1)
249
+ ancestor_ids.length = current standard's depth
250
+ ```
251
+
252
+ Example for a depth-3 standard:
253
+
254
+ ```python
255
+ ancestor_ids = ["ROOT_ID", "DEPTH1_ID", "DEPTH2_ID"]
256
+ # ancestor_ids[0] is the root (depth 0)
257
+ # ancestor_ids[1] is the depth-1 ancestor
258
+ # ancestor_ids[2] is the immediate parent (depth 2)
259
+ # To get ancestor at depth N: ancestor_ids[N]
260
+ ```
261
+
262
+ `child_ids` is ordered by the source `position` field (ascending), preserving the natural document order of standards.
263
+
264
+ ---
265
+
266
+ ## Configuration
267
+
268
+ ### Environment Variables
269
+
270
+ Add to `tools/config.py` and `.env.example`:
271
+
272
+ | Variable | Description | Default |
273
+ | --------------------- | -------------------------- | ----------------------- |
274
+ | `PINECONE_API_KEY` | Pinecone API key | (required) |
275
+ | `PINECONE_INDEX_NAME` | Name of the Pinecone index | `common-core-standards` |
276
+ | `PINECONE_NAMESPACE` | Namespace for records | `standards` |
277
+
278
+ ---
279
+
280
+ ## File Locations
281
+
282
+ - **Original data:** `data/raw/standardSets/{standard_set_id}/data.json`
283
+ - **Processed data:** `data/raw/standardSets/{standard_set_id}/processed.json`
284
+ - **Upload marker:** `data/raw/standardSets/{standard_set_id}/.pinecone_uploaded`
285
+
286
+ ### Upload Marker File Format
287
+
288
+ The `.pinecone_uploaded` marker file contains an ISO 8601 timestamp indicating when the upload completed:
289
+
290
+ ```
291
+ 2025-01-15T14:30:00Z
292
+ ```
293
+
294
+ This allows tracking when each standard set was last uploaded to Pinecone.
295
+
296
+ ---
297
+
298
+ ## Source Data Reference
299
+
300
+ The source data is stored at `data/raw/standardSets/{standard_set_id}/data.json`. Key fields used for transformation:
301
+
302
+ **Standard Set Level:**
303
+
304
+ - `id`, `title`, `subject`, `normalizedSubject`, `educationLevels`
305
+ - `document.id`, `document.valid`, `document.asnIdentifier`, `document.publicationStatus`
306
+ - `jurisdiction.id`, `jurisdiction.title`
307
+
308
+ **Individual Standard Level:**
309
+
310
+ - `id` (GUID)
311
+ - `asnIdentifier`
312
+ - `depth` (hierarchy level, 0 is root)
313
+ - `position` (numeric sort order within the document - used for ordering `child_ids`)
314
+ - `statementNotation` (e.g., "1.G.K.3.Ba.B")
315
+ - `statementLabel` (e.g., "Domain", "Standard", "Benchmark")
316
+ - `description`
317
+ - `ancestorIds` (array of ancestor GUIDs - **order is NOT guaranteed**, must be rebuilt programmatically)
318
+ - `parentId`
319
+
320
+ ---
321
+
322
+ ## CLI Commands
323
+
324
+ ### Updated Command: `download-sets`
325
+
326
+ After downloading `data.json`, automatically call the processor to generate `processed.json` in the same directory. No changes to command arguments.
327
+
328
+ ### New Command: `pinecone-upload`
329
+
330
+ ```
331
+ Usage: cli pinecone-upload [OPTIONS]
332
+
333
+ Upload processed standard sets to Pinecone.
334
+
335
+ Options:
336
+ --set-id TEXT Upload a specific standard set by ID
337
+ --all Upload all downloaded standard sets with processed.json
338
+ --force Re-upload even if .pinecone_uploaded marker exists
339
+ --dry-run Show what would be uploaded without actually uploading
340
+ --batch-size INT Number of records per batch (default: 96)
341
+ ```
342
+
343
+ **Behavior:**
344
+
345
+ - If neither `--set-id` nor `--all` is provided, prompt for confirmation before uploading all
346
+ - Skip sets that have `.pinecone_uploaded` marker unless `--force` is specified
347
+ - Show progress with count of records uploaded
348
+ - On success, create `.pinecone_uploaded` marker file with timestamp
349
+ - On failure, log error and continue with next set (if `--all`)
350
+ - Validate index exists before attempting upload; fail with helpful instructions if not found
351
+
352
+ ### Removed Command: `process`
353
+
354
+ The old `process` command is removed as it used the deprecated local embedding approach.
355
+
356
+ ---
357
+
358
+ ## Dependencies
359
+
360
+ ### Remove from `pyproject.toml`:
361
+
362
+ - `sentence-transformers`
363
+ - `numpy<2`
364
+ - `huggingface_hub`
365
+
366
+ ### Add to `pyproject.toml`:
367
+
368
+ - `pinecone` (current SDK, not `pinecone-client`)
369
+
370
+ ---
371
+
372
+ ## Transformation Algorithm
373
+
374
+ ### Pre-processing: Build Relationship Maps
375
+
376
+ Before processing individual standards, build helper data structures from ALL standards in the set:
377
+
378
+ 1. **ID-to-standard map**: Map of `id` → standard object for lookups
379
+ 2. **Parent-to-children map**: Map of `parentId` → `[child_ids]`, with children **sorted by `position` ascending**
380
+ 3. **Leaf node set**: A standard is a leaf if its `id` does NOT appear as any standard's `parentId`
381
+ 4. **Root identification**: Find all standards where `parentId` is `null`. These are root nodes.
382
+
383
+ **Note on ordering:** The source data includes a `position` field for each standard that defines the natural document order. When building `child_ids`, sort by this `position` value to maintain consistent ordering.
384
+
385
+ ### Determining root_id
386
+
387
+ **Do NOT rely on the order of `ancestorIds` from the source data.** Instead, programmatically determine the root by walking up the parent chain:
388
+
389
+ ```python
390
+ def find_root_id(standard: dict, id_to_standard: dict[str, dict]) -> str:
391
+ """Walk up the parent chain to find the root ancestor."""
392
+ current = standard
393
+ visited = set() # Prevent infinite loops from bad data
394
+
395
+ while current.get("parentId") is not None:
396
+ parent_id = current["parentId"]
397
+ if parent_id in visited:
398
+ break # Circular reference protection
399
+ visited.add(parent_id)
400
+
401
+ if parent_id not in id_to_standard:
402
+ break # Parent not found, use current as root
403
+ current = id_to_standard[parent_id]
404
+
405
+ return current["id"]
406
+ ```
407
+
408
+ For root nodes themselves (where `parentId` is `null`), `root_id` equals the node's own `_id`.
409
+
410
+ ### Building ancestor_ids in Correct Order
411
+
412
+ Since `ancestorIds` order in source data is NOT guaranteed, rebuild the ancestor chain by walking up the parent chain:
413
+
414
+ ```python
415
+ def build_ordered_ancestors(standard: dict, id_to_standard: dict[str, dict]) -> list[str]:
416
+ """Build ancestor list ordered from root to immediate parent."""
417
+ ancestors = []
418
+ current_id = standard.get("parentId")
419
+ visited = set()
420
+
421
+ while current_id is not None and current_id not in visited:
422
+ visited.add(current_id)
423
+ if current_id in id_to_standard:
424
+ ancestors.append(current_id)
425
+ current_id = id_to_standard[current_id].get("parentId")
426
+ else:
427
+ break
428
+
429
+ ancestors.reverse() # Now ordered root → immediate parent
430
+ return ancestors
431
+ ```
432
+
433
+ ### Processing Each Standard
434
+
435
+ For **EACH** standard in the set (not just leaves), create a record:
436
+
437
+ **Step 1: Compute hierarchy relationships**
438
+
439
+ | Field | How to compute |
440
+ | --------------- | ----------------------------------------------------------------------- |
441
+ | `parent_id` | Copy from source `parentId` (`null` if not present) |
442
+ | `ancestor_ids` | Build using `build_ordered_ancestors()` - ordered root (idx 0) → parent |
443
+ | `root_id` | Use `find_root_id()`. For root nodes, equals own `_id` |
444
+ | `is_root` | `True` if `parentId` is `null` |
445
+ | `child_ids` | Look up in parent-to-children map, **sorted by `position` ascending** |
446
+ | `is_leaf` | `True` if `child_ids` is empty |
447
+ | `sibling_count` | Count of other standards with same `parent_id` (excludes self) |
448
+
449
+ **Step 2: Build the content text block**
450
+
451
+ 1. Get ordered ancestors from the computed `ancestor_ids`
452
+ 2. Look up each ancestor in `id_to_standard` map
453
+ 3. Build text lines in order from root to current standard:
454
+ - If `statementNotation` is present: `Depth {depth} ({statementNotation}): {description}`
455
+ - Otherwise: `Depth {depth}: {description}`
456
+ 4. Join all lines with `\n`
457
+
458
+ **Step 3: Set statement_label**
459
+
460
+ - Copy `statementLabel` from source if present
461
+ - If missing in source, **omit the field entirely** (do not infer from depth)
462
+
463
+ ### Example Transformation
464
+
465
+ Given this hierarchy:
466
+
467
+ - Root (depth 0, id "FE0D..."): "Geometry"
468
+ - Child (depth 1, notation "1.G.K"): "Reason with shapes and their attributes."
469
+ - Child (depth 2, notation "1.G.K.3.Ba"): "Partition circles and rectangles into two equal shares and:"
470
+ - Child (depth 3, notation "1.G.K.3.Ba.B"): "Describe the whole as two of the shares."
471
+
472
+ Output `content` for the depth-3 standard:
473
+
474
+ ```
475
+ Depth 0: Geometry
476
+ Depth 1 (1.G.K): Reason with shapes and their attributes.
477
+ Depth 2 (1.G.K.3.Ba): Partition circles and rectangles into two equal shares and:
478
+ Depth 3 (1.G.K.3.Ba.B): Describe the whole as two of the shares.
479
+ ```
480
+
481
+ **For a root node** (depth 0, e.g., "Geometry"):
482
+
483
+ - `is_root`: `true`
484
+ - `root_id`: equals own `_id` (e.g., "FE0D...")
485
+ - `parent_id`: `null`
486
+ - `ancestor_ids`: `[]` (empty array)
487
+ - `content`: `Depth 0: Geometry`
488
+
489
+ ---
490
+
491
+ ## Error Handling
492
+
493
+ ### Processing Errors
494
+
495
+ - **Missing `data.json`**: Skip with warning, continue to next set
496
+ - **Invalid JSON**: Log error with file path and continue to next set
497
+ - **No leaf nodes found**: Create `processed.json` with empty records array, log warning
498
+
499
+ ### Pinecone API Errors
500
+
501
+ | Error Type | Action |
502
+ | ------------------- | ------------------------------------------------------ |
503
+ | 4xx (client errors) | Fail immediately, do not retry (indicates bad request) |
504
+ | 429 (rate limit) | Retry with exponential backoff |
505
+ | 5xx (server errors) | Retry with exponential backoff |
506
+
507
+ ### Retry Pattern
508
+
509
+ ```python
510
+ import time
511
+ from pinecone.exceptions import PineconeException
512
+
513
+ def exponential_backoff_retry(func, max_retries=5):
514
+ for attempt in range(max_retries):
515
+ try:
516
+ return func()
517
+ except PineconeException as e:
518
+ status_code = getattr(e, 'status', None)
519
+ # Only retry transient errors
520
+ if status_code and (status_code >= 500 or status_code == 429):
521
+ if attempt < max_retries - 1:
522
+ delay = min(2 ** attempt, 60) # Cap at 60s
523
+ time.sleep(delay)
524
+ else:
525
+ raise
526
+ else:
527
+ raise # Don't retry client errors
528
+ ```
529
+
530
+ ### Upload Failure Recovery
531
+
532
+ - On batch failure, log which standard set and batch failed
533
+ - Continue with remaining sets if `--all` flag is used
534
+ - Do NOT create `.pinecone_uploaded` marker for failed sets
535
+ - User can re-run command to retry failed sets
536
+
537
+ ---
538
+
539
+ ## Pinecone SDK Requirements
540
+
541
+ ### Installation
542
+
543
+ ```bash
544
+ pip install pinecone # ✅ Correct (current SDK)
545
+ # NOT: pip install pinecone-client # ❌ Deprecated
546
+ ```
547
+
548
+ ### Initialization
549
+
550
+ ```python
551
+ from pinecone import Pinecone
552
+ import os
553
+
554
+ api_key = os.getenv("PINECONE_API_KEY")
555
+ if not api_key:
556
+ raise ValueError("PINECONE_API_KEY environment variable not set")
557
+
558
+ pc = Pinecone(api_key=api_key)
559
+ index = pc.Index("common-core-standards")
560
+ ```
561
+
562
+ ### Index Validation
563
+
564
+ Use SDK to check index exists before upload:
565
+
566
+ ```python
567
+ if not pc.has_index(index_name):
568
+ # Fail with helpful message including the CLI command to create index
569
+ raise ValueError(f"Index '{index_name}' not found. Create it with:\n"
570
+ f"pc index create -n {index_name} -m cosine -c aws -r us-east-1 "
571
+ f"--model llama-text-embed-v2 --field_map text=content")
572
+ ```
573
+
574
+ ### Upserting Records
575
+
576
+ **Critical**: Use `upsert_records()` for indexes with integrated embeddings, NOT `upsert()`:
577
+
578
+ ```python
579
+ # ✅ Correct - for integrated embeddings
580
+ index.upsert_records(namespace, records)
581
+
582
+ # ❌ Wrong - this is for pre-computed vectors
583
+ index.upsert(vectors=...)
584
+ ```
585
+
586
+ ### Batch Processing
587
+
588
+ ```python
589
+ def batch_upsert(index, namespace, records, batch_size=96):
590
+ """Upsert records in batches with rate limiting."""
591
+ for i in range(0, len(records), batch_size):
592
+ batch = records[i:i + batch_size]
593
+ exponential_backoff_retry(
594
+ lambda b=batch: index.upsert_records(namespace, b)
595
+ )
596
+ time.sleep(0.1) # Rate limiting between batches
597
+ ```
598
+
599
+ ### Key Constraints
600
+
601
+ | Constraint | Limit | Notes |
602
+ | ------------------- | ------------------------------------------ | ---------------------------------------- |
603
+ | Text batch size | 96 records | Also 2MB total per batch |
604
+ | Metadata per record | 40KB | Flat JSON only |
605
+ | Metadata types | strings, ints, floats, bools, string lists | No nested objects |
606
+ | Consistency | Eventually consistent | Wait ~1-5s after upsert before searching |
607
+
608
+ ### Record Format
609
+
610
+ Records must have:
611
+
612
+ - `_id`: Unique identifier (string)
613
+ - `content`: Text field for embedding (must match `field_map` in index config)
614
+ - Additional flat metadata fields (no nesting)
615
+
616
+ ```python
617
+ record = {
618
+ "_id": "EA60C8D165F6481B90BFF782CE193F93",
619
+ "content": "Depth 0: Geometry\nDepth 1 (1.G.K): ...", # Embedded by Pinecone
620
+ "subject": "Mathematics", # Flat metadata
621
+ "jurisdiction_title": "Wyoming",
622
+ "depth": 3, # int allowed
623
+ "is_root": False, # bool allowed
624
+ "parent_id": "3445678A...", # null for root nodes
625
+ }
626
+ ```
627
+
628
+ ### Common Mistakes to Avoid
629
+
630
+ 1. **Nested metadata**: Will cause API errors
631
+
632
+ ```python
633
+ # ❌ Wrong
634
+ {"user": {"name": "John"}}
635
+ # ✅ Correct
636
+ {"user_name": "John"}
637
+ ```
638
+
639
+ 2. **Hardcoded API keys**: Always use environment variables
640
+
641
+ 3. **Missing namespace**: Always specify namespace in all operations
642
+
643
+ 4. **Wrong upsert method**: Use `upsert_records()` not `upsert()` for integrated embeddings
644
+
645
+ ---
646
+
647
+ ## Assumptions and Dependencies
648
+
649
+ ### Assumptions
650
+
651
+ - Pinecone free tier limits are sufficient for initial dataset
652
+ - The index has been created manually via Pinecone CLI before running upload
653
+ - API key has been configured in environment
654
+
655
+ ### Dependencies
656
+
657
+ - Python 3.12+
658
+ - `pinecone` SDK (current version, 2025)
659
+ - Pinecone account with API key
660
+ - Network access to Pinecone API
.agent/specs/001_pinecone/tasks.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Spec Tasks
2
+
3
+ Tasks for implementing the Pinecone Integration Sprint as defined in `spec.md`.
4
+
5
+ Recommended execution order:
6
+ Task 1 (Setup) - foundation
7
+ Task 2 (Models) - data structures
8
+ Tasks 3-7 (Processor) - in sequence
9
+ Tasks 8-9 (Client) - can parallel with 3-7
10
+ Tasks 10-12 (CLI) - depends on processor and client
11
+
12
+ ---
13
+
14
+ ## Tasks
15
+
16
+ - [x] 1. **Project Setup & Configuration**
17
+
18
+ - [x] 1.1 Update `pyproject.toml`: Remove `sentence-transformers`, `numpy<2`, `huggingface_hub`; add `pinecone`
19
+ - [x] 1.2 Update `tools/config.py`: Add `pinecone_api_key`, `pinecone_index_name` (default: `common-core-standards`), `pinecone_namespace` (default: `standards`) settings
20
+ - [x] 1.3 Create/update `.env.example`: Add `PINECONE_API_KEY`, `PINECONE_INDEX_NAME`, `PINECONE_NAMESPACE` variables
21
+ - [x] 1.4 Delete `tools/data_processor.py` (outdated local embedding approach)
22
+ - [x] 1.5 Run `pip install -e .` to verify dependencies resolve correctly
23
+
24
+ - [x] 2. **Pydantic Models for Processed Records**
25
+
26
+ - [x] 2.1 Create `tools/pinecone_models.py` with `PineconeRecord` model containing all fields from processed.json schema
27
+ - [x] 2.2 Define `ProcessedStandardSet` model with `records: list[PineconeRecord]` for the output file structure
28
+ - [x] 2.3 Add field validators for `education_levels` (split comma-separated, flatten, dedupe)
29
+ - [x] 2.4 Add `model_config` with `json_encoders` for proper null handling of `parent_id`
30
+ - [x] 2.5 Write unit tests for model validation and education_levels processing
31
+
32
+ - [x] 3. **Pinecone Processor - Relationship Maps**
33
+
34
+ - [x] 3.1 Create `tools/pinecone_processor.py` with `StandardSetProcessor` class
35
+ - [x] 3.2 Implement `_build_id_to_standard_map()`: Map of `id` → standard object
36
+ - [x] 3.3 Implement `_build_parent_to_children_map()`: Map of `parentId` → `[child_ids]` sorted by `position` ascending
37
+ - [x] 3.4 Implement `_identify_leaf_nodes()`: Set of IDs that are NOT any standard's `parentId`
38
+ - [x] 3.5 Implement `_identify_root_nodes()`: Set of IDs where `parentId` is `null`
39
+ - [x] 3.6 Write unit tests for relationship map building with sample data
40
+
41
+ - [x] 4. **Pinecone Processor - Hierarchy Functions**
42
+
43
+ - [x] 4.1 Implement `find_root_id()`: Walk up parent chain with circular reference protection
44
+ - [x] 4.2 Implement `build_ordered_ancestors()`: Build ancestor list ordered root (idx 0) → immediate parent
45
+ - [x] 4.3 Implement `_compute_sibling_count()`: Count standards with same `parent_id`, excluding self
46
+ - [x] 4.4 Write unit tests for hierarchy functions with various depth levels (0, 1, 3+)
47
+
48
+ - [x] 5. **Pinecone Processor - Content Generation**
49
+
50
+ - [x] 5.1 Implement `_build_content_text()`: Generate `Depth N (notation): description` format
51
+ - [x] 5.2 Handle missing `statementNotation` (omit parentheses)
52
+ - [x] 5.3 Handle root nodes (single line `Depth 0: description`)
53
+ - [x] 5.4 Write unit tests for content generation with various hierarchy depths
54
+
55
+ - [x] 6. **Pinecone Processor - Record Transformation**
56
+
57
+ - [x] 6.1 Implement `_transform_standard()`: Convert single standard to `PineconeRecord`
58
+ - [x] 6.2 Extract standard set context fields (subject, jurisdiction, document, education_levels)
59
+ - [x] 6.3 Compute all hierarchy fields (is_leaf, is_root, parent_id, root_id, ancestor_ids, child_ids, sibling_count)
60
+ - [x] 6.4 Handle optional fields (omit if missing: statement_label, statement_notation, asn_identifier)
61
+ - [x] 6.5 Implement `process_standard_set()`: Main entry point that processes all standards and returns `ProcessedStandardSet`
62
+ - [x] 6.6 Write integration test with real `data.json` sample file
63
+
64
+ - [x] 7. **Pinecone Processor - File Operations**
65
+
66
+ - [x] 7.1 Implement `process_and_save()`: Load `data.json`, process, save `processed.json`
67
+ - [x] 7.2 Add error handling for missing `data.json` (skip with warning)
68
+ - [x] 7.3 Add error handling for invalid JSON (log error, continue)
69
+ - [x] 7.4 Add logging for processing progress and record counts
70
+ - [x] 7.5 Write integration test for file operations
71
+
72
+ - [x] 8. **Pinecone Client - Core Functions**
73
+
74
+ - [x] 8.1 Create `tools/pinecone_client.py` with `PineconeClient` class
75
+ - [x] 8.2 Implement `__init__()`: Initialize Pinecone SDK from config settings
76
+ - [x] 8.3 Implement `validate_index()`: Check index exists with `pc.has_index()`, raise helpful error if not
77
+ - [x] 8.4 Implement `exponential_backoff_retry()`: Retry on 429/5xx, fail on 4xx
78
+ - [x] 8.5 Implement `batch_upsert()`: Upsert records in batches of 96 with rate limiting
79
+
80
+ - [x] 9. **Pinecone Client - Upload Tracking**
81
+
82
+ - [x] 9.1 Implement `is_uploaded()`: Check for `.pinecone_uploaded` marker file
83
+ - [x] 9.2 Implement `mark_uploaded()`: Create marker file with ISO 8601 timestamp
84
+ - [x] 9.3 Implement `get_upload_timestamp()`: Read timestamp from marker file
85
+ - [x] 9.4 Write unit tests for upload tracking functions
86
+
87
+ - [x] 10. **CLI - Update download-sets Command**
88
+
89
+ - [x] 10.1 Import `pinecone_processor` in `tools/cli.py`
90
+ - [x] 10.2 After successful `download_standard_set()` call, invoke `process_and_save()` for single set download
91
+ - [x] 10.3 After successful bulk download loop, invoke `process_and_save()` for each downloaded set
92
+ - [x] 10.4 Add console output showing processing status
93
+ - [x] 10.5 Update `list` command to show processing status (processed.json exists)
94
+
95
+ - [x] 11. **CLI - Remove Old Process Command**
96
+
97
+ - [x] 11.1 Remove the `process` command function from `tools/cli.py`
98
+ - [x] 11.2 Remove `data_processor` import from `tools/cli.py`
99
+ - [x] 11.3 Update `list` command if it references old processing status
100
+
101
+ - [x] 12. **CLI - Implement pinecone-upload Command**
102
+ - [x] 12.1 Add `pinecone-upload` command with options: `--set-id`, `--all`, `--force`, `--dry-run`, `--batch-size`
103
+ - [x] 12.2 Implement set discovery: Find all standard sets with `processed.json`
104
+ - [x] 12.3 Implement upload filtering: Skip sets with `.pinecone_uploaded` unless `--force`
105
+ - [x] 12.4 Implement `--dry-run`: Show what would be uploaded without uploading
106
+ - [x] 12.5 Implement upload execution: Call `PineconeClient.batch_upsert()` for each set
107
+ - [x] 12.6 Add progress output with record counts
108
+ - [x] 12.7 Handle upload failures: Log error, continue with next set if `--all`, don't create marker
109
+ - [x] 12.8 Write manual test script to verify end-to-end upload flow
.cursor/commands/complete_tasks.md ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ Please implement the tasks for the following section(s) in the provided task document.
2
+ The provided spec document includes documentation for completing these tasks. Use this to understand the overall context of this unit of work and the overall sprint.
3
+ Complete only the tasks in the indicated section(s). Do not move onto other sections and their tasks.
4
+ When you have completed all the work for the section(s), mark the section(s) and all associated tasks complete. Do this by editing the provided tasks.md and marking the checkboxes with an 'x' (`- [x] Some section or task ...`).
.cursor/commands/create_tasks.md ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Generate tasks from a spec document. The task should be broken into numbered sections with clear, cohesive, focused units of work. The generated tasks document must fully implement the requirements specified in the spec document.
2
+
3
+ <guidelines>
4
+ - One unit of work per section: Each section delivers a single, clearly defined outcome an LLM can implement end-to-end in one pass.
5
+ - Timebox: Target 30–40 minutes of dev work.
6
+ - Size and focus: 3–7 concrete subtasks per section. Avoid grab-bag sections.
7
+ - Boundaries: Keeps sections and tasks focused on a small subset of files, unless edits are simple and tightly related.
8
+ </guidelines>
9
+
10
+ <file_template>
11
+ <header>
12
+ # Spec Tasks
13
+ </header>
14
+ </file_template>
15
+
16
+ <task_structure>
17
+ <major_tasks>
18
+ - count: 1-12
19
+ - format: numbered checklist
20
+ - grouping: by feature or component
21
+ </major_tasks>
22
+ <subtasks>
23
+ - count: up to 8 per major task
24
+ - format: decimal notation (1.1, 1.2)
25
+ </subtasks>
26
+ </task_structure>
27
+
28
+ <task_template>
29
+ ## Tasks
30
+
31
+ - [ ] 1. [MAJOR_TASK_DESCRIPTION]
32
+ - [ ] 1.1 [IMPLEMENTATION_STEP]
33
+ - [ ] 1.2 [IMPLEMENTATION_STEP]
34
+ - [ ] 1.3 [IMPLEMENTATION_STEP]
35
+
36
+ - [ ] 2. [MAJOR_TASK_DESCRIPTION]
37
+ - [ ] 2.1 [IMPLEMENTATION_STEP]
38
+ - [ ] 2.2 [IMPLEMENTATION_STEP]
39
+ </task_template>
40
+
41
+ <ordering_principles>
42
+ - Consider technical dependencies
43
+ - Follow TDD approach
44
+ - Group related functionality
45
+ - Build incrementally
46
+ </ordering_principles>
47
+
48
+ </step>
.cursor/commands/finalize_spec.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Convert the spec_draft document into a final draft in the spec.md file.
2
+
3
+ The spec acts as the comprehensive source of truth for this sprint and should include all the necessary context and technical details to implement this sprint. It should leave no ambiguity for important details necessary to properly implement the changes required.
4
+
5
+ The spec.md will act as a reference for an LLM coding agent responsible for completing this sprint.
6
+
7
+ The spec should include the following information if applicable:
8
+ - An overview of the changes implemented in this sprint.
9
+ - User stories for the new functionality, if applicable.
10
+ - An outline of any new data models proposed.
11
+ - An other technical details determined in the spec_draft or related conversations.
12
+ - Specific filepaths for files for any files that need to be added, edited, or deleted as part of this sprint.
13
+ - Specific files or modules relevant to this sprint.
14
+ - Details on how things should function such as a function, workflow, or other process.
15
+ - Describe what any new functions, services, ect. are supposed to do.
16
+ - Any reasoning or rationale behind decisions, preferences, or changes that provides context for the sprint and its changes.
17
+ - Any other information required to properly understand this sprint, the desired changes, the expected deliverables, or important technical details.
18
+
19
+ Strive to retain all the final decisions and implementation details provided in the spec_draft and related conversations. Cleaning and organizing these raw notes is desirable, but do not exclude or leave out information provided in the spec_draft if it is relevant to this sprint. If there is information in the spec_draft that is outdated and negated or revised by further direction in the draft or related conversation, you should leave that stale information out of the final spec.
20
+
21
+ The spec should have all the information a junior developer needs to complete this sprint. They should be able to independently find answers to any questions they have about this sprint and how to implement it in this document.
.cursor/commands/spec_draft.md ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ I am working on developing a comprehensive spec document for the next development sprint.
2
+
3
+ <goal>
4
+ Solidify the current spec_draft document into a comprehensive specification for the next development sprint through iterative refinement.
5
+
6
+ The spec draft represents the rough notes and ideas for the next sprint. These notes are likely incomplete and require additional details and decisions to obtain sufficient information to move forward with the sprint.
7
+
8
+ READ: @.cursor/commands/finalize_spec.md to see the complete requirements for the finalized spec. The goal is to reach the level of specificity and clarity required to create this final spec.
9
+ </goal>
10
+
11
+ <process>
12
+ <overview>
13
+ Iteratively carry out the following steps to progressively refine the requirements for this sprint. Use `Requests for Input` only to gather information that cannot be inferred from the user's selection of a Recommendation; do not ask to confirm details already specified by a selected option. The initial `spec_draft` may be a loose assortment of notes, ideas, and thoughts; treat it accordingly in the first round.
14
+
15
+ First round: produce a response that includes Recommendations and Requests for Input. The user will reply by selecting exactly one option per Recommendation (or asking for refinement if none fit) and answering only those questions that cannot be inferred from selected options.
16
+
17
+ After each user response: update the `spec_draft` to incorporate the selected options with minimal, focused edits. Remove any conflicting or superseded information made obsolete by the selection. Avoid unrelated formatting or editorial changes.
18
+
19
+ Repeat this back-and-forth until ambiguity is removed and the draft aligns with the requirements in `@.cursor/commands/finalize_spec.md`.
20
+ </overview>
21
+
22
+ <steps>
23
+ - READ the spec_draft.
24
+ - IDENTIFY anything in the spec draft that is confusing, conflicting, unclear, or missing. Identify important decisions that need to be made.
25
+ - REVIEW the current state of the project to fully understand how these new requirements fit into what already exists.
26
+ - RECOMMEND specific additions or updates to the draft spec to resolve confusion, add clarity, fill gaps, or add specificity. Recommendations may provide a single option when appropriate or multiple options when needed. Each Recommendation expects selection of one and only one option by the user.
27
+ - ASK targeted questions to acquire details, decisions, or preferences from the user.
28
+ - APPLY the user's selections: make minimal, localized edits to the `spec_draft` to incorporate the chosen options and remove conflicting content. Incorporate all information contained in the selected options; do not omit details. Do not change unrelated text, structure, or formatting.
29
+ - REFINE: if the user rejects the provided options, revise the Recommendations based on feedback and repeat selection and apply.
30
+ </steps>
31
+
32
+ <end_conditions>
33
+ - Continue this process until the draft is unambiguous and conforms to `@.cursor/commands/finalize_spec.md`, or the user directs you to do otherwise.
34
+ - Do not stop after a single round unless the draft already satisfies all requirements in `@.cursor/commands/finalize_spec.md`.
35
+ </end_conditions>
36
+ </process>
37
+
38
+ <response>
39
+ <overview>
40
+ Your responses should be focused on providing clear, concrete recommendations for content to add to the spec draft to resolve ambiguity, add specificity, and increase clarity for the sprint. The options you provide in your recommendations should provide complete content that can be incorporated into the spec draft. For each Recommendation, expect the user to select exactly one option; Recommendations may include a single option when appropriate. If no option fits, the user may request refinement. If you do not have sufficient understanding of the user's intent or the meaning of some element of the spec draft, use `Request for Input` sections to ask targeted questions of the user. Only ask for information that cannot be inferred from the user's selection of a Recommendation. Do not ask to confirm details already encoded in an option (e.g., if Option 1.1 specifies renaming a file to `foo.py`, do not ask to confirm that rename).
41
+
42
+ Using incrementing section numbers are essential for helping the user quickly reference specific options or questions in their responses.
43
+ Responses must strictly follow the Format section. Include only the specified sections and no additional commentary or subsections.
44
+ The agent is responsible for updating the spec draft after each user response.
45
+ </overview>
46
+
47
+ <guidelines>
48
+ - Break recommendations and requests for input into related sections to provide concrete options or ask targeted questions to the user.
49
+ - Focus sections on a specific, concrete decision or unit of work related to the sprint outlined in the spec draft.
50
+ - Recommendations may provide one or more options; when multiple options are presented, the user must select exactly one.
51
+ - `Requests for Input` may include one or more questions, but only for details that cannot be derived from the selected option(s).
52
+ - Do not ask confirmation questions about facts stated by options; assume the selected option is authoritative.
53
+ - Use numbered sections that increment.
54
+ - Use incrementing decimals for recommendation options and request for input questions.
55
+ - After the user selects options, apply minimal, focused edits to the `spec_draft` reflecting only those selections. Remove conflicting or superseded content. Avoid broad formatting or editorial changes to unrelated content.
56
+ - Do not clutter options or questions with information already clear and unambiguous from the current draft.
57
+ - Do not add subsections beyond those defined in the Format.
58
+ </guidelines>
59
+
60
+ <format>
61
+
62
+ # Recommendations
63
+ ## 1: Section Title
64
+ Short overview providing background on the section.
65
+
66
+ **Option 1.1**
67
+ Specifics of the first option.
68
+
69
+ **Option 1.2**
70
+ Specifics of the second option.
71
+
72
+ ## 2: Section Title
73
+ Short overview providing background on the section.
74
+
75
+ **Option 2.1**
76
+ Specifics of the first option.
77
+
78
+ # Request for Input
79
+ ## 3: Section Title
80
+ Short overview providing background on the section.
81
+
82
+ **Questions**
83
+ - 3.1 Some question.
84
+ - 3.2 Another question.
85
+
86
+ </format>
87
+ <user_selection_format>
88
+ Respond by indicating a single selection per Recommendation, e.g.: `Select 1.2, 2.1`. If no option fits, reply with `Refine 1:` followed by feedback to guide revised options. You may also answer targeted questions under `Request for Input` inline.
89
+
90
+ Example mixed selections and answers:
91
+
92
+ ```text
93
+ 1.1 OK
94
+ 2: Clarifying question from the user?
95
+ 3.1 OK
96
+ 4.1 OK
97
+ 5.1 OK
98
+ 6: Answer to the specific question.
99
+ 7 Directions that indicate the users preference in response to the question.
100
+ 8 Clear directive in response to the question.
101
+ ```
102
+ </user_selection_format>
103
+
104
+ <selection_and_editing_rules>
105
+ - One and only one option must be selected per Recommendation. If none fit, request refinement.
106
+ - Apply edits narrowly: change only text directly impacted by the chosen option(s).
107
+ - Incorporate all information from the selected options into the draft.
108
+ - Remove or rewrite conflicting statements made obsolete by the selection.
109
+ - Preserve unrelated content and overall formatting; do not perform wide editorial passes.
110
+ </selection_and_editing_rules>
111
+ </response>
112
+
113
+ <guardrails>
114
+ - Only edit the draft to apply selected options and answers. Do not edit code or any other files.
115
+ </guardrails>
116
+
117
+ <finalize_spec_compliance_checklist>
118
+ - [ ] All information required by @.cursor/commands/finalize_spec.md is present.
119
+ - [ ] Requirements are testable and unambiguous.
120
+ - [ ] Risks, dependencies, and assumptions captured.
121
+ - [ ] Approval received.
122
+ </finalize_spec_compliance_checklist>
.cursor/rules/standards/best_practices.mdc ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ alwaysApply: true
3
+ ---
4
+ # Development Best Practices
5
+
6
+ ## Core Principles
7
+
8
+ ### Keep It Simple
9
+ - Implement code in the fewest lines possible
10
+ - Avoid over-engineering solutions
11
+ - Choose straightforward approaches over clever ones
12
+
13
+ ### Optimize for Readability
14
+ - Prioritize code clarity over micro-optimizations
15
+ - Write self-documenting code with clear variable names
16
+ - Add comments for "why" not "what"
17
+
18
+ ### DRY (Don't Repeat Yourself)
19
+ - Extract repeated business logic to private methods
20
+ - Extract repeated UI markup to reusable components
21
+ - Create utility functions for common operations
22
+
23
+ ### File Structure
24
+ - Keep files focused on a single responsibility
25
+ - Group related functionality together
26
+ - Use consistent naming conventions
27
+
28
+ ## Dependencies
29
+
30
+ ### Choose Libraries Wisely
31
+ When adding third-party dependencies:
32
+ - Select the most popular and actively maintained option
33
+ - Check the library's GitHub repository for:
34
+ - Recent commits (within last 6 months)
35
+ - Active issue resolution
36
+ - Number of stars/downloads
37
+ - Clear documentation
.cursor/rules/standards/cli/overview.mdc ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description: Defines rules for defining CLI functionality.
3
+ alwaysApply: false
4
+ ---
5
+
6
+ # CLI Guidance & Best Practices
7
+
8
+ - The main entrypoint for using the cli in this project is `cli.py` the sole responsibility of this file is to import the cli app and expose it for execution. NO OTHER CODE SHOULD EXIST IN THIS FILE!
9
+ - The root-level app for defining the main CLI and its sub-commands is `src/cli.py`. This file should import sub-sections of the CLI from other modules as well as provide easy-access, root-level commands for the most common workflows.
10
+ - Sub-modules that expose CLI functionality should do so by defining their own `cli.py` or `cli_{subgroup}.py` file(s).
11
+ - If a single module has multiple natural groupings of CLI commands, rather than creating one, unrelated, complex `cli.py` create multiple `cli_{subgroup}.py` files for better organization and ease of finding related content.
12
+ - Module-specific `cli.py` or `cli_{subgroup}.py` files should export a typer app with a collection of sub-commands. Import and attach these groups of commands to the main cli in `src/cli.py`.
13
+ - Add all imports at the top of typer commands for the CLI, to avoid expensive imports before the CLI starts.
14
+
15
+ # Example Template
16
+ ```python
17
+ # Example root-level src/cli.py
18
+ from __future__ import annotations
19
+
20
+ import typer
21
+
22
+ # Import groups of commands from sub-modules exported as typer apps.
23
+ from src.first_mod.cli import first_mod_app
24
+ from src.second_mod.cli import second_mod_app
25
+
26
+ # Define the root-level app.
27
+ app = typer.Typer()
28
+
29
+
30
+ # Attach the sub-commands.
31
+ app.add_typer(first_mod_app, name="first_mod")
32
+ app.add_typer(second_mod_app, name="second_mod")
33
+
34
+
35
+ # Add root-level commands.
36
+ @app.command()
37
+ def my_command(
38
+ flag: bool = typer.Option(False, "--flag", help="Flag for some option."),
39
+ ) -> None:
40
+ """Root level command."""
41
+ # Print statement to provide rapid feedback to the user, before expensive imports.
42
+ print("Starting my command.")
43
+
44
+ # Import command-level modules inside the command to avoid expensive imports before executing commands.
45
+ from src.my_mod import my_function_with_expensive_imports
46
+
47
+ for i in range(10):
48
+ my_function_with_expensive_imports(toggle_arg=flag)
49
+
50
+
51
+ if __name__ == "__name__":
52
+ app()
53
+
54
+ ```
.cursor/rules/standards/code_style/emojis.mdc ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description: Guidelines for the use of emojis.
3
+ alwaysApply: false
4
+ ---
5
+
6
+ # Emoji Usage
7
+
8
+ - **NEVER use emojis** in code, user-facing messages, comments, or UI elements
9
+ - **ALWAYS use professional icon sets** appropriate to the technology and context (e.g., Material icons `:material/icon_name:` in Streamlit)
10
+ - If no appropriate icon exists in the available icon set, omit the icon rather than using an emoji
11
+ - Keep user-facing messages clear and professional without decorative elements
.cursor/rules/standards/code_style/pydantic.mdc ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description: Rules for implementing Pydantic models in Python.
3
+ alwaysApply: false
4
+ ---
5
+
6
+ ## Type Safety and Data Validation
7
+
8
+ - **Always use Pydantic models for structured data**: Wrap JSON, dictionaries, or untyped data in Pydantic models. Never pass raw dictionaries or JSON strings when a typed model can be used.
9
+ - **Use the most specific model type**: Use the most specific Pydantic model type in function signatures. Only use union types or `BaseModel` when the function truly handles multiple types. Example: prefer `RoleOverviewUpdate` over `ProposalContent` if only handling role updates.
10
+ - **Single source of truth**: Define Pydantic models once (typically in database layer for persisted data) and import elsewhere. Avoid duplicating model definitions.
11
+ - **Database boundary pattern**: Serialization happens at the database boundary. Database models provide `parse_*`/`serialize_*` methods. Application code works with typed models, not JSON strings or dictionaries.
12
+
13
+ ## Model Definition
14
+
15
+ - Prefer the simpler syntax for default values in Pydantic models, over the more verbose `Field` notation whenever possible.
16
+
17
+ ```python
18
+ from pydantic import BaseModel
19
+
20
+
21
+ class Model(BaseModel):
22
+ # Use simple syntax for basic, mutable values.
23
+ # Pydantic creates a deep copy so this is safe.
24
+ item_counts: list[dict[str, int]] = [{}]
25
+ # Use simple `= {default}` syntax for basic values too.
26
+ some_number: int = 42
27
+ ```
28
+
29
+ - Prefer the use of `Annotated` to attach additional metadata to models.
30
+
31
+ ```python
32
+ from typing import Annotated
33
+
34
+ from pydantic import BaseModel, Field, WithJsonSchema
35
+
36
+
37
+ class Model(BaseModel):
38
+ name: Annotated[str, Field(strict=True), WithJsonSchema({'extra': 'data'})]
39
+ ```
40
+
41
+ - However, note that certain arguments to the Field() function (namely, default, default_factory, and alias) are taken into account by static type checkers to synthesize a correct **init** method. The annotated pattern is not understood by them, so you should use the normal assignment form instead.
42
+
43
+ - Use a discriminator in scenarios with mixed types where there is a need to differentiate between the types.
44
+
45
+ ```python
46
+ # Example of using a discriminator to identify specific models in a mixed-type situation.
47
+ from typing import Annotated, Literal
48
+
49
+ from pydantic import BaseModel, Discriminator, Field, Tag
50
+
51
+
52
+ class Cat(BaseModel):
53
+ pet_type: Literal['cat']
54
+ age: int
55
+
56
+
57
+ class Dog(BaseModel):
58
+ pet_kind: Literal['dog']
59
+ age: int
60
+
61
+
62
+ def pet_discriminator(v):
63
+ if isinstance(v, dict):
64
+ return v.get('pet_type', v.get('pet_kind'))
65
+ return getattr(v, 'pet_type', getattr(v, 'pet_kind', None))
66
+
67
+
68
+ class Model(BaseModel):
69
+ pet: Annotated[Cat, Tag('cat')] | Annotated[Dog, Tag('dog')] = Field(
70
+ discriminator=Discriminator(pet_discriminator)
71
+ )
72
+ ```
.cursor/rules/standards/code_style/python.mdc ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ globs: **/*.py
3
+ alwaysApply: false
4
+ ---
5
+
6
+ # General
7
+
8
+ - This project uses Python 3.12. Follow best practices and coding standards for this version.
9
+ - This project is in a Python virtual environment. Remember to activate the environment in `.venv` before executing commands that rely on Python.
10
+ - This project uses pip and pyproject.toml. All dependencies should be declared in pyproject.toml.
11
+
12
+ ## Type Annotations
13
+
14
+ - **ALWAYS** include full type annotations for all function parameters and return types
15
+ - **EXCEPTION**: Test functions do not need return type annotations (they can use `-> None` or omit entirely)
16
+ - Use typing standards for Python 3.12+, such as using `list` instead of `List` and `str | None` instead of `Optional[str]`.
17
+ - Use `from __future__ import annotations` at the top of files for forward references
18
+ - Whenever there is an if-else statement for an enum or set of literals where all values need to be handled, use `assert_never` to catch missed values.
19
+ - Always use the new syntax for defining generics rather than using TypeVars and Generic.
20
+ - **Prefer typed models over dictionaries**: When functions accept or return structured data, use Pydantic models instead of `dict` types. See `code_style/pydantic.mdc` for details.
21
+ - **Use the most specific type possible**: Use the most specific type that matches what the function handles. Only use generic types (`dict`, `BaseModel`, union types) when the function truly needs to handle multiple types. Example: prefer `AchievementAdd` over `ProposalContent` if the function only handles achievement additions.
22
+
23
+ ## Docstrings
24
+
25
+ - Use Google-style docstrings for all public functions and classes
26
+ - Include type information in docstrings for complex parameters
27
+ - Keep docstrings concise but informative
28
+
29
+ ## Code Structure
30
+
31
+ - **Main function first**: Place the primary/main function(s) at the top of the file. Supporting functions, constants, and setup code go below. This ensures readers see the most important content immediately.
32
+ - Keep functions under 50 lines when possible
33
+ - Use early returns to reduce nesting
34
+ - Prefer list comprehensions over explicit loops for simple transformations
35
+ - Use type guards for complex conditional logic
36
+
37
+ ## String Formatting
38
+
39
+ - **ALWAYS use f-strings** for string formatting and interpolation
40
+ - **NEVER use percent-style formatting** (`%s`, `%d`, `%f`, etc.)
41
+ - Use f-strings for all string concatenation and formatting needs
42
+
43
+ ## Configuration and Constants
44
+
45
+ - Define constants at module level
46
+ - Use `typing.Final` for constants that shouldn't be modified
47
+ - Group related constants together
48
+ - Use PydanticSettings for overall settings, including ones read from .env.
49
+
50
+ ## Logging
51
+
52
+ - Import the logger from `logger_config` via `from src.logger_config import logger`.
53
+ - Use structured logging with appropriate log levels
54
+ - Include context in log messages
55
+ - Use loguru for consistent logging across the project
56
+ - Make sure to use `logger.exception` to log all errors inside of expect blocks.
57
+ - Outside of except blocks use `logger.error(msg, exception=True)` and be sure to include the exception and stack trace.
58
+
59
+ ## Performance Considerations
60
+
61
+ - Use `pathlib.Path` instead of string paths
62
+ - Prefer `list` comprehensions over `map()` for readability
63
+ - Use `collections.defaultdict` when appropriate
64
+ - Use Pydantic for simple data containers.
65
+
66
+ ## Security
67
+
68
+ - Never log sensitive information (passwords, API keys, etc.)
69
+ - Use environment variables for configuration
70
+ - Validate all user inputs
71
+ - Use parameterized queries for database operations
.cursor/rules/standards/testing.mdc ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ globs: tests/**/test_*.py
3
+ alwaysApply: false
4
+ ---
5
+
6
+ - Don't import `pytest` unless you use it directly.
7
+ - Test functions should be descriptive and test one specific behavior
8
+ - Use pytest fixtures for common setup
9
+ - Mock external dependencies
10
+ - Use parametrized tests for multiple input scenarios
11
+ - Test functions do not need to provide a return type annotation.
12
+ - Test fixtures and test arguments should use type annotations.
.cursor/rules/standards/ux/deletions.mdc ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description: UX patterns for destructive actions (deletions)
3
+ alwaysApply: false
4
+ ---
5
+
6
+ ### Deletion Confirmations
7
+
8
+ Use confirmation dialogs before deleting significant objects that are difficult or impossible to undo. Confirmation dialogs should:
9
+
10
+ - Display the item name/title being deleted
11
+ - Clearly state the action cannot be undone
12
+ - Require explicit "Delete" or "Confirm" action
13
+ - Use destructive button styling (e.g., `variant="destructive"`)
14
+
15
+ **When confirmation is not necessary:**
16
+
17
+ - Deleting easily recoverable items
18
+ - Operations that can be easily undone
19
+ - Temporary or draft content with minimal user investment
.cursor/rules/standards/workflows.mdc ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description: Rules for implementing workflow files with LLM chains and structured outputs.
3
+ alwaysApply: false
4
+ ---
5
+
6
+ # Workflow File Structure
7
+
8
+ Files containing simple LLM interactions with inline prompts and structured outputs follow a specific structure.
9
+
10
+ ## File Layout Order
11
+
12
+ 1. **Main workflow function(s)** - Primary public function(s)
13
+ 2. **Helper functions** - Private utility functions
14
+ 3. **Prompt strings** - System and user prompts (`_SYSTEM_PROMPT`, `_USER_PROMPT`)
15
+ 4. **Pydantic models** - Structured output schema models
16
+ 5. **Model and chain setup** - LLM initialization, structured output config, prompt template, chain
17
+
18
+ ## Key Principles
19
+
20
+ - Use `from __future__ import annotations` to enable forward references
21
+ - Main function references types/variables defined later via forward references
22
+ - Use clear section dividers to separate logical sections
.env.example ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ # Common Standards Project API Configuration
2
+ CSP_API_KEY=your_generated_api_key_here
3
+
4
+ # Pinecone Configuration
5
+ PINECONE_API_KEY=your_pinecone_api_key_here
6
+ PINECONE_INDEX_NAME=common-core-standards
7
+ PINECONE_NAMESPACE=standards
.gitignore ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Environment variables
2
+ .env
3
+
4
+ # Python cache
5
+ __pycache__/
6
+ .mypy_cache/
7
+ *.py[cod]
8
+ *$py.class
9
+ *.so
10
+ .Python
11
+
12
+ # Distribution / packaging
13
+ build/
14
+ develop-eggs/
15
+ dist/
16
+ downloads/
17
+ eggs/
18
+ .eggs/
19
+ lib/
20
+ lib64/
21
+ parts/
22
+ sdist/
23
+ var/
24
+ wheels/
25
+ *.egg-info/
26
+ .installed.cfg
27
+ *.egg
28
+
29
+ # Virtual environments
30
+ .venv/
31
+ venv/
32
+ ENV/
33
+ env/
34
+
35
+ # IDEs
36
+ .vscode/
37
+ .idea/
38
+ *.swp
39
+ *.swo
40
+ *~
41
+
42
+ # Testing
43
+ .pytest_cache/
44
+ .coverage
45
+ htmlcov/
46
+
47
+ # OS
48
+ .DS_Store
49
+ Thumbs.db
50
+
51
+ # Raw data (local only)
52
+ data/raw/
53
+
54
+ # CLI logs
55
+ data/cli.log
56
+
data/.gitkeep ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # This directory will contain generated data files
2
+
pyproject.toml ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "common-core-mcp"
3
+ version = "0.1.0"
4
+ requires-python = ">=3.12"
5
+ dependencies = [
6
+ "mcp",
7
+ "gradio>=5.0.0,<6.0.0",
8
+ "pinecone",
9
+ "python-dotenv",
10
+ "typer",
11
+ "requests",
12
+ "rich",
13
+ "loguru",
14
+ "pydantic>=2.0.0",
15
+ "pydantic-settings>=2.0.0",
16
+ ]
17
+
18
+ [project.optional-dependencies]
19
+ dev = ["pytest>=8.0.0", "pytest-asyncio>=0.23.0"]
20
+
21
+ [build-system]
22
+ requires = ["setuptools>=61.0"]
23
+ build-backend = "setuptools.build_meta"
src/__init__.py ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ """Common Core MCP - Educational Standards Search."""
2
+
tests/__init__.py ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ """Test suite for Common Core MCP."""
2
+
tests/test_pinecone_client.py ADDED
@@ -0,0 +1,318 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Unit tests for Pinecone client."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import tempfile
6
+ from datetime import datetime, timezone
7
+ from pathlib import Path
8
+ from unittest.mock import MagicMock, patch
9
+
10
+ import pytest
11
+ from pinecone.exceptions import PineconeException
12
+
13
+ from tools.pinecone_client import PineconeClient
14
+ from tools.pinecone_models import PineconeRecord
15
+
16
+
17
+ class TestUploadTracking:
18
+ """Tests for upload tracking marker file operations."""
19
+
20
+ def test_is_uploaded_returns_false_when_marker_missing(self):
21
+ """is_uploaded() returns False when marker file doesn't exist."""
22
+ with tempfile.TemporaryDirectory() as tmpdir:
23
+ set_dir = Path(tmpdir)
24
+ assert PineconeClient.is_uploaded(set_dir) is False
25
+
26
+ def test_is_uploaded_returns_true_when_marker_exists(self):
27
+ """is_uploaded() returns True when marker file exists."""
28
+ with tempfile.TemporaryDirectory() as tmpdir:
29
+ set_dir = Path(tmpdir)
30
+ marker_file = set_dir / ".pinecone_uploaded"
31
+ marker_file.write_text("2025-01-15T14:30:00Z")
32
+ assert PineconeClient.is_uploaded(set_dir) is True
33
+
34
+ def test_mark_uploaded_creates_marker_file(self):
35
+ """mark_uploaded() creates marker file with ISO 8601 timestamp."""
36
+ with tempfile.TemporaryDirectory() as tmpdir:
37
+ set_dir = Path(tmpdir)
38
+ marker_file = set_dir / ".pinecone_uploaded"
39
+
40
+ assert not marker_file.exists()
41
+ PineconeClient.mark_uploaded(set_dir)
42
+ assert marker_file.exists()
43
+
44
+ # Verify timestamp format
45
+ timestamp = marker_file.read_text(encoding="utf-8").strip()
46
+ # Should be valid ISO 8601 format
47
+ datetime.fromisoformat(timestamp.replace("Z", "+00:00"))
48
+
49
+ def test_mark_uploaded_writes_utc_timestamp(self):
50
+ """mark_uploaded() writes UTC timestamp in ISO 8601 format."""
51
+ with tempfile.TemporaryDirectory() as tmpdir:
52
+ set_dir = Path(tmpdir)
53
+ PineconeClient.mark_uploaded(set_dir)
54
+
55
+ marker_file = set_dir / ".pinecone_uploaded"
56
+ timestamp_str = marker_file.read_text(encoding="utf-8").strip()
57
+
58
+ # Parse and verify it's UTC
59
+ if timestamp_str.endswith("Z"):
60
+ timestamp_str = timestamp_str[:-1] + "+00:00"
61
+ parsed = datetime.fromisoformat(timestamp_str)
62
+ assert parsed.tzinfo == timezone.utc
63
+
64
+ def test_get_upload_timestamp_returns_none_when_marker_missing(self):
65
+ """get_upload_timestamp() returns None when marker file doesn't exist."""
66
+ with tempfile.TemporaryDirectory() as tmpdir:
67
+ set_dir = Path(tmpdir)
68
+ assert PineconeClient.get_upload_timestamp(set_dir) is None
69
+
70
+ def test_get_upload_timestamp_returns_timestamp_when_marker_exists(self):
71
+ """get_upload_timestamp() returns timestamp string when marker exists."""
72
+ with tempfile.TemporaryDirectory() as tmpdir:
73
+ set_dir = Path(tmpdir)
74
+ expected_timestamp = "2025-01-15T14:30:00Z"
75
+ marker_file = set_dir / ".pinecone_uploaded"
76
+ marker_file.write_text(expected_timestamp)
77
+
78
+ result = PineconeClient.get_upload_timestamp(set_dir)
79
+ assert result == expected_timestamp
80
+
81
+ def test_get_upload_timestamp_handles_read_error(self):
82
+ """get_upload_timestamp() returns None if marker file can't be read."""
83
+ with tempfile.TemporaryDirectory() as tmpdir:
84
+ set_dir = Path(tmpdir)
85
+ marker_file = set_dir / ".pinecone_uploaded"
86
+ marker_file.write_text("test")
87
+
88
+ # Make file unreadable (on Unix systems)
89
+ marker_file.chmod(0o000)
90
+
91
+ try:
92
+ result = PineconeClient.get_upload_timestamp(set_dir)
93
+ # Should return None or handle gracefully
94
+ assert result is None or isinstance(result, str)
95
+ finally:
96
+ # Restore permissions for cleanup
97
+ marker_file.chmod(0o644)
98
+
99
+
100
+ class TestPineconeClientCore:
101
+ """Tests for core Pinecone client functionality."""
102
+
103
+ @patch("tools.pinecone_client.Pinecone")
104
+ @patch("tools.pinecone_client.get_settings")
105
+ def test_init_raises_error_when_api_key_missing(self, mock_get_settings):
106
+ """__init__() raises ValueError when API key is not set."""
107
+ mock_settings = MagicMock()
108
+ mock_settings.pinecone_api_key = ""
109
+ mock_get_settings.return_value = mock_settings
110
+
111
+ with pytest.raises(ValueError, match="PINECONE_API_KEY"):
112
+ PineconeClient()
113
+
114
+ @patch("tools.pinecone_client.Pinecone")
115
+ @patch("tools.pinecone_client.get_settings")
116
+ def test_init_initializes_pinecone_client(self, mock_get_settings):
117
+ """__init__() initializes Pinecone SDK with API key."""
118
+ mock_settings = MagicMock()
119
+ mock_settings.pinecone_api_key = "test-api-key"
120
+ mock_settings.pinecone_index_name = "test-index"
121
+ mock_settings.pinecone_namespace = "test-namespace"
122
+ mock_get_settings.return_value = mock_settings
123
+
124
+ mock_pc = MagicMock()
125
+ mock_pc.Index.return_value = MagicMock()
126
+ mock_pc.has_index.return_value = True
127
+ mock_pinecone_class = MagicMock(return_value=mock_pc)
128
+ with patch("tools.pinecone_client.Pinecone", mock_pinecone_class):
129
+ client = PineconeClient()
130
+
131
+ assert client.pc == mock_pc
132
+ assert client.index_name == "test-index"
133
+ assert client.namespace == "test-namespace"
134
+
135
+ @patch("tools.pinecone_client.Pinecone")
136
+ @patch("tools.pinecone_client.get_settings")
137
+ def test_validate_index_raises_error_when_index_missing(self, mock_get_settings):
138
+ """validate_index() raises ValueError when index doesn't exist."""
139
+ mock_settings = MagicMock()
140
+ mock_settings.pinecone_api_key = "test-api-key"
141
+ mock_settings.pinecone_index_name = "missing-index"
142
+ mock_get_settings.return_value = mock_settings
143
+
144
+ mock_pc = MagicMock()
145
+ mock_pc.has_index.return_value = False
146
+ mock_pinecone_class = MagicMock(return_value=mock_pc)
147
+ with patch("tools.pinecone_client.Pinecone", mock_pinecone_class):
148
+ client = PineconeClient()
149
+
150
+ with pytest.raises(ValueError, match="Index 'missing-index' not found"):
151
+ client.validate_index()
152
+
153
+ @patch("tools.pinecone_client.Pinecone")
154
+ @patch("tools.pinecone_client.get_settings")
155
+ def test_validate_index_succeeds_when_index_exists(self, mock_get_settings):
156
+ """validate_index() succeeds when index exists."""
157
+ mock_settings = MagicMock()
158
+ mock_settings.pinecone_api_key = "test-api-key"
159
+ mock_settings.pinecone_index_name = "existing-index"
160
+ mock_get_settings.return_value = mock_settings
161
+
162
+ mock_pc = MagicMock()
163
+ mock_pc.has_index.return_value = True
164
+ mock_pinecone_class = MagicMock(return_value=mock_pc)
165
+ with patch("tools.pinecone_client.Pinecone", mock_pinecone_class):
166
+ client = PineconeClient()
167
+
168
+ # Should not raise
169
+ client.validate_index()
170
+
171
+ def test_exponential_backoff_retry_succeeds_on_first_attempt(self):
172
+ """exponential_backoff_retry() succeeds when function succeeds immediately."""
173
+ func = MagicMock(return_value="success")
174
+ result = PineconeClient.exponential_backoff_retry(func)
175
+ assert result == "success"
176
+ func.assert_called_once()
177
+
178
+ @patch("tools.pinecone_client.time.sleep")
179
+ def test_exponential_backoff_retry_retries_on_429(self, mock_sleep):
180
+ """exponential_backoff_retry() retries on 429 rate limit errors."""
181
+ error_429 = PineconeException("Rate limited")
182
+ error_429.status = 429
183
+
184
+ func = MagicMock(side_effect=[error_429, "success"])
185
+ result = PineconeClient.exponential_backoff_retry(func, max_retries=2)
186
+
187
+ assert result == "success"
188
+ assert func.call_count == 2
189
+ mock_sleep.assert_called_once_with(1) # 2^0 = 1
190
+
191
+ @patch("tools.pinecone_client.time.sleep")
192
+ def test_exponential_backoff_retry_retries_on_5xx(self, mock_sleep):
193
+ """exponential_backoff_retry() retries on 5xx server errors."""
194
+ error_500 = PineconeException("Server error")
195
+ error_500.status = 500
196
+
197
+ func = MagicMock(side_effect=[error_500, "success"])
198
+ result = PineconeClient.exponential_backoff_retry(func, max_retries=2)
199
+
200
+ assert result == "success"
201
+ assert func.call_count == 2
202
+ mock_sleep.assert_called_once_with(1)
203
+
204
+ def test_exponential_backoff_retry_fails_on_4xx(self):
205
+ """exponential_backoff_retry() fails immediately on 4xx client errors."""
206
+ error_400 = PineconeException("Bad request")
207
+ error_400.status = 400
208
+
209
+ func = MagicMock(side_effect=error_400)
210
+ with pytest.raises(PineconeException):
211
+ PineconeClient.exponential_backoff_retry(func, max_retries=3)
212
+
213
+ # Should only try once (no retries for 4xx)
214
+ assert func.call_count == 1
215
+
216
+ @patch("tools.pinecone_client.time.sleep")
217
+ def test_exponential_backoff_retry_caps_delay_at_60s(self, mock_sleep):
218
+ """exponential_backoff_retry() caps delay at 60 seconds."""
219
+ error_500 = PineconeException("Server error")
220
+ error_500.status = 500
221
+
222
+ func = MagicMock(side_effect=[error_500, error_500, "success"])
223
+ result = PineconeClient.exponential_backoff_retry(func, max_retries=3)
224
+
225
+ assert result == "success"
226
+ # First retry: 2^0 = 1s, second retry: min(2^1, 60) = 2s
227
+ assert mock_sleep.call_count == 2
228
+ mock_sleep.assert_any_call(1)
229
+ mock_sleep.assert_any_call(2)
230
+
231
+ def test_record_to_dict_omits_none_optional_fields(self):
232
+ """_record_to_dict() omits None values for optional fields."""
233
+ record = PineconeRecord(
234
+ _id="test-id",
235
+ content="test content",
236
+ standard_set_id="set-id",
237
+ standard_set_title="Test Set",
238
+ subject="Math",
239
+ education_levels=["01"],
240
+ document_id="doc-id",
241
+ document_valid="2021",
242
+ jurisdiction_id="jur-id",
243
+ jurisdiction_title="Test Jurisdiction",
244
+ depth=0,
245
+ is_leaf=True,
246
+ is_root=True,
247
+ root_id="test-id",
248
+ ancestor_ids=[],
249
+ child_ids=[],
250
+ sibling_count=0,
251
+ # Optional fields set to None
252
+ normalized_subject=None,
253
+ publication_status=None,
254
+ asn_identifier=None,
255
+ statement_notation=None,
256
+ statement_label=None,
257
+ parent_id=None,
258
+ )
259
+
260
+ record_dict = PineconeClient._record_to_dict(record)
261
+
262
+ # Verify _id is serialized (not id)
263
+ assert "_id" in record_dict
264
+ assert record_dict["_id"] == "test-id"
265
+ assert "id" not in record_dict
266
+
267
+ # Optional fields should be omitted
268
+ assert "asn_identifier" not in record_dict
269
+ assert "statement_notation" not in record_dict
270
+ assert "statement_label" not in record_dict
271
+ assert "normalized_subject" not in record_dict
272
+ assert "publication_status" not in record_dict
273
+ # parent_id should be present as null
274
+ assert "parent_id" in record_dict
275
+ assert record_dict["parent_id"] is None
276
+
277
+ def test_record_to_dict_includes_present_optional_fields(self):
278
+ """_record_to_dict() includes optional fields when they have values."""
279
+ record = PineconeRecord(
280
+ _id="test-id",
281
+ content="test content",
282
+ standard_set_id="set-id",
283
+ standard_set_title="Test Set",
284
+ subject="Math",
285
+ normalized_subject="Math",
286
+ education_levels=["01"],
287
+ document_id="doc-id",
288
+ document_valid="2021",
289
+ publication_status="Published",
290
+ jurisdiction_id="jur-id",
291
+ jurisdiction_title="Test Jurisdiction",
292
+ asn_identifier="ASN123",
293
+ statement_notation="1.2.3",
294
+ statement_label="Standard",
295
+ depth=1,
296
+ is_leaf=True,
297
+ is_root=False,
298
+ parent_id="parent-id",
299
+ root_id="root-id",
300
+ ancestor_ids=["root-id"],
301
+ child_ids=[],
302
+ sibling_count=0,
303
+ )
304
+
305
+ record_dict = PineconeClient._record_to_dict(record)
306
+
307
+ # Verify _id is serialized (not id)
308
+ assert "_id" in record_dict
309
+ assert record_dict["_id"] == "test-id"
310
+ assert "id" not in record_dict
311
+
312
+ # Optional fields should be included when present
313
+ assert record_dict["asn_identifier"] == "ASN123"
314
+ assert record_dict["statement_notation"] == "1.2.3"
315
+ assert record_dict["statement_label"] == "Standard"
316
+ assert record_dict["normalized_subject"] == "Math"
317
+ assert record_dict["publication_status"] == "Published"
318
+ assert record_dict["parent_id"] == "parent-id"
tests/test_pinecone_models.py ADDED
@@ -0,0 +1,339 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Unit tests for Pinecone Pydantic models."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import json
6
+
7
+ import pytest
8
+
9
+ from tools.pinecone_models import PineconeRecord, ProcessedStandardSet
10
+
11
+
12
+ class TestEducationLevelsProcessing:
13
+ """Test education_levels field validator."""
14
+
15
+ def test_simple_array(self):
16
+ """Test simple array without comma-separated values."""
17
+ record = PineconeRecord(
18
+ **{"_id": "test-id"},
19
+ content="Test content",
20
+ standard_set_id="set-1",
21
+ standard_set_title="Grade 1",
22
+ subject="Math",
23
+ education_levels=["01", "02"],
24
+ document_id="doc-1",
25
+ document_valid="2021",
26
+ jurisdiction_id="jur-1",
27
+ jurisdiction_title="Wyoming",
28
+ depth=0,
29
+ is_leaf=True,
30
+ is_root=True,
31
+ root_id="test-id",
32
+ ancestor_ids=[],
33
+ child_ids=[],
34
+ sibling_count=0,
35
+ )
36
+ assert record.education_levels == ["01", "02"]
37
+
38
+ def test_comma_separated_strings(self):
39
+ """Test array with comma-separated values."""
40
+ record = PineconeRecord(
41
+ **{"_id": "test-id"},
42
+ content="Test content",
43
+ standard_set_id="set-1",
44
+ standard_set_title="Grade 1",
45
+ subject="Math",
46
+ education_levels=["01,02", "02", "03"],
47
+ document_id="doc-1",
48
+ document_valid="2021",
49
+ jurisdiction_id="jur-1",
50
+ jurisdiction_title="Wyoming",
51
+ depth=0,
52
+ is_leaf=True,
53
+ is_root=True,
54
+ root_id="test-id",
55
+ ancestor_ids=[],
56
+ child_ids=[],
57
+ sibling_count=0,
58
+ )
59
+ assert record.education_levels == ["01", "02", "03"]
60
+
61
+ def test_high_school_range(self):
62
+ """Test high school grade levels."""
63
+ record = PineconeRecord(
64
+ **{"_id": "test-id"},
65
+ content="Test content",
66
+ standard_set_id="set-1",
67
+ standard_set_title="High School",
68
+ subject="Math",
69
+ education_levels=["09,10,11,12"],
70
+ document_id="doc-1",
71
+ document_valid="2021",
72
+ jurisdiction_id="jur-1",
73
+ jurisdiction_title="Wyoming",
74
+ depth=0,
75
+ is_leaf=True,
76
+ is_root=True,
77
+ root_id="test-id",
78
+ ancestor_ids=[],
79
+ child_ids=[],
80
+ sibling_count=0,
81
+ )
82
+ assert record.education_levels == ["09", "10", "11", "12"]
83
+
84
+ def test_empty_array(self):
85
+ """Test empty array."""
86
+ record = PineconeRecord(
87
+ **{"_id": "test-id"},
88
+ content="Test content",
89
+ standard_set_id="set-1",
90
+ standard_set_title="Grade 1",
91
+ subject="Math",
92
+ education_levels=[],
93
+ document_id="doc-1",
94
+ document_valid="2021",
95
+ jurisdiction_id="jur-1",
96
+ jurisdiction_title="Wyoming",
97
+ depth=0,
98
+ is_leaf=True,
99
+ is_root=True,
100
+ root_id="test-id",
101
+ ancestor_ids=[],
102
+ child_ids=[],
103
+ sibling_count=0,
104
+ )
105
+ assert record.education_levels == []
106
+
107
+ def test_whitespace_handling(self):
108
+ """Test that whitespace is stripped."""
109
+ record = PineconeRecord(
110
+ **{"_id": "test-id"},
111
+ content="Test content",
112
+ standard_set_id="set-1",
113
+ standard_set_title="Grade 1",
114
+ subject="Math",
115
+ education_levels=["01 , 02", " 03 "],
116
+ document_id="doc-1",
117
+ document_valid="2021",
118
+ jurisdiction_id="jur-1",
119
+ jurisdiction_title="Wyoming",
120
+ depth=0,
121
+ is_leaf=True,
122
+ is_root=True,
123
+ root_id="test-id",
124
+ ancestor_ids=[],
125
+ child_ids=[],
126
+ sibling_count=0,
127
+ )
128
+ assert record.education_levels == ["01", "02", "03"]
129
+
130
+
131
+ class TestParentIdNullHandling:
132
+ """Test that parent_id null is properly serialized."""
133
+
134
+ def test_root_node_parent_id_null(self):
135
+ """Test root node has parent_id as null."""
136
+ record = PineconeRecord(
137
+ **{"_id": "root-id"},
138
+ content="Root content",
139
+ standard_set_id="set-1",
140
+ standard_set_title="Grade 1",
141
+ subject="Math",
142
+ education_levels=["01"],
143
+ document_id="doc-1",
144
+ document_valid="2021",
145
+ jurisdiction_id="jur-1",
146
+ jurisdiction_title="Wyoming",
147
+ depth=0,
148
+ is_leaf=False,
149
+ is_root=True,
150
+ parent_id=None,
151
+ root_id="root-id",
152
+ ancestor_ids=[],
153
+ child_ids=["child-1"],
154
+ sibling_count=0,
155
+ )
156
+ assert record.parent_id is None
157
+
158
+ # Test JSON serialization preserves null
159
+ json_str = record.model_dump_json()
160
+ data = json.loads(json_str)
161
+ assert data["parent_id"] is None
162
+
163
+ def test_child_node_parent_id_set(self):
164
+ """Test child node has parent_id set."""
165
+ record = PineconeRecord(
166
+ **{"_id": "child-id"},
167
+ content="Child content",
168
+ standard_set_id="set-1",
169
+ standard_set_title="Grade 1",
170
+ subject="Math",
171
+ education_levels=["01"],
172
+ document_id="doc-1",
173
+ document_valid="2021",
174
+ jurisdiction_id="jur-1",
175
+ jurisdiction_title="Wyoming",
176
+ depth=1,
177
+ is_leaf=True,
178
+ is_root=False,
179
+ parent_id="parent-id",
180
+ root_id="root-id",
181
+ ancestor_ids=["root-id"],
182
+ child_ids=[],
183
+ sibling_count=0,
184
+ )
185
+ assert record.parent_id == "parent-id"
186
+
187
+ # Test JSON serialization
188
+ json_str = record.model_dump_json()
189
+ data = json.loads(json_str)
190
+ assert data["parent_id"] == "parent-id"
191
+
192
+
193
+ class TestOptionalFields:
194
+ """Test optional fields can be omitted."""
195
+
196
+ def test_all_optional_fields_omitted(self):
197
+ """Test record with all optional fields omitted."""
198
+ record = PineconeRecord(
199
+ **{"_id": "test-id"},
200
+ content="Test content",
201
+ standard_set_id="set-1",
202
+ standard_set_title="Grade 1",
203
+ subject="Math",
204
+ education_levels=["01"],
205
+ document_id="doc-1",
206
+ document_valid="2021",
207
+ jurisdiction_id="jur-1",
208
+ jurisdiction_title="Wyoming",
209
+ depth=0,
210
+ is_leaf=True,
211
+ is_root=True,
212
+ root_id="test-id",
213
+ ancestor_ids=[],
214
+ child_ids=[],
215
+ sibling_count=0,
216
+ )
217
+ assert record.normalized_subject is None
218
+ assert record.asn_identifier is None
219
+ assert record.statement_notation is None
220
+ assert record.statement_label is None
221
+ assert record.publication_status is None
222
+
223
+ def test_optional_fields_set(self):
224
+ """Test record with optional fields set."""
225
+ record = PineconeRecord(
226
+ **{"_id": "test-id"},
227
+ content="Test content",
228
+ standard_set_id="set-1",
229
+ standard_set_title="Grade 1",
230
+ subject="Math",
231
+ normalized_subject="Math",
232
+ education_levels=["01"],
233
+ document_id="doc-1",
234
+ document_valid="2021",
235
+ publication_status="Published",
236
+ jurisdiction_id="jur-1",
237
+ jurisdiction_title="Wyoming",
238
+ asn_identifier="S12345",
239
+ statement_notation="1.G.K",
240
+ statement_label="Standard",
241
+ depth=1,
242
+ is_leaf=True,
243
+ is_root=False,
244
+ parent_id="parent-id",
245
+ root_id="root-id",
246
+ ancestor_ids=["root-id"],
247
+ child_ids=[],
248
+ sibling_count=1,
249
+ )
250
+ assert record.normalized_subject == "Math"
251
+ assert record.asn_identifier == "S12345"
252
+ assert record.statement_notation == "1.G.K"
253
+ assert record.statement_label == "Standard"
254
+ assert record.publication_status == "Published"
255
+
256
+
257
+ class TestProcessedStandardSet:
258
+ """Test ProcessedStandardSet container model."""
259
+
260
+ def test_empty_records(self):
261
+ """Test ProcessedStandardSet with empty records."""
262
+ processed = ProcessedStandardSet(records=[])
263
+ assert processed.records == []
264
+
265
+ def test_multiple_records(self):
266
+ """Test ProcessedStandardSet with multiple records."""
267
+ record1 = PineconeRecord(
268
+ **{"_id": "id-1"},
269
+ content="Content 1",
270
+ standard_set_id="set-1",
271
+ standard_set_title="Grade 1",
272
+ subject="Math",
273
+ education_levels=["01"],
274
+ document_id="doc-1",
275
+ document_valid="2021",
276
+ jurisdiction_id="jur-1",
277
+ jurisdiction_title="Wyoming",
278
+ depth=0,
279
+ is_leaf=True,
280
+ is_root=True,
281
+ root_id="id-1",
282
+ ancestor_ids=[],
283
+ child_ids=[],
284
+ sibling_count=0,
285
+ )
286
+ record2 = PineconeRecord(
287
+ **{"_id": "id-2"},
288
+ content="Content 2",
289
+ standard_set_id="set-1",
290
+ standard_set_title="Grade 1",
291
+ subject="Math",
292
+ education_levels=["01"],
293
+ document_id="doc-1",
294
+ document_valid="2021",
295
+ jurisdiction_id="jur-1",
296
+ jurisdiction_title="Wyoming",
297
+ depth=1,
298
+ is_leaf=True,
299
+ is_root=False,
300
+ parent_id="id-1",
301
+ root_id="id-1",
302
+ ancestor_ids=["id-1"],
303
+ child_ids=[],
304
+ sibling_count=0,
305
+ )
306
+ processed = ProcessedStandardSet(records=[record1, record2])
307
+ assert len(processed.records) == 2
308
+ assert processed.records[0].id == "id-1"
309
+ assert processed.records[1].id == "id-2"
310
+
311
+ def test_json_serialization(self):
312
+ """Test JSON serialization of ProcessedStandardSet."""
313
+ record = PineconeRecord(
314
+ **{"_id": "test-id"},
315
+ content="Test content",
316
+ standard_set_id="set-1",
317
+ standard_set_title="Grade 1",
318
+ subject="Math",
319
+ education_levels=["01"],
320
+ document_id="doc-1",
321
+ document_valid="2021",
322
+ jurisdiction_id="jur-1",
323
+ jurisdiction_title="Wyoming",
324
+ depth=0,
325
+ is_leaf=True,
326
+ is_root=True,
327
+ root_id="test-id",
328
+ ancestor_ids=[],
329
+ child_ids=[],
330
+ sibling_count=0,
331
+ )
332
+ processed = ProcessedStandardSet(records=[record])
333
+ json_str = processed.model_dump_json(by_alias=True)
334
+ data = json.loads(json_str)
335
+ assert "records" in data
336
+ assert len(data["records"]) == 1
337
+ assert data["records"][0]["_id"] == "test-id"
338
+ assert data["records"][0]["parent_id"] is None # Verify null handling
339
+
tests/test_pinecone_processor.py ADDED
@@ -0,0 +1,463 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Tests for Pinecone processor module."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import json
6
+ from pathlib import Path
7
+
8
+ import pytest
9
+
10
+ from tools.models import Standard, StandardSet
11
+ from tools.pinecone_processor import StandardSetProcessor, process_and_save
12
+
13
+
14
+ @pytest.fixture
15
+ def sample_standard_set():
16
+ """Create a sample standard set for testing."""
17
+ # Create a simple hierarchy:
18
+ # Root (depth 0): "Math"
19
+ # Child (depth 1, notation "1.1"): "Numbers"
20
+ # Child (depth 2, notation "1.1.A"): "Count to 10"
21
+ root_id = "ROOT_ID"
22
+ child1_id = "CHILD1_ID"
23
+ child2_id = "CHILD2_ID"
24
+
25
+ standards = {
26
+ root_id: Standard(
27
+ id=root_id,
28
+ position=100000,
29
+ depth=0,
30
+ description="Math",
31
+ statementLabel="Domain",
32
+ ancestorIds=[],
33
+ parentId=None,
34
+ ),
35
+ child1_id: Standard(
36
+ id=child1_id,
37
+ position=101000,
38
+ depth=1,
39
+ description="Numbers",
40
+ statementNotation="1.1",
41
+ statementLabel="Standard",
42
+ ancestorIds=[root_id],
43
+ parentId=root_id,
44
+ ),
45
+ child2_id: Standard(
46
+ id=child2_id,
47
+ position=102000,
48
+ depth=2,
49
+ description="Count to 10",
50
+ statementNotation="1.1.A",
51
+ statementLabel="Benchmark",
52
+ ancestorIds=[root_id, child1_id],
53
+ parentId=child1_id,
54
+ ),
55
+ }
56
+
57
+ standard_set = StandardSet(
58
+ id="SET_ID",
59
+ title="Grade 1",
60
+ subject="Mathematics",
61
+ normalizedSubject="Math",
62
+ educationLevels=["01"],
63
+ license={
64
+ "title": "CC BY",
65
+ "URL": "https://example.com",
66
+ "rightsHolder": "Test",
67
+ },
68
+ document={
69
+ "id": "DOC_ID",
70
+ "title": "Test Document",
71
+ "valid": "2021",
72
+ "publicationStatus": "Published",
73
+ },
74
+ jurisdiction={"id": "JUR_ID", "title": "Test State"},
75
+ standards=standards,
76
+ )
77
+
78
+ return standard_set
79
+
80
+
81
+ class TestRelationshipMaps:
82
+ """Test relationship map building (Task 3)."""
83
+
84
+ def test_build_id_to_standard_map(self, sample_standard_set):
85
+ """Test ID-to-standard map building."""
86
+ processor = StandardSetProcessor()
87
+ standards_dict = {
88
+ std_id: std.model_dump()
89
+ for std_id, std in sample_standard_set.standards.items()
90
+ }
91
+
92
+ result = processor._build_id_to_standard_map(standards_dict)
93
+
94
+ assert len(result) == 3
95
+ assert "ROOT_ID" in result
96
+ assert "CHILD1_ID" in result
97
+ assert "CHILD2_ID" in result
98
+ assert result["ROOT_ID"]["id"] == "ROOT_ID"
99
+
100
+ def test_build_parent_to_children_map(self, sample_standard_set):
101
+ """Test parent-to-children map building with position sorting."""
102
+ processor = StandardSetProcessor()
103
+ standards_dict = {
104
+ std_id: std.model_dump()
105
+ for std_id, std in sample_standard_set.standards.items()
106
+ }
107
+
108
+ result = processor._build_parent_to_children_map(standards_dict)
109
+
110
+ # Root should have one child
111
+ assert None in result
112
+ assert result[None] == ["ROOT_ID"]
113
+
114
+ # Root should have child1 as child
115
+ assert "ROOT_ID" in result
116
+ assert result["ROOT_ID"] == ["CHILD1_ID"]
117
+
118
+ # Child1 should have child2 as child
119
+ assert "CHILD1_ID" in result
120
+ assert result["CHILD1_ID"] == ["CHILD2_ID"]
121
+
122
+ # Child2 should have no children
123
+ assert "CHILD2_ID" not in result or result.get("CHILD2_ID") == []
124
+
125
+ def test_identify_leaf_nodes(self, sample_standard_set):
126
+ """Test leaf node identification."""
127
+ processor = StandardSetProcessor()
128
+ standards_dict = {
129
+ std_id: std.model_dump()
130
+ for std_id, std in sample_standard_set.standards.items()
131
+ }
132
+
133
+ result = processor._identify_leaf_nodes(standards_dict)
134
+
135
+ # Only child2 should be a leaf (no children)
136
+ assert "CHILD2_ID" in result
137
+ assert "ROOT_ID" not in result
138
+ assert "CHILD1_ID" not in result
139
+
140
+ def test_identify_root_nodes(self, sample_standard_set):
141
+ """Test root node identification."""
142
+ processor = StandardSetProcessor()
143
+ standards_dict = {
144
+ std_id: std.model_dump()
145
+ for std_id, std in sample_standard_set.standards.items()
146
+ }
147
+
148
+ result = processor._identify_root_nodes(standards_dict)
149
+
150
+ # Only ROOT_ID should be a root
151
+ assert "ROOT_ID" in result
152
+ assert "CHILD1_ID" not in result
153
+ assert "CHILD2_ID" not in result
154
+
155
+
156
+ class TestHierarchyFunctions:
157
+ """Test hierarchy functions (Task 4)."""
158
+
159
+ def test_find_root_id_for_root(self, sample_standard_set):
160
+ """Test finding root ID for a root node."""
161
+ processor = StandardSetProcessor()
162
+ standards_dict = {
163
+ std_id: std.model_dump()
164
+ for std_id, std in sample_standard_set.standards.items()
165
+ }
166
+ processor.id_to_standard = processor._build_id_to_standard_map(standards_dict)
167
+
168
+ root_std = standards_dict["ROOT_ID"]
169
+ root_id = processor.find_root_id(root_std, processor.id_to_standard)
170
+
171
+ assert root_id == "ROOT_ID"
172
+
173
+ def test_find_root_id_for_child(self, sample_standard_set):
174
+ """Test finding root ID for a child node."""
175
+ processor = StandardSetProcessor()
176
+ standards_dict = {
177
+ std_id: std.model_dump()
178
+ for std_id, std in sample_standard_set.standards.items()
179
+ }
180
+ processor.id_to_standard = processor._build_id_to_standard_map(standards_dict)
181
+
182
+ child_std = standards_dict["CHILD2_ID"]
183
+ root_id = processor.find_root_id(child_std, processor.id_to_standard)
184
+
185
+ assert root_id == "ROOT_ID"
186
+
187
+ def test_build_ordered_ancestors(self, sample_standard_set):
188
+ """Test building ordered ancestor list."""
189
+ processor = StandardSetProcessor()
190
+ standards_dict = {
191
+ std_id: std.model_dump()
192
+ for std_id, std in sample_standard_set.standards.items()
193
+ }
194
+ processor.id_to_standard = processor._build_id_to_standard_map(standards_dict)
195
+
196
+ # For root, ancestors should be empty
197
+ root_std = standards_dict["ROOT_ID"]
198
+ ancestors = processor.build_ordered_ancestors(root_std, processor.id_to_standard)
199
+ assert ancestors == []
200
+
201
+ # For child1, ancestors should be [ROOT_ID]
202
+ child1_std = standards_dict["CHILD1_ID"]
203
+ ancestors = processor.build_ordered_ancestors(child1_std, processor.id_to_standard)
204
+ assert ancestors == ["ROOT_ID"]
205
+
206
+ # For child2, ancestors should be [ROOT_ID, CHILD1_ID]
207
+ child2_std = standards_dict["CHILD2_ID"]
208
+ ancestors = processor.build_ordered_ancestors(child2_std, processor.id_to_standard)
209
+ assert ancestors == ["ROOT_ID", "CHILD1_ID"]
210
+
211
+ def test_compute_sibling_count(self, sample_standard_set):
212
+ """Test sibling count computation."""
213
+ processor = StandardSetProcessor()
214
+ standards_dict = {
215
+ std_id: std.model_dump()
216
+ for std_id, std in sample_standard_set.standards.items()
217
+ }
218
+ processor.parent_to_children = processor._build_parent_to_children_map(standards_dict)
219
+
220
+ # Root has no siblings
221
+ root_std = standards_dict["ROOT_ID"]
222
+ count = processor._compute_sibling_count(root_std)
223
+ assert count == 0
224
+
225
+ # Child1 has no siblings
226
+ child1_std = standards_dict["CHILD1_ID"]
227
+ count = processor._compute_sibling_count(child1_std)
228
+ assert count == 0
229
+
230
+ # Child2 has no siblings
231
+ child2_std = standards_dict["CHILD2_ID"]
232
+ count = processor._compute_sibling_count(child2_std)
233
+ assert count == 0
234
+
235
+
236
+ class TestContentGeneration:
237
+ """Test content text generation (Task 5)."""
238
+
239
+ def test_build_content_text_for_root(self, sample_standard_set):
240
+ """Test content generation for root node."""
241
+ processor = StandardSetProcessor()
242
+ standards_dict = {
243
+ std_id: std.model_dump()
244
+ for std_id, std in sample_standard_set.standards.items()
245
+ }
246
+ processor.id_to_standard = processor._build_id_to_standard_map(standards_dict)
247
+
248
+ root_std = standards_dict["ROOT_ID"]
249
+ content = processor._build_content_text(root_std)
250
+
251
+ assert content == "Depth 0: Math"
252
+
253
+ def test_build_content_text_for_child(self, sample_standard_set):
254
+ """Test content generation for child node with notation."""
255
+ processor = StandardSetProcessor()
256
+ standards_dict = {
257
+ std_id: std.model_dump()
258
+ for std_id, std in sample_standard_set.standards.items()
259
+ }
260
+ processor.id_to_standard = processor._build_id_to_standard_map(standards_dict)
261
+
262
+ child1_std = standards_dict["CHILD1_ID"]
263
+ content = processor._build_content_text(child1_std)
264
+
265
+ expected = "Depth 0: Math\nDepth 1 (1.1): Numbers"
266
+ assert content == expected
267
+
268
+ def test_build_content_text_for_deep_child(self, sample_standard_set):
269
+ """Test content generation for deep child node."""
270
+ processor = StandardSetProcessor()
271
+ standards_dict = {
272
+ std_id: std.model_dump()
273
+ for std_id, std in sample_standard_set.standards.items()
274
+ }
275
+ processor.id_to_standard = processor._build_id_to_standard_map(standards_dict)
276
+
277
+ child2_std = standards_dict["CHILD2_ID"]
278
+ content = processor._build_content_text(child2_std)
279
+
280
+ expected = "Depth 0: Math\nDepth 1 (1.1): Numbers\nDepth 2 (1.1.A): Count to 10"
281
+ assert content == expected
282
+
283
+ def test_build_content_text_without_notation(self):
284
+ """Test content generation without statement notation."""
285
+ processor = StandardSetProcessor()
286
+
287
+ # Create a standard without notation
288
+ standard = {
289
+ "id": "TEST_ID",
290
+ "depth": 1,
291
+ "description": "Test Description",
292
+ "parentId": "PARENT_ID",
293
+ }
294
+
295
+ parent = {
296
+ "id": "PARENT_ID",
297
+ "depth": 0,
298
+ "description": "Parent",
299
+ "parentId": None,
300
+ }
301
+
302
+ processor.id_to_standard = {"TEST_ID": standard, "PARENT_ID": parent}
303
+
304
+ content = processor._build_content_text(standard)
305
+
306
+ expected = "Depth 0: Parent\nDepth 1: Test Description"
307
+ assert content == expected
308
+
309
+
310
+ class TestRecordTransformation:
311
+ """Test record transformation (Task 6)."""
312
+
313
+ def test_transform_root_standard(self, sample_standard_set):
314
+ """Test transforming a root standard."""
315
+ processor = StandardSetProcessor()
316
+ processor._build_relationship_maps(sample_standard_set.standards)
317
+
318
+ root_standard = sample_standard_set.standards["ROOT_ID"]
319
+ record = processor._transform_standard(root_standard, sample_standard_set)
320
+
321
+ assert record.id == "ROOT_ID"
322
+ assert record.is_root is True
323
+ assert record.is_leaf is False
324
+ assert record.parent_id is None
325
+ assert record.root_id == "ROOT_ID"
326
+ assert record.ancestor_ids == []
327
+ assert record.depth == 0
328
+ assert record.content == "Depth 0: Math"
329
+
330
+ def test_transform_leaf_standard(self, sample_standard_set):
331
+ """Test transforming a leaf standard."""
332
+ processor = StandardSetProcessor()
333
+ processor._build_relationship_maps(sample_standard_set.standards)
334
+
335
+ leaf_standard = sample_standard_set.standards["CHILD2_ID"]
336
+ record = processor._transform_standard(leaf_standard, sample_standard_set)
337
+
338
+ assert record.id == "CHILD2_ID"
339
+ assert record.is_root is False
340
+ assert record.is_leaf is True
341
+ assert record.parent_id == "CHILD1_ID"
342
+ assert record.root_id == "ROOT_ID"
343
+ assert record.ancestor_ids == ["ROOT_ID", "CHILD1_ID"]
344
+ assert record.depth == 2
345
+ assert "Depth 0: Math" in record.content
346
+ assert "Depth 2 (1.1.A): Count to 10" in record.content
347
+
348
+ def test_transform_standard_with_optional_fields(self, sample_standard_set):
349
+ """Test transformation includes optional fields when present."""
350
+ processor = StandardSetProcessor()
351
+ processor._build_relationship_maps(sample_standard_set.standards)
352
+
353
+ standard = sample_standard_set.standards["CHILD2_ID"]
354
+ record = processor._transform_standard(standard, sample_standard_set)
355
+
356
+ assert record.statement_notation == "1.1.A"
357
+ assert record.statement_label == "Benchmark"
358
+
359
+ def test_transform_standard_without_optional_fields(self):
360
+ """Test transformation omits optional fields when missing."""
361
+ # Create a minimal standard set
362
+ standard = Standard(
363
+ id="MIN_ID",
364
+ position=100000,
365
+ depth=0,
366
+ description="Minimal",
367
+ ancestorIds=[],
368
+ parentId=None,
369
+ )
370
+
371
+ standard_set = StandardSet(
372
+ id="SET_ID",
373
+ title="Test",
374
+ subject="Test",
375
+ normalizedSubject="Test",
376
+ educationLevels=["01"],
377
+ license={"title": "CC", "URL": "https://example.com", "rightsHolder": "Test"},
378
+ document={"id": "DOC", "title": "Doc", "valid": "2021", "publicationStatus": "Published"},
379
+ jurisdiction={"id": "JUR", "title": "Jur"},
380
+ standards={"MIN_ID": standard},
381
+ )
382
+
383
+ processor = StandardSetProcessor()
384
+ processor._build_relationship_maps(standard_set.standards)
385
+
386
+ record = processor._transform_standard(standard, standard_set)
387
+
388
+ assert record.asn_identifier is None
389
+ assert record.statement_notation is None
390
+ assert record.statement_label is None
391
+
392
+ def test_process_standard_set(self, sample_standard_set):
393
+ """Test processing entire standard set."""
394
+ processor = StandardSetProcessor()
395
+ processed_set = processor.process_standard_set(sample_standard_set)
396
+
397
+ assert len(processed_set.records) == 3
398
+ assert all(isinstance(r, type(processed_set.records[0])) for r in processed_set.records)
399
+
400
+
401
+ class TestFileOperations:
402
+ """Test file operations (Task 7)."""
403
+
404
+ def test_process_and_save(self, tmp_path, sample_standard_set):
405
+ """Test processing and saving to file."""
406
+ # Create temporary directory structure
407
+ set_dir = tmp_path / "standardSets" / "TEST_SET_ID"
408
+ set_dir.mkdir(parents=True)
409
+
410
+ # Save sample data.json
411
+ data_file = set_dir / "data.json"
412
+ response_data = {"data": sample_standard_set.model_dump(mode="json")}
413
+ with open(data_file, "w", encoding="utf-8") as f:
414
+ json.dump(response_data, f)
415
+
416
+ # Mock the settings to use tmp_path
417
+ from unittest.mock import patch
418
+ from tools.config import ToolsSettings
419
+
420
+ with patch("tools.pinecone_processor.settings") as mock_settings:
421
+ mock_settings.standard_sets_dir = tmp_path / "standardSets"
422
+
423
+ processed_file = process_and_save("TEST_SET_ID")
424
+
425
+ assert processed_file.exists()
426
+ assert processed_file.name == "processed.json"
427
+
428
+ # Verify content
429
+ with open(processed_file, encoding="utf-8") as f:
430
+ data = json.load(f)
431
+
432
+ assert "records" in data
433
+ assert len(data["records"]) == 3
434
+
435
+ def test_process_and_save_missing_file(self):
436
+ """Test error handling for missing data.json."""
437
+ from unittest.mock import patch
438
+ from tools.config import ToolsSettings
439
+
440
+ with patch("tools.pinecone_processor.settings") as mock_settings:
441
+ mock_settings.standard_sets_dir = Path("/nonexistent/path")
442
+
443
+ with pytest.raises(FileNotFoundError):
444
+ process_and_save("NONEXISTENT_SET")
445
+
446
+ def test_process_and_save_invalid_json(self, tmp_path):
447
+ """Test error handling for invalid JSON."""
448
+ set_dir = tmp_path / "standardSets" / "TEST_SET_ID"
449
+ set_dir.mkdir(parents=True)
450
+
451
+ # Write invalid JSON
452
+ data_file = set_dir / "data.json"
453
+ with open(data_file, "w", encoding="utf-8") as f:
454
+ f.write("{ invalid json }")
455
+
456
+ from unittest.mock import patch
457
+
458
+ with patch("tools.pinecone_processor.settings") as mock_settings:
459
+ mock_settings.standard_sets_dir = tmp_path / "standardSets"
460
+
461
+ with pytest.raises(ValueError, match="Invalid JSON"):
462
+ process_and_save("TEST_SET_ID")
463
+
tools/__init__.py ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ """Build and maintenance tools."""
2
+
tools/api_client.py ADDED
@@ -0,0 +1,435 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """API client for Common Standards Project with retry logic and rate limiting."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import json
6
+ import time
7
+ from typing import Any
8
+
9
+ import requests
10
+ from loguru import logger
11
+
12
+ from tools.config import get_settings
13
+ from tools.models import (
14
+ Jurisdiction,
15
+ JurisdictionDetails,
16
+ StandardSet,
17
+ StandardSetReference,
18
+ )
19
+
20
+ settings = get_settings()
21
+
22
+ # Cache file for jurisdictions
23
+ JURISDICTIONS_CACHE_FILE = settings.raw_data_dir / "jurisdictions.json"
24
+
25
+ # Rate limiting: Max requests per minute
26
+ MAX_REQUESTS_PER_MINUTE = settings.max_requests_per_minute
27
+ _request_timestamps: list[float] = []
28
+
29
+
30
+ class APIError(Exception):
31
+ """Raised when API request fails after all retries."""
32
+
33
+ pass
34
+
35
+
36
+ def _get_headers() -> dict[str, str]:
37
+ """Get authentication headers for API requests."""
38
+ if not settings.csp_api_key:
39
+ logger.error("CSP_API_KEY not found in .env file")
40
+ raise ValueError("CSP_API_KEY environment variable not set")
41
+ return {"Api-Key": settings.csp_api_key}
42
+
43
+
44
+ def _enforce_rate_limit() -> None:
45
+ """Enforce rate limiting by tracking request timestamps."""
46
+ global _request_timestamps
47
+ now = time.time()
48
+
49
+ # Remove timestamps older than 1 minute
50
+ _request_timestamps = [ts for ts in _request_timestamps if now - ts < 60]
51
+
52
+ # If at limit, wait
53
+ if len(_request_timestamps) >= MAX_REQUESTS_PER_MINUTE:
54
+ sleep_time = 60 - (now - _request_timestamps[0])
55
+ logger.warning(f"Rate limit reached. Waiting {sleep_time:.1f} seconds...")
56
+ time.sleep(sleep_time)
57
+ _request_timestamps = []
58
+
59
+ _request_timestamps.append(now)
60
+
61
+
62
+ def _make_request(
63
+ endpoint: str, params: dict[str, Any] | None = None, max_retries: int = 3
64
+ ) -> dict[str, Any]:
65
+ """
66
+ Make API request with exponential backoff retry logic.
67
+
68
+ Args:
69
+ endpoint: API endpoint path (e.g., "/jurisdictions")
70
+ params: Query parameters
71
+ max_retries: Maximum number of retry attempts
72
+
73
+ Returns:
74
+ Parsed JSON response
75
+
76
+ Raises:
77
+ APIError: After all retries exhausted or on fatal errors
78
+ """
79
+ url = f"{settings.csp_base_url}{endpoint}"
80
+ headers = _get_headers()
81
+
82
+ for attempt in range(max_retries):
83
+ try:
84
+ _enforce_rate_limit()
85
+
86
+ logger.debug(
87
+ f"API request: {endpoint} (attempt {attempt + 1}/{max_retries})"
88
+ )
89
+ response = requests.get(url, headers=headers, params=params, timeout=30)
90
+
91
+ # Handle specific status codes
92
+ if response.status_code == 401:
93
+ logger.error("Invalid API key (401 Unauthorized)")
94
+ raise APIError("Authentication failed. Check your CSP_API_KEY in .env")
95
+
96
+ if response.status_code == 404:
97
+ logger.error(f"Resource not found (404): {endpoint}")
98
+ raise APIError(f"Resource not found: {endpoint}")
99
+
100
+ if response.status_code == 429:
101
+ # Rate limited by server
102
+ retry_after = int(response.headers.get("Retry-After", 60))
103
+ logger.warning(
104
+ f"Server rate limit hit. Waiting {retry_after} seconds..."
105
+ )
106
+ time.sleep(retry_after)
107
+ continue
108
+
109
+ response.raise_for_status()
110
+ logger.info(f"API request successful: {endpoint}")
111
+ return response.json()
112
+
113
+ except requests.exceptions.Timeout:
114
+ wait_time = 2**attempt # Exponential backoff: 1s, 2s, 4s
115
+ logger.warning(f"Request timeout. Retrying in {wait_time}s...")
116
+ if attempt < max_retries - 1:
117
+ time.sleep(wait_time)
118
+ else:
119
+ raise APIError(f"Request timeout after {max_retries} attempts")
120
+
121
+ except requests.exceptions.ConnectionError:
122
+ wait_time = 2**attempt
123
+ logger.warning(f"Connection error. Retrying in {wait_time}s...")
124
+ if attempt < max_retries - 1:
125
+ time.sleep(wait_time)
126
+ else:
127
+ raise APIError(f"Connection failed after {max_retries} attempts")
128
+
129
+ except requests.exceptions.HTTPError as e:
130
+ # Don't retry on 4xx errors (except 429)
131
+ if 400 <= response.status_code < 500 and response.status_code != 429:
132
+ raise APIError(f"HTTP {response.status_code}: {response.text}")
133
+ # Retry on 5xx errors
134
+ wait_time = 2**attempt
135
+ logger.warning(
136
+ f"Server error {response.status_code}. Retrying in {wait_time}s..."
137
+ )
138
+ if attempt < max_retries - 1:
139
+ time.sleep(wait_time)
140
+ else:
141
+ raise APIError(f"Server error after {max_retries} attempts")
142
+
143
+ raise APIError("Request failed after all retries")
144
+
145
+
146
+ def get_jurisdictions(
147
+ search_term: str | None = None,
148
+ type_filter: str | None = None,
149
+ force_refresh: bool = False,
150
+ ) -> list[Jurisdiction]:
151
+ """
152
+ Fetch all jurisdictions from the API or local cache.
153
+
154
+ Jurisdictions are cached locally in data/raw/jurisdictions.json to avoid
155
+ repeated API calls. Use force_refresh=True to fetch fresh data from the API.
156
+
157
+ Args:
158
+ search_term: Optional filter for jurisdiction title (case-insensitive partial match)
159
+ type_filter: Optional filter for jurisdiction type (case-insensitive).
160
+ Valid values: "school", "organization", "state", "nation"
161
+ force_refresh: If True, fetch fresh data from API and update cache
162
+
163
+ Returns:
164
+ List of Jurisdiction models
165
+ """
166
+ jurisdictions: list[Jurisdiction] = []
167
+ raw_data: list[dict[str, Any]] = []
168
+
169
+ # Check cache first (unless forcing refresh)
170
+ if not force_refresh and JURISDICTIONS_CACHE_FILE.exists():
171
+ try:
172
+ logger.info("Loading jurisdictions from cache")
173
+ with open(JURISDICTIONS_CACHE_FILE, encoding="utf-8") as f:
174
+ cached_response = json.load(f)
175
+ raw_data = cached_response.get("data", [])
176
+ logger.info(f"Loaded {len(raw_data)} jurisdictions from cache")
177
+ except (json.JSONDecodeError, IOError) as e:
178
+ logger.warning(f"Failed to load cache: {e}. Fetching from API...")
179
+ force_refresh = True
180
+
181
+ # Fetch from API if cache doesn't exist or force_refresh is True
182
+ if force_refresh or not raw_data:
183
+ logger.info("Fetching jurisdictions from API")
184
+ response = _make_request("/jurisdictions")
185
+ raw_data = response.get("data", [])
186
+
187
+ # Save to cache
188
+ try:
189
+ settings.raw_data_dir.mkdir(parents=True, exist_ok=True)
190
+ with open(JURISDICTIONS_CACHE_FILE, "w", encoding="utf-8") as f:
191
+ json.dump(response, f, indent=2, ensure_ascii=False)
192
+ logger.info(
193
+ f"Cached {len(raw_data)} jurisdictions to {JURISDICTIONS_CACHE_FILE}"
194
+ )
195
+ except IOError as e:
196
+ logger.warning(f"Failed to save cache: {e}")
197
+
198
+ # Parse into Pydantic models
199
+ jurisdictions = [Jurisdiction(**j) for j in raw_data]
200
+
201
+ # Apply type filter if provided (case-insensitive)
202
+ if type_filter:
203
+ type_lower = type_filter.lower()
204
+ original_count = len(jurisdictions)
205
+ jurisdictions = [j for j in jurisdictions if j.type.lower() == type_lower]
206
+ logger.info(
207
+ f"Filtered to {len(jurisdictions)} jurisdictions of type '{type_filter}' (from {original_count})"
208
+ )
209
+
210
+ # Apply search filter if provided (case-insensitive partial match)
211
+ if search_term:
212
+ search_lower = search_term.lower()
213
+ original_count = len(jurisdictions)
214
+ jurisdictions = [j for j in jurisdictions if search_lower in j.title.lower()]
215
+ logger.info(
216
+ f"Filtered to {len(jurisdictions)} jurisdictions matching '{search_term}' (from {original_count})"
217
+ )
218
+
219
+ return jurisdictions
220
+
221
+
222
+ def get_jurisdiction_details(
223
+ jurisdiction_id: str, force_refresh: bool = False, hide_hidden_sets: bool = True
224
+ ) -> JurisdictionDetails:
225
+ """
226
+ Fetch jurisdiction metadata including standard set references.
227
+
228
+ Jurisdiction metadata is cached locally in data/raw/jurisdictions/{jurisdiction_id}/data.json
229
+ to avoid repeated API calls. Use force_refresh=True to fetch fresh data from the API.
230
+
231
+ Note: This returns metadata about standard sets (IDs, titles, subjects) but NOT the
232
+ full standard set content. Use download_standard_set() to get full standard set data.
233
+
234
+ Args:
235
+ jurisdiction_id: The jurisdiction GUID
236
+ force_refresh: If True, fetch fresh data from API and update cache
237
+ hide_hidden_sets: If True, hide deprecated/outdated sets (default: True)
238
+
239
+ Returns:
240
+ JurisdictionDetails model with jurisdiction metadata and standardSets array
241
+ """
242
+ cache_dir = settings.raw_data_dir / "jurisdictions" / jurisdiction_id
243
+ cache_file = cache_dir / "data.json"
244
+ raw_data: dict[str, Any] = {}
245
+
246
+ # Check cache first (unless forcing refresh)
247
+ if not force_refresh and cache_file.exists():
248
+ try:
249
+ logger.info(f"Loading jurisdiction {jurisdiction_id} from cache")
250
+ with open(cache_file, encoding="utf-8") as f:
251
+ cached_response = json.load(f)
252
+ raw_data = cached_response.get("data", {})
253
+ logger.info(f"Loaded jurisdiction metadata from cache")
254
+ except (json.JSONDecodeError, IOError) as e:
255
+ logger.warning(f"Failed to load cache: {e}. Fetching from API...")
256
+ force_refresh = True
257
+
258
+ # Fetch from API if cache doesn't exist or force_refresh is True
259
+ if force_refresh or not raw_data:
260
+ logger.info(f"Fetching jurisdiction {jurisdiction_id} from API")
261
+ params = {"hideHiddenSets": "true" if hide_hidden_sets else "false"}
262
+ response = _make_request(f"/jurisdictions/{jurisdiction_id}", params=params)
263
+ raw_data = response.get("data", {})
264
+
265
+ # Save to cache
266
+ try:
267
+ cache_dir.mkdir(parents=True, exist_ok=True)
268
+ with open(cache_file, "w", encoding="utf-8") as f:
269
+ json.dump(response, f, indent=2, ensure_ascii=False)
270
+ logger.info(f"Cached jurisdiction metadata to {cache_file}")
271
+ except IOError as e:
272
+ logger.warning(f"Failed to save cache: {e}")
273
+
274
+ # Parse into Pydantic model
275
+ return JurisdictionDetails(**raw_data)
276
+
277
+
278
+ def download_standard_set(set_id: str, force_refresh: bool = False) -> StandardSet:
279
+ """
280
+ Download full standard set data with caching.
281
+
282
+ Standard set data is cached locally in data/raw/standardSets/{set_id}/data.json
283
+ to avoid repeated API calls. Use force_refresh=True to fetch fresh data from the API.
284
+
285
+ Args:
286
+ set_id: The standard set GUID
287
+ force_refresh: If True, fetch fresh data from API and update cache
288
+
289
+ Returns:
290
+ StandardSet model with complete standard set data including hierarchy
291
+ """
292
+ cache_dir = settings.raw_data_dir / "standardSets" / set_id
293
+ cache_file = cache_dir / "data.json"
294
+ raw_data: dict[str, Any] = {}
295
+
296
+ # Check cache first (unless forcing refresh)
297
+ if not force_refresh and cache_file.exists():
298
+ try:
299
+ logger.info(f"Loading standard set {set_id} from cache")
300
+ with open(cache_file, encoding="utf-8") as f:
301
+ cached_response = json.load(f)
302
+ raw_data = cached_response.get("data", {})
303
+ logger.info(f"Loaded standard set from cache")
304
+ except (json.JSONDecodeError, IOError) as e:
305
+ logger.warning(f"Failed to load cache: {e}. Fetching from API...")
306
+ force_refresh = True
307
+
308
+ # Fetch from API if cache doesn't exist or force_refresh is True
309
+ if force_refresh or not raw_data:
310
+ logger.info(f"Downloading standard set {set_id} from API")
311
+ response = _make_request(f"/standard_sets/{set_id}")
312
+ raw_data = response.get("data", {})
313
+
314
+ # Save to cache
315
+ try:
316
+ cache_dir.mkdir(parents=True, exist_ok=True)
317
+ with open(cache_file, "w", encoding="utf-8") as f:
318
+ json.dump(response, f, indent=2, ensure_ascii=False)
319
+ logger.info(f"Cached standard set to {cache_file}")
320
+ except IOError as e:
321
+ logger.warning(f"Failed to save cache: {e}")
322
+
323
+ # Parse into Pydantic model
324
+ return StandardSet(**raw_data)
325
+
326
+
327
+ def _filter_standard_set(
328
+ standard_set: StandardSetReference,
329
+ education_levels: list[str] | None = None,
330
+ publication_status: str | None = None,
331
+ valid_year: str | None = None,
332
+ title_search: str | None = None,
333
+ subject_search: str | None = None,
334
+ ) -> bool:
335
+ """
336
+ Check if a standard set matches all provided filters (AND logic).
337
+
338
+ Args:
339
+ standard_set: StandardSetReference model from jurisdiction metadata
340
+ education_levels: List of grade levels to match (any match)
341
+ publication_status: Publication status to match
342
+ valid_year: Valid year string to match
343
+ title_search: Partial string match on title (case-insensitive)
344
+ subject_search: Partial string match on subject (case-insensitive)
345
+
346
+ Returns:
347
+ True if standard set matches all provided filters
348
+ """
349
+ # Filter by education levels (any match)
350
+ if education_levels:
351
+ set_levels = {level.upper() for level in standard_set.educationLevels}
352
+ filter_levels = {level.upper() for level in education_levels}
353
+ if not set_levels.intersection(filter_levels):
354
+ return False
355
+
356
+ # Filter by publication status
357
+ if publication_status:
358
+ if (
359
+ standard_set.document.publicationStatus
360
+ and standard_set.document.publicationStatus.lower()
361
+ != publication_status.lower()
362
+ ):
363
+ return False
364
+
365
+ # Filter by valid year
366
+ if valid_year:
367
+ if standard_set.document.valid != valid_year:
368
+ return False
369
+
370
+ # Filter by title search (partial match, case-insensitive)
371
+ if title_search:
372
+ if title_search.lower() not in standard_set.title.lower():
373
+ return False
374
+
375
+ # Filter by subject search (partial match, case-insensitive)
376
+ if subject_search:
377
+ if subject_search.lower() not in standard_set.subject.lower():
378
+ return False
379
+
380
+ return True
381
+
382
+
383
+ def download_standard_sets_by_jurisdiction(
384
+ jurisdiction_id: str,
385
+ force_refresh: bool = False,
386
+ education_levels: list[str] | None = None,
387
+ publication_status: str | None = None,
388
+ valid_year: str | None = None,
389
+ title_search: str | None = None,
390
+ subject_search: str | None = None,
391
+ ) -> list[str]:
392
+ """
393
+ Download standard sets for a jurisdiction with optional filtering.
394
+
395
+ Args:
396
+ jurisdiction_id: The jurisdiction GUID
397
+ force_refresh: If True, force refresh all downloads (ignores cache)
398
+ education_levels: List of grade levels to filter by
399
+ publication_status: Publication status to filter by
400
+ valid_year: Valid year string to filter by
401
+ title_search: Partial string match on title
402
+ subject_search: Partial string match on subject
403
+
404
+ Returns:
405
+ List of downloaded standard set IDs
406
+ """
407
+ # Get jurisdiction metadata
408
+ jurisdiction_data = get_jurisdiction_details(jurisdiction_id, force_refresh=False)
409
+ standard_sets = jurisdiction_data.standardSets
410
+
411
+ # Apply filters
412
+ filtered_sets = [
413
+ s
414
+ for s in standard_sets
415
+ if _filter_standard_set(
416
+ s,
417
+ education_levels=education_levels,
418
+ publication_status=publication_status,
419
+ valid_year=valid_year,
420
+ title_search=title_search,
421
+ subject_search=subject_search,
422
+ )
423
+ ]
424
+
425
+ # Download each filtered standard set
426
+ downloaded_ids = []
427
+ for standard_set in filtered_sets:
428
+ set_id = standard_set.id
429
+ try:
430
+ download_standard_set(set_id, force_refresh=force_refresh)
431
+ downloaded_ids.append(set_id)
432
+ except Exception as e:
433
+ logger.error(f"Failed to download standard set {set_id}: {e}")
434
+
435
+ return downloaded_ids
tools/cli.py ADDED
@@ -0,0 +1,752 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """CLI entry point for EduMatch Data Management."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import sys
6
+ from pathlib import Path
7
+
8
+ # Add project root to Python path
9
+ project_root = Path(__file__).parent.parent
10
+ if str(project_root) not in sys.path:
11
+ sys.path.insert(0, str(project_root))
12
+
13
+ import typer
14
+ from loguru import logger
15
+ from rich.console import Console
16
+ from rich.table import Table
17
+
18
+ from tools import api_client, data_manager
19
+ from tools.config import get_settings
20
+ from tools.pinecone_processor import process_and_save
21
+
22
+ settings = get_settings()
23
+
24
+ # Configure logger
25
+ logger.remove() # Remove default handler
26
+ logger.add(
27
+ sys.stderr,
28
+ format="<green>{time:HH:mm:ss}</green> | <level>{level: <8}</level> | <level>{message}</level>",
29
+ )
30
+ logger.add(
31
+ settings.log_file,
32
+ rotation=settings.log_rotation,
33
+ retention=settings.log_retention,
34
+ format="{time} | {level} | {message}",
35
+ )
36
+
37
+ app = typer.Typer(help="Common Core MCP CLI - Manage educational standards data")
38
+ console = Console()
39
+
40
+
41
+ @app.command()
42
+ def jurisdictions(
43
+ search: str = typer.Option(
44
+ None,
45
+ "--search",
46
+ "-s",
47
+ help="Filter by jurisdiction name (case-insensitive partial match)",
48
+ ),
49
+ type: str = typer.Option(
50
+ None,
51
+ "--type",
52
+ "-t",
53
+ help="Filter by jurisdiction type: school, organization, state, or nation",
54
+ ),
55
+ force: bool = typer.Option(
56
+ False, "--force", "-f", help="Force refresh from API, ignoring local cache"
57
+ ),
58
+ ):
59
+ """
60
+ List all available jurisdictions (states/organizations).
61
+
62
+ By default, jurisdictions are loaded from local cache (data/raw/jurisdictions.json)
63
+ to avoid repeated API calls. Use --force to fetch fresh data from the API and update
64
+ the cache. The cache is automatically created on first use.
65
+
66
+ Filters can be combined: use --search to filter by name and --type to filter by type.
67
+ """
68
+ try:
69
+ if force:
70
+ console.print("[yellow]Forcing refresh from API...[/yellow]")
71
+
72
+ # Validate type filter if provided
73
+ if type:
74
+ valid_types = {"school", "organization", "state", "nation"}
75
+ if type.lower() not in valid_types:
76
+ console.print(
77
+ f"[red]Error: Invalid type '{type}'. Must be one of: {', '.join(sorted(valid_types))}[/red]"
78
+ )
79
+ raise typer.Exit(code=1)
80
+
81
+ results = api_client.get_jurisdictions(
82
+ search_term=search, type_filter=type, force_refresh=force
83
+ )
84
+
85
+ table = Table("ID", "Title", "Type", title="Jurisdictions")
86
+ for j in results:
87
+ table.add_row(j.id, j.title, j.type)
88
+
89
+ console.print(table)
90
+ console.print(f"\n[green]Found {len(results)} jurisdictions[/green]")
91
+
92
+ if not force:
93
+ console.print("[dim]Tip: Use --force to refresh from API[/dim]")
94
+
95
+ except Exception as e:
96
+ console.print(f"[red]Error: {e}[/red]")
97
+ logger.exception("Failed to fetch jurisdictions")
98
+ raise typer.Exit(code=1)
99
+
100
+
101
+ @app.command()
102
+ def jurisdiction_details(
103
+ jurisdiction_id: str = typer.Argument(..., help="Jurisdiction ID"),
104
+ force: bool = typer.Option(
105
+ False, "--force", "-f", help="Force refresh from API, ignoring local cache"
106
+ ),
107
+ ):
108
+ """
109
+ Download and display jurisdiction metadata including standard set references.
110
+
111
+ By default, jurisdiction metadata is loaded from local cache (data/raw/jurisdictions/{id}/data.json)
112
+ to avoid repeated API calls. Use --force to fetch fresh data from the API and update the cache.
113
+ The cache is automatically created on first use.
114
+
115
+ Note: This command downloads metadata about standard sets (IDs, titles, subjects) but NOT
116
+ the full standard set content. Use the 'download' command to get full standard set data.
117
+ """
118
+ try:
119
+ if force:
120
+ console.print("[yellow]Forcing refresh from API...[/yellow]")
121
+
122
+ jurisdiction_data = api_client.get_jurisdiction_details(
123
+ jurisdiction_id, force_refresh=force
124
+ )
125
+
126
+ # Display jurisdiction info
127
+ console.print(f"\n[bold]Jurisdiction:[/bold] {jurisdiction_data.title}")
128
+ console.print(f"[bold]Type:[/bold] {jurisdiction_data.type}")
129
+ console.print(f"[bold]ID:[/bold] {jurisdiction_data.id}")
130
+
131
+ # Display standard sets
132
+ standard_sets = jurisdiction_data.standardSets
133
+ if standard_sets:
134
+ table = Table(
135
+ "Set ID", "Subject", "Title", "Grade Levels", title="Standard Sets"
136
+ )
137
+ for s in standard_sets:
138
+ grade_levels = ", ".join(s.educationLevels)
139
+ table.add_row(
140
+ s.id,
141
+ s.subject,
142
+ s.title,
143
+ grade_levels or "N/A",
144
+ )
145
+
146
+ console.print("\n")
147
+ console.print(table)
148
+ console.print(f"\n[green]Found {len(standard_sets)} standard sets[/green]")
149
+ else:
150
+ console.print("\n[yellow]No standard sets found[/yellow]")
151
+
152
+ if not force:
153
+ console.print("[dim]Tip: Use --force to refresh from API[/dim]")
154
+
155
+ except Exception as e:
156
+ console.print(f"[red]Error: {e}[/red]")
157
+ logger.exception("Failed to fetch jurisdiction details")
158
+ raise typer.Exit(code=1)
159
+
160
+
161
+ @app.command("download-sets")
162
+ def download_sets(
163
+ set_id: str = typer.Argument(None, help="Standard set ID (if downloading by ID)"),
164
+ jurisdiction: str = typer.Option(
165
+ None,
166
+ "--jurisdiction",
167
+ "-j",
168
+ help="Jurisdiction ID (if downloading by jurisdiction)",
169
+ ),
170
+ force: bool = typer.Option(
171
+ False, "--force", "-f", help="Force refresh from API, ignoring local cache"
172
+ ),
173
+ yes: bool = typer.Option(
174
+ False,
175
+ "--yes",
176
+ "-y",
177
+ help="Skip confirmation prompt when downloading by jurisdiction",
178
+ ),
179
+ dry_run: bool = typer.Option(
180
+ False,
181
+ "--dry-run",
182
+ help="Show what would be downloaded without actually downloading",
183
+ ),
184
+ education_levels: str = typer.Option(
185
+ None,
186
+ "--education-levels",
187
+ help="Comma-separated grade levels (e.g., '03,04,05')",
188
+ ),
189
+ publication_status: str = typer.Option(
190
+ None,
191
+ "--publication-status",
192
+ help="Publication status filter (e.g., 'Published', 'Deprecated')",
193
+ ),
194
+ valid_year: str = typer.Option(
195
+ None, "--valid-year", help="Valid year filter (e.g., '2012')"
196
+ ),
197
+ title: str = typer.Option(
198
+ None, "--title", help="Partial title match (case-insensitive)"
199
+ ),
200
+ subject: str = typer.Option(
201
+ None, "--subject", help="Partial subject match (case-insensitive)"
202
+ ),
203
+ ):
204
+ """
205
+ Download standard sets either by ID or by jurisdiction with filtering.
206
+
207
+ When downloading by jurisdiction, filters can be applied and all filters combine with AND logic.
208
+ A confirmation prompt will be shown listing all standard sets that will be downloaded.
209
+
210
+ Use --dry-run to preview what would be downloaded without actually downloading anything.
211
+ """
212
+ try:
213
+ # Validate arguments
214
+ if not set_id and not jurisdiction:
215
+ console.print(
216
+ "[red]Error: Must provide either set_id or --jurisdiction[/red]"
217
+ )
218
+ raise typer.Exit(code=1)
219
+
220
+ if set_id and jurisdiction:
221
+ console.print(
222
+ "[red]Error: Cannot specify both set_id and --jurisdiction[/red]"
223
+ )
224
+ raise typer.Exit(code=1)
225
+
226
+ # Download by ID
227
+ if set_id:
228
+ if dry_run:
229
+ console.print(
230
+ f"[yellow][DRY RUN] Would download standard set: {set_id}[/yellow]"
231
+ )
232
+ cache_path = Path("data/raw/standardSets") / set_id / "data.json"
233
+ console.print(f" Would cache to: {cache_path}")
234
+ return
235
+
236
+ with console.status(f"[bold blue]Downloading standard set {set_id}..."):
237
+ api_client.download_standard_set(set_id, force_refresh=force)
238
+
239
+ cache_path = Path("data/raw/standardSets") / set_id / "data.json"
240
+ console.print("[green]✓ Successfully downloaded standard set[/green]")
241
+ console.print(f" Cached to: {cache_path}")
242
+
243
+ # Process the downloaded set
244
+ try:
245
+ with console.status(f"[bold blue]Processing standard set {set_id}..."):
246
+ processed_path = process_and_save(set_id)
247
+ console.print("[green]✓ Successfully processed standard set[/green]")
248
+ console.print(f" Processed to: {processed_path}")
249
+ except FileNotFoundError:
250
+ console.print(
251
+ "[yellow]Warning: data.json not found, skipping processing[/yellow]"
252
+ )
253
+ except Exception as e:
254
+ console.print(
255
+ f"[yellow]Warning: Failed to process standard set: {e}[/yellow]"
256
+ )
257
+ logger.exception(f"Failed to process standard set {set_id}")
258
+
259
+ return
260
+
261
+ # Download by jurisdiction
262
+ if jurisdiction:
263
+ # Parse education levels
264
+ education_levels_list = None
265
+ if education_levels:
266
+ education_levels_list = [
267
+ level.strip() for level in education_levels.split(",")
268
+ ]
269
+
270
+ # Get jurisdiction metadata
271
+ jurisdiction_data = api_client.get_jurisdiction_details(
272
+ jurisdiction, force_refresh=False
273
+ )
274
+ all_sets = jurisdiction_data.standardSets
275
+
276
+ # Apply filters using the API client's filter function
277
+ from tools.api_client import _filter_standard_set
278
+
279
+ filtered_sets = [
280
+ s
281
+ for s in all_sets
282
+ if _filter_standard_set(
283
+ s,
284
+ education_levels=education_levels_list,
285
+ publication_status=publication_status,
286
+ valid_year=valid_year,
287
+ title_search=title,
288
+ subject_search=subject,
289
+ )
290
+ ]
291
+
292
+ if not filtered_sets:
293
+ console.print(
294
+ "[yellow]No standard sets match the provided filters.[/yellow]"
295
+ )
296
+ return
297
+
298
+ # Display filtered sets
299
+ if dry_run:
300
+ console.print(
301
+ f"\n[yellow][DRY RUN] Standard sets that would be downloaded ({len(filtered_sets)}):[/yellow]"
302
+ )
303
+ else:
304
+ console.print(
305
+ f"\n[bold]Standard sets to download ({len(filtered_sets)}):[/bold]"
306
+ )
307
+
308
+ table = Table(
309
+ "Set ID",
310
+ "Subject",
311
+ "Title",
312
+ "Grade Levels",
313
+ "Status",
314
+ "Year",
315
+ "Downloaded",
316
+ title="Standard Sets",
317
+ )
318
+ for s in filtered_sets:
319
+ display_id = s.id[:20] + "..." if len(s.id) > 20 else s.id
320
+ # Check if already downloaded
321
+ set_data_path = settings.standard_sets_dir / s.id / "data.json"
322
+ is_downloaded = set_data_path.exists()
323
+ downloaded_status = (
324
+ "[green]✓[/green]" if is_downloaded else "[yellow]✗[/yellow]"
325
+ )
326
+ table.add_row(
327
+ display_id,
328
+ s.subject,
329
+ s.title[:40],
330
+ ", ".join(s.educationLevels),
331
+ s.document.publicationStatus or "N/A",
332
+ s.document.valid,
333
+ downloaded_status,
334
+ )
335
+ console.print(table)
336
+
337
+ # If dry run, show summary and exit
338
+ if dry_run:
339
+ console.print(
340
+ f"\n[yellow][DRY RUN] Would download {len(filtered_sets)} standard set(s)[/yellow]"
341
+ )
342
+ console.print(
343
+ "[dim]Run without --dry-run to actually download these standard sets.[/dim]"
344
+ )
345
+ return
346
+
347
+ # Confirmation prompt
348
+ if not yes:
349
+ if not typer.confirm(
350
+ f"\nDownload {len(filtered_sets)} standard set(s)?"
351
+ ):
352
+ console.print("[yellow]Download cancelled.[/yellow]")
353
+ return
354
+
355
+ # Download each standard set
356
+ console.print(
357
+ f"\n[bold blue]Downloading {len(filtered_sets)} standard set(s)...[/bold blue]"
358
+ )
359
+ downloaded = 0
360
+ failed = 0
361
+
362
+ for i, standard_set in enumerate(filtered_sets, 1):
363
+ set_id = standard_set.id
364
+ try:
365
+ with console.status(
366
+ f"[bold blue][{i}/{len(filtered_sets)}] Downloading {set_id[:20]}..."
367
+ ):
368
+ api_client.download_standard_set(set_id, force_refresh=force)
369
+ downloaded += 1
370
+
371
+ # Process the downloaded set
372
+ try:
373
+ with console.status(
374
+ f"[bold blue][{i}/{len(filtered_sets)}] Processing {set_id[:20]}..."
375
+ ):
376
+ process_and_save(set_id)
377
+ except FileNotFoundError:
378
+ console.print(
379
+ f"[yellow]Warning: Skipping processing for {set_id[:20]}... (data.json not found)[/yellow]"
380
+ )
381
+ except Exception as e:
382
+ console.print(
383
+ f"[yellow]Warning: Failed to process {set_id[:20]}...: {e}[/yellow]"
384
+ )
385
+ logger.exception(f"Failed to process standard set {set_id}")
386
+
387
+ except Exception as e:
388
+ console.print(f"[red]✗ Failed to download {set_id}: {e}[/red]")
389
+ logger.exception(f"Failed to download standard set {set_id}")
390
+ failed += 1
391
+
392
+ # Summary
393
+ console.print(
394
+ f"\n[green]✓ Successfully downloaded {downloaded} standard set(s)[/green]"
395
+ )
396
+ if failed > 0:
397
+ console.print(
398
+ f"[red]✗ Failed to download {failed} standard set(s)[/red]"
399
+ )
400
+
401
+ except Exception as e:
402
+ console.print(f"[red]Error: {e}[/red]")
403
+ logger.exception("Failed to download standard sets")
404
+ raise typer.Exit(code=1)
405
+
406
+
407
+ @app.command("list")
408
+ def list_datasets():
409
+ """List all downloaded standard sets and their processing status."""
410
+ try:
411
+ datasets = data_manager.list_downloaded_standard_sets()
412
+
413
+ if not datasets:
414
+ console.print("[yellow]No standard sets downloaded yet.[/yellow]")
415
+ console.print("[dim]Use 'download-sets' to download standard sets.[/dim]")
416
+ return
417
+
418
+ # Check for processed.json files
419
+ for d in datasets:
420
+ set_dir = settings.standard_sets_dir / d.set_id
421
+ processed_file = set_dir / "processed.json"
422
+ d.processed = processed_file.exists()
423
+
424
+ # Count processed vs unprocessed
425
+ processed_count = sum(1 for d in datasets if d.processed)
426
+ unprocessed_count = len(datasets) - processed_count
427
+
428
+ table = Table(
429
+ "Set ID",
430
+ "Jurisdiction",
431
+ "Subject",
432
+ "Title",
433
+ "Grades",
434
+ "Status",
435
+ "Processed",
436
+ title="Downloaded Standard Sets",
437
+ )
438
+ for d in datasets:
439
+ # Truncate long set IDs
440
+ display_id = d.set_id[:25] + "..." if len(d.set_id) > 25 else d.set_id
441
+
442
+ table.add_row(
443
+ display_id,
444
+ d.jurisdiction,
445
+ d.subject[:30],
446
+ d.title[:30],
447
+ ", ".join(d.education_levels),
448
+ d.publication_status,
449
+ "[green]✓[/green]" if d.processed else "[yellow]✗[/yellow]",
450
+ )
451
+
452
+ console.print(table)
453
+ console.print("\n[bold]Summary:[/bold]")
454
+ console.print(f" Total: {len(datasets)} standard sets")
455
+ console.print(f" Processed: [green]{processed_count}[/green]")
456
+ console.print(f" Unprocessed: [yellow]{unprocessed_count}[/yellow]")
457
+
458
+ except Exception as e:
459
+ console.print(f"[red]Error: {e}[/red]")
460
+ logger.exception("Failed to list datasets")
461
+ raise typer.Exit(code=1)
462
+
463
+
464
+ @app.command("pinecone-init")
465
+ def pinecone_init():
466
+ """
467
+ Initialize Pinecone index.
468
+
469
+ Checks if the configured index exists and creates it if not.
470
+ Uses integrated embeddings with llama-text-embed-v2 model.
471
+ """
472
+ try:
473
+ from tools.pinecone_client import PineconeClient
474
+
475
+ console.print("[bold]Initializing Pinecone...[/bold]")
476
+
477
+ # Initialize Pinecone client (validates API key)
478
+ try:
479
+ client = PineconeClient()
480
+ except ValueError as e:
481
+ console.print(f"[red]Error: {e}[/red]")
482
+ raise typer.Exit(code=1)
483
+
484
+ console.print(f" Index name: [cyan]{client.index_name}[/cyan]")
485
+ console.print(f" Namespace: [cyan]{client.namespace}[/cyan]")
486
+
487
+ # Check and create index if needed
488
+ with console.status("[bold blue]Checking index status..."):
489
+ created = client.ensure_index_exists()
490
+
491
+ if created:
492
+ console.print(
493
+ f"\n[green]Successfully created index '{client.index_name}'[/green]"
494
+ )
495
+ console.print("[dim]Index configuration:[/dim]")
496
+ console.print(" Cloud: aws")
497
+ console.print(" Region: us-east-1")
498
+ console.print(" Embedding model: llama-text-embed-v2")
499
+ console.print(" Field map: text -> content")
500
+ else:
501
+ console.print(
502
+ f"\n[green]Index '{client.index_name}' already exists[/green]"
503
+ )
504
+
505
+ # Show index stats
506
+ with console.status("[bold blue]Fetching index stats..."):
507
+ stats = client.get_index_stats()
508
+
509
+ console.print("\n[bold]Index Statistics:[/bold]")
510
+ console.print(f" Total vectors: [cyan]{stats['total_vector_count']}[/cyan]")
511
+
512
+ namespaces = stats.get("namespaces", {})
513
+ if namespaces:
514
+ console.print(f" Namespaces: [cyan]{len(namespaces)}[/cyan]")
515
+ table = Table("Namespace", "Vector Count", title="Namespace Details")
516
+ for ns_name, ns_info in namespaces.items():
517
+ vector_count = getattr(ns_info, "vector_count", 0)
518
+ table.add_row(ns_name or "(default)", str(vector_count))
519
+ console.print(table)
520
+ else:
521
+ console.print(" Namespaces: [yellow]None (empty index)[/yellow]")
522
+
523
+ except Exception as e:
524
+ console.print(f"[red]Error: {e}[/red]")
525
+ logger.exception("Failed to initialize Pinecone")
526
+ raise typer.Exit(code=1)
527
+
528
+
529
+ @app.command("pinecone-upload")
530
+ def pinecone_upload(
531
+ set_id: str = typer.Option(
532
+ None, "--set-id", help="Upload a specific standard set by ID"
533
+ ),
534
+ all: bool = typer.Option(
535
+ False, "--all", help="Upload all downloaded standard sets with processed.json"
536
+ ),
537
+ force: bool = typer.Option(
538
+ False,
539
+ "--force",
540
+ help="Re-upload even if .pinecone_uploaded marker exists",
541
+ ),
542
+ dry_run: bool = typer.Option(
543
+ False,
544
+ "--dry-run",
545
+ help="Show what would be uploaded without actually uploading",
546
+ ),
547
+ batch_size: int = typer.Option(
548
+ 96, "--batch-size", help="Number of records per batch (default: 96)"
549
+ ),
550
+ ):
551
+ """
552
+ Upload processed standard sets to Pinecone.
553
+
554
+ Use --set-id to upload a specific set, or --all to upload all sets with processed.json.
555
+ If neither is provided, you'll be prompted to confirm uploading all sets.
556
+ """
557
+ try:
558
+ from tools.pinecone_client import PineconeClient
559
+ from tools.pinecone_models import ProcessedStandardSet
560
+ import json
561
+
562
+ # Initialize Pinecone client
563
+ try:
564
+ client = PineconeClient()
565
+ except ValueError as e:
566
+ console.print(f"[red]Error: {e}[/red]")
567
+ raise typer.Exit(code=1)
568
+
569
+ # Validate index exists
570
+ try:
571
+ client.validate_index()
572
+ except ValueError as e:
573
+ console.print(f"[red]Error: {e}[/red]")
574
+ raise typer.Exit(code=1)
575
+
576
+ # Discover standard sets with processed.json
577
+ standard_sets_dir = settings.standard_sets_dir
578
+ if not standard_sets_dir.exists():
579
+ console.print("[yellow]No standard sets directory found.[/yellow]")
580
+ console.print(
581
+ "[dim]Use 'download-sets' to download standard sets first.[/dim]"
582
+ )
583
+ return
584
+
585
+ # Find all sets with processed.json
586
+ sets_to_upload = []
587
+ for set_dir in standard_sets_dir.iterdir():
588
+ if not set_dir.is_dir():
589
+ continue
590
+
591
+ processed_file = set_dir / "processed.json"
592
+ if not processed_file.exists():
593
+ continue
594
+
595
+ set_id_from_dir = set_dir.name
596
+
597
+ # Check if already uploaded (unless --force)
598
+ # Mark all sets during discovery; filtering by --set-id happens later
599
+ if not force and PineconeClient.is_uploaded(set_dir):
600
+ sets_to_upload.append(
601
+ (set_id_from_dir, set_dir, True)
602
+ ) # True = already uploaded
603
+ else:
604
+ sets_to_upload.append(
605
+ (set_id_from_dir, set_dir, False)
606
+ ) # False = needs upload
607
+
608
+ if not sets_to_upload:
609
+ console.print(
610
+ "[yellow]No standard sets with processed.json found.[/yellow]"
611
+ )
612
+ console.print(
613
+ "[dim]Use 'download-sets' to download and process standard sets first.[/dim]"
614
+ )
615
+ return
616
+
617
+ # Filter by --set-id if provided
618
+ if set_id:
619
+ sets_to_upload = [
620
+ (sid, sdir, skipped)
621
+ for sid, sdir, skipped in sets_to_upload
622
+ if sid == set_id
623
+ ]
624
+ if not sets_to_upload:
625
+ console.print(
626
+ f"[yellow]Standard set '{set_id}' not found or has no processed.json.[/yellow]"
627
+ )
628
+ return
629
+
630
+ # If neither --set-id nor --all provided, prompt for confirmation
631
+ if not set_id and not all:
632
+ console.print(
633
+ f"\n[bold]Found {len(sets_to_upload)} standard set(s) with processed.json:[/bold]"
634
+ )
635
+ table = Table("Set ID", "Status", title="Standard Sets")
636
+ for sid, sdir, skipped in sets_to_upload:
637
+ status = (
638
+ "[yellow]Already uploaded[/yellow]"
639
+ if skipped
640
+ else "[green]Ready[/green]"
641
+ )
642
+ table.add_row(sid, status)
643
+ console.print(table)
644
+
645
+ if not typer.confirm(
646
+ f"\nUpload {len(sets_to_upload)} standard set(s) to Pinecone?"
647
+ ):
648
+ console.print("[yellow]Upload cancelled.[/yellow]")
649
+ return
650
+
651
+ # Show what would be uploaded (dry-run or preview)
652
+ if dry_run or not all:
653
+ console.print(
654
+ f"\n[bold]Standard sets to upload ({len(sets_to_upload)}):[/bold]"
655
+ )
656
+ table = Table("Set ID", "Records", "Status", title="Upload Preview")
657
+ for sid, sdir, skipped in sets_to_upload:
658
+ if skipped and not force:
659
+ table.add_row(
660
+ sid, "N/A", "[yellow]Skipped (already uploaded)[/yellow]"
661
+ )
662
+ continue
663
+
664
+ # Load processed.json to count records
665
+ try:
666
+ with open(sdir / "processed.json", encoding="utf-8") as f:
667
+ processed_data = json.load(f)
668
+ record_count = len(processed_data.get("records", []))
669
+ status = (
670
+ "[green]Ready[/green]"
671
+ if not dry_run
672
+ else "[yellow]Would upload[/yellow]"
673
+ )
674
+ table.add_row(sid, str(record_count), status)
675
+ except Exception as e:
676
+ table.add_row(sid, "Error", f"[red]Failed to read: {e}[/red]")
677
+ console.print(table)
678
+
679
+ if dry_run:
680
+ console.print(
681
+ f"\n[yellow][DRY RUN] Would upload {len([s for s in sets_to_upload if not s[2] or force])} standard set(s)[/yellow]"
682
+ )
683
+ console.print("[dim]Run without --dry-run to actually upload.[/dim]")
684
+ return
685
+
686
+ # Perform uploads
687
+ uploaded_count = 0
688
+ failed_count = 0
689
+ skipped_count = 0
690
+
691
+ for i, (sid, sdir, already_uploaded) in enumerate(sets_to_upload, 1):
692
+ if already_uploaded and not force:
693
+ skipped_count += 1
694
+ continue
695
+
696
+ try:
697
+ # Load processed.json
698
+ with open(sdir / "processed.json", encoding="utf-8") as f:
699
+ processed_data = json.load(f)
700
+
701
+ processed_set = ProcessedStandardSet(**processed_data)
702
+ records = processed_set.records
703
+
704
+ if not records:
705
+ console.print(
706
+ f"[yellow]Skipping {sid} (no records)[/yellow]"
707
+ )
708
+ skipped_count += 1
709
+ continue
710
+
711
+ # Upload records
712
+ with console.status(
713
+ f"[bold blue][{i}/{len(sets_to_upload)}] Uploading {sid} ({len(records)} records)"
714
+ ):
715
+ client.batch_upsert(records, batch_size=batch_size)
716
+
717
+ # Mark as uploaded
718
+ PineconeClient.mark_uploaded(sdir)
719
+ uploaded_count += 1
720
+ console.print(
721
+ f"[green]✓ [{i}/{len(sets_to_upload)}] Uploaded {sid} ({len(records)} records)[/green]"
722
+ )
723
+
724
+ except FileNotFoundError:
725
+ console.print(
726
+ f"[red]✗ [{i}/{len(sets_to_upload)}] Failed: {sid} (processed.json not found)[/red]"
727
+ )
728
+ logger.exception(f"Failed to upload standard set {sid}")
729
+ failed_count += 1
730
+ except Exception as e:
731
+ console.print(
732
+ f"[red]✗ [{i}/{len(sets_to_upload)}] Failed: {sid} ({e})[/red]"
733
+ )
734
+ logger.exception(f"Failed to upload standard set {sid}")
735
+ failed_count += 1
736
+
737
+ # Summary
738
+ console.print("\n[bold]Upload Summary:[/bold]")
739
+ console.print(f" Uploaded: [green]{uploaded_count}[/green]")
740
+ if skipped_count > 0:
741
+ console.print(f" Skipped: [yellow]{skipped_count}[/yellow]")
742
+ if failed_count > 0:
743
+ console.print(f" Failed: [red]{failed_count}[/red]")
744
+
745
+ except Exception as e:
746
+ console.print(f"[red]Error: {e}[/red]")
747
+ logger.exception("Failed to upload to Pinecone")
748
+ raise typer.Exit(code=1)
749
+
750
+
751
+ if __name__ == "__main__":
752
+ app()
tools/config.py ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Centralized configuration for the tools module."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from pathlib import Path
6
+
7
+ from pydantic_settings import BaseSettings, SettingsConfigDict
8
+
9
+
10
+ class ToolsSettings(BaseSettings):
11
+ """Configuration settings for the tools module."""
12
+
13
+ model_config = SettingsConfigDict(
14
+ env_file=".env",
15
+ env_file_encoding="utf-8",
16
+ case_sensitive=False,
17
+ )
18
+
19
+ # API Configuration
20
+ csp_api_key: str = ""
21
+ csp_base_url: str = "https://api.commonstandardsproject.com/api/v1"
22
+ max_requests_per_minute: int = 60
23
+
24
+ # Path Configuration
25
+ # These are computed properties based on project root
26
+ @property
27
+ def project_root(self) -> Path:
28
+ """Get the project root directory."""
29
+ return Path(__file__).parent.parent
30
+
31
+ @property
32
+ def raw_data_dir(self) -> Path:
33
+ """Get the raw data directory."""
34
+ return self.project_root / "data" / "raw"
35
+
36
+ @property
37
+ def standard_sets_dir(self) -> Path:
38
+ """Get the standard sets directory."""
39
+ return self.raw_data_dir / "standardSets"
40
+
41
+ @property
42
+ def processed_data_dir(self) -> Path:
43
+ """Get the processed data directory."""
44
+ return self.project_root / "data" / "processed"
45
+
46
+ # Logging Configuration
47
+ log_file: str = "data/cli.log"
48
+ log_rotation: str = "10 MB"
49
+ log_retention: str = "7 days"
50
+
51
+ # Pinecone Configuration
52
+ pinecone_api_key: str = ""
53
+ pinecone_index_name: str = "common-core-standards"
54
+ pinecone_namespace: str = "standards"
55
+
56
+
57
+ _settings: ToolsSettings | None = None
58
+
59
+
60
+ def get_settings() -> ToolsSettings:
61
+ """Get the singleton settings instance."""
62
+ global _settings
63
+ if _settings is None:
64
+ _settings = ToolsSettings()
65
+ return _settings
tools/data_manager.py ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Manages local data storage and metadata tracking."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import json
6
+ from dataclasses import dataclass
7
+
8
+ from loguru import logger
9
+
10
+ from tools.config import get_settings
11
+ from tools.models import StandardSetResponse
12
+
13
+ settings = get_settings()
14
+
15
+ # Data directories (from config)
16
+ RAW_DATA_DIR = settings.raw_data_dir
17
+ STANDARD_SETS_DIR = settings.standard_sets_dir
18
+ PROCESSED_DATA_DIR = settings.processed_data_dir
19
+
20
+
21
+ @dataclass
22
+ class StandardSetInfo:
23
+ """Information about a downloaded standard set with processing status."""
24
+
25
+ set_id: str
26
+ title: str
27
+ subject: str
28
+ education_levels: list[str]
29
+ jurisdiction: str
30
+ publication_status: str
31
+ valid_year: str
32
+ processed: bool
33
+
34
+
35
+ def list_downloaded_standard_sets() -> list[StandardSetInfo]:
36
+ """
37
+ List all downloaded standard sets from the standardSets directory.
38
+
39
+ Returns:
40
+ List of StandardSetInfo with standard set info and processing status
41
+ """
42
+ if not STANDARD_SETS_DIR.exists():
43
+ return []
44
+
45
+ datasets = []
46
+ for set_dir in STANDARD_SETS_DIR.iterdir():
47
+ if not set_dir.is_dir():
48
+ continue
49
+
50
+ data_file = set_dir / "data.json"
51
+ if not data_file.exists():
52
+ continue
53
+
54
+ try:
55
+ with open(data_file, encoding="utf-8") as f:
56
+ raw_data = json.load(f)
57
+
58
+ # Parse the API response wrapper
59
+ response = StandardSetResponse(**raw_data)
60
+ standard_set = response.data
61
+
62
+ # Build the dataset info
63
+ dataset_info = StandardSetInfo(
64
+ set_id=standard_set.id,
65
+ title=standard_set.title,
66
+ subject=standard_set.subject,
67
+ education_levels=standard_set.educationLevels,
68
+ jurisdiction=standard_set.jurisdiction.title,
69
+ publication_status=standard_set.document.publicationStatus or "Unknown",
70
+ valid_year=standard_set.document.valid,
71
+ processed=False, # TODO: Check against processed directory
72
+ )
73
+
74
+ datasets.append(dataset_info)
75
+
76
+ except (json.JSONDecodeError, IOError, Exception) as e:
77
+ logger.warning(f"Failed to read {data_file}: {e}")
78
+ continue
79
+
80
+ logger.debug(f"Found {len(datasets)} downloaded standard sets")
81
+ return datasets
tools/models.py ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Pydantic models for Common Standards Project API data structures."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from typing import Any, Optional
6
+
7
+ from pydantic import BaseModel, ConfigDict
8
+
9
+
10
+ class CSPBaseModel(BaseModel):
11
+ """Base model for all CSP API models with extra fields allowed."""
12
+
13
+ model_config = ConfigDict(extra="allow")
14
+
15
+
16
+ # ============================================================================
17
+ # Jurisdiction List Models
18
+ # ============================================================================
19
+
20
+
21
+ class Jurisdiction(CSPBaseModel):
22
+ """Basic jurisdiction information from the jurisdictions list endpoint."""
23
+
24
+ id: str
25
+ title: str
26
+ type: str # "school", "organization", "state", "nation"
27
+
28
+
29
+ class JurisdictionsResponse(CSPBaseModel):
30
+ """API response wrapper for jurisdictions list."""
31
+
32
+ data: list[Jurisdiction]
33
+
34
+
35
+ # ============================================================================
36
+ # Jurisdiction Details Models
37
+ # ============================================================================
38
+
39
+
40
+ class Document(CSPBaseModel):
41
+ """Standard document metadata."""
42
+
43
+ id: Optional[str] = None
44
+ title: str
45
+ valid: Optional[str] = None # Year as string
46
+ sourceURL: Optional[str] = None
47
+ asnIdentifier: Optional[str] = None
48
+ publicationStatus: Optional[str] = None
49
+
50
+
51
+ class StandardSetReference(CSPBaseModel):
52
+ """Reference to a standard set (metadata only, not full content)."""
53
+
54
+ id: str
55
+ title: str
56
+ subject: str
57
+ educationLevels: list[str]
58
+ document: Document
59
+
60
+
61
+ class JurisdictionDetails(CSPBaseModel):
62
+ """Full jurisdiction details including standard set references."""
63
+
64
+ id: str
65
+ title: str
66
+ type: str # "school", "organization", "state", "nation"
67
+ standardSets: list[StandardSetReference]
68
+
69
+
70
+ class JurisdictionDetailsResponse(CSPBaseModel):
71
+ """API response wrapper for jurisdiction details."""
72
+
73
+ data: JurisdictionDetails
74
+
75
+
76
+ # ============================================================================
77
+ # Standard Set Models
78
+ # ============================================================================
79
+
80
+
81
+ class License(CSPBaseModel):
82
+ """License information for a standard set."""
83
+
84
+ title: str
85
+ URL: str
86
+ rightsHolder: str
87
+
88
+
89
+ class JurisdictionRef(CSPBaseModel):
90
+ """Simple jurisdiction reference within a standard set."""
91
+
92
+ id: str
93
+ title: str
94
+
95
+
96
+ class Standard(CSPBaseModel):
97
+ """Individual standard within a standard set."""
98
+
99
+ id: str
100
+ asnIdentifier: Optional[str] = None
101
+ position: int
102
+ depth: int
103
+ statementNotation: Optional[str] = None
104
+ description: str
105
+ ancestorIds: list[str]
106
+ parentId: Optional[str] = None
107
+ statementLabel: Optional[str] = None # e.g., "Standard", "Benchmark"
108
+ educationLevels: Optional[list[str]] = None
109
+
110
+
111
+ class StandardSet(CSPBaseModel):
112
+ """Full standard set data including all standards."""
113
+
114
+ id: str
115
+ title: str
116
+ subject: str
117
+ normalizedSubject: Optional[str] = None
118
+ educationLevels: list[str]
119
+ license: License
120
+ document: Document
121
+ jurisdiction: JurisdictionRef
122
+ standards: dict[str, Standard] # GUID -> Standard mapping
123
+ cspStatus: Optional[dict[str, Any]] = None
124
+
125
+
126
+ class StandardSetResponse(CSPBaseModel):
127
+ """API response wrapper for standard set."""
128
+
129
+ data: StandardSet
tools/pinecone_client.py ADDED
@@ -0,0 +1,262 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Pinecone client for uploading and managing standard records."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import time
6
+ from datetime import datetime, timezone
7
+ from pathlib import Path
8
+ from collections.abc import Callable
9
+ from typing import Any
10
+
11
+ from loguru import logger
12
+ from pinecone import Pinecone
13
+ from pinecone.exceptions import PineconeException
14
+
15
+ from tools.config import get_settings
16
+ from tools.pinecone_models import PineconeRecord
17
+
18
+ settings = get_settings()
19
+
20
+
21
+ class PineconeClient:
22
+ """Client for interacting with Pinecone index."""
23
+
24
+ def __init__(self) -> None:
25
+ """Initialize Pinecone SDK from config settings."""
26
+ api_key = settings.pinecone_api_key
27
+ if not api_key:
28
+ raise ValueError("PINECONE_API_KEY environment variable not set")
29
+
30
+ self.pc = Pinecone(api_key=api_key)
31
+ self.index_name = settings.pinecone_index_name
32
+ self.namespace = settings.pinecone_namespace
33
+ self._index = None
34
+
35
+ @property
36
+ def index(self):
37
+ """Get the index object, creating it if needed."""
38
+ if self._index is None:
39
+ self._index = self.pc.Index(self.index_name)
40
+ return self._index
41
+
42
+ def validate_index(self) -> None:
43
+ """
44
+ Check index exists with pc.has_index(), raise helpful error if not.
45
+
46
+ Raises:
47
+ ValueError: If index does not exist, with instructions to create it.
48
+ """
49
+ if not self.pc.has_index(name=self.index_name):
50
+ raise ValueError(
51
+ f"Index '{self.index_name}' not found. Create it with:\n"
52
+ f"pc index create -n {self.index_name} -m cosine -c aws -r us-east-1 "
53
+ f"--model llama-text-embed-v2 --field_map text=content"
54
+ )
55
+
56
+ def ensure_index_exists(self) -> bool:
57
+ """
58
+ Check if index exists, create it if not.
59
+
60
+ Creates the index with integrated embeddings using llama-text-embed-v2 model.
61
+
62
+ Returns:
63
+ True if index was created, False if it already existed.
64
+ """
65
+ if self.pc.has_index(name=self.index_name):
66
+ logger.info(f"Index '{self.index_name}' already exists")
67
+ return False
68
+
69
+ logger.info(f"Creating index '{self.index_name}' with integrated embeddings...")
70
+ self.pc.create_index_for_model(
71
+ name=self.index_name,
72
+ cloud="aws",
73
+ region="us-east-1",
74
+ embed={
75
+ "model": "llama-text-embed-v2",
76
+ "field_map": {"text": "content"},
77
+ },
78
+ )
79
+ logger.info(f"Successfully created index '{self.index_name}'")
80
+ return True
81
+
82
+ def get_index_stats(self) -> dict[str, Any]:
83
+ """
84
+ Get index statistics including vector count and namespaces.
85
+
86
+ Returns:
87
+ Dictionary with index stats including total_vector_count and namespaces.
88
+ """
89
+ stats = self.index.describe_index_stats()
90
+ return {
91
+ "total_vector_count": stats.total_vector_count,
92
+ "namespaces": dict(stats.namespaces) if stats.namespaces else {},
93
+ }
94
+
95
+ @staticmethod
96
+ def exponential_backoff_retry(
97
+ func: Callable[[], Any], max_retries: int = 5
98
+ ) -> Any:
99
+ """
100
+ Retry function with exponential backoff on 429/5xx, fail on 4xx.
101
+
102
+ Args:
103
+ func: Function to retry (should be a callable that takes no args)
104
+ max_retries: Maximum number of retry attempts
105
+
106
+ Returns:
107
+ Result of func()
108
+
109
+ Raises:
110
+ PineconeException: If retries exhausted or non-retryable error
111
+ """
112
+ for attempt in range(max_retries):
113
+ try:
114
+ return func()
115
+ except PineconeException as e:
116
+ status_code = getattr(e, "status", None)
117
+ # Only retry transient errors
118
+ if status_code and (status_code >= 500 or status_code == 429):
119
+ if attempt < max_retries - 1:
120
+ delay = min(2 ** attempt, 60) # Cap at 60s
121
+ logger.warning(
122
+ f"Retryable error (status {status_code}), "
123
+ f"retrying in {delay}s (attempt {attempt + 1}/{max_retries})"
124
+ )
125
+ time.sleep(delay)
126
+ else:
127
+ logger.error(
128
+ f"Max retries ({max_retries}) exceeded for retryable error"
129
+ )
130
+ raise
131
+ else:
132
+ # Don't retry client errors
133
+ logger.error(f"Non-retryable error (status {status_code}): {e}")
134
+ raise
135
+ except Exception as e:
136
+ # Non-Pinecone exceptions should not be retried
137
+ logger.error(f"Non-retryable exception: {e}")
138
+ raise
139
+
140
+ def batch_upsert(
141
+ self, records: list[PineconeRecord], batch_size: int = 96
142
+ ) -> None:
143
+ """
144
+ Upsert records in batches of specified size with rate limiting.
145
+
146
+ Args:
147
+ records: List of PineconeRecord objects to upsert
148
+ batch_size: Number of records per batch (default: 96)
149
+ """
150
+ if not records:
151
+ logger.info("No records to upsert")
152
+ return
153
+
154
+ total_batches = (len(records) + batch_size - 1) // batch_size
155
+ logger.info(
156
+ f"Upserting {len(records)} records in {total_batches} batch(es) "
157
+ f"(batch size: {batch_size})"
158
+ )
159
+
160
+ for i in range(0, len(records), batch_size):
161
+ batch = records[i : i + batch_size]
162
+ batch_num = (i // batch_size) + 1
163
+
164
+ # Convert PineconeRecord models to dict format for Pinecone
165
+ batch_dicts = [self._record_to_dict(record) for record in batch]
166
+
167
+ logger.debug(f"Upserting batch {batch_num}/{total_batches} ({len(batch)} records)")
168
+
169
+ # Retry with exponential backoff
170
+ self.exponential_backoff_retry(
171
+ lambda b=batch_dicts: self.index.upsert_records(
172
+ namespace=self.namespace, records=b
173
+ )
174
+ )
175
+
176
+ # Rate limiting between batches
177
+ if i + batch_size < len(records):
178
+ time.sleep(0.1)
179
+
180
+ logger.info(f"Successfully upserted {len(records)} records")
181
+
182
+ @staticmethod
183
+ def _record_to_dict(record: PineconeRecord) -> dict[str, Any]:
184
+ """
185
+ Convert PineconeRecord model to dict format for Pinecone API.
186
+
187
+ Handles optional fields by omitting them if None. Pinecone doesn't accept
188
+ null values for metadata fields, so parent_id must be omitted entirely
189
+ when None (for root nodes).
190
+
191
+ Args:
192
+ record: PineconeRecord model instance
193
+
194
+ Returns:
195
+ Dictionary ready for Pinecone upsert_records
196
+ """
197
+ # Use by_alias=True to serialize 'id' as '_id' per model serialization_alias
198
+ record_dict = record.model_dump(exclude_none=False, by_alias=True)
199
+
200
+ # Remove None values for optional fields
201
+ optional_fields = {
202
+ "asn_identifier",
203
+ "statement_notation",
204
+ "statement_label",
205
+ "normalized_subject",
206
+ "publication_status",
207
+ "parent_id", # Must be omitted when None (Pinecone doesn't accept null)
208
+ }
209
+ for field in optional_fields:
210
+ if record_dict.get(field) is None:
211
+ record_dict.pop(field, None)
212
+
213
+ return record_dict
214
+
215
+ @staticmethod
216
+ def is_uploaded(set_dir: Path) -> bool:
217
+ """
218
+ Check for .pinecone_uploaded marker file.
219
+
220
+ Args:
221
+ set_dir: Path to standard set directory
222
+
223
+ Returns:
224
+ True if marker file exists, False otherwise
225
+ """
226
+ marker_file = set_dir / ".pinecone_uploaded"
227
+ return marker_file.exists()
228
+
229
+ @staticmethod
230
+ def mark_uploaded(set_dir: Path) -> None:
231
+ """
232
+ Create marker file with ISO 8601 timestamp.
233
+
234
+ Args:
235
+ set_dir: Path to standard set directory
236
+ """
237
+ marker_file = set_dir / ".pinecone_uploaded"
238
+ timestamp = datetime.now(timezone.utc).isoformat()
239
+ marker_file.write_text(timestamp, encoding="utf-8")
240
+ logger.debug(f"Created upload marker: {marker_file}")
241
+
242
+ @staticmethod
243
+ def get_upload_timestamp(set_dir: Path) -> str | None:
244
+ """
245
+ Read timestamp from marker file.
246
+
247
+ Args:
248
+ set_dir: Path to standard set directory
249
+
250
+ Returns:
251
+ ISO 8601 timestamp string if marker exists, None otherwise
252
+ """
253
+ marker_file = set_dir / ".pinecone_uploaded"
254
+ if not marker_file.exists():
255
+ return None
256
+
257
+ try:
258
+ return marker_file.read_text(encoding="utf-8").strip()
259
+ except Exception as e:
260
+ logger.warning(f"Failed to read upload marker {marker_file}: {e}")
261
+ return None
262
+
tools/pinecone_models.py ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Pydantic models for Pinecone-processed standard records."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from typing import Any
6
+
7
+ from pydantic import BaseModel, ConfigDict, Field, field_validator
8
+
9
+
10
+ class PineconeRecord(BaseModel):
11
+ """A single standard record ready for Pinecone upsert."""
12
+
13
+ model_config = ConfigDict(
14
+ json_encoders={
15
+ # Ensure parent_id null is serialized as null, not omitted
16
+ type(None): lambda v: None,
17
+ },
18
+ # Use snake_case for field names (matches JSON schema)
19
+ populate_by_name=True,
20
+ )
21
+
22
+ # Core identifier - use alias to serialize as _id
23
+ id: str = Field(alias="_id", serialization_alias="_id")
24
+
25
+ # Content for embedding
26
+ content: str
27
+
28
+ # Standard Set Context
29
+ standard_set_id: str
30
+ standard_set_title: str
31
+ subject: str
32
+ normalized_subject: str | None = None
33
+ education_levels: list[str]
34
+ document_id: str
35
+ document_valid: str
36
+ publication_status: str | None = None
37
+ jurisdiction_id: str
38
+ jurisdiction_title: str
39
+
40
+ # Standard Identity & Position
41
+ asn_identifier: str | None = None
42
+ statement_notation: str | None = None
43
+ statement_label: str | None = None
44
+ depth: int
45
+ is_leaf: bool
46
+ is_root: bool
47
+
48
+ # Hierarchy Relationships
49
+ parent_id: str | None = None # null for root nodes
50
+ root_id: str
51
+ ancestor_ids: list[str]
52
+ child_ids: list[str]
53
+ sibling_count: int
54
+
55
+ @field_validator("education_levels", mode="before")
56
+ @classmethod
57
+ def process_education_levels(cls, v: Any) -> list[str]:
58
+ """
59
+ Process education_levels: split comma-separated strings, flatten, dedupe.
60
+
61
+ Handles cases where source data has comma-separated values within array
62
+ elements (e.g., ["01,02"] instead of ["01", "02"]).
63
+
64
+ Args:
65
+ v: Input value (list[str] or list with comma-separated strings)
66
+
67
+ Returns:
68
+ Flattened, deduplicated list of grade level strings
69
+ """
70
+ if not isinstance(v, list):
71
+ return []
72
+
73
+ # Split comma-separated strings and flatten
74
+ flattened: list[str] = []
75
+ for item in v:
76
+ if isinstance(item, str):
77
+ # Split on commas and strip whitespace
78
+ split_items = [s.strip() for s in item.split(",") if s.strip()]
79
+ flattened.extend(split_items)
80
+
81
+ # Deduplicate while preserving order
82
+ seen: set[str] = set()
83
+ result: list[str] = []
84
+ for item in flattened:
85
+ if item not in seen:
86
+ seen.add(item)
87
+ result.append(item)
88
+
89
+ return result
90
+
91
+
92
+ class ProcessedStandardSet(BaseModel):
93
+ """Container for processed standard set records ready for Pinecone."""
94
+
95
+ records: list[PineconeRecord]
tools/pinecone_processor.py ADDED
@@ -0,0 +1,375 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Processor for transforming standard sets into Pinecone-ready format."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import json
6
+ from pathlib import Path
7
+ from typing import TYPE_CHECKING
8
+
9
+ from loguru import logger
10
+
11
+ from tools.config import get_settings
12
+ from tools.models import StandardSet, StandardSetResponse
13
+ from tools.pinecone_models import PineconeRecord, ProcessedStandardSet
14
+
15
+ if TYPE_CHECKING:
16
+ from collections.abc import Mapping
17
+
18
+ settings = get_settings()
19
+
20
+
21
+ class StandardSetProcessor:
22
+ """Processes standard sets into Pinecone-ready format."""
23
+
24
+ def __init__(self):
25
+ """Initialize the processor."""
26
+ self.id_to_standard: dict[str, dict] = {}
27
+ self.parent_to_children: dict[str | None, list[str]] = {}
28
+ self.leaf_nodes: set[str] = set()
29
+ self.root_nodes: set[str] = set()
30
+
31
+ def process_standard_set(self, standard_set: StandardSet) -> ProcessedStandardSet:
32
+ """
33
+ Process a standard set into Pinecone-ready records.
34
+
35
+ Args:
36
+ standard_set: The StandardSet model from the API
37
+
38
+ Returns:
39
+ ProcessedStandardSet with all records ready for Pinecone
40
+ """
41
+ # Build relationship maps from all standards
42
+ self._build_relationship_maps(standard_set.standards)
43
+
44
+ # Process each standard into a PineconeRecord
45
+ records = []
46
+ for standard_id, standard in standard_set.standards.items():
47
+ record = self._transform_standard(standard, standard_set)
48
+ records.append(record)
49
+
50
+ return ProcessedStandardSet(records=records)
51
+
52
+ def _build_relationship_maps(self, standards: dict[str, Standard]) -> None:
53
+ """
54
+ Build helper data structures from all standards in the set.
55
+
56
+ Args:
57
+ standards: Dictionary mapping standard ID to Standard object
58
+ """
59
+ # Convert to dict format for easier manipulation
60
+ standards_dict = {
61
+ std_id: standard.model_dump() for std_id, standard in standards.items()
62
+ }
63
+
64
+ # Build ID-to-standard map
65
+ self.id_to_standard = self._build_id_to_standard_map(standards_dict)
66
+
67
+ # Build parent-to-children map (sorted by position)
68
+ self.parent_to_children = self._build_parent_to_children_map(standards_dict)
69
+
70
+ # Identify leaf nodes
71
+ self.leaf_nodes = self._identify_leaf_nodes(standards_dict)
72
+
73
+ # Identify root nodes
74
+ self.root_nodes = self._identify_root_nodes(standards_dict)
75
+
76
+ def _build_id_to_standard_map(
77
+ self, standards: dict[str, dict]
78
+ ) -> dict[str, dict]:
79
+ """Build map of id -> standard object."""
80
+ return {std_id: std for std_id, std in standards.items()}
81
+
82
+ def _build_parent_to_children_map(
83
+ self, standards: dict[str, dict]
84
+ ) -> dict[str | None, list[str]]:
85
+ """
86
+ Build map of parentId -> [child_ids], sorted by position ascending.
87
+
88
+ Args:
89
+ standards: Dictionary of standard ID to standard dict
90
+
91
+ Returns:
92
+ Dictionary mapping parent ID (or None for roots) to sorted list of child IDs
93
+ """
94
+ parent_map: dict[str | None, list[tuple[int, str]]] = {}
95
+
96
+ for std_id, std in standards.items():
97
+ parent_id = std.get("parentId")
98
+ position = std.get("position", 0)
99
+
100
+ if parent_id not in parent_map:
101
+ parent_map[parent_id] = []
102
+ parent_map[parent_id].append((position, std_id))
103
+
104
+ # Sort each list by position and extract just the IDs
105
+ result: dict[str | None, list[str]] = {}
106
+ for parent_id, children in parent_map.items():
107
+ sorted_children = sorted(children, key=lambda x: x[0])
108
+ result[parent_id] = [std_id for _, std_id in sorted_children]
109
+
110
+ return result
111
+
112
+ def _identify_leaf_nodes(self, standards: dict[str, dict]) -> set[str]:
113
+ """
114
+ Identify leaf nodes: standards whose ID does NOT appear as any standard's parentId.
115
+
116
+ Args:
117
+ standards: Dictionary of standard ID to standard dict
118
+
119
+ Returns:
120
+ Set of standard IDs that are leaf nodes
121
+ """
122
+ all_ids = set(standards.keys())
123
+ parent_ids = {std.get("parentId") for std in standards.values() if std.get("parentId") is not None}
124
+
125
+ # Leaf nodes are IDs that are NOT in parent_ids
126
+ return all_ids - parent_ids
127
+
128
+ def _identify_root_nodes(self, standards: dict[str, dict]) -> set[str]:
129
+ """
130
+ Identify root nodes: standards where parentId is null.
131
+
132
+ Args:
133
+ standards: Dictionary of standard ID to standard dict
134
+
135
+ Returns:
136
+ Set of standard IDs that are root nodes
137
+ """
138
+ return {
139
+ std_id
140
+ for std_id, std in standards.items()
141
+ if std.get("parentId") is None
142
+ }
143
+
144
+ def find_root_id(self, standard: dict, id_to_standard: dict[str, dict]) -> str:
145
+ """
146
+ Walk up the parent chain to find the root ancestor.
147
+
148
+ Args:
149
+ standard: The standard dict to find root for
150
+ id_to_standard: Map of ID to standard dict
151
+
152
+ Returns:
153
+ The root ancestor's ID
154
+ """
155
+ current = standard
156
+ visited = set() # Prevent infinite loops from bad data
157
+
158
+ while current.get("parentId") is not None:
159
+ parent_id = current["parentId"]
160
+ if parent_id in visited:
161
+ break # Circular reference protection
162
+ visited.add(parent_id)
163
+
164
+ if parent_id not in id_to_standard:
165
+ break # Parent not found, use current as root
166
+ current = id_to_standard[parent_id]
167
+
168
+ return current["id"]
169
+
170
+ def build_ordered_ancestors(
171
+ self, standard: dict, id_to_standard: dict[str, dict]
172
+ ) -> list[str]:
173
+ """
174
+ Build ancestor list ordered from root (index 0) to immediate parent (last index).
175
+
176
+ Args:
177
+ standard: The standard dict to build ancestors for
178
+ id_to_standard: Map of ID to standard dict
179
+
180
+ Returns:
181
+ List of ancestor IDs ordered root -> immediate parent
182
+ """
183
+ ancestors = []
184
+ current_id = standard.get("parentId")
185
+ visited = set()
186
+
187
+ while current_id is not None and current_id not in visited:
188
+ visited.add(current_id)
189
+ if current_id in id_to_standard:
190
+ ancestors.append(current_id)
191
+ current_id = id_to_standard[current_id].get("parentId")
192
+ else:
193
+ break
194
+
195
+ ancestors.reverse() # Now ordered root → immediate parent
196
+ return ancestors
197
+
198
+ def _compute_sibling_count(self, standard: dict) -> int:
199
+ """
200
+ Count standards with same parent_id, excluding self.
201
+
202
+ Args:
203
+ standard: The standard dict
204
+
205
+ Returns:
206
+ Number of siblings (excluding self)
207
+ """
208
+ parent_id = standard.get("parentId")
209
+ if parent_id not in self.parent_to_children:
210
+ return 0
211
+
212
+ siblings = self.parent_to_children[parent_id]
213
+ # Exclude self from count
214
+ return len([s for s in siblings if s != standard["id"]])
215
+
216
+ def _build_content_text(self, standard: dict) -> str:
217
+ """
218
+ Generate content text block with full hierarchy.
219
+
220
+ Format: "Depth N (notation): description" for each ancestor and self.
221
+
222
+ Args:
223
+ standard: The standard dict
224
+
225
+ Returns:
226
+ Multi-line text block with full hierarchy
227
+ """
228
+ # Build ordered ancestor chain
229
+ ancestor_ids = self.build_ordered_ancestors(standard, self.id_to_standard)
230
+
231
+ # Build lines from root to current standard
232
+ lines = []
233
+
234
+ # Add ancestor lines
235
+ for ancestor_id in ancestor_ids:
236
+ ancestor = self.id_to_standard[ancestor_id]
237
+ depth = ancestor.get("depth", 0)
238
+ description = ancestor.get("description", "")
239
+ notation = ancestor.get("statementNotation")
240
+
241
+ if notation:
242
+ lines.append(f"Depth {depth} ({notation}): {description}")
243
+ else:
244
+ lines.append(f"Depth {depth}: {description}")
245
+
246
+ # Add current standard line
247
+ depth = standard.get("depth", 0)
248
+ description = standard.get("description", "")
249
+ notation = standard.get("statementNotation")
250
+
251
+ if notation:
252
+ lines.append(f"Depth {depth} ({notation}): {description}")
253
+ else:
254
+ lines.append(f"Depth {depth}: {description}")
255
+
256
+ return "\n".join(lines)
257
+
258
+ def _transform_standard(
259
+ self, standard: Standard, standard_set: StandardSet
260
+ ) -> PineconeRecord:
261
+ """
262
+ Transform a single standard into a PineconeRecord.
263
+
264
+ Args:
265
+ standard: The Standard object to transform
266
+ standard_set: The parent StandardSet containing context
267
+
268
+ Returns:
269
+ PineconeRecord ready for Pinecone upsert
270
+ """
271
+ std_dict = standard.model_dump()
272
+
273
+ # Compute hierarchy relationships
274
+ is_root = std_dict.get("parentId") is None
275
+ root_id = (
276
+ std_dict["id"] if is_root else self.find_root_id(std_dict, self.id_to_standard)
277
+ )
278
+ ancestor_ids = self.build_ordered_ancestors(std_dict, self.id_to_standard)
279
+ child_ids = self.parent_to_children.get(std_dict["id"], [])
280
+ is_leaf = std_dict["id"] in self.leaf_nodes
281
+ sibling_count = self._compute_sibling_count(std_dict)
282
+
283
+ # Build content text
284
+ content = self._build_content_text(std_dict)
285
+
286
+ # Extract standard set context
287
+ parent_id = std_dict.get("parentId") # Keep as None if null
288
+
289
+ # Build record with all fields
290
+ # Note: Use "id" not "_id" - Pydantic handles serialization alias automatically
291
+ record_data = {
292
+ "id": std_dict["id"],
293
+ "content": content,
294
+ "standard_set_id": standard_set.id,
295
+ "standard_set_title": standard_set.title,
296
+ "subject": standard_set.subject,
297
+ "normalized_subject": standard_set.normalizedSubject, # Optional, can be None
298
+ "education_levels": standard_set.educationLevels,
299
+ "document_id": standard_set.document.id,
300
+ "document_valid": standard_set.document.valid,
301
+ "publication_status": standard_set.document.publicationStatus, # Optional, can be None
302
+ "jurisdiction_id": standard_set.jurisdiction.id,
303
+ "jurisdiction_title": standard_set.jurisdiction.title,
304
+ "depth": std_dict.get("depth", 0),
305
+ "is_leaf": is_leaf,
306
+ "is_root": is_root,
307
+ "parent_id": parent_id,
308
+ "root_id": root_id,
309
+ "ancestor_ids": ancestor_ids,
310
+ "child_ids": child_ids,
311
+ "sibling_count": sibling_count,
312
+ }
313
+
314
+ # Add optional fields only if present
315
+ if std_dict.get("asnIdentifier"):
316
+ record_data["asn_identifier"] = std_dict["asnIdentifier"]
317
+ if std_dict.get("statementNotation"):
318
+ record_data["statement_notation"] = std_dict["statementNotation"]
319
+ if std_dict.get("statementLabel"):
320
+ record_data["statement_label"] = std_dict["statementLabel"]
321
+
322
+ return PineconeRecord(**record_data)
323
+
324
+
325
+ def process_and_save(standard_set_id: str) -> Path:
326
+ """
327
+ Load data.json, process it, and save processed.json.
328
+
329
+ Args:
330
+ standard_set_id: The ID of the standard set to process
331
+
332
+ Returns:
333
+ Path to the saved processed.json file
334
+
335
+ Raises:
336
+ FileNotFoundError: If data.json doesn't exist
337
+ ValueError: If JSON is invalid
338
+ """
339
+ # Locate data.json
340
+ data_file = settings.standard_sets_dir / standard_set_id / "data.json"
341
+ if not data_file.exists():
342
+ logger.warning(f"data.json not found for set {standard_set_id}, skipping")
343
+ raise FileNotFoundError(f"data.json not found for set {standard_set_id}")
344
+
345
+ # Load and parse JSON
346
+ try:
347
+ with open(data_file, encoding="utf-8") as f:
348
+ raw_data = json.load(f)
349
+ except json.JSONDecodeError as e:
350
+ raise ValueError(f"Invalid JSON in {data_file}: {e}") from e
351
+
352
+ # Parse into Pydantic model
353
+ try:
354
+ response = StandardSetResponse(**raw_data)
355
+ standard_set = response.data
356
+ except Exception as e:
357
+ raise ValueError(f"Failed to parse standard set data: {e}") from e
358
+
359
+ # Process the standard set
360
+ processor = StandardSetProcessor()
361
+ processed_set = processor.process_standard_set(standard_set)
362
+
363
+ # Save processed.json
364
+ processed_file = settings.standard_sets_dir / standard_set_id / "processed.json"
365
+ processed_file.parent.mkdir(parents=True, exist_ok=True)
366
+
367
+ with open(processed_file, "w", encoding="utf-8") as f:
368
+ json.dump(processed_set.model_dump(mode="json"), f, indent=2)
369
+
370
+ logger.info(
371
+ f"Processed {standard_set_id}: {len(processed_set.records)} records"
372
+ )
373
+
374
+ return processed_file
375
+