BRONZE LAYER — Raw Data Ingestion & ETL  

Goal: Get all 16,852 historical records into unified JSON format 

--------------------------------------------------------------------------------
STORY B-01: Build IRIS Solution Intake Excel Ingestion Pipeline
--------------------------------------------------------------------------------

Story Name:     Build IRIS Solution Intake Excel Ingestion Pipeline
Epic:           Bronze Layer - Data Ingestion
Story Points:   5
Sprint:         Sprint 1
Priority:       Critical
Assignee:       [TBD - Backend Developer]
Labels:         bronze, etl, iris, data-pipeline

Description:
As a data engineer, I need to build a Python script using openpyxl that
reads the IRIS Solution Intake XLSX file (3,091 records, 124 columns) and
outputs normalized JSON records so that downstream semantic processing can
consume clean, standardized data.

The script must:
- Read IRIS_Solution_Intake_requests.xlsx
- Parse all 124 fields per record
- Handle missing/null values gracefully (empty string, not None)
- Extract requestor names from "Full Name (ID)" format
  e.g., "Johnny Hoogenboom (406508)" → name: "Johnny Hoogenboom", id: "406508"
- Normalize date fields to ISO 8601 format
  e.g., "2025-08-26 04:55:58" → "2025-08-26T04:55:58Z"
- Output each record as a JSON object
- Store output to Content Sphere Bronze bucket (partitioned: source=iris)

Key fields to extract:
- number (RITM number), cat_item, stage, state, approval
- request.requested_for, request.opened_by, due_date
- short_description, description, work_notes, comments
- business_service, service_offering, assignment_group
- opened_at, closed_at, made_sla, project_id

Acceptance Criteria:
  AC-1: Script reads IRIS_Solution_Intake_requests.xlsx without errors
  AC-2: Outputs exactly 3,091 JSON records
  AC-3: All 124 fields mapped to output JSON
  AC-4: Date fields converted to ISO 8601 (e.g., "2025-08-26T04:55:58Z")
  AC-5: Names extracted from parenthetical format (name separate from ID)
  AC-6: Null/missing values handled as empty strings
  AC-7: Output JSON stored to Content Sphere Bronze bucket
  AC-8: Script logs processing stats (records processed, errors, time taken)
  AC-9: Unit tests cover date parsing, name extraction, null handling

Dependencies:  None (first story in pipeline)
Test Data:     IRIS_Solution_Intake_requests.xlsx (in project files)


--------------------------------------------------------------------------------
STORY B-02: Build Jira Epics Excel Ingestion Pipeline with Summary Parsing
--------------------------------------------------------------------------------

Story Name:     Build Jira Epics Excel Ingestion Pipeline with Summary Parsing
Epic:           Bronze Layer - Data Ingestion
Story Points:   8
Sprint:         Sprint 1
Priority:       Critical
Assignee:       [TBD - Backend Developer]
Labels:         bronze, etl, jira, epics, regex

Description:
As a data engineer, I need to build a Python script that reads the
MLL_JIRA_Epics_extract.xlsx file (7,210 records, 476 columns) and parses
the Epic Summary field using a regex pattern to extract structured site
and request type information.

The Epic Summary follows this pattern:
  "{SITE}, {COUNTRY_CODE} - {Request Type} - {RITM_NUMBER}"
  Example: "JACKSONVILLE, US - InfoLink Modules Implementation - RITM000023585344"

The regex must extract:
  - site_name: "JACKSONVILLE"
  - site_country_code: "US"
  - request_type_parsed: "InfoLink Modules Implementation"
  - linked_ritm: "RITM000023585344"

The script must also extract:
  - site_id from labels (e.g., "Site_5619" → "5619")
  - site_region from epic_region field
  - site_tier from labels (e.g., "Tier_1")
  - epic_components as array (e.g., ["ProjectDemand", "Network"])
  - epic_labels as array
  - epic_issue_links as structured objects

Acceptance Criteria:
  AC-1: Script reads MLL_JIRA_Epics_extract.xlsx without errors
  AC-2: Outputs exactly 7,210 JSON records
  AC-3: Regex correctly parses Summary field for records matching pattern
  AC-4: Extracted fields: site_name, site_country_code, request_type_parsed, linked_ritm
  AC-5: site_id extracted from labels (Site_XXXX pattern)
  AC-6: site_tier extracted from labels (Tier_X pattern)
  AC-7: Records not matching Summary pattern have empty parsed fields (no errors)
  AC-8: epic_components and epic_labels stored as JSON arrays
  AC-9: Output stored to Content Sphere Bronze bucket (partitioned: source=jira_epics)
  AC-10: Unit tests cover regex parsing with 10+ sample summaries

Dependencies:  None (can run parallel with B-01)
Test Data:     MLL_JIRA_Epics_extract.xlsx (in project files)


--------------------------------------------------------------------------------
STORY B-03: Build Jira User Stories Excel Ingestion Pipeline
--------------------------------------------------------------------------------

Story Name:     Build Jira User Stories Excel Ingestion Pipeline
Epic:           Bronze Layer - Data Ingestion
Story Points:   5
Sprint:         Sprint 1
Priority:       High
Assignee:       [TBD - Backend Developer]
Labels:         bronze, etl, jira, user-stories

Description:
As a data engineer, I need to build a Python script that reads the
MLL_JIRA_Userstory_extracts.xlsx file (6,551 records, 407 columns) and
links each story to its parent Epic via the Parent ID field so that
stories can be nested under their Epics in the unified record.

Key fields to extract per story:
  - story_key (Issue key, e.g., "ABFZ-97354")
  - summary
  - status (Completed, Open, In Progress, etc.)
  - assignee
  - story_points
  - sprint (sprint name)
  - description (full text)
  - parent_id (links to Epic)
  - acceptance_criteria (if present in description)

The output must be a lookup dictionary: { parent_epic_id: [list of stories] }
so the Unified Field Mapper (B-05) can attach stories to each Epic.

Acceptance Criteria:
  AC-1: Script reads MLL_JIRA_Userstory_extracts.xlsx without errors
  AC-2: Outputs exactly 6,551 story records
  AC-3: Each story linked to parent Epic via parent_id
  AC-4: Output includes: story_key, summary, status, assignee, story_points, sprint, description
  AC-5: Lookup dictionary keyed by parent Epic ID produced
  AC-6: Stories with no parent_id logged as warnings (not errors)
  AC-7: Output stored to Content Sphere Bronze bucket (partitioned: source=jira_stories)
  AC-8: Unit tests verify parent linkage with known Epic-Story pairs

Dependencies:  None (can run parallel with B-01 and B-02)
Test Data:     MLL_JIRA_Userstory_extracts.xlsx (in project files)


--------------------------------------------------------------------------------
STORY B-04: Build PDF Invoice OCR Extraction Pipeline
--------------------------------------------------------------------------------

Story Name:     Build PDF Invoice OCR Extraction Pipeline
Epic:           Bronze Layer - Data Ingestion
Story Points:   8
Sprint:         Sprint 2
Priority:       High
Assignee:       [TBD - Backend Developer]
Labels:         bronze, etl, ocr, pdf, invoices

Description:
As a data engineer, I need to build an OCR pipeline using Tesseract +
PyMuPDF + pdfplumber that extracts structured cost data from ~2,200
PDF vendor invoices so that cost estimation features can use real
historical pricing data.

The pipeline must extract from each invoice:
  - vendor_name (e.g., "Crown Equipment Corporation")
  - invoice_number (e.g., "INV-2024-AESR-0847")
  - invoice_date
  - po_number (Purchase Order)
  - currency (USD, EUR, etc.)
  - line_items: array of { description, quantity, unit_cost, total }
    e.g., { "InfoLink Terminal IT5000": qty=24, unit=$1,850, total=$44,400 }
  - subtotal
  - total_amount

The pipeline must handle:
  - Multi-page invoices (concatenate pages before parsing)
  - Scanned vs digital PDFs (OCR for scanned, text extraction for digital)
  - Multiple table formats (vendors use different layouts)
  - Currency symbols and number formatting

Acceptance Criteria:
  AC-1: Pipeline processes PDF files from input directory
  AC-2: Extracts vendor_name, invoice_number, invoice_date, po_number
  AC-3: Extracts line items with description, quantity, unit_cost, total
  AC-4: Handles multi-page invoices correctly
  AC-5: Uses OCR (Tesseract) for scanned PDFs, text extraction for digital
  AC-6: Output as structured JSON per invoice
  AC-7: Error logging for invoices that fail extraction (with PDF filename)
  AC-8: >85% successful extraction rate on test set of 100 invoices
  AC-9: Output stored to Content Sphere Bronze bucket (partitioned: source=pdf_invoices)
  AC-10: Processing time <2 minutes per invoice on average

Dependencies:  None
Test Data:     Sample PDF invoices (to be provided)


--------------------------------------------------------------------------------
STORY B-05: Build Confluence Knowledge Base Ingestion Pipeline
--------------------------------------------------------------------------------

Story Name:     Build Confluence Knowledge Base Ingestion Pipeline
Epic:           Bronze Layer - Data Ingestion
Story Points:   5
Sprint:         Sprint 2
Priority:       Medium
Assignee:       [TBD - Backend Developer]
Labels:         bronze, etl, confluence, knowledge-base

Description:
As a data engineer, I need to build a pipeline that extracts content from
key Confluence pages used in MLL intake processes so that the Knowledge
Agent has access to process documentation and guidelines.

Target Confluence pages:
  - MLLF Intake Process (detailed submission instructions)
  - MLL Network and Firewall (network team engagement process)
  - Intake vs Incident decision guide
  - TS Engage or IRIS routing guide
  - User Story Checklist template

The pipeline must:
  - Connect to Confluence REST API (or use exported HTML)
  - Extract page title, body content (cleaned HTML → plain text)
  - Preserve section structure (headings, lists, tables)
  - Extract linked KB article references
  - Store as JSON documents with page_id, title, content, last_updated

Acceptance Criteria:
  AC-1: All 5 target Confluence pages extracted
  AC-2: HTML cleaned to plain text with structure preserved
  AC-3: Section headings maintained for chunking
  AC-4: Tables converted to structured text
  AC-5: Output JSON includes: page_id, title, content, sections[], last_updated
  AC-6: Stored to Content Sphere Bronze bucket (partitioned: source=confluence)
  AC-7: Handles Confluence API authentication

Dependencies:  Confluence API access credentials
Test Data:     Confluence page URLs (in project screenshots)


--------------------------------------------------------------------------------
STORY B-06: Build Unified Field Mapper (IRIS + Epic + Stories + PDF Merge)
--------------------------------------------------------------------------------

Story Name:     Build Unified Field Mapper - Merge All Sources
Epic:           Bronze Layer - Data Ingestion
Story Points:   8
Sprint:         Sprint 2-3
Priority:       Critical
Assignee:       [TBD - Senior Backend Developer]
Labels:         bronze, etl, unified-schema, merge, critical-path

Description:
As a data engineer, I need to create the Unified Field Mapper that merges
data from all four sources (IRIS, Jira Epics, Jira Stories, PDF invoices)
into a single unified JSON record per request, following the
MLL_Unified_Ticket_Schema.json schema with 12 logical sections.

Merge logic:
  1. Start with each IRIS RITM record as the base
  2. Link to Jira Epic via work_notes field
     (contains Jira URL like "https://jira.jnj.com/browse/ABFZ-97353")
  3. Attach User Stories to Epic via parent_id linkage
  4. Attach PDF invoice data to Epic via RITM number match
  5. For Epics without IRIS RITM, create record from Epic data only
  6. Compute derived fields:
     - duration_days (closed_at - opened_at)
     - has_network_requirement (true if "Network" in components)
     - has_quotation (true if PDF invoice linked)
     - story_count (number of linked stories)
     - attachment_count (number of attachments)

The 12 schema sections:
  1. iris_identity       2. iris_workflow      3. iris_people
  4. epic_identity       5. epic_classification 6. epic_description
  7. epic_scope          8. epic_timeline      9. user_stories
  10. network_request    11. quotation         12. computed_fields

Output: ~16,852 unified JSON records (union of IRIS + Epics)

Acceptance Criteria:
  AC-1: Merges IRIS + Epics + Stories + PDF data into single records
  AC-2: IRIS-to-Epic linkage via work_notes Jira URL extraction
  AC-3: Stories nested under their parent Epic
  AC-4: PDF invoice data attached via RITM number
  AC-5: 12-section schema structure maintained per record
  AC-6: Computed fields calculated correctly (duration, story_count, etc.)
  AC-7: ~16,852 unified records produced (IRIS + Epics union)
  AC-8: Records with partial data (missing Epic or missing IRIS) handled gracefully
  AC-9: Deduplication: no duplicate records for same RITM/Epic pair
  AC-10: Output stored to Content Sphere Bronze bucket (partitioned: source=unified)
  AC-11: Mapping report generated: fields mapped, fields dropped, merge stats

Dependencies:  B-01, B-02, B-03, B-04 (all source ingestion complete)
Test Data:     Output from B-01 through B-04


--------------------------------------------------------------------------------
STORY B-07: Bronze Layer Validation & Quality Checks
--------------------------------------------------------------------------------

Story Name:     Bronze Layer Data Validation & Quality Report
Epic:           Bronze Layer - Data Ingestion
Story Points:   3
Sprint:         Sprint 3
Priority:       High
Assignee:       [TBD - Backend Developer]
Labels:         bronze, validation, quality

Description:
As a data engineer, I need to build a validation script that verifies the
quality and completeness of all Bronze layer data before it moves to the
Silver layer for semantic processing.

Validation checks:
  - Record count verification (3,091 + 7,210 + 6,551 expected)
  - Required field completeness rates (description, site_name, etc.)
  - Date format consistency (all ISO 8601)
  - RITM-to-Epic linkage success rate
  - Story-to-Epic linkage success rate
  - PDF extraction success rate
  - Duplicate detection (same RITM or Epic appearing twice)

Output: HTML quality report with pass/fail status and statistics.

Acceptance Criteria:
  AC-1: Validates record counts per source
  AC-2: Reports field completeness percentages
  AC-3: Flags records with missing required fields
  AC-4: Reports linkage success rates (RITM→Epic, Story→Epic)
  AC-5: Identifies duplicate records
  AC-6: Generates HTML quality report
  AC-7: Returns pass/fail status (pass = >95% completeness)
  AC-8: Report stored to Content Sphere Bronze bucket

Dependencies:  B-06 (unified records available)


SILVER LAYER — Semantic Processing & Embedding  

Goal: Convert unified JSON into searchable semantic vectors

--------------------------------------------------------------------------------
STORY S-01: Build Text Cleaning Module (clean_text)
--------------------------------------------------------------------------------

Story Name:     Build Text Cleaning Module for Embedding Quality
Epic:           Silver Layer - Semantic Processing
Story Points:   3
Sprint:         Sprint 3
Priority:       Critical
Assignee:       [TBD - ML Engineer]
Labels:         silver, semantic, text-cleaning, nlp

Description:
As an ML engineer, I need to implement the clean_text() function that
removes noise from raw ticket text fields before they are converted to
semantic documents. This directly impacts embedding quality — noisy text
produces poor vectors that reduce RAG retrieval accuracy.

The function must handle these noise patterns:
  - Jira markup: [~username], {panel:title=...}, {code:...}, {noformat}
  - Confluence artifacts: _x000D_ carriage returns
  - URLs → replace with [URL] token (URLs add noise to embeddings)
  - Email addresses → replace with [EMAIL] token
  - MAC addresses (e.g., 00:07:4d:a0:1b:cb) → replace with [MAC] token
  - Multiple consecutive newlines → collapse to double newline
  - Multiple spaces/tabs → collapse to single space
  - Non-breaking spaces (U+00A0) → regular space
  - Leading/trailing whitespace → strip

Important: Do NOT remove RITM numbers or Jira keys — they carry identity
meaning for the embedding model.

Acceptance Criteria:
  AC-1: Removes all Jira markup patterns ([~user], {panel}, {code}, {noformat})
  AC-2: Removes _x000D_ artifacts
  AC-3: URLs replaced with [URL] token
  AC-4: Email addresses replaced with [EMAIL] token
  AC-5: MAC addresses replaced with [MAC] token
  AC-6: Whitespace normalized (no triple+ newlines, no double+ spaces)
  AC-7: Non-breaking spaces converted to regular spaces
  AC-8: RITM numbers and Jira keys preserved (NOT removed)
  AC-9: Function returns empty string for None/empty input
  AC-10: Unit tests with 15+ test cases covering all patterns

Dependencies:  None (can start while Bronze completes)
Reference:     mll_semantic_document_converter.py Section 3


--------------------------------------------------------------------------------
STORY S-02: Build Schema Normalization Module (normalize_schema)
--------------------------------------------------------------------------------

Story Name:     Build Schema Normalization Module
Epic:           Silver Layer - Semantic Processing
Story Points:   3
Sprint:         Sprint 3
Priority:       Critical
Assignee:       [TBD - ML Engineer]
Labels:         silver, semantic, normalization

Description:
As an ML engineer, I need to implement normalize_schema() that
standardizes field formats from the unified JSON into a structure
suitable for the semantic template engine.

Normalization tasks:
  - Extract names: "Johnny Hoogenboom (406508)" → "Johnny Hoogenboom"
  - Normalize dates: "2025-08-26 04:55:58" → "2025-08-26T04:55:58Z"
  - Human-readable dates: "2025-08-26T04:55:58Z" → "August 26, 2025"
  - Compute derived fields:
    * days_open = (today - opened_at).days
    * sla_sentence = "The request met its SLA target." or "missed"
    * nexus_id_sentence = "A Nexus ID is required..." or ""
  - Default empty values for missing optional sections
  - Structure classification sub-object (request_type, capability_center,
    routing_system, complexity_tier, confidence_score)

Acceptance Criteria:
  AC-1: Names extracted from "Name (ID)" format correctly
  AC-2: All dates normalized to ISO 8601
  AC-3: Human-readable date strings generated (e.g., "August 26, 2025")
  AC-4: days_open computed correctly
  AC-5: sla_sentence generated based on made_sla field
  AC-6: nexus_id_sentence generated based on capability_center
  AC-7: Classification sub-object populated
  AC-8: Missing optional fields defaulted (empty string, not None)
  AC-9: Unit tests with 10+ test cases

Dependencies:  B-06 (unified JSON schema must be defined)
Reference:     mll_semantic_document_converter.py Section 2


--------------------------------------------------------------------------------
STORY S-03: Build Semantic Document Template Engine (THE CORE)
--------------------------------------------------------------------------------

Story Name:     Build Semantic Document Template Engine
Epic:           Silver Layer - Semantic Processing
Story Points:   8
Sprint:         Sprint 4
Priority:       Critical (MOST IMPORTANT STORY IN SILVER)
Assignee:       [TBD - Senior ML Engineer]
Labels:         silver, semantic, template-engine, critical-path

Description:
As an ML engineer, I need to implement the semantic template engine that
converts structured JSON records into natural language documents across
8 section templates. This is the CORE of the entire pipeline — the quality
of these semantic documents directly determines RAG retrieval accuracy.

WHY THIS MATTERS:
Embedding models (like all-MiniLM-L6-v2) are trained on natural language
text, NOT on database field names. The template engine bridges this gap.

BAD input for embedding (raw key-value):
  "site_name: JACKSONVILLE, site_country_code: US, site_id: 5619"

GOOD input for embedding (semantic text):
  "This request is for site JACKSONVILLE, US (Site ID: 5619). The site
   is in the NA region and is classified as Tier 1. The sector is MedTech."

The 8 templates to implement:

  1. TEMPLATE_IDENTITY - RITM + Epic + Site identifiers
     "MLL Solution Intake request {ritm_number} linked to Jira Epic
      {epic_key} in ABFZ project..."

  2. TEMPLATE_SUMMARY - Request type + capability + confidence
     "Request summary: {epic_summary}. Request type: {request_type_parsed}.
      Classified as {classification_request_type}..."

  3. TEMPLATE_DESCRIPTION - Business problem in natural language
     "Business problem and description for {ritm_number} ({epic_key}):
      {epic_description_cleaned}"

  4. TEMPLATE_SCOPE - Components, team, labels
     "Components: {epic_components_text}. Labels: {epic_labels_text}.
      Submitted by {opened_by} for {requested_for}..."

  5. TEMPLATE_TIMELINE - Dates, status, SLA
     "Timeline for {ritm_number}: Opened {opened_at_human}, state
      {state}, approval {approval}..."

  6. TEMPLATE_STORIES - Linked stories and sprints (OPTIONAL - omit if empty)
     "User stories for {epic_key}: Story 1: {story_key}..."

  7. TEMPLATE_NETWORK - LAN/WAN/Firewall (OPTIONAL - omit if no network)
     "Network requirements for {ritm_number}: Network type: {type}..."

  8. TEMPLATE_QUOTATION - Vendor pricing (OPTIONAL - omit if no invoice)
     "Quotation and cost estimate for {ritm_number}: Vendor: {vendor}..."

Design principles:
  - Write as natural language paragraphs, never key:value pairs
  - Group related fields into coherent sentences
  - Use domain vocabulary (RITM, Epic, MLL, FLNEC, etc.)
  - Omit empty/null sections — shorter docs embed better
  - Lead with most important info (summary, site, type)

Two output modes:
  A) Full document: All 8 sections concatenated with section headers
  B) Chunked: Dict of {section_name: text} for per-section embedding

Acceptance Criteria:
  AC-1: All 8 templates implemented
  AC-2: Each template produces natural language paragraphs (not key:value)
  AC-3: Empty/null sections omitted from output
  AC-4: Full document mode: concatenates all sections with [SECTION] headers
  AC-5: Chunked mode: returns dict of {section_name: text}
  AC-6: Template variables populated from normalized record via build_template_variables()
  AC-7: Output matches reference format in semantic_document_output.txt
  AC-8: doc_hash (SHA-256) generated for deduplication
  AC-9: Processing time <50ms per record
  AC-10: Unit tests verify output for sample record (InfoLink Jacksonville)

Dependencies:  S-01 (clean_text), S-02 (normalize_schema)
Reference:     mll_semantic_document_converter.py Sections 4-6
Test Data:     sample_input_json.json → expected: semantic_document_output.txt


--------------------------------------------------------------------------------
STORY S-04: Build Embedding Generation Pipeline
--------------------------------------------------------------------------------

Story Name:     Build Embedding Generation Pipeline (all-MiniLM-L6-v2)
Epic:           Silver Layer - Semantic Processing
Story Points:   5
Sprint:         Sprint 4
Priority:       Critical
Assignee:       [TBD - ML Engineer]
Labels:         silver, embedding, sentence-transformers, vectors

Description:
As an ML engineer, I need to implement the embedding pipeline using
sentence-transformers (all-MiniLM-L6-v2) that generates 384-dimensional
vectors for each semantic document and chunk.

The pipeline must support two modes:
  A) Full-document embedding: entire semantic doc → single 384-dim vector
  B) Chunked embedding: each of the 8 sections → separate 384-dim vector

All vectors must be L2-normalized for cosine similarity computation.

Batch processing requirements:
  - Process all ~16,852 records
  - Batch size: 256 records per batch (GPU) or 64 (CPU)
  - Show progress bar during processing
  - Save intermediate results every 1,000 records (resume on failure)
  - Total processing time target: <60 minutes on CPU

Output format per record:
  {
    "ritm_number": "RITM000023587097",
    "full_embedding": [0.023, -0.041, ...],      // 384 floats
    "chunk_embeddings": {
      "identity_summary": [0.018, ...],           // 384 floats
      "classification": [0.031, ...],             // 384 floats
      ...
    }
  }

Acceptance Criteria:
  AC-1: all-MiniLM-L6-v2 model loads successfully
  AC-2: Full document → single 384-dim vector per record
  AC-3: Each chunk → separate 384-dim vector
  AC-4: All vectors L2-normalized (unit length)
  AC-5: Batch processing for all ~16,852 records
  AC-6: Progress bar shows processing status
  AC-7: Intermediate saves every 1,000 records
  AC-8: Total processing time <60 min on CPU
  AC-9: Output embeddings saved as NumPy arrays (.npy)
  AC-10: Spot-check: cosine similarity between related tickets > 0.7

Dependencies:  S-03 (semantic documents must be generated first)
Reference:     mll_semantic_document_converter.py Section 9


--------------------------------------------------------------------------------
STORY S-05: Build Metadata Extraction Module
--------------------------------------------------------------------------------

Story Name:     Build Metadata Extraction for Filtered Vector Retrieval
Epic:           Silver Layer - Semantic Processing
Story Points:   3
Sprint:         Sprint 4
Priority:       High
Assignee:       [TBD - ML Engineer]
Labels:         silver, metadata, filtering

Description:
As an ML engineer, I need to implement extract_metadata() that pulls
filterable dimensions from each normalized record. These metadata fields
are stored ALONGSIDE vectors in the vector store and used for filtered
retrieval (e.g., "find similar tickets only from the same region").

Metadata fields to extract:
  Primary keys:    ritm_number, request_number, epic_key, epic_id
  Site dimensions:  site_name, site_country_code, site_id, site_region, site_tier
  Classification:   request_type, capability_center, routing_system,
                    complexity_tier, confidence_score
  Status:           epic_status, epic_priority, epic_sector, state, stage
  Flags:            has_network, has_quotation, has_stories
  Computed:          doc_hash (SHA-256 of full semantic doc, first 16 chars)

These metadata fields enable queries like:
  - "Find similar network requests at Tier 1 sites in NA region"
  - "Find InfoLink implementations with vendor quotations"

Acceptance Criteria:
  AC-1: All filterable fields extracted per record
  AC-2: doc_hash computed (SHA-256, first 16 chars) for deduplication
  AC-3: Metadata output as JSON compatible with vector store format
  AC-4: Boolean flags (has_network, has_quotation, has_stories) computed
  AC-5: Null/missing fields defaulted to empty string (not None)
  AC-6: Metadata stored to Content Sphere Silver bucket
  AC-7: Unit tests verify extraction for sample record

Dependencies:  S-02 (normalize_schema)
Reference:     mll_semantic_document_converter.py Section 7


--------------------------------------------------------------------------------
STORY S-06: Load Embeddings into LangFlow Vector Store
--------------------------------------------------------------------------------

Story Name:     Load Embeddings into LangFlow-Compatible Vector Store
Epic:           Silver Layer - Semantic Processing
Story Points:   8
Sprint:         Sprint 5
Priority:       Critical
Assignee:       [TBD - ML Engineer]
Labels:         silver, vector-store, langflow, hnsw, critical-path

Description:
As an ML engineer, I need to load all generated embeddings and metadata
into the LangFlow-compatible vector store (AstraDB or Chroma) and
configure the HNSW index for fast similarity search.

Setup tasks:
  1. Provision vector store (AstraDB cloud or Chroma local)
  2. Create collection/index: "mll_intake_vectors"
  3. Configure HNSW index: 384 dimensions, cosine similarity metric
  4. Bulk load all embeddings with associated metadata
  5. Verify query latency: Top-K=5 retrieval in <3ms

Loading strategy:
  - Load full-document embeddings (primary retrieval)
  - Load chunk embeddings (secondary, for section-level search)
  - Attach metadata to each vector for filtered queries
  - Batch upload: 500 vectors per batch

Verification queries to run after loading:
  1. "network drops Jacksonville warehouse" → should return InfoLink ticket
  2. "server implementation Mexico" → should return JUAREZ server tickets
  3. "firewall MACD request" → should return network/firewall tickets

Acceptance Criteria:
  AC-1: Vector store provisioned (AstraDB or Chroma)
  AC-2: Collection created with HNSW index (384-dim, cosine)
  AC-3: All ~16,852 full-document embeddings loaded
  AC-4: All chunk embeddings loaded (~100,000+ vectors)
  AC-5: Metadata attached to each vector
  AC-6: Top-K=5 query latency <3ms (measured)
  AC-7: 3 verification queries return semantically correct results
  AC-8: Filtered query works (e.g., site_region="NA" + similarity search)
  AC-9: Vector store connection config documented for LangFlow
  AC-10: Content Sphere Silver bucket updated with index state snapshot

Dependencies:  S-04 (embeddings), S-05 (metadata)


--------------------------------------------------------------------------------
STORY S-07: Build Confluence KB Embedding Pipeline
--------------------------------------------------------------------------------

Story Name:     Build Confluence Knowledge Base Embedding Pipeline
Epic:           Silver Layer - Semantic Processing
Story Points:   5
Sprint:         Sprint 5
Priority:       Medium
Assignee:       [TBD - ML Engineer]
Labels:         silver, confluence, embedding, knowledge-base

Description:
As an ML engineer, I need to process the Confluence KB documents through
the same Silver layer pipeline: clean text, chunk by section headings,
generate embeddings, and load into the vector store as a separate
collection ("mll_kb_vectors") so the Knowledge Agent can perform
RAG search over process documentation.

Chunking strategy for Confluence pages:
  - Split by H2 headings (each section = one chunk)
  - Include page title as prefix for each chunk
  - Target chunk size: 200-500 words
  - Overlap: 50 words between consecutive chunks

Acceptance Criteria:
  AC-1: Confluence pages chunked by section headings
  AC-2: Each chunk prefixed with page title
  AC-3: Chunks embedded using all-MiniLM-L6-v2
  AC-4: Loaded into "mll_kb_vectors" collection in vector store
  AC-5: Metadata includes: page_id, page_title, section_heading
  AC-6: Verification query: "how to submit intake request" returns MLLF Intake Process
  AC-7: Stored to Content Sphere Silver bucket

Dependencies:  B-05 (Confluence ingestion), S-06 (vector store ready)


--------------------------------------------------------------------------------
STORY S-08: Silver Layer Validation & Embedding Quality Report
--------------------------------------------------------------------------------

Story Name:     Silver Layer Validation & Embedding Quality Report
Epic:           Silver Layer - Semantic Processing
Story Points:   5
Sprint:         Sprint 5
Priority:       High
Assignee:       [TBD - ML Engineer]
Labels:         silver, validation, quality, embedding

Description:
As an ML engineer, I need to build a validation suite that verifies
embedding quality and vector store integrity before the Gold layer
agents can use it for RAG retrieval.

Validation checks:
  - Vector count matches record count (~16,852)
  - No zero vectors or NaN values
  - Cosine similarity sanity checks:
    * Same-site tickets should cluster (similarity > 0.6)
    * Different-domain tickets should be distant (similarity < 0.4)
  - Top-K retrieval relevance test (20 curated queries with expected results)
  - Metadata filter verification (filtered search returns correct subset)
  - Latency benchmark (measure p50, p95, p99 query times)

Acceptance Criteria:
  AC-1: Vector count verified (matches ~16,852 records)
  AC-2: No zero vectors or NaN values found
  AC-3: Same-site clustering verified (avg similarity > 0.6)
  AC-4: Cross-domain separation verified (avg similarity < 0.4)
  AC-5: 20 curated queries return expected top results
  AC-6: Metadata filtering works correctly
  AC-7: Latency: p50 <2ms, p95 <5ms, p99 <10ms
  AC-8: HTML quality report generated
  AC-9: Report stored to Content Sphere Silver bucket

Dependencies:  S-06, S-07 (all vectors loaded)


GOLD LAYER — Multi-Agent RAG System & Chat UI 

Goal: Build 7 agents in LangFlow + Chainlit chat UI   

--------------------------------------------------------------------------------
STORY G-01: Build Supervisor Agent Flow in LangFlow
--------------------------------------------------------------------------------

Story Name:     Build Supervisor Agent - Central Orchestrator
Epic:           Gold Layer - Multi-Agent System
Story Points:   5
Sprint:         Sprint 6
Priority:       Critical
Assignee:       [TBD - AI Engineer]
Labels:         gold, agent, supervisor, langflow, critical-path

Description:
As an AI engineer, I need to create the Supervisor Agent flow in LangFlow
that serves as the central orchestrator for all user interactions. Every
user message first hits the Supervisor, which decides where to route it.

Routing logic:
  1. New message + no active workflow → Route to Intent Classifier (G-02)
  2. Active workflow + awaiting user response → Route back to active agent
  3. User says "cancel" or "start over" → Reset state, fresh start
  4. User asks about status → Route directly to Status Agent (G-09)
  5. Classification confidence < 0.6 → Ask user to clarify intent
  6. Agent cannot resolve → Escalate to human agent

State management (via Redis):
  - session_id: unique per user session
  - active_agent: which agent currently owns the conversation
  - collected_fields: dict of info gathered so far
  - missing_fields: list of required fields not yet collected
  - classification_result: output from Intent Classifier
  - created_tickets: list of RITMs/Jira tickets created in session

Acceptance Criteria:
  AC-1: Supervisor receives all user messages as entry point
  AC-2: Routes new messages to Intent Classifier
  AC-3: Routes follow-up messages to active agent
  AC-4: "Cancel"/"start over" resets conversation state
  AC-5: Status keywords route directly to Status Agent
  AC-6: Low confidence (<0.6) triggers clarification question
  AC-7: Escalation path to human agent works
  AC-8: Session state persisted in Redis
  AC-9: State survives page refresh (session resumed from Redis)
  AC-10: LangFlow flow exported and version-controlled

Dependencies:  None (LangFlow setup must be complete)


--------------------------------------------------------------------------------
STORY G-02: Build Intent Classifier Agent with RAG Pipeline
--------------------------------------------------------------------------------

Story Name:     Build Intent Classifier Agent with RAG Retrieval
Epic:           Gold Layer - Multi-Agent System
Story Points:   8
Sprint:         Sprint 6-7
Priority:       Critical
Assignee:       [TBD - Senior AI Engineer]
Labels:         gold, agent, classifier, rag, langflow, critical-path

Description:
As an AI engineer, I need to create the Intent Classifier Agent in
LangFlow that uses the RAG pipeline to classify user requests by
comparing them against historical data in the vector store.

RAG Classification Flow (step by step):
  1. Receive user message from Supervisor
  2. Embed user message using all-MiniLM-L6-v2 (384-dim vector)
  3. Query vector store for Top-K=5 most similar historical tickets
  4. Build LLM prompt with:
     - User's original message
     - 5 retrieved similar tickets (semantic document text)
     - Classification instructions
  5. Send to LLM (Claude/GPT-4) for classification
  6. Parse LLM response into structured output

Classification output:
  {
    "request_type": "NetworkRequest",     // or IntakeRequest, IncidentRequest, AccessRequest, StatusQuery
    "capability_center": "ApprovedProjectDemand",  // or AgileDemand, TSInternalDemand, EstimateDemand
    "confidence_score": 0.96,
    "similar_tickets": [
      {"ritm": "RITM000023590001", "summary": "...", "similarity": 0.94},
      ...
    ],
    "extracted_entities": {
      "site_name": "JACKSONVILLE",
      "site_id": "5619",
      "request_details": "12 network drops for wireless access points"
    }
  }

LangFlow components to use:
  - Embedding component (all-MiniLM-L6-v2)
  - Vector Store Retriever (connected to S-06 vector store)
  - Prompt Template (classification prompt)
  - LLM component (Claude/GPT-4)
  - Output Parser (structured JSON)

Acceptance Criteria:
  AC-1: User message embedded via all-MiniLM-L6-v2 in LangFlow
  AC-2: Top-5 similar tickets retrieved from vector store
  AC-3: LLM classifies using retrieved context (not just keywords)
  AC-4: Output includes: request_type, capability_center, confidence_score
  AC-5: similar_tickets returned with RITM, summary, similarity score
  AC-6: Entities extracted from user message (site, quantities, etc.)
  AC-7: Classification accuracy >95% on 50-case test set
  AC-8: Total classification time <2 seconds end-to-end
  AC-9: Handles ambiguous requests (asks for clarification if confidence <0.6)
  AC-10: LangFlow flow tested and exported

Dependencies:  S-06 (vector store loaded), G-01 (Supervisor routes to this)
Test Data:     50 test messages covering all 5 request types


--------------------------------------------------------------------------------
STORY G-03: Build Clarity Agent - Smart Question Generator
--------------------------------------------------------------------------------

Story Name:     Build Clarity Agent - Missing Field Detection & Smart Questions
Epic:           Gold Layer - Multi-Agent System
Story Points:   8
Sprint:         Sprint 7
Priority:       Critical
Assignee:       [TBD - AI Engineer]
Labels:         gold, agent, clarity, questions, langflow, critical-path

Description:
As an AI engineer, I need to create the Clarity Agent in LangFlow that
identifies missing required fields based on the classified request type
and asks targeted, context-aware follow-up questions.

Required fields by request type:

  IntakeRequest:
    - site_id (4-digit, validated against known sites)
    - capability_center (Agile Demand / Approved Project Demand / etc.)
    - nexus_id (required for Agile and Approved Project)
    - business_problem (min 50 chars, natural language)
    - business_value (why this matters)

  NetworkRequest:
    - existing_intake_ritm (REQUIRED prerequisite — must exist)
    - network_type (LAN/WAN, Firewall MACD, Switch Config — multi-select)
    - cable_drops (number)
    - switch_names (optional, format validation)
    - network_zone (ICE or standard)

  IncidentRequest:
    - what_stopped_working (description)
    - when_it_broke (date/time)
    - who_is_affected (scope)
    - business_impact (urgency justification)

  AccessRequest:
    - application_name (e.g., "TS ENGAGE - PROD")
    - access_type (new account, modify, remove)
    - user_network_id

Smart defaults from similar tickets:
  - If Intent Classifier returned similar_tickets, extract likely values
  - Example: Jacksonville network ticket → suggest "ICE zone" and known switch names
  - Present as: "Based on similar projects at Jacksonville, this would typically
    be an ICE zone network with switches JAX-MDF-SW01. Can you confirm?"

Question behavior:
  - Ask ONE question at a time (not a form dump)
  - Accept substantive text responses (if user provides business problem when
    asked for site ID, extract both)
  - Validate responses (site_id must be 4 digits, RITM must match pattern)
  - Track collected vs missing fields in state

Acceptance Criteria:
  AC-1: Required field list defined per request type
  AC-2: Missing fields identified by comparing collected vs required
  AC-3: Questions asked one at a time
  AC-4: Smart defaults suggested from similar historical tickets
  AC-5: Accepts substantive text (extracts multiple fields from one response)
  AC-6: Input validation: site_id (4 digits), RITM (pattern match), etc.
  AC-7: State tracks collected_fields and missing_fields
  AC-8: Passes completed fields to specialized agent when all required gathered
  AC-9: Handles "I don't know" responses gracefully (mark as optional or explain why needed)
  AC-10: Works for all 4 request types

Dependencies:  G-02 (classification result provides request_type and similar_tickets)


--------------------------------------------------------------------------------
STORY G-04: Build Intake Agent - Solution Intake RITM Creation
--------------------------------------------------------------------------------

Story Name:     Build Intake Agent - IRIS RITM Creation for Solution Intake
Epic:           Gold Layer - Multi-Agent System
Story Points:   5
Sprint:         Sprint 8
Priority:       High
Assignee:       [TBD - AI Engineer]
Labels:         gold, agent, intake, iris, servicenow

Description:
As an AI engineer, I need to create the Intake Agent in LangFlow that
handles new Solution Intake requests. It takes the complete field set
from the Clarity Agent and creates an IRIS RITM via the ServiceNow API.

Workflow:
  1. Receive collected fields from Clarity Agent
  2. Validate all required fields present
  3. Format payload for ServiceNow API (sc_req_item table)
  4. Create RITM via POST /api/now/table/sc_req_item
  5. Return RITM number to user
  6. Inform user that IRIS automation will create Jira Epic in ABFZ
  7. Provide link format: https://jira.jnj.com/browse/ABFZ-XXXXX
  8. Log action to Content Sphere Gold audit bucket

Acceptance Criteria:
  AC-1: Receives complete field set from Clarity Agent
  AC-2: Validates all required fields before API call
  AC-3: Creates RITM in IRIS via ServiceNow REST API
  AC-4: Returns RITM number to user (e.g., "RITM000023XXXXXX")
  AC-5: Informs user about Jira Epic auto-creation
  AC-6: Handles API errors gracefully (timeout, auth failure, validation error)
  AC-7: Retry logic for transient failures (3 retries with backoff)
  AC-8: Audit log entry written to Content Sphere Gold bucket
  AC-9: Response time <5 seconds for RITM creation

Dependencies:  G-03 (Clarity Agent provides collected fields)
               ServiceNow API access credentials


--------------------------------------------------------------------------------
STORY G-05: Build Network Agent - Network RITM Creation
--------------------------------------------------------------------------------

Story Name:     Build Network Agent - Network/Firewall RITM with Prerequisite Check
Epic:           Gold Layer - Multi-Agent System
Story Points:   5
Sprint:         Sprint 8
Priority:       High
Assignee:       [TBD - AI Engineer]
Labels:         gold, agent, network, firewall, pete-ward

Description:
As an AI engineer, I need to create the Network Agent in LangFlow that
handles LAN/WAN, Firewall MACD, and switch configuration requests. It
enforces the prerequisite that an MLL Intake RITM must exist before
a Network RITM can be created.

Workflow:
  1. Validate prerequisite: parent Intake RITM exists (query IRIS API)
  2. If no parent RITM → route back to Intake Agent first
  3. Collect network-specific fields (from Clarity Agent):
     - network_type, cable_drops, switch_names, network_zone, firewall_required
  4. Assign priority (P1-P4) based on business impact:
     - P1: Emergency/security concern
     - P2: High priority project work
     - P3: Standard (DEFAULT)
     - P4: Deferred/back burner
  5. Create Network RITM in IRIS, linked to parent Intake RITM
  6. Inform user: "Network RITM created. Will be reviewed at next triage call
     and assigned to Pete Ward's network team."
  7. Log to Content Sphere Gold audit bucket

Acceptance Criteria:
  AC-1: Validates parent Intake RITM exists before proceeding
  AC-2: If no parent RITM, redirects to Intake Agent workflow
  AC-3: Network RITM created and linked to parent RITM
  AC-4: Priority assigned (P1-P4), defaults to P3
  AC-5: Network RITM includes: type, cable_drops, switch_names, zone
  AC-6: User informed about triage call review process
  AC-7: Handles API errors gracefully
  AC-8: Audit log written to Content Sphere Gold bucket

Dependencies:  G-03 (Clarity Agent), G-04 (Intake Agent for prerequisite creation)


--------------------------------------------------------------------------------
STORY G-06: Build Incident Agent - Break/Fix Handling
--------------------------------------------------------------------------------

Story Name:     Build Incident Agent - Break/Fix Scenario Handler
Epic:           Gold Layer - Multi-Agent System
Story Points:   5
Sprint:         Sprint 9
Priority:       High
Assignee:       [TBD - AI Engineer]
Labels:         gold, agent, incident, break-fix

Description:
As an AI engineer, I need to create the Incident Agent in LangFlow that
handles break-fix scenarios where existing functionality has stopped
working. The agent must distinguish between incidents and intakes.

Decision logic:
  INTAKE (new demand):                    INCIDENT (break-fix):
  - Something never worked before         - Something WAS working, now broke
  - New capability request                - Unplanned interruption
  - Enhancement or upgrade                - Degraded service quality
  → Route to Intake Agent                 → Continue as Incident

If classified as incident, collect:
  - What stopped working (description)
  - When it broke (date/time)
  - Who is affected (individuals, teams, sites)
  - Business impact (critical, high, medium, low)
  - Any error messages or symptoms

Acceptance Criteria:
  AC-1: Intake vs Incident decision logic implemented
  AC-2: Collects: what, when, who, impact, symptoms
  AC-3: If user describes new demand, routes to Intake Agent
  AC-4: Creates incident ticket via IRIS API
  AC-5: Assigns priority based on business impact
  AC-6: Provides incident reference number to user
  AC-7: Audit log written to Content Sphere Gold bucket

Dependencies:  G-03 (Clarity Agent)


--------------------------------------------------------------------------------
STORY G-07: Build Access Agent - Application Access Requests
--------------------------------------------------------------------------------

Story Name:     Build Access Agent - TS ENGAGE and Application Access
Epic:           Gold Layer - Multi-Agent System
Story Points:   3
Sprint:         Sprint 9
Priority:       Medium
Assignee:       [TBD - AI Engineer]
Labels:         gold, agent, access, ts-engage

Description:
As an AI engineer, I need to create the Access Agent in LangFlow that
handles application access requests, particularly TS ENGAGE-PROD access.

TS ENGAGE-PROD access workflow:
  1. Instruct user: Submit in IRIS → "Create a new account for an application"
  2. Application CI: TS ENGAGE - PROD
  3. Application Login ID: User's Network Account
  4. Account Type: Standard Account
  5. Provide direct link to IRIS form if available

For other applications:
  - Ask for application name
  - Check if application is in known list (CMDB lookup)
  - Guide through appropriate access request process

Acceptance Criteria:
  AC-1: TS ENGAGE-PROD workflow guides user through all 4 steps
  AC-2: Other application access requests handled
  AC-3: CMDB lookup for application CI (if available)
  AC-4: Provides IRIS form link
  AC-5: Audit log written to Content Sphere Gold bucket

Dependencies:  G-03 (Clarity Agent)


--------------------------------------------------------------------------------
STORY G-08: Build Status Agent - RITM/Jira Ticket Tracking
--------------------------------------------------------------------------------

Story Name:     Build Status Agent - Real-Time Ticket Tracking
Epic:           Gold Layer - Multi-Agent System
Story Points:   5
Sprint:         Sprint 9
Priority:       High
Assignee:       [TBD - AI Engineer]
Labels:         gold, agent, status, tracking

Description:
As an AI engineer, I need to create the Status Agent in LangFlow that
queries both IRIS and Jira APIs to provide real-time ticket status.

The agent must handle three input types:
  1. RITM number: "What's the status of RITM000023587097?"
     → Query IRIS API for RITM state, stage, assignment
  2. Jira key: "What's happening with ABFZ-97353?"
     → Query Jira API for Epic status, stories, assignees
  3. Natural language: "What's my latest request?"
     → Search by user's ID across both systems

Cross-reference capability:
  - Given RITM → find linked Jira Epic → show both statuses
  - Given Jira key → find linked RITM → show both statuses

Status response format:
  "Your request RITM000023587097 is currently in [Fulfilled] state.
   The linked Jira Epic ABFZ-97353 is [In Progress] with 2 user stories:
   - ABFZ-97354: Gather User Stories (Completed)
   - ABFZ-97355: Network Assessment (Open)
   Next step: Network assessment is pending assignment."

Acceptance Criteria:
  AC-1: Accepts RITM numbers (regex: RITM\d{15})
  AC-2: Accepts Jira keys (regex: ABFZ-\d+)
  AC-3: Accepts natural language status queries
  AC-4: Queries IRIS REST API for RITM status
  AC-5: Queries Jira REST API for Epic/Story status
  AC-6: Cross-references RITM to Epic (and vice versa)
  AC-7: Returns formatted status with next steps
  AC-8: Handles "ticket not found" gracefully
  AC-9: Response time <3 seconds

Dependencies:  G-01 (Supervisor routes status queries here)


--------------------------------------------------------------------------------
STORY G-09: Build Chainlit Chat UI with LangFlow Integration
--------------------------------------------------------------------------------

Story Name:     Build Chainlit Chat Interface with LangFlow Backend
Epic:           Gold Layer - Chat UI
Story Points:   8
Sprint:         Sprint 10
Priority:       Critical
Assignee:       [TBD - Full-Stack Developer]
Labels:         gold, chainlit, ui, langflow, chat, critical-path

Description:
As a full-stack developer, I need to build the Chainlit-based chat
interface that serves as the user-facing front end, connecting to the
LangFlow multi-agent backend via REST API.

Implementation:
  @cl.on_chat_start → Initialize session, display welcome message
  @cl.on_message → Forward to LangFlow API, stream response back

Features required:
  1. Welcome message with quick-action buttons:
     "New Intake Request | Check Status | Report Incident | Get Access"
  2. Streaming responses (tokens appear as generated)
  3. Agent step visualization (show which agent is handling: "🔍 Classifier
     analyzing your request..." → " Clarity Agent: asking follow-up...")
  4. File upload support (for PDF invoices and supporting docs)
  5. Clickable option buttons (when Clarity Agent offers choices)
  6. SSO authentication (OAuth2/SAML integration)
  7. Conversation history (sidebar, persisted via Redis)
  8. Mobile responsive layout
  9. Custom branding (MLL/J&J colors and logo)

Acceptance Criteria:
  AC-1: Chainlit app launches and displays welcome message
  AC-2: Messages forwarded to LangFlow REST API endpoint
  AC-3: Streaming responses work (tokens appear progressively)
  AC-4: Agent step visualization shows active agent name and status
  AC-5: File upload works for PDF and common document types
  AC-6: Quick-action buttons on welcome screen
  AC-7: SSO authentication configured
  AC-8: Conversation history in sidebar
  AC-9: Mobile responsive (tested on iPhone and Android)
  AC-10: Custom MLL branding applied (colors, logo)

Dependencies:  G-01 through G-08 (all agents must be functional)


--------------------------------------------------------------------------------
STORY G-10: Build Knowledge Agent - Confluence KB RAG Search
--------------------------------------------------------------------------------

Story Name:     Build Knowledge Agent - Process Documentation RAG Search
Epic:           Gold Layer - Multi-Agent System
Story Points:   3
Sprint:         Sprint 10
Priority:       Medium
Assignee:       [TBD - AI Engineer]
Labels:         gold, agent, knowledge, confluence, rag

Description:
As an AI engineer, I need to create the Knowledge Agent in LangFlow
that searches Confluence KB articles using RAG to answer process
questions like "How do I submit an intake request?" or "What's the
difference between intake and incident?"

The agent queries the "mll_kb_vectors" collection (from S-07) and
returns relevant Confluence content with source attribution.

Acceptance Criteria:
  AC-1: Queries mll_kb_vectors collection for relevant KB content
  AC-2: Returns synthesized answer with source page reference
  AC-3: Handles: intake process, network engagement, intake vs incident, TS Engage
  AC-4: Falls back gracefully if no relevant KB article found
  AC-5: Provides Confluence page link when available

Dependencies:  S-07 (KB vectors loaded)


--------------------------------------------------------------------------------
STORY G-11: Build Cost Estimation Feature
--------------------------------------------------------------------------------

Story Name:     Build Cost Estimation from Historical Invoice Data
Epic:           Gold Layer - Multi-Agent System
Story Points:   5
Sprint:         Sprint 11
Priority:       High
Assignee:       [TBD - AI Engineer]
Labels:         gold, cost-estimation, invoices, rag

Description:
As an AI engineer, I need to add cost estimation capability to the Intake
Agent that uses historical PDF invoice data (from Silver layer embeddings)
to provide ballpark cost estimates for new requests.

Estimation formula:
  Est_Cost = SUM(Unit_Cost_i × Qty_i) × Geo_Factor × Complexity_Multiplier
           + Network_Surcharge + Firewall_Fee + MedDevice_Compliance

Factors:
  - Geolocation: US=1.0x, EU=1.3x, LATAM=0.8x, APAC=1.1x
  - Complexity: Simple (1-5 items)=1.0x, Medium (5-20)=1.2x, Complex (20+)=1.5x
  - Network surcharge: $350/drop
  - Firewall MACD: $2,500 per rule set
  - Medical device compliance: +15%

The agent retrieves Top-3 similar historical tickets with quotation data
and uses their actual costs as reference points.

Acceptance Criteria:
  AC-1: Retrieves Top-3 similar tickets with quotation_cost chunks
  AC-2: Applies estimation formula with all factors
  AC-3: Returns estimated cost range (low-high) with confidence
  AC-4: Shows reference tickets: "Based on similar project at Jacksonville: $98,200"
  AC-5: Geo, complexity, and surcharge factors applied correctly
  AC-6: Handles cases with no historical cost data gracefully

Dependencies:  G-04 (Intake Agent), S-06 (quotation vectors available)


--------------------------------------------------------------------------------
STORY G-12: End-to-End Integration Testing
--------------------------------------------------------------------------------

Story Name:     End-to-End Integration Testing - All Agents & Workflows
Epic:           Gold Layer - Quality Assurance
Story Points:   8
Sprint:         Sprint 11-12
Priority:       Critical
Assignee:       [TBD - QA Engineer]
Labels:         gold, testing, e2e, integration, critical-path

Description:
As a QA engineer, I need to verify the complete flow from user message
through all agents to RITM/Jira creation for every supported request type.

Test matrix (50 test cases total):
  IntakeRequest:    10 cases (various sites, capability centers)
  NetworkRequest:   10 cases (LAN/WAN, Firewall, Switch, with/without parent RITM)
  IncidentRequest:  10 cases (various break/fix scenarios)
  AccessRequest:    5 cases (TS ENGAGE, other apps)
  StatusQuery:      10 cases (RITM lookup, Jira lookup, natural language)
  KnowledgeQuery:   5 cases (process questions)

Verification per test case:
  1. Intent Classifier produces correct request_type
  2. Clarity Agent asks correct follow-up questions
  3. Smart defaults from similar tickets are relevant
  4. Specialized agent creates correct ticket
  5. Response is clear and helpful
  6. Audit log entry created in Content Sphere Gold bucket
  7. Response time <5 seconds end-to-end

Acceptance Criteria:
  AC-1: 50 test cases executed across all request types
  AC-2: 100% pass rate (all workflows complete successfully)
  AC-3: Classification accuracy >95% (47+ of 50 correct)
  AC-4: RAG retrieval returns relevant similar tickets
  AC-5: RITM creation verified in IRIS for intake/network tests
  AC-6: Audit logs verified in Content Sphere Gold bucket
  AC-7: End-to-end response time <5s for 95th percentile
  AC-8: Test report generated with pass/fail per test case
  AC-9: Any failures have detailed repro steps and logs

Dependencies:  All G-stories (complete agent system)


--------------------------------------------------------------------------------
STORY G-13: Production Deployment & Monitoring Setup
--------------------------------------------------------------------------------

Story Name:     Production Deployment, Monitoring & Content Sphere Audit Pipeline
Epic:           Gold Layer - Deployment
Story Points:   5
Sprint:         Sprint 12
Priority:       Critical
Assignee:       [TBD - DevOps / Senior Engineer]
Labels:         gold, deployment, monitoring, production, content-sphere

Description:
As a DevOps engineer, I need to deploy the complete MLL Intake system
to production and configure monitoring, alerting, and the Content Sphere
audit logging pipeline.

Deployment components:
  - Chainlit app (containerized, 3 replicas behind load balancer)
  - LangFlow backend (containerized, with all agent flows)
  - Redis cluster (3 nodes for session state)
  - Vector store (AstraDB cloud or managed Chroma)
  - FastAPI gateway (2 replicas for IRIS/Jira API integration)

Content Sphere Gold audit pipeline:
  - Every agent action logged: timestamp, session_id, agent_name, action, result
  - Every RITM/Jira creation logged: ticket_number, fields, user
  - Every RAG retrieval logged: query, top_k results, classification
  - Conversation transcripts stored (anonymized)
  - Daily aggregation job for analytics dashboard

Monitoring & alerting:
  - Response latency: p95 alert if >5s
  - Error rate: alert if >1%
  - Classification accuracy: weekly report
  - IRIS/Jira API health: alert on >1% failure rate
  - Vector store query latency: alert if p95 >10ms

Acceptance Criteria:
  AC-1: All components deployed and running in production
  AC-2: Health checks pass for all services
  AC-3: Load balancer routes traffic to Chainlit replicas
  AC-4: SSL/TLS configured for all endpoints
  AC-5: SSO authentication working in production
  AC-6: Content Sphere Gold audit pipeline capturing all actions
  AC-7: Monitoring dashboards live with alerting rules
  AC-8: Runbook documented for common operational scenarios
  AC-9: Rollback procedure tested
  AC-10: Stakeholder sign-off obtained

Dependencies:  G-12 (E2E testing passed)


LAYER       | STORIES | POINTS | SPRINTS  | FOCUS
------------|---------|--------|----------|----------------------------------
Bronze      |    7    |   42   |  1-2     | ETL, Data Ingestion, Unification
Silver      |    8    |   44   |  3-4     | Semantic Text, Embeddings, Vector DB
Gold        |   13    |   67   |  6-8    | 7 Agents, Chainlit UI, Deployment
------------|---------|--------|----------|----------------------------------
TOTAL       |   28    |  153   |  12      | Complete MLL Intake System

CRITICAL PATH (must complete in order):
  B-01/02/03 → B-06 → S-01/02 → S-03 → S-04 → S-06 → G-01 → G-02 → G-03 → G-04/05 → G-09 → G-12 → G-13

PARALLEL TRACKS:
  Track A: B-01, B-02, B-03 (all Sprint 1, independent)
  Track B: B-04, B-05 (Sprint 2, independent)
  Track C: S-01, S-02 (Sprint 3, can start before Bronze complete)
  Track D: G-06, G-07, G-08 (Sprints 9, after G-03)
  Track E: G-10, G-11 (Sprint 10-11, after vector store ready)

STORAGE (Content Sphere):
  Bronze Bucket: Raw JSON records (~16,852), partitioned by source
  Silver Bucket: Semantic docs, embeddings (.npy), metadata JSON
  Gold Bucket:   Audit logs, conversation history, analytics (append-only)