This section provides NaviGo’s testing and evaluation summary based on the current tests/ directory and script configurations.
tests/
├── unit/
│ ├── agents/
│ ├── security/
│ └── tools/
├── integration/
│ ├── api.chat-endpoint.test.ts
│ ├── api.frontend-route.test.ts
│ ├── api.plan-endpoint.test.ts
│ └── graph.plan-flow.test.ts
├── redteam/
│ └── guardrails.redteam.test.ts
├── evals/
│ └── travel-planner.eval.ts
└── helpers/
└── fake-model.ts
tests/unit/agents/budget.agent.test.tstests/unit/agents/itinerary.agent.test.tstests/unit/agents/risk-guard.agent.test.tstests/unit/agents/form-completer.agent.test.tstests/unit/agents/requirement-parser.agent.test.tstests/unit/security/guardrails.test.tstests/unit/tools/http.test.tsGoal: Verify single-module logic correctness and error paths.
tests/integration/graph.plan-flow.test.tstests/integration/api.plan-endpoint.test.tstests/integration/api.frontend-route.test.tstests/integration/api.chat-endpoint.test.tsGoal: Verify complete graph flows, API routes (including chat and resume), static resources, and state persistence read behavior.
tests/redteam/guardrails.redteam.test.tsGoal: Verify guardrail detection capability against known attack vectors (injection, jailbreak, homoglyph, zero-width characters, indirect injection, context manipulation).
tests/evals/travel-planner.eval.tsGoal: Verify “final plan completeness” baseline; gated by LANGSMITH_API_KEY.
| Module | Verified Points (from test code) |
|---|---|
requirement-parser.agent.ts |
Natural language field extraction, missing field filtering |
form-completer.agent.ts |
Complete form assembly, pending clarification questions generation |
risk-guard.agent.ts |
Injection hit/non-hit branches, LLM scan and rule scan merging, risk flag writing |
itinerary.agent.ts |
LLM itinerary generation, round-trip flight integration, weather risk propagation, unknown city anchor fallback |
budget.agent.ts |
Over-budget/within-budget branches and risk flags |
guardrails.ts |
Prompt injection / unsafe output detection, zero-width character and homoglyph normalization |
tools/common/http.ts |
Query assembly, timeout interruption and error mapping |
graph/builder.ts + routes.ts |
Full chain execution, node progression by state, chat resume, thread recovery |
| API routes | POST /plan, POST /plan/chat, POST /plan/chat/resume, GET /plan/:threadId behavior and state reading |
Tests heavily use:
FakeStructuredChatModelcreateInMemoryCheckpointer()Therefore unit and integration tests do not depend on real external APIs, yielding stable results.
Test fixtures are generally constructed via schema (e.g., UserRequestSchema.parse(...)), ensuring consistency with production input contracts.
Integration tests focus on state graph execution results (finalPlan, snapshot, thread recovery, chat resume) rather than internal implementation details, making them suitable for refactoring safety.
tests/redteam/guardrails.redteam.test.ts logs known blind spots (e.g., variant verbs, plural forms) informationally without blocking builds, avoiding false positives affecting development velocity while preserving security audit trails.
tests/evals/travel-planner.eval.ts baseline scoring consists of four items:
Pass condition: completenessScore >= 4.
This is a “structural completeness” evaluation, suitable as a minimum quality threshold.
The following are recommendations (not fully implemented in current repository):
npm run test:unit
npm run test:integration
npm run test:eval
npm run test
npm run acceptance
Red-team tests (require OPENAI_API_KEY for LLM-generated variants; static adversarial samples do not):
npx vitest run tests/redteam/
acceptance appends live CLI scenario verification when environment variables are satisfied.
Based on existing test code, we can confirm:
Meanwhile, current eval remains primarily structural completeness; for higher-reliability scenarios, it is recommended to supplement semantic quality, security robustness quantitative scoring, and performance regression evaluations.