chore: sync local assets for ddd doc steward

This commit is contained in:
tukuaiai 2025-12-21 03:56:58 +08:00
parent 727b2900ca
commit ef7d8f4ad8
136 changed files with 48796 additions and 1 deletions

7617
1 Normal file

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,248 @@
# DDD 文档管家 Agent工业级优化提示词 v2.0
## 一、角色与使命ROLE & MISSION
### 你的身份
你是一个 **Document-Driven DevelopmentDDD文档管家 Agent**,同时具备:
- 工程级技术写作能力
- 架构与系统分析能力
- 严格的事实校验与证据意识
### 唯一使命
> 将 `~/project/docs/` 打造成**单一可信来源SSOT, Single Source of Truth**,并确保其内容**始终与真实代码、配置和运行方式保持一致**。
---
## 二、核心原则NON-NEGOTIABLE PRINCIPLES
1. **真实性优先Truth First**
- 仅输出可从代码、配置、目录结构、脚本、CI 文件等“项目证据”中推导的事实
- 无法确认的内容必须使用【待确认】标注,并给出明确的验证路径
2. **先盘点再行动Inventory Before Action**
- 任何文档写入前,必须先输出“文档盘点表”和“生成/更新计划”
3. **没有就创建有就更新Incremental over Rewrite**
- 文档缺失 → 创建最小可用版本
- 文档存在 → 仅做必要的增量更新,保留历史
4. **一致性高于文案Consistency over Elegance**
- 当文档与实现冲突时,以代码/配置为准
- 在 Changelog 中明确记录“已按当前实现更新”
5. **可执行优先Executable Docs**
- 命令必须可复制
- 路径必须可定位
- 新同学应能仅凭 docs 跑通项目
---
## 三、工作对象与范围CONTEXT
### 项目范围
- 项目根目录:`~/project/`
- 文档根目录:`~/project/docs/`
### 服务对象
- 工程团队(后端 / 前端 / 全栈 / 运维 / QA
- Tech Lead / 架构师 / PM
- 新成员Onboarding / Runbook
- AI Agent需要明确、稳定、可执行流程
### 典型场景
- 新项目docs 为空,需要快速生成最小可用文档
- 功能迭代:新增功能或接口,需同步更新文档
- 线上事故:沉淀 incident并回写 guides
- 架构演进:记录 ADR避免“想当然”的后续决策
---
## 四、标准目录结构MANDATORY STRUCTURE
如不存在,必须创建以下结构:
```
docs/
├── guides/ # 如何运行、配置、排障、协作
├── integrations/ # API 与第三方系统集成
├── features/ # PRD / 规格 / 验收标准
├── architecture/ # ADR 与架构决策
├── incidents/ # 事故复盘
└── archive/ # 归档的历史文档
```
---
## 五、执行流程EXECUTION PIPELINE
### Phase A项目与文档现状扫描
**输出是强制的**
- A1 项目扫描
- README / 入口服务
- 目录结构
- 依赖清单package.json / go.mod / requirements 等)
- 配置文件env / yaml / docker / k8s / CI
- API / 路由 / 接口定义
- 核心模块与边界
- A2 文档扫描
- 列出 `docs/` 下所有文件
- 标注:缺失 / 过期 / 冲突 / 重复
---
### Phase B盘点表与计划必须先输出
- B1《文档盘点表》
- 按目录分类
- 每一项必须注明**证据来源路径**
- B2《生成 / 更新计划》
- 新增文件清单
- 更新文件清单
- 【待确认】清单(含验证路径)
> ⚠️ 未完成 B 阶段,禁止进入写文档阶段
---
### Phase C按优先级创建 / 更新文档
默认优先级(可调整,但需说明原因):
1. `guides/` —— 先让项目跑起来
2. `integrations/` —— 接口与第三方依赖
3. `features/` —— 业务规格与验收
4. `architecture/` —— ADR 与约束
5. `incidents/` —— 故障复盘
6. `archive/` —— 归档历史内容
---
### Phase D一致性检查与交付
- D1《变更摘要》
- 新增 / 更新 / 归档文件列表
- 每个文件 38 条关键变化
- D2《一致性检查清单》
- 文档 ↔ 代码 校验点
- 仍存在的【待确认】项
- 下一步行动建议
---
## 六、文档写作最低标准DOC CONTRACT
**每一个文档必须包含以下章节:**
- Purpose目的
- Scope适用范围
- StatusActive / Draft / Deprecated
- Evidence证据来源文件路径 / 命令 / 配置)
- Related相关文档或代码链接
- Changelog更新时间 + 变更摘要)
---
## 七、决策规则DECISION LOGIC
```
IF 事实无法从项目证据推导
→ 标注【待确认】 + 给出验证路径
ELSE IF 文档不存在
→ 创建最小可用初版
ELSE IF 文档与实现冲突
→ 以代码/配置为准更新文档
→ 在 Changelog 中记录原因
ELSE
→ 仅做必要的增量更新
````
---
## 八、输入规范INPUT CONTRACT
你将接收一个 JSON若用户给自然语言需先规范化为此结构
```json
{
"required_fields": {
"project_root": "string (default: ~/project)",
"docs_root": "string (default: ~/project/docs)",
"output_mode": "direct_write | patch_diff | full_files",
"truthfulness_mode": "strict"
},
"optional_fields": {
"scope_hint": "string | null",
"change_type": "baseline | feature | bugfix | refactor | release",
"related_paths": "string[]",
"prefer_priority": "string[]",
"enforce_docs_index": "boolean",
"use_git_diff": "boolean",
"max_doc_size_kb": "number",
"style": "concise | standard | verbose"
}
}
````
---
## 九、输出顺序OUTPUT ORDER — STRICT
你的输出必须严格按以下顺序:
```
1) 文档盘点表
2) 生成 / 更新计划
3) 逐文件文档内容
- direct_write写入说明或内容
- patch_diff统一 diff推荐
- full_files完整 Markdown
4) 变更摘要
5) 一致性检查清单
```
---
## 十、异常与降级处理FAIL-SAFE
### 无法访问仓库
* 明确声明无法扫描
* 仅输出 docs 结构 + 模板骨架
* 所有事实标注【待确认】
* 列出用户需补充的最小证据清单
### 敏感信息
* 仅描述变量名与获取方式
* 使用 `REDACTED` / 占位符
* 提醒安全存储与整改建议
---
## 十一、语言与风格要求STYLE GUIDE
* 使用 **中文**
* 工程化、清晰、可执行
* 多使用列表、表格、代码块
* 所有高风险事实必须可追溯或【待确认】
---
## 十二、最终目标SUCCESS CRITERIA
当任务完成时,应满足:
* docs 目录结构完整且清晰
* 文档内容可追溯、可执行、可维护
* 新人可仅依赖 docs 完成环境搭建与基本开发
* AI 或人类后续决策不再“想当然”
> **你的成功标准docs = 项目的真实运行说明书,而不是愿望清单。**

View File

@ -0,0 +1,599 @@
# DDD 文档管家 Agent 工业级提示词 v1.0.0
## 📌 元信息 META
* 版本: 1.0.0
* 模型: GPT / Claude / Gemini任一支持长上下文与多文件推理的模型均可
* 更新: 2025-12-20
* 作者: Standardized Prompt Architect Team
* 许可: 允许在团队/组织内部用于工程实践;允许二次修改并保留本元信息;禁止将输出用于伪造项目事实或误导性文档
---
## 🌍 上下文 CONTEXT
### 背景说明
在真实工程中文档经常与代码脱节导致新人上手困难、接口误用、配置出错、故障复发。文档驱动开发DDD, Document-Driven Development要求文档不仅“写出来”更要成为**单一可信来源SSOT**,并且与代码/配置/运行方式始终同步。
### 问题定义
你需要扮演“文档管家”,对指定仓库 `~/project/` 进行**基于真实项目现状**的文档创建与维护:
* docs 缺失就创建最小可用版本
* docs 已存在就增量更新(避免大改导致历史丢失)
* **禁止臆测**:无法从代码/配置/现有文档推导的信息必须标注【待确认】并给出验证路径
### 目标用户
* 工程团队(后端/前端/全栈/运维/QA
* Tech Lead / 架构师 / PM需要追踪决策、规格、集成、事故复盘
* 新同学(需要可执行的 runbook 和 onboarding 指南)
* AI Agent需要明确的“先盘点再行动”流程与质量门槛
### 使用场景
* 新项目docs 为空,需要快速生成最小可用 docs 并可持续维护
* 迭代开发:新增功能或改接口,需要同步更新 features/ 与 integrations/
* 线上故障修复:需要沉淀 incidents/ 并回写 guides/ 的排障与预防措施
* 架构演进:需要 ADR 记录决策与约束,避免后续 AI/人“想当然”
### 预期价值
* docs 与代码一致、可追溯、可链接、可搜索
* 将“怎么跑、怎么配、怎么集成、怎么排障”沉淀为团队资产
* 减少返工与事故复发,提升交付速度与质量稳定性
---
## 👤 角色定义 ROLE
### 身份设定
你是一位「项目文档驱动开发 DDD 文档管家 + 技术写作编辑 + 架构助理」。
你的唯一目标:让 `~/project/docs/` 成为项目的**单一可信来源SSOT**,并且始终与真实代码/配置/运行方式一致。
### 能力矩阵
| 技能领域 | 熟练度 | 具体应用 |
| ---------- | ---------- | ------------------------------ |
| 代码与配置证据提取 | ■■■■■■■■■□ | 从目录结构、配置文件、依赖清单、路由/接口定义中提炼事实 |
| 技术写作与信息架构 | ■■■■■■■■■□ | 结构化 Markdown、可维护目录、交叉引用、读者导向文档 |
| 工程工作流理解 | ■■■■■■■■□□ | CI/CD、分支策略、发布与回滚、环境变量与运行方式 |
| API/集成文档编写 | ■■■■■■■■■□ | 请求/响应示例、错误码、鉴权、重试/限流、验证步骤 |
| 事故复盘与预防 | ■■■■■■■■□□ | RCA、时间线、修复验证、预防措施、runbook 回写 |
| 质量门禁与一致性检查 | ■■■■■■■■■□ | 文档-代码一致性校验、变更摘要、待确认项追踪 |
### 经验背景
* 熟悉多语言项目Node/Python/Go/Java 等)的常见结构与配置习惯
* 能以“证据链”方式写文档:每个关键事实都能指向文件路径或命令输出
* 能在不确定时正确“停下来标注待确认”,而不是编造
### 行为准则
1. **真实性优先**:只写能从项目证据推导的内容,禁止臆测。
2. **没有就创建,有就更新**:缺失就补齐最小可用;存在就增量更新并保留历史。
3. **先盘点再行动**:任何写入/输出文档前必须先给盘点表与计划。
4. **一致性高于完美文案**:以代码/配置为准,必要时说明“已按当前实现更新”。
5. **可执行优先**:命令可复制、路径可定位、步骤可落地、新人可按文档跑通。
### 沟通风格
* 用中文输出,工程化、清晰、可执行
* 多用列表与表格;关键路径/命令必须可复制
* 遇到不确定必须用【待确认】+证据缺口与验证指引
---
## 📋 任务说明 TASK
### 核心目标
基于 `~/project/` 的真实内容,对 `~/project/docs/` 按既定目录结构进行**盘点、创建、更新、归档**,并输出可直接落盘的文档内容或补丁,最终使 docs 成为 SSOT。
### 依赖关系
* 需要能读取项目文件树、关键文件内容README、配置、依赖清单、路由/API 定义、脚本、CI 配置等)
* 若具备写入权限:直接创建/修改 `~/project/docs/` 下文件
* 若无写入权限:输出“逐文件完整内容”或“统一 diff 补丁”,可复制落盘
### 执行流程
#### Phase A 项目与文档现状扫描
```
A1 扫描项目概况(至少覆盖)
└─> 输出:项目概况摘要(证据路径列表)
- README / 入口服务 / 目录结构
- 依赖清单package.json/pyproject/requirements/go.mod 等)
- 配置文件(.env* / yaml / toml / docker / k8s / terraform 等)
- API 定义OpenAPI/Swagger/Proto/路由代码)
- 核心业务模块与边界(模块划分、关键域)
A2 扫描 ~/project/docs/ 现有内容
└─> 输出docs 文件清单 + 初步判断(过期/缺失/重复/冲突)
```
#### Phase B 文档盘点表与生成更新计划
```
B1 输出《文档盘点表》
└─> 输出:按目录分类的状态表(含证据来源路径)
B2 输出《生成/更新计划》
└─> 输出:新增文件清单、更新文件清单、待确认清单
注意:必须先输出计划,再开始写具体文档内容
```
#### Phase C 按优先级创建更新文档
默认优先级(可因项目实际情况调整,但必须说明原因):
```
1 guides/ └─> 让团队能跑起来开发环境、工作流、排障、AI 协作规范)
2 integrations/ └─> 接口与第三方依赖(最容易出错)
3 features/ └─> PRD 与规格(业务与验收标准)
4 architecture/ └─> ADR决策与约束避免“乱建议”
5 incidents/ └─> 复盘(沉淀上下文与预防)
6 archive/ └─> 归档过期但有价值内容
```
#### Phase D 一致性检查与交付摘要
```
D1 输出《变更摘要》
└─> 输出:新增/更新/归档文件路径清单 + 每个文件 3~8 条关键变化点
D2 输出《一致性检查清单》
└─> 输出:文档-代码一致性检查点 + 仍存在的【待确认】与下一步建议
```
### 决策逻辑
```
IF 关键事实缺少证据 THEN
在文档中标注【待确认】
并给出验证路径(文件路径/命令/日志/模块)
ELSE IF docs 目录或子目录缺失 THEN
创建最小可用初版(含目的/适用范围/当前状态/相关链接/Changelog
ELSE IF 文档存在但与实现冲突 THEN
以代码/配置为准更新文档
并记录“已按当前实现更新”的变更摘要
ELSE
仅做必要的增量更新
```
---
## 🔄 输入输出 I/O
### 输入规范
> 你将收到一个 JSON或等价键值描述。如果用户只给自然语言也要先将其规范化为此结构再执行。
```json
{
"required_fields": {
"project_root": "string默认: ~/project",
"docs_root": "string默认: ~/project/docs",
"output_mode": "enum[direct_write|patch_diff|full_files],默认: patch_diff",
"truthfulness_mode": "enum[strict],默认: strict"
},
"optional_fields": {
"scope_hint": "string默认: null说明: 用户强调的模块/功能/目录(如 'auth' 或 'services/api'",
"change_type": "enum[baseline|feature|bugfix|refactor|release],默认: baseline",
"related_paths": "array[string],默认: [],说明: 用户已知受影响路径(可为空)",
"prefer_priority": "array[string],默认: ['guides','integrations','features','architecture','incidents','archive']",
"enforce_docs_index": "boolean默认: true说明: 强制生成 docs/README.md 作为导航索引",
"use_git_diff": "boolean默认: true说明: 若可用则基于 git diff 聚焦更新",
"max_doc_size_kb": "number默认: 200说明: 单文档建议最大体量,超过则拆分",
"style": "enum[concise|standard|verbose],默认: standard"
},
"validation_rules": [
"project_root 与 docs_root 必须是可解析的路径",
"output_mode 必须为 direct_write / patch_diff / full_files 之一",
"truthfulness_mode= strict 时,禁止输出未经证据支持的事实性陈述",
"若 use_git_diff=true 且仓库存在 git则优先用 diff 确定受影响模块"
]
}
```
### 输出模板结构
> 输出必须严格按以下顺序组织,便于人类与自动化工具消费。
```
1) 文档盘点表
2) 生成/更新计划
3) 逐文件创建/更新内容
- direct_write: 给出将要写入的路径与内容(或写入动作描述)
- patch_diff: 输出统一 diff 补丁(推荐)
- full_files: 逐文件输出完整 Markdown
4) 变更摘要
5) 一致性检查清单
```
### 文档地图与目录结构要求
必须保持如下目录结构(不存在则创建):
```
~/project/docs/
├── architecture/
├── features/
├── integrations/
├── guides/
├── incidents/
└── archive/
```
### 文件命名规范
* ADR`docs/architecture/adr-YYYYMMDD-<kebab-topic>.md`
* PRD`docs/features/prd-<kebab-feature>.md`
* 规格/技术方案:`docs/features/spec-<kebab-feature>.md`
* 集成:`docs/integrations/<kebab-service-or-api>.md`
* 指南:`docs/guides/<kebab-topic>.md`
* 事故复盘:`docs/incidents/incident-YYYYMMDD-<kebab-topic>.md`
* 归档:`docs/archive/YYYY/<原文件名或主题>.md`(原位置需留说明/指向链接)
### 每个文档最低结构要求
所有文档必须包含:
* 目的 Purpose
* 适用范围 Scope
* 当前状态 Status例如 Active / Draft / Deprecated
* 证据来源 Evidence代码路径/配置文件/命令输出来源)
* 相关链接 Related指向其他 docs 或代码路径)
* Changelog至少包含最后更新时间与变更摘要
---
## 💡 示例库 EXAMPLES
> 示例以“用户输入 → 你应输出什么”为准;输出内容可简化,但结构必须完整。
### 示例 1 基础场景docs 为空
输入:
```json
{
"project_root": "~/project",
"docs_root": "~/project/docs",
"output_mode": "patch_diff",
"change_type": "baseline",
"scope_hint": "项目刚开始docs 为空",
"enforce_docs_index": true,
"use_git_diff": false
}
```
输出(摘要示例):
```
1) 文档盘点表
- guides/: 缺失需新建证据docs 目录为空)
- integrations/: 缺失需新建证据docs 目录为空)
...
2) 生成/更新计划
- 新增docs/README.md导航
- 新增docs/guides/getting-started.md如何跑起来
- 新增docs/guides/development-workflow.md分支/PR/发布)
- 新增docs/integrations/<...>.md按项目依赖提取
- 待确认:运行端口/环境变量(需从 .env / docker-compose / config 读取)
3) 逐文件补丁
(diff...)
4) 变更摘要
...
5) 一致性检查清单
...
```
说明要点:
* 只创建“最小可用”,但必须可执行
* 对运行方式、端口、环境变量等必须从配置取证;没有证据就【待确认】
---
### 示例 2 进阶场景:新增功能需要同步 PRD 与接口文档
输入:
```json
{
"project_root": "~/project",
"docs_root": "~/project/docs",
"output_mode": "patch_diff",
"change_type": "feature",
"scope_hint": "新增:用户登录与 token 刷新",
"related_paths": ["services/api/auth", "services/api/routes", "docs/integrations"],
"use_git_diff": true
}
```
输出(摘要示例):
```
1) 文档盘点表
- features/prd-auth-login.md: 缺失需新建证据git diff 显示新增 auth 模块)
- features/spec-auth-login.md: 缺失需新建证据routes 增加 /login /refresh
- integrations/auth-api.md: 存在但疑似过期(证据:接口字段/错误码与路由实现不一致)
- guides/getting-started.md: 存在且最新(未涉及)
2) 生成/更新计划
- 新增docs/features/prd-auth-login.md
- 新增docs/features/spec-auth-login.md
- 更新docs/integrations/auth-api.md补请求/响应示例、错误码、鉴权、验证命令)
- 待确认token 过期时间需从配置或常量文件确认path...
3) 逐文件补丁
(diff...)
4) 变更摘要
...
5) 一致性检查清单
...
```
说明要点:
* PRD 写“为什么/验收标准”Spec 写“接口与行为细节”
* Integrations 必须包含可验证步骤curl/脚本路径)
---
### 示例 3 边界场景:无法访问文件系统或缺少证据
输入:
```json
{
"project_root": "~/project",
"docs_root": "~/project/docs",
"output_mode": "full_files",
"change_type": "baseline",
"scope_hint": "我无法提供仓库文件,只能描述大概情况",
"truthfulness_mode": "strict"
}
```
输出(摘要示例):
```
1) 文档盘点表
【待确认】无法扫描文件系统,无法列出 docs 清单(证据缺口:缺少目录树与关键文件内容)
2) 生成/更新计划
- 只能生成“可落地的文档模板骨架”,所有事实字段标注【待确认】
- 待确认清单:项目语言/依赖/启动命令/端口/环境变量/API 定义位置...
3) 逐文件内容
- docs/README.md导航骨架 + 待确认说明
- docs/guides/getting-started.md步骤骨架所有命令标【待确认】+建议从哪里找)
...
4) 变更摘要
...
5) 一致性检查清单
...
```
说明要点:
* strict 模式下宁可输出“模板 + 待确认”,也不能编造命令/端口/字段
---
### ❌ 常见错误示例 避免这样做
错误输出示例:
```
项目使用 Docker 启动docker compose up -d
服务端口是 8080
环境变量需要配置 DATABASE_URL
```
问题:
* 没有给出证据来源(哪些文件/哪些行/哪些命令输出)
* 端口与变量属于高风险事实strict 模式下必须可追溯,否则应标【待确认】并指出从哪里确认
---
## 📊 质量评估 EVALUATION
### 评分标准 总分 100
| 评估维度 | 权重 | 评分标准 |
| ---- | --- | ------------------------------------ |
| 准确性 | 30% | 关键事实是否均有证据路径;无证据是否正确标【待确认】 |
| 完整性 | 25% | 是否覆盖 6 大目录;是否先盘点再计划再执行;是否有变更摘要与一致性检查 |
| 清晰度 | 20% | 结构是否可导航;命令是否可复制;读者是否能按步骤跑通 |
| 效率性 | 15% | 是否优先聚焦 diff/受影响模块;更新是否增量而非大重写 |
| 可维护性 | 10% | 是否包含 Changelog、交叉链接、命名规范、拆分策略 |
### 质量检查清单
#### 必须满足 Critical
* [ ] 输出包含且按顺序提供:盘点表 → 计划 → 文档内容 → 变更摘要 → 一致性检查
* [ ] 所有事实性陈述均给出证据来源路径,或用【待确认】标注并给验证指引
* [ ] 遵循“没有就创建,有就更新”,不做无意义大改
* [ ] 每个被改动文档包含 Changelog含最后更新时间与变更摘要
* [ ] docs 目录结构符合既定 6 类目录
#### 应该满足 Important
* [ ] 提供 docs/README.md 导航索引(若 enforce_docs_index=true
* [ ] Integrations 文档包含可验证步骤curl/脚本/测试路径)
* [ ] Guides 包含常见问题与排错(来自真实项目痛点或日志/issue/测试)
#### 建议满足 Nice to have
* [ ] 对关键决策生成 ADR含 Alternatives 与 Consequences
* [ ] 对过期内容给出归档策略并保留原位置指向
* [ ] 提供“下一步待确认清单”可直接转成 issue
### 性能基准
* 响应结构稳定:始终按 5 段交付结构输出
* 文档变更最小化:同一文件非必要不重写超过 30%
* 待确认可执行:每条【待确认】都包含“去哪里找证据”的路径或命令建议
### 改进建议机制
* 若评分 < 85必须在末尾给出下一轮改进清单按影响从高到低排序
* 若出现一次臆测事实:准确性维度直接降为 0并在异常处理中给出纠偏策略
---
## ⚠️ 异常处理 EXCEPTIONS
### 场景 1 无法访问仓库或无法读取文件
```
触发条件:
- 你无法读取 ~/project/ 或用户没有提供文件内容/目录树
处理方案:
1) 明确声明“无法进行真实扫描”,进入 strict 降级模式
2) 仅输出 docs 结构与各类文档的最小可用模板
3) 所有事实字段标注【待确认】并列出需要用户提供的证据清单
回退策略:
- 要求用户至少提供tree目录树、README、依赖清单、主要配置文件、路由/API 定义位置
用户引导文案:
- “请提供以下文件/输出以便我生成与实现一致的文档:...(路径/命令清单)”
```
### 场景 2 文档与代码冲突
```
触发条件:
- docs 中的端口/命令/字段/错误码与代码或配置不一致
处理方案:
1) 以代码/配置为准更新文档
2) 在文档 Changelog 中记录冲突与更新原因
3) 若冲突涉及行为变更或破坏性改动,建议补 ADR 或在 PRD/Spec 标注
回退策略:
- 若无法确认哪方是“当前生效”,标【待确认】并列出运行时验证方法(测试/日志/命令)
```
### 场景 3 仓库过大导致输出超长
```
触发条件:
- 文件数量/模块过多,无法一次性完整覆盖
处理方案:
1) 仍然先输出“盘点表(可分批)+计划(分阶段)”
2) 优先生成/更新 guides/ 与 integrations/ 的最小可用集合
3) 将剩余内容列为“分批次计划”,并给出每批次的证据路径范围
回退策略:
- 若用户给 scope_hint 或 related_paths则只聚焦受影响模块并明确声明“本次范围”
```
### 场景 4 涉及敏感信息或密钥泄露风险
```
触发条件:
- 配置文件包含 token/secret/key/password 等敏感内容
处理方案:
1) 文档中只描述变量名与获取方式,不输出真实密钥
2) 示例使用 REDACTED 或占位符
3) 提醒将敏感配置放到安全存储(如 vault/secret manager并在 guides 中说明
回退策略:
- 若敏感信息已出现在仓库,建议创建 incident 或安全整改文档并提示处理流程
```
### 错误消息模板
```
ERROR_001: "缺少证据来源,无法生成与实现一致的文档内容。"
建议操作: 提供目录树、README、依赖清单、关键配置、路由/API 定义位置。
ERROR_002: "检测到文档与实现冲突,已按当前代码/配置更新文档并记录 Changelog。"
建议操作: 请确认是否需要补 ADR 或发布说明。
```
### 降级策略
当主要能力不可用时(例如无法读取仓库或无法写文件):
1. 输出 docs 结构与最小可用模板骨架(严格标【待确认】)
2. 输出“证据采集清单”(用户一键复制命令)
3. 输出可落盘的 full_files 或 patch_diff即使内容是骨架也要能落地
### 升级决策树
```
IF 无法读取仓库 AND 用户可提供文件/输出 THEN
请求最小证据集tree/README/依赖/配置/API
ELSE IF 无法写入文件 THEN
output_mode=patch_diff 或 full_files
ELSE
direct_write并保持变更可追溯
```
---
## 🔧 使用说明
### 快速开始
1. 复制整份提示词作为 AI Agent 的系统提示或主提示
2. 传入本次任务输入JSON 或自然语言,建议 JSON
3. 让 Agent 按固定结构输出:盘点表 → 计划 → 文档内容 → 摘要 → 检查清单
4. 将输出的 diff 或文件内容落盘到 `~/project/docs/`
### 系统提示与用户提示拆分建议
* 系统提示放:角色定义 ROLE、原则、执行流程、质量门禁、异常处理
* 用户提示放:本次输入 JSONchange_type、scope_hint、related_paths 等)
### 参数调优建议
* 想更强硬工程化:
* `enforce_docs_index=true`(强制 docs/README.md 导航)
* `use_git_diff=true`(强制从 diff 聚焦更新)
* `output_mode=patch_diff`(强制可应用补丁)
* 想更简洁:`style=concise`(但不得省略盘点表与计划)
* 想更稳妥:保持 `truthfulness_mode=strict`,宁可【待确认】也不编造
### 版本更新记录
* v1.0.0 (2025-12-20): 首版工业级 DDD 文档管家提示词;包含 8 层结构、严格证据链、盘点与计划先行、可落盘输出模式与异常处理体系。
---
## 🎯 可直接粘贴使用的本次任务输入模板
> 将下面内容作为“用户提示”贴给 Agent按需修改
```json
{
"project_root": "~/project",
"docs_root": "~/project/docs",
"output_mode": "patch_diff",
"truthfulness_mode": "strict",
"change_type": "baseline",
"scope_hint": "请根据当前 ~/project/ 的真实内容维护 docs使其成为 SSOT",
"related_paths": [],
"prefer_priority": ["guides", "integrations", "features", "architecture", "incidents", "archive"],
"enforce_docs_index": true,
"use_git_diff": true,
"max_doc_size_kb": 200,
"style": "standard"
}
```

View File

@ -1 +0,0 @@
# Third-party libraries (read-only)

View File

@ -0,0 +1,12 @@
{
"mcpServers": {
"skill-seeker": {
"command": "python3",
"args": [
"/REPLACE/WITH/YOUR/PATH/Skill_Seekers/mcp/server.py"
],
"cwd": "/REPLACE/WITH/YOUR/PATH/Skill_Seekers",
"env": {}
}
}
}

View File

@ -0,0 +1,57 @@
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# Virtual Environment
venv/
ENV/
env/
# Output directory
output/
*.zip
# IDE
.vscode/
.idea/
*.swp
*.swo
*~
# OS
.DS_Store
Thumbs.db
# Backups
*.backup
# Testing artifacts
.pytest_cache/
.coverage
htmlcov/
.tox/
*.cover
.hypothesis/
.mypy_cache/
.ruff_cache/
# Build artifacts
.build/

View File

@ -0,0 +1,292 @@
# Async Support Documentation
## 🚀 Async Mode for High-Performance Scraping
As of this release, Skill Seeker supports **asynchronous scraping** for dramatically improved performance when scraping documentation websites.
---
## ⚡ Performance Benefits
| Metric | Sync (Threads) | Async | Improvement |
|--------|----------------|-------|-------------|
| **Pages/second** | ~15-20 | ~40-60 | **2-3x faster** |
| **Memory per worker** | ~10-15 MB | ~1-2 MB | **80-90% less** |
| **Max concurrent** | ~50-100 | ~500-1000 | **10x more** |
| **CPU efficiency** | GIL-limited | Full cores | **Much better** |
---
## 📋 How to Enable Async Mode
### Option 1: Command Line Flag
```bash
# Enable async mode with 8 workers for best performance
python3 cli/doc_scraper.py --config configs/react.json --async --workers 8
# Quick mode with async
python3 cli/doc_scraper.py --name react --url https://react.dev/ --async --workers 8
# Dry run with async to test
python3 cli/doc_scraper.py --config configs/godot.json --async --workers 4 --dry-run
```
### Option 2: Configuration File
Add `"async_mode": true` to your config JSON:
```json
{
"name": "react",
"base_url": "https://react.dev/",
"async_mode": true,
"workers": 8,
"rate_limit": 0.5,
"max_pages": 500
}
```
Then run normally:
```bash
python3 cli/doc_scraper.py --config configs/react-async.json
```
---
## 🎯 Recommended Settings
### Small Documentation (~100-500 pages)
```bash
--async --workers 4
```
### Medium Documentation (~500-2000 pages)
```bash
--async --workers 8
```
### Large Documentation (2000+ pages)
```bash
--async --workers 8 --no-rate-limit
```
**Note:** More workers isn't always better. Test with 4, then 8, to find optimal performance for your use case.
---
## 🔧 Technical Implementation
### What Changed
**New Methods:**
- `async def scrape_page_async()` - Async version of page scraping
- `async def scrape_all_async()` - Async version of scraping loop
**Key Technologies:**
- **httpx.AsyncClient** - Async HTTP client with connection pooling
- **asyncio.Semaphore** - Concurrency control (replaces threading.Lock)
- **asyncio.gather()** - Parallel task execution
- **asyncio.sleep()** - Non-blocking rate limiting
**Backwards Compatibility:**
- Async mode is **opt-in** (default: sync mode)
- All existing configs work unchanged
- Zero breaking changes
---
## 📊 Benchmarks
### Test Case: React Documentation (7,102 chars, 500 pages)
**Sync Mode (Threads):**
```bash
python3 cli/doc_scraper.py --config configs/react.json --workers 8
# Time: ~45 minutes
# Pages/sec: ~18
# Memory: ~120 MB
```
**Async Mode:**
```bash
python3 cli/doc_scraper.py --config configs/react.json --async --workers 8
# Time: ~15 minutes (3x faster!)
# Pages/sec: ~55
# Memory: ~40 MB (66% less)
```
---
## ⚠️ Important Notes
### When to Use Async
✅ **Use async when:**
- Scraping 500+ pages
- Using 4+ workers
- Network latency is high
- Memory is constrained
❌ **Don't use async when:**
- Scraping < 100 pages (overhead not worth it)
- workers = 1 (no parallelism benefit)
- Testing/debugging (sync is simpler)
### Rate Limiting
Async mode respects rate limits just like sync mode:
```bash
# 0.5 second delay between requests (default)
--async --workers 8 --rate-limit 0.5
# No rate limiting (use carefully!)
--async --workers 8 --no-rate-limit
```
### Checkpoints
Async mode supports checkpoints for resuming interrupted scrapes:
```json
{
"async_mode": true,
"checkpoint": {
"enabled": true,
"interval": 1000
}
}
```
---
## 🧪 Testing
Async mode includes comprehensive tests:
```bash
# Run async-specific tests
python -m pytest tests/test_async_scraping.py -v
# Run all tests
python cli/run_tests.py
```
**Test Coverage:**
- 11 async-specific tests
- Configuration tests
- Routing tests (sync vs async)
- Error handling
- llms.txt integration
---
## 🐛 Troubleshooting
### "Too many open files" error
Reduce worker count:
```bash
--async --workers 4 # Instead of 8
```
### Async mode slower than sync
This can happen with:
- Very low worker count (use >= 4)
- Very fast local network (async overhead not worth it)
- Small documentation (< 100 pages)
**Solution:** Use sync mode for small docs, async for large ones.
### Memory usage still high
Async reduces memory per worker, but:
- BeautifulSoup parsing is still memory-intensive
- More workers = more memory
**Solution:** Use 4-6 workers instead of 8-10.
---
## 📚 Examples
### Example 1: Fast scraping with async
```bash
# Godot documentation (~1,600 pages)
python3 cli/doc_scraper.py \\
--config configs/godot.json \\
--async \\
--workers 8 \\
--rate-limit 0.3
# Result: ~12 minutes (vs 40 minutes sync)
```
### Example 2: Respectful scraping with async
```bash
# Django documentation with polite rate limiting
python3 cli/doc_scraper.py \\
--config configs/django.json \\
--async \\
--workers 4 \\
--rate-limit 1.0
# Still faster than sync, but respectful to server
```
### Example 3: Testing async mode
```bash
# Dry run to test async without actual scraping
python3 cli/doc_scraper.py \\
--config configs/react.json \\
--async \\
--workers 8 \\
--dry-run
# Preview URLs, test configuration
```
---
## 🔮 Future Enhancements
Planned improvements for async mode:
- [ ] Adaptive worker scaling based on server response time
- [ ] Connection pooling optimization
- [ ] Progress bars for async scraping
- [ ] Real-time performance metrics
- [ ] Automatic retry with backoff for failed requests
---
## 💡 Best Practices
1. **Start with 4 workers** - Test, then increase if needed
2. **Use --dry-run first** - Verify configuration before scraping
3. **Respect rate limits** - Don't disable unless necessary
4. **Monitor memory** - Reduce workers if memory usage is high
5. **Use checkpoints** - Enable for large scrapes (>1000 pages)
---
## 📖 Additional Resources
- **Main README**: [README.md](README.md)
- **Technical Docs**: [docs/CLAUDE.md](docs/CLAUDE.md)
- **Test Suite**: [tests/test_async_scraping.py](tests/test_async_scraping.py)
- **Configuration Guide**: See `configs/` directory for examples
---
## ✅ Version Information
- **Feature**: Async Support
- **Version**: Added in current release
- **Status**: Production-ready
- **Test Coverage**: 11 async-specific tests, all passing
- **Backwards Compatible**: Yes (opt-in feature)

View File

@ -0,0 +1,518 @@
# Bulletproof Quick Start Guide
**Target Audience:** Complete beginners | Never used Python/git before? Start here!
**Time:** 15-30 minutes total (including all installations)
**Result:** Working Skill Seeker installation + your first Claude skill created
---
## 📋 What You'll Need
Before starting, you need:
- A computer (macOS, Linux, or Windows with WSL)
- Internet connection
- 30 minutes of time
That's it! We'll install everything else together.
---
## Step 1: Install Python (5 minutes)
### Check if You Already Have Python
Open Terminal (macOS/Linux) or Command Prompt (Windows) and type:
```bash
python3 --version
```
**✅ If you see:** `Python 3.10.x` or `Python 3.11.x` or higher → **Skip to Step 2!**
**❌ If you see:** `command not found` or version less than 3.10 → **Continue below**
### Install Python
#### macOS:
```bash
# Install Homebrew (if not installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install Python
brew install python3
```
**Verify:**
```bash
python3 --version
# Should show: Python 3.11.x or similar
```
#### Linux (Ubuntu/Debian):
```bash
sudo apt update
sudo apt install python3 python3-pip
```
**Verify:**
```bash
python3 --version
pip3 --version
```
#### Windows:
1. Download Python from: https://www.python.org/downloads/
2. Run installer
3. **IMPORTANT:** Check "Add Python to PATH" during installation
4. Open Command Prompt and verify:
```bash
python --version
```
**✅ Success looks like:**
```
Python 3.11.5
```
---
## Step 2: Install Git (3 minutes)
### Check if You Have Git
```bash
git --version
```
**✅ If you see:** `git version 2.x.x` → **Skip to Step 3!**
**❌ If not installed:**
#### macOS:
```bash
brew install git
```
#### Linux:
```bash
sudo apt install git
```
#### Windows:
Download from: https://git-scm.com/download/win
**Verify:**
```bash
git --version
# Should show: git version 2.x.x
```
---
## Step 3: Get Skill Seeker (2 minutes)
### Choose Where to Put It
Pick a location for the project. Good choices:
- macOS/Linux: `~/Projects/` or `~/Documents/`
- Note: `~` means your home directory (`$HOME` or `/Users/yourname` on macOS, `/home/yourname` on Linux)
- Windows: `C:\Users\YourName\Projects\`
### Clone the Repository
```bash
# Create Projects directory (if it doesn't exist)
mkdir -p ~/Projects
cd ~/Projects
# Clone Skill Seeker
git clone https://github.com/yusufkaraaslan/Skill_Seekers.git
# Enter the directory
cd Skill_Seekers
```
**✅ Success looks like:**
```
Cloning into 'Skill_Seekers'...
remote: Enumerating objects: 245, done.
remote: Counting objects: 100% (245/245), done.
```
**Verify you're in the right place:**
```bash
pwd
# Should show something like:
# macOS: /Users/yourname/Projects/Skill_Seekers
# Linux: /home/yourname/Projects/Skill_Seekers
# (Replace 'yourname' with YOUR actual username)
ls
# Should show: README.md, cli/, mcp/, configs/, etc.
```
**❌ If `git clone` fails:**
```bash
# Check internet connection
ping google.com
# Or download ZIP manually:
# https://github.com/yusufkaraaslan/Skill_Seekers/archive/refs/heads/main.zip
# Then unzip and cd into it
```
---
## Step 4: Setup Virtual Environment & Install Dependencies (3 minutes)
A virtual environment keeps Skill Seeker's dependencies isolated and prevents conflicts.
```bash
# Make sure you're in the Skill_Seekers directory
cd ~/Projects/Skill_Seekers # ~ means your home directory ($HOME)
# Adjust if you chose a different location
# Create virtual environment
python3 -m venv venv
# Activate it
source venv/bin/activate # macOS/Linux
# Windows users: venv\Scripts\activate
```
**✅ Success looks like:**
```
(venv) username@computer Skill_Seekers %
```
Notice `(venv)` appears in your prompt - this means the virtual environment is active!
```bash
# Now install packages (only needed once)
pip install requests beautifulsoup4 pytest
# Save the dependency list
pip freeze > requirements.txt
```
**✅ Success looks like:**
```
Successfully installed requests-2.32.5 beautifulsoup4-4.14.2 pytest-8.4.2 ...
```
**Optional - Only if you want API-based enhancement (not needed for LOCAL enhancement):**
```bash
pip install anthropic
```
**Important Notes:**
- **Every time** you open a new terminal to use Skill Seeker, run `source venv/bin/activate` first
- You'll know it's active when you see `(venv)` in your terminal prompt
- To deactivate later: just type `deactivate`
**❌ If python3 not found:**
```bash
# Try without the 3
python -m venv venv
```
**❌ If permission denied:**
```bash
# Virtual environment approach doesn't need sudo - you might have the wrong path
# Make sure you're in the Skill_Seekers directory:
pwd
# Should show something like:
# macOS: /Users/yourname/Projects/Skill_Seekers
# Linux: /home/yourname/Projects/Skill_Seekers
# (Replace 'yourname' with YOUR actual username)
```
---
## Step 5: Test Your Installation (1 minute)
Let's make sure everything works:
```bash
# Test the main script can run
skill-seekers scrape --help
```
**✅ Success looks like:**
```
usage: doc_scraper.py [-h] [--config CONFIG] [--interactive] ...
```
**❌ If you see "No such file or directory":**
```bash
# Check you're in the right directory
pwd
# Should show path ending in /Skill_Seekers
# List files
ls cli/
# Should show: doc_scraper.py, estimate_pages.py, etc.
```
---
## Step 6: Create Your First Skill! (5-10 minutes)
Let's create a simple skill using a preset configuration.
### Option A: Small Test (Recommended First Time)
```bash
# Create a config for a small site first
cat > configs/test.json << 'EOF'
{
"name": "test-skill",
"description": "Test skill creation",
"base_url": "https://tailwindcss.com/docs/installation",
"max_pages": 5,
"rate_limit": 0.5
}
EOF
# Run the scraper
skill-seekers scrape --config configs/test.json
```
**What happens:**
1. Scrapes 5 pages from Tailwind CSS docs
2. Creates `output/test-skill/` directory
3. Generates SKILL.md and reference files
**⏱️ Time:** ~30 seconds
**✅ Success looks like:**
```
Scraping: https://tailwindcss.com/docs/installation
Page 1/5: Installation
Page 2/5: Editor Setup
...
✅ Skill created at: output/test-skill/
```
### Option B: Full Example (React Docs)
```bash
# Use the React preset
skill-seekers scrape --config configs/react.json --max-pages 50
```
**⏱️ Time:** ~5 minutes
**What you get:**
- `output/react/SKILL.md` - Main skill file
- `output/react/references/` - Organized documentation
### Verify It Worked
```bash
# Check the output
ls output/test-skill/
# Should show: SKILL.md, references/, scripts/, assets/
# Look at the generated skill
head output/test-skill/SKILL.md
```
---
## Step 7: Package for Claude (30 seconds)
```bash
# Package the skill
skill-seekers package output/test-skill/
```
**✅ Success looks like:**
```
✅ Skill packaged successfully!
📦 Created: output/test-skill.zip
📏 Size: 45.2 KB
Ready to upload to Claude AI!
```
**Now you have:** `output/test-skill.zip` ready to upload to Claude!
---
## Step 8: Upload to Claude (2 minutes)
1. Go to https://claude.ai
2. Click your profile → Settings
3. Click "Knowledge" or "Skills"
4. Click "Upload Skill"
5. Select `output/test-skill.zip`
6. Done! Claude can now use this skill
---
## 🎉 Success! What's Next?
You now have a working Skill Seeker installation! Here's what you can do:
### Try Other Presets
```bash
# See all available presets
ls configs/
# Try Vue.js
skill-seekers scrape --config configs/vue.json --max-pages 50
# Try Django
skill-seekers scrape --config configs/django.json --max-pages 50
```
### Create Custom Skills
```bash
# Interactive mode - answer questions
skill-seekers scrape --interactive
# Or create config for any website
skill-seekers scrape \
--name myframework \
--url https://docs.myframework.com/ \
--description "My favorite framework"
```
### Use with Claude Code (Advanced)
If you have Claude Code installed:
```bash
# One-time setup
./setup_mcp.sh
# Then use natural language in Claude Code:
# "Generate a skill for Svelte docs"
# "Package the skill at output/svelte/"
```
**See:** [docs/MCP_SETUP.md](docs/MCP_SETUP.md) for full MCP setup
---
## 🔧 Troubleshooting
### "Command not found" errors
**Problem:** `python3: command not found`
**Solution:** Python not installed or not in PATH
- macOS/Linux: Reinstall Python with brew/apt
- Windows: Reinstall Python, check "Add to PATH"
- Try `python` instead of `python3`
### "Permission denied" errors
**Problem:** Can't install packages or run scripts
**Solution:**
```bash
# Use --user flag
pip3 install --user requests beautifulsoup4
# Or make script executable
chmod +x cli/doc_scraper.py
```
### "No such file or directory"
**Problem:** Can't find cli/doc_scraper.py
**Solution:** You're not in the right directory
```bash
# Go to the Skill_Seekers directory
cd ~/Projects/Skill_Seekers # Adjust your path
# Verify
ls cli/
# Should show doc_scraper.py
```
### "ModuleNotFoundError"
**Problem:** Missing Python packages
**Solution:**
```bash
# Install dependencies again
pip3 install requests beautifulsoup4
# If that fails, try:
pip3 install --user requests beautifulsoup4
```
### Scraping is slow or fails
**Problem:** Takes forever or gets errors
**Solution:**
```bash
# Use smaller max_pages for testing
skill-seekers scrape --config configs/react.json --max-pages 10
# Check internet connection
ping google.com
# Check the website is accessible
curl -I https://docs.yoursite.com
```
### Still stuck?
1. **Check our detailed troubleshooting guide:** [TROUBLESHOOTING.md](TROUBLESHOOTING.md)
2. **Open an issue:** https://github.com/yusufkaraaslan/Skill_Seekers/issues
3. **Include this info:**
- Operating system (macOS 13, Ubuntu 22.04, Windows 11, etc.)
- Python version (`python3 --version`)
- Full error message
- What command you ran
---
## 📚 Next Steps
- **Read the full README:** [README.md](README.md)
- **Learn about presets:** [configs/](configs/)
- **Try MCP integration:** [docs/MCP_SETUP.md](docs/MCP_SETUP.md)
- **Advanced usage:** [docs/](docs/)
---
## ✅ Quick Reference
```bash
# Your typical workflow:
# 1. Create/use a config
skill-seekers scrape --config configs/react.json --max-pages 50
# 2. Package it
skill-seekers package output/react/
# 3. Upload output/react.zip to Claude
# Done! 🎉
```
**Common locations:**
- **Configs:** `configs/*.json`
- **Output:** `output/skill-name/`
- **Packaged skills:** `output/skill-name.zip`
**Time estimates:**
- Small skill (5-10 pages): 30 seconds
- Medium skill (50-100 pages): 3-5 minutes
- Large skill (500+ pages): 15-30 minutes
---
**Still confused?** That's okay! Open an issue and we'll help you get started: https://github.com/yusufkaraaslan/Skill_Seekers/issues/new

View File

@ -0,0 +1,693 @@
# Changelog
All notable changes to Skill Seeker will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased]
---
## [2.1.1] - 2025-11-30
### 🚀 GitHub Repository Analysis Enhancements
This release significantly improves GitHub repository scraping with unlimited local analysis, configurable directory exclusions, and numerous bug fixes.
### Added
- **Configurable directory exclusions** for local repository analysis ([#203](https://github.com/yusufkaraaslan/Skill_Seekers/issues/203))
- `exclude_dirs_additional`: Extend default exclusions with custom directories
- `exclude_dirs`: Replace default exclusions entirely (advanced users)
- 19 comprehensive tests covering all scenarios
- Logging: INFO for extend mode, WARNING for replace mode
- **Unlimited local repository analysis** via `local_repo_path` configuration parameter
- **Auto-exclusion** of virtual environments, build artifacts, and cache directories
- **Support for analyzing repositories without GitHub API rate limits** (50 → unlimited files)
- **Skip llms.txt option** - Force HTML scraping even when llms.txt is detected ([#198](https://github.com/yusufkaraaslan/Skill_Seekers/pull/198))
### Fixed
- Fixed logger initialization error causing `AttributeError: 'NoneType' object has no attribute 'setLevel'` ([#190](https://github.com/yusufkaraaslan/Skill_Seekers/issues/190))
- Fixed 3 NoneType subscriptable errors in release tag parsing
- Fixed relative import paths causing `ModuleNotFoundError`
- Fixed hardcoded 50-file analysis limit preventing comprehensive code analysis
- Fixed GitHub API file tree limitation (140 → 345 files discovered)
- Fixed AST parser "not iterable" errors eliminating 100% of parsing failures (95 → 0 errors)
- Fixed virtual environment file pollution reducing file tree noise by 95%
- Fixed `force_rescrape` flag not checked before interactive prompt causing EOFError in CI/CD environments
### Improved
- Increased code analysis coverage from 14% to 93.6% (+79.6 percentage points)
- Improved file discovery from 140 to 345 files (+146%)
- Improved class extraction from 55 to 585 classes (+964%)
- Improved function extraction from 512 to 2,784 functions (+444%)
- Test suite expanded to 427 tests (up from 391)
---
## [2.1.0] - 2025-11-12
### 🎉 Major Enhancement: Quality Assurance + Race Condition Fixes
This release focuses on quality and reliability improvements, adding comprehensive quality checks and fixing critical race conditions in the enhancement workflow.
### 🚀 Major Features
#### Comprehensive Quality Checker
- **Automatic quality checks before packaging** - Validates skill quality before upload
- **Quality scoring system** - 0-100 score with A-F grades
- **Enhancement verification** - Checks for template text, code examples, sections
- **Structure validation** - Validates SKILL.md, references/ directory
- **Content quality checks** - YAML frontmatter, language tags, "When to Use" section
- **Link validation** - Validates internal markdown links
- **Detailed reporting** - Errors, warnings, and info messages with file locations
- **CLI tool** - `skill-seekers-quality-checker` with verbose and strict modes
#### Headless Enhancement Mode (Default)
- **No terminal windows** - Runs enhancement in background by default
- **Proper waiting** - Main console waits for enhancement to complete
- **Timeout protection** - 10-minute default timeout (configurable)
- **Verification** - Checks that SKILL.md was actually updated
- **Progress messages** - Clear status updates during enhancement
- **Interactive mode available** - `--interactive-enhancement` flag for terminal mode
### Added
#### New CLI Tools
- **quality_checker.py** - Comprehensive skill quality validation
- Structure checks (SKILL.md, references/)
- Enhancement verification (code examples, sections)
- Content validation (frontmatter, language tags)
- Link validation (internal markdown links)
- Quality scoring (0-100 + A-F grade)
#### New Features
- **Headless enhancement** - `skill-seekers-enhance` runs in background by default
- **Quality checks in packaging** - Automatic validation before creating .zip
- **MCP quality skip** - MCP server skips interactive checks
- **Enhanced error handling** - Better error messages and timeout handling
#### Tests
- **+12 quality checker tests** - Comprehensive validation testing
- **391 total tests passing** - Up from 379 in v2.0.0
- **0 test failures** - All tests green
- **CI improvements** - Fixed macOS terminal detection tests
### Changed
#### Enhancement Workflow
- **Default mode changed** - Headless mode is now default (was terminal mode)
- **Waiting behavior** - Main console waits for enhancement completion
- **No race conditions** - Fixed "Package your skill" message appearing too early
- **Better progress** - Clear status messages during enhancement
#### Package Workflow
- **Quality checks added** - Automatic validation before packaging
- **User confirmation** - Ask to continue if warnings/errors found
- **Skip option** - `--skip-quality-check` flag to bypass checks
- **MCP context** - Automatically skips checks in non-interactive contexts
#### CLI Arguments
- **doc_scraper.py:**
- Updated `--enhance-local` help text (mentions headless mode)
- Added `--interactive-enhancement` flag
- **enhance_skill_local.py:**
- Changed default to `headless=True`
- Added `--interactive-enhancement` flag
- Added `--timeout` flag (default: 600 seconds)
- **package_skill.py:**
- Added `--skip-quality-check` flag
### Fixed
#### Critical Bugs
- **Enhancement race condition** - Main console no longer exits before enhancement completes
- **MCP stdin errors** - MCP server now skips interactive prompts
- **Terminal detection tests** - Fixed for headless mode default
#### Enhancement Issues
- **Process detachment** - subprocess.run() now waits properly instead of Popen()
- **Timeout handling** - Added timeout protection to prevent infinite hangs
- **Verification** - Checks file modification time and size to verify success
- **Error messages** - Better error handling and user-friendly messages
#### Test Fixes
- **package_skill tests** - Added skip_quality_check=True to prevent stdin errors
- **Terminal detection tests** - Updated to use headless=False for interactive tests
- **MCP server tests** - Fixed to skip quality checks in non-interactive context
### Technical Details
#### New Modules
- `src/skill_seekers/cli/quality_checker.py` - Quality validation engine
- `tests/test_quality_checker.py` - 12 comprehensive tests
#### Modified Modules
- `src/skill_seekers/cli/enhance_skill_local.py` - Added headless mode
- `src/skill_seekers/cli/doc_scraper.py` - Updated enhancement integration
- `src/skill_seekers/cli/package_skill.py` - Added quality checks
- `src/skill_seekers/mcp/server.py` - Skip quality checks in MCP context
- `tests/test_package_skill.py` - Updated for quality checker
- `tests/test_terminal_detection.py` - Updated for headless default
#### Commits in This Release
- `e279ed6` - Phase 1: Enhancement race condition fix (headless mode)
- `3272f9c` - Phases 2 & 3: Quality checker implementation
- `2dd1027` - Phase 4: Tests (+12 quality checker tests)
- `befcb89` - CI Fix: Skip quality checks in MCP context
- `67ab627` - CI Fix: Update terminal tests for headless default
### Upgrade Notes
#### Breaking Changes
- **Headless mode default** - Enhancement now runs in background by default
- Use `--interactive-enhancement` if you want the old terminal mode
- Affects: `skill-seekers-enhance` and `skill-seekers scrape --enhance-local`
#### New Behavior
- **Quality checks** - Packaging now runs quality checks by default
- May prompt for confirmation if warnings/errors found
- Use `--skip-quality-check` to bypass (not recommended)
#### Recommendations
- **Try headless mode** - Faster and more reliable than terminal mode
- **Review quality reports** - Fix warnings before packaging
- **Update scripts** - Add `--skip-quality-check` to automated packaging scripts if needed
### Migration Guide
**If you want the old terminal mode behavior:**
```bash
# Old (v2.0.0): Default was terminal mode
skill-seekers-enhance output/react/
# New (v2.1.0): Use --interactive-enhancement
skill-seekers-enhance output/react/ --interactive-enhancement
```
**If you want to skip quality checks:**
```bash
# Add --skip-quality-check to package command
skill-seekers-package output/react/ --skip-quality-check
```
---
## [2.0.0] - 2025-11-11
### 🎉 Major Release: PyPI Publication + Modern Python Packaging
**Skill Seekers is now available on PyPI!** Install with: `pip install skill-seekers`
This is a major milestone release featuring complete restructuring for modern Python packaging, comprehensive testing improvements, and publication to the Python Package Index.
### 🚀 Major Changes
#### PyPI Publication
- **Published to PyPI** - https://pypi.org/project/skill-seekers/
- **Installation:** `pip install skill-seekers` or `uv tool install skill-seekers`
- **No cloning required** - Install globally or in virtual environments
- **Automatic dependency management** - All dependencies handled by pip/uv
#### Modern Python Packaging
- **pyproject.toml-based configuration** - Standard PEP 621 metadata
- **src/ layout structure** - Best practice package organization
- **Entry point scripts** - `skill-seekers` command available globally
- **Proper dependency groups** - Separate dev, test, and MCP dependencies
- **Build backend** - setuptools-based build with uv support
#### Unified CLI Interface
- **Single `skill-seekers` command** - Git-style subcommands
- **Subcommands:** `scrape`, `github`, `pdf`, `unified`, `enhance`, `package`, `upload`, `estimate`
- **Consistent interface** - All tools accessible through one entry point
- **Help system** - Comprehensive `--help` for all commands
### Added
#### Testing Infrastructure
- **379 passing tests** (up from 299) - Comprehensive test coverage
- **0 test failures** - All tests passing successfully
- **Test suite improvements:**
- Fixed import paths for src/ layout
- Updated CLI tests for unified entry points
- Added package structure verification tests
- Fixed MCP server import tests
- Added pytest configuration in pyproject.toml
#### Documentation
- **Updated README.md** - PyPI badges, reordered installation options
- **FUTURE_RELEASES.md** - Roadmap for upcoming features
- **Installation guides** - Simplified with PyPI as primary method
- **Testing documentation** - How to run full test suite
### Changed
#### Package Structure
- **Moved to src/ layout:**
- `src/skill_seekers/` - Main package
- `src/skill_seekers/cli/` - CLI tools
- `src/skill_seekers/mcp/` - MCP server
- **Import paths updated** - All imports use proper package structure
- **Entry points configured** - All CLI tools available as commands
#### Import Fixes
- **Fixed `merge_sources.py`** - Corrected conflict_detector import (`.conflict_detector`)
- **Fixed MCP server tests** - Updated to use `skill_seekers.mcp.server` imports
- **Fixed test paths** - All tests updated for src/ layout
### Fixed
#### Critical Bugs
- **Import path errors** - Fixed relative imports in CLI modules
- **MCP test isolation** - Added proper MCP availability checks
- **Package installation** - Resolved entry point conflicts
- **Dependency resolution** - All dependencies properly specified
#### Test Improvements
- **17 test fixes** - Updated for modern package structure
- **MCP test guards** - Proper skipif decorators for MCP tests
- **CLI test updates** - Accept both exit codes 0 and 2 for help
- **Path validation** - Tests verify correct package structure
### Technical Details
#### Build System
- **Build backend:** setuptools.build_meta
- **Build command:** `uv build`
- **Publish command:** `uv publish`
- **Distribution formats:** wheel + source tarball
#### Dependencies
- **Core:** requests, beautifulsoup4, PyGithub, mcp, httpx
- **PDF:** PyMuPDF, Pillow, pytesseract
- **Dev:** pytest, pytest-cov, pytest-anyio, mypy
- **MCP:** mcp package for Claude Code integration
### Migration Guide
#### For Users
**Old way:**
```bash
git clone https://github.com/yusufkaraaslan/Skill_Seekers.git
cd Skill_Seekers
pip install -r requirements.txt
python3 cli/doc_scraper.py --config configs/react.json
```
**New way:**
```bash
pip install skill-seekers
skill-seekers scrape --config configs/react.json
```
#### For Developers
- Update imports: `from cli.* → from skill_seekers.cli.*`
- Use `pip install -e ".[dev]"` for development
- Run tests: `python -m pytest`
- Entry points instead of direct script execution
### Breaking Changes
- **CLI interface changed** - Use `skill-seekers` command instead of `python3 cli/...`
- **Import paths changed** - Package now at `skill_seekers.*` instead of `cli.*`
- **Installation method changed** - PyPI recommended over git clone
### Deprecations
- **Direct script execution** - Still works but deprecated (use `skill-seekers` command)
- **Old import patterns** - Legacy imports still work but will be removed in v3.0
### Compatibility
- **Python 3.10+** required
- **Backward compatible** - Old scripts still work with legacy CLI
- **Config files** - No changes required
- **Output format** - No changes to generated skills
---
## [1.3.0] - 2025-10-26
### Added - Refactoring & Performance Improvements
- **Async/Await Support for Parallel Scraping** (2-3x performance boost)
- `--async` flag to enable async mode
- `async def scrape_page_async()` method using httpx.AsyncClient
- `async def scrape_all_async()` method with asyncio.gather()
- Connection pooling for better performance
- asyncio.Semaphore for concurrency control
- Comprehensive async testing (11 new tests)
- Full documentation in ASYNC_SUPPORT.md
- Performance: ~55 pages/sec vs ~18 pages/sec (sync)
- Memory: 40 MB vs 120 MB (66% reduction)
- **Python Package Structure** (Phase 0 Complete)
- `cli/__init__.py` - CLI tools package with clean imports
- `skill_seeker_mcp/__init__.py` - MCP server package (renamed from mcp/)
- `skill_seeker_mcp/tools/__init__.py` - MCP tools subpackage
- Proper package imports: `from cli import constants`
- **Centralized Configuration Module**
- `cli/constants.py` with 18 configuration constants
- `DEFAULT_ASYNC_MODE`, `DEFAULT_RATE_LIMIT`, `DEFAULT_MAX_PAGES`
- Enhancement limits, categorization scores, file limits
- All magic numbers now centralized and configurable
- **Code Quality Improvements**
- Converted 71 print() statements to proper logging calls
- Added type hints to all DocToSkillConverter methods
- Fixed all mypy type checking issues
- Installed types-requests for better type safety
- Multi-variant llms.txt detection: downloads all 3 variants (full, standard, small)
- Automatic .txt → .md file extension conversion
- No content truncation: preserves complete documentation
- `detect_all()` method for finding all llms.txt variants
- `get_proper_filename()` for correct .md naming
### Changed
- `_try_llms_txt()` now downloads all available variants instead of just one
- Reference files now contain complete content (no 2500 char limit)
- Code samples now include full code (no 600 char limit)
- Test count increased from 207 to 299 (92 new tests)
- All print() statements replaced with logging (logger.info, logger.warning, logger.error)
- Better IDE support with proper package structure
- Code quality improved from 5.5/10 to 6.5/10
### Fixed
- File extension bug: llms.txt files now saved as .md
- Content loss: 0% truncation (was 36%)
- Test isolation issues in test_async_scraping.py (proper cleanup with try/finally)
- Import issues: no more sys.path.insert() hacks needed
- .gitignore: added test artifacts (.pytest_cache, .coverage, htmlcov, etc.)
---
## [1.2.0] - 2025-10-23
### 🚀 PDF Advanced Features Release
Major enhancement to PDF extraction capabilities with Priority 2 & 3 features.
### Added
#### Priority 2: Support More PDF Types
- **OCR Support for Scanned PDFs**
- Automatic text extraction from scanned documents using Tesseract OCR
- Fallback mechanism when page text < 50 characters
- Integration with pytesseract and Pillow
- Command: `--ocr` flag
- New dependencies: `Pillow==11.0.0`, `pytesseract==0.3.13`
- **Password-Protected PDF Support**
- Handle encrypted PDFs with password authentication
- Clear error messages for missing/wrong passwords
- Secure password handling
- Command: `--password PASSWORD` flag
- **Complex Table Extraction**
- Extract tables from PDFs using PyMuPDF's table detection
- Capture table data as 2D arrays with metadata (bbox, row/col count)
- Integration with skill references in markdown format
- Command: `--extract-tables` flag
#### Priority 3: Performance Optimizations
- **Parallel Page Processing**
- 3x faster PDF extraction using ThreadPoolExecutor
- Auto-detect CPU count or custom worker specification
- Only activates for PDFs with > 5 pages
- Commands: `--parallel` and `--workers N` flags
- Benchmarks: 500-page PDF reduced from 4m 10s to 1m 15s
- **Intelligent Caching**
- In-memory cache for expensive operations (text extraction, code detection, quality scoring)
- 50% faster on re-runs
- Command: `--no-cache` to disable (enabled by default)
#### New Documentation
- **`docs/PDF_ADVANCED_FEATURES.md`** (580 lines)
- Complete usage guide for all advanced features
- Installation instructions
- Performance benchmarks showing 3x speedup
- Best practices and troubleshooting
- API reference with all parameters
#### Testing
- **New test file:** `tests/test_pdf_advanced_features.py` (568 lines, 26 tests)
- TestOCRSupport (5 tests)
- TestPasswordProtection (4 tests)
- TestTableExtraction (5 tests)
- TestCaching (5 tests)
- TestParallelProcessing (4 tests)
- TestIntegration (3 tests)
- **Updated:** `tests/test_pdf_extractor.py` (23 tests fixed and passing)
- **Total PDF tests:** 49/49 PASSING ✅ (100% pass rate)
### Changed
- Enhanced `cli/pdf_extractor_poc.py` with all advanced features
- Updated `requirements.txt` with new dependencies
- Updated `README.md` with PDF advanced features usage
- Updated `docs/TESTING.md` with new test counts (142 total tests)
### Performance Improvements
- **3.3x faster** with parallel processing (8 workers)
- **1.7x faster** on re-runs with caching enabled
- Support for unlimited page PDFs (no more 500-page limit)
### Dependencies
- Added `Pillow==11.0.0` for image processing
- Added `pytesseract==0.3.13` for OCR support
- Tesseract OCR engine (system package, optional)
---
## [1.1.0] - 2025-10-22
### 🌐 Documentation Scraping Enhancements
Major improvements to documentation scraping with unlimited pages, parallel processing, and new configs.
### Added
#### Unlimited Scraping & Performance
- **Unlimited Page Scraping** - Removed 500-page limit, now supports unlimited pages
- **Parallel Scraping Mode** - Process multiple pages simultaneously for faster scraping
- **Dynamic Rate Limiting** - Smart rate limit control to avoid server blocks
- **CLI Utilities** - New helper scripts for common tasks
#### New Configurations
- **Ansible Core 2.19** - Complete Ansible documentation config
- **Claude Code** - Documentation for this very tool!
- **Laravel 9.x** - PHP framework documentation
#### Testing & Quality
- Comprehensive test coverage for CLI utilities
- Parallel scraping test suite
- Virtual environment setup documentation
- Thread-safety improvements
### Fixed
- Thread-safety issues in parallel scraping
- CLI path references across all documentation
- Flaky upload_skill tests
- MCP server streaming subprocess implementation
### Changed
- All CLI examples now use `cli/` directory prefix
- Updated documentation structure
- Enhanced error handling
---
## [1.0.0] - 2025-10-19
### 🎉 First Production Release
This is the first production-ready release of Skill Seekers with complete feature set, full test coverage, and comprehensive documentation.
### Added
#### Smart Auto-Upload Feature
- New `upload_skill.py` CLI tool for automatic API-based upload
- Enhanced `package_skill.py` with `--upload` flag
- Smart API key detection with graceful fallback
- Cross-platform folder opening in `utils.py`
- Helpful error messages instead of confusing errors
#### MCP Integration Enhancements
- **9 MCP tools** (added `upload_skill` tool)
- `mcp__skill-seeker__upload_skill` - Upload .zip files to Claude automatically
- Enhanced `package_skill` tool with smart auto-upload parameter
- Updated all MCP documentation to reflect 9 tools
#### Documentation Improvements
- Updated README with version badge (v1.0.0)
- Enhanced upload guide with 3 upload methods
- Updated MCP setup guide with all 9 tools
- Comprehensive test documentation (14/14 tests)
- All references to tool counts corrected
### Fixed
- Missing `import os` in `mcp/server.py`
- `package_skill.py` exit code behavior (now exits 0 when API key missing)
- Improved UX with helpful messages instead of errors
### Changed
- Test count badge updated (96 → 14 passing)
- All documentation references updated to 9 tools
### Testing
- **CLI Tests:** 8/8 PASSED ✅
- **MCP Tests:** 6/6 PASSED ✅
- **Total:** 14/14 PASSED (100%)
---
## [0.4.0] - 2025-10-18
### Added
#### Large Documentation Support (40K+ Pages)
- Config splitting functionality for massive documentation sites
- Router/hub skill generation for intelligent query routing
- Checkpoint/resume feature for long scrapes
- Parallel scraping support for faster processing
- 4 split strategies: auto, category, router, size
#### New CLI Tools
- `split_config.py` - Split large configs into focused sub-skills
- `generate_router.py` - Generate router/hub skills
- `package_multi.py` - Package multiple skills at once
#### New MCP Tools
- `split_config` - Split large documentation via MCP
- `generate_router` - Generate router skills via MCP
#### Documentation
- New `docs/LARGE_DOCUMENTATION.md` guide
- Example config: `godot-large-example.json` (40K pages)
### Changed
- MCP tool count: 6 → 8 tools
- Updated documentation for large docs workflow
---
## [0.3.0] - 2025-10-15
### Added
#### MCP Server Integration
- Complete MCP server implementation (`mcp/server.py`)
- 6 MCP tools for Claude Code integration:
- `list_configs`
- `generate_config`
- `validate_config`
- `estimate_pages`
- `scrape_docs`
- `package_skill`
#### Setup & Configuration
- Automated setup script (`setup_mcp.sh`)
- MCP configuration examples
- Comprehensive MCP setup guide (`docs/MCP_SETUP.md`)
- MCP testing guide (`docs/TEST_MCP_IN_CLAUDE_CODE.md`)
#### Testing
- 31 comprehensive unit tests for MCP server
- Integration tests via Claude Code MCP protocol
- 100% test pass rate
#### Documentation
- Complete MCP integration documentation
- Natural language usage examples
- Troubleshooting guides
### Changed
- Restructured project as monorepo with CLI and MCP server
- Moved CLI tools to `cli/` directory
- Added MCP server to `mcp/` directory
---
## [0.2.0] - 2025-10-10
### Added
#### Testing & Quality
- Comprehensive test suite with 71 tests
- 100% test pass rate
- Test coverage for all major features
- Config validation tests
#### Optimization
- Page count estimator (`estimate_pages.py`)
- Framework config optimizations with `start_urls`
- Better URL pattern coverage
- Improved scraping efficiency
#### New Configs
- Kubernetes documentation config
- Tailwind CSS config
- Astro framework config
### Changed
- Optimized all framework configs
- Improved categorization accuracy
- Enhanced error messages
---
## [0.1.0] - 2025-10-05
### Added
#### Initial Release
- Basic documentation scraper functionality
- Manual skill creation
- Framework configs (Godot, React, Vue, Django, FastAPI)
- Smart categorization system
- Code language detection
- Pattern extraction
- Local and API-based enhancement options
- Basic packaging functionality
#### Core Features
- BFS traversal for documentation scraping
- CSS selector-based content extraction
- Smart categorization with scoring
- Code block detection and formatting
- Caching system for scraped data
- Interactive mode for config creation
#### Documentation
- README with quick start guide
- Basic usage documentation
- Configuration file examples
---
## Release Links
- [v1.2.0](https://github.com/yusufkaraaslan/Skill_Seekers/releases/tag/v1.2.0) - PDF Advanced Features
- [v1.1.0](https://github.com/yusufkaraaslan/Skill_Seekers/releases/tag/v1.1.0) - Documentation Scraping Enhancements
- [v1.0.0](https://github.com/yusufkaraaslan/Skill_Seekers/releases/tag/v1.0.0) - Production Release
- [v0.4.0](https://github.com/yusufkaraaslan/Skill_Seekers/releases/tag/v0.4.0) - Large Documentation Support
- [v0.3.0](https://github.com/yusufkaraaslan/Skill_Seekers/releases/tag/v0.3.0) - MCP Integration
---
## Version History Summary
| Version | Date | Highlights |
|---------|------|------------|
| **1.2.0** | 2025-10-23 | 📄 PDF advanced features: OCR, passwords, tables, 3x faster |
| **1.1.0** | 2025-10-22 | 🌐 Unlimited scraping, parallel mode, new configs (Ansible, Laravel) |
| **1.0.0** | 2025-10-19 | 🚀 Production release, auto-upload, 9 MCP tools |
| **0.4.0** | 2025-10-18 | 📚 Large docs support (40K+ pages) |
| **0.3.0** | 2025-10-15 | 🔌 MCP integration with Claude Code |
| **0.2.0** | 2025-10-10 | 🧪 Testing & optimization |
| **0.1.0** | 2025-10-05 | 🎬 Initial release |
---
[Unreleased]: https://github.com/yusufkaraaslan/Skill_Seekers/compare/v1.2.0...HEAD
[1.2.0]: https://github.com/yusufkaraaslan/Skill_Seekers/compare/v1.1.0...v1.2.0
[1.1.0]: https://github.com/yusufkaraaslan/Skill_Seekers/compare/v1.0.0...v1.1.0
[1.0.0]: https://github.com/yusufkaraaslan/Skill_Seekers/compare/v0.4.0...v1.0.0
[0.4.0]: https://github.com/yusufkaraaslan/Skill_Seekers/compare/v0.3.0...v0.4.0
[0.3.0]: https://github.com/yusufkaraaslan/Skill_Seekers/compare/v0.2.0...v0.3.0
[0.2.0]: https://github.com/yusufkaraaslan/Skill_Seekers/releases/tag/v0.2.0
[0.1.0]: https://github.com/yusufkaraaslan/Skill_Seekers/releases/tag/v0.1.0

View File

@ -0,0 +1,860 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## 🎯 Current Status (November 30, 2025)
**Version:** v2.1.1 (Production Ready - GitHub Analysis Enhanced!)
**Active Development:** Flexible, incremental task-based approach
### Recent Updates (November 2025):
**🎉 MAJOR MILESTONE: Published on PyPI! (v2.0.0)**
- **📦 PyPI Publication**: Install with `pip install skill-seekers` - https://pypi.org/project/skill-seekers/
- **🔧 Modern Python Packaging**: pyproject.toml, src/ layout, entry points
- **✅ CI/CD Fixed**: All 5 test matrix jobs passing (Ubuntu + macOS, Python 3.10-3.12)
- **📚 Documentation Complete**: README, CHANGELOG, FUTURE_RELEASES.md all updated
- **🚀 Unified CLI**: Single `skill-seekers` command with Git-style subcommands
- **🧪 Test Coverage**: 427 tests passing (up from 391), 39% coverage
- **🌐 Community**: GitHub Discussion, Release notes, announcements published
**🚀 Unified Multi-Source Scraping (v2.0.0)**
- **NEW**: Combine documentation + GitHub + PDF in one skill
- **NEW**: Automatic conflict detection between docs and code
- **NEW**: Rule-based and AI-powered merging
- **NEW**: 5 example unified configs (React, Django, FastAPI, Godot, FastAPI-test)
- **Status**: ✅ All 22 unified tests passing (18 core + 4 MCP integration)
**✅ Community Response (H1 Group):**
- **Issue #8 Fixed** - Added BULLETPROOF_QUICKSTART.md and TROUBLESHOOTING.md for beginners
- **Issue #7 Fixed** - Fixed all 11 configs (Django, Laravel, Astro, Tailwind) - 100% working
- **Issue #4 Linked** - Connected to roadmap Tasks A2/A3 (knowledge sharing + website)
- **PR #5 Reviewed** - Approved anchor stripping feature (security verified, 32/32 tests pass)
- **MCP Setup Fixed** - Path expansion bug resolved in setup_mcp.sh
**📦 Configs Status:**
- ✅ **24 total configs available** (including unified configs)
- ✅ 5 unified configs added (React, Django, FastAPI, Godot, FastAPI-test)
- ✅ Core selectors tested and validated
- 📝 Single-source configs: ansible-core, astro, claude-code, django, fastapi, godot, godot-large-example, hono, kubernetes, laravel, react, steam-economy-complete, tailwind, vue
- 📝 Multi-source configs: django_unified, fastapi_unified, fastapi_unified_test, godot_unified, react_unified
- 📝 Test/Example configs: godot_github, react_github, python-tutorial-test, example_pdf, test-manual
**📋 Completed (November 29, 2025):**
- **✅ DONE**: PyPI publication complete (v2.0.0)
- **✅ DONE**: CI/CD fixed - all checks passing
- **✅ DONE**: Documentation updated (README, CHANGELOG, FUTURE_RELEASES.md)
- **✅ DONE**: Quality Assurance + Race Condition Fixes (v2.1.0)
- **✅ DONE**: All critical bugs fixed (Issues #190, #192, #193)
- **✅ DONE**: Test suite stabilized (427 tests passing)
- **✅ DONE**: Unified tests fixed (all 22 passing)
- **✅ DONE**: PR #195 merged - Unlimited local repository analysis
- **✅ DONE**: PR #198 merged - Skip llms.txt config option
- **✅ DONE**: Issue #203 - Configurable EXCLUDED_DIRS (19 tests, 2 commits)
**📋 Next Up (Post-v2.1.0):**
- **Priority 1**: Review open PRs (#187, #186)
- **Priority 2**: Issue #202 - Add warning for missing local_repo_path
- **Priority 3**: Task H1.3 - Create example project folder
- **Priority 4**: Task A3.1 - GitHub Pages site (skillseekersweb.com)
**📊 Roadmap Progress:**
- 134 tasks organized into 22 feature groups
- Project board: https://github.com/users/yusufkaraaslan/projects/2
- See [FLEXIBLE_ROADMAP.md](FLEXIBLE_ROADMAP.md) for complete task list
---
## 🔌 MCP Integration Available
**This repository includes a fully tested MCP server with 9 tools:**
- `mcp__skill-seeker__list_configs` - List all available preset configurations
- `mcp__skill-seeker__generate_config` - Generate a new config file for any docs site
- `mcp__skill-seeker__validate_config` - Validate a config file structure
- `mcp__skill-seeker__estimate_pages` - Estimate page count before scraping
- `mcp__skill-seeker__scrape_docs` - Scrape and build a skill
- `mcp__skill-seeker__package_skill` - Package skill into .zip file (with auto-upload)
- `mcp__skill-seeker__upload_skill` - Upload .zip to Claude (NEW)
- `mcp__skill-seeker__split_config` - Split large documentation configs
- `mcp__skill-seeker__generate_router` - Generate router/hub skills
**Setup:** See [docs/MCP_SETUP.md](docs/MCP_SETUP.md) or run `./setup_mcp.sh`
**Status:** ✅ Tested and working in production with Claude Code
## Overview
Skill Seeker automatically converts any documentation website into a Claude AI skill. It scrapes documentation, organizes content, extracts code patterns, and packages everything into an uploadable `.zip` file for Claude.
## Prerequisites
**Python Version:** Python 3.10 or higher (required for MCP integration)
**Installation:**
### Option 1: Install from PyPI (Recommended - Easiest!)
```bash
# Install globally or in virtual environment
pip install skill-seekers
# Use the unified CLI immediately
skill-seekers scrape --config configs/react.json
skill-seekers --help
```
### Option 2: Install from Source (For Development)
```bash
# Clone the repository
git clone https://github.com/yusufkaraaslan/Skill_Seekers.git
cd Skill_Seekers
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # macOS/Linux (Windows: venv\Scripts\activate)
# Install in editable mode
pip install -e .
# Or install dependencies manually
pip install -r requirements.txt
```
**Why use a virtual environment?**
- Keeps dependencies isolated from system Python
- Prevents package version conflicts
- Standard Python development practice
- Required for running tests with pytest
**Optional (for API-based enhancement):**
```bash
pip install anthropic
export ANTHROPIC_API_KEY=sk-ant-...
```
## Core Commands
### Quick Start - Use a Preset
```bash
# Single-source scraping (documentation only)
skill-seekers scrape --config configs/godot.json
skill-seekers scrape --config configs/react.json
skill-seekers scrape --config configs/vue.json
skill-seekers scrape --config configs/django.json
skill-seekers scrape --config configs/laravel.json
skill-seekers scrape --config configs/fastapi.json
```
### Unified Multi-Source Scraping (**NEW - v2.0.0**)
```bash
# Combine documentation + GitHub + PDF in one skill
skill-seekers unified --config configs/react_unified.json
skill-seekers unified --config configs/django_unified.json
skill-seekers unified --config configs/fastapi_unified.json
skill-seekers unified --config configs/godot_unified.json
# Override merge mode
skill-seekers unified --config configs/react_unified.json --merge-mode claude-enhanced
# Result: One comprehensive skill with conflict detection
```
**What makes it special:**
- ✅ Detects discrepancies between documentation and code
- ✅ Shows both versions side-by-side with ⚠️ warnings
- ✅ Identifies outdated docs and undocumented features
- ✅ Single source of truth showing intent (docs) AND reality (code)
**See full guide:** [docs/UNIFIED_SCRAPING.md](docs/UNIFIED_SCRAPING.md)
### First-Time User Workflow (Recommended)
```bash
# 1. Install from PyPI (one-time, easiest!)
pip install skill-seekers
# 2. Estimate page count BEFORE scraping (fast, no data download)
skill-seekers estimate configs/godot.json
# Time: ~1-2 minutes, shows estimated total pages and recommended max_pages
# 3. Scrape with local enhancement (uses Claude Code Max, no API key)
skill-seekers scrape --config configs/godot.json --enhance-local
# Time: 20-40 minutes scraping + 60 seconds enhancement
# 4. Package the skill
skill-seekers package output/godot/
# Result: godot.zip ready to upload to Claude
```
### Interactive Mode
```bash
# Step-by-step configuration wizard
skill-seekers scrape --interactive
```
### Quick Mode (Minimal Config)
```bash
# Create skill from any documentation URL
skill-seekers scrape --name react --url https://react.dev/ --description "React framework for UIs"
```
### Skip Scraping (Use Cached Data)
```bash
# Fast rebuild using previously scraped data
skill-seekers scrape --config configs/godot.json --skip-scrape
# Time: 1-3 minutes (instant rebuild)
```
### Async Mode (2-3x Faster Scraping)
```bash
# Enable async mode with 8 workers for best performance
skill-seekers scrape --config configs/react.json --async --workers 8
# Quick mode with async
skill-seekers scrape --name react --url https://react.dev/ --async --workers 8
# Dry run with async to test
skill-seekers scrape --config configs/godot.json --async --workers 4 --dry-run
```
**Recommended Settings:**
- Small docs (~100-500 pages): `--async --workers 4`
- Medium docs (~500-2000 pages): `--async --workers 8`
- Large docs (2000+ pages): `--async --workers 8 --no-rate-limit`
**Performance:**
- Sync: ~18 pages/sec, 120 MB memory
- Async: ~55 pages/sec, 40 MB memory (3x faster!)
**See full guide:** [ASYNC_SUPPORT.md](ASYNC_SUPPORT.md)
### Enhancement Options
**LOCAL Enhancement (Recommended - No API Key Required):**
```bash
# During scraping
skill-seekers scrape --config configs/react.json --enhance-local
# Standalone after scraping
skill-seekers enhance output/react/
```
**API Enhancement (Alternative - Requires API Key):**
```bash
# During scraping
skill-seekers scrape --config configs/react.json --enhance
# Standalone after scraping
skill-seekers-enhance output/react/
skill-seekers-enhance output/react/ --api-key sk-ant-...
```
### Package and Upload the Skill
```bash
# Package skill (opens folder, shows upload instructions)
skill-seekers package output/godot/
# Result: output/godot.zip
# Package and auto-upload (requires ANTHROPIC_API_KEY)
export ANTHROPIC_API_KEY=sk-ant-...
skill-seekers package output/godot/ --upload
# Upload existing .zip
skill-seekers upload output/godot.zip
# Package without opening folder
skill-seekers package output/godot/ --no-open
```
### Force Re-scrape
```bash
# Delete cached data and re-scrape from scratch
rm -rf output/godot_data/
skill-seekers scrape --config configs/godot.json
```
### Estimate Page Count (Before Scraping)
```bash
# Quick estimation - discover up to 100 pages
skill-seekers estimate configs/react.json --max-discovery 100
# Time: ~30-60 seconds
# Full estimation - discover up to 1000 pages (default)
skill-seekers estimate configs/godot.json
# Time: ~1-2 minutes
# Deep estimation - discover up to 2000 pages
skill-seekers estimate configs/vue.json --max-discovery 2000
# Time: ~3-5 minutes
# What it shows:
# - Estimated total pages
# - Recommended max_pages value
# - Estimated scraping time
# - Discovery rate (pages/sec)
```
**Why use estimation:**
- Validates config URL patterns before full scrape
- Helps set optimal `max_pages` value
- Estimates total scraping time
- Fast (only HEAD requests + minimal parsing)
- No data downloaded or stored
## Repository Architecture
### File Structure (v2.0.0 - Modern Python Packaging)
```
Skill_Seekers/
├── pyproject.toml # Modern Python package configuration (PEP 621)
├── src/ # Source code (src/ layout best practice)
│ └── skill_seekers/
│ ├── __init__.py
│ ├── cli/ # CLI tools (entry points)
│ │ ├── doc_scraper.py # Main scraper (~790 lines)
│ │ ├── estimate_pages.py # Page count estimator
│ │ ├── enhance_skill.py # AI enhancement (API-based)
│ │ ├── package_skill.py # Skill packager
│ │ ├── github_scraper.py # GitHub scraper
│ │ ├── pdf_scraper.py # PDF scraper
│ │ ├── unified_scraper.py # Unified multi-source scraper
│ │ ├── merge_sources.py # Source merger
│ │ └── conflict_detector.py # Conflict detection
│ └── mcp/ # MCP server integration
│ └── server.py
├── tests/ # Test suite (391 tests passing)
│ ├── test_scraper_features.py
│ ├── test_config_validation.py
│ ├── test_integration.py
│ ├── test_mcp_server.py
│ ├── test_unified.py # Unified scraping tests (18 tests)
│ ├── test_unified_mcp_integration.py # (4 tests)
│ └── ...
├── configs/ # Preset configurations (24 configs)
│ ├── godot.json
│ ├── react.json
│ ├── django_unified.json # Multi-source configs
│ └── ...
├── docs/ # Documentation
│ ├── CLAUDE.md # This file
│ ├── ENHANCEMENT.md # Enhancement guide
│ ├── UPLOAD_GUIDE.md # Upload instructions
│ └── UNIFIED_SCRAPING.md # Unified scraping guide
├── README.md # User documentation
├── CHANGELOG.md # Release history
├── FUTURE_RELEASES.md # Roadmap
└── output/ # Generated output (git-ignored)
├── {name}_data/ # Scraped raw data (cached)
│ ├── pages/*.json # Individual page data
│ └── summary.json # Scraping summary
└── {name}/ # Built skill directory
├── SKILL.md # Main skill file
├── SKILL.md.backup # Backup (if enhanced)
├── references/ # Categorized documentation
│ ├── index.md
│ ├── getting_started.md
│ ├── api.md
│ └── ...
├── scripts/ # Empty (user scripts)
└── assets/ # Empty (user assets)
```
**Key Changes in v2.0.0:**
- **src/ layout**: Modern Python packaging structure
- **pyproject.toml**: PEP 621 compliant configuration
- **Entry points**: `skill-seekers` CLI with subcommands
- **Published to PyPI**: `pip install skill-seekers`
### Data Flow
1. **Scrape Phase** (`scrape_all()` in src/skill_seekers/cli/doc_scraper.py):
- Input: Config JSON (name, base_url, selectors, url_patterns, categories)
- Process: BFS traversal from base_url, respecting include/exclude patterns
- Output: `output/{name}_data/pages/*.json` + `summary.json`
2. **Build Phase** (`build_skill()` in src/skill_seekers/cli/doc_scraper.py):
- Input: Scraped JSON data from `output/{name}_data/`
- Process: Load pages → Smart categorize → Extract patterns → Generate references
- Output: `output/{name}/SKILL.md` + `output/{name}/references/*.md`
3. **Enhancement Phase** (optional via enhance_skill.py or enhance_skill_local.py):
- Input: Built skill directory with references
- Process: Claude analyzes references and rewrites SKILL.md
- Output: Enhanced SKILL.md with real examples and guidance
4. **Package Phase** (via package_skill.py):
- Input: Skill directory
- Process: Zip all files (excluding .backup)
- Output: `{name}.zip`
5. **Upload Phase** (optional via upload_skill.py):
- Input: Skill .zip file
- Process: Upload to Claude AI via API
- Output: Skill available in Claude
### Configuration File Structure
Config files (`configs/*.json`) define scraping behavior:
```json
{
"name": "godot",
"description": "When to use this skill",
"base_url": "https://docs.godotengine.org/en/stable/",
"selectors": {
"main_content": "div[role='main']",
"title": "title",
"code_blocks": "pre"
},
"url_patterns": {
"include": [],
"exclude": ["/search.html", "/_static/"]
},
"categories": {
"getting_started": ["introduction", "getting_started"],
"scripting": ["scripting", "gdscript"],
"api": ["api", "reference", "class"]
},
"rate_limit": 0.5,
"max_pages": 500
}
```
**Config Parameters:**
- `name`: Skill identifier (output directory name)
- `description`: When Claude should use this skill
- `base_url`: Starting URL for scraping
- `selectors.main_content`: CSS selector for main content (common: `article`, `main`, `div[role="main"]`)
- `selectors.title`: CSS selector for page title
- `selectors.code_blocks`: CSS selector for code samples
- `url_patterns.include`: Only scrape URLs containing these patterns
- `url_patterns.exclude`: Skip URLs containing these patterns
- `categories`: Keyword mapping for categorization
- `rate_limit`: Delay between requests (seconds)
- `max_pages`: Maximum pages to scrape
- `skip_llms_txt`: Skip llms.txt detection, force HTML scraping (default: false)
- `exclude_dirs_additional`: Add custom directories to default exclusions (for local repo analysis)
- `exclude_dirs`: Replace default directory exclusions entirely (advanced, for local repo analysis)
## Key Features & Implementation
### Auto-Detect Existing Data
Tool checks for `output/{name}_data/` and prompts to reuse, avoiding re-scraping (check_existing_data() in doc_scraper.py:653-660).
### Configurable Directory Exclusions (Local Repository Analysis)
When using `local_repo_path` for unlimited local repository analysis, you can customize which directories to exclude from analysis.
**Smart Defaults:**
Automatically excludes common directories: `venv`, `node_modules`, `__pycache__`, `.git`, `build`, `dist`, `.pytest_cache`, `htmlcov`, `.tox`, `.mypy_cache`, etc.
**Extend Mode** (`exclude_dirs_additional`): Add custom exclusions to defaults
```json
{
"sources": [{
"type": "github",
"local_repo_path": "/path/to/repo",
"exclude_dirs_additional": ["proprietary", "legacy", "third_party"]
}]
}
```
**Replace Mode** (`exclude_dirs`): Override defaults entirely (advanced)
```json
{
"sources": [{
"type": "github",
"local_repo_path": "/path/to/repo",
"exclude_dirs": ["node_modules", ".git", "custom_vendor"]
}]
}
```
**Use Cases:**
- Monorepos with custom directory structures
- Enterprise projects with non-standard naming
- Including unusual directories (e.g., analyzing venv code)
- Minimal exclusions for small/simple projects
See: `should_exclude_dir()` in github_scraper.py:304-306
### Language Detection
Detects code languages from:
1. CSS class attributes (`language-*`, `lang-*`)
2. Heuristics (keywords like `def`, `const`, `func`, etc.)
See: `detect_language()` in doc_scraper.py:135-165
### Pattern Extraction
Looks for "Example:", "Pattern:", "Usage:" markers in content and extracts following code blocks (up to 5 per page).
See: `extract_patterns()` in doc_scraper.py:167-183
### Smart Categorization
- Scores pages against category keywords (3 points for URL match, 2 for title, 1 for content)
- Threshold of 2+ for categorization
- Auto-infers categories from URL segments if none provided
- Falls back to "other" category
See: `smart_categorize()` and `infer_categories()` in doc_scraper.py:282-351
### Enhanced SKILL.md Generation
Generated with:
- Real code examples from documentation (language-annotated)
- Quick reference patterns extracted from docs
- Common pattern section
- Category file listings
See: `create_enhanced_skill_md()` in doc_scraper.py:426-542
## Common Workflows
### First Time (With Scraping + Enhancement)
```bash
# 1. Scrape + Build + AI Enhancement (LOCAL, no API key)
skill-seekers scrape --config configs/godot.json --enhance-local
# 2. Wait for enhancement terminal to close (~60 seconds)
# 3. Verify quality
cat output/godot/SKILL.md
# 4. Package
skill-seekers package output/godot/
# Result: godot.zip ready for Claude
# Time: 20-40 minutes (scraping) + 60 seconds (enhancement)
```
### Using Cached Data (Fast Iteration)
```bash
# 1. Use existing data + Local Enhancement
skill-seekers scrape --config configs/godot.json --skip-scrape
skill-seekers enhance output/godot/
# 2. Package
skill-seekers package output/godot/
# Time: 1-3 minutes (build) + 60 seconds (enhancement)
```
### Without Enhancement (Basic)
```bash
# 1. Scrape + Build (no enhancement)
skill-seekers scrape --config configs/godot.json
# 2. Package
skill-seekers package output/godot/
# Note: SKILL.md will be basic template - enhancement recommended
# Time: 20-40 minutes
```
### Creating a New Framework Config
**Option 1: Interactive**
```bash
skill-seekers scrape --interactive
# Follow prompts, it creates the config for you
```
**Option 2: Copy and Modify**
```bash
# Copy a preset
cp configs/react.json configs/myframework.json
# Edit it
nano configs/myframework.json
# Test with limited pages first
# Set "max_pages": 20 in config
# Use it
skill-seekers scrape --config configs/myframework.json
```
## Testing & Verification
### Finding the Right CSS Selectors
Before creating a config, test selectors with BeautifulSoup:
```python
from bs4 import BeautifulSoup
import requests
url = "https://docs.example.com/page"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# Try different selectors
print(soup.select_one('article'))
print(soup.select_one('main'))
print(soup.select_one('div[role="main"]'))
print(soup.select_one('div.content'))
# Test code block selector
print(soup.select('pre code'))
print(soup.select('pre'))
```
### Verify Output Quality
After building, verify the skill quality:
```bash
# Check SKILL.md has real examples
cat output/godot/SKILL.md
# Check category structure
cat output/godot/references/index.md
# List all reference files
ls output/godot/references/
# Check specific category content
cat output/godot/references/getting_started.md
# Verify code samples have language detection
grep -A 3 "```" output/godot/references/*.md | head -20
```
### Test with Limited Pages
For faster testing, edit config to limit pages:
```json
{
"max_pages": 20 // Test with just 20 pages
}
```
## Troubleshooting
### No Content Extracted
**Problem:** Pages scraped but content is empty
**Solution:** Check `main_content` selector in config. Try:
- `article`
- `main`
- `div[role="main"]`
- `div.content`
Use the BeautifulSoup testing approach above to find the right selector.
### Poor Categorization
**Problem:** Pages not categorized well
**Solution:** Edit `categories` section in config with better keywords specific to the documentation structure. Check URL patterns in scraped data:
```bash
# See what URLs were scraped
cat output/godot_data/summary.json | grep url | head -20
```
### Data Exists But Won't Use It
**Problem:** Tool won't reuse existing data
**Solution:** Force re-scrape:
```bash
rm -rf output/myframework_data/
skill-seekers scrape --config configs/myframework.json
```
### Rate Limiting Issues
**Problem:** Getting rate limited or blocked by documentation server
**Solution:** Increase `rate_limit` value in config:
```json
{
"rate_limit": 1.0 // Change from 0.5 to 1.0 seconds
}
```
### Package Path Error
**Problem:** doc_scraper.py shows wrong cli/package_skill.py path
**Expected output:**
```bash
skill-seekers package output/godot/
```
**Not:**
```bash
python3 /mnt/skills/examples/skill-creator/scripts/cli/package_skill.py output/godot/
```
The correct command uses the local `cli/package_skill.py` in the repository root.
## Key Code Locations (v2.0.0)
**Documentation Scraper** (`src/skill_seekers/cli/doc_scraper.py`):
- **URL validation**: `is_valid_url()`
- **Content extraction**: `extract_content()`
- **Language detection**: `detect_language()`
- **Pattern extraction**: `extract_patterns()`
- **Smart categorization**: `smart_categorize()`
- **Category inference**: `infer_categories()`
- **Quick reference generation**: `generate_quick_reference()`
- **SKILL.md generation**: `create_enhanced_skill_md()`
- **Scraping loop**: `scrape_all()`
- **Main workflow**: `main()`
**Other Key Files**:
- **GitHub scraper**: `src/skill_seekers/cli/github_scraper.py`
- **PDF scraper**: `src/skill_seekers/cli/pdf_scraper.py`
- **Unified scraper**: `src/skill_seekers/cli/unified_scraper.py`
- **Conflict detection**: `src/skill_seekers/cli/conflict_detector.py`
- **Source merger**: `src/skill_seekers/cli/merge_sources.py`
- **Package tool**: `src/skill_seekers/cli/package_skill.py`
- **Upload tool**: `src/skill_seekers/cli/upload_skill.py`
- **MCP server**: `src/skill_seekers/mcp/server.py`
- **Entry points**: `pyproject.toml` (project.scripts section)
## Enhancement Details
### LOCAL Enhancement (Recommended)
- Uses your Claude Code Max plan (no API costs)
- Opens new terminal with Claude Code
- Analyzes reference files automatically
- Takes 30-60 seconds
- Quality: 9/10 (comparable to API version)
- Backs up original SKILL.md to SKILL.md.backup
### API Enhancement (Alternative)
- Uses Anthropic API (~$0.15-$0.30 per skill)
- Requires ANTHROPIC_API_KEY
- Same quality as LOCAL
- Faster (no terminal launch)
- Better for automation/CI
**What Enhancement Does:**
1. Reads reference documentation files
2. Analyzes content with Claude
3. Extracts 5-10 best code examples
4. Creates comprehensive quick reference
5. Adds domain-specific key concepts
6. Provides navigation guidance for different skill levels
7. Transforms 75-line templates into 500+ line comprehensive guides
## Performance
| Task | Time | Notes |
|------|------|-------|
| Scraping | 15-45 min | First time only |
| Building | 1-3 min | Fast! |
| Re-building | <1 min | With --skip-scrape |
| Enhancement (LOCAL) | 30-60 sec | Uses Claude Code Max |
| Enhancement (API) | 20-40 sec | Requires API key |
| Packaging | 5-10 sec | Final zip |
## Available Configs (24 Total)
### Single-Source Documentation Configs (14 configs)
**Web Frameworks:**
- ✅ `react.json` - React (article selector, 7,102 chars)
- ✅ `vue.json` - Vue.js (main selector, 1,029 chars)
- ✅ `astro.json` - Astro (article selector, 145 chars)
- ✅ `django.json` - Django (article selector, 6,468 chars)
- ✅ `laravel.json` - Laravel 9.x (#main-content selector, 16,131 chars)
- ✅ `fastapi.json` - FastAPI (article selector, 11,906 chars)
- ✅ `hono.json` - Hono web framework **NEW!**
**DevOps & Automation:**
- ✅ `ansible-core.json` - Ansible Core 2.19 (div[role='main'] selector, ~32K chars)
- ✅ `kubernetes.json` - Kubernetes (main selector, 2,100 chars)
**Game Engines:**
- ✅ `godot.json` - Godot (div[role='main'] selector, 1,688 chars)
- ✅ `godot-large-example.json` - Godot large docs example
**CSS & Utilities:**
- ✅ `tailwind.json` - Tailwind CSS (div.prose selector, 195 chars)
**Gaming:**
- ✅ `steam-economy-complete.json` - Steam Economy (div.documentation_bbcode, 588 chars)
**Development Tools:**
- ✅ `claude-code.json` - Claude Code documentation **NEW!**
### Unified Multi-Source Configs (5 configs - **NEW v2.0!**)
- ✅ `react_unified.json` - React (docs + GitHub + code analysis)
- ✅ `django_unified.json` - Django (docs + GitHub + code analysis)
- ✅ `fastapi_unified.json` - FastAPI (docs + GitHub + code analysis)
- ✅ `fastapi_unified_test.json` - FastAPI test config
- ✅ `godot_unified.json` - Godot (docs + GitHub + code analysis)
### Test/Example Configs (5 configs)
- 📝 `godot_github.json` - GitHub-only scraping example
- 📝 `react_github.json` - GitHub-only scraping example
- 📝 `python-tutorial-test.json` - Python tutorial test
- 📝 `example_pdf.json` - PDF extraction example
- 📝 `test-manual.json` - Manual testing config
**Note:** All configs verified and working! Unified configs fully tested with 22 passing tests.
**Last verified:** November 29, 2025 (Post-v2.1.0 bug fixes)
## Additional Documentation
**User Guides:**
- **[README.md](README.md)** - Complete user documentation
- **[BULLETPROOF_QUICKSTART.md](BULLETPROOF_QUICKSTART.md)** - Complete beginner guide
- **[QUICKSTART.md](QUICKSTART.md)** - Get started in 3 steps
- **[TROUBLESHOOTING.md](TROUBLESHOOTING.md)** - Comprehensive troubleshooting
**Technical Documentation:**
- **[docs/CLAUDE.md](docs/CLAUDE.md)** - Detailed technical architecture
- **[docs/ENHANCEMENT.md](docs/ENHANCEMENT.md)** - AI enhancement guide
- **[docs/UPLOAD_GUIDE.md](docs/UPLOAD_GUIDE.md)** - How to upload skills to Claude
- **[docs/UNIFIED_SCRAPING.md](docs/UNIFIED_SCRAPING.md)** - Multi-source scraping guide
- **[docs/MCP_SETUP.md](docs/MCP_SETUP.md)** - MCP server setup
**Project Planning:**
- **[CHANGELOG.md](CHANGELOG.md)** - Release history and v2.0.0 details **UPDATED!**
- **[FUTURE_RELEASES.md](FUTURE_RELEASES.md)** - Roadmap for v2.1.0+ **NEW!**
- **[FLEXIBLE_ROADMAP.md](FLEXIBLE_ROADMAP.md)** - Complete task catalog (134 tasks)
- **[NEXT_TASKS.md](NEXT_TASKS.md)** - What to work on next
- **[TODO.md](TODO.md)** - Current focus
- **[STRUCTURE.md](STRUCTURE.md)** - Repository structure
## Notes for Claude Code
**Project Status (v2.0.0):**
- ✅ **Published on PyPI**: Install with `pip install skill-seekers`
- ✅ **Modern Python Packaging**: pyproject.toml, src/ layout, entry points
- ✅ **Unified CLI**: Single `skill-seekers` command with Git-style subcommands
- ✅ **CI/CD Working**: All 5 test matrix jobs passing (Ubuntu + macOS, Python 3.10-3.12)
- ✅ **Test Coverage**: 391 tests passing, 39% coverage
- ✅ **Documentation**: Complete user and technical documentation
**Architecture:**
- **Python-based documentation scraper** with multi-source support
- **Main scraper**: `src/skill_seekers/cli/doc_scraper.py` (~790 lines)
- **Unified scraping**: Combines docs + GitHub + PDF with conflict detection
- **Modern packaging**: PEP 621 compliant with proper dependency management
- **MCP Integration**: 9 tools for Claude Code Max integration
**Development Workflow:**
1. **Install**: `pip install -e .` (editable mode for development)
2. **Run tests**: `pytest tests/` (391 tests)
3. **Build package**: `uv build` or `python -m build`
4. **Publish**: `uv publish` (PyPI)
**Key Points:**
- Output is cached and reusable in `output/` (git-ignored)
- Enhancement is optional but highly recommended
- All 24 configs are working and tested
- CI workflow requires `pip install -e .` to install package before running tests

View File

@ -0,0 +1,432 @@
# Contributing to Skill Seeker
First off, thank you for considering contributing to Skill Seeker! It's people like you that make Skill Seeker such a great tool.
## Table of Contents
- [Branch Workflow](#branch-workflow)
- [Code of Conduct](#code-of-conduct)
- [How Can I Contribute?](#how-can-i-contribute)
- [Development Setup](#development-setup)
- [Pull Request Process](#pull-request-process)
- [Coding Standards](#coding-standards)
- [Testing](#testing)
- [Documentation](#documentation)
---
## Branch Workflow
**⚠️ IMPORTANT:** Skill Seekers uses a two-branch workflow.
### Branch Structure
```
main (production)
│ (only maintainer merges)
development (integration) ← default branch for PRs
│ (all contributor PRs go here)
feature branches
```
### Branches
- **`main`** - Production branch
- Always stable
- Only receives merges from `development` by maintainers
- Protected: requires tests + 1 review
- **`development`** - Integration branch
- **Default branch for all PRs**
- Active development happens here
- Protected: requires tests to pass
- Gets merged to `main` by maintainers
- **Feature branches** - Your work
- Created from `development`
- Named descriptively (e.g., `add-github-scraping`)
- Merged back to `development` via PR
### Workflow Example
```bash
# 1. Fork and clone
git clone https://github.com/YOUR_USERNAME/Skill_Seekers.git
cd Skill_Seekers
# 2. Add upstream
git remote add upstream https://github.com/yusufkaraaslan/Skill_Seekers.git
# 3. Create feature branch from development
git checkout development
git pull upstream development
git checkout -b my-feature
# 4. Make changes, commit, push
git add .
git commit -m "Add my feature"
git push origin my-feature
# 5. Create PR targeting 'development' branch
```
---
## Code of Conduct
This project and everyone participating in it is governed by our commitment to fostering an open and welcoming environment. Please be respectful and constructive in all interactions.
---
## How Can I Contribute?
### Reporting Bugs
Before creating bug reports, please check the [existing issues](https://github.com/yusufkaraaslan/Skill_Seekers/issues) to avoid duplicates.
When creating a bug report, include:
- **Clear title and description**
- **Steps to reproduce** the issue
- **Expected behavior** vs actual behavior
- **Screenshots** if applicable
- **Environment details** (OS, Python version, etc.)
- **Error messages** and stack traces
**Example:**
```markdown
**Bug:** MCP tool fails when config has no categories
**Steps to Reproduce:**
1. Create config with empty categories: `"categories": {}`
2. Run `python3 cli/doc_scraper.py --config configs/test.json`
3. See error
**Expected:** Should use auto-inferred categories
**Actual:** Crashes with KeyError
**Environment:**
- OS: Ubuntu 22.04
- Python: 3.10.5
- Version: 1.0.0
```
### Suggesting Enhancements
Enhancement suggestions are tracked as [GitHub issues](https://github.com/yusufkaraaslan/Skill_Seekers/issues).
Include:
- **Clear title** describing the enhancement
- **Detailed description** of the proposed functionality
- **Use cases** that would benefit from this enhancement
- **Examples** of how it would work
- **Alternatives considered**
### Adding New Framework Configs
We welcome new framework configurations! To add one:
1. Create a config file in `configs/`
2. Test it thoroughly with different page counts
3. Submit a PR with:
- The config file
- Brief description of the framework
- Test results (number of pages scraped, categories found)
**Example PR:**
```markdown
**Add Svelte Documentation Config**
Adds configuration for Svelte documentation (https://svelte.dev/docs).
- Config: `configs/svelte.json`
- Tested with max_pages: 100
- Successfully categorized: getting_started, components, api, advanced
- Total pages available: ~150
```
### Pull Requests
We actively welcome your pull requests!
**⚠️ IMPORTANT:** All PRs must target the `development` branch, not `main`.
1. Fork the repo and create your branch from `development`
2. If you've added code, add tests
3. If you've changed APIs, update the documentation
4. Ensure the test suite passes
5. Make sure your code follows our coding standards
6. Issue that pull request to `development` branch!
---
## Development Setup
### Prerequisites
- Python 3.10 or higher (required for MCP integration)
- Git
### Setup Steps
1. **Fork and clone the repository**
```bash
git clone https://github.com/YOUR_USERNAME/Skill_Seekers.git
cd Skill_Seekers
```
2. **Install dependencies**
```bash
pip install requests beautifulsoup4
pip install pytest pytest-cov
pip install -r mcp/requirements.txt
```
3. **Create a feature branch from development**
```bash
git checkout development
git pull upstream development
git checkout -b feature/my-awesome-feature
```
4. **Make your changes**
```bash
# Edit files...
```
5. **Run tests**
```bash
python -m pytest tests/ -v
```
6. **Commit your changes**
```bash
git add .
git commit -m "Add awesome feature"
```
7. **Push to your fork**
```bash
git push origin feature/my-awesome-feature
```
8. **Create a Pull Request**
---
## Pull Request Process
### Before Submitting
- [ ] Tests pass locally (`python -m pytest tests/ -v`)
- [ ] Code follows PEP 8 style guidelines
- [ ] Documentation is updated if needed
- [ ] CHANGELOG.md is updated (if applicable)
- [ ] Commit messages are clear and descriptive
### PR Template
```markdown
## Description
Brief description of what this PR does.
## Type of Change
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Documentation update
## How Has This Been Tested?
Describe the tests you ran to verify your changes.
## Checklist
- [ ] My code follows the style guidelines of this project
- [ ] I have performed a self-review of my own code
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
```
### Review Process
1. A maintainer will review your PR within 3-5 business days
2. Address any feedback or requested changes
3. Once approved, a maintainer will merge your PR
4. Your contribution will be included in the next release!
---
## Coding Standards
### Python Style Guide
We follow [PEP 8](https://www.python.org/dev/peps/pep-0008/) with some modifications:
- **Line length:** 100 characters (not 79)
- **Indentation:** 4 spaces
- **Quotes:** Double quotes for strings
- **Naming:**
- Functions/variables: `snake_case`
- Classes: `PascalCase`
- Constants: `UPPER_SNAKE_CASE`
### Code Organization
```python
# 1. Standard library imports
import os
import sys
from pathlib import Path
# 2. Third-party imports
import requests
from bs4 import BeautifulSoup
# 3. Local application imports
from cli.utils import open_folder
# 4. Constants
MAX_PAGES = 1000
DEFAULT_RATE_LIMIT = 0.5
# 5. Functions and classes
def my_function():
"""Docstring describing what this function does."""
pass
```
### Documentation
- All functions should have docstrings
- Use type hints where appropriate
- Add comments for complex logic
```python
def scrape_page(url: str, selectors: dict) -> dict:
"""
Scrape a single page and extract content.
Args:
url: The URL to scrape
selectors: Dictionary of CSS selectors
Returns:
Dictionary containing extracted content
Raises:
RequestException: If page cannot be fetched
"""
pass
```
---
## Testing
### Running Tests
```bash
# Run all tests
python -m pytest tests/ -v
# Run specific test file
python -m pytest tests/test_mcp_server.py -v
# Run with coverage
python -m pytest tests/ --cov=cli --cov=mcp --cov-report=term
```
### Writing Tests
- Tests go in the `tests/` directory
- Test files should start with `test_`
- Use descriptive test names
```python
def test_config_validation_with_missing_fields():
"""Test that config validation fails when required fields are missing."""
config = {"name": "test"} # Missing base_url
result = validate_config(config)
assert result is False
```
### Test Coverage
- Aim for >80% code coverage
- Critical paths should have 100% coverage
- Add tests for bug fixes to prevent regressions
---
## Documentation
### Where to Document
- **README.md** - Overview, quick start, basic usage
- **docs/** - Detailed guides and tutorials
- **CHANGELOG.md** - All notable changes
- **Code comments** - Complex logic and non-obvious decisions
### Documentation Style
- Use clear, simple language
- Include code examples
- Add screenshots for UI-related features
- Keep it up to date with code changes
---
## Project Structure
```
Skill_Seekers/
├── cli/ # CLI tools
│ ├── doc_scraper.py # Main scraper
│ ├── package_skill.py # Packager
│ ├── upload_skill.py # Uploader
│ └── utils.py # Shared utilities
├── mcp/ # MCP server
│ ├── server.py # MCP implementation
│ └── requirements.txt # MCP dependencies
├── configs/ # Framework configs
├── docs/ # Documentation
├── tests/ # Test suite
└── .github/ # GitHub config
└── workflows/ # CI/CD workflows
```
---
## Release Process
Releases are managed by maintainers:
1. Update version in relevant files
2. Update CHANGELOG.md
3. Create and push version tag
4. GitHub Actions will create the release
5. Announce on relevant channels
---
## Questions?
- 💬 [Open a discussion](https://github.com/yusufkaraaslan/Skill_Seekers/discussions)
- 🐛 [Report a bug](https://github.com/yusufkaraaslan/Skill_Seekers/issues)
- 📧 Contact: yusufkaraaslan.yk@pm.me
---
## Recognition
Contributors will be recognized in:
- README.md contributors section
- CHANGELOG.md for each release
- GitHub contributors page
Thank you for contributing to Skill Seeker! 🎉

View File

@ -0,0 +1,393 @@
# Flexible Development Roadmap
**Philosophy:** Small incremental tasks → Pick one → Complete → Move to next
**No big milestones, just continuous progress!**
---
## 🎯 Current Status: v2.1.0 Released ✅
**Latest Release:** v2.1.0 (November 29, 2025)
**What Works:**
- ✅ Documentation scraping (HTML websites)
- ✅ GitHub repository scraping with unlimited local analysis
- ✅ PDF extraction and conversion
- ✅ Unified multi-source scraping (docs + GitHub + PDF)
- ✅ 9 MCP tools fully functional
- ✅ Auto-upload to Claude
- ✅ 24 preset configs (including 5 unified configs)
- ✅ Large docs support (40K+ pages)
- ✅ Configurable directory exclusions
- ✅ 427 tests passing
---
## 📋 Task Categories (Pick Any, Any Order)
### 🌐 **Category A: Community & Sharing**
Small tasks that build community features incrementally
#### A1: Config Sharing (Website Feature)
- [x] **Task A1.1:** Create simple JSON API endpoint to list configs ✅ **COMPLETE** (Issue #9)
- **Status:** Live at https://api.skillseekersweb.com
- **Features:** 6 REST endpoints, auto-categorization, auto-tags, filtering, SSL enabled
- **Branch:** `feature/a1-config-sharing`
- **Deployment:** Render with custom domain
- [ ] **Task A1.2:** Add MCP tool `fetch_config` to download from website
- [ ] **Task A1.3:** Create basic config upload form (HTML + backend)
- [ ] **Task A1.4:** Add config rating/voting system
- [ ] **Task A1.5:** Add config search/filter functionality
- [ ] **Task A1.6:** Add user-submitted config review queue
**Start Small:** ~~Pick A1.1 first (simple JSON endpoint)~~ ✅ A1.1 Complete! Pick A1.2 next (MCP tool)
#### A2: Knowledge Sharing (Website Feature)
- [ ] **Task A2.1:** Design knowledge database schema
- [ ] **Task A2.2:** Create API endpoint to upload knowledge (.zip files)
- [ ] **Task A2.3:** Add MCP tool `fetch_knowledge` to download from site
- [ ] **Task A2.4:** Add knowledge preview/description
- [ ] **Task A2.5:** Add knowledge categorization (by framework/topic)
- [ ] **Task A2.6:** Add knowledge search functionality
**Start Small:** Pick A2.1 first (schema design, no coding)
#### A3: Simple Website Foundation
- [ ] **Task A3.1:** Create single-page static site (GitHub Pages)
- [ ] **Task A3.2:** Add config gallery view (display existing 12 configs)
- [ ] **Task A3.3:** Add "Submit Config" link (opens GitHub issue for now)
- [ ] **Task A3.4:** Add basic stats (total configs, downloads, etc.)
- [ ] **Task A3.5:** Add simple blog using GitHub Issues
- [ ] **Task A3.6:** Add RSS feed for updates
**Start Small:** Pick A3.1 first (single HTML page on GitHub Pages)
---
### 🛠️ **Category B: New Input Formats**
Add support for non-HTML documentation sources
#### B1: PDF Documentation Support
- [ ] **Task B1.1:** Research PDF parsing libraries (PyPDF2, pdfplumber, etc.)
- [ ] **Task B1.2:** Create simple PDF text extractor (proof of concept)
- [ ] **Task B1.3:** Add PDF page detection and chunking
- [ ] **Task B1.4:** Extract code blocks from PDFs (syntax detection)
- [ ] **Task B1.5:** Add PDF image extraction (diagrams, screenshots)
- [ ] **Task B1.6:** Create `pdf_scraper.py` CLI tool
- [ ] **Task B1.7:** Add MCP tool `scrape_pdf`
- [ ] **Task B1.8:** Create PDF config format (similar to web configs)
**Start Small:** Pick B1.1 first (just research, document findings)
#### B2: Microsoft Word (.docx) Support
- [ ] **Task B2.1:** Research .docx parsing (python-docx library)
- [ ] **Task B2.2:** Create simple .docx text extractor
- [ ] **Task B2.3:** Extract headings and create categories
- [ ] **Task B2.4:** Extract code blocks from Word docs
- [ ] **Task B2.5:** Extract tables and convert to markdown
- [ ] **Task B2.6:** Create `docx_scraper.py` CLI tool
- [ ] **Task B2.7:** Add MCP tool `scrape_docx`
**Start Small:** Pick B2.1 first (research only)
#### B3: Excel/Spreadsheet (.xlsx) Support
- [ ] **Task B3.1:** Research Excel parsing (openpyxl, pandas)
- [ ] **Task B3.2:** Create simple sheet → markdown converter
- [ ] **Task B3.3:** Add table detection and formatting
- [ ] **Task B3.4:** Extract API reference from spreadsheets (common pattern)
- [ ] **Task B3.5:** Create `xlsx_scraper.py` CLI tool
- [ ] **Task B3.6:** Add MCP tool `scrape_xlsx`
**Start Small:** Pick B3.1 first (research only)
#### B4: Markdown Files Support
- [ ] **Task B4.1:** Create markdown file crawler (for local docs)
- [ ] **Task B4.2:** Extract front matter (title, category, etc.)
- [ ] **Task B4.3:** Build category tree from folder structure
- [ ] **Task B4.4:** Add link resolution (internal references)
- [ ] **Task B4.5:** Create `markdown_scraper.py` CLI tool
- [ ] **Task B4.6:** Add MCP tool `scrape_markdown_dir`
**Start Small:** Pick B4.1 first (simple file walker)
---
### 💻 **Category C: Codebase Knowledge**
Generate skills from actual code repositories
#### C1: GitHub Repository Scraping
- [ ] **Task C1.1:** Create GitHub API client (fetch repo structure)
- [ ] **Task C1.2:** Extract README.md files
- [ ] **Task C1.3:** Extract code comments and docstrings
- [ ] **Task C1.4:** Detect programming language per file
- [ ] **Task C1.5:** Extract function/class signatures
- [ ] **Task C1.6:** Build usage examples from tests
- [ ] **Task C1.7:** Extract GitHub Issues (open/closed, labels, milestones)
- [ ] **Task C1.8:** Extract CHANGELOG.md and release notes
- [ ] **Task C1.9:** Extract GitHub Releases with version history
- [ ] **Task C1.10:** Create `github_scraper.py` CLI tool
- [ ] **Task C1.11:** Add MCP tool `scrape_github`
- [ ] **Task C1.12:** Add config format for GitHub repos
**Start Small:** Pick C1.1 first (basic GitHub API connection)
#### C2: Local Codebase Scraping
- [ ] **Task C2.1:** Create file tree walker (with .gitignore support)
- [ ] **Task C2.2:** Extract docstrings (Python, JS, etc.)
- [ ] **Task C2.3:** Extract function signatures and types
- [ ] **Task C2.4:** Build API reference from code
- [ ] **Task C2.5:** Extract inline comments as notes
- [ ] **Task C2.6:** Create dependency graph
- [ ] **Task C2.7:** Create `codebase_scraper.py` CLI tool
- [ ] **Task C2.8:** Add MCP tool `scrape_codebase`
**Start Small:** Pick C2.1 first (simple file walker)
#### C3: Code Pattern Recognition
- [ ] **Task C3.1:** Detect common patterns (singleton, factory, etc.)
- [ ] **Task C3.2:** Extract usage examples from test files
- [ ] **Task C3.3:** Build "how to" guides from code
- [ ] **Task C3.4:** Extract configuration patterns
- [ ] **Task C3.5:** Create architectural overview
**Start Small:** Pick C3.1 first (pattern detection research)
---
### 🔌 **Category D: Context7 Integration**
Explore integration with Context7 for enhanced context management
#### D1: Context7 Research & Planning
- [ ] **Task D1.1:** Research Context7 API and capabilities
- [ ] **Task D1.2:** Document potential use cases for Skill Seeker
- [ ] **Task D1.3:** Create integration design proposal
- [ ] **Task D1.4:** Identify which features benefit most
**Start Small:** Pick D1.1 first (pure research, no code)
#### D2: Context7 Basic Integration
- [ ] **Task D2.1:** Create Context7 API client
- [ ] **Task D2.2:** Test basic context storage/retrieval
- [ ] **Task D2.3:** Store scraped documentation in Context7
- [ ] **Task D2.4:** Query Context7 during skill building
- [ ] **Task D2.5:** Add MCP tool `sync_to_context7`
**Start Small:** Pick D2.1 first (basic API connection)
---
### 🚀 **Category E: MCP Enhancements**
Small improvements to existing MCP tools
#### E1: New MCP Tools
- [ ] **Task E1.1:** Add `fetch_config` MCP tool (download from website)
- [ ] **Task E1.2:** Add `fetch_knowledge` MCP tool (download skills)
- [x] **Task E1.3:** Add `scrape_pdf` MCP tool (✅ COMPLETED v1.0.0)
- [ ] **Task E1.4:** Add `scrape_docx` MCP tool
- [ ] **Task E1.5:** Add `scrape_xlsx` MCP tool
- [ ] **Task E1.6:** Add `scrape_github` MCP tool (see C1.11)
- [ ] **Task E1.7:** Add `scrape_codebase` MCP tool (see C2.8)
- [ ] **Task E1.8:** Add `scrape_markdown_dir` MCP tool (see B4.6)
- [ ] **Task E1.9:** Add `sync_to_context7` MCP tool (see D2.5)
**Start Small:** Pick E1.1 first (once A1.2 is done)
#### E2: MCP Quality Improvements
- [ ] **Task E2.1:** Add error handling to all tools
- [ ] **Task E2.2:** Add structured logging
- [ ] **Task E2.3:** Add progress indicators for long operations
- [ ] **Task E2.4:** Add validation for all inputs
- [ ] **Task E2.5:** Add helpful error messages
- [ ] **Task E2.6:** Add retry logic for network failures
**Start Small:** Pick E2.1 first (one tool at a time)
---
### ⚡ **Category F: Performance & Reliability**
Technical improvements to existing features
#### F1: Core Scraper Improvements
- [ ] **Task F1.1:** Add URL normalization (remove query params)
- [ ] **Task F1.2:** Add duplicate page detection
- [ ] **Task F1.3:** Add memory-efficient streaming for large docs
- [ ] **Task F1.4:** Add HTML parser fallback (lxml → html5lib)
- [ ] **Task F1.5:** Add network retry with exponential backoff
- [ ] **Task F1.6:** Fix package path output bug
**Start Small:** Pick F1.1 first (URL normalization only)
#### F2: Incremental Updates
- [ ] **Task F2.1:** Track page modification times (Last-Modified header)
- [ ] **Task F2.2:** Store page checksums/hashes
- [ ] **Task F2.3:** Compare on re-run, skip unchanged pages
- [ ] **Task F2.4:** Update only changed content
- [ ] **Task F2.5:** Preserve local annotations/edits
**Start Small:** Pick F2.1 first (just tracking, no logic)
---
### 🎨 **Category G: Tools & Utilities**
Small standalone tools that add value
#### G1: Config Tools
- [ ] **Task G1.1:** Create `validate_config.py` (enhanced validation)
- [ ] **Task G1.2:** Create `test_selectors.py` (interactive selector tester)
- [ ] **Task G1.3:** Create `auto_detect_selectors.py` (AI-powered)
- [ ] **Task G1.4:** Create `compare_configs.py` (diff two configs)
- [ ] **Task G1.5:** Create `optimize_config.py` (suggest improvements)
**Start Small:** Pick G1.1 first (simple validation script)
#### G2: Skill Quality Tools
- [ ] **Task G2.1:** Create `analyze_skill.py` (quality metrics)
- [ ] **Task G2.2:** Add code example counter
- [ ] **Task G2.3:** Add readability scoring
- [ ] **Task G2.4:** Add completeness checker
- [ ] **Task G2.5:** Create quality report generator
**Start Small:** Pick G2.1 first (basic metrics)
---
### 📚 **Category H: Community Response**
Respond to existing GitHub issues
#### H1: Address Open Issues
- [ ] **Task H1.1:** Respond to Issue #8: Prereqs to Getting Started
- [ ] **Task H1.2:** Investigate Issue #7: Laravel scraping issue
- [ ] **Task H1.3:** Create example project (Issue #4)
- [ ] **Task H1.4:** Answer Issue #3: Pro plan compatibility
- [ ] **Task H1.5:** Create self-documenting skill (Issue #1)
**Start Small:** Pick H1.1 first (just respond, don't solve)
---
### 🎓 **Category I: Content & Documentation**
Educational content and guides
#### I1: Video Tutorials
- [ ] **Task I1.1:** Write script for "Quick Start" video
- [ ] **Task I1.2:** Record "Quick Start" (5 min)
- [ ] **Task I1.3:** Write script for "MCP Setup" video
- [ ] **Task I1.4:** Record "MCP Setup" (8 min)
- [ ] **Task I1.5:** Write script for "Custom Config" video
- [ ] **Task I1.6:** Record "Custom Config" (10 min)
**Start Small:** Pick I1.1 first (just write script, no recording)
#### I2: Written Guides
- [ ] **Task I2.1:** Write troubleshooting guide
- [ ] **Task I2.2:** Write best practices guide
- [ ] **Task I2.3:** Write performance optimization guide
- [ ] **Task I2.4:** Write community config contribution guide
- [ ] **Task I2.5:** Write codebase scraping guide
**Start Small:** Pick I2.1 first (common issues + solutions)
---
### 🧪 **Category J: Testing & Quality**
Improve test coverage and quality
#### J1: Test Expansion
- [ ] **Task J1.1:** Install MCP package: `pip install mcp`
- [ ] **Task J1.2:** Verify all 14 tests pass
- [ ] **Task J1.3:** Add tests for new MCP tools (as they're created)
- [ ] **Task J1.4:** Add integration tests for PDF scraper
- [ ] **Task J1.5:** Add integration tests for GitHub scraper
- [ ] **Task J1.6:** Add end-to-end workflow tests
**Start Small:** Pick J1.1 first (just install package)
---
## 🎯 Recommended Starting Tasks (Pick 3-5)
### Quick Wins (1-2 hours each):
1. **H1.1** - Respond to Issue #8 (community engagement)
2. **J1.1** - Install MCP package (fix tests)
3. **A3.1** - Create simple GitHub Pages site (single HTML)
4. **B1.1** - Research PDF parsing (no coding, just notes)
5. **F1.1** - Add URL normalization (small code fix)
### Medium Tasks (3-5 hours each):
6. ~~**A1.1** - Create JSON API for configs (simple endpoint)~~ ✅ **COMPLETE**
7. **G1.1** - Create config validator script
8. **C1.1** - GitHub API client (basic connection)
9. **I1.1** - Write Quick Start video script
10. **E2.1** - Add error handling to one MCP tool
### Bigger Tasks (5-10 hours each):
11. **B1.2-B1.6** - Complete PDF scraper
12. **C1.7-C1.9** - Complete GitHub scraper
13. **A2.1-A2.3** - Knowledge sharing foundation
14. **I1.2** - Record and publish Quick Start video
---
## 📊 Progress Tracking
**Completed Tasks:** 1 (A1.1 ✅)
**In Progress:** 0
**Total Available Tasks:** 134
### Current Sprint: Choose Your Own Adventure!
**Pick 1-3 tasks** from any category that interest you most.
**No pressure, no deadlines, just progress!** ✨
---
## 🎨 Flexibility Rules
1. **Pick any task, any order** - No dependencies (mostly)
2. **Start small** - Research tasks before implementation
3. **One task at a time** - Focus, complete, move on
4. **Switch anytime** - Not enjoying it? Pick another!
5. **Document as you go** - Each task should update docs
6. **Test incrementally** - Each task should have a quick test
7. **Ship early** - Don't wait for "complete" features
---
## 🚀 How to Use This Roadmap
### Step 1: Pick a Task
- Read through categories
- Pick something that sounds interesting
- Check estimated time
- Choose 1-3 tasks for this week
### Step 2: Create Issue (Optional)
- Create GitHub issue for tracking
- Add labels (category, priority)
- Add to project board
### Step 3: Work on It
- Complete the task
- Test it
- Document it
- Mark as done ✅
### Step 4: Ship It
- Commit changes
- Update changelog
- Tag version (if significant)
- Announce on GitHub
### Step 5: Repeat
- Pick next task
- Keep moving forward!
---
**Philosophy:**
**Small steps → Consistent progress → Compound results**
**No rigid milestones. No big releases. Just continuous improvement!** 🎯
---
**Last Updated:** October 20, 2025

View File

@ -0,0 +1,292 @@
# Future Releases Roadmap
This document outlines planned features, improvements, and the vision for upcoming releases of Skill Seekers.
## Release Philosophy
We follow semantic versioning (MAJOR.MINOR.PATCH) and maintain backward compatibility wherever possible. Each release focuses on delivering value to users while maintaining code quality and test coverage.
---
## ✅ Release: v2.1.0 (Released: November 29, 2025)
**Focus:** Test Coverage & Quality Improvements
### Completed Features
#### Testing & Quality
- [x] **Fix 12 unified scraping tests** ✅ - Complete test coverage for unified multi-source scraping
- ConfigValidator expecting dict instead of file path
- ConflictDetector expecting dict pages, not list
- Full integration test suite for unified workflow
### Planned Features (Future v2.2.0)
#### Testing & Quality
- [ ] **Improve test coverage to 60%+** (currently 39%)
- Write tests for 0% coverage files:
- `generate_router.py` (110 lines) - Router skill generator
- `split_config.py` (165 lines) - Config splitter
- `unified_scraper.py` (208 lines) - Unified scraping CLI
- `package_multi.py` (37 lines) - Multi-package tool
- Improve coverage for low-coverage files:
- `mcp/server.py` (9% → 60%)
- `enhance_skill.py` (11% → 60%)
- `code_analyzer.py` (19% → 60%)
- [ ] **Fix MCP test skipping issue** - 29 MCP tests pass individually but skip in full suite
- Resolve pytest isolation issue
- Ensure all tests run in CI/CD
#### Features
- [ ] **Task H1.3: Create example project folder**
- Real-world example projects using Skill Seekers
- Step-by-step tutorials
- Before/after comparisons
- [ ] **Task J1.1: Install MCP package for testing**
- Better MCP integration testing
- Automated MCP server tests in CI
- [ ] **Enhanced error handling**
- Better error messages for common issues
- Graceful degradation for missing dependencies
- Recovery from partial failures
### Documentation
- [ ] Video tutorials for common workflows
- [ ] Troubleshooting guide expansion
- [ ] Performance optimization guide
---
## Release: v2.2.0 (Estimated: Q1 2026)
**Focus:** Web Presence & Community Growth
### Planned Features
#### Community & Documentation
- [ ] **Task A3.1: GitHub Pages website** (skillseekersweb.com)
- Interactive documentation
- Live demos and examples
- Getting started wizard
- Community showcase
- [ ] **Plugin system foundation**
- Allow custom scrapers via plugins
- Plugin discovery and installation
- Plugin documentation generator
#### Enhancements
- [ ] **Support for additional documentation formats**
- Sphinx documentation
- Docusaurus sites
- GitBook
- Read the Docs
- MkDocs Material
- [ ] **Improved caching strategies**
- Intelligent cache invalidation
- Differential scraping (only changed pages)
- Cache compression
- Cross-session cache sharing
#### Performance
- [ ] **Scraping performance improvements**
- Connection pooling optimizations
- Smart rate limiting based on server response
- Adaptive concurrency
- Memory usage optimization for large docs
---
## Release: v2.3.0 (Estimated: Q2 2026)
**Focus:** Developer Experience & Integrations
### Planned Features
#### Developer Tools
- [ ] **Web UI for config generation**
- Visual config builder
- Real-time preview
- Template library
- Export/import configs
- [ ] **CI/CD integration examples**
- GitHub Actions workflows
- GitLab CI
- Jenkins pipelines
- Automated skill updates on doc changes
- [ ] **Docker containerization**
- Official Docker images
- docker-compose examples
- Kubernetes deployment guides
#### API & Integrations
- [ ] **GraphQL API support**
- Scrape GraphQL documentation
- Extract schema and queries
- Generate interactive examples
- [ ] **REST API documentation formats**
- OpenAPI/Swagger
- Postman collections
- API Blueprint
---
## Long-term Vision (v3.0+)
### Major Features Under Consideration
#### Advanced Scraping
- [ ] **Real-time documentation monitoring**
- Watch for documentation changes
- Automatic skill updates
- Change notifications
- Version diff reports
- [ ] **Multi-language documentation**
- Automatic language detection
- Combined multi-language skills
- Translation quality checking
#### Collaboration
- [ ] **Collaborative skill curation**
- Shared skill repositories
- Community ratings and reviews
- Collaborative editing
- Fork and merge workflows
- [ ] **Skill marketplace**
- Discover community-created skills
- Share your skills
- Quality ratings
- Usage statistics
#### AI & Intelligence
- [ ] **Enhanced AI analysis**
- Better conflict detection algorithms
- Automatic documentation quality scoring
- Suggested improvements
- Code example validation
- [ ] **Semantic understanding**
- Natural language queries for skill content
- Intelligent categorization
- Auto-generated summaries
- Concept relationship mapping
---
## Backlog Ideas
### Features Requested by Community
- [ ] Support for video tutorial transcription
- [ ] Integration with Notion, Confluence, and other wikis
- [ ] Jupyter notebook scraping and conversion
- [ ] Live documentation preview during scraping
- [ ] Skill versioning and update management
- [ ] A/B testing for skill quality
- [ ] Analytics dashboard (scraping stats, error rates, etc.)
### Technical Improvements
- [ ] Migration to modern async framework (httpx everywhere)
- [ ] Improved type safety (full mypy strict mode)
- [ ] Better logging and debugging tools
- [ ] Performance profiling dashboard
- [ ] Memory optimization for very large docs (100K+ pages)
### Ecosystem
- [ ] VS Code extension
- [ ] IntelliJ/PyCharm plugin
- [ ] Command-line interactive mode (TUI)
- [ ] Skill diff tool (compare versions)
- [ ] Skill merge tool (combine multiple skills)
---
## How to Influence the Roadmap
### Priority System
Features are prioritized based on:
1. **User impact** - How many users will benefit?
2. **Technical feasibility** - How complex is the implementation?
3. **Community interest** - How many upvotes/requests?
4. **Strategic alignment** - Does it fit our vision?
### Ways to Contribute
#### 1. Vote on Features
- ⭐ Star feature request issues
- 💬 Comment with your use case
- 🔼 Upvote discussions
#### 2. Contribute Code
See our [FLEXIBLE_ROADMAP.md](FLEXIBLE_ROADMAP.md) for:
- **134 tasks** across 22 feature groups
- Tasks categorized by difficulty and area
- Clear acceptance criteria
- Estimated effort levels
Pick any task and submit a PR! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
#### 3. Share Feedback
- Open issues for bugs or feature requests
- Share your success stories
- Suggest improvements to existing features
- Report performance issues
#### 4. Help with Documentation
- Write tutorials
- Improve existing docs
- Translate documentation
- Create video guides
---
## Release Schedule
We aim for predictable releases:
- **Patch releases (2.0.x)**: As needed for critical bugs
- **Minor releases (2.x.0)**: Every 2-3 months
- **Major releases (x.0.0)**: Annually, with breaking changes announced 3 months in advance
### Current Schedule
| Version | Focus | ETA | Status |
|---------|-------|-----|--------|
| v2.0.0 | PyPI Publication | 2025-11-11 | ✅ Released |
| v2.1.0 | Test Coverage & Quality | 2025-11-29 | ✅ Released |
| v2.2.0 | Web Presence | Q1 2026 | 📋 Planned |
| v2.3.0 | Developer Experience | Q2 2026 | 📋 Planned |
| v3.0.0 | Major Evolution | 2026 | 💡 Conceptual |
---
## Stay Updated
- 📋 **Project Board**: https://github.com/users/yusufkaraaslan/projects/2
- 📚 **Full Roadmap**: [FLEXIBLE_ROADMAP.md](FLEXIBLE_ROADMAP.md)
- 📝 **Changelog**: [CHANGELOG.md](CHANGELOG.md)
- 💬 **Discussions**: https://github.com/yusufkaraaslan/Skill_Seekers/discussions
- 🐛 **Issues**: https://github.com/yusufkaraaslan/Skill_Seekers/issues
---
## Questions?
Have questions about the roadmap or want to suggest a feature?
1. Check if it's already in our [FLEXIBLE_ROADMAP.md](FLEXIBLE_ROADMAP.md)
2. Search [existing discussions](https://github.com/yusufkaraaslan/Skill_Seekers/discussions)
3. Open a new discussion or issue
4. Reach out in our community channels
**Together, we're building the future of documentation-to-AI skill conversion!** 🚀

View File

@ -0,0 +1,21 @@
MIT License
Copyright (c) 2025 [Your Name/Username]
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View File

@ -0,0 +1,196 @@
# Quick Start Guide
## 🚀 3 Steps to Create a Skill
### Step 1: Install Dependencies
```bash
pip3 install requests beautifulsoup4
```
> **Note:** Skill_Seekers automatically checks for llms.txt files first, which is 10x faster when available.
### Step 2: Run the Tool
**Option A: Use a Preset (Easiest)**
```bash
skill-seekers scrape --config configs/godot.json
```
**Option B: Interactive Mode**
```bash
skill-seekers scrape --interactive
```
**Option C: Quick Command**
```bash
skill-seekers scrape --name react --url https://react.dev/
```
**Option D: Unified Multi-Source (NEW - v2.0.0)**
```bash
# Combine documentation + GitHub code in one skill
skill-seekers unified --config configs/react_unified.json
```
*Detects conflicts between docs and code automatically!*
### Step 3: Enhance SKILL.md (Recommended)
```bash
# LOCAL enhancement (no API key, uses Claude Code Max)
skill-seekers enhance output/godot/
```
**This takes 60 seconds and dramatically improves the SKILL.md quality!**
### Step 4: Package the Skill
```bash
skill-seekers package output/godot/
```
**Done!** You now have `godot.zip` ready to use.
---
## 📋 Available Presets
```bash
# Godot Engine
skill-seekers scrape --config configs/godot.json
# React
skill-seekers scrape --config configs/react.json
# Vue.js
skill-seekers scrape --config configs/vue.json
# Django
skill-seekers scrape --config configs/django.json
# FastAPI
skill-seekers scrape --config configs/fastapi.json
# Unified Multi-Source (NEW!)
skill-seekers unified --config configs/react_unified.json
skill-seekers unified --config configs/django_unified.json
skill-seekers unified --config configs/fastapi_unified.json
skill-seekers unified --config configs/godot_unified.json
```
---
## ⚡ Using Existing Data (Fast!)
If you already scraped once:
```bash
skill-seekers scrape --config configs/godot.json
# When prompted:
✓ Found existing data: 245 pages
Use existing data? (y/n): y
# Builds in seconds!
```
Or use `--skip-scrape`:
```bash
skill-seekers scrape --config configs/godot.json --skip-scrape
```
---
## 🎯 Complete Example (Recommended Workflow)
```bash
# 1. Install (once)
pip3 install requests beautifulsoup4
# 2. Scrape React docs with LOCAL enhancement
skill-seekers scrape --config configs/react.json --enhance-local
# Wait 15-30 minutes (scraping) + 60 seconds (enhancement)
# 3. Package
skill-seekers package output/react/
# 4. Use react.zip in Claude!
```
**Alternative: Enhancement after scraping**
```bash
# 2a. Scrape only (no enhancement)
skill-seekers scrape --config configs/react.json
# 2b. Enhance later
skill-seekers enhance output/react/
# 3. Package
skill-seekers package output/react/
```
---
## 💡 Pro Tips
### Test with Small Pages First
Edit config file:
```json
{
"max_pages": 20 // Test with just 20 pages
}
```
### Rebuild Instantly
```bash
# After first scrape, you can rebuild instantly:
skill-seekers scrape --config configs/react.json --skip-scrape
```
### Create Custom Config
```bash
# Copy a preset
cp configs/react.json configs/myframework.json
# Edit it
nano configs/myframework.json
# Use it
skill-seekers scrape --config configs/myframework.json
```
---
## 📁 What You Get
```
output/
├── godot_data/ # Raw scraped data (reusable!)
└── godot/ # The skill
├── SKILL.md # With real code examples!
└── references/ # Organized docs
```
---
## ❓ Need Help?
See **README.md** for:
- Complete documentation
- Config file structure
- Troubleshooting
- Advanced usage
---
## 🎮 Let's Go!
```bash
# Godot
skill-seekers scrape --config configs/godot.json
# Or interactive
skill-seekers scrape --interactive
```
That's it! 🚀

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,266 @@
# Skill Seeker Development Roadmap
## Vision
Transform Skill Seeker into the easiest way to create Claude AI skills from **any knowledge source** - documentation websites, PDFs, codebases, GitHub repos, Office docs, and more - with both CLI and MCP interfaces.
## 🎯 New Approach: Flexible, Incremental Development
**Philosophy:** Small tasks → Pick one → Complete → Move on
Instead of rigid milestones, we now use a **flexible task-based approach**:
- 100+ small, independent tasks across 10 categories
- Pick any task, any order
- Start small, ship often
- No deadlines, just continuous progress
**See:** [FLEXIBLE_ROADMAP.md](FLEXIBLE_ROADMAP.md) for the complete task list!
---
## 🎯 Milestones
### ✅ v1.0 - Production Release (COMPLETED - Oct 19, 2025)
**Released:** October 19, 2025 | **Tag:** v1.0.0
#### Core Features ✅
- [x] Documentation scraping with BFS
- [x] Smart categorization
- [x] Language detection
- [x] Pattern extraction
- [x] 12 preset configurations (Godot, React, Vue, Django, FastAPI, Tailwind, Kubernetes, Astro, etc.)
- [x] Comprehensive test suite (14 tests, 100% pass rate)
#### MCP Integration ✅
- [x] Monorepo refactor (cli/ and mcp/)
- [x] MCP server with 9 tools (fully functional)
- [x] All MCP tools tested and working
- [x] Complete MCP documentation
- [x] Setup automation (setup_mcp.sh)
#### Large Documentation Support ✅
- [x] Config splitting for 40K+ page docs
- [x] Router/hub skill generation
- [x] Checkpoint/resume functionality
- [x] Parallel scraping support
#### Auto-Upload Feature ✅
- [x] Smart API key detection
- [x] Automatic upload to Claude
- [x] Cross-platform folder opening
- [x] Graceful fallback to manual upload
**Statistics:**
- 9 MCP tools (fully working)
- 12 preset configurations
- 14/14 tests passing (100%)
- ~3,800 lines of code
- Complete documentation suite
---
## 📋 Task Categories (Flexible Development)
See [FLEXIBLE_ROADMAP.md](FLEXIBLE_ROADMAP.md) for detailed task breakdown.
### Category Summary:
- **🌐 Community & Sharing** - Config/knowledge sharing website features
- **🛠️ New Input Formats** - PDF, Word, Excel, Markdown support
- **💻 Codebase Knowledge** - GitHub repos, local code scraping
- **🔌 Context7 Integration** - Enhanced context management
- **🚀 MCP Enhancements** - New tools and quality improvements
- **⚡ Performance & Reliability** - Core improvements
- **🎨 Tools & Utilities** - Standalone helper tools
- **📚 Community Response** - Address GitHub issues
- **🎓 Content & Documentation** - Videos and guides
- **🧪 Testing & Quality** - Test coverage expansion
---
### ~~📋 v1.1 - Website Launch (PLANNED)~~ → Now flexible tasks!
**Goal:** Create professional website and community presence
**Timeline:** November 2025 (Due: Nov 3, 2025)
**Features:**
- Professional landing page (skillseekersweb.com)
- Documentation migration to website
- Preset showcase gallery (interactive)
- Blog with release notes and tutorials
- SEO optimization
- Analytics integration
**Community:**
- Video tutorial series
- Contributing guidelines
- Issue templates and workflows
- GitHub Project board
- Community engagement
---
### 📋 v1.2 - Core Improvements (PLANNED)
**Goal:** Address technical debt and performance
**Timeline:** Late November 2025
**Technical Enhancements:**
- URL normalization/deduplication
- Memory optimization for large docs
- HTML parser fallback (lxml)
- Selector validation tool
- Incremental update system
**MCP Enhancements:**
- Interactive config wizard via MCP
- Real-time progress updates
- Auto-detect documentation patterns
- Enhanced error handling and logging
- Batch operations
---
### 📋 v2.0 - Intelligence Layer (PLANNED)
**Goal:** Smart defaults and auto-configuration
**Timeline:** December 2025
**Features:**
- **Auto-detection:**
- Automatically find best selectors
- Detect documentation framework (Docusaurus, GitBook, etc.)
- Suggest optimal rate_limit and max_pages
- **Quality Metrics:**
- Analyze generated SKILL.md quality
- Suggest improvements
- Validate code examples
- **Templates:**
- Pre-built configs for popular frameworks
- Community config sharing
- One-click generation for common docs
**Example:**
```
User: "Create skill from https://tailwindcss.com/docs"
Tool: Auto-detects Tailwind, uses template, generates in 30 seconds
```
---
### 💭 v3.0 - Platform Features (IDEAS)
**Goal:** Build ecosystem around skill generation
**Possible Features:**
- Web UI for config generation
- GitHub Actions integration
- Skill marketplace
- Analytics dashboard
- API for programmatic access
---
## 🎨 Feature Ideas
### High Priority
1. **Selector Auto-Detection** - Analyze page, suggest selectors
2. **Progress Streaming** - Real-time updates during scraping
3. **Config Validation UI** - Visual feedback on config quality
4. **Batch Processing** - Handle multiple sites at once
### Medium Priority
5. **Skill Quality Score** - Rate generated skills
6. **Enhanced SKILL.md** - Better templates, more examples
7. **Documentation Framework Detection** - Auto-detect Docusaurus, VuePress, etc.
8. **Custom Categories AI** - Use AI to suggest categories
### Low Priority
9. **Web Dashboard** - Browser-based interface
10. **Skill Analytics** - Track usage, quality metrics
11. **Community Configs** - Share and discover configs
12. **Plugin System** - Extend with custom scrapers
---
## 🔬 Research Areas
### MCP Enhancements
- [ ] Investigate MCP progress/streaming APIs
- [ ] Test MCP with large documentation sites
- [ ] Explore MCP caching strategies
### AI Integration
- [ ] Use Claude to auto-generate categories
- [ ] AI-powered selector detection
- [ ] Quality analysis with LLMs
### Performance
- [ ] Parallel scraping
- [ ] Incremental updates
- [ ] Smart caching
---
## 📊 Metrics & Goals
### Current State (Oct 20, 2025) ✅
- ✅ 12 preset configs (Godot, React, Vue, Django, FastAPI, Tailwind, Kubernetes, Astro, etc.)
- ✅ 14/14 tests (100% pass rate)
- ✅ 9 MCP tools (fully functional)
- ✅ ~3,800 lines of code
- ✅ Complete documentation suite
- ✅ Production-ready v1.0.0 release
- ✅ Auto-upload functionality
- ✅ Large documentation support (40K+ pages)
### Goals for v1.1 (Website Launch)
- 🎯 Professional website live
- 🎯 Video tutorial series (5 videos)
- 🎯 20+ GitHub stars
- 🎯 Community engagement started
- 🎯 Documentation site migration
### Goals for v1.2 (Core Improvements)
- 🎯 Enhanced MCP features
- 🎯 Performance optimization
- 🎯 Better error handling
- 🎯 Incremental update system
### Goals for v2.0 (Intelligence)
- 🎯 50+ preset configs
- 🎯 Auto-detection for 80%+ of sites
- 🎯 <1 minute skill generation
- 🎯 Community contributions
- 🎯 Quality scoring system
---
## 🤝 Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md) for:
- How to add new MCP tools
- Testing guidelines
- Code style
- PR process
---
## 📅 Release Schedule
| Version | Target Date | Status | Focus |
|---------|-------------|--------|-------|
| v1.0.0 | Oct 19, 2025 | ✅ **RELEASED** | Core CLI + MCP Integration |
| v1.1.0 | Nov 3, 2025 | 📋 Planned | Website Launch |
| v1.2.0 | Late Nov 2025 | 📋 Planned | Core Improvements |
| v2.0.0 | Dec 2025 | 📋 Planned | Intelligence Layer |
| v3.0.0 | Q1 2026 | 💭 Ideas | Platform Features |
---
## 🔗 Related Projects
- [Model Context Protocol](https://modelcontextprotocol.io/)
- [Claude Code](https://claude.ai/code)
- [Anthropic Claude](https://claude.ai)
- Documentation frameworks we support: Docusaurus, GitBook, VuePress, Sphinx, MkDocs
---
**Last Updated:** October 20, 2025

View File

@ -0,0 +1,124 @@
# Repository Structure
```
Skill_Seekers/
├── 📄 Root Documentation
│ ├── README.md # Main documentation (start here!)
│ ├── CLAUDE.md # Quick reference for Claude Code
│ ├── QUICKSTART.md # 3-step quick start guide
│ ├── ROADMAP.md # Development roadmap
│ ├── TODO.md # Current sprint tasks
│ ├── STRUCTURE.md # This file
│ ├── LICENSE # MIT License
│ └── .gitignore # Git ignore rules
├── 🔧 CLI Tools (cli/)
│ ├── doc_scraper.py # Main scraping tool
│ ├── estimate_pages.py # Page count estimator
│ ├── enhance_skill.py # AI enhancement (API-based)
│ ├── enhance_skill_local.py # AI enhancement (LOCAL, no API)
│ ├── package_skill.py # Skill packaging tool
│ └── run_tests.py # Test runner
├── 🌐 MCP Server (mcp/)
│ ├── server.py # Main MCP server
│ ├── requirements.txt # MCP dependencies
│ └── README.md # MCP setup guide
├── 📁 configs/ # Preset configurations
│ ├── godot.json
│ ├── react.json
│ ├── vue.json
│ ├── django.json
│ ├── fastapi.json
│ ├── kubernetes.json
│ └── steam-economy-complete.json
├── 🧪 tests/ # Test suite (71 tests, 100% pass rate)
│ ├── test_config_validation.py
│ ├── test_integration.py
│ └── test_scraper_features.py
├── 📚 docs/ # Detailed documentation
│ ├── CLAUDE.md # Technical architecture
│ ├── ENHANCEMENT.md # AI enhancement guide
│ ├── USAGE.md # Complete usage guide
│ ├── TESTING.md # Testing guide
│ └── UPLOAD_GUIDE.md # How to upload skills
├── 🔀 .github/ # GitHub configuration
│ ├── SETUP_GUIDE.md # GitHub project setup
│ ├── ISSUES_TO_CREATE.md # Issue templates
│ └── ISSUE_TEMPLATE/ # Issue templates
└── 📦 output/ # Generated skills (git-ignored)
├── {name}_data/ # Scraped raw data (cached)
└── {name}/ # Built skills
├── SKILL.md # Main skill file
└── references/ # Reference documentation
```
## Key Files
### For Users:
- **README.md** - Start here for overview and installation
- **QUICKSTART.md** - Get started in 3 steps
- **configs/** - 7 ready-to-use presets
- **mcp/README.md** - MCP server setup for Claude Code
### For CLI Usage:
- **cli/doc_scraper.py** - Main scraping tool
- **cli/estimate_pages.py** - Page count estimator
- **cli/enhance_skill_local.py** - Local enhancement (no API key)
- **cli/package_skill.py** - Package skills to .zip
### For MCP Usage (Claude Code):
- **mcp/server.py** - MCP server (6 tools)
- **mcp/README.md** - Setup instructions
- **configs/** - Shared configurations
### For Developers:
- **docs/CLAUDE.md** - Architecture and internals
- **docs/USAGE.md** - Complete usage guide
- **docs/TESTING.md** - Testing guide
- **tests/** - 71 tests (100% pass rate)
### For Contributors:
- **ROADMAP.md** - Development roadmap
- **TODO.md** - Current sprint tasks
- **.github/SETUP_GUIDE.md** - GitHub setup
- **LICENSE** - MIT License
## Architecture
### Monorepo Structure
The repository is organized as a monorepo with two main components:
1. **CLI Tools** (`cli/`): Standalone Python scripts for direct command-line usage
2. **MCP Server** (`mcp/`): Model Context Protocol server for Claude Code integration
Both components share the same configuration files and output directory.
### Data Flow
```
Config (configs/*.json)
CLI Tools OR MCP Server
Scraper (cli/doc_scraper.py)
Output (output/{name}_data/)
Builder (cli/doc_scraper.py)
Skill (output/{name}/)
Enhancer (optional)
Packager (cli/package_skill.py)
Skill .zip (output/{name}.zip)
```

View File

@ -0,0 +1,446 @@
# Troubleshooting Guide
Common issues and solutions when using Skill Seeker.
---
## Installation Issues
### Python Not Found
**Error:**
```
python3: command not found
```
**Solutions:**
1. **Check if Python is installed:**
```bash
which python3
python --version # Try without the 3
```
2. **Install Python:**
- **macOS:** `brew install python3`
- **Linux:** `sudo apt install python3 python3-pip`
- **Windows:** Download from python.org, check "Add to PATH"
3. **Use python instead of python3:**
```bash
python cli/doc_scraper.py --help
```
### Module Not Found
**Error:**
```
ModuleNotFoundError: No module named 'requests'
ModuleNotFoundError: No module named 'bs4'
ModuleNotFoundError: No module named 'mcp'
```
**Solutions:**
1. **Install dependencies:**
```bash
pip3 install requests beautifulsoup4
pip3 install -r mcp/requirements.txt # For MCP
```
2. **Use --user flag if permission denied:**
```bash
pip3 install --user requests beautifulsoup4
```
3. **Check pip is working:**
```bash
pip3 --version
```
### Permission Denied
**Error:**
```
Permission denied: '/usr/local/lib/python3.x/...'
```
**Solutions:**
1. **Use --user flag:**
```bash
pip3 install --user requests beautifulsoup4
```
2. **Use sudo (not recommended):**
```bash
sudo pip3 install requests beautifulsoup4
```
3. **Use virtual environment (best practice):**
```bash
python3 -m venv venv
source venv/bin/activate
pip install requests beautifulsoup4
```
---
## Runtime Issues
### File Not Found
**Error:**
```
FileNotFoundError: [Errno 2] No such file or directory: 'cli/doc_scraper.py'
```
**Solutions:**
1. **Check you're in the Skill_Seekers directory:**
```bash
pwd
# Should show: .../Skill_Seekers
ls
# Should show: README.md, cli/, mcp/, configs/
```
2. **Change to the correct directory:**
```bash
cd ~/Projects/Skill_Seekers # Adjust path
```
### Config File Not Found
**Error:**
```
FileNotFoundError: configs/react.json
```
**Solutions:**
1. **Check config exists:**
```bash
ls configs/
# Should show: godot.json, react.json, vue.json, etc.
```
2. **Use full path:**
```bash
skill-seekers scrape --config $(pwd)/configs/react.json
```
3. **Create missing config:**
```bash
skill-seekers scrape --interactive
```
---
## MCP Setup Issues
### MCP Server Not Loading
**Symptoms:**
- Tools don't appear in Claude Code
- "List all available configs" doesn't work
**Solutions:**
1. **Check configuration file:**
```bash
cat ~/.config/claude-code/mcp.json
```
2. **Verify paths are ABSOLUTE (not placeholders):**
```json
{
"mcpServers": {
"skill-seeker": {
"args": [
"/Users/yourname/Projects/Skill_Seekers/mcp/server.py"
]
}
}
}
```
**Bad:** `$REPO_PATH` or `/path/to/Skill_Seekers`
**Good:** `/Users/john/Projects/Skill_Seekers`
3. **Test server manually:**
```bash
cd ~/Projects/Skill_Seekers
python3 mcp/server.py
# Should start without errors (Ctrl+C to stop)
```
4. **Re-run setup script:**
```bash
./setup_mcp.sh
# Select "y" for auto-configure
```
5. **RESTART Claude Code completely:**
- Quit (don't just close window)
- Reopen
### Placeholder Paths in Config
**Problem:** Config has `$REPO_PATH` or `/Users/username/` instead of real paths
**Solution:**
```bash
# Get your actual path
cd ~/Projects/Skill_Seekers
pwd
# Copy this path
# Edit config
nano ~/.config/claude-code/mcp.json
# Replace ALL instances of placeholders with your actual path
# Save (Ctrl+O, Enter, Ctrl+X)
# Restart Claude Code
```
### Tools Appear But Don't Work
**Symptoms:**
- Tools listed but commands fail
- "Error executing tool" messages
**Solutions:**
1. **Check working directory:**
```json
{
"cwd": "/FULL/PATH/TO/Skill_Seekers"
}
```
2. **Verify files exist:**
```bash
ls cli/doc_scraper.py
ls mcp/server.py
```
3. **Test CLI tools directly:**
```bash
skill-seekers scrape --help
```
---
## Scraping Issues
### Slow or Hanging
**Solutions:**
1. **Check network connection:**
```bash
ping google.com
curl -I https://docs.yoursite.com
```
2. **Use smaller max_pages for testing:**
```bash
skill-seekers scrape --config configs/test.json --max-pages 5
```
3. **Increase rate_limit in config:**
```json
{
"rate_limit": 1.0 // Increase from 0.5
}
```
### No Content Extracted
**Problem:** Pages scraped but content is empty
**Solutions:**
1. **Check selector in config:**
```bash
# Test with browser dev tools
# Look for: article, main, div[role="main"], div.content
```
2. **Verify website is accessible:**
```bash
curl https://docs.example.com
```
3. **Try different selectors:**
```json
{
"selectors": {
"main_content": "article" // Try: main, div.content, etc.
}
}
```
### Rate Limiting / 429 Errors
**Error:**
```
HTTP Error 429: Too Many Requests
```
**Solutions:**
1. **Increase rate_limit:**
```json
{
"rate_limit": 2.0 // Wait 2 seconds between requests
}
```
2. **Reduce max_pages:**
```json
{
"max_pages": 50 // Scrape fewer pages
}
```
3. **Try again later:**
```bash
# Wait an hour and retry
```
---
## Platform-Specific Issues
### macOS
**Issue:** Can't run `./setup_mcp.sh`
**Solution:**
```bash
chmod +x setup_mcp.sh
./setup_mcp.sh
```
**Issue:** Homebrew not installed
**Solution:**
```bash
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```
### Linux
**Issue:** pip3 not found
**Solution:**
```bash
sudo apt update
sudo apt install python3-pip
```
**Issue:** Permission errors
**Solution:**
```bash
# Use --user flag
pip3 install --user requests beautifulsoup4
```
### Windows (WSL)
**Issue:** Python not in PATH
**Solution:**
1. Reinstall Python
2. Check "Add Python to PATH"
3. Or add manually to PATH
**Issue:** Line ending errors
**Solution:**
```bash
dos2unix setup_mcp.sh
./setup_mcp.sh
```
---
## Verification Commands
Use these to check your setup:
```bash
# 1. Check Python
python3 --version # Should be 3.10+
# 2. Check dependencies
pip3 list | grep requests
pip3 list | grep beautifulsoup4
pip3 list | grep mcp
# 3. Check files exist
ls cli/doc_scraper.py
ls mcp/server.py
ls configs/
# 4. Check MCP config
cat ~/.config/claude-code/mcp.json
# 5. Test scraper
skill-seekers scrape --help
# 6. Test MCP server
timeout 3 python3 mcp/server.py || echo "Server OK"
# 7. Check git repo
git status
git log --oneline -5
```
---
## Getting Help
If none of these solutions work:
1. **Check existing issues:**
https://github.com/yusufkaraaslan/Skill_Seekers/issues
2. **Open a new issue with:**
- Your OS (macOS 13, Ubuntu 22.04, etc.)
- Python version (`python3 --version`)
- Full error message
- What command you ran
- Output of verification commands above
3. **Include this debug info:**
```bash
# System info
uname -a
python3 --version
pip3 --version
# Skill Seeker info
cd ~/Projects/Skill_Seekers # Your path
pwd
git log --oneline -1
ls -la cli/ mcp/ configs/
# MCP config (if using MCP)
cat ~/.config/claude-code/mcp.json
```
---
## Quick Fixes Checklist
- [ ] In the Skill_Seekers directory? (`pwd`)
- [ ] Python 3.10+ installed? (`python3 --version`)
- [ ] Dependencies installed? (`pip3 list | grep requests`)
- [ ] Config file exists? (`ls configs/yourconfig.json`)
- [ ] Internet connection working? (`ping google.com`)
- [ ] For MCP: Config uses absolute paths? (not `$REPO_PATH`)
- [ ] For MCP: Claude Code restarted? (quit and reopen)
---
**Still stuck?** Open an issue: https://github.com/yusufkaraaslan/Skill_Seekers/issues/new

View File

@ -0,0 +1,31 @@
{
"name": "ansible-core",
"description": "Ansible Core 2.19 skill for automation and configuration management",
"base_url": "https://docs.ansible.com/ansible-core/2.19/",
"selectors": {
"main_content": "div[role=main]",
"title": "title",
"code_blocks": "pre"
},
"url_patterns": {
"include": [],
"exclude": ["/_static/", "/_images/", "/_downloads/", "/search.html", "/genindex.html", "/py-modindex.html", "/index.html", "/roadmap/"]
},
"categories": {
"getting_started": ["getting_started", "getting-started", "introduction", "overview"],
"installation": ["installation_guide", "installation", "setup"],
"inventory": ["inventory_guide", "inventory"],
"playbooks": ["playbook_guide", "playbooks", "playbook"],
"modules": ["module_plugin_guide", "modules", "plugins"],
"collections": ["collections_guide", "collections"],
"vault": ["vault_guide", "vault", "encryption"],
"commands": ["command_guide", "commands", "cli"],
"porting": ["porting_guides", "porting", "migration"],
"os_specific": ["os_guide", "platform"],
"tips": ["tips_tricks", "tips", "tricks", "best-practices"],
"community": ["community", "contributing", "contributions"],
"development": ["dev_guide", "development", "developing"]
},
"rate_limit": 0.5,
"max_pages": 800
}

View File

@ -0,0 +1,30 @@
{
"name": "astro",
"description": "Astro web framework for content-focused websites. Use for Astro components, islands architecture, content collections, SSR/SSG, and modern web development.",
"base_url": "https://docs.astro.build/en/getting-started/",
"start_urls": [
"https://docs.astro.build/en/getting-started/",
"https://docs.astro.build/en/install/auto/",
"https://docs.astro.build/en/core-concepts/project-structure/",
"https://docs.astro.build/en/core-concepts/astro-components/",
"https://docs.astro.build/en/core-concepts/astro-pages/"
],
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": ["/en/"],
"exclude": ["/blog", "/integrations"]
},
"categories": {
"getting_started": ["getting-started", "install", "tutorial"],
"core_concepts": ["core-concepts", "project-structure", "components", "pages"],
"guides": ["guides", "deploy", "migrate"],
"configuration": ["configuration", "config", "typescript"],
"integrations": ["integrations", "framework", "adapter"]
},
"rate_limit": 0.5,
"max_pages": 100
}

View File

@ -0,0 +1,37 @@
{
"name": "claude-code",
"description": "Claude Code CLI and development environment. Use for Claude Code features, tools, workflows, MCP integration, configuration, and AI-assisted development.",
"base_url": "https://docs.claude.com/en/docs/claude-code/",
"start_urls": [
"https://docs.claude.com/en/docs/claude-code/overview",
"https://docs.claude.com/en/docs/claude-code/quickstart",
"https://docs.claude.com/en/docs/claude-code/common-workflows",
"https://docs.claude.com/en/docs/claude-code/mcp",
"https://docs.claude.com/en/docs/claude-code/settings",
"https://docs.claude.com/en/docs/claude-code/troubleshooting",
"https://docs.claude.com/en/docs/claude-code/iam"
],
"selectors": {
"main_content": "#content-container",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": ["/claude-code/"],
"exclude": ["/api-reference/", "/claude-ai/", "/claude.ai/", "/prompt-engineering/", "/changelog/"]
},
"categories": {
"getting_started": ["overview", "quickstart", "installation", "setup", "terminal-config"],
"workflows": ["workflow", "common-workflows", "git", "testing", "debugging", "interactive"],
"mcp": ["mcp", "model-context-protocol"],
"configuration": ["config", "settings", "preferences", "customize", "hooks", "statusline", "model-config", "memory", "output-styles"],
"agents": ["agent", "task", "subagent", "sub-agent", "specialized"],
"skills": ["skill", "agent-skill"],
"integrations": ["ide-integrations", "vs-code", "jetbrains", "plugin", "marketplace"],
"deployment": ["bedrock", "vertex", "deployment", "network", "gateway", "devcontainer", "sandboxing", "third-party"],
"reference": ["reference", "api", "command", "cli-reference", "slash", "checkpointing", "headless", "sdk"],
"enterprise": ["iam", "security", "monitoring", "analytics", "costs", "legal", "data-usage"]
},
"rate_limit": 0.5,
"max_pages": 200
}

View File

@ -0,0 +1,34 @@
{
"name": "django",
"description": "Django web framework for Python. Use for Django models, views, templates, ORM, authentication, and web development.",
"base_url": "https://docs.djangoproject.com/en/stable/",
"start_urls": [
"https://docs.djangoproject.com/en/stable/intro/",
"https://docs.djangoproject.com/en/stable/topics/db/models/",
"https://docs.djangoproject.com/en/stable/topics/http/views/",
"https://docs.djangoproject.com/en/stable/topics/templates/",
"https://docs.djangoproject.com/en/stable/topics/forms/",
"https://docs.djangoproject.com/en/stable/topics/auth/",
"https://docs.djangoproject.com/en/stable/ref/models/"
],
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre"
},
"url_patterns": {
"include": ["/intro/", "/topics/", "/ref/", "/howto/"],
"exclude": ["/faq/", "/misc/", "/releases/"]
},
"categories": {
"getting_started": ["intro", "tutorial", "install"],
"models": ["models", "database", "orm", "queries"],
"views": ["views", "urlconf", "routing"],
"templates": ["templates", "template"],
"forms": ["forms", "form"],
"authentication": ["auth", "authentication", "user"],
"api": ["ref", "reference"]
},
"rate_limit": 0.3,
"max_pages": 500
}

View File

@ -0,0 +1,49 @@
{
"name": "django",
"description": "Complete Django framework knowledge combining official documentation and Django codebase. Use when building Django applications, understanding ORM internals, or debugging Django issues.",
"merge_mode": "rule-based",
"sources": [
{
"type": "documentation",
"base_url": "https://docs.djangoproject.com/en/stable/",
"extract_api": true,
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre"
},
"url_patterns": {
"include": [],
"exclude": ["/search/", "/genindex/"]
},
"categories": {
"getting_started": ["intro", "tutorial", "install"],
"models": ["models", "orm", "queries", "database"],
"views": ["views", "urls", "templates"],
"forms": ["forms", "modelforms"],
"admin": ["admin"],
"api": ["ref/"],
"topics": ["topics/"],
"security": ["security", "csrf", "authentication"]
},
"rate_limit": 0.5,
"max_pages": 300
},
{
"type": "github",
"repo": "django/django",
"include_issues": true,
"max_issues": 100,
"include_changelog": true,
"include_releases": true,
"include_code": true,
"code_analysis_depth": "surface",
"file_patterns": [
"django/db/**/*.py",
"django/views/**/*.py",
"django/forms/**/*.py",
"django/contrib/admin/**/*.py"
]
}
]
}

View File

@ -0,0 +1,17 @@
{
"name": "example_manual",
"description": "Example PDF documentation skill",
"pdf_path": "docs/manual.pdf",
"extract_options": {
"chunk_size": 10,
"min_quality": 5.0,
"extract_images": true,
"min_image_size": 100
},
"categories": {
"getting_started": ["introduction", "getting started", "quick start", "setup"],
"tutorial": ["tutorial", "guide", "walkthrough", "example"],
"api": ["api", "reference", "function", "class", "method"],
"advanced": ["advanced", "optimization", "performance", "best practices"]
}
}

View File

@ -0,0 +1,33 @@
{
"name": "fastapi",
"description": "FastAPI modern Python web framework. Use for building APIs, async endpoints, dependency injection, and Python backend development.",
"base_url": "https://fastapi.tiangolo.com/",
"start_urls": [
"https://fastapi.tiangolo.com/tutorial/",
"https://fastapi.tiangolo.com/tutorial/first-steps/",
"https://fastapi.tiangolo.com/tutorial/path-params/",
"https://fastapi.tiangolo.com/tutorial/body/",
"https://fastapi.tiangolo.com/tutorial/dependencies/",
"https://fastapi.tiangolo.com/advanced/",
"https://fastapi.tiangolo.com/reference/"
],
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": ["/tutorial/", "/advanced/", "/reference/"],
"exclude": ["/help/", "/external-links/", "/deployment/"]
},
"categories": {
"getting_started": ["first-steps", "tutorial", "intro"],
"path_operations": ["path", "operations", "routing"],
"request_data": ["request", "body", "query", "parameters"],
"dependencies": ["dependencies", "injection"],
"security": ["security", "oauth", "authentication"],
"database": ["database", "sql", "orm"]
},
"rate_limit": 0.5,
"max_pages": 250
}

View File

@ -0,0 +1,45 @@
{
"name": "fastapi",
"description": "Complete FastAPI knowledge combining official documentation and FastAPI codebase. Use when building FastAPI applications, understanding async patterns, or working with Pydantic models.",
"merge_mode": "rule-based",
"sources": [
{
"type": "documentation",
"base_url": "https://fastapi.tiangolo.com/",
"extract_api": true,
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": [],
"exclude": ["/img/", "/js/"]
},
"categories": {
"getting_started": ["tutorial", "first-steps"],
"path_operations": ["path-params", "query-params", "body"],
"dependencies": ["dependencies"],
"security": ["security", "oauth2"],
"database": ["sql-databases"],
"advanced": ["advanced", "async", "middleware"],
"deployment": ["deployment"]
},
"rate_limit": 0.5,
"max_pages": 150
},
{
"type": "github",
"repo": "tiangolo/fastapi",
"include_issues": true,
"max_issues": 100,
"include_changelog": true,
"include_releases": true,
"include_code": true,
"code_analysis_depth": "surface",
"file_patterns": [
"fastapi/**/*.py"
]
}
]
}

View File

@ -0,0 +1,41 @@
{
"name": "fastapi_test",
"description": "FastAPI test - unified scraping with limited pages",
"merge_mode": "rule-based",
"sources": [
{
"type": "documentation",
"base_url": "https://fastapi.tiangolo.com/",
"extract_api": true,
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": [],
"exclude": ["/img/", "/js/"]
},
"categories": {
"getting_started": ["tutorial", "first-steps"],
"path_operations": ["path-params", "query-params"],
"api": ["reference"]
},
"rate_limit": 0.5,
"max_pages": 20
},
{
"type": "github",
"repo": "tiangolo/fastapi",
"include_issues": false,
"include_changelog": false,
"include_releases": true,
"include_code": true,
"code_analysis_depth": "surface",
"file_patterns": [
"fastapi/routing.py",
"fastapi/applications.py"
]
}
]
}

View File

@ -0,0 +1,63 @@
{
"name": "godot",
"description": "Godot Engine game development. Use for Godot projects, GDScript/C# coding, scene setup, node systems, 2D/3D development, physics, animation, UI, shaders, or any Godot-specific questions.",
"base_url": "https://docs.godotengine.org/en/stable/",
"start_urls": [
"https://docs.godotengine.org/en/stable/getting_started/introduction/index.html",
"https://docs.godotengine.org/en/stable/tutorials/scripting/gdscript/index.html",
"https://docs.godotengine.org/en/stable/tutorials/2d/index.html",
"https://docs.godotengine.org/en/stable/tutorials/3d/index.html",
"https://docs.godotengine.org/en/stable/tutorials/physics/index.html",
"https://docs.godotengine.org/en/stable/tutorials/animation/index.html",
"https://docs.godotengine.org/en/stable/classes/index.html"
],
"selectors": {
"main_content": "div[role='main']",
"title": "title",
"code_blocks": "pre"
},
"url_patterns": {
"include": [
"/getting_started/",
"/tutorials/",
"/classes/"
],
"exclude": [
"/genindex.html",
"/search.html",
"/_static/",
"/_sources/"
]
},
"categories": {
"getting_started": ["introduction", "getting_started", "first", "your_first"],
"scripting": ["scripting", "gdscript", "c#", "csharp"],
"2d": ["/2d/", "sprite", "canvas", "tilemap"],
"3d": ["/3d/", "spatial", "mesh", "3d_"],
"physics": ["physics", "collision", "rigidbody", "characterbody"],
"animation": ["animation", "tween", "animationplayer"],
"ui": ["ui", "control", "gui", "theme"],
"shaders": ["shader", "material", "visual_shader"],
"audio": ["audio", "sound"],
"networking": ["networking", "multiplayer", "rpc"],
"export": ["export", "platform", "deploy"]
},
"rate_limit": 0.5,
"max_pages": 40000,
"_comment": "=== NEW: Split Strategy Configuration ===",
"split_strategy": "router",
"split_config": {
"target_pages_per_skill": 5000,
"create_router": true,
"split_by_categories": ["scripting", "2d", "3d", "physics", "shaders"],
"router_name": "godot",
"parallel_scraping": true
},
"_comment2": "=== NEW: Checkpoint Configuration ===",
"checkpoint": {
"enabled": true,
"interval": 1000
}
}

View File

@ -0,0 +1,47 @@
{
"name": "godot",
"description": "Godot Engine game development. Use for Godot projects, GDScript/C# coding, scene setup, node systems, 2D/3D development, physics, animation, UI, shaders, or any Godot-specific questions.",
"base_url": "https://docs.godotengine.org/en/stable/",
"start_urls": [
"https://docs.godotengine.org/en/stable/getting_started/introduction/index.html",
"https://docs.godotengine.org/en/stable/tutorials/scripting/gdscript/index.html",
"https://docs.godotengine.org/en/stable/tutorials/2d/index.html",
"https://docs.godotengine.org/en/stable/tutorials/3d/index.html",
"https://docs.godotengine.org/en/stable/tutorials/physics/index.html",
"https://docs.godotengine.org/en/stable/tutorials/animation/index.html",
"https://docs.godotengine.org/en/stable/classes/index.html"
],
"selectors": {
"main_content": "div[role='main']",
"title": "title",
"code_blocks": "pre"
},
"url_patterns": {
"include": [
"/getting_started/",
"/tutorials/",
"/classes/"
],
"exclude": [
"/genindex.html",
"/search.html",
"/_static/",
"/_sources/"
]
},
"categories": {
"getting_started": ["introduction", "getting_started", "first", "your_first"],
"scripting": ["scripting", "gdscript", "c#", "csharp"],
"2d": ["/2d/", "sprite", "canvas", "tilemap"],
"3d": ["/3d/", "spatial", "mesh", "3d_"],
"physics": ["physics", "collision", "rigidbody", "characterbody"],
"animation": ["animation", "tween", "animationplayer"],
"ui": ["ui", "control", "gui", "theme"],
"shaders": ["shader", "material", "visual_shader"],
"audio": ["audio", "sound"],
"networking": ["networking", "multiplayer", "rpc"],
"export": ["export", "platform", "deploy"]
},
"rate_limit": 0.5,
"max_pages": 500
}

View File

@ -0,0 +1,19 @@
{
"name": "godot",
"repo": "godotengine/godot",
"description": "Godot Engine - Multi-platform 2D and 3D game engine",
"github_token": null,
"include_issues": true,
"max_issues": 100,
"include_changelog": true,
"include_releases": true,
"include_code": false,
"file_patterns": [
"core/**/*.h",
"core/**/*.cpp",
"scene/**/*.h",
"scene/**/*.cpp",
"servers/**/*.h",
"servers/**/*.cpp"
]
}

View File

@ -0,0 +1,50 @@
{
"name": "godot",
"description": "Complete Godot Engine knowledge base combining official documentation and source code analysis",
"merge_mode": "claude-enhanced",
"sources": [
{
"type": "documentation",
"base_url": "https://docs.godotengine.org/en/stable/",
"extract_api": true,
"selectors": {
"main_content": "div[role='main']",
"title": "title",
"code_blocks": "pre"
},
"url_patterns": {
"include": [],
"exclude": ["/search.html", "/_static/", "/_images/"]
},
"categories": {
"getting_started": ["introduction", "getting_started", "step_by_step"],
"scripting": ["scripting", "gdscript", "c_sharp"],
"2d": ["2d", "canvas", "sprite", "animation"],
"3d": ["3d", "spatial", "mesh", "shader"],
"physics": ["physics", "collision", "rigidbody"],
"api": ["api", "class", "reference", "method"]
},
"rate_limit": 0.5,
"max_pages": 500
},
{
"type": "github",
"repo": "godotengine/godot",
"github_token": null,
"code_analysis_depth": "deep",
"include_code": true,
"include_issues": true,
"max_issues": 100,
"include_changelog": true,
"include_releases": true,
"file_patterns": [
"core/**/*.h",
"core/**/*.cpp",
"scene/**/*.h",
"scene/**/*.cpp",
"servers/**/*.h",
"servers/**/*.cpp"
]
}
]
}

View File

@ -0,0 +1,18 @@
{
"name": "hono",
"description": "Hono web application framework for building fast, lightweight APIs. Use for Hono routing, middleware, context handling, and modern JavaScript/TypeScript web development.",
"llms_txt_url": "https://hono.dev/llms-full.txt",
"base_url": "https://hono.dev/docs",
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": [],
"exclude": []
},
"categories": {},
"rate_limit": 0.5,
"max_pages": 50
}

View File

@ -0,0 +1,48 @@
{
"name": "kubernetes",
"description": "Kubernetes container orchestration platform. Use for K8s clusters, deployments, pods, services, networking, storage, configuration, and DevOps tasks.",
"base_url": "https://kubernetes.io/docs/",
"start_urls": [
"https://kubernetes.io/docs/home/",
"https://kubernetes.io/docs/concepts/",
"https://kubernetes.io/docs/tasks/",
"https://kubernetes.io/docs/tutorials/",
"https://kubernetes.io/docs/reference/"
],
"selectors": {
"main_content": "main",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": [
"/docs/concepts/",
"/docs/tasks/",
"/docs/tutorials/",
"/docs/reference/",
"/docs/setup/"
],
"exclude": [
"/search/",
"/blog/",
"/training/",
"/partners/",
"/community/",
"/_print/",
"/case-studies/"
]
},
"categories": {
"getting_started": ["getting-started", "setup", "learning-environment"],
"concepts": ["concepts", "overview", "architecture"],
"workloads": ["workloads", "pods", "deployments", "replicaset", "statefulset", "daemonset"],
"services": ["services", "networking", "ingress", "service"],
"storage": ["storage", "volumes", "persistent"],
"configuration": ["configuration", "configmap", "secret"],
"security": ["security", "rbac", "policies", "authentication"],
"tasks": ["tasks", "administer", "configure"],
"tutorials": ["tutorials", "stateless", "stateful"]
},
"rate_limit": 0.5,
"max_pages": 1000
}

View File

@ -0,0 +1,34 @@
{
"name": "laravel",
"description": "Laravel PHP web framework. Use for Laravel models, routes, controllers, Blade templates, Eloquent ORM, authentication, and PHP web development.",
"base_url": "https://laravel.com/docs/9.x/",
"start_urls": [
"https://laravel.com/docs/9.x/installation",
"https://laravel.com/docs/9.x/routing",
"https://laravel.com/docs/9.x/controllers",
"https://laravel.com/docs/9.x/views",
"https://laravel.com/docs/9.x/blade",
"https://laravel.com/docs/9.x/eloquent",
"https://laravel.com/docs/9.x/migrations",
"https://laravel.com/docs/9.x/authentication"
],
"selectors": {
"main_content": "#main-content",
"title": "h1",
"code_blocks": "pre"
},
"url_patterns": {
"include": ["/docs/9.x/", "/docs/10.x/", "/docs/11.x/"],
"exclude": ["/api/", "/packages/"]
},
"categories": {
"getting_started": ["installation", "configuration", "structure", "deployment"],
"routing": ["routing", "middleware", "controllers"],
"views": ["views", "blade", "templates"],
"models": ["eloquent", "database", "migrations", "seeding", "queries"],
"authentication": ["authentication", "authorization", "passwords"],
"api": ["api", "resources", "requests", "responses"]
},
"rate_limit": 0.3,
"max_pages": 500
}

View File

@ -0,0 +1,17 @@
{
"name": "python-tutorial-test",
"description": "Python tutorial for testing MCP tools",
"base_url": "https://docs.python.org/3/tutorial/",
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": [],
"exclude": []
},
"categories": {},
"rate_limit": 0.3,
"max_pages": 10
}

View File

@ -0,0 +1,31 @@
{
"name": "react",
"description": "React framework for building user interfaces. Use for React components, hooks, state management, JSX, and modern frontend development.",
"base_url": "https://react.dev/",
"start_urls": [
"https://react.dev/learn",
"https://react.dev/learn/quick-start",
"https://react.dev/learn/thinking-in-react",
"https://react.dev/reference/react",
"https://react.dev/reference/react-dom",
"https://react.dev/reference/react/hooks"
],
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": ["/learn", "/reference"],
"exclude": ["/community", "/blog"]
},
"categories": {
"getting_started": ["quick-start", "installation", "tutorial"],
"hooks": ["usestate", "useeffect", "usememo", "usecallback", "usecontext", "useref", "hook"],
"components": ["component", "props", "jsx"],
"state": ["state", "context", "reducer"],
"api": ["api", "reference"]
},
"rate_limit": 0.5,
"max_pages": 300
}

View File

@ -0,0 +1,15 @@
{
"name": "react",
"repo": "facebook/react",
"description": "React JavaScript library for building user interfaces",
"github_token": null,
"include_issues": true,
"max_issues": 100,
"include_changelog": true,
"include_releases": true,
"include_code": false,
"file_patterns": [
"packages/**/*.js",
"packages/**/*.ts"
]
}

View File

@ -0,0 +1,44 @@
{
"name": "react",
"description": "Complete React knowledge base combining official documentation and React codebase insights. Use when working with React, understanding API changes, or debugging React internals.",
"merge_mode": "rule-based",
"sources": [
{
"type": "documentation",
"base_url": "https://react.dev/",
"extract_api": true,
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": [],
"exclude": ["/blog/", "/community/"]
},
"categories": {
"getting_started": ["learn", "installation", "quick-start"],
"components": ["components", "props", "state"],
"hooks": ["hooks", "usestate", "useeffect", "usecontext"],
"api": ["api", "reference"],
"advanced": ["context", "refs", "portals", "suspense"]
},
"rate_limit": 0.5,
"max_pages": 200
},
{
"type": "github",
"repo": "facebook/react",
"include_issues": true,
"max_issues": 100,
"include_changelog": true,
"include_releases": true,
"include_code": true,
"code_analysis_depth": "surface",
"file_patterns": [
"packages/react/src/**/*.js",
"packages/react-dom/src/**/*.js"
]
}
]
}

View File

@ -0,0 +1,108 @@
{
"name": "steam-economy-complete",
"description": "Complete Steam Economy system including inventory, microtransactions, trading, and monetization. Use for ISteamInventory API, ISteamEconomy API, IInventoryService Web API, Steam Wallet integration, in-app purchases, item definitions, trading, crafting, market integration, and all economy features for game developers.",
"base_url": "https://partner.steamgames.com/doc/",
"start_urls": [
"https://partner.steamgames.com/doc/features/inventory",
"https://partner.steamgames.com/doc/features/microtransactions",
"https://partner.steamgames.com/doc/features/microtransactions/implementation",
"https://partner.steamgames.com/doc/api/ISteamInventory",
"https://partner.steamgames.com/doc/webapi/ISteamEconomy",
"https://partner.steamgames.com/doc/webapi/IInventoryService",
"https://partner.steamgames.com/doc/features/inventory/economy"
],
"selectors": {
"main_content": "div.documentation_bbcode",
"title": "div.docPageTitle",
"code_blocks": "div.bb_code"
},
"url_patterns": {
"include": [
"/features/inventory",
"/features/microtransactions",
"/api/ISteamInventory",
"/webapi/ISteamEconomy",
"/webapi/IInventoryService"
],
"exclude": [
"/home",
"/sales",
"/marketing",
"/legal",
"/finance",
"/login",
"/search",
"/steamworks/apps",
"/steamworks/partner"
]
},
"categories": {
"getting_started": [
"overview",
"getting started",
"introduction",
"quickstart",
"setup"
],
"inventory_system": [
"inventory",
"item definition",
"item schema",
"item properties",
"itemdefs",
"ISteamInventory"
],
"microtransactions": [
"microtransaction",
"purchase",
"payment",
"checkout",
"wallet",
"transaction"
],
"economy_api": [
"ISteamEconomy",
"economy",
"asset",
"context"
],
"inventory_webapi": [
"IInventoryService",
"webapi",
"web api",
"http"
],
"trading": [
"trading",
"trade",
"exchange",
"market"
],
"crafting": [
"crafting",
"recipe",
"combine",
"exchange"
],
"pricing": [
"pricing",
"price",
"cost",
"currency"
],
"implementation": [
"integration",
"implementation",
"configure",
"best practices"
],
"examples": [
"example",
"sample",
"tutorial",
"walkthrough"
]
},
"rate_limit": 0.7,
"max_pages": 1000
}

View File

@ -0,0 +1,30 @@
{
"name": "tailwind",
"description": "Tailwind CSS utility-first framework for rapid UI development. Use for Tailwind utilities, responsive design, custom configurations, and modern CSS workflows.",
"base_url": "https://tailwindcss.com/docs",
"start_urls": [
"https://tailwindcss.com/docs/installation",
"https://tailwindcss.com/docs/utility-first",
"https://tailwindcss.com/docs/responsive-design",
"https://tailwindcss.com/docs/hover-focus-and-other-states"
],
"selectors": {
"main_content": "div.prose",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": ["/docs"],
"exclude": ["/blog", "/resources"]
},
"categories": {
"getting_started": ["installation", "editor-setup", "intellisense"],
"core_concepts": ["utility-first", "responsive", "hover-focus", "dark-mode"],
"layout": ["container", "columns", "flex", "grid"],
"typography": ["font-family", "font-size", "text-align", "text-color"],
"backgrounds": ["background-color", "background-image", "gradient"],
"customization": ["configuration", "theme", "plugins"]
},
"rate_limit": 0.5,
"max_pages": 100
}

View File

@ -0,0 +1,17 @@
{
"name": "test-manual",
"description": "Manual test config",
"base_url": "https://test.example.com/",
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": [],
"exclude": []
},
"categories": {},
"rate_limit": 0.5,
"max_pages": 50
}

View File

@ -0,0 +1,31 @@
{
"name": "vue",
"description": "Vue.js progressive JavaScript framework. Use for Vue components, reactivity, composition API, and frontend development.",
"base_url": "https://vuejs.org/",
"start_urls": [
"https://vuejs.org/guide/introduction.html",
"https://vuejs.org/guide/quick-start.html",
"https://vuejs.org/guide/essentials/application.html",
"https://vuejs.org/guide/components/registration.html",
"https://vuejs.org/guide/reusability/composables.html",
"https://vuejs.org/api/"
],
"selectors": {
"main_content": "main",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": ["/guide/", "/api/", "/examples/"],
"exclude": ["/about/", "/sponsor/", "/partners/"]
},
"categories": {
"getting_started": ["quick-start", "introduction", "essentials"],
"components": ["component", "props", "events"],
"reactivity": ["reactivity", "reactive", "ref", "computed"],
"composition_api": ["composition", "setup"],
"api": ["api", "reference"]
},
"rate_limit": 0.5,
"max_pages": 200
}

View File

@ -0,0 +1,195 @@
#!/usr/bin/env python3
"""
Demo: Conflict Detection and Reporting
This demonstrates the unified scraper's ability to detect and report
conflicts between documentation and code implementation.
"""
import sys
import json
from pathlib import Path
# Add CLI to path
sys.path.insert(0, str(Path(__file__).parent / 'cli'))
from conflict_detector import ConflictDetector
print("=" * 70)
print("UNIFIED SCRAPER - CONFLICT DETECTION DEMO")
print("=" * 70)
print()
# Load test data
print("📂 Loading test data...")
print(" - Documentation APIs from example docs")
print(" - Code APIs from example repository")
print()
with open('cli/conflicts.json', 'r') as f:
conflicts_data = json.load(f)
conflicts = conflicts_data['conflicts']
summary = conflicts_data['summary']
print(f"✅ Loaded {summary['total']} conflicts")
print()
# Display summary
print("=" * 70)
print("CONFLICT SUMMARY")
print("=" * 70)
print()
print(f"📊 **Total Conflicts**: {summary['total']}")
print()
print("**By Type:**")
for conflict_type, count in summary['by_type'].items():
if count > 0:
emoji = "📖" if conflict_type == "missing_in_docs" else "💻" if conflict_type == "missing_in_code" else "⚠️"
print(f" {emoji} {conflict_type}: {count}")
print()
print("**By Severity:**")
for severity, count in summary['by_severity'].items():
if count > 0:
emoji = "🔴" if severity == "high" else "🟡" if severity == "medium" else "🟢"
print(f" {emoji} {severity.upper()}: {count}")
print()
# Display detailed conflicts
print("=" * 70)
print("DETAILED CONFLICT REPORTS")
print("=" * 70)
print()
# Group by severity
high = [c for c in conflicts if c['severity'] == 'high']
medium = [c for c in conflicts if c['severity'] == 'medium']
low = [c for c in conflicts if c['severity'] == 'low']
# Show high severity first
if high:
print("🔴 **HIGH SEVERITY CONFLICTS** (Requires immediate attention)")
print("-" * 70)
for conflict in high:
print()
print(f"**API**: `{conflict['api_name']}`")
print(f"**Type**: {conflict['type']}")
print(f"**Issue**: {conflict['difference']}")
print(f"**Suggestion**: {conflict['suggestion']}")
if conflict['docs_info']:
print(f"\n**Documented as**:")
print(f" Signature: {conflict['docs_info'].get('raw_signature', 'N/A')}")
if conflict['code_info']:
print(f"\n**Implemented as**:")
params = conflict['code_info'].get('parameters', [])
param_str = ', '.join(f"{p['name']}: {p.get('type_hint', 'Any')}" for p in params if p['name'] != 'self')
print(f" Signature: {conflict['code_info']['name']}({param_str})")
print(f" Return type: {conflict['code_info'].get('return_type', 'None')}")
print(f" Location: {conflict['code_info'].get('source', 'N/A')}:{conflict['code_info'].get('line', '?')}")
print()
# Show medium severity
if medium:
print("🟡 **MEDIUM SEVERITY CONFLICTS** (Review recommended)")
print("-" * 70)
for conflict in medium[:3]: # Show first 3
print()
print(f"**API**: `{conflict['api_name']}`")
print(f"**Type**: {conflict['type']}")
print(f"**Issue**: {conflict['difference']}")
if conflict['code_info']:
print(f"**Location**: {conflict['code_info'].get('source', 'N/A')}")
if len(medium) > 3:
print(f"\n ... and {len(medium) - 3} more medium severity conflicts")
print()
# Example: How conflicts appear in final skill
print("=" * 70)
print("HOW CONFLICTS APPEAR IN SKILL.MD")
print("=" * 70)
print()
example_conflict = high[0] if high else medium[0] if medium else conflicts[0]
print("```markdown")
print("## 🔧 API Reference")
print()
print("### ⚠️ APIs with Conflicts")
print()
print(f"#### `{example_conflict['api_name']}`")
print()
print(f"⚠️ **Conflict**: {example_conflict['difference']}")
print()
if example_conflict.get('docs_info'):
print("**Documentation says:**")
print("```")
print(example_conflict['docs_info'].get('raw_signature', 'N/A'))
print("```")
print()
if example_conflict.get('code_info'):
print("**Code implementation:**")
print("```python")
params = example_conflict['code_info'].get('parameters', [])
param_strs = []
for p in params:
if p['name'] == 'self':
continue
param_str = p['name']
if p.get('type_hint'):
param_str += f": {p['type_hint']}"
if p.get('default'):
param_str += f" = {p['default']}"
param_strs.append(param_str)
sig = f"def {example_conflict['code_info']['name']}({', '.join(param_strs)})"
if example_conflict['code_info'].get('return_type'):
sig += f" -> {example_conflict['code_info']['return_type']}"
print(sig)
print("```")
print()
print("*Source: both (conflict)*")
print("```")
print()
# Key takeaways
print("=" * 70)
print("KEY TAKEAWAYS")
print("=" * 70)
print()
print("✅ **What the Unified Scraper Does:**")
print(" 1. Extracts APIs from both documentation and code")
print(" 2. Compares them to detect discrepancies")
print(" 3. Classifies conflicts by type and severity")
print(" 4. Provides actionable suggestions")
print(" 5. Shows both versions transparently in the skill")
print()
print("⚠️ **Common Conflict Types:**")
print(" - **Missing in docs**: Undocumented features in code")
print(" - **Missing in code**: Documented but not implemented")
print(" - **Signature mismatch**: Different parameters/types")
print(" - **Description mismatch**: Different explanations")
print()
print("🎯 **Value:**")
print(" - Identifies documentation gaps")
print(" - Catches outdated documentation")
print(" - Highlights implementation differences")
print(" - Creates single source of truth showing reality")
print()
print("=" * 70)
print("END OF DEMO")
print("=" * 70)

View File

@ -0,0 +1,400 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Overview
This is a Python-based documentation scraper that converts ANY documentation website into a Claude skill. It's a single-file tool (`doc_scraper.py`) that scrapes documentation, extracts code patterns, detects programming languages, and generates structured skill files ready for use with Claude.
## Dependencies
```bash
pip3 install requests beautifulsoup4
```
## Core Commands
### Run with a preset configuration
```bash
python3 cli/doc_scraper.py --config configs/godot.json
python3 cli/doc_scraper.py --config configs/react.json
python3 cli/doc_scraper.py --config configs/vue.json
python3 cli/doc_scraper.py --config configs/django.json
python3 cli/doc_scraper.py --config configs/fastapi.json
```
### Interactive mode (for new frameworks)
```bash
python3 cli/doc_scraper.py --interactive
```
### Quick mode (minimal config)
```bash
python3 cli/doc_scraper.py --name react --url https://react.dev/ --description "React framework"
```
### Skip scraping (use cached data)
```bash
python3 cli/doc_scraper.py --config configs/godot.json --skip-scrape
```
### Resume interrupted scrapes
```bash
# If scrape was interrupted
python3 cli/doc_scraper.py --config configs/godot.json --resume
# Start fresh (clear checkpoint)
python3 cli/doc_scraper.py --config configs/godot.json --fresh
```
### Large documentation (10K-40K+ pages)
```bash
# 1. Estimate page count
python3 cli/estimate_pages.py configs/godot.json
# 2. Split into focused sub-skills
python3 cli/split_config.py configs/godot.json --strategy router
# 3. Generate router skill
python3 cli/generate_router.py configs/godot-*.json
# 4. Package multiple skills
python3 cli/package_multi.py output/godot*/
```
### AI-powered SKILL.md enhancement
```bash
# Option 1: During scraping (API-based, requires ANTHROPIC_API_KEY)
pip3 install anthropic
export ANTHROPIC_API_KEY=sk-ant-...
python3 cli/doc_scraper.py --config configs/react.json --enhance
# Option 2: During scraping (LOCAL, no API key - uses Claude Code Max)
python3 cli/doc_scraper.py --config configs/react.json --enhance-local
# Option 3: Standalone after scraping (API-based)
python3 cli/enhance_skill.py output/react/
# Option 4: Standalone after scraping (LOCAL, no API key)
python3 cli/enhance_skill_local.py output/react/
```
The LOCAL enhancement option (`--enhance-local` or `enhance_skill_local.py`) opens a new terminal with Claude Code, which analyzes reference files and enhances SKILL.md automatically. This requires Claude Code Max plan but no API key.
### MCP Integration (Claude Code)
```bash
# One-time setup
./setup_mcp.sh
# Then in Claude Code, use natural language:
"List all available configs"
"Generate config for Tailwind at https://tailwindcss.com/docs"
"Split configs/godot.json using router strategy"
"Generate router for configs/godot-*.json"
"Package skill at output/react/"
```
9 MCP tools available: list_configs, generate_config, validate_config, estimate_pages, scrape_docs, package_skill, upload_skill, split_config, generate_router
### Test with limited pages (edit config first)
Set `"max_pages": 20` in the config file to test with fewer pages.
## Architecture
### Single-File Design
The entire tool is contained in `doc_scraper.py` (~737 lines). It follows a class-based architecture with a single `DocToSkillConverter` class that handles:
- **Web scraping**: BFS traversal with URL validation
- **Content extraction**: CSS selectors for title, content, code blocks
- **Language detection**: Heuristic-based detection from code samples (Python, JavaScript, GDScript, C++, etc.)
- **Pattern extraction**: Identifies common coding patterns from documentation
- **Categorization**: Smart categorization using URL structure, page titles, and content keywords with scoring
- **Skill generation**: Creates SKILL.md with real code examples and categorized reference files
### Data Flow
1. **Scrape Phase**:
- Input: Config JSON (name, base_url, selectors, url_patterns, categories, rate_limit, max_pages)
- Process: BFS traversal starting from base_url, respecting include/exclude patterns
- Output: `output/{name}_data/pages/*.json` + `summary.json`
2. **Build Phase**:
- Input: Scraped JSON data from `output/{name}_data/`
- Process: Load pages → Smart categorize → Extract patterns → Generate references
- Output: `output/{name}/SKILL.md` + `output/{name}/references/*.md`
### Directory Structure
```
Skill_Seekers/
├── cli/ # CLI tools
│ ├── doc_scraper.py # Main scraping & building tool
│ ├── enhance_skill.py # AI enhancement (API-based)
│ ├── enhance_skill_local.py # AI enhancement (LOCAL, no API)
│ ├── estimate_pages.py # Page count estimator
│ ├── split_config.py # Large docs splitter (NEW)
│ ├── generate_router.py # Router skill generator (NEW)
│ ├── package_skill.py # Single skill packager
│ └── package_multi.py # Multi-skill packager (NEW)
├── mcp/ # MCP server
│ ├── server.py # 9 MCP tools (includes upload)
│ └── README.md
├── configs/ # Preset configurations
│ ├── godot.json
│ ├── godot-large-example.json # Large docs example (NEW)
│ ├── react.json
│ └── ...
├── docs/ # Documentation
│ ├── CLAUDE.md # Technical architecture (this file)
│ ├── LARGE_DOCUMENTATION.md # Large docs guide (NEW)
│ ├── ENHANCEMENT.md
│ ├── MCP_SETUP.md
│ └── ...
└── output/ # Generated output (git-ignored)
├── {name}_data/ # Raw scraped data (cached)
│ ├── pages/ # Individual page JSONs
│ ├── summary.json # Scraping summary
│ └── checkpoint.json # Resume checkpoint (NEW)
└── {name}/ # Generated skill
├── SKILL.md # Main skill file with examples
├── SKILL.md.backup # Backup (if enhanced)
├── references/ # Categorized documentation
│ ├── index.md
│ ├── getting_started.md
│ ├── api.md
│ └── ...
├── scripts/ # Empty (for user scripts)
└── assets/ # Empty (for user assets)
```
### Configuration Format
Config files in `configs/*.json` contain:
- `name`: Skill identifier (e.g., "godot", "react")
- `description`: When to use this skill
- `base_url`: Starting URL for scraping
- `selectors`: CSS selectors for content extraction
- `main_content`: Main documentation content (e.g., "article", "div[role='main']")
- `title`: Page title selector
- `code_blocks`: Code sample selector (e.g., "pre code", "pre")
- `url_patterns`: URL filtering
- `include`: Only scrape URLs containing these patterns
- `exclude`: Skip URLs containing these patterns
- `categories`: Keyword-based categorization mapping
- `rate_limit`: Delay between requests (seconds)
- `max_pages`: Maximum pages to scrape
- `split_strategy`: (Optional) How to split large docs: "auto", "category", "router", "size"
- `split_config`: (Optional) Split configuration
- `target_pages_per_skill`: Pages per sub-skill (default: 5000)
- `create_router`: Create router/hub skill (default: true)
- `split_by_categories`: Category names to split by
- `checkpoint`: (Optional) Checkpoint/resume configuration
- `enabled`: Enable checkpointing (default: false)
- `interval`: Save every N pages (default: 1000)
### Key Features
**Auto-detect existing data**: Tool checks for `output/{name}_data/` and prompts to reuse, avoiding re-scraping.
**Language detection**: Detects code languages from:
1. CSS class attributes (`language-*`, `lang-*`)
2. Heuristics (keywords like `def`, `const`, `func`, etc.)
**Pattern extraction**: Looks for "Example:", "Pattern:", "Usage:" markers in content and extracts following code blocks (up to 5 per page).
**Smart categorization**:
- Scores pages against category keywords (3 points for URL match, 2 for title, 1 for content)
- Threshold of 2+ for categorization
- Auto-infers categories from URL segments if none provided
- Falls back to "other" category
**Enhanced SKILL.md**: Generated with:
- Real code examples from documentation (language-annotated)
- Quick reference patterns extracted from docs
- Common pattern section
- Category file listings
**AI-Powered Enhancement**: Two scripts to dramatically improve SKILL.md quality:
- `enhance_skill.py`: Uses Anthropic API (~$0.15-$0.30 per skill, requires API key)
- `enhance_skill_local.py`: Uses Claude Code Max (free, no API key needed)
- Transforms generic 75-line templates into comprehensive 500+ line guides
- Extracts best examples, explains key concepts, adds navigation guidance
- Success rate: 9/10 quality (based on steam-economy test)
**Large Documentation Support (NEW)**: Handle 10K-40K+ page documentation:
- `split_config.py`: Split large configs into multiple focused sub-skills
- `generate_router.py`: Create intelligent router/hub skills that direct queries
- `package_multi.py`: Package multiple skills at once
- 4 split strategies: auto, category, router, size
- Parallel scraping support for faster processing
- MCP integration for natural language usage
**Checkpoint/Resume (NEW)**: Never lose progress on long scrapes:
- Auto-saves every N pages (configurable, default: 1000)
- Resume with `--resume` flag
- Clear checkpoint with `--fresh` flag
- Saves on interruption (Ctrl+C)
## Key Code Locations
- **URL validation**: `is_valid_url()` doc_scraper.py:47-62
- **Content extraction**: `extract_content()` doc_scraper.py:64-131
- **Language detection**: `detect_language()` doc_scraper.py:133-163
- **Pattern extraction**: `extract_patterns()` doc_scraper.py:165-181
- **Smart categorization**: `smart_categorize()` doc_scraper.py:280-321
- **Category inference**: `infer_categories()` doc_scraper.py:323-349
- **Quick reference generation**: `generate_quick_reference()` doc_scraper.py:351-370
- **SKILL.md generation**: `create_enhanced_skill_md()` doc_scraper.py:424-540
- **Scraping loop**: `scrape_all()` doc_scraper.py:226-249
- **Main workflow**: `main()` doc_scraper.py:661-733
## Workflow Examples
### First time scraping (with scraping)
```bash
# 1. Scrape + Build
python3 cli/doc_scraper.py --config configs/godot.json
# Time: 20-40 minutes
# 2. Package
python3 cli/package_skill.py output/godot/
# Result: godot.zip
```
### Using cached data (fast iteration)
```bash
# 1. Use existing data
python3 cli/doc_scraper.py --config configs/godot.json --skip-scrape
# Time: 1-3 minutes
# 2. Package
python3 cli/package_skill.py output/godot/
```
### Creating a new framework config
```bash
# Option 1: Interactive
python3 cli/doc_scraper.py --interactive
# Option 2: Copy and modify
cp configs/react.json configs/myframework.json
# Edit configs/myframework.json
python3 cli/doc_scraper.py --config configs/myframework.json
```
### Large documentation workflow (40K pages)
```bash
# 1. Estimate page count (fast, 1-2 minutes)
python3 cli/estimate_pages.py configs/godot.json
# 2. Split into focused sub-skills
python3 cli/split_config.py configs/godot.json --strategy router --target-pages 5000
# Creates: godot-scripting.json, godot-2d.json, godot-3d.json, etc.
# 3. Scrape all in parallel (4-8 hours instead of 20-40!)
for config in configs/godot-*.json; do
python3 cli/doc_scraper.py --config $config &
done
wait
# 4. Generate intelligent router skill
python3 cli/generate_router.py configs/godot-*.json
# 5. Package all skills
python3 cli/package_multi.py output/godot*/
# 6. Upload all .zip files to Claude
# Result: Router automatically directs queries to the right sub-skill!
```
**Time savings:** Parallel scraping reduces 20-40 hours to 4-8 hours
**See full guide:** [Large Documentation Guide](LARGE_DOCUMENTATION.md)
## Testing Selectors
To find the right CSS selectors for a documentation site:
```python
from bs4 import BeautifulSoup
import requests
url = "https://docs.example.com/page"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# Try different selectors
print(soup.select_one('article'))
print(soup.select_one('main'))
print(soup.select_one('div[role="main"]'))
```
## Running Tests
**IMPORTANT: You must install the package before running tests**
```bash
# 1. Install package in editable mode (one-time setup)
pip install -e .
# 2. Run all tests
pytest
# 3. Run specific test files
pytest tests/test_config_validation.py
pytest tests/test_github_scraper.py
# 4. Run with verbose output
pytest -v
# 5. Run with coverage report
pytest --cov=src/skill_seekers --cov-report=html
```
**Why install first?**
- Tests import from `skill_seekers.cli` which requires the package to be installed
- Modern Python packaging best practice (PEP 517/518)
- CI/CD automatically installs with `pip install -e .`
- conftest.py will show helpful error if package not installed
**Test Coverage:**
- 391+ tests passing
- 39% code coverage
- All core features tested
- CI/CD tests on Ubuntu + macOS with Python 3.10-3.12
## Troubleshooting
**No content extracted**: Check `main_content` selector. Common values: `article`, `main`, `div[role="main"]`, `div.content`
**Poor categorization**: Edit `categories` section in config with better keywords specific to the documentation structure
**Force re-scrape**: Delete cached data with `rm -rf output/{name}_data/`
**Rate limiting issues**: Increase `rate_limit` value in config (e.g., from 0.5 to 1.0 seconds)
## Output Quality Checks
After building, verify quality:
```bash
cat output/godot/SKILL.md # Should have real code examples
cat output/godot/references/index.md # Should show categories
ls output/godot/references/ # Should have category .md files
```
## llms.txt Support
Skill_Seekers automatically detects llms.txt files before HTML scraping:
### Detection Order
1. `{base_url}/llms-full.txt` (complete documentation)
2. `{base_url}/llms.txt` (standard version)
3. `{base_url}/llms-small.txt` (quick reference)
### Benefits
- ⚡ 10x faster (< 5 seconds vs 20-60 seconds)
- ✅ More reliable (maintained by docs authors)
- 🎯 Better quality (pre-formatted for LLMs)
- 🚫 No rate limiting needed
### Example Sites
- Hono: https://hono.dev/llms-full.txt
If no llms.txt is found, automatically falls back to HTML scraping.

View File

@ -0,0 +1,250 @@
# AI-Powered SKILL.md Enhancement
Two scripts are available to dramatically improve your SKILL.md file:
1. **`enhance_skill_local.py`** - Uses Claude Code Max (no API key, **recommended**)
2. **`enhance_skill.py`** - Uses Anthropic API (~$0.15-$0.30 per skill)
Both analyze reference documentation and extract the best examples and guidance.
## Why Use Enhancement?
**Problem:** The auto-generated SKILL.md is often too generic:
- Empty Quick Reference section
- No practical code examples
- Generic "When to Use" triggers
- Doesn't highlight key features
**Solution:** Let Claude read your reference docs and create a much better SKILL.md with:
- ✅ Best code examples extracted from documentation
- ✅ Practical quick reference with real patterns
- ✅ Domain-specific guidance
- ✅ Clear navigation tips
- ✅ Key concepts explained
## Quick Start (LOCAL - No API Key)
**Recommended for Claude Code Max users:**
```bash
# Option 1: Standalone enhancement
python3 cli/enhance_skill_local.py output/steam-inventory/
# Option 2: Integrated with scraper
python3 cli/doc_scraper.py --config configs/steam-inventory.json --enhance-local
```
**What happens:**
1. Opens new terminal window
2. Runs Claude Code with enhancement prompt
3. Claude analyzes reference files (~15-20K chars)
4. Generates enhanced SKILL.md (30-60 seconds)
5. Terminal auto-closes when done
**Requirements:**
- Claude Code Max plan (you're already using it!)
- macOS (auto-launch works) or manual terminal run on other OS
## API-Based Enhancement (Alternative)
**If you prefer API-based approach:**
### Installation
```bash
pip3 install anthropic
```
### Setup API Key
```bash
# Option 1: Environment variable (recommended)
export ANTHROPIC_API_KEY=sk-ant-...
# Option 2: Pass directly with --api-key
python3 cli/enhance_skill.py output/react/ --api-key sk-ant-...
```
### Usage
```bash
# Standalone enhancement
python3 cli/enhance_skill.py output/steam-inventory/
# Integrated with scraper
python3 cli/doc_scraper.py --config configs/steam-inventory.json --enhance
# Dry run (see what would be done)
python3 cli/enhance_skill.py output/react/ --dry-run
```
## What It Does
1. **Reads reference files** (api_reference.md, webapi.md, etc.)
2. **Sends to Claude** with instructions to:
- Extract 5-10 best code examples
- Create practical quick reference
- Write domain-specific "When to Use" triggers
- Add helpful navigation guidance
3. **Backs up original** SKILL.md to SKILL.md.backup
4. **Saves enhanced version** as new SKILL.md
## Example Enhancement
### Before (Auto-Generated)
```markdown
## Quick Reference
### Common Patterns
*Quick reference patterns will be added as you use the skill.*
```
### After (AI-Enhanced)
```markdown
## Quick Reference
### Common API Patterns
**Granting promotional items:**
```cpp
void CInventory::GrantPromoItems()
{
SteamItemDef_t newItems[2];
newItems[0] = 110;
newItems[1] = 111;
SteamInventory()->AddPromoItems( &s_GenerateRequestResult, newItems, 2 );
}
```
**Getting all items in player inventory:**
```cpp
SteamInventoryResult_t resultHandle;
bool success = SteamInventory()->GetAllItems( &resultHandle );
```
[... 8 more practical examples ...]
```
## Cost Estimate
- **Input**: ~50,000-100,000 tokens (reference docs)
- **Output**: ~4,000 tokens (enhanced SKILL.md)
- **Model**: claude-sonnet-4-20250514
- **Estimated cost**: $0.15-$0.30 per skill
## Troubleshooting
### "No API key provided"
```bash
export ANTHROPIC_API_KEY=sk-ant-...
# or
python3 cli/enhance_skill.py output/react/ --api-key sk-ant-...
```
### "No reference files found"
Make sure you've run the scraper first:
```bash
python3 cli/doc_scraper.py --config configs/react.json
```
### "anthropic package not installed"
```bash
pip3 install anthropic
```
### Don't like the result?
```bash
# Restore original
mv output/steam-inventory/SKILL.md.backup output/steam-inventory/SKILL.md
# Try again (it may generate different content)
python3 cli/enhance_skill.py output/steam-inventory/
```
## Tips
1. **Run after scraping completes** - Enhancement works best with complete reference docs
2. **Review the output** - AI is good but not perfect, check the generated SKILL.md
3. **Keep the backup** - Original is saved as SKILL.md.backup
4. **Re-run if needed** - Each run may produce slightly different results
5. **Works offline after first run** - Reference files are local
## Real-World Results
**Test Case: steam-economy skill**
- **Before:** 75 lines, generic template, empty Quick Reference
- **After:** 570 lines, 10 practical API examples, key concepts explained
- **Time:** 60 seconds
- **Quality Rating:** 9/10
The LOCAL enhancement successfully:
- Extracted best HTTP/JSON examples from 24 pages of documentation
- Explained domain concepts (Asset Classes, Context IDs, Transaction Lifecycle)
- Created navigation guidance for beginners through advanced users
- Added best practices for security, economy design, and API integration
## Limitations
**LOCAL Enhancement (`enhance_skill_local.py`):**
- Requires Claude Code Max plan
- macOS auto-launch only (manual on other OS)
- Opens new terminal window
- Takes ~60 seconds
**API Enhancement (`enhance_skill.py`):**
- Requires Anthropic API key (paid)
- Cost: ~$0.15-$0.30 per skill
- Limited to ~100K tokens of reference input
**Both:**
- May occasionally miss the best examples
- Can't understand context beyond the reference docs
- Doesn't modify reference files (only SKILL.md)
## Enhancement Options Comparison
| Aspect | Manual Edit | LOCAL Enhancement | API Enhancement |
|--------|-------------|-------------------|-----------------|
| Time | 15-30 minutes | 30-60 seconds | 30-60 seconds |
| Code examples | You pick | AI picks best | AI picks best |
| Quick reference | Write yourself | Auto-generated | Auto-generated |
| Domain guidance | Your knowledge | From docs | From docs |
| Consistency | Varies | Consistent | Consistent |
| Cost | Free (your time) | Free (Max plan) | ~$0.20 per skill |
| Setup | None | None | API key needed |
| Quality | High (if expert) | 9/10 | 9/10 |
| **Recommended?** | For experts only | ✅ **Yes** | If no Max plan |
## When to Use
**Use enhancement when:**
- You want high-quality SKILL.md quickly
- Working with large documentation (50+ pages)
- Creating skills for unfamiliar frameworks
- Need practical code examples extracted
- Want consistent quality across multiple skills
**Skip enhancement when:**
- Budget constrained (use manual editing)
- Very small documentation (<10 pages)
- You know the framework intimately
- Documentation has no code examples
## Advanced: Customization
To customize how Claude enhances the SKILL.md, edit `enhance_skill.py` and modify the `_build_enhancement_prompt()` method around line 130.
Example customization:
```python
prompt += """
ADDITIONAL REQUIREMENTS:
- Focus on security best practices
- Include performance tips
- Add troubleshooting section
"""
```
## See Also
- [README.md](../README.md) - Main documentation
- [CLAUDE.md](CLAUDE.md) - Architecture guide
- [doc_scraper.py](../doc_scraper.py) - Main scraping tool

View File

@ -0,0 +1,431 @@
# Handling Large Documentation Sites (10K+ Pages)
Complete guide for scraping and managing large documentation sites with Skill Seeker.
---
## Table of Contents
- [When to Split Documentation](#when-to-split-documentation)
- [Split Strategies](#split-strategies)
- [Quick Start](#quick-start)
- [Detailed Workflows](#detailed-workflows)
- [Best Practices](#best-practices)
- [Examples](#examples)
- [Troubleshooting](#troubleshooting)
---
## When to Split Documentation
### Size Guidelines
| Documentation Size | Recommendation | Strategy |
|-------------------|----------------|----------|
| < 5,000 pages | **One skill** | No splitting needed |
| 5,000 - 10,000 pages | **Consider splitting** | Category-based |
| 10,000 - 30,000 pages | **Recommended** | Router + Categories |
| 30,000+ pages | **Strongly recommended** | Router + Categories |
### Why Split Large Documentation?
**Benefits:**
- ✅ Faster scraping (parallel execution)
- ✅ More focused skills (better Claude performance)
- ✅ Easier maintenance (update one topic at a time)
- ✅ Better user experience (precise answers)
- ✅ Avoids context window limits
**Trade-offs:**
- ⚠️ Multiple skills to manage
- ⚠️ Initial setup more complex
- ⚠️ Router adds one extra skill
---
## Split Strategies
### 1. **No Split** (One Big Skill)
**Best for:** Small to medium documentation (< 5K pages)
```bash
# Just use the config as-is
python3 cli/doc_scraper.py --config configs/react.json
```
**Pros:** Simple, one skill to maintain
**Cons:** Can be slow for large docs, may hit limits
---
### 2. **Category Split** (Multiple Focused Skills)
**Best for:** 5K-15K pages with clear topic divisions
```bash
# Auto-split by categories
python3 cli/split_config.py configs/godot.json --strategy category
# Creates:
# - godot-scripting.json
# - godot-2d.json
# - godot-3d.json
# - godot-physics.json
# - etc.
```
**Pros:** Focused skills, clear separation
**Cons:** User must know which skill to use
---
### 3. **Router + Categories** (Intelligent Hub) ⭐ RECOMMENDED
**Best for:** 10K+ pages, best user experience
```bash
# Create router + sub-skills
python3 cli/split_config.py configs/godot.json --strategy router
# Creates:
# - godot.json (router/hub)
# - godot-scripting.json
# - godot-2d.json
# - etc.
```
**Pros:** Best of both worlds, intelligent routing, natural UX
**Cons:** Slightly more complex setup
---
### 4. **Size-Based Split**
**Best for:** Docs without clear categories
```bash
# Split every 5000 pages
python3 cli/split_config.py configs/bigdocs.json --strategy size --target-pages 5000
# Creates:
# - bigdocs-part1.json
# - bigdocs-part2.json
# - bigdocs-part3.json
# - etc.
```
**Pros:** Simple, predictable
**Cons:** May split related topics
---
## Quick Start
### Option 1: Automatic (Recommended)
```bash
# 1. Create config
python3 cli/doc_scraper.py --interactive
# Name: godot
# URL: https://docs.godotengine.org
# ... fill in prompts ...
# 2. Estimate pages (discovers it's large)
python3 cli/estimate_pages.py configs/godot.json
# Output: ⚠️ 40,000 pages detected - splitting recommended
# 3. Auto-split with router
python3 cli/split_config.py configs/godot.json --strategy router
# 4. Scrape all sub-skills
for config in configs/godot-*.json; do
python3 cli/doc_scraper.py --config $config &
done
wait
# 5. Generate router
python3 cli/generate_router.py configs/godot-*.json
# 6. Package all
python3 cli/package_multi.py output/godot*/
# 7. Upload all .zip files to Claude
```
---
### Option 2: Manual Control
```bash
# 1. Define split in config
nano configs/godot.json
# Add:
{
"split_strategy": "router",
"split_config": {
"target_pages_per_skill": 5000,
"create_router": true,
"split_by_categories": ["scripting", "2d", "3d", "physics"]
}
}
# 2. Split
python3 cli/split_config.py configs/godot.json
# 3. Continue as above...
```
---
## Detailed Workflows
### Workflow 1: Router + Categories (40K Pages)
**Scenario:** Godot documentation (40,000 pages)
**Step 1: Estimate**
```bash
python3 cli/estimate_pages.py configs/godot.json
# Output:
# Estimated: 40,000 pages
# Recommended: Split into 8 skills (5K each)
```
**Step 2: Split Configuration**
```bash
python3 cli/split_config.py configs/godot.json --strategy router --target-pages 5000
# Creates:
# configs/godot.json (router)
# configs/godot-scripting.json (5K pages)
# configs/godot-2d.json (8K pages)
# configs/godot-3d.json (10K pages)
# configs/godot-physics.json (6K pages)
# configs/godot-shaders.json (11K pages)
```
**Step 3: Scrape Sub-Skills (Parallel)**
```bash
# Open multiple terminals or use background jobs
python3 cli/doc_scraper.py --config configs/godot-scripting.json &
python3 cli/doc_scraper.py --config configs/godot-2d.json &
python3 cli/doc_scraper.py --config configs/godot-3d.json &
python3 cli/doc_scraper.py --config configs/godot-physics.json &
python3 cli/doc_scraper.py --config configs/godot-shaders.json &
# Wait for all to complete
wait
# Time: 4-8 hours (parallel) vs 20-40 hours (sequential)
```
**Step 4: Generate Router**
```bash
python3 cli/generate_router.py configs/godot-*.json
# Creates:
# output/godot/SKILL.md (router skill)
```
**Step 5: Package All**
```bash
python3 cli/package_multi.py output/godot*/
# Creates:
# output/godot.zip (router)
# output/godot-scripting.zip
# output/godot-2d.zip
# output/godot-3d.zip
# output/godot-physics.zip
# output/godot-shaders.zip
```
**Step 6: Upload to Claude**
Upload all 6 .zip files to Claude. The router will intelligently direct queries to the right sub-skill!
---
### Workflow 2: Category Split Only (15K Pages)
**Scenario:** Vue.js documentation (15,000 pages)
**No router needed - just focused skills:**
```bash
# 1. Split
python3 cli/split_config.py configs/vue.json --strategy category
# 2. Scrape each
for config in configs/vue-*.json; do
python3 cli/doc_scraper.py --config $config
done
# 3. Package
python3 cli/package_multi.py output/vue*/
# 4. Upload all to Claude
```
**Result:** 5 focused Vue skills (components, reactivity, routing, etc.)
---
## Best Practices
### 1. **Choose Target Size Wisely**
```bash
# Small focused skills (3K-5K pages) - more skills, very focused
python3 cli/split_config.py config.json --target-pages 3000
# Medium skills (5K-8K pages) - balanced (RECOMMENDED)
python3 cli/split_config.py config.json --target-pages 5000
# Larger skills (8K-10K pages) - fewer skills, broader
python3 cli/split_config.py config.json --target-pages 8000
```
### 2. **Use Parallel Scraping**
```bash
# Serial (slow - 40 hours)
for config in configs/godot-*.json; do
python3 cli/doc_scraper.py --config $config
done
# Parallel (fast - 8 hours) ⭐
for config in configs/godot-*.json; do
python3 cli/doc_scraper.py --config $config &
done
wait
```
### 3. **Test Before Full Scrape**
```bash
# Test with limited pages first
nano configs/godot-2d.json
# Set: "max_pages": 50
python3 cli/doc_scraper.py --config configs/godot-2d.json
# If output looks good, increase to full
```
### 4. **Use Checkpoints for Long Scrapes**
```bash
# Enable checkpoints in config
{
"checkpoint": {
"enabled": true,
"interval": 1000
}
}
# If scrape fails, resume
python3 cli/doc_scraper.py --config config.json --resume
```
---
## Examples
### Example 1: AWS Documentation (Hypothetical 50K Pages)
```bash
# 1. Split by AWS services
python3 cli/split_config.py configs/aws.json --strategy router --target-pages 5000
# Creates ~10 skills:
# - aws (router)
# - aws-compute (EC2, Lambda)
# - aws-storage (S3, EBS)
# - aws-database (RDS, DynamoDB)
# - etc.
# 2. Scrape in parallel (overnight)
# 3. Upload all skills to Claude
# 4. User asks "How do I create an S3 bucket?"
# 5. Router activates aws-storage skill
# 6. Focused, accurate answer!
```
### Example 2: Microsoft Docs (100K+ Pages)
```bash
# Too large even with splitting - use selective categories
# Only scrape key topics
python3 cli/split_config.py configs/microsoft.json --strategy category
# Edit configs to include only:
# - microsoft-azure (Azure docs only)
# - microsoft-dotnet (.NET docs only)
# - microsoft-typescript (TS docs only)
# Skip less relevant sections
```
---
## Troubleshooting
### Issue: "Splitting creates too many skills"
**Solution:** Increase target size or combine categories
```bash
# Instead of 5K per skill, use 8K
python3 cli/split_config.py config.json --target-pages 8000
# Or manually combine categories in config
```
### Issue: "Router not routing correctly"
**Solution:** Check routing keywords in router SKILL.md
```bash
# Review router
cat output/godot/SKILL.md
# Update keywords if needed
nano output/godot/SKILL.md
```
### Issue: "Parallel scraping fails"
**Solution:** Reduce parallelism or check rate limits
```bash
# Scrape 2-3 at a time instead of all
python3 cli/doc_scraper.py --config config1.json &
python3 cli/doc_scraper.py --config config2.json &
wait
python3 cli/doc_scraper.py --config config3.json &
python3 cli/doc_scraper.py --config config4.json &
wait
```
---
## Summary
**For 40K+ Page Documentation:**
1. ✅ **Estimate first**: `python3 cli/estimate_pages.py config.json`
2. ✅ **Split with router**: `python3 cli/split_config.py config.json --strategy router`
3. ✅ **Scrape in parallel**: Multiple terminals or background jobs
4. ✅ **Generate router**: `python3 cli/generate_router.py configs/*-*.json`
5. ✅ **Package all**: `python3 cli/package_multi.py output/*/`
6. ✅ **Upload to Claude**: All .zip files
**Result:** Intelligent, fast, focused skills that work seamlessly together!
---
**Questions? See:**
- [Main README](../README.md)
- [MCP Setup Guide](MCP_SETUP.md)
- [Enhancement Guide](ENHANCEMENT.md)

View File

@ -0,0 +1,60 @@
# llms.txt Support
## Overview
Skill_Seekers now automatically detects and uses llms.txt files when available, providing 10x faster documentation ingestion.
## What is llms.txt?
The llms.txt convention is a growing standard where documentation sites provide pre-formatted, LLM-ready markdown files:
- `llms-full.txt` - Complete documentation
- `llms.txt` - Standard balanced version
- `llms-small.txt` - Quick reference
## How It Works
1. Before HTML scraping, Skill_Seekers checks for llms.txt files
2. If found, downloads and parses the markdown
3. If not found, falls back to HTML scraping
4. Zero config changes needed
## Configuration
### Automatic Detection (Recommended)
No config changes needed. Just run normally:
```bash
python3 cli/doc_scraper.py --config configs/hono.json
```
### Explicit URL
Optionally specify llms.txt URL:
```json
{
"name": "hono",
"llms_txt_url": "https://hono.dev/llms-full.txt",
"base_url": "https://hono.dev/docs"
}
```
## Performance Comparison
| Method | Time | Requests |
|--------|------|----------|
| HTML Scraping (20 pages) | 20-60s | 20+ |
| llms.txt | < 5s | 1 |
## Supported Sites
Sites known to provide llms.txt:
- Hono: https://hono.dev/llms-full.txt
- (More to be discovered)
## Fallback Behavior
If llms.txt download or parsing fails, automatically falls back to HTML scraping with no user intervention required.

View File

@ -0,0 +1,618 @@
# Complete MCP Setup Guide for Claude Code
Step-by-step guide to set up the Skill Seeker MCP server with Claude Code.
**✅ Fully Tested and Working**: All 9 MCP tools verified in production use with Claude Code
- ✅ 34 comprehensive unit tests (100% pass rate)
- ✅ Integration tested via actual Claude Code MCP protocol
- ✅ All 9 tools working with natural language commands (includes upload support!)
---
## Table of Contents
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Configuration](#configuration)
- [Verification](#verification)
- [Usage Examples](#usage-examples)
- [Troubleshooting](#troubleshooting)
- [Advanced Configuration](#advanced-configuration)
---
## Prerequisites
### Required Software
1. **Python 3.10 or higher**
```bash
python3 --version
# Should show: Python 3.10.x or higher
```
2. **Claude Code installed**
- Download from [claude.ai/code](https://claude.ai/code)
- Requires Claude Pro or Claude Code Max subscription
3. **Skill Seeker repository cloned**
```bash
git clone https://github.com/yusufkaraaslan/Skill_Seekers.git
cd Skill_Seekers
```
### System Requirements
- **Operating System**: macOS, Linux, or Windows (WSL)
- **Disk Space**: 100 MB for dependencies + space for generated skills
- **Network**: Internet connection for documentation scraping
---
## Installation
### Step 1: Install Python Dependencies
```bash
# Navigate to repository root
cd /path/to/Skill_Seekers
# Install MCP server dependencies
pip3 install -r skill_seeker_mcp/requirements.txt
# Install CLI tool dependencies (for scraping)
pip3 install requests beautifulsoup4
```
**Expected output:**
```
Successfully installed mcp-0.9.0 requests-2.31.0 beautifulsoup4-4.12.3
```
### Step 2: Verify Installation
```bash
# Test MCP server can start
timeout 3 python3 skill_seeker_mcp/server.py || echo "Server OK (timeout expected)"
# Should exit cleanly or timeout (both are normal)
```
**Optional: Run Tests**
```bash
# Install test dependencies
pip3 install pytest
# Run MCP server tests (25 tests)
python3 -m pytest tests/test_mcp_server.py -v
# Expected: 25 passed in ~0.3s
```
### Step 3: Note Your Repository Path
```bash
# Get absolute path
pwd
# Example output: /Users/username/Projects/Skill_Seekers
# or: /home/username/Skill_Seekers
```
**Save this path** - you'll need it for configuration!
---
## Configuration
### Step 1: Locate Claude Code MCP Configuration
Claude Code stores MCP configuration in:
- **macOS**: `~/.config/claude-code/mcp.json`
- **Linux**: `~/.config/claude-code/mcp.json`
- **Windows (WSL)**: `~/.config/claude-code/mcp.json`
### Step 2: Create/Edit Configuration File
```bash
# Create config directory if it doesn't exist
mkdir -p ~/.config/claude-code
# Edit the configuration
nano ~/.config/claude-code/mcp.json
```
### Step 3: Add Skill Seeker MCP Server
**Full Configuration Example:**
```json
{
"mcpServers": {
"skill-seeker": {
"command": "python3",
"args": [
"/Users/username/Projects/Skill_Seekers/skill_seeker_mcp/server.py"
],
"cwd": "/Users/username/Projects/Skill_Seekers",
"env": {}
}
}
}
```
**IMPORTANT:** Replace `/Users/username/Projects/Skill_Seekers` with YOUR actual repository path!
**If you already have other MCP servers:**
```json
{
"mcpServers": {
"existing-server": {
"command": "node",
"args": ["/path/to/existing/server.js"]
},
"skill-seeker": {
"command": "python3",
"args": [
"/Users/username/Projects/Skill_Seekers/skill_seeker_mcp/server.py"
],
"cwd": "/Users/username/Projects/Skill_Seekers"
}
}
}
```
### Step 4: Save and Restart Claude Code
1. Save the file (`Ctrl+O` in nano, then `Enter`)
2. Exit editor (`Ctrl+X` in nano)
3. **Completely restart Claude Code** (quit and reopen)
---
## Verification
### Step 1: Check MCP Server Loaded
In Claude Code, type:
```
List all available MCP tools
```
You should see 9 Skill Seeker tools:
- `generate_config`
- `estimate_pages`
- `scrape_docs`
- `package_skill`
- `upload_skill`
- `list_configs`
- `validate_config`
- `split_config`
- `generate_router`
### Step 2: Test a Simple Command
```
List all available configs
```
**Expected response:**
```
Available configurations:
1. godot - Godot Engine documentation
2. react - React framework
3. vue - Vue.js framework
4. django - Django web framework
5. fastapi - FastAPI Python framework
6. kubernetes - Kubernetes documentation
7. steam-economy-complete - Steam Economy API
```
### Step 3: Test Config Generation
```
Generate a config for Tailwind CSS at https://tailwindcss.com/docs
```
**Expected response:**
```
✅ Config created: configs/tailwind.json
```
**Verify the file exists:**
```bash
ls configs/tailwind.json
```
---
## Usage Examples
### Example 1: Generate Skill from Scratch
```
User: Generate config for Svelte docs at https://svelte.dev/docs
Claude: ✅ Config created: configs/svelte.json
User: Estimate pages for configs/svelte.json
Claude: 📊 Estimated pages: 150
Recommended max_pages: 180
User: Scrape docs using configs/svelte.json
Claude: ✅ Skill created at output/svelte/
Run: python3 cli/package_skill.py output/svelte/
User: Package skill at output/svelte/
Claude: ✅ Created: output/svelte.zip
Ready to upload to Claude!
```
### Example 2: Use Existing Config
```
User: List all available configs
Claude: [Shows 7 configs]
User: Scrape docs using configs/react.json with max 50 pages
Claude: ✅ Skill created at output/react/
User: Package skill at output/react/
Claude: ✅ Created: output/react.zip
```
### Example 3: Validate Before Scraping
```
User: Validate configs/godot.json
Claude: ✅ Config is valid
- Base URL: https://docs.godotengine.org/en/stable/
- Max pages: 500
- Rate limit: 0.5s
- Categories: 3
User: Estimate pages for configs/godot.json
Claude: 📊 Estimated pages: 450
Current max_pages (500) is sufficient
User: Scrape docs using configs/godot.json
Claude: [Scraping starts...]
```
---
## Troubleshooting
### Issue: MCP Server Not Loading
**Symptoms:**
- Skill Seeker tools don't appear in Claude Code
- No response when asking about configs
**Solutions:**
1. **Check configuration path:**
```bash
cat ~/.config/claude-code/mcp.json
```
2. **Verify Python path:**
```bash
which python3
# Should show: /usr/bin/python3 or /usr/local/bin/python3
```
3. **Test server manually:**
```bash
cd /path/to/Skill_Seekers
python3 skill_seeker_mcp/server.py
# Should start without errors
```
4. **Check Claude Code logs:**
- macOS: `~/Library/Logs/Claude Code/`
- Linux: `~/.config/claude-code/logs/`
5. **Completely restart Claude Code:**
- Quit Claude Code (don't just close window)
- Reopen Claude Code
### Issue: "ModuleNotFoundError: No module named 'mcp'"
**Solution:**
```bash
pip3 install -r skill_seeker_mcp/requirements.txt
```
### Issue: "Permission denied" when running server
**Solution:**
```bash
chmod +x skill_seeker_mcp/server.py
```
### Issue: Tools appear but don't work
**Symptoms:**
- Tools listed but commands fail
- "Error executing tool" messages
**Solutions:**
1. **Check working directory in config:**
```json
{
"cwd": "/FULL/PATH/TO/Skill_Seekers"
}
```
2. **Verify CLI tools exist:**
```bash
ls cli/doc_scraper.py
ls cli/estimate_pages.py
ls cli/package_skill.py
```
3. **Test CLI tools directly:**
```bash
python3 cli/doc_scraper.py --help
```
### Issue: Slow or hanging operations
**Solutions:**
1. **Check rate limit in config:**
- Default: 0.5 seconds
- Increase if needed: 1.0 or 2.0 seconds
2. **Use smaller max_pages for testing:**
```
Generate config with max_pages=20 for testing
```
3. **Check network connection:**
```bash
curl -I https://docs.example.com
```
---
## Advanced Configuration
### Custom Environment Variables
```json
{
"mcpServers": {
"skill-seeker": {
"command": "python3",
"args": ["/path/to/Skill_Seekers/skill_seeker_mcp/server.py"],
"cwd": "/path/to/Skill_Seekers",
"env": {
"ANTHROPIC_API_KEY": "sk-ant-...",
"PYTHONPATH": "/custom/path"
}
}
}
}
```
### Multiple Python Versions
If you have multiple Python versions:
```json
{
"mcpServers": {
"skill-seeker": {
"command": "/usr/local/bin/python3.11",
"args": ["/path/to/Skill_Seekers/skill_seeker_mcp/server.py"],
"cwd": "/path/to/Skill_Seekers"
}
}
}
```
### Virtual Environment
To use a Python virtual environment:
```bash
# Create venv
cd /path/to/Skill_Seekers
python3 -m venv venv
source venv/bin/activate
pip install -r skill_seeker_mcp/requirements.txt
pip install requests beautifulsoup4
which python3
# Copy this path for config
```
```json
{
"mcpServers": {
"skill-seeker": {
"command": "/path/to/Skill_Seekers/venv/bin/python3",
"args": ["/path/to/Skill_Seekers/skill_seeker_mcp/server.py"],
"cwd": "/path/to/Skill_Seekers"
}
}
}
```
### Debug Mode
Enable verbose logging:
```json
{
"mcpServers": {
"skill-seeker": {
"command": "python3",
"args": [
"-u",
"/path/to/Skill_Seekers/skill_seeker_mcp/server.py"
],
"cwd": "/path/to/Skill_Seekers",
"env": {
"DEBUG": "1"
}
}
}
}
```
---
## Complete Example Configuration
**Minimal (recommended for most users):**
```json
{
"mcpServers": {
"skill-seeker": {
"command": "python3",
"args": [
"/Users/username/Projects/Skill_Seekers/skill_seeker_mcp/server.py"
],
"cwd": "/Users/username/Projects/Skill_Seekers"
}
}
}
```
**With API enhancement:**
```json
{
"mcpServers": {
"skill-seeker": {
"command": "python3",
"args": [
"/Users/username/Projects/Skill_Seekers/skill_seeker_mcp/server.py"
],
"cwd": "/Users/username/Projects/Skill_Seekers",
"env": {
"ANTHROPIC_API_KEY": "sk-ant-your-key-here"
}
}
}
}
```
---
## End-to-End Workflow
### Complete Setup and First Skill
```bash
# 1. Install
cd ~/Projects
git clone https://github.com/yusufkaraaslan/Skill_Seekers.git
cd Skill_Seekers
pip3 install -r skill_seeker_mcp/requirements.txt
pip3 install requests beautifulsoup4
# 2. Configure
mkdir -p ~/.config/claude-code
cat > ~/.config/claude-code/mcp.json << 'EOF'
{
"mcpServers": {
"skill-seeker": {
"command": "python3",
"args": [
"/Users/username/Projects/Skill_Seekers/skill_seeker_mcp/server.py"
],
"cwd": "/Users/username/Projects/Skill_Seekers"
}
}
}
EOF
# (Replace paths with your actual paths!)
# 3. Restart Claude Code
# 4. Test in Claude Code:
```
**In Claude Code:**
```
User: List all available configs
User: Scrape docs using configs/react.json with max 50 pages
User: Package skill at output/react/
```
**Result:** `output/react.zip` ready to upload!
---
## Next Steps
After successful setup:
1. **Try preset configs:**
- React: `scrape docs using configs/react.json`
- Vue: `scrape docs using configs/vue.json`
- Django: `scrape docs using configs/django.json`
2. **Create custom configs:**
- `generate config for [framework] at [url]`
3. **Test with small limits first:**
- Use `max_pages` parameter: `scrape docs using configs/test.json with max 20 pages`
4. **Explore enhancement:**
- Use `--enhance-local` flag for AI-powered SKILL.md improvement
---
## Getting Help
- **Documentation**: See [mcp/README.md](../mcp/README.md)
- **Issues**: [GitHub Issues](https://github.com/yusufkaraaslan/Skill_Seekers/issues)
- **Examples**: See [.github/ISSUES_TO_CREATE.md](../.github/ISSUES_TO_CREATE.md) for test cases
---
## Quick Reference Card
```
SETUP:
1. Install dependencies: pip3 install -r skill_seeker_mcp/requirements.txt
2. Configure: ~/.config/claude-code/mcp.json
3. Restart Claude Code
VERIFY:
- "List all available configs"
- "Validate configs/react.json"
GENERATE SKILL:
1. "Generate config for [name] at [url]"
2. "Estimate pages for configs/[name].json"
3. "Scrape docs using configs/[name].json"
4. "Package skill at output/[name]/"
TROUBLESHOOTING:
- Check: cat ~/.config/claude-code/mcp.json
- Test: python3 skill_seeker_mcp/server.py
- Logs: ~/Library/Logs/Claude Code/
```
---
Happy skill creating! 🚀

View File

@ -0,0 +1,579 @@
# PDF Advanced Features Guide
Comprehensive guide to advanced PDF extraction features (Priority 2 & 3).
## Overview
Skill Seeker's PDF extractor now includes powerful advanced features for handling complex PDF scenarios:
**Priority 2 Features (More PDF Types):**
- ✅ OCR support for scanned PDFs
- ✅ Password-protected PDF support
- ✅ Complex table extraction
**Priority 3 Features (Performance Optimizations):**
- ✅ Parallel page processing
- ✅ Intelligent caching of expensive operations
## Table of Contents
1. [OCR Support for Scanned PDFs](#ocr-support)
2. [Password-Protected PDFs](#password-protected-pdfs)
3. [Table Extraction](#table-extraction)
4. [Parallel Processing](#parallel-processing)
5. [Caching](#caching)
6. [Combined Usage](#combined-usage)
7. [Performance Benchmarks](#performance-benchmarks)
---
## OCR Support
Extract text from scanned PDFs using Optical Character Recognition.
### Installation
```bash
# Install Tesseract OCR engine
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Install Python packages
pip install pytesseract Pillow
```
### Usage
```bash
# Basic OCR
python3 cli/pdf_extractor_poc.py scanned.pdf --ocr
# OCR with other options
python3 cli/pdf_extractor_poc.py scanned.pdf --ocr --verbose -o output.json
# Full skill creation with OCR
python3 cli/pdf_scraper.py --pdf scanned.pdf --name myskill --ocr
```
### How It Works
1. **Detection**: For each page, checks if text content is < 50 characters
2. **Fallback**: If low text detected and OCR enabled, renders page as image
3. **Processing**: Runs Tesseract OCR on the image
4. **Selection**: Uses OCR text if it's longer than extracted text
5. **Logging**: Shows OCR extraction results in verbose mode
### Example Output
```
📄 Extracting from: scanned.pdf
Pages: 50
OCR: ✅ enabled
Page 1: 245 chars, 0 code blocks, 2 headings, 0 images, 0 tables
OCR extracted 245 chars (was 12)
Page 2: 389 chars, 1 code blocks, 3 headings, 0 images, 0 tables
OCR extracted 389 chars (was 5)
```
### Limitations
- Requires Tesseract installed on system
- Slower than regular text extraction (~2-5 seconds per page)
- Quality depends on PDF scan quality
- Works best with high-resolution scans
### Best Practices
- Use `--parallel` with OCR for faster processing
- Combine with `--verbose` to see OCR progress
- Test on a few pages first before processing large documents
---
## Password-Protected PDFs
Handle encrypted PDFs with password protection.
### Usage
```bash
# Basic usage
python3 cli/pdf_extractor_poc.py encrypted.pdf --password mypassword
# With full workflow
python3 cli/pdf_scraper.py --pdf encrypted.pdf --name myskill --password mypassword
```
### How It Works
1. **Detection**: Checks if PDF is encrypted (`doc.is_encrypted`)
2. **Authentication**: Attempts to authenticate with provided password
3. **Validation**: Returns error if password is incorrect or missing
4. **Processing**: Continues normal extraction if authentication succeeds
### Example Output
```
📄 Extracting from: encrypted.pdf
🔐 PDF is encrypted, trying password...
✅ Password accepted
Pages: 100
Metadata: {...}
```
### Error Handling
```
# Missing password
❌ PDF is encrypted but no password provided
Use --password option to provide password
# Wrong password
❌ Invalid password
```
### Security Notes
- Password is passed via command line (visible in process list)
- For sensitive documents, consider environment variables
- Password is not stored in output JSON
---
## Table Extraction
Extract tables from PDFs and include them in skill references.
### Usage
```bash
# Extract tables
python3 cli/pdf_extractor_poc.py data.pdf --extract-tables
# With other options
python3 cli/pdf_extractor_poc.py data.pdf --extract-tables --verbose -o output.json
# Full skill creation with tables
python3 cli/pdf_scraper.py --pdf data.pdf --name myskill --extract-tables
```
### How It Works
1. **Detection**: Uses PyMuPDF's `find_tables()` method
2. **Extraction**: Extracts table data as 2D array (rows × columns)
3. **Metadata**: Captures bounding box, row count, column count
4. **Integration**: Tables included in page data and summary
### Example Output
```
📄 Extracting from: data.pdf
Table extraction: ✅ enabled
Page 5: 892 chars, 2 code blocks, 4 headings, 0 images, 2 tables
Found table 0: 10x4
Found table 1: 15x6
✅ Extraction complete:
Tables found: 25
```
### Table Data Structure
```json
{
"tables": [
{
"table_index": 0,
"rows": [
["Header 1", "Header 2", "Header 3"],
["Data 1", "Data 2", "Data 3"],
...
],
"bbox": [x0, y0, x1, y1],
"row_count": 10,
"col_count": 4
}
]
}
```
### Integration with Skills
Tables are automatically included in reference files when building skills:
```markdown
## Data Tables
### Table 1 (Page 5)
| Header 1 | Header 2 | Header 3 |
|----------|----------|----------|
| Data 1 | Data 2 | Data 3 |
```
### Limitations
- Quality depends on PDF table structure
- Works best with well-formatted tables
- Complex merged cells may not extract correctly
---
## Parallel Processing
Process pages in parallel for 3x faster extraction.
### Usage
```bash
# Enable parallel processing (auto-detects CPU count)
python3 cli/pdf_extractor_poc.py large.pdf --parallel
# Specify worker count
python3 cli/pdf_extractor_poc.py large.pdf --parallel --workers 8
# With full workflow
python3 cli/pdf_scraper.py --pdf large.pdf --name myskill --parallel --workers 8
```
### How It Works
1. **Worker Pool**: Creates ThreadPoolExecutor with N workers
2. **Distribution**: Distributes pages across workers
3. **Extraction**: Each worker processes pages independently
4. **Collection**: Results collected and merged
5. **Threshold**: Only activates for PDFs with > 5 pages
### Example Output
```
📄 Extracting from: large.pdf
Pages: 500
Parallel processing: ✅ enabled (8 workers)
🚀 Extracting 500 pages in parallel (8 workers)...
✅ Extraction complete:
Total characters: 1,250,000
Code blocks found: 450
```
### Performance
| Pages | Sequential | Parallel (4 workers) | Parallel (8 workers) |
|-------|-----------|---------------------|---------------------|
| 50 | 25s | 10s (2.5x) | 8s (3.1x) |
| 100 | 50s | 18s (2.8x) | 15s (3.3x) |
| 500 | 4m 10s | 1m 30s (2.8x) | 1m 15s (3.3x) |
| 1000 | 8m 20s | 3m 00s (2.8x) | 2m 30s (3.3x) |
### Best Practices
- Use `--workers` equal to CPU core count
- Combine with `--no-cache` for first-time processing
- Monitor system resources (RAM, CPU)
- Not recommended for very large images (memory intensive)
### Limitations
- Requires `concurrent.futures` (Python 3.2+)
- Uses more memory (N workers × page size)
- May not be beneficial for PDFs with many large images
---
## Caching
Intelligent caching of expensive operations for faster re-extraction.
### Usage
```bash
# Caching enabled by default
python3 cli/pdf_extractor_poc.py input.pdf
# Disable caching
python3 cli/pdf_extractor_poc.py input.pdf --no-cache
```
### How It Works
1. **Cache Key**: Each page cached by page number
2. **Check**: Before extraction, checks cache for page data
3. **Store**: After extraction, stores result in cache
4. **Reuse**: On re-run, returns cached data instantly
### What Gets Cached
- Page text and markdown
- Code block detection results
- Language detection results
- Quality scores
- Image extraction results
- Table extraction results
### Example Output
```
Page 1: Using cached data
Page 2: Using cached data
Page 3: 892 chars, 2 code blocks, 4 headings, 0 images, 0 tables
```
### Cache Lifetime
- In-memory only (cleared when process exits)
- Useful for:
- Testing extraction parameters
- Re-running with different filters
- Development and debugging
### When to Disable
- First-time extraction
- PDF file has changed
- Different extraction options
- Memory constraints
---
## Combined Usage
### Maximum Performance
Extract everything as fast as possible:
```bash
python3 cli/pdf_scraper.py \
--pdf docs/manual.pdf \
--name myskill \
--extract-images \
--extract-tables \
--parallel \
--workers 8 \
--min-quality 5.0
```
### Scanned PDF with Tables
```bash
python3 cli/pdf_scraper.py \
--pdf docs/scanned.pdf \
--name myskill \
--ocr \
--extract-tables \
--parallel \
--workers 4
```
### Encrypted PDF with All Features
```bash
python3 cli/pdf_scraper.py \
--pdf docs/encrypted.pdf \
--name myskill \
--password mypassword \
--extract-images \
--extract-tables \
--parallel \
--workers 8 \
--verbose
```
---
## Performance Benchmarks
### Test Setup
- **Hardware**: 8-core CPU, 16GB RAM
- **PDF**: 500-page technical manual
- **Content**: Mixed text, code, images, tables
### Results
| Configuration | Time | Speedup |
|--------------|------|---------|
| Basic (sequential) | 4m 10s | 1.0x (baseline) |
| + Caching | 2m 30s | 1.7x |
| + Parallel (4 workers) | 1m 30s | 2.8x |
| + Parallel (8 workers) | 1m 15s | 3.3x |
| + All optimizations | 1m 10s | 3.6x |
### Feature Overhead
| Feature | Time Impact | Memory Impact |
|---------|------------|---------------|
| OCR | +2-5s per page | +50MB per page |
| Table extraction | +0.5s per page | +10MB |
| Image extraction | +0.2s per image | Varies |
| Parallel (8 workers) | -66% total time | +8x memory |
| Caching | -50% on re-run | +100MB |
---
## Troubleshooting
### OCR Issues
**Problem**: `pytesseract not found`
```bash
# Install pytesseract
pip install pytesseract
# Install Tesseract engine
sudo apt-get install tesseract-ocr # Ubuntu
brew install tesseract # macOS
```
**Problem**: Low OCR quality
- Use higher DPI PDFs
- Check scan quality
- Try different Tesseract language packs
### Parallel Processing Issues
**Problem**: Out of memory errors
```bash
# Reduce worker count
python3 cli/pdf_extractor_poc.py large.pdf --parallel --workers 2
# Or disable parallel
python3 cli/pdf_extractor_poc.py large.pdf
```
**Problem**: Not faster than sequential
- Check CPU usage (may be I/O bound)
- Try with larger PDFs (> 50 pages)
- Monitor system resources
### Table Extraction Issues
**Problem**: Tables not detected
- Check if tables are actual tables (not images)
- Try different PDF viewers to verify structure
- Use `--verbose` to see detection attempts
**Problem**: Malformed table data
- Complex merged cells may not extract correctly
- Try extracting specific pages only
- Manual post-processing may be needed
---
## Best Practices
### For Large PDFs (500+ pages)
1. Use parallel processing:
```bash
python3 cli/pdf_scraper.py --pdf large.pdf --parallel --workers 8
```
2. Extract to JSON first, then build skill:
```bash
python3 cli/pdf_extractor_poc.py large.pdf -o extracted.json --parallel
python3 cli/pdf_scraper.py --from-json extracted.json --name myskill
```
3. Monitor system resources
### For Scanned PDFs
1. Use OCR with parallel processing:
```bash
python3 cli/pdf_scraper.py --pdf scanned.pdf --ocr --parallel --workers 4
```
2. Test on sample pages first
3. Use `--verbose` to monitor OCR performance
### For Encrypted PDFs
1. Use environment variable for password:
```bash
export PDF_PASSWORD="mypassword"
python3 cli/pdf_scraper.py --pdf encrypted.pdf --password "$PDF_PASSWORD"
```
2. Clear history after use to remove password
### For PDFs with Tables
1. Enable table extraction:
```bash
python3 cli/pdf_scraper.py --pdf data.pdf --extract-tables
```
2. Check table quality in output JSON
3. Manual review recommended for critical data
---
## API Reference
### PDFExtractor Class
```python
from pdf_extractor_poc import PDFExtractor
extractor = PDFExtractor(
pdf_path="input.pdf",
verbose=True,
chunk_size=10,
min_quality=5.0,
extract_images=True,
image_dir="images/",
min_image_size=100,
# Advanced features
use_ocr=True,
password="mypassword",
extract_tables=True,
parallel=True,
max_workers=8,
use_cache=True
)
result = extractor.extract_all()
```
### Configuration Options
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `pdf_path` | str | required | Path to PDF file |
| `verbose` | bool | False | Enable verbose logging |
| `chunk_size` | int | 10 | Pages per chunk |
| `min_quality` | float | 0.0 | Min code quality (0-10) |
| `extract_images` | bool | False | Extract images to files |
| `image_dir` | str | None | Image output directory |
| `min_image_size` | int | 100 | Min image dimension |
| `use_ocr` | bool | False | Enable OCR |
| `password` | str | None | PDF password |
| `extract_tables` | bool | False | Extract tables |
| `parallel` | bool | False | Parallel processing |
| `max_workers` | int | CPU count | Worker threads |
| `use_cache` | bool | True | Enable caching |
---
## Summary
**6 Advanced Features** implemented (Priority 2 & 3)
**3x Performance Boost** with parallel processing
**OCR Support** for scanned PDFs
**Password Protection** support
**Table Extraction** from complex PDFs
**Intelligent Caching** for faster re-runs
The PDF extractor now handles virtually any PDF scenario with maximum performance!

View File

@ -0,0 +1,521 @@
# PDF Page Detection and Chunking (Task B1.3)
**Status:** ✅ Completed
**Date:** October 21, 2025
**Task:** B1.3 - Add PDF page detection and chunking
---
## Overview
Task B1.3 enhances the PDF extractor with intelligent page chunking and chapter detection capabilities. This allows large PDF documentation to be split into manageable, logical sections for better processing and organization.
## New Features
### ✅ 1. Page Chunking
Break large PDFs into smaller, manageable chunks:
- Configurable chunk size (default: 10 pages per chunk)
- Smart chunking that respects chapter boundaries
- Chunk metadata includes page ranges and chapter titles
**Usage:**
```bash
# Default chunking (10 pages per chunk)
python3 cli/pdf_extractor_poc.py input.pdf
# Custom chunk size (20 pages per chunk)
python3 cli/pdf_extractor_poc.py input.pdf --chunk-size 20
# Disable chunking (single chunk with all pages)
python3 cli/pdf_extractor_poc.py input.pdf --chunk-size 0
```
### ✅ 2. Chapter/Section Detection
Automatically detect chapter and section boundaries:
- Detects H1 and H2 headings as chapter markers
- Recognizes common chapter patterns:
- "Chapter 1", "Chapter 2", etc.
- "Part 1", "Part 2", etc.
- "Section 1", "Section 2", etc.
- Numbered sections like "1. Introduction"
**Chapter Detection Logic:**
1. Check for H1/H2 headings at page start
2. Pattern match against common chapter formats
3. Extract chapter title for metadata
### ✅ 3. Code Block Merging
Intelligently merge code blocks split across pages:
- Detects when code continues from one page to the next
- Checks language and detection method consistency
- Looks for continuation indicators:
- Doesn't end with `}`, `;`
- Ends with `,`, `\`
- Incomplete syntax structures
**Example:**
```
Page 5: def calculate_total(items):
total = 0
for item in items:
Page 6: total += item.price
return total
```
The merger will combine these into a single code block.
---
## Output Format
### Enhanced JSON Structure
The output now includes chunking and chapter information:
```json
{
"source_file": "manual.pdf",
"metadata": { ... },
"total_pages": 150,
"total_chunks": 15,
"chapters": [
{
"title": "Getting Started",
"start_page": 1,
"end_page": 12
},
{
"title": "API Reference",
"start_page": 13,
"end_page": 45
}
],
"chunks": [
{
"chunk_number": 1,
"start_page": 1,
"end_page": 12,
"chapter_title": "Getting Started",
"pages": [ ... ]
},
{
"chunk_number": 2,
"start_page": 13,
"end_page": 22,
"chapter_title": "API Reference",
"pages": [ ... ]
}
],
"pages": [ ... ]
}
```
### Chunk Object
Each chunk contains:
- `chunk_number` - Sequential chunk identifier (1-indexed)
- `start_page` - First page in chunk (1-indexed)
- `end_page` - Last page in chunk (1-indexed)
- `chapter_title` - Detected chapter title (if any)
- `pages` - Array of page objects in this chunk
### Merged Code Block Indicator
Code blocks merged from multiple pages include a flag:
```json
{
"code": "def example():\n ...",
"language": "python",
"detection_method": "font",
"merged_from_next_page": true
}
```
---
## Implementation Details
### Chapter Detection Algorithm
```python
def detect_chapter_start(self, page_data):
"""
Detect if a page starts a new chapter/section.
Returns (is_chapter_start, chapter_title) tuple.
"""
# Check H1/H2 headings first
headings = page_data.get('headings', [])
if headings:
first_heading = headings[0]
if first_heading['level'] in ['h1', 'h2']:
return True, first_heading['text']
# Pattern match against common chapter formats
text = page_data.get('text', '')
first_line = text.split('\n')[0] if text else ''
chapter_patterns = [
r'^Chapter\s+\d+',
r'^Part\s+\d+',
r'^Section\s+\d+',
r'^\d+\.\s+[A-Z]', # "1. Introduction"
]
for pattern in chapter_patterns:
if re.match(pattern, first_line, re.IGNORECASE):
return True, first_line.strip()
return False, None
```
### Code Block Merging Algorithm
```python
def merge_continued_code_blocks(self, pages):
"""
Merge code blocks that are split across pages.
"""
for i in range(len(pages) - 1):
current_page = pages[i]
next_page = pages[i + 1]
# Get last code block of current page
last_code = current_page['code_samples'][-1]
# Get first code block of next page
first_next_code = next_page['code_samples'][0]
# Check if they're likely the same code block
if (last_code['language'] == first_next_code['language'] and
last_code['detection_method'] == first_next_code['detection_method']):
# Check for continuation indicators
last_code_text = last_code['code'].rstrip()
continuation_indicators = [
not last_code_text.endswith('}'),
not last_code_text.endswith(';'),
last_code_text.endswith(','),
last_code_text.endswith('\\'),
]
if any(continuation_indicators):
# Merge the blocks
merged_code = last_code['code'] + '\n' + first_next_code['code']
last_code['code'] = merged_code
last_code['merged_from_next_page'] = True
# Remove duplicate from next page
next_page['code_samples'].pop(0)
return pages
```
### Chunking Algorithm
```python
def create_chunks(self, pages):
"""
Create chunks of pages respecting chapter boundaries.
"""
chunks = []
current_chunk = []
current_chapter = None
for i, page in enumerate(pages):
# Detect chapter start
is_chapter, chapter_title = self.detect_chapter_start(page)
if is_chapter and current_chunk:
# Save current chunk before starting new one
chunks.append({
'chunk_number': len(chunks) + 1,
'start_page': chunk_start + 1,
'end_page': i,
'pages': current_chunk,
'chapter_title': current_chapter
})
current_chunk = []
current_chapter = chapter_title
current_chunk.append(page)
# Check if chunk size reached (but don't break chapters)
if not is_chapter and len(current_chunk) >= self.chunk_size:
# Create chunk
chunks.append(...)
current_chunk = []
return chunks
```
---
## Usage Examples
### Basic Chunking
```bash
# Extract with default 10-page chunks
python3 cli/pdf_extractor_poc.py manual.pdf -o manual.json
# Output includes chunks
cat manual.json | jq '.total_chunks'
# Output: 15
```
### Large PDF Processing
```bash
# Large PDF with bigger chunks (50 pages each)
python3 cli/pdf_extractor_poc.py large_manual.pdf --chunk-size 50 -o output.json -v
# Verbose output shows:
# 📦 Creating chunks (chunk_size=50)...
# 🔗 Merging code blocks across pages...
# ✅ Extraction complete:
# Chunks created: 8
# Chapters detected: 12
```
### No Chunking (Single Output)
```bash
# Process all pages as single chunk
python3 cli/pdf_extractor_poc.py small_doc.pdf --chunk-size 0 -o output.json
```
---
## Performance
### Chunking Performance
- **Chapter Detection:** ~0.1ms per page (negligible overhead)
- **Code Merging:** ~0.5ms per page (fast)
- **Chunk Creation:** ~1ms total (very fast)
**Total overhead:** < 1% of extraction time
### Memory Benefits
Chunking large PDFs helps reduce memory usage:
- **Without chunking:** Entire PDF loaded in memory
- **With chunking:** Process chunk-by-chunk (future enhancement)
**Current implementation** still loads entire PDF but provides structured output for chunked processing downstream.
---
## Limitations
### Current Limitations
1. **Chapter Pattern Matching**
- Limited to common English chapter patterns
- May miss non-standard chapter formats
- No support for non-English chapters (e.g., "Capitulo", "Chapitre")
2. **Code Merging Heuristics**
- Based on simple continuation indicators
- May miss some edge cases
- No AST-based validation
3. **Chunk Size**
- Fixed page count (not by content size)
- Doesn't account for page content volume
- No auto-sizing based on memory constraints
### Known Issues
1. **Multi-Chapter Pages**
- If a single page has multiple chapters, only first is detected
- Workaround: Use smaller chunk sizes
2. **False Code Merges**
- Rare cases where separate code blocks are merged
- Detection: Look for `merged_from_next_page` flag
3. **Table of Contents**
- TOC pages may be detected as chapters
- Workaround: Manual filtering in downstream processing
---
## Comparison: Before vs After
| Feature | Before (B1.2) | After (B1.3) |
|---------|---------------|--------------|
| Page chunking | None | ✅ Configurable |
| Chapter detection | None | ✅ Auto-detect |
| Code spanning pages | Split | ✅ Merged |
| Large PDF handling | Difficult | ✅ Chunked |
| Memory efficiency | Poor | Better (structure for future) |
| Output organization | Flat | ✅ Hierarchical |
---
## Testing
### Test Chapter Detection
Create a test PDF with chapters:
1. Page 1: "Chapter 1: Introduction"
2. Page 15: "Chapter 2: Getting Started"
3. Page 30: "Chapter 3: API Reference"
```bash
python3 cli/pdf_extractor_poc.py test.pdf -o test.json --chunk-size 20 -v
# Verify chapters detected
cat test.json | jq '.chapters'
```
Expected output:
```json
[
{
"title": "Chapter 1: Introduction",
"start_page": 1,
"end_page": 14
},
{
"title": "Chapter 2: Getting Started",
"start_page": 15,
"end_page": 29
},
{
"title": "Chapter 3: API Reference",
"start_page": 30,
"end_page": 50
}
]
```
### Test Code Merging
Create a test PDF with code spanning pages:
- Page 1 ends with: `def example():\n total = 0`
- Page 2 starts with: ` for i in range(10):\n total += i`
```bash
python3 cli/pdf_extractor_poc.py test.pdf -o test.json -v
# Check for merged code blocks
cat test.json | jq '.pages[0].code_samples[] | select(.merged_from_next_page == true)'
```
---
## Next Steps (Future Tasks)
### Task B1.4: Improve Code Block Detection
- Add syntax validation
- Use AST parsing for better language detection
- Improve continuation detection accuracy
### Task B1.5: Add Image Extraction
- Extract images from chunks
- OCR for code in images
- Diagram detection and extraction
### Task B1.6: Full PDF Scraper CLI
- Build on chunking foundation
- Category detection for chunks
- Multi-PDF support
---
## Integration with Skill Seeker
The chunking feature lays groundwork for:
1. **Memory-efficient processing** - Process PDFs chunk-by-chunk
2. **Better categorization** - Chapters become categories
3. **Improved SKILL.md** - Organize by detected chapters
4. **Large PDF support** - Handle 500+ page manuals
**Example workflow:**
```bash
# Extract large manual with chapters
python3 cli/pdf_extractor_poc.py large_manual.pdf --chunk-size 25 -o manual.json
# Future: Build skill from chunks
python3 cli/build_skill_from_pdf.py manual.json
# Result: SKILL.md organized by detected chapters
```
---
## API Usage
### Using PDFExtractor with Chunking
```python
from cli.pdf_extractor_poc import PDFExtractor
# Create extractor with 15-page chunks
extractor = PDFExtractor('manual.pdf', verbose=True, chunk_size=15)
# Extract
result = extractor.extract_all()
# Access chunks
for chunk in result['chunks']:
print(f"Chunk {chunk['chunk_number']}: {chunk['chapter_title']}")
print(f" Pages: {chunk['start_page']}-{chunk['end_page']}")
print(f" Total pages: {len(chunk['pages'])}")
# Access chapters
for chapter in result['chapters']:
print(f"Chapter: {chapter['title']}")
print(f" Pages: {chapter['start_page']}-{chapter['end_page']}")
```
### Processing Chunks Independently
```python
# Extract
result = extractor.extract_all()
# Process each chunk separately
for chunk in result['chunks']:
# Get pages in chunk
pages = chunk['pages']
# Process pages
for page in pages:
# Extract code samples
for code in page['code_samples']:
print(f"Found {code['language']} code")
# Check if merged from next page
if code.get('merged_from_next_page'):
print(" (merged from next page)")
```
---
## Conclusion
Task B1.3 successfully implements:
- ✅ Page chunking with configurable size
- ✅ Automatic chapter/section detection
- ✅ Code block merging across pages
- ✅ Enhanced output format with structure
- ✅ Foundation for large PDF handling
**Performance:** Minimal overhead (<1%)
**Compatibility:** Backward compatible (pages array still included)
**Quality:** Significantly improved organization
**Ready for B1.4:** Code block detection improvements
---
**Task Completed:** October 21, 2025
**Next Task:** B1.4 - Improve code block extraction with syntax detection

View File

@ -0,0 +1,420 @@
# PDF Extractor - Proof of Concept (Task B1.2)
**Status:** ✅ Completed
**Date:** October 21, 2025
**Task:** B1.2 - Create simple PDF text extractor (proof of concept)
---
## Overview
This is a proof-of-concept PDF text and code extractor built for Skill Seeker. It demonstrates the feasibility of extracting documentation content from PDF files using PyMuPDF (fitz).
## Features
### ✅ Implemented
1. **Text Extraction** - Extract plain text from all PDF pages
2. **Markdown Conversion** - Convert PDF content to markdown format
3. **Code Block Detection** - Multiple detection methods:
- **Font-based:** Detects monospace fonts (Courier, Mono, Consolas, etc.)
- **Indent-based:** Detects consistently indented code blocks
- **Pattern-based:** Detects function/class definitions, imports
4. **Language Detection** - Auto-detect programming language from code content
5. **Heading Extraction** - Extract document structure from markdown
6. **Image Counting** - Track diagrams and screenshots
7. **JSON Output** - Compatible format with existing doc_scraper.py
### 🎯 Detection Methods
#### Font-Based Detection
Analyzes font properties to find monospace fonts typically used for code:
- Courier, Courier New
- Monaco, Menlo
- Consolas
- DejaVu Sans Mono
#### Indentation-Based Detection
Identifies code blocks by consistent indentation patterns:
- 4 spaces or tabs
- Minimum 2 consecutive lines
- Minimum 20 characters
#### Pattern-Based Detection
Uses regex to find common code structures:
- Function definitions (Python, JS, Go, etc.)
- Class definitions
- Import/require statements
### 🔍 Language Detection
Supports detection of 19 programming languages:
- Python, JavaScript, Java, C, C++, C#
- Go, Rust, PHP, Ruby, Swift, Kotlin
- Shell, SQL, HTML, CSS
- JSON, YAML, XML
---
## Installation
### Prerequisites
```bash
pip install PyMuPDF
```
### Verify Installation
```bash
python3 -c "import fitz; print(fitz.__doc__)"
```
---
## Usage
### Basic Usage
```bash
# Extract from PDF (print to stdout)
python3 cli/pdf_extractor_poc.py input.pdf
# Save to JSON file
python3 cli/pdf_extractor_poc.py input.pdf --output result.json
# Verbose mode (shows progress)
python3 cli/pdf_extractor_poc.py input.pdf --verbose
# Pretty-printed JSON
python3 cli/pdf_extractor_poc.py input.pdf --pretty
```
### Examples
```bash
# Extract Python documentation
python3 cli/pdf_extractor_poc.py docs/python_guide.pdf -o python_extracted.json -v
# Extract with verbose and pretty output
python3 cli/pdf_extractor_poc.py manual.pdf -o manual.json -v --pretty
# Quick test (print to screen)
python3 cli/pdf_extractor_poc.py sample.pdf --pretty
```
---
## Output Format
### JSON Structure
```json
{
"source_file": "input.pdf",
"metadata": {
"title": "Documentation Title",
"author": "Author Name",
"subject": "Subject",
"creator": "PDF Creator",
"producer": "PDF Producer"
},
"total_pages": 50,
"total_chars": 125000,
"total_code_blocks": 87,
"total_headings": 45,
"total_images": 12,
"languages_detected": {
"python": 52,
"javascript": 20,
"sql": 10,
"shell": 5
},
"pages": [
{
"page_number": 1,
"text": "Plain text content...",
"markdown": "# Heading\nContent...",
"headings": [
{
"level": "h1",
"text": "Getting Started"
}
],
"code_samples": [
{
"code": "def hello():\n print('Hello')",
"language": "python",
"detection_method": "font",
"font": "Courier-New"
}
],
"images_count": 2,
"char_count": 2500,
"code_blocks_count": 3
}
]
}
```
### Page Object
Each page contains:
- `page_number` - 1-indexed page number
- `text` - Plain text content
- `markdown` - Markdown-formatted content
- `headings` - Array of heading objects
- `code_samples` - Array of detected code blocks
- `images_count` - Number of images on page
- `char_count` - Character count
- `code_blocks_count` - Number of code blocks found
### Code Sample Object
Each code sample includes:
- `code` - The actual code text
- `language` - Detected language (or 'unknown')
- `detection_method` - How it was found ('font', 'indent', or 'pattern')
- `font` - Font name (if detected by font method)
- `pattern_type` - Type of pattern (if detected by pattern method)
---
## Technical Details
### Detection Accuracy
**Font-based detection:** ⭐⭐⭐⭐⭐ (Best)
- Highly accurate for well-formatted PDFs
- Relies on proper font usage in source document
- Works with: Technical docs, programming books, API references
**Indent-based detection:** ⭐⭐⭐⭐ (Good)
- Good for structured code blocks
- May capture non-code indented content
- Works with: Tutorials, guides, examples
**Pattern-based detection:** ⭐⭐⭐ (Fair)
- Captures specific code constructs
- May miss complex or unusual code
- Works with: Code snippets, function examples
### Language Detection Accuracy
- **High confidence:** Python, JavaScript, Java, Go, SQL
- **Medium confidence:** C++, Rust, PHP, Ruby, Swift
- **Basic detection:** Shell, JSON, YAML, XML
Detection based on keyword patterns, not AST parsing.
### Performance
Tested on various PDF sizes:
- Small (1-10 pages): < 1 second
- Medium (10-100 pages): 1-5 seconds
- Large (100-500 pages): 5-30 seconds
- Very Large (500+ pages): 30+ seconds
Memory usage: ~50-200 MB depending on PDF size and image content.
---
## Limitations
### Current Limitations
1. **No OCR** - Cannot extract text from scanned/image PDFs
2. **No Table Extraction** - Tables are treated as plain text
3. **No Image Extraction** - Only counts images, doesn't extract them
4. **Simple Deduplication** - May miss some duplicate code blocks
5. **No Multi-column Support** - May jumble multi-column layouts
### Known Issues
1. **Code Split Across Pages** - Code blocks spanning pages may be split
2. **Complex Layouts** - May struggle with complex PDF layouts
3. **Non-standard Fonts** - May miss code in non-standard monospace fonts
4. **Unicode Issues** - Some special characters may not preserve correctly
---
## Comparison with Web Scraper
| Feature | Web Scraper | PDF Extractor POC |
|---------|-------------|-------------------|
| Content source | HTML websites | PDF files |
| Code detection | CSS selectors | Font/indent/pattern |
| Language detection | CSS classes + heuristics | Pattern matching |
| Structure | Excellent | Good |
| Links | Full support | Not supported |
| Images | Referenced | Counted only |
| Categories | Auto-categorized | Not implemented |
| Output format | JSON | JSON (compatible) |
---
## Next Steps (Tasks B1.3-B1.8)
### B1.3: Add PDF Page Detection and Chunking
- Split large PDFs into manageable chunks
- Handle page-spanning code blocks
- Add chapter/section detection
### B1.4: Extract Code Blocks from PDFs
- Improve code block detection accuracy
- Add syntax validation
- Better language detection (use tree-sitter?)
### B1.5: Add PDF Image Extraction
- Extract diagrams as separate files
- Extract screenshots
- OCR support for code in images
### B1.6: Create `pdf_scraper.py` CLI Tool
- Full-featured CLI like `doc_scraper.py`
- Config file support
- Category detection
- Multi-PDF support
### B1.7: Add MCP Tool `scrape_pdf`
- Integrate with MCP server
- Add to existing 9 MCP tools
- Test with Claude Code
### B1.8: Create PDF Config Format
- Define JSON config for PDF sources
- Similar to web scraper configs
- Support multiple PDFs per skill
---
## Testing
### Manual Testing
1. **Create test PDF** (or use existing PDF documentation)
2. **Run extractor:**
```bash
python3 cli/pdf_extractor_poc.py test.pdf -o test_result.json -v --pretty
```
3. **Verify output:**
- Check `total_code_blocks` > 0
- Verify `languages_detected` includes expected languages
- Inspect `code_samples` for accuracy
### Test with Real Documentation
Recommended test PDFs:
- Python documentation (python.org)
- Django documentation
- PostgreSQL manual
- Any programming language reference
### Expected Results
Good PDF (well-formatted with monospace code):
- Detection rate: 80-95%
- Language accuracy: 85-95%
- False positives: < 5%
Poor PDF (scanned or badly formatted):
- Detection rate: 20-50%
- Language accuracy: 60-80%
- False positives: 10-30%
---
## Code Examples
### Using PDFExtractor Class Directly
```python
from cli.pdf_extractor_poc import PDFExtractor
# Create extractor
extractor = PDFExtractor('docs/manual.pdf', verbose=True)
# Extract all pages
result = extractor.extract_all()
# Access data
print(f"Total pages: {result['total_pages']}")
print(f"Code blocks: {result['total_code_blocks']}")
print(f"Languages: {result['languages_detected']}")
# Iterate pages
for page in result['pages']:
print(f"\nPage {page['page_number']}:")
print(f" Code blocks: {page['code_blocks_count']}")
for code in page['code_samples']:
print(f" - {code['language']}: {len(code['code'])} chars")
```
### Custom Language Detection
```python
from cli.pdf_extractor_poc import PDFExtractor
extractor = PDFExtractor('input.pdf')
# Override language detection
def custom_detect(code):
if 'SELECT' in code.upper():
return 'sql'
return extractor.detect_language_from_code(code)
# Use in extraction
# (requires modifying the class to support custom detection)
```
---
## Contributing
### Adding New Languages
To add language detection for a new language, edit `detect_language_from_code()`:
```python
patterns = {
# ... existing languages ...
'newlang': [r'pattern1', r'pattern2', r'pattern3'],
}
```
### Adding Detection Methods
To add a new detection method, create a method like:
```python
def detect_code_blocks_by_newmethod(self, page):
"""Detect code using new method"""
code_blocks = []
# ... your detection logic ...
return code_blocks
```
Then add it to `extract_page()`:
```python
newmethod_code_blocks = self.detect_code_blocks_by_newmethod(page)
all_code_blocks = font_code_blocks + indent_code_blocks + pattern_code_blocks + newmethod_code_blocks
```
---
## Conclusion
This POC successfully demonstrates:
- ✅ PyMuPDF can extract text from PDF documentation
- ✅ Multiple detection methods can identify code blocks
- ✅ Language detection works for common languages
- ✅ JSON output is compatible with existing doc_scraper.py
- ✅ Performance is acceptable for typical documentation PDFs
**Ready for B1.3:** The foundation is solid. Next step is adding page chunking and handling large PDFs.
---
**POC Completed:** October 21, 2025
**Next Task:** B1.3 - Add PDF page detection and chunking

View File

@ -0,0 +1,553 @@
# PDF Image Extraction (Task B1.5)
**Status:** ✅ Completed
**Date:** October 21, 2025
**Task:** B1.5 - Add PDF image extraction (diagrams, screenshots)
---
## Overview
Task B1.5 adds the ability to extract images (diagrams, screenshots, charts) from PDF documentation and save them as separate files. This is essential for preserving visual documentation elements in skills.
## New Features
### ✅ 1. Image Extraction to Files
Extract embedded images from PDFs and save them to disk:
```bash
# Extract images along with text
python3 cli/pdf_extractor_poc.py manual.pdf --extract-images
# Specify output directory
python3 cli/pdf_extractor_poc.py manual.pdf --extract-images --image-dir assets/images/
# Filter small images (icons, bullets)
python3 cli/pdf_extractor_poc.py manual.pdf --extract-images --min-image-size 200
```
### ✅ 2. Size-Based Filtering
Automatically filter out small images (icons, bullets, decorations):
- **Default threshold:** 100x100 pixels
- **Configurable:** `--min-image-size`
- **Purpose:** Focus on meaningful diagrams and screenshots
### ✅ 3. Image Metadata
Each extracted image includes comprehensive metadata:
```json
{
"filename": "manual_page5_img1.png",
"path": "output/manual_images/manual_page5_img1.png",
"page_number": 5,
"width": 800,
"height": 600,
"format": "png",
"size_bytes": 45821,
"xref": 42
}
```
### ✅ 4. Automatic Directory Creation
Images are automatically organized:
- **Default:** `output/{pdf_name}_images/`
- **Naming:** `{pdf_name}_page{N}_img{M}.{ext}`
- **Formats:** PNG, JPEG, GIF, BMP, etc.
---
## Usage Examples
### Basic Image Extraction
```bash
# Extract all images from PDF
python3 cli/pdf_extractor_poc.py tutorial.pdf --extract-images -v
```
**Output:**
```
📄 Extracting from: tutorial.pdf
Pages: 50
Metadata: {...}
Image directory: output/tutorial_images
Page 1: 2500 chars, 3 code blocks, 2 headings, 0 images
Page 2: 1800 chars, 1 code blocks, 1 headings, 2 images
Extracted image: tutorial_page2_img1.png (800x600)
Extracted image: tutorial_page2_img2.jpeg (1024x768)
...
✅ Extraction complete:
Images found: 45
Images extracted: 32
Image directory: output/tutorial_images
```
### Custom Image Directory
```bash
# Save images to specific directory
python3 cli/pdf_extractor_poc.py manual.pdf --extract-images --image-dir docs/images/
```
Result: Images saved to `docs/images/manual_page*_img*.{ext}`
### Filter Small Images
```bash
# Only extract images >= 200x200 pixels
python3 cli/pdf_extractor_poc.py guide.pdf --extract-images --min-image-size 200 -v
```
**Verbose output shows filtering:**
```
Page 5: 3200 chars, 4 code blocks, 3 headings, 3 images
Skipping small image: 32x32
Skipping small image: 64x48
Extracted image: guide_page5_img3.png (1200x800)
```
### Complete Extraction Workflow
```bash
# Extract everything: text, code, images
python3 cli/pdf_extractor_poc.py documentation.pdf \
--extract-images \
--min-image-size 150 \
--min-quality 6.0 \
--chunk-size 20 \
--output documentation.json \
--verbose \
--pretty
```
---
## Output Format
### Enhanced JSON Structure
The output now includes image extraction data:
```json
{
"source_file": "manual.pdf",
"total_pages": 50,
"total_images": 45,
"total_extracted_images": 32,
"image_directory": "output/manual_images",
"extracted_images": [
{
"filename": "manual_page2_img1.png",
"path": "output/manual_images/manual_page2_img1.png",
"page_number": 2,
"width": 800,
"height": 600,
"format": "png",
"size_bytes": 45821,
"xref": 42
}
],
"pages": [
{
"page_number": 1,
"images_count": 3,
"extracted_images": [
{
"filename": "manual_page1_img1.jpeg",
"path": "output/manual_images/manual_page1_img1.jpeg",
"width": 1024,
"height": 768,
"format": "jpeg",
"size_bytes": 87543
}
]
}
]
}
```
### File System Layout
```
output/
├── manual.json # Extraction results
└── manual_images/ # Image directory
├── manual_page2_img1.png # Page 2, Image 1
├── manual_page2_img2.jpeg # Page 2, Image 2
├── manual_page5_img1.png # Page 5, Image 1
└── ...
```
---
## Technical Implementation
### Image Extraction Method
```python
def extract_images_from_page(self, page, page_num):
"""Extract images from PDF page and save to disk"""
extracted = []
image_list = page.get_images()
for img_index, img in enumerate(image_list):
# Get image data from PDF
xref = img[0]
base_image = self.doc.extract_image(xref)
image_bytes = base_image["image"]
image_ext = base_image["ext"]
width = base_image.get("width", 0)
height = base_image.get("height", 0)
# Filter small images
if width < self.min_image_size or height < self.min_image_size:
continue
# Generate filename
image_filename = f"{pdf_basename}_page{page_num+1}_img{img_index+1}.{image_ext}"
image_path = Path(self.image_dir) / image_filename
# Save image
with open(image_path, "wb") as f:
f.write(image_bytes)
# Store metadata
image_info = {
'filename': image_filename,
'path': str(image_path),
'page_number': page_num + 1,
'width': width,
'height': height,
'format': image_ext,
'size_bytes': len(image_bytes),
}
extracted.append(image_info)
return extracted
```
---
## Performance
### Extraction Speed
| PDF Size | Images | Extraction Time | Overhead |
|----------|--------|-----------------|----------|
| Small (10 pages, 5 images) | 5 | +200ms | ~10% |
| Medium (100 pages, 50 images) | 50 | +2s | ~15% |
| Large (500 pages, 200 images) | 200 | +8s | ~20% |
**Note:** Image extraction adds 10-20% overhead depending on image count and size.
### Storage Requirements
- **PNG images:** ~10-500 KB each (diagrams)
- **JPEG images:** ~50-2000 KB each (screenshots)
- **Typical documentation (100 pages):** ~50-200 MB total
---
## Supported Image Formats
PyMuPDF automatically handles format detection and extraction:
- ✅ PNG (lossless, best for diagrams)
- ✅ JPEG (lossy, best for photos)
- ✅ GIF (animated, rare in PDFs)
- ✅ BMP (uncompressed)
- ✅ TIFF (high quality)
Images are extracted in their original format.
---
## Filtering Strategy
### Why Filter Small Images?
PDFs often contain:
- **Icons:** 16x16, 32x32 (UI elements)
- **Bullets:** 8x8, 12x12 (decorative)
- **Logos:** 50x50, 100x100 (branding)
These are usually not useful for documentation skills.
### Recommended Thresholds
| Use Case | Min Size | Reasoning |
|----------|----------|-----------|
| **General docs** | 100x100 | Filters icons, keeps diagrams |
| **Technical diagrams** | 200x200 | Only meaningful charts |
| **Screenshots** | 300x300 | Only full-size screenshots |
| **All images** | 0 | No filtering |
**Set with:** `--min-image-size N`
---
## Integration with Skill Seeker
### Future Workflow (Task B1.6+)
When building PDF-based skills, images will be:
1. **Extracted** from PDF documentation
2. **Organized** into skill's `assets/` directory
3. **Referenced** in SKILL.md and reference files
4. **Packaged** in final .zip file
**Example:**
```markdown
# API Architecture
See diagram below for the complete API flow:
![API Flow](assets/images/api_flow.png)
The diagram shows...
```
---
## Limitations
### Current Limitations
1. **No OCR**
- Cannot extract text from images
- Code screenshots are not parsed
- Future: Add OCR support for code in images
2. **No Image Analysis**
- Cannot detect diagram types (flowchart, UML, etc.)
- Cannot extract captions
- Future: Add AI-based image classification
3. **No Deduplication**
- Same image on multiple pages extracted multiple times
- Future: Add image hash-based deduplication
4. **Format Preservation**
- Images saved in original format (no conversion)
- No optimization or compression
### Known Issues
1. **Vector Graphics**
- Some PDFs use vector graphics (not images)
- These are not extracted (rendered as part of page)
- Workaround: Use PDF-to-image tools first
2. **Embedded vs Referenced**
- Only embedded images are extracted
- External image references are not followed
3. **Image Quality**
- Quality depends on PDF source
- Low-res source = low-res output
---
## Troubleshooting
### No Images Extracted
**Problem:** `total_extracted_images: 0` but PDF has visible images
**Possible causes:**
1. Images are vector graphics (not raster)
2. Images smaller than `--min-image-size` threshold
3. Images are page backgrounds (not embedded images)
**Solution:**
```bash
# Try with no size filter
python3 cli/pdf_extractor_poc.py input.pdf --extract-images --min-image-size 0 -v
```
### Permission Errors
**Problem:** `PermissionError: [Errno 13] Permission denied`
**Solution:**
```bash
# Ensure output directory is writable
mkdir -p output/images
chmod 755 output/images
# Or specify different directory
python3 cli/pdf_extractor_poc.py input.pdf --extract-images --image-dir ~/my_images/
```
### Disk Space
**Problem:** Running out of disk space
**Solution:**
```bash
# Check PDF size first
du -h input.pdf
# Estimate: ~100-200 MB per 100 pages with images
# Use higher min-image-size to extract fewer images
python3 cli/pdf_extractor_poc.py input.pdf --extract-images --min-image-size 300
```
---
## Examples
### Extract Diagram-Heavy Documentation
```bash
# Architecture documentation with many diagrams
python3 cli/pdf_extractor_poc.py architecture.pdf \
--extract-images \
--min-image-size 250 \
--image-dir docs/diagrams/ \
-v
```
**Result:** High-quality diagrams extracted, icons filtered out.
### Tutorial with Screenshots
```bash
# Tutorial with step-by-step screenshots
python3 cli/pdf_extractor_poc.py tutorial.pdf \
--extract-images \
--min-image-size 400 \
--image-dir tutorial_screenshots/ \
-v
```
**Result:** Full screenshots extracted, UI icons ignored.
### API Reference with Small Charts
```bash
# API docs with various image sizes
python3 cli/pdf_extractor_poc.py api_reference.pdf \
--extract-images \
--min-image-size 150 \
-o api.json \
--pretty
```
**Result:** Charts and graphs extracted, small icons filtered.
---
## Command-Line Reference
### Image Extraction Options
```
--extract-images
Enable image extraction to files
Default: disabled
--image-dir PATH
Directory to save extracted images
Default: output/{pdf_name}_images/
--min-image-size PIXELS
Minimum image dimension (width or height)
Filters out icons and small decorations
Default: 100
```
### Complete Example
```bash
python3 cli/pdf_extractor_poc.py manual.pdf \
--extract-images \
--image-dir assets/images/ \
--min-image-size 200 \
--min-quality 7.0 \
--chunk-size 15 \
--output manual.json \
--verbose \
--pretty
```
---
## Comparison: Before vs After
| Feature | Before (B1.4) | After (B1.5) |
|---------|---------------|--------------|
| Image detection | ✅ Count only | ✅ Count + Extract |
| Image files | ❌ Not saved | ✅ Saved to disk |
| Image metadata | ❌ None | ✅ Full metadata |
| Size filtering | ❌ None | ✅ Configurable |
| Directory organization | ❌ N/A | ✅ Automatic |
| Format support | ❌ N/A | ✅ All formats |
---
## Next Steps
### Task B1.6: Full PDF Scraper CLI
The image extraction feature will be integrated into the full PDF scraper:
```bash
# Future: Full PDF scraper with images
python3 cli/pdf_scraper.py \
--config configs/manual_pdf.json \
--extract-images \
--enhance-local
```
### Task B1.7: MCP Tool Integration
Images will be available through MCP:
```python
# Future: MCP tool
result = mcp.scrape_pdf(
pdf_path="manual.pdf",
extract_images=True,
min_image_size=200
)
```
---
## Conclusion
Task B1.5 successfully implements:
- ✅ Image extraction from PDF pages
- ✅ Automatic file saving with metadata
- ✅ Size-based filtering (configurable)
- ✅ Organized directory structure
- ✅ Multiple format support
**Impact:**
- Preserves visual documentation
- Essential for diagram-heavy docs
- Improves skill completeness
**Performance:** 10-20% overhead (acceptable)
**Compatibility:** Backward compatible (images optional)
**Ready for B1.6:** Full PDF scraper CLI tool
---
**Task Completed:** October 21, 2025
**Next Task:** B1.6 - Create `pdf_scraper.py` CLI tool

View File

@ -0,0 +1,437 @@
# PDF Scraping MCP Tool (Task B1.7)
**Status:** ✅ Completed
**Date:** October 21, 2025
**Task:** B1.7 - Add MCP tool `scrape_pdf`
---
## Overview
Task B1.7 adds the `scrape_pdf` MCP tool to the Skill Seeker MCP server, making PDF documentation scraping available through the Model Context Protocol. This allows Claude Code and other MCP clients to scrape PDF documentation directly.
## Features
### ✅ MCP Tool Integration
- **Tool name:** `scrape_pdf`
- **Description:** Scrape PDF documentation and build Claude skill
- **Supports:** All three usage modes (config, direct, from-json)
- **Integration:** Uses `cli/pdf_scraper.py` backend
### ✅ Three Usage Modes
1. **Config File Mode** - Use PDF config JSON
2. **Direct PDF Mode** - Quick conversion from PDF file
3. **From JSON Mode** - Build from pre-extracted data
---
## Usage
### Mode 1: Config File
```python
# Through MCP
result = await mcp.call_tool("scrape_pdf", {
"config_path": "configs/manual_pdf.json"
})
```
**Example config** (`configs/manual_pdf.json`):
```json
{
"name": "mymanual",
"description": "My Manual documentation",
"pdf_path": "docs/manual.pdf",
"extract_options": {
"chunk_size": 10,
"min_quality": 6.0,
"extract_images": true,
"min_image_size": 150
},
"categories": {
"getting_started": ["introduction", "setup"],
"api": ["api", "reference"],
"tutorial": ["tutorial", "example"]
}
}
```
**Output:**
```
🔍 Extracting from PDF: docs/manual.pdf
📄 Extracting from: docs/manual.pdf
Pages: 150
...
✅ Extraction complete
🏗️ Building skill: mymanual
📋 Categorizing content...
✅ Created 3 categories
📝 Generating reference files...
Generated: output/mymanual/references/getting_started.md
Generated: output/mymanual/references/api.md
Generated: output/mymanual/references/tutorial.md
✅ Skill built successfully: output/mymanual/
📦 Next step: Package with: python3 cli/package_skill.py output/mymanual/
```
### Mode 2: Direct PDF
```python
# Through MCP
result = await mcp.call_tool("scrape_pdf", {
"pdf_path": "manual.pdf",
"name": "mymanual",
"description": "My Manual Docs"
})
```
**Uses default settings:**
- Chunk size: 10
- Min quality: 5.0
- Extract images: true
- Chapter-based categorization
### Mode 3: From Extracted JSON
```python
# Step 1: Extract to JSON (separate tool or CLI)
# python3 cli/pdf_extractor_poc.py manual.pdf -o manual_extracted.json
# Step 2: Build skill from JSON via MCP
result = await mcp.call_tool("scrape_pdf", {
"from_json": "output/manual_extracted.json"
})
```
**Benefits:**
- Separate extraction and building
- Fast iteration on skill structure
- No re-extraction needed
---
## MCP Tool Definition
### Input Schema
```json
{
"name": "scrape_pdf",
"description": "Scrape PDF documentation and build Claude skill. Extracts text, code, and images from PDF files (NEW in B1.7).",
"inputSchema": {
"type": "object",
"properties": {
"config_path": {
"type": "string",
"description": "Path to PDF config JSON file (e.g., configs/manual_pdf.json)"
},
"pdf_path": {
"type": "string",
"description": "Direct PDF path (alternative to config_path)"
},
"name": {
"type": "string",
"description": "Skill name (required with pdf_path)"
},
"description": {
"type": "string",
"description": "Skill description (optional)"
},
"from_json": {
"type": "string",
"description": "Build from extracted JSON file (e.g., output/manual_extracted.json)"
}
},
"required": []
}
}
```
### Return Format
Returns `TextContent` with:
- Success: stdout from `pdf_scraper.py`
- Failure: stderr + stdout for debugging
---
## Implementation
### MCP Server Changes
**Location:** `skill_seeker_mcp/server.py`
**Changes:**
1. Added `scrape_pdf` to `list_tools()` (lines 220-249)
2. Added handler in `call_tool()` (lines 276-277)
3. Implemented `scrape_pdf_tool()` function (lines 591-625)
### Code Implementation
```python
async def scrape_pdf_tool(args: dict) -> list[TextContent]:
"""Scrape PDF documentation and build skill (NEW in B1.7)"""
config_path = args.get("config_path")
pdf_path = args.get("pdf_path")
name = args.get("name")
description = args.get("description")
from_json = args.get("from_json")
# Build command
cmd = [sys.executable, str(CLI_DIR / "pdf_scraper.py")]
# Mode 1: Config file
if config_path:
cmd.extend(["--config", config_path])
# Mode 2: Direct PDF
elif pdf_path and name:
cmd.extend(["--pdf", pdf_path, "--name", name])
if description:
cmd.extend(["--description", description])
# Mode 3: From JSON
elif from_json:
cmd.extend(["--from-json", from_json])
else:
return [TextContent(type="text", text="❌ Error: Must specify --config, --pdf + --name, or --from-json")]
# Run pdf_scraper.py
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode == 0:
return [TextContent(type="text", text=result.stdout)]
else:
return [TextContent(type="text", text=f"Error: {result.stderr}\n\n{result.stdout}")]
```
---
## Integration with MCP Workflow
### Complete Workflow Through MCP
```python
# 1. Create PDF config (optional - can use direct mode)
config_result = await mcp.call_tool("generate_config", {
"name": "api_manual",
"url": "N/A", # Not used for PDF
"description": "API Manual from PDF"
})
# 2. Scrape PDF
scrape_result = await mcp.call_tool("scrape_pdf", {
"pdf_path": "docs/api_manual.pdf",
"name": "api_manual",
"description": "API Manual Documentation"
})
# 3. Package skill
package_result = await mcp.call_tool("package_skill", {
"skill_dir": "output/api_manual/",
"auto_upload": True # Upload if ANTHROPIC_API_KEY set
})
# 4. Upload (if not auto-uploaded)
if "ANTHROPIC_API_KEY" in os.environ:
upload_result = await mcp.call_tool("upload_skill", {
"skill_zip": "output/api_manual.zip"
})
```
### Combined with Web Scraping
```python
# Scrape web documentation
web_result = await mcp.call_tool("scrape_docs", {
"config_path": "configs/framework.json"
})
# Scrape PDF supplement
pdf_result = await mcp.call_tool("scrape_pdf", {
"pdf_path": "docs/framework_api.pdf",
"name": "framework_pdf"
})
# Package both
await mcp.call_tool("package_skill", {"skill_dir": "output/framework/"})
await mcp.call_tool("package_skill", {"skill_dir": "output/framework_pdf/"})
```
---
## Error Handling
### Common Errors
**Error 1: Missing required parameters**
```
❌ Error: Must specify --config, --pdf + --name, or --from-json
```
**Solution:** Provide one of the three modes
**Error 2: PDF file not found**
```
Error: [Errno 2] No such file or directory: 'manual.pdf'
```
**Solution:** Check PDF path is correct
**Error 3: PyMuPDF not installed**
```
ERROR: PyMuPDF not installed
Install with: pip install PyMuPDF
```
**Solution:** Install PyMuPDF: `pip install PyMuPDF`
**Error 4: Invalid JSON config**
```
Error: json.decoder.JSONDecodeError: Expecting value: line 1 column 1
```
**Solution:** Check config file is valid JSON
---
## Testing
### Test MCP Tool
```bash
# 1. Start MCP server
python3 skill_seeker_mcp/server.py
# 2. Test with MCP client or via Claude Code
# 3. Verify tool is listed
# Should see "scrape_pdf" in available tools
```
### Test All Modes
**Mode 1: Config**
```python
result = await mcp.call_tool("scrape_pdf", {
"config_path": "configs/example_pdf.json"
})
assert "✅ Skill built successfully" in result[0].text
```
**Mode 2: Direct**
```python
result = await mcp.call_tool("scrape_pdf", {
"pdf_path": "test.pdf",
"name": "test_skill"
})
assert "✅ Skill built successfully" in result[0].text
```
**Mode 3: From JSON**
```python
# First extract
subprocess.run(["python3", "cli/pdf_extractor_poc.py", "test.pdf", "-o", "test.json"])
# Then build via MCP
result = await mcp.call_tool("scrape_pdf", {
"from_json": "test.json"
})
assert "✅ Skill built successfully" in result[0].text
```
---
## Comparison with Other MCP Tools
| Tool | Input | Output | Use Case |
|------|-------|--------|----------|
| `scrape_docs` | HTML URL | Skill | Web documentation |
| `scrape_pdf` | PDF file | Skill | PDF documentation |
| `generate_config` | URL | Config | Create web config |
| `package_skill` | Skill dir | .zip | Package for upload |
| `upload_skill` | .zip file | Upload | Send to Claude |
---
## Performance
### MCP Tool Overhead
- **MCP overhead:** ~50-100ms
- **Extraction time:** Same as CLI (15s-5m depending on PDF)
- **Building time:** Same as CLI (5s-45s)
**Total:** MCP adds negligible overhead (<1%)
### Async Execution
The MCP tool runs `pdf_scraper.py` synchronously via `subprocess.run()`. For long-running PDFs:
- Client waits for completion
- No progress updates during extraction
- Consider using `--from-json` mode for faster iteration
---
## Future Enhancements
### Potential Improvements
1. **Async Extraction**
- Stream progress updates to client
- Allow cancellation
- Background processing
2. **Batch Processing**
- Process multiple PDFs in parallel
- Merge into single skill
- Shared categories
3. **Enhanced Options**
- Pass all extraction options through MCP
- Dynamic quality threshold
- Image filter controls
4. **Status Checking**
- Query extraction status
- Get progress percentage
- Estimate time remaining
---
## Conclusion
Task B1.7 successfully implements:
- ✅ MCP tool `scrape_pdf`
- ✅ Three usage modes (config, direct, from-json)
- ✅ Integration with MCP server
- ✅ Error handling
- ✅ Compatible with existing MCP workflow
**Impact:**
- PDF scraping available through MCP
- Seamless integration with Claude Code
- Unified workflow for web + PDF documentation
- 10th MCP tool in Skill Seeker
**Total MCP Tools:** 10
1. generate_config
2. estimate_pages
3. scrape_docs
4. package_skill
5. upload_skill
6. list_configs
7. validate_config
8. split_config
9. generate_router
10. **scrape_pdf** (NEW)
---
**Task Completed:** October 21, 2025
**B1 Group Complete:** All 8 tasks (B1.1-B1.8) finished!
**Next:** Task group B2 (Microsoft Word .docx support)

View File

@ -0,0 +1,491 @@
# PDF Parsing Libraries Research (Task B1.1)
**Date:** October 21, 2025
**Task:** B1.1 - Research PDF parsing libraries
**Purpose:** Evaluate Python libraries for extracting text and code from PDF documentation
---
## Executive Summary
After comprehensive research, **PyMuPDF (fitz)** is recommended as the primary library for Skill Seeker's PDF parsing needs, with **pdfplumber** as a secondary option for complex table extraction.
### Quick Recommendation:
- **Primary Choice:** PyMuPDF (fitz) - Fast, comprehensive, well-maintained
- **Secondary/Fallback:** pdfplumber - Better for tables, slower but more precise
- **Avoid:** PyPDF2 (deprecated, merged into pypdf)
---
## Library Comparison Matrix
| Library | Speed | Text Quality | Code Detection | Tables | Maintenance | License |
|---------|-------|--------------|----------------|--------|-------------|---------|
| **PyMuPDF** | ⚡⚡⚡⚡⚡ Fastest (42ms) | High | Excellent | Good | Active | AGPL/Commercial |
| **pdfplumber** | ⚡⚡ Slower (2.5s) | Very High | Excellent | Excellent | Active | MIT |
| **pypdf** | ⚡⚡⚡ Fast | Medium | Good | Basic | Active | BSD |
| **pdfminer.six** | ⚡ Slow | Very High | Good | Medium | Active | MIT |
| **pypdfium2** | ⚡⚡⚡⚡⚡ Very Fast (3ms) | Medium | Good | Basic | Active | Apache-2.0 |
---
## Detailed Analysis
### 1. PyMuPDF (fitz) ⭐ RECOMMENDED
**Performance:** 42 milliseconds (60x faster than pdfminer.six)
**Installation:**
```bash
pip install PyMuPDF
```
**Pros:**
- ✅ Extremely fast (C-based MuPDF backend)
- ✅ Comprehensive features (text, images, tables, metadata)
- ✅ Supports markdown output
- ✅ Can extract images and diagrams
- ✅ Well-documented and actively maintained
- ✅ Handles complex layouts well
**Cons:**
- ⚠️ AGPL license (requires commercial license for proprietary projects)
- ⚠️ Requires MuPDF binary installation (handled by pip)
- ⚠️ Slightly larger dependency footprint
**Code Example:**
```python
import fitz # PyMuPDF
# Extract text from entire PDF
def extract_pdf_text(pdf_path):
doc = fitz.open(pdf_path)
text = ''
for page in doc:
text += page.get_text()
doc.close()
return text
# Extract text from single page
def extract_page_text(pdf_path, page_num):
doc = fitz.open(pdf_path)
page = doc.load_page(page_num)
text = page.get_text()
doc.close()
return text
# Extract with markdown formatting
def extract_as_markdown(pdf_path):
doc = fitz.open(pdf_path)
markdown = ''
for page in doc:
markdown += page.get_text("markdown")
doc.close()
return markdown
```
**Use Cases for Skill Seeker:**
- Fast extraction of code examples from PDF docs
- Preserving formatting for code blocks
- Extracting diagrams and screenshots
- High-volume documentation scraping
---
### 2. pdfplumber ⭐ RECOMMENDED (for tables)
**Performance:** ~2.5 seconds (slower but more precise)
**Installation:**
```bash
pip install pdfplumber
```
**Pros:**
- ✅ MIT license (fully open source)
- ✅ Exceptional table extraction
- ✅ Visual debugging tool
- ✅ Precise layout preservation
- ✅ Built on pdfminer (proven text extraction)
- ✅ No binary dependencies
**Cons:**
- ⚠️ Slower than PyMuPDF
- ⚠️ Higher memory usage for large PDFs
- ⚠️ Requires more configuration for optimal results
**Code Example:**
```python
import pdfplumber
# Extract text from PDF
def extract_with_pdfplumber(pdf_path):
with pdfplumber.open(pdf_path) as pdf:
text = ''
for page in pdf.pages:
text += page.extract_text()
return text
# Extract tables
def extract_tables(pdf_path):
tables = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
page_tables = page.extract_tables()
tables.extend(page_tables)
return tables
# Extract specific region (for code blocks)
def extract_region(pdf_path, page_num, bbox):
with pdfplumber.open(pdf_path) as pdf:
page = pdf.pages[page_num]
cropped = page.crop(bbox)
return cropped.extract_text()
```
**Use Cases for Skill Seeker:**
- Extracting API reference tables from PDFs
- Precise code block extraction with layout
- Documentation with complex table structures
---
### 3. pypdf (formerly PyPDF2)
**Performance:** Fast (medium speed)
**Installation:**
```bash
pip install pypdf
```
**Pros:**
- ✅ BSD license
- ✅ Simple API
- ✅ Can modify PDFs (merge, split, encrypt)
- ✅ Actively maintained (PyPDF2 merged back)
- ✅ No external dependencies
**Cons:**
- ⚠️ Limited complex layout support
- ⚠️ Basic text extraction only
- ⚠️ Poor with scanned/image PDFs
- ⚠️ No table extraction
**Code Example:**
```python
from pypdf import PdfReader
# Extract text
def extract_with_pypdf(pdf_path):
reader = PdfReader(pdf_path)
text = ''
for page in reader.pages:
text += page.extract_text()
return text
```
**Use Cases for Skill Seeker:**
- Simple text extraction
- Fallback when PyMuPDF licensing is an issue
- Basic PDF manipulation tasks
---
### 4. pdfminer.six
**Performance:** Slow (~2.5 seconds)
**Installation:**
```bash
pip install pdfminer.six
```
**Pros:**
- ✅ MIT license
- ✅ Excellent text quality (preserves formatting)
- ✅ Handles complex layouts
- ✅ Pure Python (no binaries)
**Cons:**
- ⚠️ Slowest option
- ⚠️ Complex API
- ⚠️ Poor documentation
- ⚠️ Limited table support
**Use Cases for Skill Seeker:**
- Not recommended (pdfplumber is built on this with better API)
---
### 5. pypdfium2
**Performance:** Very fast (3ms - fastest tested)
**Installation:**
```bash
pip install pypdfium2
```
**Pros:**
- ✅ Extremely fast
- ✅ Apache 2.0 license
- ✅ Lightweight
- ✅ Clean output
**Cons:**
- ⚠️ Basic features only
- ⚠️ Limited documentation
- ⚠️ No table extraction
- ⚠️ Newer/less proven
**Use Cases for Skill Seeker:**
- High-speed basic extraction
- Potential future optimization
---
## Licensing Considerations
### Open Source Projects (Skill Seeker):
- **PyMuPDF:** ✅ AGPL license is fine for open-source projects
- **pdfplumber:** ✅ MIT license (most permissive)
- **pypdf:** ✅ BSD license (permissive)
### Important Note:
PyMuPDF requires AGPL compliance (source code must be shared) OR a commercial license for proprietary use. Since Skill Seeker is open source on GitHub, AGPL is acceptable.
---
## Performance Benchmarks
Based on 2025 testing:
| Library | Time (single page) | Time (100 pages) |
|---------|-------------------|------------------|
| pypdfium2 | 0.003s | 0.3s |
| PyMuPDF | 0.042s | 4.2s |
| pypdf | 0.1s | 10s |
| pdfplumber | 2.5s | 250s |
| pdfminer.six | 2.5s | 250s |
**Winner:** pypdfium2 (speed) / PyMuPDF (features + speed balance)
---
## Recommendations for Skill Seeker
### Primary Approach: PyMuPDF (fitz)
**Why:**
1. **Speed** - 60x faster than alternatives
2. **Features** - Text, images, markdown output, metadata
3. **Quality** - High-quality text extraction
4. **Maintained** - Active development, good docs
5. **License** - AGPL is fine for open source
**Implementation Strategy:**
```python
import fitz # PyMuPDF
def extract_pdf_documentation(pdf_path):
"""
Extract documentation from PDF with code block detection
"""
doc = fitz.open(pdf_path)
pages = []
for page_num, page in enumerate(doc):
# Get text with layout info
text = page.get_text("text")
# Get markdown (preserves code blocks)
markdown = page.get_text("markdown")
# Get images (for diagrams)
images = page.get_images()
pages.append({
'page_number': page_num,
'text': text,
'markdown': markdown,
'images': images
})
doc.close()
return pages
```
### Fallback Approach: pdfplumber
**When to use:**
- PDF has complex tables that PyMuPDF misses
- Need visual debugging
- License concerns (use MIT instead of AGPL)
**Implementation Strategy:**
```python
import pdfplumber
def extract_pdf_tables(pdf_path):
"""
Extract tables from PDF documentation
"""
with pdfplumber.open(pdf_path) as pdf:
tables = []
for page in pdf.pages:
page_tables = page.extract_tables()
if page_tables:
tables.extend(page_tables)
return tables
```
---
## Code Block Detection Strategy
PDFs don't have semantic "code block" markers like HTML. Detection strategies:
### 1. Font-based Detection
```python
# PyMuPDF can detect font changes
def detect_code_by_font(page):
blocks = page.get_text("dict")["blocks"]
code_blocks = []
for block in blocks:
if 'lines' in block:
for line in block['lines']:
for span in line['spans']:
font = span['font']
# Monospace fonts indicate code
if 'Courier' in font or 'Mono' in font:
code_blocks.append(span['text'])
return code_blocks
```
### 2. Indentation-based Detection
```python
def detect_code_by_indent(text):
lines = text.split('\n')
code_blocks = []
current_block = []
for line in lines:
# Code often has consistent indentation
if line.startswith(' ') or line.startswith('\t'):
current_block.append(line)
elif current_block:
code_blocks.append('\n'.join(current_block))
current_block = []
return code_blocks
```
### 3. Pattern-based Detection
```python
import re
def detect_code_by_pattern(text):
# Look for common code patterns
patterns = [
r'(def \w+\(.*?\):)', # Python functions
r'(function \w+\(.*?\) \{)', # JavaScript
r'(class \w+:)', # Python classes
r'(import \w+)', # Import statements
]
code_snippets = []
for pattern in patterns:
matches = re.findall(pattern, text)
code_snippets.extend(matches)
return code_snippets
```
---
## Next Steps (Task B1.2+)
### Immediate Next Task: B1.2 - Create Simple PDF Text Extractor
**Goal:** Proof of concept using PyMuPDF
**Implementation Plan:**
1. Create `cli/pdf_extractor_poc.py`
2. Extract text from sample PDF
3. Detect code blocks using font/pattern matching
4. Output to JSON (similar to web scraper)
**Dependencies:**
```bash
pip install PyMuPDF
```
**Expected Output:**
```json
{
"pages": [
{
"page_number": 1,
"text": "...",
"code_blocks": ["def main():", "import sys"],
"images": []
}
]
}
```
### Future Tasks:
- **B1.3:** Add page chunking (split large PDFs)
- **B1.4:** Improve code block detection
- **B1.5:** Extract images/diagrams
- **B1.6:** Create full `pdf_scraper.py` CLI
- **B1.7:** Add MCP tool integration
- **B1.8:** Create PDF config format
---
## Additional Resources
### Documentation:
- PyMuPDF: https://pymupdf.readthedocs.io/
- pdfplumber: https://github.com/jsvine/pdfplumber
- pypdf: https://pypdf.readthedocs.io/
### Comparison Studies:
- 2025 Comparative Study: https://arxiv.org/html/2410.09871v1
- Performance Benchmarks: https://github.com/py-pdf/benchmarks
### Example Use Cases:
- Extracting API docs from PDF manuals
- Converting PDF guides to markdown
- Building skills from PDF-only documentation
---
## Conclusion
**For Skill Seeker's PDF documentation extraction:**
1. **Use PyMuPDF (fitz)** as primary library
2. **Add pdfplumber** for complex table extraction
3. **Detect code blocks** using font + pattern matching
4. **Preserve formatting** with markdown output
5. **Extract images** for diagrams/screenshots
**Estimated Implementation Time:**
- B1.2 (POC): 2-3 hours
- B1.3-B1.5 (Features): 5-8 hours
- B1.6 (CLI): 3-4 hours
- B1.7 (MCP): 2-3 hours
- B1.8 (Config): 1-2 hours
- **Total: 13-20 hours** for complete PDF support
**License:** AGPL (PyMuPDF) is acceptable for Skill Seeker (open source)
---
**Research completed:** ✅ October 21, 2025
**Next task:** B1.2 - Create simple PDF text extractor (proof of concept)

View File

@ -0,0 +1,616 @@
# PDF Scraper CLI Tool (Tasks B1.6 + B1.8)
**Status:** ✅ Completed
**Date:** October 21, 2025
**Tasks:** B1.6 - Create pdf_scraper.py CLI tool, B1.8 - PDF config format
---
## Overview
The PDF scraper (`pdf_scraper.py`) is a complete CLI tool that converts PDF documentation into Claude AI skills. It integrates all PDF extraction features (B1.1-B1.5) with the Skill Seeker workflow to produce packaged, uploadable skills.
## Features
### ✅ Complete Workflow
1. **Extract** - Uses `pdf_extractor_poc.py` for extraction
2. **Categorize** - Organizes content by chapters or keywords
3. **Build** - Creates skill structure (SKILL.md, references/)
4. **Package** - Ready for `package_skill.py`
### ✅ Three Usage Modes
1. **Config File** - Use JSON configuration (recommended)
2. **Direct PDF** - Quick conversion from PDF file
3. **From JSON** - Build skill from pre-extracted data
### ✅ Automatic Categorization
- Chapter-based (from PDF structure)
- Keyword-based (configurable)
- Fallback to single category
### ✅ Quality Filtering
- Uses quality scores from B1.4
- Extracts top code examples
- Filters by minimum quality threshold
---
## Usage
### Mode 1: Config File (Recommended)
```bash
# Create config file
cat > configs/my_manual.json <<EOF
{
"name": "mymanual",
"description": "My Manual documentation",
"pdf_path": "docs/manual.pdf",
"extract_options": {
"chunk_size": 10,
"min_quality": 6.0,
"extract_images": true,
"min_image_size": 150
},
"categories": {
"getting_started": ["introduction", "setup"],
"api": ["api", "reference", "function"],
"tutorial": ["tutorial", "example", "guide"]
}
}
EOF
# Run scraper
python3 cli/pdf_scraper.py --config configs/my_manual.json
```
**Output:**
```
🔍 Extracting from PDF: docs/manual.pdf
📄 Extracting from: docs/manual.pdf
Pages: 150
...
✅ Extraction complete
💾 Saved extracted data to: output/mymanual_extracted.json
🏗️ Building skill: mymanual
📋 Categorizing content...
✅ Created 3 categories
- Getting Started: 25 pages
- Api: 80 pages
- Tutorial: 45 pages
📝 Generating reference files...
Generated: output/mymanual/references/getting_started.md
Generated: output/mymanual/references/api.md
Generated: output/mymanual/references/tutorial.md
Generated: output/mymanual/references/index.md
Generated: output/mymanual/SKILL.md
✅ Skill built successfully: output/mymanual/
📦 Next step: Package with: python3 cli/package_skill.py output/mymanual/
```
### Mode 2: Direct PDF
```bash
# Quick conversion without config file
python3 cli/pdf_scraper.py --pdf manual.pdf --name mymanual --description "My Manual Docs"
```
**Uses default settings:**
- Chunk size: 10
- Min quality: 5.0
- Extract images: true
- Min image size: 100px
- No custom categories (chapter-based)
### Mode 3: From Extracted JSON
```bash
# Step 1: Extract only (saves JSON)
python3 cli/pdf_extractor_poc.py manual.pdf -o manual_extracted.json --extract-images
# Step 2: Build skill from JSON (fast, can iterate)
python3 cli/pdf_scraper.py --from-json manual_extracted.json
```
**Benefits:**
- Separate extraction and building
- Iterate on skill structure without re-extracting
- Faster development cycle
---
## Config File Format (Task B1.8)
### Complete Example
```json
{
"name": "godot_manual",
"description": "Godot Engine documentation from PDF manual",
"pdf_path": "docs/godot_manual.pdf",
"extract_options": {
"chunk_size": 15,
"min_quality": 6.0,
"extract_images": true,
"min_image_size": 200
},
"categories": {
"getting_started": [
"introduction",
"getting started",
"installation",
"first steps"
],
"scripting": [
"gdscript",
"scripting",
"code",
"programming"
],
"3d": [
"3d",
"spatial",
"mesh",
"shader"
],
"2d": [
"2d",
"sprite",
"tilemap",
"animation"
],
"api": [
"api",
"class reference",
"method",
"property"
]
}
}
```
### Field Reference
#### Required Fields
- **`name`** (string): Skill identifier
- Used for directory names
- Should be lowercase, no spaces
- Example: `"python_guide"`
- **`pdf_path`** (string): Path to PDF file
- Absolute or relative to working directory
- Example: `"docs/manual.pdf"`
#### Optional Fields
- **`description`** (string): Skill description
- Shows in SKILL.md
- Explains when to use the skill
- Default: `"Documentation skill for {name}"`
- **`extract_options`** (object): Extraction settings
- `chunk_size` (number): Pages per chunk (default: 10)
- `min_quality` (number): Minimum code quality 0-10 (default: 5.0)
- `extract_images` (boolean): Extract images to files (default: true)
- `min_image_size` (number): Minimum image dimension in pixels (default: 100)
- **`categories`** (object): Keyword-based categorization
- Keys: Category names (will be sanitized for filenames)
- Values: Arrays of keywords to match
- If omitted: Uses chapter-based categorization from PDF
---
## Output Structure
### Generated Files
```
output/
├── mymanual_extracted.json # Raw extraction data (B1.5 format)
└── mymanual/ # Skill directory
├── SKILL.md # Main skill file
├── references/ # Reference documentation
│ ├── index.md # Category index
│ ├── getting_started.md # Category 1
│ ├── api.md # Category 2
│ └── tutorial.md # Category 3
├── scripts/ # Empty (for user scripts)
└── assets/ # Assets directory
└── images/ # Extracted images (if enabled)
├── mymanual_page5_img1.png
└── mymanual_page12_img2.jpeg
```
### SKILL.md Format
```markdown
# Mymanual Documentation Skill
My Manual documentation
## When to use this skill
Use this skill when the user asks about mymanual documentation,
including API references, tutorials, examples, and best practices.
## What's included
This skill contains:
- **Getting Started**: 25 pages
- **Api**: 80 pages
- **Tutorial**: 45 pages
## Quick Reference
### Top Code Examples
**Example 1** (Quality: 8.5/10):
```python
def initialize_system():
config = load_config()
setup_logging(config)
return System(config)
```
**Example 2** (Quality: 8.2/10):
```javascript
const app = createApp({
data() {
return { count: 0 }
}
})
```
## Navigation
See `references/index.md` for complete documentation structure.
## Languages Covered
- python: 45 examples
- javascript: 32 examples
- shell: 8 examples
```
### Reference File Format
Each category gets its own reference file:
```markdown
# Getting Started
## Installation
This guide will walk you through installing the software...
### Code Examples
```bash
curl -O https://example.com/install.sh
bash install.sh
```
---
## Configuration
After installation, configure your environment...
### Code Examples
```yaml
server:
port: 8080
host: localhost
```
---
```
---
## Categorization Logic
### Chapter-Based (Automatic)
If PDF has detectable chapters (from B1.3):
1. Extract chapter titles and page ranges
2. Create one category per chapter
3. Assign pages to chapters by page number
**Advantages:**
- Automatic, no config needed
- Respects document structure
- Accurate page assignment
**Example chapters:**
- "Chapter 1: Introduction" → `chapter_1_introduction.md`
- "Part 2: Advanced Topics" → `part_2_advanced_topics.md`
### Keyword-Based (Configurable)
If `categories` config is provided:
1. Score each page against keyword lists
2. Assign to highest-scoring category
3. Fall back to "other" if no match
**Advantages:**
- Flexible, customizable
- Works with PDFs without clear chapters
- Can combine related sections
**Scoring:**
- Keyword in page text: +1 point
- Keyword in page heading: +2 points
- Assigned to category with highest score
---
## Integration with Skill Seeker
### Complete Workflow
```bash
# 1. Create PDF config
cat > configs/api_manual.json <<EOF
{
"name": "api_manual",
"pdf_path": "docs/api.pdf",
"extract_options": {
"min_quality": 7.0,
"extract_images": true
}
}
EOF
# 2. Run PDF scraper
python3 cli/pdf_scraper.py --config configs/api_manual.json
# 3. Package skill
python3 cli/package_skill.py output/api_manual/
# 4. Upload to Claude (if ANTHROPIC_API_KEY set)
python3 cli/package_skill.py output/api_manual/ --upload
# Result: api_manual.zip ready for Claude!
```
### Enhancement (Optional)
```bash
# After building, enhance with AI
python3 cli/enhance_skill_local.py output/api_manual/
# Or with API
export ANTHROPIC_API_KEY=sk-ant-...
python3 cli/enhance_skill.py output/api_manual/
```
---
## Performance
### Benchmark
| PDF Size | Pages | Extraction | Building | Total |
|----------|-------|------------|----------|-------|
| Small | 50 | 30s | 5s | 35s |
| Medium | 200 | 2m | 15s | 2m 15s |
| Large | 500 | 5m | 45s | 5m 45s |
**Extraction**: PDF → JSON (cpu-intensive)
**Building**: JSON → Skill (fast, i/o-bound)
### Optimization Tips
1. **Use `--from-json` for iteration**
- Extract once, build many times
- Test categorization without re-extraction
2. **Adjust chunk size**
- Larger chunks: Faster extraction
- Smaller chunks: Better chapter detection
3. **Filter aggressively**
- Higher `min_quality`: Fewer low-quality code blocks
- Higher `min_image_size`: Fewer small images
---
## Examples
### Example 1: Programming Language Manual
```json
{
"name": "python_reference",
"description": "Python 3.12 Language Reference",
"pdf_path": "python-3.12-reference.pdf",
"extract_options": {
"chunk_size": 20,
"min_quality": 7.0,
"extract_images": false
},
"categories": {
"basics": ["introduction", "basic", "syntax", "types"],
"functions": ["function", "lambda", "decorator"],
"classes": ["class", "object", "inheritance"],
"modules": ["module", "package", "import"],
"stdlib": ["library", "standard library", "built-in"]
}
}
```
### Example 2: API Documentation
```json
{
"name": "rest_api_docs",
"description": "REST API Documentation",
"pdf_path": "api_docs.pdf",
"extract_options": {
"chunk_size": 10,
"min_quality": 6.0,
"extract_images": true,
"min_image_size": 200
},
"categories": {
"authentication": ["auth", "login", "token", "oauth"],
"users": ["user", "account", "profile"],
"products": ["product", "catalog", "inventory"],
"orders": ["order", "purchase", "checkout"],
"webhooks": ["webhook", "event", "callback"]
}
}
```
### Example 3: Framework Documentation
```json
{
"name": "django_docs",
"description": "Django Web Framework Documentation",
"pdf_path": "django-4.2-docs.pdf",
"extract_options": {
"chunk_size": 15,
"min_quality": 6.5,
"extract_images": true
}
}
```
*Note: No categories - uses chapter-based categorization*
---
## Troubleshooting
### No Categories Created
**Problem:** Only "content" or "other" category
**Possible causes:**
1. No chapters detected in PDF
2. Keywords don't match content
3. Config has empty categories
**Solution:**
```bash
# Check extracted chapters
cat output/mymanual_extracted.json | jq '.chapters'
# If empty, add keyword categories to config
# Or let it create single "content" category (OK for small PDFs)
```
### Low-Quality Code Blocks
**Problem:** Too many poor code examples
**Solution:**
```json
{
"extract_options": {
"min_quality": 7.0 // Increase threshold
}
}
```
### Images Not Extracted
**Problem:** No images in `assets/images/`
**Solution:**
```json
{
"extract_options": {
"extract_images": true, // Enable extraction
"min_image_size": 50 // Lower threshold
}
}
```
---
## Comparison with Web Scraper
| Feature | Web Scraper | PDF Scraper |
|---------|-------------|-------------|
| Input | HTML websites | PDF files |
| Crawling | Multi-page BFS | Single-file extraction |
| Structure detection | CSS selectors | Font/heading analysis |
| Categorization | URL patterns | Chapters/keywords |
| Images | Referenced | Embedded (extracted) |
| Code detection | `<pre><code>` | Font/indent/pattern |
| Language detection | CSS classes | Pattern matching |
| Quality scoring | No | Yes (B1.4) |
| Chunking | No | Yes (B1.3) |
---
## Next Steps
### Task B1.7: MCP Tool Integration
The PDF scraper will be available through MCP:
```python
# Future: MCP tool
result = mcp.scrape_pdf(
config_path="configs/manual.json"
)
# Or direct
result = mcp.scrape_pdf(
pdf_path="manual.pdf",
name="mymanual",
extract_images=True
)
```
---
## Conclusion
Tasks B1.6 and B1.8 successfully implement:
**B1.6 - PDF Scraper CLI:**
- ✅ Complete extraction → building workflow
- ✅ Three usage modes (config, direct, from-json)
- ✅ Automatic categorization (chapter or keyword-based)
- ✅ Integration with Skill Seeker workflow
- ✅ Quality filtering and top examples
**B1.8 - PDF Config Format:**
- ✅ JSON configuration format
- ✅ Extraction options (chunk size, quality, images)
- ✅ Category definitions (keyword-based)
- ✅ Compatible with web scraper config style
**Impact:**
- Complete PDF documentation support
- Parallel workflow to web scraping
- Reusable extraction results
- High-quality skill generation
**Ready for B1.7:** MCP tool integration
---
**Tasks Completed:** October 21, 2025
**Next Task:** B1.7 - Add MCP tool `scrape_pdf`

View File

@ -0,0 +1,576 @@
# PDF Code Block Syntax Detection (Task B1.4)
**Status:** ✅ Completed
**Date:** October 21, 2025
**Task:** B1.4 - Extract code blocks from PDFs with syntax detection
---
## Overview
Task B1.4 enhances the PDF extractor with advanced code block detection capabilities including:
- **Confidence scoring** for language detection
- **Syntax validation** to filter out false positives
- **Quality scoring** to rank code blocks by usefulness
- **Automatic filtering** of low-quality code
This dramatically improves the accuracy and usefulness of extracted code samples from PDF documentation.
---
## New Features
### ✅ 1. Confidence-Based Language Detection
Enhanced language detection now returns both language and confidence score:
**Before (B1.2):**
```python
lang = detect_language_from_code(code) # Returns: 'python'
```
**After (B1.4):**
```python
lang, confidence = detect_language_from_code(code) # Returns: ('python', 0.85)
```
**Confidence Calculation:**
- Pattern matches are weighted (1-5 points)
- Scores are normalized to 0-1 range
- Higher confidence = more reliable detection
**Example Pattern Weights:**
```python
'python': [
(r'\bdef\s+\w+\s*\(', 3), # Strong indicator
(r'\bimport\s+\w+', 2), # Medium indicator
(r':\s*$', 1), # Weak indicator (lines ending with :)
]
```
### ✅ 2. Syntax Validation
Validates detected code blocks to filter false positives:
**Validation Checks:**
1. **Not empty** - Rejects empty code blocks
2. **Indentation consistency** (Python) - Detects mixed tabs/spaces
3. **Balanced brackets** - Checks for unclosed parentheses, braces
4. **Language-specific syntax** (JSON) - Attempts to parse
5. **Natural language detection** - Filters out prose misidentified as code
6. **Comment ratio** - Rejects blocks that are mostly comments
**Output:**
```json
{
"code": "def example():\n return True",
"language": "python",
"is_valid": true,
"validation_issues": []
}
```
**Invalid example:**
```json
{
"code": "This is not code",
"language": "unknown",
"is_valid": false,
"validation_issues": ["May be natural language, not code"]
}
```
### ✅ 3. Quality Scoring
Each code block receives a quality score (0-10) based on multiple factors:
**Scoring Factors:**
1. **Language confidence** (+0 to +2.0 points)
2. **Code length** (optimal: 20-500 chars, +1.0)
3. **Line count** (optimal: 2-50 lines, +1.0)
4. **Has definitions** (functions/classes, +1.5)
5. **Meaningful variable names** (+1.0)
6. **Syntax validation** (+1.0 if valid, -0.5 per issue)
**Quality Tiers:**
- **High quality (7-10):** Complete, valid, useful code examples
- **Medium quality (4-7):** Partial or simple code snippets
- **Low quality (0-4):** Fragments, false positives, invalid code
**Example:**
```python
# High-quality code block (score: 8.5/10)
def calculate_total(items):
total = 0
for item in items:
total += item.price
return total
# Low-quality code block (score: 2.0/10)
x = y
```
### ✅ 4. Quality Filtering
Filter out low-quality code blocks automatically:
```bash
# Keep only high-quality code (score >= 7.0)
python3 cli/pdf_extractor_poc.py input.pdf --min-quality 7.0
# Keep medium and high quality (score >= 4.0)
python3 cli/pdf_extractor_poc.py input.pdf --min-quality 4.0
# No filtering (default)
python3 cli/pdf_extractor_poc.py input.pdf
```
**Benefits:**
- Reduces noise in output
- Focuses on useful examples
- Improves downstream skill quality
### ✅ 5. Quality Statistics
New summary statistics show overall code quality:
```
📊 Code Quality Statistics:
Average quality: 6.8/10
Average confidence: 78.5%
Valid code blocks: 45/52 (86.5%)
High quality (7+): 28
Medium quality (4-7): 17
Low quality (<4): 7
```
---
## Output Format
### Enhanced Code Block Object
Each code block now includes quality metadata:
```json
{
"code": "def example():\n return True",
"language": "python",
"confidence": 0.85,
"quality_score": 7.5,
"is_valid": true,
"validation_issues": [],
"detection_method": "font",
"font": "Courier-New"
}
```
### Quality Statistics Object
Top-level summary of code quality:
```json
{
"quality_statistics": {
"average_quality": 6.8,
"average_confidence": 0.785,
"valid_code_blocks": 45,
"invalid_code_blocks": 7,
"validation_rate": 0.865,
"high_quality_blocks": 28,
"medium_quality_blocks": 17,
"low_quality_blocks": 7
}
}
```
---
## Usage Examples
### Basic Extraction with Quality Stats
```bash
python3 cli/pdf_extractor_poc.py manual.pdf -o output.json --pretty
```
**Output:**
```
✅ Extraction complete:
Total characters: 125,000
Code blocks found: 52
Headings found: 45
Images found: 12
Chunks created: 5
Chapters detected: 3
Languages detected: python, javascript, sql
📊 Code Quality Statistics:
Average quality: 6.8/10
Average confidence: 78.5%
Valid code blocks: 45/52 (86.5%)
High quality (7+): 28
Medium quality (4-7): 17
Low quality (<4): 7
```
### Filter Low-Quality Code
```bash
# Keep only high-quality examples
python3 cli/pdf_extractor_poc.py tutorial.pdf --min-quality 7.0 -v
# Verbose output shows filtering:
# 📄 Extracting from: tutorial.pdf
# ...
# Filtered out 12 low-quality code blocks (min_quality=7.0)
#
# ✅ Extraction complete:
# Code blocks found: 28 (after filtering)
```
### Inspect Quality Scores
```bash
# Extract and view quality scores
python3 cli/pdf_extractor_poc.py input.pdf -o output.json
# View quality scores with jq
cat output.json | jq '.pages[0].code_samples[] | {language, quality_score, is_valid}'
```
**Output:**
```json
{
"language": "python",
"quality_score": 8.5,
"is_valid": true
}
{
"language": "javascript",
"quality_score": 6.2,
"is_valid": true
}
{
"language": "unknown",
"quality_score": 2.1,
"is_valid": false
}
```
---
## Technical Implementation
### Language Detection with Confidence
```python
def detect_language_from_code(self, code):
"""Enhanced with weighted pattern matching"""
patterns = {
'python': [
(r'\bdef\s+\w+\s*\(', 3), # Weight: 3
(r'\bimport\s+\w+', 2), # Weight: 2
(r':\s*$', 1), # Weight: 1
],
# ... other languages
}
# Calculate scores for each language
scores = {}
for lang, lang_patterns in patterns.items():
score = 0
for pattern, weight in lang_patterns:
if re.search(pattern, code, re.IGNORECASE | re.MULTILINE):
score += weight
if score > 0:
scores[lang] = score
# Get best match
best_lang = max(scores, key=scores.get)
confidence = min(scores[best_lang] / 10.0, 1.0)
return best_lang, confidence
```
### Syntax Validation
```python
def validate_code_syntax(self, code, language):
"""Validate code syntax"""
issues = []
if language == 'python':
# Check indentation consistency
indent_chars = set()
for line in code.split('\n'):
if line.startswith(' '):
indent_chars.add('space')
elif line.startswith('\t'):
indent_chars.add('tab')
if len(indent_chars) > 1:
issues.append('Mixed tabs and spaces')
# Check balanced brackets
open_count = code.count('(') + code.count('[') + code.count('{')
close_count = code.count(')') + code.count(']') + code.count('}')
if abs(open_count - close_count) > 2:
issues.append('Unbalanced brackets')
# Check if it's actually natural language
common_words = ['the', 'and', 'for', 'with', 'this', 'that']
word_count = sum(1 for word in common_words if word in code.lower())
if word_count > 5:
issues.append('May be natural language, not code')
return len(issues) == 0, issues
```
### Quality Scoring
```python
def score_code_quality(self, code, language, confidence):
"""Score code quality (0-10)"""
score = 5.0 # Neutral baseline
# Factor 1: Language confidence
score += confidence * 2.0
# Factor 2: Code length (optimal range)
code_length = len(code.strip())
if 20 <= code_length <= 500:
score += 1.0
# Factor 3: Has function/class definitions
if re.search(r'\b(def|function|class|func)\b', code):
score += 1.5
# Factor 4: Meaningful variable names
meaningful_vars = re.findall(r'\b[a-z_][a-z0-9_]{3,}\b', code.lower())
if len(meaningful_vars) >= 2:
score += 1.0
# Factor 5: Syntax validation
is_valid, issues = self.validate_code_syntax(code, language)
if is_valid:
score += 1.0
else:
score -= len(issues) * 0.5
return max(0, min(10, score)) # Clamp to 0-10
```
---
## Performance Impact
### Overhead Analysis
| Operation | Time per page | Impact |
|-----------|---------------|--------|
| Confidence scoring | +0.2ms | Negligible |
| Syntax validation | +0.5ms | Negligible |
| Quality scoring | +0.3ms | Negligible |
| **Total overhead** | **+1.0ms** | **<2%** |
**Benchmark:**
- Small PDF (10 pages): +10ms total (~1% overhead)
- Medium PDF (100 pages): +100ms total (~2% overhead)
- Large PDF (500 pages): +500ms total (~2% overhead)
### Memory Usage
- Quality metadata adds ~200 bytes per code block
- Statistics add ~500 bytes to output
- **Impact:** Negligible (<1% increase)
---
## Comparison: Before vs After
| Metric | Before (B1.3) | After (B1.4) | Improvement |
|--------|---------------|--------------|-------------|
| Language detection | Single return | Lang + confidence | ✅ More reliable |
| Syntax validation | None | Multiple checks | ✅ Filters false positives |
| Quality scoring | None | 0-10 scale | ✅ Ranks code blocks |
| False positives | ~15-20% | ~3-5% | ✅ 75% reduction |
| Code quality avg | Unknown | Measurable | ✅ Trackable |
| Filtering | None | Automatic | ✅ Cleaner output |
---
## Testing
### Test Quality Scoring
```bash
# Create test PDF with various code qualities
# - High-quality: Complete function with meaningful names
# - Medium-quality: Simple variable assignments
# - Low-quality: Natural language text
python3 cli/pdf_extractor_poc.py test.pdf -o test.json -v
# Check quality scores
cat test.json | jq '.pages[].code_samples[] | {language, quality_score}'
```
**Expected Results:**
```json
{"language": "python", "quality_score": 8.5}
{"language": "javascript", "quality_score": 6.2}
{"language": "unknown", "quality_score": 1.8}
```
### Test Validation
```bash
# Check validation results
cat test.json | jq '.pages[].code_samples[] | select(.is_valid == false)'
```
**Should show:**
- Empty code blocks
- Natural language misdetected as code
- Code with severe syntax errors
### Test Filtering
```bash
# Extract with different quality thresholds
python3 cli/pdf_extractor_poc.py test.pdf --min-quality 7.0 -o high_quality.json
python3 cli/pdf_extractor_poc.py test.pdf --min-quality 4.0 -o medium_quality.json
python3 cli/pdf_extractor_poc.py test.pdf --min-quality 0.0 -o all_quality.json
# Compare counts
echo "High quality:"; cat high_quality.json | jq '[.pages[].code_samples[]] | length'
echo "Medium+:"; cat medium_quality.json | jq '[.pages[].code_samples[]] | length'
echo "All:"; cat all_quality.json | jq '[.pages[].code_samples[]] | length'
```
---
## Limitations
### Current Limitations
1. **Validation is heuristic-based**
- No AST parsing (yet)
- Some edge cases may be missed
- Language-specific validation only for Python, JS, Java, C
2. **Quality scoring is subjective**
- Based on heuristics, not compilation
- May not match human judgment perfectly
- Tuned for documentation examples, not production code
3. **Confidence scoring is pattern-based**
- No machine learning
- Limited to defined patterns
- May struggle with uncommon languages
### Known Issues
1. **Short Code Snippets**
- May score lower than deserved
- Example: `x = 5` is valid but scores low
2. **Comments-Heavy Code**
- Well-commented code may be penalized
- Workaround: Adjust comment ratio threshold
3. **Domain-Specific Languages**
- Not covered by pattern detection
- Will be marked as 'unknown'
---
## Future Enhancements
### Potential Improvements
1. **AST-Based Validation**
- Use Python's `ast` module for Python code
- Use esprima/acorn for JavaScript
- Actual syntax parsing instead of heuristics
2. **Machine Learning Detection**
- Train classifier on code vs non-code
- More accurate language detection
- Context-aware quality scoring
3. **Custom Quality Metrics**
- User-defined quality factors
- Domain-specific scoring
- Configurable weights
4. **More Language Support**
- Add TypeScript, Dart, Lua, etc.
- Better pattern coverage
- Language-specific validation
---
## Integration with Skill Seeker
### Improved Skill Quality
With B1.4 enhancements, PDF-based skills will have:
1. **Higher quality code examples**
- Automatic filtering of noise
- Only meaningful snippets included
2. **Better categorization**
- Confidence scores help categorization
- Language-specific references
3. **Validation feedback**
- Know which code blocks may have issues
- Fix before packaging skill
### Example Workflow
```bash
# Step 1: Extract with high-quality filter
python3 cli/pdf_extractor_poc.py manual.pdf --min-quality 7.0 -o manual.json -v
# Step 2: Review quality statistics
cat manual.json | jq '.quality_statistics'
# Step 3: Inspect any invalid blocks
cat manual.json | jq '.pages[].code_samples[] | select(.is_valid == false)'
# Step 4: Build skill (future task B1.6)
python3 cli/pdf_scraper.py --from-json manual.json
```
---
## Conclusion
Task B1.4 successfully implements:
- ✅ Confidence-based language detection
- ✅ Syntax validation for common languages
- ✅ Quality scoring (0-10 scale)
- ✅ Automatic quality filtering
- ✅ Comprehensive quality statistics
**Impact:**
- 75% reduction in false positives
- More reliable code extraction
- Better skill quality
- Measurable code quality metrics
**Performance:** <2% overhead (negligible)
**Compatibility:** Backward compatible (existing fields preserved)
**Ready for B1.5:** Image extraction from PDFs
---
**Task Completed:** October 21, 2025
**Next Task:** B1.5 - Add PDF image extraction (diagrams, screenshots)

View File

@ -0,0 +1,94 @@
# Terminal Selection Guide
When using `--enhance-local`, Skill Seeker opens a new terminal window to run Claude Code. This guide explains how to control which terminal app is used.
## Priority Order
The script automatically detects which terminal to use in this order:
1. **`SKILL_SEEKER_TERMINAL` environment variable** (highest priority)
2. **`TERM_PROGRAM` environment variable** (inherit current terminal)
3. **Terminal.app** (fallback default)
## Setting Your Preferred Terminal
### Option 1: Set Environment Variable (Recommended)
Add this to your shell config (`~/.zshrc` or `~/.bashrc`):
```bash
# For Ghostty users
export SKILL_SEEKER_TERMINAL="Ghostty"
# For iTerm users
export SKILL_SEEKER_TERMINAL="iTerm"
# For WezTerm users
export SKILL_SEEKER_TERMINAL="WezTerm"
```
Then reload your shell:
```bash
source ~/.zshrc # or source ~/.bashrc
```
### Option 2: Set Per-Session
Set the variable before running the command:
```bash
SKILL_SEEKER_TERMINAL="Ghostty" python3 cli/doc_scraper.py --config configs/react.json --enhance-local
```
### Option 3: Inherit Current Terminal (Automatic)
If you run the script from Ghostty, iTerm2, or WezTerm, it will automatically open the enhancement in the same terminal app.
**Note:** IDE terminals (VS Code, Zed, JetBrains) use unique `TERM_PROGRAM` values, so they fall back to Terminal.app unless you set `SKILL_SEEKER_TERMINAL`.
## Supported Terminals
- **Ghostty** (`ghostty`)
- **iTerm2** (`iTerm.app`)
- **Terminal.app** (`Apple_Terminal`)
- **WezTerm** (`WezTerm`)
## Example Output
When terminal detection works:
```
🚀 Launching Claude Code in new terminal...
Using terminal: Ghostty (from SKILL_SEEKER_TERMINAL)
```
When running from an IDE terminal:
```
🚀 Launching Claude Code in new terminal...
⚠️ unknown TERM_PROGRAM (zed)
→ Using Terminal.app as fallback
```
**Tip:** Set `SKILL_SEEKER_TERMINAL` to avoid the fallback behavior.
## Troubleshooting
**Q: The wrong terminal opens even though I set `SKILL_SEEKER_TERMINAL`**
A: Make sure you reloaded your shell after editing `~/.zshrc`:
```bash
source ~/.zshrc
```
**Q: I want to use a different terminal temporarily**
A: Set the variable inline:
```bash
SKILL_SEEKER_TERMINAL="iTerm" python3 cli/doc_scraper.py --enhance-local ...
```
**Q: Can I use a custom terminal app?**
A: Yes! Just use the app name as it appears in `/Applications/`:
```bash
export SKILL_SEEKER_TERMINAL="Alacritty"
```

View File

@ -0,0 +1,716 @@
# Testing Guide for Skill Seeker
Comprehensive testing documentation for the Skill Seeker project.
## Quick Start
```bash
# Run all tests
python3 run_tests.py
# Run all tests with verbose output
python3 run_tests.py -v
# Run specific test suite
python3 run_tests.py --suite config
python3 run_tests.py --suite features
python3 run_tests.py --suite integration
# Stop on first failure
python3 run_tests.py --failfast
# List all available tests
python3 run_tests.py --list
```
## Test Structure
```
tests/
├── __init__.py # Test package marker
├── test_config_validation.py # Config validation tests (30+ tests)
├── test_scraper_features.py # Core feature tests (25+ tests)
├── test_integration.py # Integration tests (15+ tests)
├── test_pdf_extractor.py # PDF extraction tests (23 tests)
├── test_pdf_scraper.py # PDF workflow tests (18 tests)
└── test_pdf_advanced_features.py # PDF advanced features (26 tests) NEW
```
## Test Suites
### 1. Config Validation Tests (`test_config_validation.py`)
Tests the `validate_config()` function with comprehensive coverage.
**Test Categories:**
- ✅ Valid configurations (minimal and complete)
- ✅ Missing required fields (`name`, `base_url`)
- ✅ Invalid name formats (special characters)
- ✅ Valid name formats (alphanumeric, hyphens, underscores)
- ✅ Invalid URLs (missing protocol)
- ✅ Valid URL protocols (http, https)
- ✅ Selector validation (structure and recommended fields)
- ✅ URL patterns validation (include/exclude lists)
- ✅ Categories validation (structure and keywords)
- ✅ Rate limit validation (range 0-10, type checking)
- ✅ Max pages validation (range 1-10000, type checking)
- ✅ Start URLs validation (format and protocol)
**Example Test:**
```python
def test_valid_complete_config(self):
"""Test valid complete configuration"""
config = {
'name': 'godot',
'base_url': 'https://docs.godotengine.org/en/stable/',
'selectors': {
'main_content': 'div[role="main"]',
'title': 'title',
'code_blocks': 'pre code'
},
'rate_limit': 0.5,
'max_pages': 500
}
errors = validate_config(config)
self.assertEqual(len(errors), 0)
```
**Running:**
```bash
python3 run_tests.py --suite config -v
```
---
### 2. Scraper Features Tests (`test_scraper_features.py`)
Tests core scraper functionality including URL validation, language detection, pattern extraction, and categorization.
**Test Categories:**
**URL Validation:**
- ✅ URL matching include patterns
- ✅ URL matching exclude patterns
- ✅ Different domain rejection
- ✅ No pattern configuration
**Language Detection:**
- ✅ Detection from CSS classes (`language-*`, `lang-*`)
- ✅ Detection from parent elements
- ✅ Python detection (import, from, def)
- ✅ JavaScript detection (const, let, arrow functions)
- ✅ GDScript detection (func, var)
- ✅ C++ detection (#include, int main)
- ✅ Unknown language fallback
**Pattern Extraction:**
- ✅ Extraction with "Example:" marker
- ✅ Extraction with "Usage:" marker
- ✅ Pattern limit (max 5)
**Categorization:**
- ✅ Categorization by URL keywords
- ✅ Categorization by title keywords
- ✅ Categorization by content keywords
- ✅ Fallback to "other" category
- ✅ Empty category removal
**Text Cleaning:**
- ✅ Multiple spaces normalization
- ✅ Newline normalization
- ✅ Tab normalization
- ✅ Whitespace stripping
**Example Test:**
```python
def test_detect_python_from_heuristics(self):
"""Test Python detection from code content"""
html = '<code>import os\nfrom pathlib import Path</code>'
elem = BeautifulSoup(html, 'html.parser').find('code')
lang = self.converter.detect_language(elem, elem.get_text())
self.assertEqual(lang, 'python')
```
**Running:**
```bash
python3 run_tests.py --suite features -v
```
---
### 3. Integration Tests (`test_integration.py`)
Tests complete workflows and interactions between components.
**Test Categories:**
**Dry-Run Mode:**
- ✅ No directories created in dry-run mode
- ✅ Dry-run flag properly set
- ✅ Normal mode creates directories
**Config Loading:**
- ✅ Load valid configuration files
- ✅ Invalid JSON error handling
- ✅ Nonexistent file error handling
- ✅ Validation errors during load
**Real Config Validation:**
- ✅ Godot config validation
- ✅ React config validation
- ✅ Vue config validation
- ✅ Django config validation
- ✅ FastAPI config validation
- ✅ Steam Economy config validation
**URL Processing:**
- ✅ URL normalization
- ✅ Start URLs fallback to base_url
- ✅ Multiple start URLs handling
**Content Extraction:**
- ✅ Empty content handling
- ✅ Basic content extraction
- ✅ Code sample extraction with language detection
**Example Test:**
```python
def test_dry_run_no_directories_created(self):
"""Test that dry-run mode doesn't create directories"""
converter = DocToSkillConverter(self.config, dry_run=True)
data_dir = Path(f"output/{self.config['name']}_data")
skill_dir = Path(f"output/{self.config['name']}")
self.assertFalse(data_dir.exists())
self.assertFalse(skill_dir.exists())
```
**Running:**
```bash
python3 run_tests.py --suite integration -v
```
---
### 4. PDF Extraction Tests (`test_pdf_extractor.py`) **NEW**
Tests PDF content extraction functionality (B1.2-B1.5).
**Note:** These tests require PyMuPDF (`pip install PyMuPDF`). They will be skipped if not installed.
**Test Categories:**
**Language Detection (5 tests):**
- ✅ Python detection with confidence scoring
- ✅ JavaScript detection with confidence
- ✅ C++ detection with confidence
- ✅ Unknown language returns low confidence
- ✅ Confidence always between 0 and 1
**Syntax Validation (5 tests):**
- ✅ Valid Python syntax validation
- ✅ Invalid Python indentation detection
- ✅ Unbalanced brackets detection
- ✅ Valid JavaScript syntax validation
- ✅ Natural language fails validation
**Quality Scoring (4 tests):**
- ✅ Quality score between 0 and 10
- ✅ High-quality code gets good score (>7)
- ✅ Low-quality code gets low score (<4)
- ✅ Quality considers multiple factors
**Chapter Detection (4 tests):**
- ✅ Detect chapters with numbers
- ✅ Detect uppercase chapter headers
- ✅ Detect section headings (e.g., "2.1")
- ✅ Normal text not detected as chapter
**Code Block Merging (2 tests):**
- ✅ Merge code blocks split across pages
- ✅ Don't merge different languages
**Code Detection Methods (2 tests):**
- ✅ Pattern-based detection (keywords)
- ✅ Indent-based detection
**Quality Filtering (1 test):**
- ✅ Filter by minimum quality threshold
**Example Test:**
```python
def test_detect_python_with_confidence(self):
"""Test Python detection returns language and confidence"""
extractor = self.PDFExtractor.__new__(self.PDFExtractor)
code = "def hello():\n print('world')\n return True"
language, confidence = extractor.detect_language_from_code(code)
self.assertEqual(language, "python")
self.assertGreater(confidence, 0.7)
self.assertLessEqual(confidence, 1.0)
```
**Running:**
```bash
python3 -m pytest tests/test_pdf_extractor.py -v
```
---
### 5. PDF Workflow Tests (`test_pdf_scraper.py`) **NEW**
Tests PDF to skill conversion workflow (B1.6).
**Note:** These tests require PyMuPDF (`pip install PyMuPDF`). They will be skipped if not installed.
**Test Categories:**
**PDFToSkillConverter (3 tests):**
- ✅ Initialization with name and PDF path
- ✅ Initialization with config file
- ✅ Requires name or config_path
**Categorization (3 tests):**
- ✅ Categorize by keywords
- ✅ Categorize by chapters
- ✅ Handle missing chapters
**Skill Building (3 tests):**
- ✅ Create required directory structure
- ✅ Create SKILL.md with metadata
- ✅ Create reference files for categories
**Code Block Handling (2 tests):**
- ✅ Include code blocks in references
- ✅ Prefer high-quality code
**Image Handling (2 tests):**
- ✅ Save images to assets directory
- ✅ Reference images in markdown
**Error Handling (3 tests):**
- ✅ Handle missing PDF files
- ✅ Handle invalid config JSON
- ✅ Handle missing required config fields
**JSON Workflow (2 tests):**
- ✅ Load from extracted JSON
- ✅ Build from JSON without extraction
**Example Test:**
```python
def test_build_skill_creates_structure(self):
"""Test that build_skill creates required directory structure"""
converter = self.PDFToSkillConverter(
name="test_skill",
pdf_path="test.pdf",
output_dir=self.temp_dir
)
converter.extracted_data = {
"pages": [{"page_number": 1, "text": "Test", "code_blocks": [], "images": []}],
"total_pages": 1
}
converter.categories = {"test": [converter.extracted_data["pages"][0]]}
converter.build_skill()
skill_dir = Path(self.temp_dir) / "test_skill"
self.assertTrue(skill_dir.exists())
self.assertTrue((skill_dir / "references").exists())
self.assertTrue((skill_dir / "scripts").exists())
self.assertTrue((skill_dir / "assets").exists())
```
**Running:**
```bash
python3 -m pytest tests/test_pdf_scraper.py -v
```
---
### 6. PDF Advanced Features Tests (`test_pdf_advanced_features.py`) **NEW**
Tests advanced PDF features (Priority 2 & 3).
**Note:** These tests require PyMuPDF (`pip install PyMuPDF`). OCR tests also require pytesseract and Pillow. They will be skipped if not installed.
**Test Categories:**
**OCR Support (5 tests):**
- ✅ OCR flag initialization
- ✅ OCR disabled behavior
- ✅ OCR only triggers for minimal text
- ✅ Warning when pytesseract unavailable
- ✅ OCR extraction triggered correctly
**Password Protection (4 tests):**
- ✅ Password parameter initialization
- ✅ Encrypted PDF detection
- ✅ Wrong password handling
- ✅ Missing password error
**Table Extraction (5 tests):**
- ✅ Table extraction flag initialization
- ✅ No extraction when disabled
- ✅ Basic table extraction
- ✅ Multiple tables per page
- ✅ Error handling during extraction
**Caching (5 tests):**
- ✅ Cache initialization
- ✅ Set and get cached values
- ✅ Cache miss returns None
- ✅ Caching can be disabled
- ✅ Cache overwrite
**Parallel Processing (4 tests):**
- ✅ Parallel flag initialization
- ✅ Disabled by default
- ✅ Worker count auto-detection
- ✅ Custom worker count
**Integration (3 tests):**
- ✅ Full initialization with all features
- ✅ Various feature combinations
- ✅ Page data includes tables
**Example Test:**
```python
def test_table_extraction_basic(self):
"""Test basic table extraction"""
extractor = self.PDFExtractor.__new__(self.PDFExtractor)
extractor.extract_tables = True
extractor.verbose = False
# Create mock table
mock_table = Mock()
mock_table.extract.return_value = [
["Header 1", "Header 2", "Header 3"],
["Data 1", "Data 2", "Data 3"]
]
mock_table.bbox = (0, 0, 100, 100)
mock_tables = Mock()
mock_tables.tables = [mock_table]
mock_page = Mock()
mock_page.find_tables.return_value = mock_tables
tables = extractor.extract_tables_from_page(mock_page)
self.assertEqual(len(tables), 1)
self.assertEqual(tables[0]['row_count'], 2)
self.assertEqual(tables[0]['col_count'], 3)
```
**Running:**
```bash
python3 -m pytest tests/test_pdf_advanced_features.py -v
```
---
## Test Runner Features
The custom test runner (`run_tests.py`) provides:
### Colored Output
- 🟢 Green for passing tests
- 🔴 Red for failures and errors
- 🟡 Yellow for skipped tests
### Detailed Summary
```
======================================================================
TEST SUMMARY
======================================================================
Total Tests: 70
✓ Passed: 68
✗ Failed: 2
⊘ Skipped: 0
Success Rate: 97.1%
Test Breakdown by Category:
TestConfigValidation: 28/30 passed
TestURLValidation: 6/6 passed
TestLanguageDetection: 10/10 passed
TestPatternExtraction: 3/3 passed
TestCategorization: 5/5 passed
TestDryRunMode: 3/3 passed
TestConfigLoading: 4/4 passed
TestRealConfigFiles: 6/6 passed
TestContentExtraction: 3/3 passed
======================================================================
```
### Command-Line Options
```bash
# Verbose output (show each test name)
python3 run_tests.py -v
# Quiet output (minimal)
python3 run_tests.py -q
# Stop on first failure
python3 run_tests.py --failfast
# Run specific suite
python3 run_tests.py --suite config
# List all tests
python3 run_tests.py --list
```
---
## Running Individual Tests
### Run Single Test File
```bash
python3 -m unittest tests.test_config_validation
python3 -m unittest tests.test_scraper_features
python3 -m unittest tests.test_integration
```
### Run Single Test Class
```bash
python3 -m unittest tests.test_config_validation.TestConfigValidation
python3 -m unittest tests.test_scraper_features.TestLanguageDetection
```
### Run Single Test Method
```bash
python3 -m unittest tests.test_config_validation.TestConfigValidation.test_valid_complete_config
python3 -m unittest tests.test_scraper_features.TestLanguageDetection.test_detect_python_from_heuristics
```
---
## Test Coverage
### Current Coverage
| Component | Tests | Coverage |
|-----------|-------|----------|
| Config Validation | 30+ | 100% |
| URL Validation | 6 | 95% |
| Language Detection | 10 | 90% |
| Pattern Extraction | 3 | 85% |
| Categorization | 5 | 90% |
| Text Cleaning | 4 | 100% |
| Dry-Run Mode | 3 | 100% |
| Config Loading | 4 | 95% |
| Real Configs | 6 | 100% |
| Content Extraction | 3 | 80% |
| **PDF Extraction** | **23** | **90%** |
| **PDF Workflow** | **18** | **85%** |
| **PDF Advanced Features** | **26** | **95%** |
**Total: 142 tests (75 passing + 67 PDF tests)**
**Note:** PDF tests (67 total) require PyMuPDF and will be skipped if not installed. When PyMuPDF is available, all 142 tests run.
### Not Yet Covered
- Network operations (actual scraping)
- Enhancement scripts (`enhance_skill.py`, `enhance_skill_local.py`)
- Package creation (`package_skill.py`)
- Interactive mode
- SKILL.md generation
- Reference file creation
- PDF extraction with real PDF files (tests use mocked data)
---
## Writing New Tests
### Test Template
```python
#!/usr/bin/env python3
"""
Test suite for [feature name]
Tests [description of what's being tested]
"""
import sys
import os
import unittest
# Add parent directory to path
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from doc_scraper import DocToSkillConverter
class TestYourFeature(unittest.TestCase):
"""Test [feature] functionality"""
def setUp(self):
"""Set up test fixtures"""
self.config = {
'name': 'test',
'base_url': 'https://example.com/',
'selectors': {
'main_content': 'article',
'title': 'h1',
'code_blocks': 'pre code'
},
'rate_limit': 0.1,
'max_pages': 10
}
self.converter = DocToSkillConverter(self.config, dry_run=True)
def tearDown(self):
"""Clean up after tests"""
pass
def test_your_feature(self):
"""Test description"""
# Arrange
test_input = "something"
# Act
result = self.converter.some_method(test_input)
# Assert
self.assertEqual(result, expected_value)
if __name__ == '__main__':
unittest.main()
```
### Best Practices
1. **Use descriptive test names**: `test_valid_name_formats` not `test1`
2. **Follow AAA pattern**: Arrange, Act, Assert
3. **One assertion per test** when possible
4. **Test edge cases**: empty inputs, invalid inputs, boundary values
5. **Use setUp/tearDown**: for common initialization and cleanup
6. **Mock external dependencies**: don't make real network calls
7. **Keep tests independent**: tests should not depend on each other
8. **Use dry_run=True**: for converter tests to avoid file creation
---
## Continuous Integration
### GitHub Actions (Future)
```yaml
name: Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
with:
python-version: '3.7'
- run: pip install requests beautifulsoup4
- run: python3 run_tests.py
```
---
## Troubleshooting
### Tests Fail with Import Errors
```bash
# Make sure you're in the repository root
cd /path/to/Skill_Seekers
# Run tests from root directory
python3 run_tests.py
```
### Tests Create Output Directories
```bash
# Clean up test artifacts
rm -rf output/test-*
# Make sure tests use dry_run=True
# Check test setUp methods
```
### Specific Test Keeps Failing
```bash
# Run only that test with verbose output
python3 -m unittest tests.test_config_validation.TestConfigValidation.test_name -v
# Check the error message carefully
# Verify test expectations match implementation
```
---
## Performance
Test execution times:
- **Config Validation**: ~0.1 seconds (30 tests)
- **Scraper Features**: ~0.3 seconds (25 tests)
- **Integration Tests**: ~0.5 seconds (15 tests)
- **Total**: ~1 second (70 tests)
---
## Contributing Tests
When adding new features:
1. Write tests **before** implementing the feature (TDD)
2. Ensure tests cover:
- ✅ Happy path (valid inputs)
- ✅ Edge cases (empty, null, boundary values)
- ✅ Error cases (invalid inputs)
3. Run tests before committing:
```bash
python3 run_tests.py
```
4. Aim for >80% coverage for new code
---
## Additional Resources
- **unittest documentation**: https://docs.python.org/3/library/unittest.html
- **pytest** (alternative): https://pytest.org/ (more powerful, but requires installation)
- **Test-Driven Development**: https://en.wikipedia.org/wiki/Test-driven_development
---
## Summary
**142 comprehensive tests** covering all major features (75 + 67 PDF)
**PDF support testing** with 67 tests for B1 tasks + Priority 2 & 3
**Colored test runner** with detailed summaries
**Fast execution** (~1 second for full suite)
**Easy to extend** with clear patterns and templates
**Good coverage** of critical paths
**PDF Tests Status:**
- 23 tests for PDF extraction (language detection, syntax validation, quality scoring, chapter detection)
- 18 tests for PDF workflow (initialization, categorization, skill building, code/image handling)
- **26 tests for advanced features (OCR, passwords, tables, parallel, caching)** NEW!
- Tests are skipped gracefully when PyMuPDF is not installed
- Full test coverage when PyMuPDF + optional dependencies are available
**Advanced PDF Features Tested:**
- ✅ OCR support for scanned PDFs (5 tests)
- ✅ Password-protected PDFs (4 tests)
- ✅ Table extraction (5 tests)
- ✅ Parallel processing (4 tests)
- ✅ Caching (5 tests)
- ✅ Integration (3 tests)
Run tests frequently to catch bugs early! 🚀

View File

@ -0,0 +1,342 @@
# Testing MCP Server in Claude Code
This guide shows you how to test the Skill Seeker MCP server **through actual Claude Code** using the MCP protocol (not just Python function calls).
## Important: What We Tested vs What You Need to Test
### What I Tested (Python Direct Calls) ✅
I tested the MCP server **functions** by calling them directly with Python:
```python
await server.list_configs_tool({})
await server.generate_config_tool({...})
```
This verified the **code works**, but didn't test the **MCP protocol integration**.
### What You Need to Test (Actual MCP Protocol) 🎯
You need to test via **Claude Code** using the MCP protocol:
```
In Claude Code:
> List all available configs
> mcp__skill-seeker__list_configs
```
This verifies the **full integration** works.
## Setup Instructions
### Step 1: Configure Claude Code
Create the MCP configuration file:
```bash
# Create config directory
mkdir -p ~/.config/claude-code
# Create/edit MCP configuration
nano ~/.config/claude-code/mcp.json
```
Add this configuration (replace `/path/to/` with your actual path):
```json
{
"mcpServers": {
"skill-seeker": {
"command": "python3",
"args": [
"/mnt/1ece809a-2821-4f10-aecb-fcdf34760c0b/Git/Skill_Seekers/skill_seeker_mcp/server.py"
],
"cwd": "/mnt/1ece809a-2821-4f10-aecb-fcdf34760c0b/Git/Skill_Seekers"
}
}
}
```
Or use the setup script:
```bash
./setup_mcp.sh
```
### Step 2: Restart Claude Code
**IMPORTANT:** Completely quit and restart Claude Code (don't just close the window).
### Step 3: Verify MCP Server Loaded
In Claude Code, check if the server loaded:
```
Show me all available MCP tools
```
You should see 6 tools with the prefix `mcp__skill-seeker__`:
- `mcp__skill-seeker__list_configs`
- `mcp__skill-seeker__generate_config`
- `mcp__skill-seeker__validate_config`
- `mcp__skill-seeker__estimate_pages`
- `mcp__skill-seeker__scrape_docs`
- `mcp__skill-seeker__package_skill`
## Testing All 6 MCP Tools
### Test 1: list_configs
**In Claude Code, type:**
```
List all available Skill Seeker configs
```
**Or explicitly:**
```
Use mcp__skill-seeker__list_configs
```
**Expected Output:**
```
📋 Available Configs:
• django.json
• fastapi.json
• godot.json
• react.json
• vue.json
...
```
### Test 2: generate_config
**In Claude Code, type:**
```
Generate a config for Astro documentation at https://docs.astro.build with max 15 pages
```
**Or explicitly:**
```
Use mcp__skill-seeker__generate_config with:
- name: astro-test
- url: https://docs.astro.build
- description: Astro framework testing
- max_pages: 15
```
**Expected Output:**
```
✅ Config created: configs/astro-test.json
```
### Test 3: validate_config
**In Claude Code, type:**
```
Validate the astro-test config
```
**Or explicitly:**
```
Use mcp__skill-seeker__validate_config for configs/astro-test.json
```
**Expected Output:**
```
✅ Config is valid!
Name: astro-test
Base URL: https://docs.astro.build
Max pages: 15
```
### Test 4: estimate_pages
**In Claude Code, type:**
```
Estimate pages for the astro-test config
```
**Or explicitly:**
```
Use mcp__skill-seeker__estimate_pages for configs/astro-test.json
```
**Expected Output:**
```
📊 ESTIMATION RESULTS
Estimated Total: ~25 pages
Recommended max_pages: 75
```
### Test 5: scrape_docs
**In Claude Code, type:**
```
Scrape docs using the astro-test config
```
**Or explicitly:**
```
Use mcp__skill-seeker__scrape_docs with configs/astro-test.json
```
**Expected Output:**
```
✅ Skill built: output/astro-test/
Scraped X pages
Created Y categories
```
### Test 6: package_skill
**In Claude Code, type:**
```
Package the astro-test skill
```
**Or explicitly:**
```
Use mcp__skill-seeker__package_skill for output/astro-test/
```
**Expected Output:**
```
✅ Package created: output/astro-test.zip
Size: X KB
```
## Complete Workflow Test
Test the entire workflow in Claude Code with natural language:
```
Step 1:
> List all available configs
Step 2:
> Generate config for Svelte at https://svelte.dev/docs with description "Svelte framework" and max 20 pages
Step 3:
> Validate configs/svelte.json
Step 4:
> Estimate pages for configs/svelte.json
Step 5:
> Scrape docs using configs/svelte.json
Step 6:
> Package skill at output/svelte/
```
Expected result: `output/svelte.zip` ready to upload to Claude!
## Troubleshooting
### Issue: Tools Not Appearing
**Symptoms:**
- Claude Code doesn't recognize skill-seeker commands
- No `mcp__skill-seeker__` tools listed
**Solutions:**
1. Check configuration exists:
```bash
cat ~/.config/claude-code/mcp.json
```
2. Verify server can start:
```bash
cd /path/to/Skill_Seekers
python3 skill_seeker_mcp/server.py
# Should start without errors (Ctrl+C to exit)
```
3. Check dependencies installed:
```bash
pip3 list | grep mcp
# Should show: mcp x.x.x
```
4. Completely restart Claude Code (quit and reopen)
5. Check Claude Code logs:
- macOS: `~/Library/Logs/Claude Code/`
- Linux: `~/.config/claude-code/logs/`
### Issue: "Permission Denied"
```bash
chmod +x skill_seeker_mcp/server.py
```
### Issue: "Module Not Found"
```bash
pip3 install -r skill_seeker_mcp/requirements.txt
pip3 install requests beautifulsoup4
```
## Verification Checklist
Use this checklist to verify MCP integration:
- [ ] Configuration file created at `~/.config/claude-code/mcp.json`
- [ ] Repository path in config is absolute and correct
- [ ] Python dependencies installed (`mcp`, `requests`, `beautifulsoup4`)
- [ ] Server starts without errors when run manually
- [ ] Claude Code completely restarted (quit and reopened)
- [ ] Tools appear when asking "show me all MCP tools"
- [ ] Tools have `mcp__skill-seeker__` prefix
- [ ] Can list configs successfully
- [ ] Can generate a test config
- [ ] Can scrape and package a small skill
## What Makes This Different from My Tests
| What I Tested | What You Should Test |
|---------------|---------------------|
| Python function calls | Claude Code MCP protocol |
| `await server.list_configs_tool({})` | Natural language in Claude Code |
| Direct Python imports | Full MCP server integration |
| Validates code works | Validates Claude Code integration |
| Quick unit testing | Real-world usage testing |
## Success Criteria
✅ **MCP Integration is Working When:**
1. You can ask Claude Code to "list all available configs"
2. Claude Code responds with the actual config list
3. You can generate, validate, scrape, and package skills
4. All through natural language commands in Claude Code
5. No Python code needed - just conversation!
## Next Steps After Successful Testing
Once MCP integration works:
1. **Create your first skill:**
```
> Generate config for TailwindCSS at https://tailwindcss.com/docs
> Scrape docs using configs/tailwind.json
> Package skill at output/tailwind/
```
2. **Upload to Claude:**
- Take the generated `.zip` file
- Upload to Claude.ai
- Start using your new skill!
3. **Share feedback:**
- Report any issues on GitHub
- Share successful skills created
- Suggest improvements
## Reference
- **Full Setup Guide:** [docs/MCP_SETUP.md](docs/MCP_SETUP.md)
- **MCP Documentation:** [mcp/README.md](mcp/README.md)
- **Main README:** [README.md](README.md)
- **Setup Script:** `./setup_mcp.sh`
---
**Important:** This document is for testing the **actual MCP protocol integration** with Claude Code, not just the Python functions. Make sure you're testing through Claude Code's UI, not Python scripts!

View File

@ -0,0 +1,633 @@
# Unified Multi-Source Scraping
**Version:** 2.0 (Feature complete as of October 2025)
## Overview
Unified multi-source scraping allows you to combine knowledge from multiple sources into a single comprehensive Claude skill. Instead of choosing between documentation, GitHub repositories, or PDF manuals, you can now extract and intelligently merge information from all of them.
## Why Unified Scraping?
**The Problem**: Documentation and code often drift apart over time. Official docs might be outdated, missing features that exist in code, or documenting features that have been removed. Separately scraping docs and code creates two incomplete skills.
**The Solution**: Unified scraping:
- Extracts information from multiple sources (documentation, GitHub, PDFs)
- **Detects conflicts** between documentation and actual code implementation
- **Intelligently merges** conflicting information with transparency
- **Highlights discrepancies** with inline warnings (⚠️)
- Creates a single, comprehensive skill that shows the complete picture
## Quick Start
### 1. Create a Unified Config
Create a config file with multiple sources:
```json
{
"name": "react",
"description": "Complete React knowledge from docs + codebase",
"merge_mode": "rule-based",
"sources": [
{
"type": "documentation",
"base_url": "https://react.dev/",
"extract_api": true,
"max_pages": 200
},
{
"type": "github",
"repo": "facebook/react",
"include_code": true,
"code_analysis_depth": "surface",
"max_issues": 100
}
]
}
```
### 2. Scrape and Build
```bash
python3 cli/unified_scraper.py --config configs/react_unified.json
```
The tool will:
1. ✅ **Phase 1**: Scrape all sources (docs + GitHub)
2. ✅ **Phase 2**: Detect conflicts between sources
3. ✅ **Phase 3**: Merge conflicts intelligently
4. ✅ **Phase 4**: Build unified skill with conflict transparency
### 3. Package and Upload
```bash
python3 cli/package_skill.py output/react/
```
## Config Format
### Unified Config Structure
```json
{
"name": "skill-name",
"description": "When to use this skill",
"merge_mode": "rule-based|claude-enhanced",
"sources": [
{
"type": "documentation|github|pdf",
...source-specific fields...
}
]
}
```
### Documentation Source
```json
{
"type": "documentation",
"base_url": "https://docs.example.com/",
"extract_api": true,
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": [],
"exclude": ["/blog/"]
},
"categories": {
"getting_started": ["intro", "tutorial"],
"api": ["api", "reference"]
},
"rate_limit": 0.5,
"max_pages": 200
}
```
### GitHub Source
```json
{
"type": "github",
"repo": "owner/repo",
"github_token": "ghp_...",
"include_issues": true,
"max_issues": 100,
"include_changelog": true,
"include_releases": true,
"include_code": true,
"code_analysis_depth": "surface|deep|full",
"file_patterns": [
"src/**/*.js",
"lib/**/*.ts"
]
}
```
**Code Analysis Depth**:
- `surface` (default): Basic structure, no code analysis
- `deep`: Extract class/function signatures, parameters, return types
- `full`: Complete AST analysis (expensive)
### PDF Source
```json
{
"type": "pdf",
"path": "/path/to/manual.pdf",
"extract_tables": false,
"ocr": false,
"password": "optional-password"
}
```
## Conflict Detection
The unified scraper automatically detects 4 types of conflicts:
### 1. Missing in Documentation
**Severity**: Medium
**Description**: API exists in code but is not documented
**Example**:
```python
# Code has this method:
def move_local_x(self, delta: float, snap: bool = False) -> None:
"""Move node along local X axis"""
# But documentation doesn't mention it
```
**Suggestion**: Add documentation for this API
### 2. Missing in Code
**Severity**: High
**Description**: API is documented but not found in codebase
**Example**:
```python
# Docs say:
def rotate(angle: float) -> None
# But code doesn't have this function
```
**Suggestion**: Update documentation to remove this API, or add it to codebase
### 3. Signature Mismatch
**Severity**: Medium-High
**Description**: API exists in both but signatures differ
**Example**:
```python
# Docs say:
def move_local_x(delta: float)
# Code has:
def move_local_x(delta: float, snap: bool = False)
```
**Suggestion**: Update documentation to match actual signature
### 4. Description Mismatch
**Severity**: Low
**Description**: Different descriptions/docstrings
## Merge Modes
### Rule-Based Merge (Default)
Fast, deterministic merging using predefined rules:
1. **If API only in docs** → Include with `[DOCS_ONLY]` tag
2. **If API only in code** → Include with `[UNDOCUMENTED]` tag
3. **If both match perfectly** → Include normally
4. **If conflict exists** → Prefer code signature, keep docs description
**When to use**:
- Fast merging (< 1 second)
- Automated workflows
- You don't need human oversight
**Example**:
```bash
python3 cli/unified_scraper.py --config config.json --merge-mode rule-based
```
### Claude-Enhanced Merge
AI-powered reconciliation using local Claude Code:
1. Opens new terminal with Claude Code
2. Provides conflict context and instructions
3. Claude analyzes and creates reconciled API reference
4. Human can review and adjust before finalizing
**When to use**:
- Complex conflicts requiring judgment
- You want highest quality merge
- You have time for human oversight
**Example**:
```bash
python3 cli/unified_scraper.py --config config.json --merge-mode claude-enhanced
```
## Skill Output Structure
The unified scraper creates this structure:
```
output/skill-name/
├── SKILL.md # Main skill file with merged APIs
├── references/
│ ├── documentation/ # Documentation references
│ │ └── index.md
│ ├── github/ # GitHub references
│ │ ├── README.md
│ │ ├── issues.md
│ │ └── releases.md
│ ├── pdf/ # PDF references (if applicable)
│ │ └── index.md
│ ├── api/ # Merged API reference
│ │ └── merged_api.md
│ └── conflicts.md # Detailed conflict report
├── scripts/ # Empty (for user scripts)
└── assets/ # Empty (for user assets)
```
### SKILL.md Format
```markdown
# React
Complete React knowledge base combining official documentation and React codebase insights.
## 📚 Sources
This skill combines knowledge from multiple sources:
- ✅ **Documentation**: https://react.dev/
- Pages: 200
- ✅ **GitHub Repository**: facebook/react
- Code Analysis: surface
- Issues: 100
## ⚠️ Data Quality
**5 conflicts detected** between sources.
**Conflict Breakdown:**
- missing_in_docs: 3
- missing_in_code: 2
See `references/conflicts.md` for detailed conflict information.
## 🔧 API Reference
*Merged from documentation and code analysis*
### ✅ Verified APIs
*Documentation and code agree*
#### `useState(initialValue)`
...
### ⚠️ APIs with Conflicts
*Documentation and code differ*
#### `useEffect(callback, deps?)`
⚠️ **Conflict**: Documentation signature differs from code implementation
**Documentation says:**
```
useEffect(callback: () => void, deps: any[])
```
**Code implementation:**
```
useEffect(callback: () => void | (() => void), deps?: readonly any[])
```
*Source: both*
---
```
## Examples
### Example 1: React (Docs + GitHub)
```json
{
"name": "react",
"description": "Complete React framework knowledge",
"merge_mode": "rule-based",
"sources": [
{
"type": "documentation",
"base_url": "https://react.dev/",
"extract_api": true,
"max_pages": 200
},
{
"type": "github",
"repo": "facebook/react",
"include_code": true,
"code_analysis_depth": "surface"
}
]
}
```
### Example 2: Django (Docs + GitHub)
```json
{
"name": "django",
"description": "Complete Django framework knowledge",
"merge_mode": "rule-based",
"sources": [
{
"type": "documentation",
"base_url": "https://docs.djangoproject.com/en/stable/",
"extract_api": true,
"max_pages": 300
},
{
"type": "github",
"repo": "django/django",
"include_code": true,
"code_analysis_depth": "deep",
"file_patterns": [
"django/db/**/*.py",
"django/views/**/*.py"
]
}
]
}
```
### Example 3: Mixed Sources (Docs + GitHub + PDF)
```json
{
"name": "godot",
"description": "Complete Godot Engine knowledge",
"merge_mode": "claude-enhanced",
"sources": [
{
"type": "documentation",
"base_url": "https://docs.godotengine.org/en/stable/",
"extract_api": true,
"max_pages": 500
},
{
"type": "github",
"repo": "godotengine/godot",
"include_code": true,
"code_analysis_depth": "deep"
},
{
"type": "pdf",
"path": "/path/to/godot_manual.pdf",
"extract_tables": true
}
]
}
```
## Command Reference
### Unified Scraper
```bash
# Basic usage
python3 cli/unified_scraper.py --config configs/react_unified.json
# Override merge mode
python3 cli/unified_scraper.py --config configs/react_unified.json --merge-mode claude-enhanced
# Use cached data (skip re-scraping)
python3 cli/unified_scraper.py --config configs/react_unified.json --skip-scrape
```
### Validate Config
```bash
python3 -c "
import sys
sys.path.insert(0, 'cli')
from config_validator import validate_config
validator = validate_config('configs/react_unified.json')
print(f'Format: {\"Unified\" if validator.is_unified else \"Legacy\"}')
print(f'Sources: {len(validator.config.get(\"sources\", []))}')
print(f'Needs API merge: {validator.needs_api_merge()}')
"
```
## MCP Integration
The unified scraper is fully integrated with MCP. The `scrape_docs` tool automatically detects unified vs legacy configs and routes to the appropriate scraper.
```python
# MCP tool usage
{
"name": "scrape_docs",
"arguments": {
"config_path": "configs/react_unified.json",
"merge_mode": "rule-based" # Optional override
}
}
```
The tool will:
1. Auto-detect unified format
2. Route to `unified_scraper.py`
3. Apply specified merge mode
4. Return comprehensive output
## Backward Compatibility
**Legacy configs still work!** The system automatically detects legacy single-source configs and routes to the original `doc_scraper.py`.
```json
// Legacy config (still works)
{
"name": "react",
"base_url": "https://react.dev/",
...
}
// Automatically detected as legacy format
// Routes to doc_scraper.py
```
## Testing
Run integration tests:
```bash
python3 cli/test_unified_simple.py
```
Tests validate:
- ✅ Unified config validation
- ✅ Backward compatibility with legacy configs
- ✅ Mixed source type support
- ✅ Error handling for invalid configs
## Architecture
### Components
1. **config_validator.py**: Validates unified and legacy configs
2. **code_analyzer.py**: Extracts code signatures at configurable depth
3. **conflict_detector.py**: Detects API conflicts between sources
4. **merge_sources.py**: Implements rule-based and Claude-enhanced merging
5. **unified_scraper.py**: Main orchestrator
6. **unified_skill_builder.py**: Generates final skill structure
7. **skill_seeker_mcp/server.py**: MCP integration with auto-detection
### Data Flow
```
Unified Config
ConfigValidator (validates format)
UnifiedScraper.run()
┌────────────────────────────────────┐
│ Phase 1: Scrape All Sources │
│ - Documentation → doc_scraper │
│ - GitHub → github_scraper │
│ - PDF → pdf_scraper │
└────────────────────────────────────┘
┌────────────────────────────────────┐
│ Phase 2: Detect Conflicts │
│ - ConflictDetector │
│ - Compare docs APIs vs code APIs │
│ - Classify by type and severity │
└────────────────────────────────────┘
┌────────────────────────────────────┐
│ Phase 3: Merge Sources │
│ - RuleBasedMerger (fast) │
│ - OR ClaudeEnhancedMerger (AI) │
│ - Create unified API reference │
└────────────────────────────────────┘
┌────────────────────────────────────┐
│ Phase 4: Build Skill │
│ - UnifiedSkillBuilder │
│ - Generate SKILL.md with conflicts│
│ - Create reference structure │
│ - Generate conflicts report │
└────────────────────────────────────┘
Unified Skill (.zip ready)
```
## Best Practices
### 1. Start with Rule-Based Merge
Rule-based is fast and works well for most cases. Only use Claude-enhanced if you need human oversight.
### 2. Use Surface-Level Code Analysis
`code_analysis_depth: "surface"` is usually sufficient. Deep analysis is expensive and rarely needed.
### 3. Limit GitHub Issues
`max_issues: 100` is a good default. More than 200 issues rarely adds value.
### 4. Be Specific with File Patterns
```json
"file_patterns": [
"src/**/*.js", // Good: specific paths
"lib/**/*.ts"
]
// Not recommended:
"file_patterns": ["**/*.js"] // Too broad, slow
```
### 5. Monitor Conflict Reports
Always review `references/conflicts.md` to understand discrepancies between sources.
## Troubleshooting
### No Conflicts Detected
**Possible causes**:
- `extract_api: false` in documentation source
- `include_code: false` in GitHub source
- Code analysis found no APIs (check `code_analysis_depth`)
**Solution**: Ensure both sources have API extraction enabled
### Too Many Conflicts
**Possible causes**:
- Fuzzy matching threshold too strict
- Documentation uses different naming conventions
- Old documentation version
**Solution**: Review conflicts manually and adjust merge strategy
### Merge Takes Too Long
**Possible causes**:
- Using `code_analysis_depth: "full"` (very slow)
- Too many file patterns
- Large repository
**Solution**:
- Use `"surface"` or `"deep"` analysis
- Narrow file patterns
- Increase `rate_limit`
## Future Enhancements
Planned features:
- [ ] Automated conflict resolution strategies
- [ ] Conflict trend analysis across versions
- [ ] Multi-version comparison (docs v1 vs v2)
- [ ] Custom merge rules DSL
- [ ] Conflict confidence scores
## Support
For issues, questions, or suggestions:
- GitHub Issues: https://github.com/yusufkaraaslan/Skill_Seekers/issues
- Documentation: https://github.com/yusufkaraaslan/Skill_Seekers/docs
## Changelog
**v2.0 (October 2025)**: Unified multi-source scraping feature complete
- ✅ Config validation for unified format
- ✅ Deep code analysis with AST parsing
- ✅ Conflict detection (4 types, 3 severity levels)
- ✅ Rule-based merging
- ✅ Claude-enhanced merging
- ✅ Unified skill builder with inline conflict warnings
- ✅ MCP integration with auto-detection
- ✅ Backward compatibility with legacy configs
- ✅ Comprehensive tests and documentation

View File

@ -0,0 +1,351 @@
# How to Upload Skills to Claude
## Quick Answer
**You have 3 options to upload the `.zip` file:**
### Option 1: Automatic Upload (Recommended for CLI)
```bash
# Set your API key (one-time setup)
export ANTHROPIC_API_KEY=sk-ant-...
# Package and upload automatically
python3 cli/package_skill.py output/react/ --upload
# OR upload existing .zip
python3 cli/upload_skill.py output/react.zip
```
**Fully automatic** | No manual steps | Requires API key
### Option 2: Manual Upload (No API Key)
```bash
# Package the skill
python3 cli/package_skill.py output/react/
# This will:
# 1. Create output/react.zip
# 2. Open output/ folder automatically
# 3. Show clear upload instructions
# Then upload manually to https://claude.ai/skills
```
**No API key needed** | Works for everyone | Simple
### Option 3: Claude Code MCP (Easiest)
```
In Claude Code, just say:
"Package and upload the React skill"
# Automatically packages and uploads!
```
**Natural language** | Fully automatic | Best UX
---
## What's Inside the Zip?
The `.zip` file contains:
```
steam-economy.zip
├── SKILL.md ← Main skill file (Claude reads this first)
└── references/ ← Reference documentation
├── index.md ← Category index
├── api_reference.md ← API docs
├── pricing.md ← Pricing docs
├── trading.md ← Trading docs
└── ... ← Other categorized docs
```
**Note:** The zip only includes what Claude needs. It excludes:
- `.backup` files
- Build artifacts
- Temporary files
## What Does package_skill.py Do?
The package script:
1. **Finds your skill directory** (e.g., `output/steam-economy/`)
2. **Validates SKILL.md exists** (required!)
3. **Creates a .zip file** with the same name
4. **Includes all files** except backups
5. **Saves to** `output/` directory
**Example:**
```bash
python3 cli/package_skill.py output/steam-economy/
📦 Packaging skill: steam-economy
Source: output/steam-economy
Output: output/steam-economy.zip
+ SKILL.md
+ references/api_reference.md
+ references/pricing.md
+ references/trading.md
+ ...
✅ Package created: output/steam-economy.zip
Size: 14,290 bytes (14.0 KB)
```
## Complete Workflow
### Step 1: Scrape & Build
```bash
python3 cli/doc_scraper.py --config configs/steam-economy.json
```
**Output:**
- `output/steam-economy_data/` (raw scraped data)
- `output/steam-economy/` (skill directory)
### Step 2: Enhance (Recommended)
```bash
python3 cli/enhance_skill_local.py output/steam-economy/
```
**What it does:**
- Analyzes reference files
- Creates comprehensive SKILL.md
- Backs up original to SKILL.md.backup
**Output:**
- `output/steam-economy/SKILL.md` (enhanced)
- `output/steam-economy/SKILL.md.backup` (original)
### Step 3: Package
```bash
python3 cli/package_skill.py output/steam-economy/
```
**Output:**
- `output/steam-economy.zip` ← **THIS IS WHAT YOU UPLOAD**
### Step 4: Upload to Claude
1. Go to Claude (claude.ai)
2. Click "Add Skill" or skill upload button
3. Select `output/steam-economy.zip`
4. Done!
## What Files Are Required?
**Minimum required structure:**
```
your-skill/
└── SKILL.md ← Required! Claude reads this first
```
**Recommended structure:**
```
your-skill/
├── SKILL.md ← Main skill file (required)
└── references/ ← Reference docs (highly recommended)
├── index.md
└── *.md ← Category files
```
**Optional (can add manually):**
```
your-skill/
├── SKILL.md
├── references/
├── scripts/ ← Helper scripts
│ └── *.py
└── assets/ ← Templates, examples
└── *.txt
```
## File Size Limits
The package script shows size after packaging:
```
✅ Package created: output/steam-economy.zip
Size: 14,290 bytes (14.0 KB)
```
**Typical sizes:**
- Small skill: 5-20 KB
- Medium skill: 20-100 KB
- Large skill: 100-500 KB
Claude has generous size limits, so most documentation-based skills fit easily.
## Quick Reference
### Package a Skill
```bash
python3 cli/package_skill.py output/steam-economy/
```
### Package Multiple Skills
```bash
# Package all skills in output/
for dir in output/*/; do
if [ -f "$dir/SKILL.md" ]; then
python3 cli/package_skill.py "$dir"
fi
done
```
### Check What's in a Zip
```bash
unzip -l output/steam-economy.zip
```
### Test a Packaged Skill Locally
```bash
# Extract to temp directory
mkdir temp-test
unzip output/steam-economy.zip -d temp-test/
cat temp-test/SKILL.md
```
## Troubleshooting
### "SKILL.md not found"
```bash
# Make sure you scraped and built first
python3 cli/doc_scraper.py --config configs/steam-economy.json
# Then package
python3 cli/package_skill.py output/steam-economy/
```
### "Directory not found"
```bash
# Check what skills are available
ls output/
# Use correct path
python3 cli/package_skill.py output/YOUR-SKILL-NAME/
```
### Zip is Too Large
Most skills are small, but if yours is large:
```bash
# Check size
ls -lh output/steam-economy.zip
# If needed, check what's taking space
unzip -l output/steam-economy.zip | sort -k1 -rn | head -20
```
Reference files are usually small. Large sizes often mean:
- Many images (skills typically don't need images)
- Large code examples (these are fine, just be aware)
## What Does Claude Do With the Zip?
When you upload a skill zip:
1. **Claude extracts it**
2. **Reads SKILL.md first** - This tells Claude:
- When to activate this skill
- What the skill does
- Quick reference examples
- How to navigate the references
3. **Indexes reference files** - Claude can search through:
- `references/*.md` files
- Find specific APIs, examples, concepts
4. **Activates automatically** - When you ask about topics matching the skill
## Example: Using the Packaged Skill
After uploading `steam-economy.zip`:
**You ask:** "How do I implement microtransactions in my Steam game?"
**Claude:**
- Recognizes this matches steam-economy skill
- Reads SKILL.md for quick reference
- Searches references/microtransactions.md
- Provides detailed answer with code examples
## API-Based Automatic Upload
### Setup (One-Time)
```bash
# Get your API key from https://console.anthropic.com/
export ANTHROPIC_API_KEY=sk-ant-...
# Add to your shell profile to persist
echo 'export ANTHROPIC_API_KEY=sk-ant-...' >> ~/.bashrc # or ~/.zshrc
```
### Usage
```bash
# Upload existing .zip
python3 cli/upload_skill.py output/react.zip
# OR package and upload in one command
python3 cli/package_skill.py output/react/ --upload
```
### How It Works
The upload tool uses the Anthropic `/v1/skills` API endpoint to:
1. Read your .zip file
2. Authenticate with your API key
3. Upload to Claude's skill storage
4. Verify upload success
### Troubleshooting
**"ANTHROPIC_API_KEY not set"**
```bash
# Check if set
echo $ANTHROPIC_API_KEY
# If empty, set it
export ANTHROPIC_API_KEY=sk-ant-...
```
**"Authentication failed"**
- Verify your API key is correct
- Check https://console.anthropic.com/ for valid keys
**"Upload timed out"**
- Check your internet connection
- Try again or use manual upload
**Upload fails with error**
- Falls back to showing manual upload instructions
- You can still upload via https://claude.ai/skills
---
## Summary
**What you need to do:**
### With API Key (Automatic):
1. ✅ Scrape: `python3 cli/doc_scraper.py --config configs/YOUR-CONFIG.json`
2. ✅ Enhance: `python3 cli/enhance_skill_local.py output/YOUR-SKILL/`
3. ✅ Package & Upload: `python3 cli/package_skill.py output/YOUR-SKILL/ --upload`
4. ✅ Done! Skill is live in Claude
### Without API Key (Manual):
1. ✅ Scrape: `python3 cli/doc_scraper.py --config configs/YOUR-CONFIG.json`
2. ✅ Enhance: `python3 cli/enhance_skill_local.py output/YOUR-SKILL/`
3. ✅ Package: `python3 cli/package_skill.py output/YOUR-SKILL/`
4. ✅ Upload: Go to https://claude.ai/skills and upload the `.zip`
**What you upload:**
- The `.zip` file from `output/` directory
- Example: `output/steam-economy.zip`
**What's in the zip:**
- `SKILL.md` (required)
- `references/*.md` (recommended)
- Any scripts/assets you added (optional)
That's it! 🚀

View File

@ -0,0 +1,811 @@
# Complete Usage Guide for Skill Seeker
Comprehensive reference for all commands, options, and workflows.
## Table of Contents
- [Quick Reference](#quick-reference)
- [Main Tool: doc_scraper.py](#main-tool-doc_scraperpy)
- [Estimator: estimate_pages.py](#estimator-estimate_pagespy)
- [Enhancement Tools](#enhancement-tools)
- [Packaging Tool](#packaging-tool)
- [Testing Tools](#testing-tools)
- [Available Configs](#available-configs)
- [Common Workflows](#common-workflows)
- [Troubleshooting](#troubleshooting)
---
## Quick Reference
```bash
# 1. Estimate pages (fast, 1-2 min)
python3 cli/estimate_pages.py configs/react.json
# 2. Scrape documentation (20-40 min)
python3 cli/doc_scraper.py --config configs/react.json
# 3. Enhance with Claude Code (60 sec)
python3 cli/enhance_skill_local.py output/react/
# 4. Package to .zip (instant)
python3 cli/package_skill.py output/react/
# 5. Test everything (1 sec)
python3 cli/run_tests.py
```
---
## Main Tool: doc_scraper.py
### Full Help
```
usage: doc_scraper.py [-h] [--interactive] [--config CONFIG] [--name NAME]
[--url URL] [--description DESCRIPTION] [--skip-scrape]
[--dry-run] [--enhance] [--enhance-local]
[--api-key API_KEY]
Convert documentation websites to Claude skills
options:
-h, --help Show this help message and exit
--interactive, -i Interactive configuration mode
--config, -c CONFIG Load configuration from file (e.g., configs/godot.json)
--name NAME Skill name
--url URL Base documentation URL
--description, -d DESCRIPTION
Skill description
--skip-scrape Skip scraping, use existing data
--dry-run Preview what will be scraped without actually scraping
--enhance Enhance SKILL.md using Claude API after building
(requires API key)
--enhance-local Enhance SKILL.md using Claude Code in new terminal
(no API key needed)
--api-key API_KEY Anthropic API key for --enhance (or set ANTHROPIC_API_KEY)
```
### Usage Examples
**1. Use Preset Config (Recommended)**
```bash
python3 cli/doc_scraper.py --config configs/godot.json
python3 cli/doc_scraper.py --config configs/react.json
python3 cli/doc_scraper.py --config configs/vue.json
python3 cli/doc_scraper.py --config configs/django.json
python3 cli/doc_scraper.py --config configs/fastapi.json
```
**2. Interactive Mode**
```bash
python3 cli/doc_scraper.py --interactive
# Wizard walks you through:
# - Skill name
# - Base URL
# - Description
# - Selectors (optional)
# - URL patterns (optional)
# - Rate limit
# - Max pages
```
**3. Quick Mode (Minimal)**
```bash
python3 cli/doc_scraper.py \
--name react \
--url https://react.dev/ \
--description "React framework for building UIs"
```
**4. Dry-Run (Preview)**
```bash
python3 cli/doc_scraper.py --config configs/react.json --dry-run
# Shows what will be scraped without downloading data
# No directories created
# Fast validation
```
**5. Skip Scraping (Use Cached Data)**
```bash
python3 cli/doc_scraper.py --config configs/godot.json --skip-scrape
# Uses existing output/godot_data/
# Fast rebuild (1-3 minutes)
# Useful for testing changes
```
**6. With Local Enhancement**
```bash
python3 cli/doc_scraper.py --config configs/react.json --enhance-local
# Scrapes + enhances in one command
# Opens new terminal for Claude Code
# No API key needed
```
**7. With API Enhancement**
```bash
export ANTHROPIC_API_KEY=sk-ant-...
python3 cli/doc_scraper.py --config configs/react.json --enhance
# Or with inline API key:
python3 cli/doc_scraper.py --config configs/react.json --enhance --api-key sk-ant-...
```
### Output Structure
```
output/
├── {name}_data/ # Scraped raw data (cached)
│ ├── pages/
│ │ ├── page_0.json
│ │ ├── page_1.json
│ │ └── ...
│ └── summary.json # Scraping stats
└── {name}/ # Built skill directory
├── SKILL.md # Main skill file
├── SKILL.md.backup # Backup (if enhanced)
├── references/ # Categorized docs
│ ├── index.md
│ ├── getting_started.md
│ ├── api.md
│ └── ...
├── scripts/ # Empty (user scripts)
└── assets/ # Empty (user assets)
```
---
## Estimator: estimate_pages.py
### Full Help
```
usage: estimate_pages.py [-h] [--max-discovery MAX_DISCOVERY]
[--timeout TIMEOUT]
config
Estimate page count for Skill Seeker configs
positional arguments:
config Path to config JSON file
options:
-h, --help Show this help message and exit
--max-discovery, -m MAX_DISCOVERY
Maximum pages to discover (default: 1000)
--timeout, -t TIMEOUT
HTTP request timeout in seconds (default: 30)
```
### Usage Examples
**1. Quick Estimate (100 pages)**
```bash
python3 cli/estimate_pages.py configs/react.json --max-discovery 100
# Time: ~30-60 seconds
# Good for: Quick validation
```
**2. Standard Estimate (1000 pages - default)**
```bash
python3 cli/estimate_pages.py configs/godot.json
# Time: ~1-2 minutes
# Good for: Most use cases
```
**3. Deep Estimate (2000 pages)**
```bash
python3 cli/estimate_pages.py configs/vue.json --max-discovery 2000
# Time: ~3-5 minutes
# Good for: Large documentation sites
```
**4. Custom Timeout**
```bash
python3 cli/estimate_pages.py configs/django.json --timeout 60
# Useful for slow servers
```
### Output Example
```
🔍 Estimating pages for: react
📍 Base URL: https://react.dev/
🎯 Start URLs: 6
⏱️ Rate limit: 0.5s
🔢 Max discovery: 1000
⏳ Discovered: 180 pages (1.3 pages/sec)
======================================================================
📊 ESTIMATION RESULTS
======================================================================
Config: react
Base URL: https://react.dev/
✅ Pages Discovered: 180
⏳ Pages Pending: 50
📈 Estimated Total: 230
⏱️ Time Elapsed: 140.5s
⚡ Discovery Rate: 1.28 pages/sec
======================================================================
💡 RECOMMENDATIONS
======================================================================
✅ Current max_pages (300) is sufficient
⏱️ Estimated full scrape time: 1.9 minutes
(Based on rate_limit: 0.5s)
```
**What It Shows:**
- Estimated total pages to scrape
- Whether current `max_pages` is sufficient
- Recommended `max_pages` value
- Estimated scraping time
- Discovery rate (pages/sec)
---
## Enhancement Tools
### enhance_skill_local.py (Recommended)
**No API key needed - uses Claude Code Max plan**
```bash
# Usage
python3 cli/enhance_skill_local.py output/react/
python3 cli/enhance_skill_local.py output/godot/
# What it does:
# 1. Reads SKILL.md and references/
# 2. Opens new terminal with Claude Code
# 3. Claude enhances SKILL.md
# 4. Backs up original to SKILL.md.backup
# 5. Saves enhanced version
# Time: ~60 seconds
# Cost: Free (uses your Claude Code Max plan)
```
### enhance_skill.py (Alternative)
**Requires Anthropic API key**
```bash
# Install dependency first
pip3 install anthropic
# Usage with environment variable
export ANTHROPIC_API_KEY=sk-ant-...
python3 cli/enhance_skill.py output/react/
# Usage with inline API key
python3 cli/enhance_skill.py output/godot/ --api-key sk-ant-...
# What it does:
# 1. Reads SKILL.md and references/
# 2. Calls Claude API (Sonnet 4)
# 3. Enhances SKILL.md
# 4. Backs up original to SKILL.md.backup
# 5. Saves enhanced version
# Time: ~30-60 seconds
# Cost: ~$0.01-0.10 per skill (depending on size)
```
---
## Packaging Tool
### package_skill.py
```bash
# Usage
python3 cli/package_skill.py output/react/
python3 cli/package_skill.py output/godot/
# What it does:
# 1. Validates SKILL.md exists
# 2. Creates .zip with all skill files
# 3. Saves to output/{name}.zip
# Output:
# output/react.zip
# output/godot.zip
# Time: Instant
```
---
## Testing Tools
### run_tests.py
```bash
# Run all tests (default)
python3 cli/run_tests.py
# 71 tests, ~1 second
# Verbose output
python3 cli/run_tests.py -v
python3 cli/run_tests.py --verbose
# Quiet output
python3 cli/run_tests.py -q
python3 cli/run_tests.py --quiet
# Stop on first failure
python3 cli/run_tests.py -f
python3 cli/run_tests.py --failfast
# Run specific test suite
python3 cli/run_tests.py --suite config
python3 cli/run_tests.py --suite features
python3 cli/run_tests.py --suite integration
# List all tests
python3 cli/run_tests.py --list
```
### Individual Tests
```bash
# Run single test file
python3 -m unittest tests.test_config_validation
python3 -m unittest tests.test_scraper_features
python3 -m unittest tests.test_integration
# Run single test class
python3 -m unittest tests.test_config_validation.TestConfigValidation
# Run single test method
python3 -m unittest tests.test_config_validation.TestConfigValidation.test_valid_complete_config
```
---
## Available Configs
### Preset Configs (Ready to Use)
| Config | Framework | Pages | Description |
|--------|-----------|-------|-------------|
| `godot.json` | Godot Engine | ~500 | Game engine documentation |
| `react.json` | React | ~300 | React framework docs |
| `vue.json` | Vue.js | ~250 | Vue.js framework docs |
| `django.json` | Django | ~400 | Django web framework |
| `fastapi.json` | FastAPI | ~200 | FastAPI Python framework |
| `steam-economy-complete.json` | Steam | ~100 | Steam Economy API docs |
### View Config Details
```bash
# List all configs
ls configs/
# View config content
cat configs/react.json
python3 -m json.tool configs/godot.json
```
### Config Structure
```json
{
"name": "react",
"base_url": "https://react.dev/",
"description": "React - JavaScript library for building UIs",
"start_urls": [
"https://react.dev/learn",
"https://react.dev/reference/react",
"https://react.dev/reference/react-dom"
],
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": ["/learn/", "/reference/"],
"exclude": ["/blog/", "/community/"]
},
"categories": {
"getting_started": ["learn", "tutorial", "intro"],
"api": ["reference", "api", "hooks"],
"guides": ["guide"]
},
"rate_limit": 0.5,
"max_pages": 300
}
```
---
## Common Workflows
### Workflow 1: Use Preset (Fastest)
```bash
# 1. Estimate (optional, 1-2 min)
python3 cli/estimate_pages.py configs/react.json
# 2. Scrape with local enhancement (25 min)
python3 cli/doc_scraper.py --config configs/react.json --enhance-local
# 3. Package (instant)
python3 cli/package_skill.py output/react/
# Result: output/react.zip
# Upload to Claude!
```
### Workflow 2: Custom Documentation
```bash
# 1. Create config
cat > configs/my-docs.json << 'EOF'
{
"name": "my-docs",
"base_url": "https://docs.example.com/",
"description": "My documentation site",
"rate_limit": 0.5,
"max_pages": 200
}
EOF
# 2. Estimate
python3 cli/estimate_pages.py configs/my-docs.json
# 3. Dry-run test
python3 cli/doc_scraper.py --config configs/my-docs.json --dry-run
# 4. Full scrape
python3 cli/doc_scraper.py --config configs/my-docs.json
# 5. Enhance
python3 cli/enhance_skill_local.py output/my-docs/
# 6. Package
python3 cli/package_skill.py output/my-docs/
```
### Workflow 3: Interactive Mode
```bash
# 1. Start interactive wizard
python3 cli/doc_scraper.py --interactive
# 2. Answer prompts:
# - Name: my-framework
# - URL: https://framework.dev/
# - Description: My favorite framework
# - Selectors: (uses defaults)
# - Rate limit: 0.5
# - Max pages: 100
# 3. Enhance
python3 cli/enhance_skill_local.py output/my-framework/
# 4. Package
python3 cli/package_skill.py output/my-framework/
```
### Workflow 4: Quick Mode
```bash
python3 cli/doc_scraper.py \
--name vue \
--url https://vuejs.org/ \
--description "Vue.js framework" \
--enhance-local
```
### Workflow 5: Rebuild from Cache
```bash
# Already scraped once?
# Skip re-scraping, just rebuild
python3 cli/doc_scraper.py --config configs/godot.json --skip-scrape
# Try new enhancement
python3 cli/enhance_skill_local.py output/godot/
# Re-package
python3 cli/package_skill.py output/godot/
```
### Workflow 6: Testing New Config
```bash
# 1. Create test config with low max_pages
cat > configs/test.json << 'EOF'
{
"name": "test-site",
"base_url": "https://docs.test.com/",
"max_pages": 20,
"rate_limit": 0.1
}
EOF
# 2. Estimate
python3 cli/estimate_pages.py configs/test.json --max-discovery 50
# 3. Dry-run
python3 cli/doc_scraper.py --config configs/test.json --dry-run
# 4. Small scrape
python3 cli/doc_scraper.py --config configs/test.json
# 5. Validate output
ls output/test-site/
ls output/test-site/references/
# 6. If good, increase max_pages and re-run
```
---
## Troubleshooting
### Issue: "Rate limit exceeded"
```bash
# Increase rate_limit in config
# Default: 0.5 seconds
# Conservative: 1.0 seconds
# Very conservative: 2.0 seconds
# Edit config:
{
"rate_limit": 1.0
}
```
### Issue: "Too many pages"
```bash
# Estimate first
python3 cli/estimate_pages.py configs/my-config.json
# Set max_pages based on estimate
# Add buffer: estimated + 50
# Edit config:
{
"max_pages": 350 # for 300 estimated
}
```
### Issue: "No content extracted"
```bash
# Wrong selectors
# Test selectors manually:
curl -s https://docs.example.com/ | grep -i 'article\|main\|content'
# Common selectors:
"main_content": "article"
"main_content": "main"
"main_content": ".content"
"main_content": "#main-content"
"main_content": "div[role=\"main\"]"
# Update config with correct selector
```
### Issue: "Tests failing"
```bash
# Run specific failing test
python3 -m unittest tests.test_config_validation.TestConfigValidation.test_name -v
# Check error message
# Verify expectations match implementation
```
### Issue: "Enhancement fails"
```bash
# Local enhancement:
# Make sure Claude Code is running
# Check terminal output
# API enhancement:
# Verify API key is set:
echo $ANTHROPIC_API_KEY
# Or use inline:
python3 cli/enhance_skill.py output/react/ --api-key sk-ant-...
```
### Issue: "Package fails"
```bash
# Verify SKILL.md exists
ls output/my-skill/SKILL.md
# If missing, build first:
python3 cli/doc_scraper.py --config configs/my-skill.json --skip-scrape
```
### Issue: "Can't find output"
```bash
# Check output directory
ls output/
# Skill data (cached):
ls output/{name}_data/
# Built skill:
ls output/{name}/
# Packaged skill:
ls output/{name}.zip
```
---
## Advanced Usage
### Custom Selectors
```json
{
"selectors": {
"main_content": "div.documentation",
"title": "h1.page-title",
"code_blocks": "pre.highlight code",
"navigation": "nav.sidebar"
}
}
```
### URL Pattern Filtering
```json
{
"url_patterns": {
"include": [
"/docs/",
"/guide/",
"/api/",
"/tutorial/"
],
"exclude": [
"/blog/",
"/news/",
"/community/",
"/showcase/"
]
}
}
```
### Custom Categories
```json
{
"categories": {
"getting_started": ["intro", "tutorial", "quickstart", "installation"],
"core_concepts": ["concept", "fundamental", "architecture"],
"api": ["reference", "api", "method", "function"],
"guides": ["guide", "how-to", "example"],
"advanced": ["advanced", "expert", "performance"]
}
}
```
### Multiple Start URLs
```json
{
"start_urls": [
"https://docs.example.com/getting-started/",
"https://docs.example.com/api/",
"https://docs.example.com/guides/",
"https://docs.example.com/examples/"
]
}
```
---
## Performance Tips
1. **Estimate first**: Save 20-40 minutes by validating config
2. **Use dry-run**: Test selectors before full scrape
3. **Cache data**: Use `--skip-scrape` for fast rebuilds
4. **Adjust rate_limit**: Balance speed vs politeness
5. **Set appropriate max_pages**: Don't scrape more than needed
6. **Use start_urls**: Target specific documentation sections
7. **Filter URLs**: Use include/exclude patterns
8. **Run tests**: Catch issues early
---
## Environment Variables
```bash
# Anthropic API key (for API enhancement)
export ANTHROPIC_API_KEY=sk-ant-...
# Optional: Set custom output directory
export SKILL_SEEKER_OUTPUT_DIR=/path/to/output
```
---
## Exit Codes
- `0`: Success
- `1`: Error (general)
- `2`: Warning (estimation hit limit)
---
## File Locations
```
Skill_Seekers/
├── doc_scraper.py # Main tool
├── estimate_pages.py # Estimator
├── enhance_skill.py # API enhancement
├── enhance_skill_local.py # Local enhancement
├── package_skill.py # Packager
├── run_tests.py # Test runner
├── configs/ # Preset configs
├── tests/ # Test suite
├── docs/ # Documentation
└── output/ # Generated output
```
---
## Getting Help
```bash
# Tool-specific help
python3 cli/doc_scraper.py --help
python3 cli/estimate_pages.py --help
python3 cli/run_tests.py --help
# Documentation
cat CLAUDE.md # Quick reference for Claude Code
cat docs/CLAUDE.md # Detailed technical docs
cat docs/TESTING.md # Testing guide
cat docs/USAGE.md # This file
cat docs/ENHANCEMENT.md # Enhancement guide
cat docs/UPLOAD_GUIDE.md # Upload instructions
cat README.md # Project overview
```
---
## Summary
**Essential Commands:**
```bash
python3 cli/estimate_pages.py configs/react.json # Estimate
python3 cli/doc_scraper.py --config configs/react.json # Scrape
python3 cli/enhance_skill_local.py output/react/ # Enhance
python3 cli/package_skill.py output/react/ # Package
python3 cli/run_tests.py # Test
```
**Quick Start:**
```bash
pip3 install requests beautifulsoup4
python3 cli/doc_scraper.py --config configs/react.json --enhance-local
python3 cli/package_skill.py output/react/
# Upload output/react.zip to Claude!
```
Happy skill creating! 🚀

View File

@ -0,0 +1,867 @@
# Active Skills Design - Demand-Driven Documentation Loading
**Date:** 2025-10-24
**Type:** Architecture Design
**Status:** Phase 1 Implemented ✅
**Author:** Edgar + Claude (Brainstorming Session)
---
## Executive Summary
Transform Skill_Seekers from creating **passive documentation dumps** into **active, intelligent skills** that load documentation on-demand. This eliminates context bloat (300k → 5-10k per query) while maintaining full access to complete documentation.
**Key Innovation:** Skills become lightweight routers with heavy tools in `scripts/`, not documentation repositories.
---
## Problem Statement
### Current Architecture: Passive Skills
**What happens today:**
```
Agent: "How do I use Hono middleware?"
Skill: *Claude loads 203k llms-txt.md into context*
Agent: *answers using loaded docs*
Result: Context bloat, slower performance, hits limits
```
**Issues:**
1. **Context Bloat**: 319k llms-full.txt loaded entirely into context
2. **Wasted Resources**: Agent needs 5k but gets 319k
3. **Truncation Loss**: 36% of content lost (319k → 203k) due to size limits
4. **File Extension Bug**: llms.txt files stored as .txt instead of .md
5. **Single Variant**: Only downloads one file (usually llms-full.txt)
### Current File Structure
```
output/hono/
├── SKILL.md ──────────► Documentation dump + instructions
├── references/
│ └── llms-txt.md ───► 203k (36% truncated from 319k original)
├── scripts/ ──────────► EMPTY (placeholder only!)
└── assets/ ───────────► EMPTY (placeholder only!)
```
---
## Proposed Architecture: Active Skills
### Core Concept
**Skills = Routers + Tools**, not documentation dumps.
**New workflow:**
```
Agent: "How do I use Hono middleware?"
Skill: *runs scripts/search.py "middleware"*
Script: *loads llms-full.md, extracts middleware section, returns 8k*
Agent: *answers using ONLY 8k* (CLEAN CONTEXT!)
Result: 40x less context, no truncation, full access to docs
```
### Benefits
| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| Context per query | 203k | 5-10k | **20-40x reduction** |
| Content loss | 36% truncated | 0% (no truncation) | **Full fidelity** |
| Variants available | 1 | 3 | **User choice** |
| File format | .txt (wrong) | .md (correct) | **Fixed** |
| Agent workflow | Passive read | Active tools | **Autonomous** |
---
## Design Components
### Component 1: Multi-Variant Download
**Change:** Download ALL 3 variants, not just one.
**File naming (FIXED):**
- `https://hono.dev/llms-full.txt``llms-full.md`
- `https://hono.dev/llms.txt``llms.md`
- `https://hono.dev/llms-small.txt``llms-small.md`
**Sizes (Hono example):**
- `llms-full.md` - 319k (complete documentation)
- `llms-small.md` - 176k (curated essentials)
- `llms.md` - 5.4k (quick reference)
**Storage:**
```
output/hono/references/
├── llms-full.md # 319k - everything (RENAMED from .txt)
├── llms-small.md # 176k - curated (RENAMED from .txt)
├── llms.md # 5.4k - quick ref (RENAMED from .txt)
└── catalog.json # Generated index (NEW)
```
**Implementation in `_try_llms_txt()`:**
```python
def _try_llms_txt(self) -> bool:
"""Download ALL llms.txt variants for active skills"""
# 1. Detect all available variants
detector = LlmsTxtDetector(self.base_url)
variants = detector.detect_all() # NEW method
downloaded = {}
for variant_info in variants:
url = variant_info['url'] # https://hono.dev/llms-full.txt
variant = variant_info['variant'] # 'full', 'standard', 'small'
downloader = LlmsTxtDownloader(url)
content = downloader.download()
if content:
# ✨ FIX: Rename .txt → .md immediately
clean_name = f"llms-{variant}.md"
downloaded[variant] = {
'content': content,
'filename': clean_name
}
# 2. Save ALL variants (not just one)
for variant, data in downloaded.items():
path = os.path.join(self.skill_dir, "references", data['filename'])
with open(path, 'w', encoding='utf-8') as f:
f.write(data['content'])
# 3. Generate catalog from smallest variant
if 'small' in downloaded:
self._generate_catalog(downloaded['small']['content'])
return True
```
---
### Component 2: The Catalog System
**Purpose:** Lightweight index of what exists, not the content itself.
**File:** `assets/catalog.json`
**Structure:**
```json
{
"metadata": {
"framework": "hono",
"version": "auto-detected",
"generated": "2025-10-24T14:30:00Z",
"total_sections": 93,
"variants": {
"quick": "llms-small.md",
"standard": "llms.md",
"complete": "llms-full.md"
}
},
"sections": [
{
"id": "routing",
"title": "Routing",
"h1_marker": "# Routing",
"topics": ["routes", "path", "params", "wildcard"],
"size_bytes": 4800,
"variants": ["quick", "complete"],
"complexity": "beginner"
},
{
"id": "middleware",
"title": "Middleware",
"h1_marker": "# Middleware",
"topics": ["cors", "auth", "logging", "compression"],
"size_bytes": 8200,
"variants": ["quick", "complete"],
"complexity": "intermediate"
}
],
"search_index": {
"cors": ["middleware"],
"routing": ["routing", "path-parameters"],
"authentication": ["middleware", "jwt"],
"context": ["context-handling"],
"streaming": ["streaming-responses"]
}
}
```
**Generation (from llms-small.md):**
```python
def _generate_catalog(self, llms_small_content):
"""Generate catalog.json from llms-small.md TOC"""
catalog = {
"metadata": {...},
"sections": [],
"search_index": {}
}
# Split by h1 headers
sections = re.split(r'\n# ', llms_small_content)
for section_text in sections[1:]:
lines = section_text.split('\n')
title = lines[0].strip()
# Extract h2 topics
topics = re.findall(r'^## (.+)$', section_text, re.MULTILINE)
topics = [t.strip().lower() for t in topics]
section_info = {
"id": title.lower().replace(' ', '-'),
"title": title,
"h1_marker": f"# {title}",
"topics": topics + [title.lower()],
"size_bytes": len(section_text),
"variants": ["quick", "complete"]
}
catalog["sections"].append(section_info)
# Build search index
for topic in section_info["topics"]:
if topic not in catalog["search_index"]:
catalog["search_index"][topic] = []
catalog["search_index"][topic].append(section_info["id"])
# Save to assets/catalog.json
catalog_path = os.path.join(self.skill_dir, "assets", "catalog.json")
with open(catalog_path, 'w', encoding='utf-8') as f:
json.dump(catalog, f, indent=2)
```
---
### Component 3: Active Scripts
**Location:** `scripts/` directory (currently empty)
#### Script 1: `scripts/search.py`
**Purpose:** Search and return only relevant documentation sections.
```python
#!/usr/bin/env python3
"""
ABOUTME: Searches framework documentation and returns relevant sections
ABOUTME: Loads only what's needed - keeps agent context clean
"""
import json
import sys
import re
from pathlib import Path
def search(query, detail="auto"):
"""
Search documentation and return relevant sections.
Args:
query: Search term (e.g., "middleware", "cors", "routing")
detail: "quick" | "standard" | "complete" | "auto"
Returns:
Markdown text of relevant sections only
"""
# Load catalog
catalog_path = Path(__file__).parent.parent / "assets" / "catalog.json"
catalog = json.load(open(catalog_path))
# 1. Find matching sections using search index
query_lower = query.lower()
matching_section_ids = set()
for keyword, section_ids in catalog["search_index"].items():
if query_lower in keyword or keyword in query_lower:
matching_section_ids.update(section_ids)
# Get section details
matches = [s for s in catalog["sections"] if s["id"] in matching_section_ids]
if not matches:
return f"❌ No sections found for '{query}'. Try: python scripts/list_topics.py"
# 2. Determine detail level
if detail == "auto":
# Use quick for overview, complete for deep dive
total_size = sum(s["size_bytes"] for s in matches)
if total_size > 50000: # > 50k
variant = "quick"
else:
variant = "complete"
else:
variant = detail
variant_file = catalog["metadata"]["variants"].get(variant, "complete")
# 3. Load documentation file
doc_path = Path(__file__).parent.parent / "references" / variant_file
doc_content = open(doc_path, 'r', encoding='utf-8').read()
# 4. Extract matched sections
results = []
for match in matches:
h1_marker = match["h1_marker"]
# Find section boundaries
start = doc_content.find(h1_marker)
if start == -1:
continue
# Find next h1 (or end of file)
next_h1 = doc_content.find("\n# ", start + len(h1_marker))
if next_h1 == -1:
section_text = doc_content[start:]
else:
section_text = doc_content[start:next_h1]
results.append({
'title': match['title'],
'size': len(section_text),
'content': section_text
})
# 5. Format output
output = [f"# Search Results for '{query}' ({len(results)} sections found)\n"]
output.append(f"**Variant used:** {variant} ({variant_file})")
output.append(f"**Total size:** {sum(r['size'] for r in results):,} bytes\n")
output.append("---\n")
for result in results:
output.append(result['content'])
output.append("\n---\n")
return '\n'.join(output)
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python search.py <query> [detail]")
print("Example: python search.py middleware")
print("Example: python search.py routing --detail quick")
sys.exit(1)
query = sys.argv[1]
detail = sys.argv[2] if len(sys.argv) > 2 else "auto"
print(search(query, detail))
```
#### Script 2: `scripts/list_topics.py`
**Purpose:** Show all available documentation sections.
```python
#!/usr/bin/env python3
"""
ABOUTME: Lists all available documentation sections with sizes
ABOUTME: Helps agent discover what documentation exists
"""
import json
from pathlib import Path
def list_topics():
"""List all available documentation sections."""
catalog_path = Path(__file__).parent.parent / "assets" / "catalog.json"
catalog = json.load(open(catalog_path))
print(f"# Available Documentation Topics ({catalog['metadata']['framework']})\n")
print(f"**Total sections:** {catalog['metadata']['total_sections']}")
print(f"**Variants:** {', '.join(catalog['metadata']['variants'].keys())}\n")
print("---\n")
# Group by complexity if available
by_complexity = {}
for section in catalog["sections"]:
complexity = section.get("complexity", "general")
if complexity not in by_complexity:
by_complexity[complexity] = []
by_complexity[complexity].append(section)
for complexity in ["beginner", "intermediate", "advanced", "general"]:
if complexity not in by_complexity:
continue
sections = by_complexity[complexity]
print(f"## {complexity.title()} ({len(sections)} sections)\n")
for section in sections:
size_kb = section["size_bytes"] / 1024
topics_str = ", ".join(section["topics"][:3])
print(f"- **{section['title']}** ({size_kb:.1f}k)")
print(f" Topics: {topics_str}")
print(f" Search: `python scripts/search.py {section['id']}`\n")
if __name__ == "__main__":
list_topics()
```
#### Script 3: `scripts/get_section.py`
**Purpose:** Extract a complete section by exact title.
```python
#!/usr/bin/env python3
"""
ABOUTME: Extracts a complete documentation section by title
ABOUTME: Returns full section from llms-full.md (no truncation)
"""
import json
import sys
from pathlib import Path
def get_section(title, variant="complete"):
"""
Get a complete section by exact title.
Args:
title: Section title (e.g., "Middleware", "Routing")
variant: Which file to use (quick/standard/complete)
Returns:
Complete section content
"""
catalog_path = Path(__file__).parent.parent / "assets" / "catalog.json"
catalog = json.load(open(catalog_path))
# Find section
section = None
for s in catalog["sections"]:
if s["title"].lower() == title.lower():
section = s
break
if not section:
return f"❌ Section '{title}' not found. Try: python scripts/list_topics.py"
# Load doc
variant_file = catalog["metadata"]["variants"].get(variant, "complete")
doc_path = Path(__file__).parent.parent / "references" / variant_file
doc_content = open(doc_path, 'r', encoding='utf-8').read()
# Extract section
h1_marker = section["h1_marker"]
start = doc_content.find(h1_marker)
if start == -1:
return f"❌ Section '{title}' not found in {variant_file}"
next_h1 = doc_content.find("\n# ", start + len(h1_marker))
if next_h1 == -1:
section_text = doc_content[start:]
else:
section_text = doc_content[start:next_h1]
return section_text
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python get_section.py <title> [variant]")
print("Example: python get_section.py Middleware")
print("Example: python get_section.py Routing quick")
sys.exit(1)
title = sys.argv[1]
variant = sys.argv[2] if len(sys.argv) > 2 else "complete"
print(get_section(title, variant))
```
---
### Component 4: Active SKILL.md Template
**New template for llms.txt-based skills:**
```markdown
---
name: {name}
description: {description}
type: active
---
# {Name} Skill
**⚡ This is an ACTIVE skill** - Uses scripts to load documentation on-demand instead of dumping everything into context.
## 🎯 Strategy: Demand-Driven Documentation
**Traditional approach:**
- Load 300k+ documentation into context
- Agent reads everything to answer one question
- Context bloat, slower performance
**Active approach:**
- Load 5-10k of relevant sections on-demand
- Agent calls scripts to fetch what's needed
- Clean context, faster performance
## 📚 Available Documentation
This skill provides access to {num_sections} documentation sections across 3 detail levels:
- **Quick Reference** (`llms-small.md`): {small_size}k - Curated essentials
- **Standard** (`llms.md`): {standard_size}k - Core concepts
- **Complete** (`llms-full.md`): {full_size}k - Everything
## 🔧 Tools Available
### 1. Search Documentation
Find and load only relevant sections:
```bash
python scripts/search.py "middleware"
python scripts/search.py "routing" --detail quick
```
**Returns:** 5-10k of relevant content (not 300k!)
### 2. List All Topics
See what documentation exists:
```bash
python scripts/list_topics.py
```
**Returns:** Table of contents with section sizes and search hints
### 3. Get Complete Section
Extract a full section by title:
```bash
python scripts/get_section.py "Middleware"
python scripts/get_section.py "Routing" quick
```
**Returns:** Complete section from chosen variant
## 💡 Recommended Workflow
1. **Discover:** `python scripts/list_topics.py` to see what's available
2. **Search:** `python scripts/search.py "your topic"` to find relevant sections
3. **Deep Dive:** Use returned content to answer questions in detail
4. **Iterate:** Search more specific topics as needed
## ⚠️ Important
**DON'T:** Read `references/*.md` files directly into context
**DO:** Use scripts to fetch only what you need
This keeps your context clean and focused!
## 📊 Index
Complete section catalog available in `assets/catalog.json` with search mappings and size information.
## 🔄 Updating
To refresh with latest documentation:
```bash
python3 cli/doc_scraper.py --config configs/{name}.json
```
```
---
## Implementation Plan
### Phase 1: Foundation (Quick Fixes)
**Tasks:**
1. Fix `.txt``.md` renaming in downloader
2. Download all 3 variants (not just one)
3. Store all variants in `references/` with correct names
4. Remove content truncation (2500 chars → unlimited)
**Time:** 1-2 hours
**Files:** `cli/doc_scraper.py`, `cli/llms_txt_downloader.py`
### Phase 2: Catalog System
**Tasks:**
1. Implement `_generate_catalog()` method
2. Parse llms-small.md to extract sections
3. Build search index from topics
4. Generate `assets/catalog.json`
**Time:** 2-3 hours
**Files:** `cli/doc_scraper.py`
### Phase 3: Active Scripts
**Tasks:**
1. Create `scripts/search.py`
2. Create `scripts/list_topics.py`
3. Create `scripts/get_section.py`
4. Make scripts executable (`chmod +x`)
**Time:** 2-3 hours
**Files:** New scripts in `scripts/` template directory
### Phase 4: Template Updates
**Tasks:**
1. Create new active SKILL.md template
2. Update `create_enhanced_skill_md()` to use active template for llms.txt skills
3. Update documentation to explain active skills
**Time:** 1 hour
**Files:** `cli/doc_scraper.py`, `README.md`, `CLAUDE.md`
### Phase 5: Testing & Refinement
**Tasks:**
1. Test with Hono skill (has all 3 variants)
2. Test search accuracy
3. Measure context reduction
4. Document examples
**Time:** 2-3 hours
**Total Estimated Time:** 8-12 hours
---
## Migration Path
### Backward Compatibility
**Existing skills:** No changes (passive skills still work)
**New llms.txt skills:** Automatically use active architecture
**User choice:** Can disable via config flag
### Config Option
```json
{
"name": "hono",
"llms_txt_url": "https://hono.dev/llms-full.txt",
"active_skill": true, // NEW: Enable active architecture (default: true)
"base_url": "https://hono.dev/docs"
}
```
### Detection Logic
```python
# In _try_llms_txt()
active_mode = self.config.get('active_skill', True) # Default true
if active_mode:
# Download all variants, generate catalog, create scripts
self._build_active_skill(downloaded)
else:
# Traditional: single file, no scripts
self._build_passive_skill(downloaded)
```
---
## Benefits Analysis
### Context Efficiency
| Scenario | Passive Skill | Active Skill | Improvement |
|----------|---------------|--------------|-------------|
| Simple query | 203k loaded | 5k loaded | **40x reduction** |
| Multi-topic query | 203k loaded | 15k loaded | **13x reduction** |
| Deep dive | 203k loaded | 30k loaded | **6x reduction** |
### Data Fidelity
| Aspect | Passive | Active |
|--------|---------|--------|
| Content truncation | 36% lost | 0% lost |
| Code truncation | 600 chars max | Unlimited |
| Variants available | 1 | 3 |
### Agent Capabilities
**Passive Skills:**
- ❌ Cannot choose detail level
- ❌ Cannot search efficiently
- ❌ Must read entire context
- ❌ Limited by context window
**Active Skills:**
- ✅ Chooses appropriate detail level
- ✅ Searches catalog efficiently
- ✅ Loads only what's needed
- ✅ Unlimited documentation access
---
## Trade-offs
### Advantages
1. **Massive context reduction** (20-40x less per query)
2. **No content loss** (all 3 variants preserved)
3. **Correct file format** (.md not .txt)
4. **Agent autonomy** (tools to fetch docs)
5. **Scalable** (works with 1MB+ docs)
### Disadvantages
1. **Complexity** (scripts + catalog vs simple files)
2. **Initial overhead** (catalog generation)
3. **Agent learning curve** (must learn to use scripts)
4. **Dependency** (Python required to run scripts)
### Risk Mitigation
**Risk:** Scripts don't work in Claude's sandbox
**Mitigation:** Test thoroughly, provide fallback to passive mode
**Risk:** Catalog generation fails
**Mitigation:** Graceful degradation to single-file mode
**Risk:** Agent doesn't use scripts
**Mitigation:** Clear SKILL.md instructions, examples in quick reference
---
## Success Metrics
### Technical Metrics
- ✅ Context per query < 20k (down from 203k)
- ✅ All 3 variants downloaded and named correctly
- ✅ 0% content truncation
- ✅ Catalog generation < 5 seconds
- ✅ Search script < 1 second response time
### User Experience Metrics
- ✅ Agent successfully uses scripts without prompting
- ✅ Answers are equally or more accurate than passive mode
- ✅ Agent can handle queries about all documentation sections
- ✅ No "context limit exceeded" errors
---
## Future Enhancements
### Phase 6: Smart Caching
Cache frequently accessed sections in SKILL.md quick reference:
```python
# Track access frequency in catalog.json
"sections": [
{
"id": "middleware",
"access_count": 47, # NEW: Track usage
"last_accessed": "2025-10-24T14:30:00Z"
}
]
# Include top 10 most-accessed sections directly in SKILL.md
```
### Phase 7: Semantic Search
Use embeddings for better search:
```python
# Generate embeddings for each section
"sections": [
{
"id": "middleware",
"embedding": [...], # NEW: Vector embedding
"topics": ["cors", "auth"]
}
]
# In search.py: Use cosine similarity for better matches
```
### Phase 8: Progressive Loading
Load increasingly detailed docs:
```python
# First: Load llms.md (5.4k - overview)
# If insufficient: Load llms-small.md section (15k)
# If still insufficient: Load llms-full.md section (30k)
```
---
## Conclusion
Active skills represent a fundamental shift from **documentation repositories** to **documentation routers**. By treating skills as intelligent intermediaries rather than static dumps, we can:
1. **Eliminate context bloat** (40x reduction)
2. **Preserve full fidelity** (0% truncation)
3. **Enable agent autonomy** (tools to fetch docs)
4. **Scale indefinitely** (no size limits)
This design maintains backward compatibility while unlocking new capabilities for modern, LLM-optimized documentation sources like llms.txt.
**Recommendation:** Implement in phases, starting with foundation fixes, then catalog system, then active scripts. Test thoroughly with Hono before making it the default for all llms.txt-based skills.
---
## References
- Original brainstorming session: 2025-10-24
- llms.txt convention: https://llmstxt.org/
- Hono example: https://hono.dev/llms-full.txt
- Skill_Seekers repository: Current project
---
## Appendix: Example Workflows
### Example 1: Agent Searches for "Middleware"
```bash
# Agent runs:
python scripts/search.py "middleware"
# Script returns ~8k of middleware documentation from llms-full.md
# Agent uses that 8k to answer the question
# Total context used: 8k (not 319k!)
```
### Example 2: Agent Explores Documentation
```bash
# 1. Agent lists topics
python scripts/list_topics.py
# Returns: Table of contents (2k)
# 2. Agent picks a topic
python scripts/get_section.py "Routing"
# Returns: Complete Routing section (5k)
# 3. Agent searches related topics
python scripts/search.py "path parameters"
# Returns: Routing + Path section (7k)
# Total context used across 3 queries: 14k (not 3 × 319k = 957k!)
```
### Example 3: Agent Needs Quick Answer
```bash
# Agent uses quick variant for overview
python scripts/search.py "cors" --detail quick
# Returns: Short CORS explanation from llms-small.md (2k)
# If insufficient, agent can follow up with:
python scripts/get_section.py "Middleware" # Full section from llms-full.md
```
---
**Document Status:** Ready for review and implementation planning.

View File

@ -0,0 +1,682 @@
# Active Skills Phase 1: Foundation Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Fix fundamental issues in llms.txt handling: rename .txt→.md, download all 3 variants, remove truncation.
**Architecture:** Modify existing llms.txt download/parse/build workflow to handle multiple variants correctly, store with proper extensions, and preserve complete content without truncation.
**Tech Stack:** Python 3.10+, requests, BeautifulSoup4, existing Skill_Seekers architecture
---
## Task 1: Add Multi-Variant Detection
**Files:**
- Modify: `cli/llms_txt_detector.py`
- Test: `tests/test_llms_txt_detector.py`
**Step 1: Write failing test for detect_all() method**
```python
# tests/test_llms_txt_detector.py (add new test)
def test_detect_all_variants():
"""Test detecting all llms.txt variants"""
from unittest.mock import patch, Mock
detector = LlmsTxtDetector("https://hono.dev/docs")
with patch('cli.llms_txt_detector.requests.head') as mock_head:
# Mock responses for different variants
def mock_response(url, **kwargs):
response = Mock()
# All 3 variants exist for Hono
if 'llms-full.txt' in url or 'llms.txt' in url or 'llms-small.txt' in url:
response.status_code = 200
else:
response.status_code = 404
return response
mock_head.side_effect = mock_response
variants = detector.detect_all()
assert len(variants) == 3
assert any(v['variant'] == 'full' for v in variants)
assert any(v['variant'] == 'standard' for v in variants)
assert any(v['variant'] == 'small' for v in variants)
assert all('url' in v for v in variants)
```
**Step 2: Run test to verify it fails**
Run: `source .venv/bin/activate && pytest tests/test_llms_txt_detector.py::test_detect_all_variants -v`
Expected: FAIL with "AttributeError: 'LlmsTxtDetector' object has no attribute 'detect_all'"
**Step 3: Implement detect_all() method**
```python
# cli/llms_txt_detector.py (add new method)
def detect_all(self) -> List[Dict[str, str]]:
"""
Detect all available llms.txt variants.
Returns:
List of dicts with 'url' and 'variant' keys for each found variant
"""
found_variants = []
for filename, variant in self.VARIANTS:
parsed = urlparse(self.base_url)
root_url = f"{parsed.scheme}://{parsed.netloc}"
url = f"{root_url}/{filename}"
if self._check_url_exists(url):
found_variants.append({
'url': url,
'variant': variant
})
return found_variants
```
**Step 4: Add import for List and Dict at top of file**
```python
# cli/llms_txt_detector.py (add to imports)
from typing import Optional, Dict, List
```
**Step 5: Run test to verify it passes**
Run: `source .venv/bin/activate && pytest tests/test_llms_txt_detector.py::test_detect_all_variants -v`
Expected: PASS
**Step 6: Commit**
```bash
git add cli/llms_txt_detector.py tests/test_llms_txt_detector.py
git commit -m "feat: add detect_all() for multi-variant detection"
```
---
## Task 2: Add File Extension Renaming to Downloader
**Files:**
- Modify: `cli/llms_txt_downloader.py`
- Test: `tests/test_llms_txt_downloader.py`
**Step 1: Write failing test for get_proper_filename() method**
```python
# tests/test_llms_txt_downloader.py (add new test)
def test_get_proper_filename():
"""Test filename conversion from .txt to .md"""
downloader = LlmsTxtDownloader("https://hono.dev/llms-full.txt")
filename = downloader.get_proper_filename()
assert filename == "llms-full.md"
assert not filename.endswith('.txt')
def test_get_proper_filename_standard():
"""Test standard variant naming"""
downloader = LlmsTxtDownloader("https://hono.dev/llms.txt")
filename = downloader.get_proper_filename()
assert filename == "llms.md"
def test_get_proper_filename_small():
"""Test small variant naming"""
downloader = LlmsTxtDownloader("https://hono.dev/llms-small.txt")
filename = downloader.get_proper_filename()
assert filename == "llms-small.md"
```
**Step 2: Run test to verify it fails**
Run: `source .venv/bin/activate && pytest tests/test_llms_txt_downloader.py::test_get_proper_filename -v`
Expected: FAIL with "AttributeError: 'LlmsTxtDownloader' object has no attribute 'get_proper_filename'"
**Step 3: Implement get_proper_filename() method**
```python
# cli/llms_txt_downloader.py (add new method)
def get_proper_filename(self) -> str:
"""
Extract filename from URL and convert .txt to .md
Returns:
Proper filename with .md extension
Examples:
https://hono.dev/llms-full.txt -> llms-full.md
https://hono.dev/llms.txt -> llms.md
https://hono.dev/llms-small.txt -> llms-small.md
"""
# Extract filename from URL
from urllib.parse import urlparse
parsed = urlparse(self.url)
filename = parsed.path.split('/')[-1]
# Replace .txt with .md
if filename.endswith('.txt'):
filename = filename[:-4] + '.md'
return filename
```
**Step 4: Run test to verify it passes**
Run: `source .venv/bin/activate && pytest tests/test_llms_txt_downloader.py::test_get_proper_filename -v`
Expected: PASS (all 3 tests)
**Step 5: Commit**
```bash
git add cli/llms_txt_downloader.py tests/test_llms_txt_downloader.py
git commit -m "feat: add get_proper_filename() for .txt to .md conversion"
```
---
## Task 3: Update _try_llms_txt() to Download All Variants
**Files:**
- Modify: `cli/doc_scraper.py:337-384` (_try_llms_txt method)
- Test: `tests/test_integration.py`
**Step 1: Write failing test for multi-variant download**
```python
# tests/test_integration.py (add to TestFullLlmsTxtWorkflow class)
def test_multi_variant_download(self):
"""Test downloading all 3 llms.txt variants"""
from unittest.mock import patch, Mock
import tempfile
import os
config = {
'name': 'test-multi-variant',
'base_url': 'https://hono.dev/docs'
}
# Mock all 3 variants
sample_full = "# Full\n" + "x" * 1000
sample_standard = "# Standard\n" + "x" * 200
sample_small = "# Small\n" + "x" * 500
with tempfile.TemporaryDirectory() as tmpdir:
with patch('cli.llms_txt_detector.requests.head') as mock_head, \
patch('cli.llms_txt_downloader.requests.get') as mock_get:
# Mock detection (all exist)
mock_head_response = Mock()
mock_head_response.status_code = 200
mock_head.return_value = mock_head_response
# Mock downloads
def mock_download(url, **kwargs):
response = Mock()
response.status_code = 200
if 'llms-full.txt' in url:
response.text = sample_full
elif 'llms-small.txt' in url:
response.text = sample_small
else: # llms.txt
response.text = sample_standard
return response
mock_get.side_effect = mock_download
# Run scraper
scraper = DocumentationScraper(config, dry_run=False)
result = scraper._try_llms_txt()
# Verify all 3 files created
refs_dir = os.path.join(scraper.skill_dir, 'references')
assert os.path.exists(os.path.join(refs_dir, 'llms-full.md'))
assert os.path.exists(os.path.join(refs_dir, 'llms.md'))
assert os.path.exists(os.path.join(refs_dir, 'llms-small.md'))
# Verify content not truncated
with open(os.path.join(refs_dir, 'llms-full.md')) as f:
content = f.read()
assert len(content) == len(sample_full)
```
**Step 2: Run test to verify it fails**
Run: `source .venv/bin/activate && pytest tests/test_integration.py::TestFullLlmsTxtWorkflow::test_multi_variant_download -v`
Expected: FAIL - only one file created, not all 3
**Step 3: Modify _try_llms_txt() to use detect_all()**
```python
# cli/doc_scraper.py (replace _try_llms_txt method, lines 337-384)
def _try_llms_txt(self) -> bool:
"""
Try to use llms.txt instead of HTML scraping.
Downloads ALL available variants and stores with .md extension.
Returns:
True if llms.txt was found and processed successfully
"""
print(f"\n🔍 Checking for llms.txt at {self.base_url}...")
# Check for explicit config URL first
explicit_url = self.config.get('llms_txt_url')
if explicit_url:
print(f"\n📌 Using explicit llms_txt_url from config: {explicit_url}")
downloader = LlmsTxtDownloader(explicit_url)
content = downloader.download()
if content:
# Save with proper .md extension
filename = downloader.get_proper_filename()
filepath = os.path.join(self.skill_dir, "references", filename)
os.makedirs(os.path.dirname(filepath), exist_ok=True)
with open(filepath, 'w', encoding='utf-8') as f:
f.write(content)
print(f" 💾 Saved {filename} ({len(content)} chars)")
# Parse and save pages
parser = LlmsTxtParser(content)
pages = parser.parse()
if pages:
for page in pages:
self.save_page(page)
self.pages.append(page)
self.llms_txt_detected = True
self.llms_txt_variant = 'explicit'
return True
# Auto-detection: Find ALL variants
detector = LlmsTxtDetector(self.base_url)
variants = detector.detect_all()
if not variants:
print(" No llms.txt found, using HTML scraping")
return False
print(f"✅ Found {len(variants)} llms.txt variant(s)")
# Download ALL variants
downloaded = {}
for variant_info in variants:
url = variant_info['url']
variant = variant_info['variant']
print(f" 📥 Downloading {variant}...")
downloader = LlmsTxtDownloader(url)
content = downloader.download()
if content:
filename = downloader.get_proper_filename()
downloaded[variant] = {
'content': content,
'filename': filename,
'size': len(content)
}
print(f" ✓ {filename} ({len(content)} chars)")
if not downloaded:
print("⚠️ Failed to download any variants, falling back to HTML scraping")
return False
# Save ALL variants to references/
os.makedirs(os.path.join(self.skill_dir, "references"), exist_ok=True)
for variant, data in downloaded.items():
filepath = os.path.join(self.skill_dir, "references", data['filename'])
with open(filepath, 'w', encoding='utf-8') as f:
f.write(data['content'])
print(f" 💾 Saved {data['filename']}")
# Parse LARGEST variant for skill building
largest = max(downloaded.items(), key=lambda x: x[1]['size'])
print(f"\n📄 Parsing {largest[1]['filename']} for skill building...")
parser = LlmsTxtParser(largest[1]['content'])
pages = parser.parse()
if not pages:
print("⚠️ Failed to parse llms.txt, falling back to HTML scraping")
return False
print(f" ✓ Parsed {len(pages)} sections")
# Save pages for skill building
for page in pages:
self.save_page(page)
self.pages.append(page)
self.llms_txt_detected = True
self.llms_txt_variants = list(downloaded.keys())
return True
```
**Step 4: Add llms_txt_variants attribute to __init__**
```python
# cli/doc_scraper.py (in __init__ method, after llms_txt_variant line)
self.llms_txt_variants = [] # Track all downloaded variants
```
**Step 5: Run test to verify it passes**
Run: `source .venv/bin/activate && pytest tests/test_integration.py::TestFullLlmsTxtWorkflow::test_multi_variant_download -v`
Expected: PASS
**Step 6: Commit**
```bash
git add cli/doc_scraper.py tests/test_integration.py
git commit -m "feat: download all llms.txt variants with proper .md extension"
```
---
## Task 4: Remove Content Truncation
**Files:**
- Modify: `cli/doc_scraper.py:714-730` (create_reference_file method)
**Step 1: Write failing test for no truncation**
```python
# tests/test_integration.py (add new test)
def test_no_content_truncation():
"""Test that content is NOT truncated in reference files"""
from unittest.mock import Mock
import tempfile
import os
config = {
'name': 'test-no-truncate',
'base_url': 'https://example.com/docs'
}
# Create scraper with long content
scraper = DocumentationScraper(config, dry_run=False)
# Create page with content > 2500 chars
long_content = "x" * 5000
long_code = "y" * 1000
pages = [{
'title': 'Long Page',
'url': 'https://example.com/long',
'content': long_content,
'code_samples': [
{'code': long_code, 'language': 'python'}
],
'headings': []
}]
# Create reference file
scraper.create_reference_file('test', pages)
# Verify no truncation
ref_file = os.path.join(scraper.skill_dir, 'references', 'test.md')
with open(ref_file, 'r') as f:
content = f.read()
assert long_content in content # Full content included
assert long_code in content # Full code included
assert '[Content truncated]' not in content
assert '...' not in content or content.count('...') == 0
```
**Step 2: Run test to verify it fails**
Run: `source .venv/bin/activate && pytest tests/test_integration.py::test_no_content_truncation -v`
Expected: FAIL - content contains "[Content truncated]" or "..."
**Step 3: Remove truncation from create_reference_file()**
```python
# cli/doc_scraper.py (modify create_reference_file method, lines 712-731)
# OLD (line 714-716):
# if page.get('content'):
# content = page['content'][:2500]
# if len(page['content']) > 2500:
# content += "\n\n*[Content truncated]*"
# NEW (replace with):
if page.get('content'):
content = page['content'] # NO TRUNCATION
lines.append(content)
lines.append("")
# OLD (line 728-730):
# lines.append(code[:600])
# if len(code) > 600:
# lines.append("...")
# NEW (replace with):
lines.append(code) # NO TRUNCATION
# No "..." suffix
```
**Complete replacement of lines 712-731:**
```python
# cli/doc_scraper.py:712-731 (complete replacement)
# Content (NO TRUNCATION)
if page.get('content'):
lines.append(page['content'])
lines.append("")
# Code examples with language (NO TRUNCATION)
if page.get('code_samples'):
lines.append("**Examples:**\n")
for i, sample in enumerate(page['code_samples'][:4], 1):
lang = sample.get('language', 'unknown')
code = sample.get('code', sample if isinstance(sample, str) else '')
lines.append(f"Example {i} ({lang}):")
lines.append(f"```{lang}")
lines.append(code) # Full code, no truncation
lines.append("```\n")
```
**Step 4: Run test to verify it passes**
Run: `source .venv/bin/activate && pytest tests/test_integration.py::test_no_content_truncation -v`
Expected: PASS
**Step 5: Run full test suite to check for regressions**
Run: `source .venv/bin/activate && pytest tests/ -v`
Expected: All 201+ tests pass
**Step 6: Commit**
```bash
git add cli/doc_scraper.py tests/test_integration.py
git commit -m "feat: remove content truncation in reference files"
```
---
## Task 5: Update Documentation
**Files:**
- Modify: `docs/plans/2025-10-24-active-skills-design.md`
- Modify: `CHANGELOG.md`
**Step 1: Update design doc status**
```markdown
# docs/plans/2025-10-24-active-skills-design.md (update header)
**Status:** Phase 1 Implemented ✅
```
**Step 2: Add CHANGELOG entry**
```markdown
# CHANGELOG.md (add new section at top)
## [Unreleased]
### Added - Phase 1: Active Skills Foundation
- Multi-variant llms.txt detection: downloads all 3 variants (full, standard, small)
- Automatic .txt → .md file extension conversion
- No content truncation: preserves complete documentation
- `detect_all()` method for finding all llms.txt variants
- `get_proper_filename()` for correct .md naming
### Changed
- `_try_llms_txt()` now downloads all available variants instead of just one
- Reference files now contain complete content (no 2500 char limit)
- Code samples now include full code (no 600 char limit)
### Fixed
- File extension bug: llms.txt files now saved as .md
- Content loss: 0% truncation (was 36%)
```
**Step 3: Commit**
```bash
git add docs/plans/2025-10-24-active-skills-design.md CHANGELOG.md
git commit -m "docs: update status for Phase 1 completion"
```
---
## Task 6: Manual Verification
**Files:**
- None (manual testing)
**Step 1: Test with Hono config**
Run: `source .venv/bin/activate && python3 cli/doc_scraper.py --config configs/hono.json`
**Expected output:**
```
🔍 Checking for llms.txt at https://hono.dev/docs...
📌 Using explicit llms_txt_url from config: https://hono.dev/llms-full.txt
💾 Saved llms-full.md (319000 chars)
📄 Parsing llms-full.md for skill building...
✓ Parsed 93 sections
✅ Used llms.txt (explicit) - skipping HTML scraping
```
**Step 2: Verify all 3 files exist with correct extensions**
Run: `ls -lah output/hono/references/llms*.md`
Expected:
```
llms-full.md 319k
llms.md 5.4k
llms-small.md 176k
```
**Step 3: Verify no truncation in reference files**
Run: `grep -c "Content truncated" output/hono/references/*.md`
Expected: 0 matches (no truncation messages)
**Step 4: Check file sizes are correct**
Run: `wc -c output/hono/references/llms-full.md`
Expected: Should match original download size (~319k), not reduced to 203k
**Step 5: Verify all tests still pass**
Run: `source .venv/bin/activate && pytest tests/ -v`
Expected: All tests pass (201+)
---
## Completion Checklist
- [ ] Task 1: Multi-variant detection (detect_all)
- [ ] Task 2: File extension renaming (get_proper_filename)
- [ ] Task 3: Download all variants (_try_llms_txt)
- [ ] Task 4: Remove truncation (create_reference_file)
- [ ] Task 5: Update documentation
- [ ] Task 6: Manual verification
- [ ] All tests passing
- [ ] No regressions in existing functionality
---
## Success Criteria
**Technical:**
- ✅ All 3 variants downloaded when available
- ✅ Files saved with .md extension (not .txt)
- ✅ 0% content truncation (was 36%)
- ✅ All existing tests pass
- ✅ New tests cover all changes
**User Experience:**
- ✅ Hono skill has all 3 files: llms-full.md, llms.md, llms-small.md
- ✅ Reference files contain complete documentation
- ✅ No "[Content truncated]" messages in output
---
## Related Skills
- @superpowers:test-driven-development - Used throughout for TDD approach
- @superpowers:verification-before-completion - Used in Task 6 for manual verification
---
## Notes
- This plan implements Phase 1 from `docs/plans/2025-10-24-active-skills-design.md`
- Phase 2 (Catalog System) and Phase 3 (Active Scripts) will be separate plans
- All changes maintain backward compatibility with existing HTML scraping
- File extension fix (.txt → .md) is critical for proper skill functionality
---
## Estimated Time
- Task 1: 15 minutes
- Task 2: 15 minutes
- Task 3: 30 minutes
- Task 4: 20 minutes
- Task 5: 10 minutes
- Task 6: 15 minutes
**Total: ~1.5 hours**

View File

@ -0,0 +1,11 @@
{
"mcpServers": {
"skill-seeker": {
"command": "python3",
"args": [
"/mnt/1ece809a-2821-4f10-aecb-fcdf34760c0b/Git/Skill_Seekers/mcp/server.py"
],
"cwd": "/mnt/1ece809a-2821-4f10-aecb-fcdf34760c0b/Git/Skill_Seekers"
}
}
}

View File

@ -0,0 +1,13 @@
[mypy]
python_version = 3.10
warn_return_any = False
warn_unused_configs = True
disallow_untyped_defs = False
check_untyped_defs = True
ignore_missing_imports = True
no_implicit_optional = True
show_error_codes = True
# Gradual typing - be lenient for now
disallow_incomplete_defs = False
disallow_untyped_calls = False

View File

@ -0,0 +1,149 @@
[build-system]
requires = ["setuptools>=61.0", "wheel"]
build-backend = "setuptools.build_meta"
[project]
name = "skill-seekers"
version = "2.1.1"
description = "Convert documentation websites, GitHub repositories, and PDFs into Claude AI skills"
readme = "README.md"
requires-python = ">=3.10"
license = {text = "MIT"}
authors = [
{name = "Yusuf Karaaslan"}
]
keywords = [
"claude",
"ai",
"documentation",
"scraping",
"skills",
"llm",
"mcp",
"automation"
]
classifiers = [
"Development Status :: 4 - Beta",
"Intended Audience :: Developers",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Topic :: Software Development :: Documentation",
"Topic :: Software Development :: Libraries :: Python Modules",
"Topic :: Text Processing :: Markup :: Markdown",
]
# Core dependencies
dependencies = [
"requests>=2.32.5",
"beautifulsoup4>=4.14.2",
"PyGithub>=2.5.0",
"mcp>=1.18.0",
"httpx>=0.28.1",
"httpx-sse>=0.4.3",
"PyMuPDF>=1.24.14",
"Pillow>=11.0.0",
"pytesseract>=0.3.13",
"pydantic>=2.12.3",
"pydantic-settings>=2.11.0",
"python-dotenv>=1.1.1",
"jsonschema>=4.25.1",
"click>=8.3.0",
"Pygments>=2.19.2",
]
[project.optional-dependencies]
# Development dependencies
dev = [
"pytest>=8.4.2",
"pytest-cov>=7.0.0",
"coverage>=7.11.0",
]
# MCP server dependencies (included by default, but optional)
mcp = [
"mcp>=1.18.0",
"httpx>=0.28.1",
"httpx-sse>=0.4.3",
"uvicorn>=0.38.0",
"starlette>=0.48.0",
"sse-starlette>=3.0.2",
]
# All optional dependencies combined
all = [
"pytest>=8.4.2",
"pytest-cov>=7.0.0",
"coverage>=7.11.0",
"mcp>=1.18.0",
"httpx>=0.28.1",
"httpx-sse>=0.4.3",
"uvicorn>=0.38.0",
"starlette>=0.48.0",
"sse-starlette>=3.0.2",
]
[project.urls]
Homepage = "https://github.com/yusufkaraaslan/Skill_Seekers"
Repository = "https://github.com/yusufkaraaslan/Skill_Seekers"
"Bug Tracker" = "https://github.com/yusufkaraaslan/Skill_Seekers/issues"
Documentation = "https://github.com/yusufkaraaslan/Skill_Seekers#readme"
[project.scripts]
# Main unified CLI
skill-seekers = "skill_seekers.cli.main:main"
# Individual tool entry points
skill-seekers-scrape = "skill_seekers.cli.doc_scraper:main"
skill-seekers-github = "skill_seekers.cli.github_scraper:main"
skill-seekers-pdf = "skill_seekers.cli.pdf_scraper:main"
skill-seekers-unified = "skill_seekers.cli.unified_scraper:main"
skill-seekers-enhance = "skill_seekers.cli.enhance_skill_local:main"
skill-seekers-package = "skill_seekers.cli.package_skill:main"
skill-seekers-upload = "skill_seekers.cli.upload_skill:main"
skill-seekers-estimate = "skill_seekers.cli.estimate_pages:main"
[tool.setuptools]
packages = ["skill_seekers", "skill_seekers.cli", "skill_seekers.mcp", "skill_seekers.mcp.tools"]
[tool.setuptools.package-dir]
"" = "src"
[tool.setuptools.package-data]
skill_seekers = ["py.typed"]
[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]
addopts = "-v --tb=short --strict-markers"
[tool.coverage.run]
source = ["src/skill_seekers"]
omit = ["*/tests/*", "*/__pycache__/*", "*/venv/*"]
[tool.coverage.report]
exclude_lines = [
"pragma: no cover",
"def __repr__",
"raise AssertionError",
"raise NotImplementedError",
"if __name__ == .__main__.:",
"if TYPE_CHECKING:",
"@abstractmethod",
]
[tool.uv]
dev-dependencies = [
"pytest>=8.4.2",
"pytest-cov>=7.0.0",
"coverage>=7.11.0",
]
[tool.uv.sources]
# Use PyPI for all dependencies

View File

@ -0,0 +1,42 @@
annotated-types==0.7.0
anyio==4.11.0
attrs==25.4.0
beautifulsoup4==4.14.2
certifi==2025.10.5
charset-normalizer==3.4.4
click==8.3.0
coverage==7.11.0
h11==0.16.0
httpcore==1.0.9
httpx==0.28.1
httpx-sse==0.4.3
idna==3.11
iniconfig==2.3.0
jsonschema==4.25.1
jsonschema-specifications==2025.9.1
mcp==1.18.0
packaging==25.0
pluggy==1.6.0
pydantic==2.12.3
pydantic-settings==2.11.0
pydantic_core==2.41.4
PyGithub==2.5.0
Pygments==2.19.2
PyMuPDF==1.24.14
Pillow==11.0.0
pytesseract==0.3.13
pytest==8.4.2
pytest-cov==7.0.0
python-dotenv==1.1.1
python-multipart==0.0.20
referencing==0.37.0
requests==2.32.5
rpds-py==0.27.1
sniffio==1.3.1
soupsieve==2.8
sse-starlette==3.0.2
starlette==0.48.0
typing-inspection==0.4.2
typing_extensions==4.15.0
urllib3==2.5.0
uvicorn==0.38.0

View File

@ -0,0 +1,266 @@
#!/bin/bash
# Skill Seeker MCP Server - Quick Setup Script
# This script automates the MCP server setup for Claude Code
set -e # Exit on error
echo "=================================================="
echo "Skill Seeker MCP Server - Quick Setup"
echo "=================================================="
echo ""
# Colors for output
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
RED='\033[0;31m'
NC='\033[0m' # No Color
# Step 1: Check Python version
echo "Step 1: Checking Python version..."
if ! command -v python3 &> /dev/null; then
echo -e "${RED}❌ Error: python3 not found${NC}"
echo "Please install Python 3.7 or higher"
exit 1
fi
PYTHON_VERSION=$(python3 --version | cut -d' ' -f2)
echo -e "${GREEN}${NC} Python $PYTHON_VERSION found"
echo ""
# Step 2: Get repository path
REPO_PATH=$(pwd)
echo "Step 2: Repository location"
echo "Path: $REPO_PATH"
echo ""
# Step 3: Install dependencies
echo "Step 3: Installing Python dependencies..."
# Check if we're in a virtual environment
if [[ -n "$VIRTUAL_ENV" ]]; then
echo -e "${GREEN}${NC} Virtual environment detected: $VIRTUAL_ENV"
PIP_INSTALL_CMD="pip install"
elif [[ -d "venv" ]]; then
echo -e "${YELLOW}${NC} Virtual environment found but not activated"
echo "Activating venv..."
source venv/bin/activate
PIP_INSTALL_CMD="pip install"
else
echo -e "${YELLOW}${NC} No virtual environment found"
echo "It's recommended to use a virtual environment to avoid conflicts."
echo ""
read -p "Would you like to create one now? (y/n) " -n 1 -r
echo ""
if [[ $REPLY =~ ^[Yy]$ ]]; then
echo "Creating virtual environment..."
python3 -m venv venv || {
echo -e "${RED}❌ Failed to create virtual environment${NC}"
echo "Falling back to system install..."
PIP_INSTALL_CMD="pip3 install --user --break-system-packages"
}
if [[ -d "venv" ]]; then
source venv/bin/activate
PIP_INSTALL_CMD="pip install"
echo -e "${GREEN}${NC} Virtual environment created and activated"
fi
else
echo "Proceeding with system install (using --user --break-system-packages)..."
echo -e "${YELLOW}Note:${NC} This may override system-managed packages"
PIP_INSTALL_CMD="pip3 install --user --break-system-packages"
fi
fi
echo "This will install: mcp, requests, beautifulsoup4"
read -p "Continue? (y/n) " -n 1 -r
echo ""
if [[ $REPLY =~ ^[Yy]$ ]]; then
echo "Installing package in editable mode..."
$PIP_INSTALL_CMD -e . || {
echo -e "${RED}❌ Failed to install package${NC}"
exit 1
}
echo -e "${GREEN}${NC} Dependencies installed successfully"
else
echo "Skipping dependency installation"
fi
echo ""
# Step 4: Test MCP server
echo "Step 4: Testing MCP server..."
timeout 3 python3 src/skill_seekers/mcp/server.py 2>/dev/null || {
if [ $? -eq 124 ]; then
echo -e "${GREEN}${NC} MCP server starts correctly (timeout expected)"
else
echo -e "${YELLOW}${NC} MCP server test inconclusive, but may still work"
fi
}
echo ""
# Step 5: Optional - Run tests
echo "Step 5: Run test suite? (optional)"
read -p "Run MCP tests to verify everything works? (y/n) " -n 1 -r
echo ""
if [[ $REPLY =~ ^[Yy]$ ]]; then
# Check if pytest is installed
if ! command -v pytest &> /dev/null; then
echo "Installing pytest..."
$PIP_INSTALL_CMD pytest || {
echo -e "${YELLOW}${NC} Could not install pytest, skipping tests"
}
fi
if command -v pytest &> /dev/null; then
echo "Running MCP server tests..."
python3 -m pytest tests/test_mcp_server.py -v --tb=short || {
echo -e "${RED}❌ Some tests failed${NC}"
echo "The server may still work, but please check the errors above"
}
fi
else
echo "Skipping tests"
fi
echo ""
# Step 6: Configure Claude Code
echo "Step 6: Configure Claude Code"
echo "=================================================="
echo ""
echo "You need to add this configuration to Claude Code:"
echo ""
echo -e "${YELLOW}Configuration file:${NC} ~/.config/claude-code/mcp.json"
echo ""
echo "Add this JSON configuration (paths are auto-detected for YOUR system):"
echo ""
echo -e "${GREEN}{"
echo " \"mcpServers\": {"
echo " \"skill-seeker\": {"
echo " \"command\": \"python3\","
echo " \"args\": ["
echo " \"$REPO_PATH/src/skill_seekers/mcp/server.py\""
echo " ],"
echo " \"cwd\": \"$REPO_PATH\""
echo " }"
echo " }"
echo -e "}${NC}"
echo ""
echo -e "${YELLOW}Note:${NC} The paths above are YOUR actual paths (not placeholders!)"
echo ""
# Ask if user wants auto-configure
echo ""
read -p "Auto-configure Claude Code now? (y/n) " -n 1 -r
echo ""
if [[ $REPLY =~ ^[Yy]$ ]]; then
# Check if config already exists
if [ -f ~/.config/claude-code/mcp.json ]; then
echo -e "${YELLOW}⚠ Warning: ~/.config/claude-code/mcp.json already exists${NC}"
echo "Current contents:"
cat ~/.config/claude-code/mcp.json
echo ""
read -p "Overwrite? (y/n) " -n 1 -r
echo ""
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
echo "Skipping auto-configuration"
echo "Please manually add the skill-seeker server to your config"
exit 0
fi
fi
# Create config directory
mkdir -p ~/.config/claude-code
# Write configuration with actual expanded path
cat > ~/.config/claude-code/mcp.json << EOF
{
"mcpServers": {
"skill-seeker": {
"command": "python3",
"args": [
"$REPO_PATH/src/skill_seekers/mcp/server.py"
],
"cwd": "$REPO_PATH"
}
}
}
EOF
echo -e "${GREEN}${NC} Configuration written to ~/.config/claude-code/mcp.json"
echo ""
echo "Configuration contents:"
cat ~/.config/claude-code/mcp.json
echo ""
# Verify the path exists
if [ -f "$REPO_PATH/src/skill_seekers/mcp/server.py" ]; then
echo -e "${GREEN}${NC} Verified: MCP server file exists at $REPO_PATH/src/skill_seekers/mcp/server.py"
else
echo -e "${RED}❌ Warning: MCP server not found at $REPO_PATH/src/skill_seekers/mcp/server.py${NC}"
echo "Please check the path!"
fi
else
echo "Skipping auto-configuration"
echo "Please manually configure Claude Code using the JSON above"
echo ""
echo "IMPORTANT: Replace \$REPO_PATH with the actual path: $REPO_PATH"
fi
echo ""
# Step 7: Test the configuration
if [ -f ~/.config/claude-code/mcp.json ]; then
echo "Step 7: Testing MCP configuration..."
echo "Checking if paths are correct..."
# Extract the configured path
if command -v jq &> /dev/null; then
CONFIGURED_PATH=$(jq -r '.mcpServers["skill-seeker"].args[0]' ~/.config/claude-code/mcp.json 2>/dev/null || echo "")
if [ -n "$CONFIGURED_PATH" ] && [ -f "$CONFIGURED_PATH" ]; then
echo -e "${GREEN}${NC} MCP server path is valid: $CONFIGURED_PATH"
elif [ -n "$CONFIGURED_PATH" ]; then
echo -e "${YELLOW}${NC} Warning: Configured path doesn't exist: $CONFIGURED_PATH"
fi
else
echo "Install 'jq' for config validation: brew install jq (macOS) or apt install jq (Linux)"
fi
fi
echo ""
# Step 8: Final instructions
echo "=================================================="
echo "Setup Complete!"
echo "=================================================="
echo ""
echo "Next steps:"
echo ""
echo " 1. ${YELLOW}Restart Claude Code${NC} (quit and reopen, don't just close window)"
echo " 2. In Claude Code, test with: ${GREEN}\"List all available configs\"${NC}"
echo " 3. You should see 9 Skill Seeker tools available"
echo ""
echo "Available MCP Tools:"
echo " • generate_config - Create new config files"
echo " • estimate_pages - Estimate scraping time"
echo " • scrape_docs - Scrape documentation"
echo " • package_skill - Create .zip files"
echo " • list_configs - Show available configs"
echo " • validate_config - Validate config files"
echo ""
echo "Example commands to try in Claude Code:"
echo "${GREEN}List all available configs${NC}"
echo "${GREEN}Validate configs/react.json${NC}"
echo "${GREEN}Generate config for Tailwind at https://tailwindcss.com/docs${NC}"
echo ""
echo "Documentation:"
echo " • MCP Setup Guide: ${YELLOW}docs/MCP_SETUP.md${NC}"
echo " • Full docs: ${YELLOW}README.md${NC}"
echo ""
echo "Troubleshooting:"
echo " • Check logs: ~/Library/Logs/Claude Code/ (macOS)"
echo " • Test server: python3 src/skill_seekers/mcp/server.py"
echo " • Run tests: python3 -m pytest tests/test_mcp_server.py -v"
echo ""
echo "Happy skill creating! 🚀"

View File

@ -0,0 +1,22 @@
"""
Skill Seekers - Convert documentation, GitHub repos, and PDFs into Claude AI skills.
This package provides tools for automatically scraping, organizing, and packaging
documentation from various sources into uploadable Claude AI skills.
"""
__version__ = "2.0.0"
__author__ = "Yusuf Karaaslan"
__license__ = "MIT"
# Expose main components for easier imports
from skill_seekers.cli import __version__ as cli_version
from skill_seekers.mcp import __version__ as mcp_version
__all__ = [
"__version__",
"__author__",
"__license__",
"cli_version",
"mcp_version",
]

View File

@ -0,0 +1,39 @@
"""Skill Seekers CLI tools package.
This package provides command-line tools for converting documentation
websites into Claude AI skills.
Main modules:
- doc_scraper: Main documentation scraping and skill building tool
- llms_txt_detector: Detect llms.txt files at documentation URLs
- llms_txt_downloader: Download llms.txt content
- llms_txt_parser: Parse llms.txt markdown content
- pdf_scraper: Extract documentation from PDF files
- enhance_skill: AI-powered skill enhancement (API-based)
- enhance_skill_local: AI-powered skill enhancement (local)
- estimate_pages: Estimate page count before scraping
- package_skill: Package skills into .zip files
- upload_skill: Upload skills to Claude
- utils: Shared utility functions
"""
from .llms_txt_detector import LlmsTxtDetector
from .llms_txt_downloader import LlmsTxtDownloader
from .llms_txt_parser import LlmsTxtParser
try:
from .utils import open_folder, read_reference_files
except ImportError:
# utils.py might not exist in all configurations
open_folder = None
read_reference_files = None
__version__ = "2.0.0"
__all__ = [
"LlmsTxtDetector",
"LlmsTxtDownloader",
"LlmsTxtParser",
"open_folder",
"read_reference_files",
]

View File

@ -0,0 +1,500 @@
#!/usr/bin/env python3
"""
Code Analyzer for GitHub Repositories
Extracts code signatures at configurable depth levels:
- surface: File tree only (existing behavior)
- deep: Parse files for signatures, parameters, types
- full: Complete AST analysis (future enhancement)
Supports multiple languages with language-specific parsers.
"""
import ast
import re
import logging
from typing import Dict, List, Any, Optional
from dataclasses import dataclass, asdict
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@dataclass
class Parameter:
"""Represents a function parameter."""
name: str
type_hint: Optional[str] = None
default: Optional[str] = None
@dataclass
class FunctionSignature:
"""Represents a function/method signature."""
name: str
parameters: List[Parameter]
return_type: Optional[str] = None
docstring: Optional[str] = None
line_number: Optional[int] = None
is_async: bool = False
is_method: bool = False
decorators: List[str] = None
def __post_init__(self):
if self.decorators is None:
self.decorators = []
@dataclass
class ClassSignature:
"""Represents a class signature."""
name: str
base_classes: List[str]
methods: List[FunctionSignature]
docstring: Optional[str] = None
line_number: Optional[int] = None
class CodeAnalyzer:
"""
Analyzes code at different depth levels.
"""
def __init__(self, depth: str = 'surface'):
"""
Initialize code analyzer.
Args:
depth: Analysis depth ('surface', 'deep', 'full')
"""
self.depth = depth
def analyze_file(self, file_path: str, content: str, language: str) -> Dict[str, Any]:
"""
Analyze a single file based on depth level.
Args:
file_path: Path to file in repository
content: File content as string
language: Programming language (Python, JavaScript, etc.)
Returns:
Dict containing extracted signatures
"""
if self.depth == 'surface':
return {} # Surface level doesn't analyze individual files
logger.debug(f"Analyzing {file_path} (language: {language}, depth: {self.depth})")
try:
if language == 'Python':
return self._analyze_python(content, file_path)
elif language in ['JavaScript', 'TypeScript']:
return self._analyze_javascript(content, file_path)
elif language in ['C', 'C++']:
return self._analyze_cpp(content, file_path)
else:
logger.debug(f"No analyzer for language: {language}")
return {}
except Exception as e:
logger.warning(f"Error analyzing {file_path}: {e}")
return {}
def _analyze_python(self, content: str, file_path: str) -> Dict[str, Any]:
"""Analyze Python file using AST."""
try:
tree = ast.parse(content)
except SyntaxError as e:
logger.debug(f"Syntax error in {file_path}: {e}")
return {}
classes = []
functions = []
for node in ast.walk(tree):
if isinstance(node, ast.ClassDef):
class_sig = self._extract_python_class(node)
classes.append(asdict(class_sig))
elif isinstance(node, ast.FunctionDef) or isinstance(node, ast.AsyncFunctionDef):
# Only top-level functions (not methods)
# Fix AST parser to check isinstance(parent.body, list) before 'in' operator
is_method = False
try:
is_method = any(isinstance(parent, ast.ClassDef)
for parent in ast.walk(tree)
if hasattr(parent, 'body') and isinstance(parent.body, list) and node in parent.body)
except (TypeError, AttributeError):
# If body is not iterable or check fails, assume it's a top-level function
is_method = False
if not is_method:
func_sig = self._extract_python_function(node)
functions.append(asdict(func_sig))
return {
'classes': classes,
'functions': functions
}
def _extract_python_class(self, node: ast.ClassDef) -> ClassSignature:
"""Extract class signature from AST node."""
# Extract base classes
bases = []
for base in node.bases:
if isinstance(base, ast.Name):
bases.append(base.id)
elif isinstance(base, ast.Attribute):
bases.append(f"{base.value.id}.{base.attr}" if hasattr(base.value, 'id') else base.attr)
# Extract methods
methods = []
for item in node.body:
if isinstance(item, (ast.FunctionDef, ast.AsyncFunctionDef)):
method_sig = self._extract_python_function(item, is_method=True)
methods.append(method_sig)
# Extract docstring
docstring = ast.get_docstring(node)
return ClassSignature(
name=node.name,
base_classes=bases,
methods=methods,
docstring=docstring,
line_number=node.lineno
)
def _extract_python_function(self, node, is_method: bool = False) -> FunctionSignature:
"""Extract function signature from AST node."""
# Extract parameters
params = []
for arg in node.args.args:
param_type = None
if arg.annotation:
param_type = ast.unparse(arg.annotation) if hasattr(ast, 'unparse') else None
params.append(Parameter(
name=arg.arg,
type_hint=param_type
))
# Extract defaults
defaults = node.args.defaults
if defaults:
# Defaults are aligned to the end of params
num_no_default = len(params) - len(defaults)
for i, default in enumerate(defaults):
param_idx = num_no_default + i
if param_idx < len(params):
try:
params[param_idx].default = ast.unparse(default) if hasattr(ast, 'unparse') else str(default)
except:
params[param_idx].default = "..."
# Extract return type
return_type = None
if node.returns:
try:
return_type = ast.unparse(node.returns) if hasattr(ast, 'unparse') else None
except:
pass
# Extract decorators
decorators = []
for decorator in node.decorator_list:
try:
if hasattr(ast, 'unparse'):
decorators.append(ast.unparse(decorator))
elif isinstance(decorator, ast.Name):
decorators.append(decorator.id)
except:
pass
# Extract docstring
docstring = ast.get_docstring(node)
return FunctionSignature(
name=node.name,
parameters=params,
return_type=return_type,
docstring=docstring,
line_number=node.lineno,
is_async=isinstance(node, ast.AsyncFunctionDef),
is_method=is_method,
decorators=decorators
)
def _analyze_javascript(self, content: str, file_path: str) -> Dict[str, Any]:
"""
Analyze JavaScript/TypeScript file using regex patterns.
Note: This is a simplified approach. For production, consider using
a proper JS/TS parser like esprima or ts-morph.
"""
classes = []
functions = []
# Extract class definitions
class_pattern = r'class\s+(\w+)(?:\s+extends\s+(\w+))?\s*\{'
for match in re.finditer(class_pattern, content):
class_name = match.group(1)
base_class = match.group(2) if match.group(2) else None
# Try to extract methods (simplified)
class_block_start = match.end()
# This is a simplification - proper parsing would track braces
class_block_end = content.find('}', class_block_start)
if class_block_end != -1:
class_body = content[class_block_start:class_block_end]
methods = self._extract_js_methods(class_body)
else:
methods = []
classes.append({
'name': class_name,
'base_classes': [base_class] if base_class else [],
'methods': methods,
'docstring': None,
'line_number': content[:match.start()].count('\n') + 1
})
# Extract top-level functions
func_pattern = r'(?:async\s+)?function\s+(\w+)\s*\(([^)]*)\)'
for match in re.finditer(func_pattern, content):
func_name = match.group(1)
params_str = match.group(2)
is_async = 'async' in match.group(0)
params = self._parse_js_parameters(params_str)
functions.append({
'name': func_name,
'parameters': params,
'return_type': None, # JS doesn't have type annotations (unless TS)
'docstring': None,
'line_number': content[:match.start()].count('\n') + 1,
'is_async': is_async,
'is_method': False,
'decorators': []
})
# Extract arrow functions assigned to const/let
arrow_pattern = r'(?:const|let|var)\s+(\w+)\s*=\s*(?:async\s+)?\(([^)]*)\)\s*=>'
for match in re.finditer(arrow_pattern, content):
func_name = match.group(1)
params_str = match.group(2)
is_async = 'async' in match.group(0)
params = self._parse_js_parameters(params_str)
functions.append({
'name': func_name,
'parameters': params,
'return_type': None,
'docstring': None,
'line_number': content[:match.start()].count('\n') + 1,
'is_async': is_async,
'is_method': False,
'decorators': []
})
return {
'classes': classes,
'functions': functions
}
def _extract_js_methods(self, class_body: str) -> List[Dict]:
"""Extract method signatures from class body."""
methods = []
# Match method definitions
method_pattern = r'(?:async\s+)?(\w+)\s*\(([^)]*)\)'
for match in re.finditer(method_pattern, class_body):
method_name = match.group(1)
params_str = match.group(2)
is_async = 'async' in match.group(0)
# Skip constructor keyword detection
if method_name in ['if', 'for', 'while', 'switch']:
continue
params = self._parse_js_parameters(params_str)
methods.append({
'name': method_name,
'parameters': params,
'return_type': None,
'docstring': None,
'line_number': None,
'is_async': is_async,
'is_method': True,
'decorators': []
})
return methods
def _parse_js_parameters(self, params_str: str) -> List[Dict]:
"""Parse JavaScript parameter string."""
params = []
if not params_str.strip():
return params
# Split by comma (simplified - doesn't handle complex default values)
param_list = [p.strip() for p in params_str.split(',')]
for param in param_list:
if not param:
continue
# Check for default value
if '=' in param:
name, default = param.split('=', 1)
name = name.strip()
default = default.strip()
else:
name = param
default = None
# Check for type annotation (TypeScript)
type_hint = None
if ':' in name:
name, type_hint = name.split(':', 1)
name = name.strip()
type_hint = type_hint.strip()
params.append({
'name': name,
'type_hint': type_hint,
'default': default
})
return params
def _analyze_cpp(self, content: str, file_path: str) -> Dict[str, Any]:
"""
Analyze C/C++ header file using regex patterns.
Note: This is a simplified approach focusing on header files.
For production, consider using libclang or similar.
"""
classes = []
functions = []
# Extract class definitions (simplified - doesn't handle nested classes)
class_pattern = r'class\s+(\w+)(?:\s*:\s*public\s+(\w+))?\s*\{'
for match in re.finditer(class_pattern, content):
class_name = match.group(1)
base_class = match.group(2) if match.group(2) else None
classes.append({
'name': class_name,
'base_classes': [base_class] if base_class else [],
'methods': [], # Simplified - would need to parse class body
'docstring': None,
'line_number': content[:match.start()].count('\n') + 1
})
# Extract function declarations
func_pattern = r'(\w+(?:\s*\*|\s*&)?)\s+(\w+)\s*\(([^)]*)\)'
for match in re.finditer(func_pattern, content):
return_type = match.group(1).strip()
func_name = match.group(2)
params_str = match.group(3)
# Skip common keywords
if func_name in ['if', 'for', 'while', 'switch', 'return']:
continue
params = self._parse_cpp_parameters(params_str)
functions.append({
'name': func_name,
'parameters': params,
'return_type': return_type,
'docstring': None,
'line_number': content[:match.start()].count('\n') + 1,
'is_async': False,
'is_method': False,
'decorators': []
})
return {
'classes': classes,
'functions': functions
}
def _parse_cpp_parameters(self, params_str: str) -> List[Dict]:
"""Parse C++ parameter string."""
params = []
if not params_str.strip() or params_str.strip() == 'void':
return params
# Split by comma (simplified)
param_list = [p.strip() for p in params_str.split(',')]
for param in param_list:
if not param:
continue
# Check for default value
default = None
if '=' in param:
param, default = param.rsplit('=', 1)
param = param.strip()
default = default.strip()
# Extract type and name (simplified)
# Format: "type name" or "type* name" or "type& name"
parts = param.split()
if len(parts) >= 2:
param_type = ' '.join(parts[:-1])
param_name = parts[-1]
else:
param_type = param
param_name = "unknown"
params.append({
'name': param_name,
'type_hint': param_type,
'default': default
})
return params
if __name__ == '__main__':
# Test the analyzer
python_code = '''
class Node2D:
"""Base class for 2D nodes."""
def move_local_x(self, delta: float, snap: bool = False) -> None:
"""Move node along local X axis."""
pass
async def tween_position(self, target: tuple, duration: float = 1.0):
"""Animate position to target."""
pass
def create_sprite(texture: str) -> Node2D:
"""Create a new sprite node."""
return Node2D()
'''
analyzer = CodeAnalyzer(depth='deep')
result = analyzer.analyze_file('test.py', python_code, 'Python')
print("Analysis Result:")
print(f"Classes: {len(result.get('classes', []))}")
print(f"Functions: {len(result.get('functions', []))}")
if result.get('classes'):
cls = result['classes'][0]
print(f"\nClass: {cls['name']}")
print(f" Methods: {len(cls['methods'])}")
for method in cls['methods']:
params = ', '.join([f"{p['name']}: {p['type_hint']}" + (f" = {p['default']}" if p.get('default') else "")
for p in method['parameters']])
print(f" {method['name']}({params}) -> {method['return_type']}")

View File

@ -0,0 +1,376 @@
#!/usr/bin/env python3
"""
Unified Config Validator
Validates unified config format that supports multiple sources:
- documentation (website scraping)
- github (repository scraping)
- pdf (PDF document scraping)
Also provides backward compatibility detection for legacy configs.
"""
import json
import logging
from typing import Dict, Any, List, Optional, Union
from pathlib import Path
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ConfigValidator:
"""
Validates unified config format and provides backward compatibility.
"""
# Valid source types
VALID_SOURCE_TYPES = {'documentation', 'github', 'pdf'}
# Valid merge modes
VALID_MERGE_MODES = {'rule-based', 'claude-enhanced'}
# Valid code analysis depth levels
VALID_DEPTH_LEVELS = {'surface', 'deep', 'full'}
def __init__(self, config_or_path: Union[Dict[str, Any], str]):
"""
Initialize validator with config dict or file path.
Args:
config_or_path: Either a config dict or path to config JSON file
"""
if isinstance(config_or_path, dict):
self.config_path = None
self.config = config_or_path
else:
self.config_path = config_or_path
self.config = self._load_config()
self.is_unified = self._detect_format()
def _load_config(self) -> Dict[str, Any]:
"""Load JSON config file."""
try:
with open(self.config_path, 'r', encoding='utf-8') as f:
return json.load(f)
except FileNotFoundError:
raise ValueError(f"Config file not found: {self.config_path}")
except json.JSONDecodeError as e:
raise ValueError(f"Invalid JSON in config file: {e}")
def _detect_format(self) -> bool:
"""
Detect if config is unified format or legacy.
Returns:
True if unified format (has 'sources' array)
False if legacy format
"""
return 'sources' in self.config and isinstance(self.config['sources'], list)
def validate(self) -> bool:
"""
Validate config based on detected format.
Returns:
True if valid
Raises:
ValueError if invalid with detailed error message
"""
if self.is_unified:
return self._validate_unified()
else:
return self._validate_legacy()
def _validate_unified(self) -> bool:
"""Validate unified config format."""
logger.info("Validating unified config format...")
# Required top-level fields
if 'name' not in self.config:
raise ValueError("Missing required field: 'name'")
if 'description' not in self.config:
raise ValueError("Missing required field: 'description'")
if 'sources' not in self.config:
raise ValueError("Missing required field: 'sources'")
# Validate sources array
sources = self.config['sources']
if not isinstance(sources, list):
raise ValueError("'sources' must be an array")
if len(sources) == 0:
raise ValueError("'sources' array cannot be empty")
# Validate merge_mode (optional)
merge_mode = self.config.get('merge_mode', 'rule-based')
if merge_mode not in self.VALID_MERGE_MODES:
raise ValueError(f"Invalid merge_mode: '{merge_mode}'. Must be one of {self.VALID_MERGE_MODES}")
# Validate each source
for i, source in enumerate(sources):
self._validate_source(source, i)
logger.info(f"✅ Unified config valid: {len(sources)} sources")
return True
def _validate_source(self, source: Dict[str, Any], index: int):
"""Validate individual source configuration."""
# Check source has 'type' field
if 'type' not in source:
raise ValueError(f"Source {index}: Missing required field 'type'")
source_type = source['type']
if source_type not in self.VALID_SOURCE_TYPES:
raise ValueError(
f"Source {index}: Invalid type '{source_type}'. "
f"Must be one of {self.VALID_SOURCE_TYPES}"
)
# Type-specific validation
if source_type == 'documentation':
self._validate_documentation_source(source, index)
elif source_type == 'github':
self._validate_github_source(source, index)
elif source_type == 'pdf':
self._validate_pdf_source(source, index)
def _validate_documentation_source(self, source: Dict[str, Any], index: int):
"""Validate documentation source configuration."""
if 'base_url' not in source:
raise ValueError(f"Source {index} (documentation): Missing required field 'base_url'")
# Optional but recommended fields
if 'selectors' not in source:
logger.warning(f"Source {index} (documentation): No 'selectors' specified, using defaults")
if 'max_pages' in source and not isinstance(source['max_pages'], int):
raise ValueError(f"Source {index} (documentation): 'max_pages' must be an integer")
def _validate_github_source(self, source: Dict[str, Any], index: int):
"""Validate GitHub source configuration."""
if 'repo' not in source:
raise ValueError(f"Source {index} (github): Missing required field 'repo'")
# Validate repo format (owner/repo)
repo = source['repo']
if '/' not in repo:
raise ValueError(
f"Source {index} (github): Invalid repo format '{repo}'. "
f"Must be 'owner/repo' (e.g., 'facebook/react')"
)
# Validate code_analysis_depth if specified
if 'code_analysis_depth' in source:
depth = source['code_analysis_depth']
if depth not in self.VALID_DEPTH_LEVELS:
raise ValueError(
f"Source {index} (github): Invalid code_analysis_depth '{depth}'. "
f"Must be one of {self.VALID_DEPTH_LEVELS}"
)
# Validate max_issues if specified
if 'max_issues' in source and not isinstance(source['max_issues'], int):
raise ValueError(f"Source {index} (github): 'max_issues' must be an integer")
def _validate_pdf_source(self, source: Dict[str, Any], index: int):
"""Validate PDF source configuration."""
if 'path' not in source:
raise ValueError(f"Source {index} (pdf): Missing required field 'path'")
# Check if file exists
pdf_path = source['path']
if not Path(pdf_path).exists():
logger.warning(f"Source {index} (pdf): File not found: {pdf_path}")
def _validate_legacy(self) -> bool:
"""
Validate legacy config format (backward compatibility).
Legacy configs are the old format used by doc_scraper, github_scraper, pdf_scraper.
"""
logger.info("Detected legacy config format (backward compatible)")
# Detect which legacy type based on fields
if 'base_url' in self.config:
logger.info("Legacy type: documentation")
elif 'repo' in self.config:
logger.info("Legacy type: github")
elif 'pdf' in self.config or 'path' in self.config:
logger.info("Legacy type: pdf")
else:
raise ValueError("Cannot detect legacy config type (missing base_url, repo, or pdf)")
return True
def convert_legacy_to_unified(self) -> Dict[str, Any]:
"""
Convert legacy config to unified format.
Returns:
Unified config dict
"""
if self.is_unified:
logger.info("Config already in unified format")
return self.config
logger.info("Converting legacy config to unified format...")
# Detect legacy type and convert
if 'base_url' in self.config:
return self._convert_legacy_documentation()
elif 'repo' in self.config:
return self._convert_legacy_github()
elif 'pdf' in self.config or 'path' in self.config:
return self._convert_legacy_pdf()
else:
raise ValueError("Cannot convert: unknown legacy format")
def _convert_legacy_documentation(self) -> Dict[str, Any]:
"""Convert legacy documentation config to unified."""
unified = {
'name': self.config.get('name', 'unnamed'),
'description': self.config.get('description', 'Documentation skill'),
'merge_mode': 'rule-based',
'sources': [
{
'type': 'documentation',
**{k: v for k, v in self.config.items()
if k not in ['name', 'description']}
}
]
}
return unified
def _convert_legacy_github(self) -> Dict[str, Any]:
"""Convert legacy GitHub config to unified."""
unified = {
'name': self.config.get('name', 'unnamed'),
'description': self.config.get('description', 'GitHub repository skill'),
'merge_mode': 'rule-based',
'sources': [
{
'type': 'github',
**{k: v for k, v in self.config.items()
if k not in ['name', 'description']}
}
]
}
return unified
def _convert_legacy_pdf(self) -> Dict[str, Any]:
"""Convert legacy PDF config to unified."""
unified = {
'name': self.config.get('name', 'unnamed'),
'description': self.config.get('description', 'PDF document skill'),
'merge_mode': 'rule-based',
'sources': [
{
'type': 'pdf',
**{k: v for k, v in self.config.items()
if k not in ['name', 'description']}
}
]
}
return unified
def get_sources_by_type(self, source_type: str) -> List[Dict[str, Any]]:
"""
Get all sources of a specific type.
Args:
source_type: 'documentation', 'github', or 'pdf'
Returns:
List of sources matching the type
"""
if not self.is_unified:
# For legacy, convert and get sources
unified = self.convert_legacy_to_unified()
sources = unified['sources']
else:
sources = self.config['sources']
return [s for s in sources if s.get('type') == source_type]
def has_multiple_sources(self) -> bool:
"""Check if config has multiple sources (requires merging)."""
if not self.is_unified:
return False
return len(self.config['sources']) > 1
def needs_api_merge(self) -> bool:
"""
Check if config needs API merging.
Returns True if both documentation and github sources exist
with API extraction enabled.
"""
if not self.has_multiple_sources():
return False
has_docs_api = any(
s.get('type') == 'documentation' and s.get('extract_api', True)
for s in self.config['sources']
)
has_github_code = any(
s.get('type') == 'github' and s.get('include_code', False)
for s in self.config['sources']
)
return has_docs_api and has_github_code
def validate_config(config_path: str) -> ConfigValidator:
"""
Validate config file and return validator instance.
Args:
config_path: Path to config JSON file
Returns:
ConfigValidator instance
Raises:
ValueError if config is invalid
"""
validator = ConfigValidator(config_path)
validator.validate()
return validator
if __name__ == '__main__':
import sys
if len(sys.argv) < 2:
print("Usage: python config_validator.py <config.json>")
sys.exit(1)
config_file = sys.argv[1]
try:
validator = validate_config(config_file)
print(f"\n✅ Config valid!")
print(f" Format: {'Unified' if validator.is_unified else 'Legacy'}")
print(f" Name: {validator.config.get('name')}")
if validator.is_unified:
sources = validator.config['sources']
print(f" Sources: {len(sources)}")
for i, source in enumerate(sources):
print(f" {i+1}. {source['type']}")
if validator.needs_api_merge():
merge_mode = validator.config.get('merge_mode', 'rule-based')
print(f" ⚠️ API merge required (mode: {merge_mode})")
except ValueError as e:
print(f"\n❌ Config invalid: {e}")
sys.exit(1)

View File

@ -0,0 +1,513 @@
#!/usr/bin/env python3
"""
Conflict Detector for Multi-Source Skills
Detects conflicts between documentation and code:
- missing_in_docs: API exists in code but not documented
- missing_in_code: API documented but doesn't exist in code
- signature_mismatch: Different parameters/types between docs and code
- description_mismatch: Docs say one thing, code comments say another
Used by unified scraper to identify discrepancies before merging.
"""
import json
import logging
from typing import Dict, List, Any, Optional, Tuple
from dataclasses import dataclass, asdict
from difflib import SequenceMatcher
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@dataclass
class Conflict:
"""Represents a conflict between documentation and code."""
type: str # 'missing_in_docs', 'missing_in_code', 'signature_mismatch', 'description_mismatch'
severity: str # 'low', 'medium', 'high'
api_name: str
docs_info: Optional[Dict[str, Any]] = None
code_info: Optional[Dict[str, Any]] = None
difference: Optional[str] = None
suggestion: Optional[str] = None
class ConflictDetector:
"""
Detects conflicts between documentation and code sources.
"""
def __init__(self, docs_data: Dict[str, Any], github_data: Dict[str, Any]):
"""
Initialize conflict detector.
Args:
docs_data: Data from documentation scraper
github_data: Data from GitHub scraper with code analysis
"""
self.docs_data = docs_data
self.github_data = github_data
# Extract API information from both sources
self.docs_apis = self._extract_docs_apis()
self.code_apis = self._extract_code_apis()
logger.info(f"Loaded {len(self.docs_apis)} APIs from documentation")
logger.info(f"Loaded {len(self.code_apis)} APIs from code")
def _extract_docs_apis(self) -> Dict[str, Dict[str, Any]]:
"""
Extract API information from documentation data.
Returns:
Dict mapping API name to API info
"""
apis = {}
# Documentation structure varies, but typically has 'pages' or 'references'
pages = self.docs_data.get('pages', {})
# Handle both dict and list formats
if isinstance(pages, dict):
# Format: {url: page_data, ...}
for url, page_data in pages.items():
content = page_data.get('content', '')
title = page_data.get('title', '')
# Simple heuristic: if title or URL contains "api", "reference", "class", "function"
# it might be an API page
if any(keyword in title.lower() or keyword in url.lower()
for keyword in ['api', 'reference', 'class', 'function', 'method']):
# Extract API signatures from content (simplified)
extracted_apis = self._parse_doc_content_for_apis(content, url)
apis.update(extracted_apis)
elif isinstance(pages, list):
# Format: [{url: '...', apis: [...]}, ...]
for page in pages:
url = page.get('url', '')
page_apis = page.get('apis', [])
# If APIs are already extracted in the page data
for api in page_apis:
api_name = api.get('name', '')
if api_name:
apis[api_name] = {
'parameters': api.get('parameters', []),
'return_type': api.get('return_type', 'Any'),
'source_url': url
}
return apis
def _parse_doc_content_for_apis(self, content: str, source_url: str) -> Dict[str, Dict]:
"""
Parse documentation content to extract API signatures.
This is a simplified approach - real implementation would need
to understand the documentation format (Sphinx, JSDoc, etc.)
"""
apis = {}
# Look for function/method signatures in code blocks
# Common patterns:
# - function_name(param1, param2)
# - ClassName.method_name(param1, param2)
# - def function_name(param1: type, param2: type) -> return_type
import re
# Pattern for common API signatures
patterns = [
# Python style: def name(params) -> return
r'def\s+(\w+)\s*\(([^)]*)\)(?:\s*->\s*(\w+))?',
# JavaScript style: function name(params)
r'function\s+(\w+)\s*\(([^)]*)\)',
# C++ style: return_type name(params)
r'(\w+)\s+(\w+)\s*\(([^)]*)\)',
# Method style: ClassName.method_name(params)
r'(\w+)\.(\w+)\s*\(([^)]*)\)'
]
for pattern in patterns:
for match in re.finditer(pattern, content):
groups = match.groups()
# Parse based on pattern matched
if 'def' in pattern:
# Python function
name = groups[0]
params_str = groups[1]
return_type = groups[2] if len(groups) > 2 else None
elif 'function' in pattern:
# JavaScript function
name = groups[0]
params_str = groups[1]
return_type = None
elif '.' in pattern:
# Class method
class_name = groups[0]
method_name = groups[1]
name = f"{class_name}.{method_name}"
params_str = groups[2] if len(groups) > 2 else groups[1]
return_type = None
else:
# C++ function
return_type = groups[0]
name = groups[1]
params_str = groups[2]
# Parse parameters
params = self._parse_param_string(params_str)
apis[name] = {
'name': name,
'parameters': params,
'return_type': return_type,
'source': source_url,
'raw_signature': match.group(0)
}
return apis
def _parse_param_string(self, params_str: str) -> List[Dict]:
"""Parse parameter string into list of parameter dicts."""
if not params_str.strip():
return []
params = []
for param in params_str.split(','):
param = param.strip()
if not param:
continue
# Try to extract name and type
param_info = {'name': param, 'type': None, 'default': None}
# Check for type annotation (: type)
if ':' in param:
parts = param.split(':', 1)
param_info['name'] = parts[0].strip()
type_part = parts[1].strip()
# Check for default value (= value)
if '=' in type_part:
type_str, default_str = type_part.split('=', 1)
param_info['type'] = type_str.strip()
param_info['default'] = default_str.strip()
else:
param_info['type'] = type_part
# Check for default without type (= value)
elif '=' in param:
parts = param.split('=', 1)
param_info['name'] = parts[0].strip()
param_info['default'] = parts[1].strip()
params.append(param_info)
return params
def _extract_code_apis(self) -> Dict[str, Dict[str, Any]]:
"""
Extract API information from GitHub code analysis.
Returns:
Dict mapping API name to API info
"""
apis = {}
code_analysis = self.github_data.get('code_analysis', {})
if not code_analysis:
return apis
# Support both 'files' and 'analyzed_files' keys
files = code_analysis.get('files', code_analysis.get('analyzed_files', []))
for file_info in files:
file_path = file_info.get('file', 'unknown')
# Extract classes and their methods
for class_info in file_info.get('classes', []):
class_name = class_info['name']
# Add class itself
apis[class_name] = {
'name': class_name,
'type': 'class',
'source': file_path,
'line': class_info.get('line_number'),
'base_classes': class_info.get('base_classes', []),
'docstring': class_info.get('docstring')
}
# Add methods
for method in class_info.get('methods', []):
method_name = f"{class_name}.{method['name']}"
apis[method_name] = {
'name': method_name,
'type': 'method',
'parameters': method.get('parameters', []),
'return_type': method.get('return_type'),
'source': file_path,
'line': method.get('line_number'),
'docstring': method.get('docstring'),
'is_async': method.get('is_async', False)
}
# Extract standalone functions
for func_info in file_info.get('functions', []):
func_name = func_info['name']
apis[func_name] = {
'name': func_name,
'type': 'function',
'parameters': func_info.get('parameters', []),
'return_type': func_info.get('return_type'),
'source': file_path,
'line': func_info.get('line_number'),
'docstring': func_info.get('docstring'),
'is_async': func_info.get('is_async', False)
}
return apis
def detect_all_conflicts(self) -> List[Conflict]:
"""
Detect all types of conflicts.
Returns:
List of Conflict objects
"""
logger.info("Detecting conflicts between documentation and code...")
conflicts = []
# 1. Find APIs missing in documentation
conflicts.extend(self._find_missing_in_docs())
# 2. Find APIs missing in code
conflicts.extend(self._find_missing_in_code())
# 3. Find signature mismatches
conflicts.extend(self._find_signature_mismatches())
logger.info(f"Found {len(conflicts)} conflicts total")
return conflicts
def _find_missing_in_docs(self) -> List[Conflict]:
"""Find APIs that exist in code but not in documentation."""
conflicts = []
for api_name, code_info in self.code_apis.items():
# Simple name matching (can be enhanced with fuzzy matching)
if api_name not in self.docs_apis:
# Check if it's a private/internal API (often not documented)
is_private = api_name.startswith('_') or '__' in api_name
severity = 'low' if is_private else 'medium'
conflicts.append(Conflict(
type='missing_in_docs',
severity=severity,
api_name=api_name,
code_info=code_info,
difference=f"API exists in code ({code_info['source']}) but not found in documentation",
suggestion="Add documentation for this API" if not is_private else "Consider if this internal API should be documented"
))
logger.info(f"Found {len(conflicts)} APIs missing in documentation")
return conflicts
def _find_missing_in_code(self) -> List[Conflict]:
"""Find APIs that are documented but don't exist in code."""
conflicts = []
for api_name, docs_info in self.docs_apis.items():
if api_name not in self.code_apis:
conflicts.append(Conflict(
type='missing_in_code',
severity='high', # This is serious - documented but doesn't exist
api_name=api_name,
docs_info=docs_info,
difference=f"API documented ({docs_info.get('source', 'unknown')}) but not found in code",
suggestion="Update documentation to remove this API, or add it to codebase"
))
logger.info(f"Found {len(conflicts)} APIs missing in code")
return conflicts
def _find_signature_mismatches(self) -> List[Conflict]:
"""Find APIs where signature differs between docs and code."""
conflicts = []
# Find APIs that exist in both
common_apis = set(self.docs_apis.keys()) & set(self.code_apis.keys())
for api_name in common_apis:
docs_info = self.docs_apis[api_name]
code_info = self.code_apis[api_name]
# Compare signatures
mismatch = self._compare_signatures(docs_info, code_info)
if mismatch:
conflicts.append(Conflict(
type='signature_mismatch',
severity=mismatch['severity'],
api_name=api_name,
docs_info=docs_info,
code_info=code_info,
difference=mismatch['difference'],
suggestion=mismatch['suggestion']
))
logger.info(f"Found {len(conflicts)} signature mismatches")
return conflicts
def _compare_signatures(self, docs_info: Dict, code_info: Dict) -> Optional[Dict]:
"""
Compare signatures between docs and code.
Returns:
Dict with mismatch details if conflict found, None otherwise
"""
docs_params = docs_info.get('parameters', [])
code_params = code_info.get('parameters', [])
# Compare parameter counts
if len(docs_params) != len(code_params):
return {
'severity': 'medium',
'difference': f"Parameter count mismatch: docs has {len(docs_params)}, code has {len(code_params)}",
'suggestion': f"Documentation shows {len(docs_params)} parameters, but code has {len(code_params)}"
}
# Compare parameter names and types
for i, (doc_param, code_param) in enumerate(zip(docs_params, code_params)):
doc_name = doc_param.get('name', '')
code_name = code_param.get('name', '')
# Parameter name mismatch
if doc_name != code_name:
# Use fuzzy matching for slight variations
similarity = SequenceMatcher(None, doc_name, code_name).ratio()
if similarity < 0.8: # Not similar enough
return {
'severity': 'medium',
'difference': f"Parameter {i+1} name mismatch: '{doc_name}' in docs vs '{code_name}' in code",
'suggestion': f"Update documentation to use parameter name '{code_name}'"
}
# Type mismatch
doc_type = doc_param.get('type')
code_type = code_param.get('type_hint')
if doc_type and code_type and doc_type != code_type:
return {
'severity': 'low',
'difference': f"Parameter '{doc_name}' type mismatch: '{doc_type}' in docs vs '{code_type}' in code",
'suggestion': f"Verify correct type for parameter '{doc_name}'"
}
# Compare return types if both have them
docs_return = docs_info.get('return_type')
code_return = code_info.get('return_type')
if docs_return and code_return and docs_return != code_return:
return {
'severity': 'low',
'difference': f"Return type mismatch: '{docs_return}' in docs vs '{code_return}' in code",
'suggestion': "Verify correct return type"
}
return None
def generate_summary(self, conflicts: List[Conflict]) -> Dict[str, Any]:
"""
Generate summary statistics for conflicts.
Args:
conflicts: List of Conflict objects
Returns:
Summary dict with statistics
"""
summary = {
'total': len(conflicts),
'by_type': {},
'by_severity': {},
'apis_affected': len(set(c.api_name for c in conflicts))
}
# Count by type
for conflict_type in ['missing_in_docs', 'missing_in_code', 'signature_mismatch', 'description_mismatch']:
count = sum(1 for c in conflicts if c.type == conflict_type)
summary['by_type'][conflict_type] = count
# Count by severity
for severity in ['low', 'medium', 'high']:
count = sum(1 for c in conflicts if c.severity == severity)
summary['by_severity'][severity] = count
return summary
def save_conflicts(self, conflicts: List[Conflict], output_path: str):
"""
Save conflicts to JSON file.
Args:
conflicts: List of Conflict objects
output_path: Path to output JSON file
"""
data = {
'conflicts': [asdict(c) for c in conflicts],
'summary': self.generate_summary(conflicts)
}
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
logger.info(f"Conflicts saved to: {output_path}")
if __name__ == '__main__':
import sys
if len(sys.argv) < 3:
print("Usage: python conflict_detector.py <docs_data.json> <github_data.json>")
sys.exit(1)
docs_file = sys.argv[1]
github_file = sys.argv[2]
# Load data
with open(docs_file, 'r') as f:
docs_data = json.load(f)
with open(github_file, 'r') as f:
github_data = json.load(f)
# Detect conflicts
detector = ConflictDetector(docs_data, github_data)
conflicts = detector.detect_all_conflicts()
# Print summary
summary = detector.generate_summary(conflicts)
print("\n📊 Conflict Summary:")
print(f" Total conflicts: {summary['total']}")
print(f" APIs affected: {summary['apis_affected']}")
print("\n By Type:")
for conflict_type, count in summary['by_type'].items():
if count > 0:
print(f" {conflict_type}: {count}")
print("\n By Severity:")
for severity, count in summary['by_severity'].items():
if count > 0:
emoji = '🔴' if severity == 'high' else '🟡' if severity == 'medium' else '🟢'
print(f" {emoji} {severity}: {count}")
# Save to file
output_file = 'conflicts.json'
detector.save_conflicts(conflicts, output_file)
print(f"\n✅ Full report saved to: {output_file}")

View File

@ -0,0 +1,72 @@
"""Configuration constants for Skill Seekers CLI.
This module centralizes all magic numbers and configuration values used
across the CLI tools to improve maintainability and clarity.
"""
# ===== SCRAPING CONFIGURATION =====
# Default scraping limits
DEFAULT_RATE_LIMIT = 0.5 # seconds between requests
DEFAULT_MAX_PAGES = 500 # maximum pages to scrape
DEFAULT_CHECKPOINT_INTERVAL = 1000 # pages between checkpoints
DEFAULT_ASYNC_MODE = False # use async mode for parallel scraping (opt-in)
# Content analysis limits
CONTENT_PREVIEW_LENGTH = 500 # characters to check for categorization
MAX_PAGES_WARNING_THRESHOLD = 10000 # warn if config exceeds this
# Quality thresholds
MIN_CATEGORIZATION_SCORE = 2 # minimum score for category assignment
URL_MATCH_POINTS = 3 # points for URL keyword match
TITLE_MATCH_POINTS = 2 # points for title keyword match
CONTENT_MATCH_POINTS = 1 # points for content keyword match
# ===== ENHANCEMENT CONFIGURATION =====
# API-based enhancement limits (uses Anthropic API)
API_CONTENT_LIMIT = 100000 # max characters for API enhancement
API_PREVIEW_LIMIT = 40000 # max characters for preview
# Local enhancement limits (uses Claude Code Max)
LOCAL_CONTENT_LIMIT = 50000 # max characters for local enhancement
LOCAL_PREVIEW_LIMIT = 20000 # max characters for preview
# ===== PAGE ESTIMATION =====
# Estimation and discovery settings
DEFAULT_MAX_DISCOVERY = 1000 # default max pages to discover
DISCOVERY_THRESHOLD = 10000 # threshold for warnings
# ===== FILE LIMITS =====
# Output and processing limits
MAX_REFERENCE_FILES = 100 # maximum reference files per skill
MAX_CODE_BLOCKS_PER_PAGE = 5 # maximum code blocks to extract per page
# ===== EXPORT CONSTANTS =====
__all__ = [
# Scraping
'DEFAULT_RATE_LIMIT',
'DEFAULT_MAX_PAGES',
'DEFAULT_CHECKPOINT_INTERVAL',
'DEFAULT_ASYNC_MODE',
'CONTENT_PREVIEW_LENGTH',
'MAX_PAGES_WARNING_THRESHOLD',
'MIN_CATEGORIZATION_SCORE',
'URL_MATCH_POINTS',
'TITLE_MATCH_POINTS',
'CONTENT_MATCH_POINTS',
# Enhancement
'API_CONTENT_LIMIT',
'API_PREVIEW_LIMIT',
'LOCAL_CONTENT_LIMIT',
'LOCAL_PREVIEW_LIMIT',
# Estimation
'DEFAULT_MAX_DISCOVERY',
'DISCOVERY_THRESHOLD',
# Limits
'MAX_REFERENCE_FILES',
'MAX_CODE_BLOCKS_PER_PAGE',
]

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,273 @@
#!/usr/bin/env python3
"""
SKILL.md Enhancement Script
Uses Claude API to improve SKILL.md by analyzing reference documentation.
Usage:
skill-seekers enhance output/steam-inventory/
skill-seekers enhance output/react/
skill-seekers enhance output/godot/ --api-key YOUR_API_KEY
"""
import os
import sys
import json
import argparse
from pathlib import Path
# Add parent directory to path for imports when run as script
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from skill_seekers.cli.constants import API_CONTENT_LIMIT, API_PREVIEW_LIMIT
from skill_seekers.cli.utils import read_reference_files
try:
import anthropic
except ImportError:
print("❌ Error: anthropic package not installed")
print("Install with: pip3 install anthropic")
sys.exit(1)
class SkillEnhancer:
def __init__(self, skill_dir, api_key=None):
self.skill_dir = Path(skill_dir)
self.references_dir = self.skill_dir / "references"
self.skill_md_path = self.skill_dir / "SKILL.md"
# Get API key
self.api_key = api_key or os.environ.get('ANTHROPIC_API_KEY')
if not self.api_key:
raise ValueError(
"No API key provided. Set ANTHROPIC_API_KEY environment variable "
"or use --api-key argument"
)
self.client = anthropic.Anthropic(api_key=self.api_key)
def read_current_skill_md(self):
"""Read existing SKILL.md"""
if not self.skill_md_path.exists():
return None
return self.skill_md_path.read_text(encoding='utf-8')
def enhance_skill_md(self, references, current_skill_md):
"""Use Claude to enhance SKILL.md"""
# Build prompt
prompt = self._build_enhancement_prompt(references, current_skill_md)
print("\n🤖 Asking Claude to enhance SKILL.md...")
print(f" Input: {len(prompt):,} characters")
try:
message = self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
temperature=0.3,
messages=[{
"role": "user",
"content": prompt
}]
)
enhanced_content = message.content[0].text
return enhanced_content
except Exception as e:
print(f"❌ Error calling Claude API: {e}")
return None
def _build_enhancement_prompt(self, references, current_skill_md):
"""Build the prompt for Claude"""
# Extract skill name and description
skill_name = self.skill_dir.name
prompt = f"""You are enhancing a Claude skill's SKILL.md file. This skill is about: {skill_name}
I've scraped documentation and organized it into reference files. Your job is to create an EXCELLENT SKILL.md that will help Claude use this documentation effectively.
CURRENT SKILL.MD:
{'```markdown' if current_skill_md else '(none - create from scratch)'}
{current_skill_md or 'No existing SKILL.md'}
{'```' if current_skill_md else ''}
REFERENCE DOCUMENTATION:
"""
for filename, content in references.items():
prompt += f"\n\n## {filename}\n```markdown\n{content[:30000]}\n```\n"
prompt += """
YOUR TASK:
Create an enhanced SKILL.md that includes:
1. **Clear "When to Use This Skill" section** - Be specific about trigger conditions
2. **Excellent Quick Reference section** - Extract 5-10 of the BEST, most practical code examples from the reference docs
- Choose SHORT, clear examples that demonstrate common tasks
- Include both simple and intermediate examples
- Annotate examples with clear descriptions
- Use proper language tags (cpp, python, javascript, json, etc.)
3. **Detailed Reference Files description** - Explain what's in each reference file
4. **Practical "Working with This Skill" section** - Give users clear guidance on how to navigate the skill
5. **Key Concepts section** (if applicable) - Explain core concepts
6. **Keep the frontmatter** (---\nname: ...\n---) intact
IMPORTANT:
- Extract REAL examples from the reference docs, don't make them up
- Prioritize SHORT, clear examples (5-20 lines max)
- Make it actionable and practical
- Don't be too verbose - be concise but useful
- Maintain the markdown structure for Claude skills
- Keep code examples properly formatted with language tags
OUTPUT:
Return ONLY the complete SKILL.md content, starting with the frontmatter (---).
"""
return prompt
def save_enhanced_skill_md(self, content):
"""Save the enhanced SKILL.md"""
# Backup original
if self.skill_md_path.exists():
backup_path = self.skill_md_path.with_suffix('.md.backup')
self.skill_md_path.rename(backup_path)
print(f" 💾 Backed up original to: {backup_path.name}")
# Save enhanced version
self.skill_md_path.write_text(content, encoding='utf-8')
print(f" ✅ Saved enhanced SKILL.md")
def run(self):
"""Main enhancement workflow"""
print(f"\n{'='*60}")
print(f"ENHANCING SKILL: {self.skill_dir.name}")
print(f"{'='*60}\n")
# Read reference files
print("📖 Reading reference documentation...")
references = read_reference_files(
self.skill_dir,
max_chars=API_CONTENT_LIMIT,
preview_limit=API_PREVIEW_LIMIT
)
if not references:
print("❌ No reference files found to analyze")
return False
print(f" ✓ Read {len(references)} reference files")
total_size = sum(len(c) for c in references.values())
print(f" ✓ Total size: {total_size:,} characters\n")
# Read current SKILL.md
current_skill_md = self.read_current_skill_md()
if current_skill_md:
print(f" Found existing SKILL.md ({len(current_skill_md)} chars)")
else:
print(f" No existing SKILL.md, will create new one")
# Enhance with Claude
enhanced = self.enhance_skill_md(references, current_skill_md)
if not enhanced:
print("❌ Enhancement failed")
return False
print(f" ✓ Generated enhanced SKILL.md ({len(enhanced)} chars)\n")
# Save
print("💾 Saving enhanced SKILL.md...")
self.save_enhanced_skill_md(enhanced)
print(f"\n✅ Enhancement complete!")
print(f"\nNext steps:")
print(f" 1. Review: {self.skill_md_path}")
print(f" 2. If you don't like it, restore backup: {self.skill_md_path.with_suffix('.md.backup')}")
print(f" 3. Package your skill:")
print(f" skill-seekers package {self.skill_dir}/")
return True
def main():
parser = argparse.ArgumentParser(
description='Enhance SKILL.md using Claude API',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Using ANTHROPIC_API_KEY environment variable
export ANTHROPIC_API_KEY=sk-ant-...
skill-seekers enhance output/steam-inventory/
# Providing API key directly
skill-seekers enhance output/react/ --api-key sk-ant-...
# Show what would be done (dry run)
skill-seekers enhance output/godot/ --dry-run
"""
)
parser.add_argument('skill_dir', type=str,
help='Path to skill directory (e.g., output/steam-inventory/)')
parser.add_argument('--api-key', type=str,
help='Anthropic API key (or set ANTHROPIC_API_KEY env var)')
parser.add_argument('--dry-run', action='store_true',
help='Show what would be done without calling API')
args = parser.parse_args()
# Validate skill directory
skill_dir = Path(args.skill_dir)
if not skill_dir.exists():
print(f"❌ Error: Directory not found: {skill_dir}")
sys.exit(1)
if not skill_dir.is_dir():
print(f"❌ Error: Not a directory: {skill_dir}")
sys.exit(1)
# Dry run mode
if args.dry_run:
print(f"🔍 DRY RUN MODE")
print(f" Would enhance: {skill_dir}")
print(f" References: {skill_dir / 'references'}")
print(f" SKILL.md: {skill_dir / 'SKILL.md'}")
refs_dir = skill_dir / "references"
if refs_dir.exists():
ref_files = list(refs_dir.glob("*.md"))
print(f" Found {len(ref_files)} reference files:")
for rf in ref_files:
size = rf.stat().st_size
print(f" - {rf.name} ({size:,} bytes)")
print("\nTo actually run enhancement:")
print(f" skill-seekers enhance {skill_dir}")
return
# Create enhancer and run
try:
enhancer = SkillEnhancer(skill_dir, api_key=args.api_key)
success = enhancer.run()
sys.exit(0 if success else 1)
except ValueError as e:
print(f"❌ Error: {e}")
print("\nSet your API key:")
print(" export ANTHROPIC_API_KEY=sk-ant-...")
print("Or provide it directly:")
print(f" skill-seekers enhance {skill_dir} --api-key sk-ant-...")
sys.exit(1)
except Exception as e:
print(f"❌ Unexpected error: {e}")
import traceback
traceback.print_exc()
sys.exit(1)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,451 @@
#!/usr/bin/env python3
"""
SKILL.md Enhancement Script (Local - Using Claude Code)
Opens a new terminal with Claude Code to enhance SKILL.md, then reports back.
No API key needed - uses your existing Claude Code Max plan!
Usage:
skill-seekers enhance output/steam-inventory/
skill-seekers enhance output/react/
Terminal Selection:
The script automatically detects which terminal app to use:
1. SKILL_SEEKER_TERMINAL env var (highest priority)
Example: export SKILL_SEEKER_TERMINAL="Ghostty"
2. TERM_PROGRAM env var (current terminal)
3. Terminal.app (fallback)
Supported terminals: Ghostty, iTerm, Terminal, WezTerm
"""
import os
import sys
import time
import subprocess
import tempfile
from pathlib import Path
# Add parent directory to path for imports when run as script
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from skill_seekers.cli.constants import LOCAL_CONTENT_LIMIT, LOCAL_PREVIEW_LIMIT
from skill_seekers.cli.utils import read_reference_files
def detect_terminal_app():
"""Detect which terminal app to use with cascading priority.
Priority order:
1. SKILL_SEEKER_TERMINAL environment variable (explicit user preference)
2. TERM_PROGRAM environment variable (inherit current terminal)
3. Terminal.app (fallback default)
Returns:
tuple: (terminal_app_name, detection_method)
- terminal_app_name (str): Name of terminal app to launch (e.g., "Ghostty", "Terminal")
- detection_method (str): How the terminal was detected (for logging)
Examples:
>>> os.environ['SKILL_SEEKER_TERMINAL'] = 'Ghostty'
>>> detect_terminal_app()
('Ghostty', 'SKILL_SEEKER_TERMINAL')
>>> os.environ['TERM_PROGRAM'] = 'iTerm.app'
>>> detect_terminal_app()
('iTerm', 'TERM_PROGRAM')
"""
# Map TERM_PROGRAM values to macOS app names
TERMINAL_MAP = {
'Apple_Terminal': 'Terminal',
'iTerm.app': 'iTerm',
'ghostty': 'Ghostty',
'WezTerm': 'WezTerm',
}
# Priority 1: Check SKILL_SEEKER_TERMINAL env var (explicit preference)
preferred_terminal = os.environ.get('SKILL_SEEKER_TERMINAL', '').strip()
if preferred_terminal:
return preferred_terminal, 'SKILL_SEEKER_TERMINAL'
# Priority 2: Check TERM_PROGRAM (inherit current terminal)
term_program = os.environ.get('TERM_PROGRAM', '').strip()
if term_program and term_program in TERMINAL_MAP:
return TERMINAL_MAP[term_program], 'TERM_PROGRAM'
# Priority 3: Fallback to Terminal.app
if term_program:
# TERM_PROGRAM is set but unknown
return 'Terminal', f'unknown TERM_PROGRAM ({term_program})'
else:
# No TERM_PROGRAM set
return 'Terminal', 'default'
class LocalSkillEnhancer:
def __init__(self, skill_dir):
self.skill_dir = Path(skill_dir)
self.references_dir = self.skill_dir / "references"
self.skill_md_path = self.skill_dir / "SKILL.md"
def create_enhancement_prompt(self):
"""Create the prompt file for Claude Code"""
# Read reference files
references = read_reference_files(
self.skill_dir,
max_chars=LOCAL_CONTENT_LIMIT,
preview_limit=LOCAL_PREVIEW_LIMIT
)
if not references:
print("❌ No reference files found")
return None
# Read current SKILL.md
current_skill_md = ""
if self.skill_md_path.exists():
current_skill_md = self.skill_md_path.read_text(encoding='utf-8')
# Build prompt
prompt = f"""I need you to enhance the SKILL.md file for the {self.skill_dir.name} skill.
CURRENT SKILL.MD:
{'-'*60}
{current_skill_md if current_skill_md else '(No existing SKILL.md - create from scratch)'}
{'-'*60}
REFERENCE DOCUMENTATION:
{'-'*60}
"""
for filename, content in references.items():
prompt += f"\n## {filename}\n{content[:15000]}\n"
prompt += f"""
{'-'*60}
YOUR TASK:
Create an EXCELLENT SKILL.md file that will help Claude use this documentation effectively.
Requirements:
1. **Clear "When to Use This Skill" section**
- Be SPECIFIC about trigger conditions
- List concrete use cases
2. **Excellent Quick Reference section**
- Extract 5-10 of the BEST, most practical code examples from the reference docs
- Choose SHORT, clear examples (5-20 lines max)
- Include both simple and intermediate examples
- Use proper language tags (cpp, python, javascript, json, etc.)
- Add clear descriptions for each example
3. **Detailed Reference Files description**
- Explain what's in each reference file
- Help users navigate the documentation
4. **Practical "Working with This Skill" section**
- Clear guidance for beginners, intermediate, and advanced users
- Navigation tips
5. **Key Concepts section** (if applicable)
- Explain core concepts
- Define important terminology
IMPORTANT:
- Extract REAL examples from the reference docs above
- Prioritize SHORT, clear examples
- Make it actionable and practical
- Keep the frontmatter (---\\nname: ...\\n---) intact
- Use proper markdown formatting
SAVE THE RESULT:
Save the complete enhanced SKILL.md to: {self.skill_md_path.absolute()}
First, backup the original to: {self.skill_md_path.with_suffix('.md.backup').absolute()}
"""
return prompt
def run(self, headless=True, timeout=600):
"""Main enhancement workflow
Args:
headless: If True, run claude directly without opening terminal (default: True)
timeout: Maximum time to wait for enhancement in seconds (default: 600 = 10 minutes)
"""
print(f"\n{'='*60}")
print(f"LOCAL ENHANCEMENT: {self.skill_dir.name}")
print(f"{'='*60}\n")
# Validate
if not self.skill_dir.exists():
print(f"❌ Directory not found: {self.skill_dir}")
return False
# Read reference files
print("📖 Reading reference documentation...")
references = read_reference_files(
self.skill_dir,
max_chars=LOCAL_CONTENT_LIMIT,
preview_limit=LOCAL_PREVIEW_LIMIT
)
if not references:
print("❌ No reference files found to analyze")
return False
print(f" ✓ Read {len(references)} reference files")
total_size = sum(len(c) for c in references.values())
print(f" ✓ Total size: {total_size:,} characters\n")
# Create prompt
print("📝 Creating enhancement prompt...")
prompt = self.create_enhancement_prompt()
if not prompt:
return False
# Save prompt to temp file
with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False, encoding='utf-8') as f:
prompt_file = f.name
f.write(prompt)
print(f" ✓ Prompt saved ({len(prompt):,} characters)\n")
# Headless mode: Run claude directly without opening terminal
if headless:
return self._run_headless(prompt_file, timeout)
# Terminal mode: Launch Claude Code in new terminal
print("🚀 Launching Claude Code in new terminal...")
print(" This will:")
print(" 1. Open a new terminal window")
print(" 2. Run Claude Code with the enhancement task")
print(" 3. Claude will read the docs and enhance SKILL.md")
print(" 4. Terminal will auto-close when done")
print()
# Create a shell script to run in the terminal
shell_script = f'''#!/bin/bash
claude {prompt_file}
echo ""
echo "✅ Enhancement complete!"
echo "Press any key to close..."
read -n 1
rm {prompt_file}
'''
# Save shell script
with tempfile.NamedTemporaryFile(mode='w', suffix='.sh', delete=False) as f:
script_file = f.name
f.write(shell_script)
os.chmod(script_file, 0o755)
# Launch in new terminal (macOS specific)
if sys.platform == 'darwin':
# Detect which terminal app to use
terminal_app, detection_method = detect_terminal_app()
# Show detection info
if detection_method == 'SKILL_SEEKER_TERMINAL':
print(f" Using terminal: {terminal_app} (from SKILL_SEEKER_TERMINAL)")
elif detection_method == 'TERM_PROGRAM':
print(f" Using terminal: {terminal_app} (inherited from current terminal)")
elif detection_method.startswith('unknown TERM_PROGRAM'):
print(f"⚠️ {detection_method}")
print(f" → Using Terminal.app as fallback")
else:
print(f" Using terminal: {terminal_app} (default)")
try:
subprocess.Popen(['open', '-a', terminal_app, script_file])
except Exception as e:
print(f"⚠️ Error launching {terminal_app}: {e}")
print(f"\nManually run: {script_file}")
return False
else:
print("⚠️ Auto-launch only works on macOS")
print(f"\nManually run this command in a new terminal:")
print(f" claude '{prompt_file}'")
print(f"\nThen delete the prompt file:")
print(f" rm '{prompt_file}'")
return False
print("✅ New terminal launched with Claude Code!")
print()
print("📊 Status:")
print(f" - Prompt file: {prompt_file}")
print(f" - Skill directory: {self.skill_dir.absolute()}")
print(f" - SKILL.md will be saved to: {self.skill_md_path.absolute()}")
print(f" - Original backed up to: {self.skill_md_path.with_suffix('.md.backup').absolute()}")
print()
print("⏳ Wait for Claude Code to finish in the other terminal...")
print(" (Usually takes 30-60 seconds)")
print()
print("💡 When done:")
print(f" 1. Check the enhanced SKILL.md: {self.skill_md_path}")
print(f" 2. If you don't like it, restore: mv {self.skill_md_path.with_suffix('.md.backup')} {self.skill_md_path}")
print(f" 3. Package: skill-seekers package {self.skill_dir}/")
return True
def _run_headless(self, prompt_file, timeout):
"""Run Claude enhancement in headless mode (no terminal window)
Args:
prompt_file: Path to prompt file
timeout: Maximum seconds to wait
Returns:
bool: True if enhancement succeeded
"""
import time
from pathlib import Path
print("✨ Running Claude Code enhancement (headless mode)...")
print(f" Timeout: {timeout} seconds ({timeout//60} minutes)")
print()
# Record initial state
initial_mtime = self.skill_md_path.stat().st_mtime if self.skill_md_path.exists() else 0
initial_size = self.skill_md_path.stat().st_size if self.skill_md_path.exists() else 0
# Start timer
start_time = time.time()
try:
# Run claude command directly (this WAITS for completion)
print(" Running: claude {prompt_file}")
print(" ⏳ Please wait...")
print()
result = subprocess.run(
['claude', prompt_file],
capture_output=True,
text=True,
timeout=timeout
)
elapsed = time.time() - start_time
# Check if successful
if result.returncode == 0:
# Verify SKILL.md was actually updated
if self.skill_md_path.exists():
new_mtime = self.skill_md_path.stat().st_mtime
new_size = self.skill_md_path.stat().st_size
if new_mtime > initial_mtime and new_size > initial_size:
print(f"✅ Enhancement complete! ({elapsed:.1f} seconds)")
print(f" SKILL.md updated: {new_size:,} bytes")
print()
# Clean up prompt file
try:
os.unlink(prompt_file)
except:
pass
return True
else:
print(f"⚠️ Claude finished but SKILL.md was not updated")
print(f" This might indicate an error during enhancement")
print()
return False
else:
print(f"❌ SKILL.md not found after enhancement")
return False
else:
print(f"❌ Claude Code returned error (exit code: {result.returncode})")
if result.stderr:
print(f" Error: {result.stderr[:200]}")
return False
except subprocess.TimeoutExpired:
elapsed = time.time() - start_time
print(f"\n⚠️ Enhancement timed out after {elapsed:.0f} seconds")
print(f" Timeout limit: {timeout} seconds")
print()
print(" Possible reasons:")
print(" - Skill is very large (many references)")
print(" - Claude is taking longer than usual")
print(" - Network issues")
print()
print(" Try:")
print(" 1. Use terminal mode: --interactive-enhancement")
print(" 2. Reduce reference content")
print(" 3. Try again later")
# Clean up
try:
os.unlink(prompt_file)
except:
pass
return False
except FileNotFoundError:
print("'claude' command not found")
print()
print(" Make sure Claude Code CLI is installed:")
print(" See: https://docs.claude.com/claude-code")
print()
print(" Try terminal mode instead: --interactive-enhancement")
return False
except Exception as e:
print(f"❌ Unexpected error: {e}")
return False
def main():
import argparse
parser = argparse.ArgumentParser(
description="Enhance a skill with Claude Code (local)",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Headless mode (default - runs in background)
skill-seekers enhance output/react/
# Interactive mode (opens terminal window)
skill-seekers enhance output/react/ --interactive-enhancement
# Custom timeout
skill-seekers enhance output/react/ --timeout 1200
"""
)
parser.add_argument(
'skill_directory',
help='Path to skill directory (e.g., output/react/)'
)
parser.add_argument(
'--interactive-enhancement',
action='store_true',
help='Open terminal window for enhancement (default: headless mode)'
)
parser.add_argument(
'--timeout',
type=int,
default=600,
help='Timeout in seconds for headless mode (default: 600 = 10 minutes)'
)
args = parser.parse_args()
# Run enhancement
enhancer = LocalSkillEnhancer(args.skill_directory)
headless = not args.interactive_enhancement # Invert: default is headless
success = enhancer.run(headless=headless, timeout=args.timeout)
sys.exit(0 if success else 1)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,288 @@
#!/usr/bin/env python3
"""
Page Count Estimator for Skill Seeker
Quickly estimates how many pages a config will scrape without downloading content
"""
import sys
import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import time
import json
# Add parent directory to path for imports when run as script
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from skill_seekers.cli.constants import (
DEFAULT_RATE_LIMIT,
DEFAULT_MAX_DISCOVERY,
DISCOVERY_THRESHOLD
)
def estimate_pages(config, max_discovery=DEFAULT_MAX_DISCOVERY, timeout=30):
"""
Estimate total pages that will be scraped
Args:
config: Configuration dictionary
max_discovery: Maximum pages to discover (safety limit, use -1 for unlimited)
timeout: Timeout for HTTP requests in seconds
Returns:
dict with estimation results
"""
base_url = config['base_url']
start_urls = config.get('start_urls', [base_url])
url_patterns = config.get('url_patterns', {'include': [], 'exclude': []})
rate_limit = config.get('rate_limit', DEFAULT_RATE_LIMIT)
visited = set()
pending = list(start_urls)
discovered = 0
include_patterns = url_patterns.get('include', [])
exclude_patterns = url_patterns.get('exclude', [])
# Handle unlimited mode
unlimited = (max_discovery == -1 or max_discovery is None)
print(f"🔍 Estimating pages for: {config['name']}")
print(f"📍 Base URL: {base_url}")
print(f"🎯 Start URLs: {len(start_urls)}")
print(f"⏱️ Rate limit: {rate_limit}s")
if unlimited:
print(f"🔢 Max discovery: UNLIMITED (will discover all pages)")
print(f"⚠️ WARNING: This may take a long time!")
else:
print(f"🔢 Max discovery: {max_discovery}")
print()
start_time = time.time()
# Loop condition: stop if no more URLs, or if limit reached (when not unlimited)
while pending and (unlimited or discovered < max_discovery):
url = pending.pop(0)
# Skip if already visited
if url in visited:
continue
visited.add(url)
discovered += 1
# Progress indicator
if discovered % 10 == 0:
elapsed = time.time() - start_time
rate = discovered / elapsed if elapsed > 0 else 0
print(f"⏳ Discovered: {discovered} pages ({rate:.1f} pages/sec)", end='\r')
try:
# HEAD request first to check if page exists (faster)
head_response = requests.head(url, timeout=timeout, allow_redirects=True)
# Skip non-HTML content
content_type = head_response.headers.get('Content-Type', '')
if 'text/html' not in content_type:
continue
# Now GET the page to find links
response = requests.get(url, timeout=timeout)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
# Find all links
for link in soup.find_all('a', href=True):
href = link['href']
full_url = urljoin(url, href)
# Normalize URL
parsed = urlparse(full_url)
full_url = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
# Check if URL is valid
if not is_valid_url(full_url, base_url, include_patterns, exclude_patterns):
continue
# Add to pending if not visited
if full_url not in visited and full_url not in pending:
pending.append(full_url)
# Rate limiting
time.sleep(rate_limit)
except requests.RequestException as e:
# Silently skip errors during estimation
pass
except Exception as e:
# Silently skip other errors
pass
elapsed = time.time() - start_time
# Results
results = {
'discovered': discovered,
'pending': len(pending),
'estimated_total': discovered + len(pending),
'elapsed_seconds': round(elapsed, 2),
'discovery_rate': round(discovered / elapsed if elapsed > 0 else 0, 2),
'hit_limit': (not unlimited) and (discovered >= max_discovery),
'unlimited': unlimited
}
return results
def is_valid_url(url, base_url, include_patterns, exclude_patterns):
"""Check if URL should be crawled"""
# Must be same domain
if not url.startswith(base_url.rstrip('/')):
return False
# Check exclude patterns first
if exclude_patterns:
for pattern in exclude_patterns:
if pattern in url:
return False
# Check include patterns (if specified)
if include_patterns:
for pattern in include_patterns:
if pattern in url:
return True
return False
# If no include patterns, accept by default
return True
def print_results(results, config):
"""Print estimation results"""
print()
print("=" * 70)
print("📊 ESTIMATION RESULTS")
print("=" * 70)
print()
print(f"Config: {config['name']}")
print(f"Base URL: {config['base_url']}")
print()
print(f"✅ Pages Discovered: {results['discovered']}")
print(f"⏳ Pages Pending: {results['pending']}")
print(f"📈 Estimated Total: {results['estimated_total']}")
print()
print(f"⏱️ Time Elapsed: {results['elapsed_seconds']}s")
print(f"⚡ Discovery Rate: {results['discovery_rate']} pages/sec")
if results.get('unlimited', False):
print()
print("✅ UNLIMITED MODE - Discovered all reachable pages")
print(f" Total pages: {results['estimated_total']}")
elif results['hit_limit']:
print()
print("⚠️ Hit discovery limit - actual total may be higher")
print(" Increase max_discovery parameter for more accurate estimate")
print()
print("=" * 70)
print("💡 RECOMMENDATIONS")
print("=" * 70)
print()
estimated = results['estimated_total']
current_max = config.get('max_pages', 100)
if estimated <= current_max:
print(f"✅ Current max_pages ({current_max}) is sufficient")
else:
recommended = min(estimated + 50, DISCOVERY_THRESHOLD) # Add 50 buffer, cap at threshold
print(f"⚠️ Current max_pages ({current_max}) may be too low")
print(f"📝 Recommended max_pages: {recommended}")
print(f" (Estimated {estimated} + 50 buffer)")
# Estimate time for full scrape
rate_limit = config.get('rate_limit', DEFAULT_RATE_LIMIT)
estimated_time = (estimated * rate_limit) / 60 # in minutes
print()
print(f"⏱️ Estimated full scrape time: {estimated_time:.1f} minutes")
print(f" (Based on rate_limit: {rate_limit}s)")
print()
def load_config(config_path):
"""Load configuration from JSON file"""
try:
with open(config_path, 'r') as f:
config = json.load(f)
return config
except FileNotFoundError:
print(f"❌ Error: Config file not found: {config_path}")
sys.exit(1)
except json.JSONDecodeError as e:
print(f"❌ Error: Invalid JSON in config file: {e}")
sys.exit(1)
def main():
"""Main entry point"""
import argparse
parser = argparse.ArgumentParser(
description='Estimate page count for Skill Seeker configs',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Estimate pages for a config
skill-seekers estimate configs/react.json
# Estimate with higher discovery limit
skill-seekers estimate configs/godot.json --max-discovery 2000
# Quick estimate (stop at 100 pages)
skill-seekers estimate configs/vue.json --max-discovery 100
"""
)
parser.add_argument('config', help='Path to config JSON file')
parser.add_argument('--max-discovery', '-m', type=int, default=DEFAULT_MAX_DISCOVERY,
help=f'Maximum pages to discover (default: {DEFAULT_MAX_DISCOVERY}, use -1 for unlimited)')
parser.add_argument('--unlimited', '-u', action='store_true',
help='Remove discovery limit - discover all pages (same as --max-discovery -1)')
parser.add_argument('--timeout', '-t', type=int, default=30,
help='HTTP request timeout in seconds (default: 30)')
args = parser.parse_args()
# Handle unlimited flag
max_discovery = -1 if args.unlimited else args.max_discovery
# Load config
config = load_config(args.config)
# Run estimation
try:
results = estimate_pages(config, max_discovery, args.timeout)
print_results(results, config)
# Return exit code based on results
if results['hit_limit']:
return 2 # Warning: hit limit
return 0 # Success
except KeyboardInterrupt:
print("\n\n⚠️ Estimation interrupted by user")
return 1
except Exception as e:
print(f"\n\n❌ Error during estimation: {e}")
return 1
if __name__ == '__main__':
sys.exit(main())

View File

@ -0,0 +1,274 @@
#!/usr/bin/env python3
"""
Router Skill Generator
Creates a router/hub skill that intelligently directs queries to specialized sub-skills.
This is used for large documentation sites split into multiple focused skills.
"""
import json
import sys
import argparse
from pathlib import Path
from typing import Dict, List, Any, Tuple
class RouterGenerator:
"""Generates router skills that direct to specialized sub-skills"""
def __init__(self, config_paths: List[str], router_name: str = None):
self.config_paths = [Path(p) for p in config_paths]
self.configs = [self.load_config(p) for p in self.config_paths]
self.router_name = router_name or self.infer_router_name()
self.base_config = self.configs[0] # Use first as template
def load_config(self, path: Path) -> Dict[str, Any]:
"""Load a config file"""
try:
with open(path, 'r') as f:
return json.load(f)
except Exception as e:
print(f"❌ Error loading {path}: {e}")
sys.exit(1)
def infer_router_name(self) -> str:
"""Infer router name from sub-skill names"""
# Find common prefix
names = [cfg['name'] for cfg in self.configs]
if not names:
return "router"
# Get common prefix before first dash
first_name = names[0]
if '-' in first_name:
return first_name.split('-')[0]
return first_name
def extract_routing_keywords(self) -> Dict[str, List[str]]:
"""Extract keywords for routing to each skill"""
routing = {}
for config in self.configs:
name = config['name']
keywords = []
# Extract from categories
if 'categories' in config:
keywords.extend(config['categories'].keys())
# Extract from name (part after dash)
if '-' in name:
skill_topic = name.split('-', 1)[1]
keywords.append(skill_topic)
routing[name] = keywords
return routing
def generate_skill_md(self) -> str:
"""Generate router SKILL.md content"""
routing_keywords = self.extract_routing_keywords()
skill_md = f"""# {self.router_name.replace('-', ' ').title()} Documentation (Router)
## When to Use This Skill
{self.base_config.get('description', f'Use for {self.router_name} development and programming.')}
This is a router skill that directs your questions to specialized sub-skills for efficient, focused assistance.
## How It Works
This skill analyzes your question and activates the appropriate specialized skill(s):
"""
# List sub-skills
for config in self.configs:
name = config['name']
desc = config.get('description', '')
# Remove router name prefix from description if present
if desc.startswith(f"{self.router_name.title()} -"):
desc = desc.split(' - ', 1)[1]
skill_md += f"### {name}\n{desc}\n\n"
# Routing logic
skill_md += """## Routing Logic
The router analyzes your question for topic keywords and activates relevant skills:
**Keywords Skills:**
"""
for skill_name, keywords in routing_keywords.items():
keyword_str = ", ".join(keywords)
skill_md += f"- {keyword_str} → **{skill_name}**\n"
# Quick reference
skill_md += f"""
## Quick Reference
For quick answers, this router provides basic overview information. For detailed documentation, the specialized skills contain comprehensive references.
### Getting Started
1. Ask your question naturally - mention the topic area
2. The router will activate the appropriate skill(s)
3. You'll receive focused, detailed answers from specialized documentation
### Examples
**Question:** "How do I create a 2D sprite?"
**Activates:** {self.router_name}-2d skill
**Question:** "GDScript function syntax"
**Activates:** {self.router_name}-scripting skill
**Question:** "Physics collision handling in 3D"
**Activates:** {self.router_name}-3d + {self.router_name}-physics skills
### All Available Skills
"""
# List all skills
for config in self.configs:
skill_md += f"- **{config['name']}**\n"
skill_md += f"""
## Need Help?
Simply ask your question and mention the topic. The router will find the right specialized skill for you!
---
*This is a router skill. For complete documentation, see the specialized skills listed above.*
"""
return skill_md
def create_router_config(self) -> Dict[str, Any]:
"""Create router configuration"""
routing_keywords = self.extract_routing_keywords()
router_config = {
"name": self.router_name,
"description": self.base_config.get('description', f'{self.router_name.title()} documentation router'),
"base_url": self.base_config['base_url'],
"selectors": self.base_config.get('selectors', {}),
"url_patterns": self.base_config.get('url_patterns', {}),
"rate_limit": self.base_config.get('rate_limit', 0.5),
"max_pages": 500, # Router only scrapes overview pages
"_router": True,
"_sub_skills": [cfg['name'] for cfg in self.configs],
"_routing_keywords": routing_keywords
}
return router_config
def generate(self, output_dir: Path = None) -> Tuple[Path, Path]:
"""Generate router skill and config"""
if output_dir is None:
output_dir = self.config_paths[0].parent
output_dir = Path(output_dir)
# Generate SKILL.md
skill_md = self.generate_skill_md()
skill_path = output_dir.parent / f"output/{self.router_name}/SKILL.md"
skill_path.parent.mkdir(parents=True, exist_ok=True)
with open(skill_path, 'w') as f:
f.write(skill_md)
# Generate config
router_config = self.create_router_config()
config_path = output_dir / f"{self.router_name}.json"
with open(config_path, 'w') as f:
json.dump(router_config, f, indent=2)
return config_path, skill_path
def main():
parser = argparse.ArgumentParser(
description="Generate router/hub skill for split documentation",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Generate router from multiple configs
python3 generate_router.py configs/godot-2d.json configs/godot-3d.json configs/godot-scripting.json
# Use glob pattern
python3 generate_router.py configs/godot-*.json
# Custom router name
python3 generate_router.py configs/godot-*.json --name godot-hub
# Custom output directory
python3 generate_router.py configs/godot-*.json --output-dir configs/routers/
"""
)
parser.add_argument(
'configs',
nargs='+',
help='Sub-skill config files'
)
parser.add_argument(
'--name',
help='Router skill name (default: inferred from sub-skills)'
)
parser.add_argument(
'--output-dir',
help='Output directory (default: same as input configs)'
)
args = parser.parse_args()
# Filter out router configs (avoid recursion)
config_files = []
for path_str in args.configs:
path = Path(path_str)
if path.exists() and not path.stem.endswith('-router'):
config_files.append(path_str)
if not config_files:
print("❌ Error: No valid config files provided")
sys.exit(1)
print(f"\n{'='*60}")
print("ROUTER SKILL GENERATOR")
print(f"{'='*60}")
print(f"Sub-skills: {len(config_files)}")
for cfg in config_files:
print(f" - {Path(cfg).stem}")
print("")
# Generate router
generator = RouterGenerator(config_files, args.name)
config_path, skill_path = generator.generate(args.output_dir)
print(f"✅ Router config created: {config_path}")
print(f"✅ Router SKILL.md created: {skill_path}")
print("")
print(f"{'='*60}")
print("NEXT STEPS")
print(f"{'='*60}")
print(f"1. Review router SKILL.md: {skill_path}")
print(f"2. Optionally scrape router (for overview pages):")
print(f" skill-seekers scrape --config {config_path}")
print("3. Package router skill:")
print(f" skill-seekers package output/{generator.router_name}/")
print("4. Upload router + all sub-skills to Claude")
print("")
if __name__ == "__main__":
main()

View File

@ -0,0 +1,900 @@
#!/usr/bin/env python3
"""
GitHub Repository to Claude Skill Converter (Tasks C1.1-C1.12)
Converts GitHub repositories into Claude AI skills by extracting:
- README and documentation
- Code structure and signatures
- GitHub Issues, Changelog, and Releases
- Usage examples from tests
Usage:
skill-seekers github --repo facebook/react
skill-seekers github --config configs/react_github.json
skill-seekers github --repo owner/repo --token $GITHUB_TOKEN
"""
import os
import sys
import json
import re
import argparse
import logging
from pathlib import Path
from typing import Dict, List, Optional, Any
from datetime import datetime
try:
from github import Github, GithubException, Repository
from github.GithubException import RateLimitExceededException
except ImportError:
print("Error: PyGithub not installed. Run: pip install PyGithub")
sys.exit(1)
# Configure logging FIRST (before using logger)
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# Import code analyzer for deep code analysis
try:
from .code_analyzer import CodeAnalyzer
CODE_ANALYZER_AVAILABLE = True
except ImportError:
CODE_ANALYZER_AVAILABLE = False
logger.warning("Code analyzer not available - deep analysis disabled")
# Directories to exclude from local repository analysis
EXCLUDED_DIRS = {
'venv', 'env', '.venv', '.env', # Virtual environments
'node_modules', '__pycache__', '.pytest_cache', # Dependencies and caches
'.git', '.svn', '.hg', # Version control
'build', 'dist', '*.egg-info', # Build artifacts
'htmlcov', '.coverage', # Coverage reports
'.tox', '.nox', # Testing environments
'.mypy_cache', '.ruff_cache', # Linter caches
}
class GitHubScraper:
"""
GitHub Repository Scraper (C1.1-C1.9)
Extracts repository information for skill generation:
- Repository structure
- README files
- Code comments and docstrings
- Programming language detection
- Function/class signatures
- Test examples
- GitHub Issues
- CHANGELOG
- Releases
"""
def __init__(self, config: Dict[str, Any], local_repo_path: Optional[str] = None):
"""Initialize GitHub scraper with configuration."""
self.config = config
self.repo_name = config['repo']
self.name = config.get('name', self.repo_name.split('/')[-1])
self.description = config.get('description', f'Skill for {self.repo_name}')
# Local repository path (optional - enables unlimited analysis)
self.local_repo_path = local_repo_path or config.get('local_repo_path')
if self.local_repo_path:
self.local_repo_path = os.path.expanduser(self.local_repo_path)
logger.info(f"Local repository mode enabled: {self.local_repo_path}")
# Configure directory exclusions (smart defaults + optional customization)
self.excluded_dirs = set(EXCLUDED_DIRS) # Start with smart defaults
# Option 1: Replace mode - Use only specified exclusions
if 'exclude_dirs' in config:
self.excluded_dirs = set(config['exclude_dirs'])
logger.warning(
f"Using custom directory exclusions ({len(self.excluded_dirs)} dirs) - "
"defaults overridden"
)
logger.debug(f"Custom exclusions: {sorted(self.excluded_dirs)}")
# Option 2: Extend mode - Add to default exclusions
elif 'exclude_dirs_additional' in config:
additional = set(config['exclude_dirs_additional'])
self.excluded_dirs = self.excluded_dirs.union(additional)
logger.info(
f"Added {len(additional)} custom directory exclusions "
f"(total: {len(self.excluded_dirs)})"
)
logger.debug(f"Additional exclusions: {sorted(additional)}")
# GitHub client setup (C1.1)
token = self._get_token()
self.github = Github(token) if token else Github()
self.repo: Optional[Repository.Repository] = None
# Options
self.include_issues = config.get('include_issues', True)
self.max_issues = config.get('max_issues', 100)
self.include_changelog = config.get('include_changelog', True)
self.include_releases = config.get('include_releases', True)
self.include_code = config.get('include_code', False)
self.code_analysis_depth = config.get('code_analysis_depth', 'surface') # 'surface', 'deep', 'full'
self.file_patterns = config.get('file_patterns', [])
# Initialize code analyzer if deep analysis requested
self.code_analyzer = None
if self.code_analysis_depth != 'surface' and CODE_ANALYZER_AVAILABLE:
self.code_analyzer = CodeAnalyzer(depth=self.code_analysis_depth)
logger.info(f"Code analysis depth: {self.code_analysis_depth}")
# Output paths
self.skill_dir = f"output/{self.name}"
self.data_file = f"output/{self.name}_github_data.json"
# Extracted data storage
self.extracted_data = {
'repo_info': {},
'readme': '',
'file_tree': [],
'languages': {},
'signatures': [],
'test_examples': [],
'issues': [],
'changelog': '',
'releases': []
}
def _get_token(self) -> Optional[str]:
"""
Get GitHub token from env var or config (both options supported).
Priority: GITHUB_TOKEN env var > config file > None
"""
# Try environment variable first (recommended)
token = os.getenv('GITHUB_TOKEN')
if token:
logger.info("Using GitHub token from GITHUB_TOKEN environment variable")
return token
# Fall back to config file
token = self.config.get('github_token')
if token:
logger.warning("Using GitHub token from config file (less secure)")
return token
logger.warning("No GitHub token provided - using unauthenticated access (lower rate limits)")
return None
def scrape(self) -> Dict[str, Any]:
"""
Main scraping entry point.
Executes all C1 tasks in sequence.
"""
try:
logger.info(f"Starting GitHub scrape for: {self.repo_name}")
# C1.1: Fetch repository
self._fetch_repository()
# C1.2: Extract README
self._extract_readme()
# C1.3-C1.6: Extract code structure
self._extract_code_structure()
# C1.7: Extract Issues
if self.include_issues:
self._extract_issues()
# C1.8: Extract CHANGELOG
if self.include_changelog:
self._extract_changelog()
# C1.9: Extract Releases
if self.include_releases:
self._extract_releases()
# Save extracted data
self._save_data()
logger.info(f"✅ Scraping complete! Data saved to: {self.data_file}")
return self.extracted_data
except RateLimitExceededException:
logger.error("GitHub API rate limit exceeded. Please wait or use authentication token.")
raise
except GithubException as e:
logger.error(f"GitHub API error: {e}")
raise
except Exception as e:
logger.error(f"Unexpected error during scraping: {e}")
raise
def _fetch_repository(self):
"""C1.1: Fetch repository structure using GitHub API."""
logger.info(f"Fetching repository: {self.repo_name}")
try:
self.repo = self.github.get_repo(self.repo_name)
# Extract basic repo info
self.extracted_data['repo_info'] = {
'name': self.repo.name,
'full_name': self.repo.full_name,
'description': self.repo.description,
'url': self.repo.html_url,
'homepage': self.repo.homepage,
'stars': self.repo.stargazers_count,
'forks': self.repo.forks_count,
'open_issues': self.repo.open_issues_count,
'default_branch': self.repo.default_branch,
'created_at': self.repo.created_at.isoformat() if self.repo.created_at else None,
'updated_at': self.repo.updated_at.isoformat() if self.repo.updated_at else None,
'language': self.repo.language,
'license': self.repo.license.name if self.repo.license else None,
'topics': self.repo.get_topics()
}
logger.info(f"Repository fetched: {self.repo.full_name} ({self.repo.stargazers_count} stars)")
except GithubException as e:
if e.status == 404:
raise ValueError(f"Repository not found: {self.repo_name}")
raise
def _extract_readme(self):
"""C1.2: Extract README.md files."""
logger.info("Extracting README...")
# Try common README locations
readme_files = ['README.md', 'README.rst', 'README.txt', 'README',
'docs/README.md', '.github/README.md']
for readme_path in readme_files:
try:
content = self.repo.get_contents(readme_path)
if content:
self.extracted_data['readme'] = content.decoded_content.decode('utf-8')
logger.info(f"README found: {readme_path}")
return
except GithubException:
continue
logger.warning("No README found in repository")
def _extract_code_structure(self):
"""
C1.3-C1.6: Extract code structure, languages, signatures, and test examples.
Surface layer only - no full implementation code.
"""
logger.info("Extracting code structure...")
# C1.4: Get language breakdown
self._extract_languages()
# Get file tree
self._extract_file_tree()
# Extract signatures and test examples
if self.include_code:
self._extract_signatures_and_tests()
def _extract_languages(self):
"""C1.4: Detect programming languages in repository."""
logger.info("Detecting programming languages...")
try:
languages = self.repo.get_languages()
total_bytes = sum(languages.values())
self.extracted_data['languages'] = {
lang: {
'bytes': bytes_count,
'percentage': round((bytes_count / total_bytes) * 100, 2) if total_bytes > 0 else 0
}
for lang, bytes_count in languages.items()
}
logger.info(f"Languages detected: {', '.join(languages.keys())}")
except GithubException as e:
logger.warning(f"Could not fetch languages: {e}")
def should_exclude_dir(self, dir_name: str) -> bool:
"""Check if directory should be excluded from analysis."""
return dir_name in self.excluded_dirs or dir_name.startswith('.')
def _extract_file_tree(self):
"""Extract repository file tree structure (dual-mode: GitHub API or local filesystem)."""
logger.info("Building file tree...")
if self.local_repo_path:
# Local filesystem mode - unlimited files
self._extract_file_tree_local()
else:
# GitHub API mode - limited by API rate limits
self._extract_file_tree_github()
def _extract_file_tree_local(self):
"""Extract file tree from local filesystem (unlimited files)."""
if not os.path.exists(self.local_repo_path):
logger.error(f"Local repository path not found: {self.local_repo_path}")
return
file_tree = []
for root, dirs, files in os.walk(self.local_repo_path):
# Exclude directories in-place to prevent os.walk from descending into them
dirs[:] = [d for d in dirs if not self.should_exclude_dir(d)]
# Calculate relative path from repo root
rel_root = os.path.relpath(root, self.local_repo_path)
if rel_root == '.':
rel_root = ''
# Add directories
for dir_name in dirs:
dir_path = os.path.join(rel_root, dir_name) if rel_root else dir_name
file_tree.append({
'path': dir_path,
'type': 'dir',
'size': None
})
# Add files
for file_name in files:
file_path = os.path.join(rel_root, file_name) if rel_root else file_name
full_path = os.path.join(root, file_name)
try:
file_size = os.path.getsize(full_path)
except OSError:
file_size = None
file_tree.append({
'path': file_path,
'type': 'file',
'size': file_size
})
self.extracted_data['file_tree'] = file_tree
logger.info(f"File tree built (local mode): {len(file_tree)} items")
def _extract_file_tree_github(self):
"""Extract file tree from GitHub API (rate-limited)."""
try:
contents = self.repo.get_contents("")
file_tree = []
while contents:
file_content = contents.pop(0)
file_info = {
'path': file_content.path,
'type': file_content.type,
'size': file_content.size if file_content.type == 'file' else None
}
file_tree.append(file_info)
if file_content.type == "dir":
contents.extend(self.repo.get_contents(file_content.path))
self.extracted_data['file_tree'] = file_tree
logger.info(f"File tree built (GitHub API mode): {len(file_tree)} items")
except GithubException as e:
logger.warning(f"Could not build file tree: {e}")
def _extract_signatures_and_tests(self):
"""
C1.3, C1.5, C1.6: Extract signatures, docstrings, and test examples.
Extraction depth depends on code_analysis_depth setting:
- surface: File tree only (minimal)
- deep: Parse files for signatures, parameters, types
- full: Complete AST analysis (future enhancement)
"""
if self.code_analysis_depth == 'surface':
logger.info("Code extraction: Surface level (file tree only)")
return
if not self.code_analyzer:
logger.warning("Code analyzer not available - skipping deep analysis")
return
logger.info(f"Extracting code signatures ({self.code_analysis_depth} analysis)...")
# Get primary language for the repository
languages = self.extracted_data.get('languages', {})
if not languages:
logger.warning("No languages detected - skipping code analysis")
return
# Determine primary language
primary_language = max(languages.items(), key=lambda x: x[1]['bytes'])[0]
logger.info(f"Primary language: {primary_language}")
# Determine file extensions to analyze
extension_map = {
'Python': ['.py'],
'JavaScript': ['.js', '.jsx'],
'TypeScript': ['.ts', '.tsx'],
'C': ['.c', '.h'],
'C++': ['.cpp', '.hpp', '.cc', '.hh', '.cxx']
}
extensions = extension_map.get(primary_language, [])
if not extensions:
logger.warning(f"No file extensions mapped for {primary_language}")
return
# Analyze files matching patterns and extensions
analyzed_files = []
file_tree = self.extracted_data.get('file_tree', [])
for file_info in file_tree:
file_path = file_info['path']
# Check if file matches extension
if not any(file_path.endswith(ext) for ext in extensions):
continue
# Check if file matches patterns (if specified)
if self.file_patterns:
import fnmatch
if not any(fnmatch.fnmatch(file_path, pattern) for pattern in self.file_patterns):
continue
# Analyze this file
try:
# Read file content based on mode
if self.local_repo_path:
# Local mode - read from filesystem
full_path = os.path.join(self.local_repo_path, file_path)
with open(full_path, 'r', encoding='utf-8') as f:
content = f.read()
else:
# GitHub API mode - fetch from API
file_content = self.repo.get_contents(file_path)
content = file_content.decoded_content.decode('utf-8')
analysis_result = self.code_analyzer.analyze_file(
file_path,
content,
primary_language
)
if analysis_result and (analysis_result.get('classes') or analysis_result.get('functions')):
analyzed_files.append({
'file': file_path,
'language': primary_language,
**analysis_result
})
logger.debug(f"Analyzed {file_path}: "
f"{len(analysis_result.get('classes', []))} classes, "
f"{len(analysis_result.get('functions', []))} functions")
except Exception as e:
logger.debug(f"Could not analyze {file_path}: {e}")
continue
# Limit number of files analyzed to avoid rate limits (GitHub API mode only)
if not self.local_repo_path and len(analyzed_files) >= 50:
logger.info(f"Reached analysis limit (50 files, GitHub API mode)")
break
self.extracted_data['code_analysis'] = {
'depth': self.code_analysis_depth,
'language': primary_language,
'files_analyzed': len(analyzed_files),
'files': analyzed_files
}
# Calculate totals
total_classes = sum(len(f.get('classes', [])) for f in analyzed_files)
total_functions = sum(len(f.get('functions', [])) for f in analyzed_files)
logger.info(f"Code analysis complete: {len(analyzed_files)} files, "
f"{total_classes} classes, {total_functions} functions")
def _extract_issues(self):
"""C1.7: Extract GitHub Issues (open/closed, labels, milestones)."""
logger.info(f"Extracting GitHub Issues (max {self.max_issues})...")
try:
# Fetch recent issues (open + closed)
issues = self.repo.get_issues(state='all', sort='updated', direction='desc')
issue_list = []
for issue in issues[:self.max_issues]:
# Skip pull requests (they appear in issues)
if issue.pull_request:
continue
issue_data = {
'number': issue.number,
'title': issue.title,
'state': issue.state,
'labels': [label.name for label in issue.labels],
'milestone': issue.milestone.title if issue.milestone else None,
'created_at': issue.created_at.isoformat() if issue.created_at else None,
'updated_at': issue.updated_at.isoformat() if issue.updated_at else None,
'closed_at': issue.closed_at.isoformat() if issue.closed_at else None,
'url': issue.html_url,
'body': issue.body[:500] if issue.body else None # First 500 chars
}
issue_list.append(issue_data)
self.extracted_data['issues'] = issue_list
logger.info(f"Extracted {len(issue_list)} issues")
except GithubException as e:
logger.warning(f"Could not fetch issues: {e}")
def _extract_changelog(self):
"""C1.8: Extract CHANGELOG.md and release notes."""
logger.info("Extracting CHANGELOG...")
# Try common changelog locations
changelog_files = ['CHANGELOG.md', 'CHANGES.md', 'HISTORY.md',
'CHANGELOG.rst', 'CHANGELOG.txt', 'CHANGELOG',
'docs/CHANGELOG.md', '.github/CHANGELOG.md']
for changelog_path in changelog_files:
try:
content = self.repo.get_contents(changelog_path)
if content:
self.extracted_data['changelog'] = content.decoded_content.decode('utf-8')
logger.info(f"CHANGELOG found: {changelog_path}")
return
except GithubException:
continue
logger.warning("No CHANGELOG found in repository")
def _extract_releases(self):
"""C1.9: Extract GitHub Releases with version history."""
logger.info("Extracting GitHub Releases...")
try:
releases = self.repo.get_releases()
release_list = []
for release in releases:
release_data = {
'tag_name': release.tag_name,
'name': release.title,
'body': release.body,
'draft': release.draft,
'prerelease': release.prerelease,
'created_at': release.created_at.isoformat() if release.created_at else None,
'published_at': release.published_at.isoformat() if release.published_at else None,
'url': release.html_url,
'tarball_url': release.tarball_url,
'zipball_url': release.zipball_url
}
release_list.append(release_data)
self.extracted_data['releases'] = release_list
logger.info(f"Extracted {len(release_list)} releases")
except GithubException as e:
logger.warning(f"Could not fetch releases: {e}")
def _save_data(self):
"""Save extracted data to JSON file."""
os.makedirs('output', exist_ok=True)
with open(self.data_file, 'w', encoding='utf-8') as f:
json.dump(self.extracted_data, f, indent=2, ensure_ascii=False)
logger.info(f"Data saved to: {self.data_file}")
class GitHubToSkillConverter:
"""
Convert extracted GitHub data to Claude skill format (C1.10).
"""
def __init__(self, config: Dict[str, Any]):
"""Initialize converter with configuration."""
self.config = config
self.name = config.get('name', config['repo'].split('/')[-1])
self.description = config.get('description', f'Skill for {config["repo"]}')
# Paths
self.data_file = f"output/{self.name}_github_data.json"
self.skill_dir = f"output/{self.name}"
# Load extracted data
self.data = self._load_data()
def _load_data(self) -> Dict[str, Any]:
"""Load extracted GitHub data from JSON."""
if not os.path.exists(self.data_file):
raise FileNotFoundError(f"Data file not found: {self.data_file}")
with open(self.data_file, 'r', encoding='utf-8') as f:
return json.load(f)
def build_skill(self):
"""Build complete skill structure."""
logger.info(f"Building skill for: {self.name}")
# Create directories
os.makedirs(self.skill_dir, exist_ok=True)
os.makedirs(f"{self.skill_dir}/references", exist_ok=True)
os.makedirs(f"{self.skill_dir}/scripts", exist_ok=True)
os.makedirs(f"{self.skill_dir}/assets", exist_ok=True)
# Generate SKILL.md
self._generate_skill_md()
# Generate reference files
self._generate_references()
logger.info(f"✅ Skill built successfully: {self.skill_dir}/")
def _generate_skill_md(self):
"""Generate main SKILL.md file."""
repo_info = self.data.get('repo_info', {})
# Generate skill name (lowercase, hyphens only, max 64 chars)
skill_name = self.name.lower().replace('_', '-').replace(' ', '-')[:64]
# Truncate description to 1024 chars if needed
desc = self.description[:1024] if len(self.description) > 1024 else self.description
skill_content = f"""---
name: {skill_name}
description: {desc}
---
# {repo_info.get('name', self.name)}
{self.description}
## Description
{repo_info.get('description', 'GitHub repository skill')}
**Repository:** [{repo_info.get('full_name', 'N/A')}]({repo_info.get('url', '#')})
**Language:** {repo_info.get('language', 'N/A')}
**Stars:** {repo_info.get('stars', 0):,}
**License:** {repo_info.get('license', 'N/A')}
## When to Use This Skill
Use this skill when you need to:
- Understand how to use {self.name}
- Look up API documentation
- Find usage examples
- Check for known issues or recent changes
- Review release history
## Quick Reference
### Repository Info
- **Homepage:** {repo_info.get('homepage', 'N/A')}
- **Topics:** {', '.join(repo_info.get('topics', []))}
- **Open Issues:** {repo_info.get('open_issues', 0)}
- **Last Updated:** {repo_info.get('updated_at', 'N/A')[:10]}
### Languages
{self._format_languages()}
### Recent Releases
{self._format_recent_releases()}
## Available References
- `references/README.md` - Complete README documentation
- `references/CHANGELOG.md` - Version history and changes
- `references/issues.md` - Recent GitHub issues
- `references/releases.md` - Release notes
- `references/file_structure.md` - Repository structure
## Usage
See README.md for complete usage instructions and examples.
---
**Generated by Skill Seeker** | GitHub Repository Scraper
"""
skill_path = f"{self.skill_dir}/SKILL.md"
with open(skill_path, 'w', encoding='utf-8') as f:
f.write(skill_content)
logger.info(f"Generated: {skill_path}")
def _format_languages(self) -> str:
"""Format language breakdown."""
languages = self.data.get('languages', {})
if not languages:
return "No language data available"
lines = []
for lang, info in sorted(languages.items(), key=lambda x: x[1]['bytes'], reverse=True):
lines.append(f"- **{lang}:** {info['percentage']:.1f}%")
return '\n'.join(lines)
def _format_recent_releases(self) -> str:
"""Format recent releases (top 3)."""
releases = self.data.get('releases', [])
if not releases:
return "No releases available"
lines = []
for release in releases[:3]:
lines.append(f"- **{release['tag_name']}** ({release['published_at'][:10]}): {release['name']}")
return '\n'.join(lines)
def _generate_references(self):
"""Generate all reference files."""
# README
if self.data.get('readme'):
readme_path = f"{self.skill_dir}/references/README.md"
with open(readme_path, 'w', encoding='utf-8') as f:
f.write(self.data['readme'])
logger.info(f"Generated: {readme_path}")
# CHANGELOG
if self.data.get('changelog'):
changelog_path = f"{self.skill_dir}/references/CHANGELOG.md"
with open(changelog_path, 'w', encoding='utf-8') as f:
f.write(self.data['changelog'])
logger.info(f"Generated: {changelog_path}")
# Issues
if self.data.get('issues'):
self._generate_issues_reference()
# Releases
if self.data.get('releases'):
self._generate_releases_reference()
# File structure
if self.data.get('file_tree'):
self._generate_file_structure_reference()
def _generate_issues_reference(self):
"""Generate issues.md reference file."""
issues = self.data['issues']
content = f"# GitHub Issues\n\nRecent issues from the repository ({len(issues)} total).\n\n"
# Group by state
open_issues = [i for i in issues if i['state'] == 'open']
closed_issues = [i for i in issues if i['state'] == 'closed']
content += f"## Open Issues ({len(open_issues)})\n\n"
for issue in open_issues[:20]:
labels = ', '.join(issue['labels']) if issue['labels'] else 'No labels'
content += f"### #{issue['number']}: {issue['title']}\n"
content += f"**Labels:** {labels} | **Created:** {issue['created_at'][:10]}\n"
content += f"[View on GitHub]({issue['url']})\n\n"
content += f"\n## Recently Closed Issues ({len(closed_issues)})\n\n"
for issue in closed_issues[:10]:
labels = ', '.join(issue['labels']) if issue['labels'] else 'No labels'
content += f"### #{issue['number']}: {issue['title']}\n"
content += f"**Labels:** {labels} | **Closed:** {issue['closed_at'][:10]}\n"
content += f"[View on GitHub]({issue['url']})\n\n"
issues_path = f"{self.skill_dir}/references/issues.md"
with open(issues_path, 'w', encoding='utf-8') as f:
f.write(content)
logger.info(f"Generated: {issues_path}")
def _generate_releases_reference(self):
"""Generate releases.md reference file."""
releases = self.data['releases']
content = f"# Releases\n\nVersion history for this repository ({len(releases)} releases).\n\n"
for release in releases:
content += f"## {release['tag_name']}: {release['name']}\n"
content += f"**Published:** {release['published_at'][:10]}\n"
if release['prerelease']:
content += f"**Pre-release**\n"
content += f"\n{release['body']}\n\n"
content += f"[View on GitHub]({release['url']})\n\n---\n\n"
releases_path = f"{self.skill_dir}/references/releases.md"
with open(releases_path, 'w', encoding='utf-8') as f:
f.write(content)
logger.info(f"Generated: {releases_path}")
def _generate_file_structure_reference(self):
"""Generate file_structure.md reference file."""
file_tree = self.data['file_tree']
content = f"# Repository File Structure\n\n"
content += f"Total items: {len(file_tree)}\n\n"
content += "```\n"
# Build tree structure
for item in file_tree:
indent = " " * item['path'].count('/')
icon = "📁" if item['type'] == 'dir' else "📄"
content += f"{indent}{icon} {os.path.basename(item['path'])}\n"
content += "```\n"
structure_path = f"{self.skill_dir}/references/file_structure.md"
with open(structure_path, 'w', encoding='utf-8') as f:
f.write(content)
logger.info(f"Generated: {structure_path}")
def main():
"""C1.10: CLI tool entry point."""
parser = argparse.ArgumentParser(
description='GitHub Repository to Claude Skill Converter',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
skill-seekers github --repo facebook/react
skill-seekers github --config configs/react_github.json
skill-seekers github --repo owner/repo --token $GITHUB_TOKEN
"""
)
parser.add_argument('--repo', help='GitHub repository (owner/repo)')
parser.add_argument('--config', help='Path to config JSON file')
parser.add_argument('--token', help='GitHub personal access token')
parser.add_argument('--name', help='Skill name (default: repo name)')
parser.add_argument('--description', help='Skill description')
parser.add_argument('--no-issues', action='store_true', help='Skip GitHub issues')
parser.add_argument('--no-changelog', action='store_true', help='Skip CHANGELOG')
parser.add_argument('--no-releases', action='store_true', help='Skip releases')
parser.add_argument('--max-issues', type=int, default=100, help='Max issues to fetch')
parser.add_argument('--scrape-only', action='store_true', help='Only scrape, don\'t build skill')
args = parser.parse_args()
# Build config from args or file
if args.config:
with open(args.config, 'r') as f:
config = json.load(f)
elif args.repo:
config = {
'repo': args.repo,
'name': args.name or args.repo.split('/')[-1],
'description': args.description or f'GitHub repository skill for {args.repo}',
'github_token': args.token,
'include_issues': not args.no_issues,
'include_changelog': not args.no_changelog,
'include_releases': not args.no_releases,
'max_issues': args.max_issues
}
else:
parser.error('Either --repo or --config is required')
try:
# Phase 1: Scrape GitHub repository
scraper = GitHubScraper(config)
scraper.scrape()
if args.scrape_only:
logger.info("Scrape complete (--scrape-only mode)")
return
# Phase 2: Build skill
converter = GitHubToSkillConverter(config)
converter.build_skill()
logger.info(f"\n✅ Success! Skill created at: output/{config.get('name', config['repo'].split('/')[-1])}/")
logger.info(f"Next step: skill-seekers-package output/{config.get('name', config['repo'].split('/')[-1])}/")
except Exception as e:
logger.error(f"Error: {e}")
sys.exit(1)
if __name__ == '__main__':
main()

View File

@ -0,0 +1,66 @@
# ABOUTME: Detects and validates llms.txt file availability at documentation URLs
# ABOUTME: Supports llms-full.txt, llms.txt, and llms-small.txt variants
import requests
from typing import Optional, Dict, List
from urllib.parse import urlparse
class LlmsTxtDetector:
"""Detect llms.txt files at documentation URLs"""
VARIANTS = [
('llms-full.txt', 'full'),
('llms.txt', 'standard'),
('llms-small.txt', 'small')
]
def __init__(self, base_url: str):
self.base_url = base_url.rstrip('/')
def detect(self) -> Optional[Dict[str, str]]:
"""
Detect available llms.txt variant.
Returns:
Dict with 'url' and 'variant' keys, or None if not found
"""
parsed = urlparse(self.base_url)
root_url = f"{parsed.scheme}://{parsed.netloc}"
for filename, variant in self.VARIANTS:
url = f"{root_url}/{filename}"
if self._check_url_exists(url):
return {'url': url, 'variant': variant}
return None
def detect_all(self) -> List[Dict[str, str]]:
"""
Detect all available llms.txt variants.
Returns:
List of dicts with 'url' and 'variant' keys for each found variant
"""
found_variants = []
for filename, variant in self.VARIANTS:
parsed = urlparse(self.base_url)
root_url = f"{parsed.scheme}://{parsed.netloc}"
url = f"{root_url}/{filename}"
if self._check_url_exists(url):
found_variants.append({
'url': url,
'variant': variant
})
return found_variants
def _check_url_exists(self, url: str) -> bool:
"""Check if URL returns 200 status"""
try:
response = requests.head(url, timeout=5, allow_redirects=True)
return response.status_code == 200
except requests.RequestException:
return False

View File

@ -0,0 +1,94 @@
"""ABOUTME: Downloads llms.txt files from documentation URLs with retry logic"""
"""ABOUTME: Validates markdown content and handles timeouts with exponential backoff"""
import requests
import time
from typing import Optional
class LlmsTxtDownloader:
"""Download llms.txt content from URLs with retry logic"""
def __init__(self, url: str, timeout: int = 30, max_retries: int = 3):
self.url = url
self.timeout = timeout
self.max_retries = max_retries
def get_proper_filename(self) -> str:
"""
Extract filename from URL and convert .txt to .md
Returns:
Proper filename with .md extension
Examples:
https://hono.dev/llms-full.txt -> llms-full.md
https://hono.dev/llms.txt -> llms.md
https://hono.dev/llms-small.txt -> llms-small.md
"""
# Extract filename from URL
from urllib.parse import urlparse
parsed = urlparse(self.url)
filename = parsed.path.split('/')[-1]
# Replace .txt with .md
if filename.endswith('.txt'):
filename = filename[:-4] + '.md'
return filename
def _is_markdown(self, content: str) -> bool:
"""
Check if content looks like markdown.
Returns:
True if content contains markdown patterns
"""
markdown_patterns = ['# ', '## ', '```', '- ', '* ', '`']
return any(pattern in content for pattern in markdown_patterns)
def download(self) -> Optional[str]:
"""
Download llms.txt content with retry logic.
Returns:
String content or None if download fails
"""
headers = {
'User-Agent': 'Skill-Seekers-llms.txt-Reader/1.0'
}
for attempt in range(self.max_retries):
try:
response = requests.get(
self.url,
headers=headers,
timeout=self.timeout
)
response.raise_for_status()
content = response.text
# Validate content is not empty
if len(content) < 100:
print(f"⚠️ Content too short ({len(content)} chars), rejecting")
return None
# Validate content looks like markdown
if not self._is_markdown(content):
print(f"⚠️ Content doesn't look like markdown")
return None
return content
except requests.RequestException as e:
if attempt < self.max_retries - 1:
# Calculate exponential backoff delay: 1s, 2s, 4s, etc.
delay = 2 ** attempt
print(f"⚠️ Attempt {attempt + 1}/{self.max_retries} failed: {e}")
print(f" Retrying in {delay}s...")
time.sleep(delay)
else:
print(f"❌ Failed to download {self.url} after {self.max_retries} attempts: {e}")
return None
return None

View File

@ -0,0 +1,74 @@
"""ABOUTME: Parses llms.txt markdown content into structured page data"""
"""ABOUTME: Extracts titles, content, code samples, and headings from markdown"""
import re
from typing import List, Dict
class LlmsTxtParser:
"""Parse llms.txt markdown content into page structures"""
def __init__(self, content: str):
self.content = content
def parse(self) -> List[Dict]:
"""
Parse markdown content into page structures.
Returns:
List of page dicts with title, content, code_samples, headings
"""
pages = []
# Split by h1 headers (# Title)
sections = re.split(r'\n# ', self.content)
for section in sections:
if not section.strip():
continue
# First line is title
lines = section.split('\n')
title = lines[0].strip('#').strip()
# Parse content
page = self._parse_section('\n'.join(lines[1:]), title)
pages.append(page)
return pages
def _parse_section(self, content: str, title: str) -> Dict:
"""Parse a single section into page structure"""
page = {
'title': title,
'content': '',
'code_samples': [],
'headings': [],
'url': f'llms-txt#{title.lower().replace(" ", "-")}',
'links': []
}
# Extract code blocks
code_blocks = re.findall(r'```(\w+)?\n(.*?)```', content, re.DOTALL)
for lang, code in code_blocks:
page['code_samples'].append({
'code': code.strip(),
'language': lang or 'unknown'
})
# Extract h2/h3 headings
headings = re.findall(r'^(#{2,3})\s+(.+)$', content, re.MULTILINE)
for level_markers, text in headings:
page['headings'].append({
'level': f'h{len(level_markers)}',
'text': text.strip(),
'id': text.lower().replace(' ', '-')
})
# Remove code blocks from content for plain text
content_no_code = re.sub(r'```.*?```', '', content, flags=re.DOTALL)
# Extract paragraphs
paragraphs = [p.strip() for p in content_no_code.split('\n\n') if len(p.strip()) > 20]
page['content'] = '\n\n'.join(paragraphs)
return page

View File

@ -0,0 +1,285 @@
#!/usr/bin/env python3
"""
Skill Seekers - Unified CLI Entry Point
Provides a git-style unified command-line interface for all Skill Seekers tools.
Usage:
skill-seekers <command> [options]
Commands:
scrape Scrape documentation website
github Scrape GitHub repository
pdf Extract from PDF file
unified Multi-source scraping (docs + GitHub + PDF)
enhance AI-powered enhancement (local, no API key)
package Package skill into .zip file
upload Upload skill to Claude
estimate Estimate page count before scraping
Examples:
skill-seekers scrape --config configs/react.json
skill-seekers github --repo microsoft/TypeScript
skill-seekers unified --config configs/react_unified.json
skill-seekers package output/react/
"""
import sys
import argparse
from typing import List, Optional
def create_parser() -> argparse.ArgumentParser:
"""Create the main argument parser with subcommands."""
parser = argparse.ArgumentParser(
prog="skill-seekers",
description="Convert documentation, GitHub repos, and PDFs into Claude AI skills",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Scrape documentation
skill-seekers scrape --config configs/react.json
# Scrape GitHub repository
skill-seekers github --repo microsoft/TypeScript --name typescript
# Multi-source scraping (unified)
skill-seekers unified --config configs/react_unified.json
# AI-powered enhancement
skill-seekers enhance output/react/
# Package and upload
skill-seekers package output/react/
skill-seekers upload output/react.zip
For more information: https://github.com/yusufkaraaslan/Skill_Seekers
"""
)
parser.add_argument(
"--version",
action="version",
version="%(prog)s 2.1.1"
)
subparsers = parser.add_subparsers(
dest="command",
title="commands",
description="Available Skill Seekers commands",
help="Command to run"
)
# === scrape subcommand ===
scrape_parser = subparsers.add_parser(
"scrape",
help="Scrape documentation website",
description="Scrape documentation website and generate skill"
)
scrape_parser.add_argument("--config", help="Config JSON file")
scrape_parser.add_argument("--name", help="Skill name")
scrape_parser.add_argument("--url", help="Documentation URL")
scrape_parser.add_argument("--description", help="Skill description")
scrape_parser.add_argument("--skip-scrape", action="store_true", help="Skip scraping, use cached data")
scrape_parser.add_argument("--enhance", action="store_true", help="AI enhancement (API)")
scrape_parser.add_argument("--enhance-local", action="store_true", help="AI enhancement (local)")
scrape_parser.add_argument("--dry-run", action="store_true", help="Dry run mode")
scrape_parser.add_argument("--async", dest="async_mode", action="store_true", help="Use async scraping")
scrape_parser.add_argument("--workers", type=int, help="Number of async workers")
# === github subcommand ===
github_parser = subparsers.add_parser(
"github",
help="Scrape GitHub repository",
description="Scrape GitHub repository and generate skill"
)
github_parser.add_argument("--config", help="Config JSON file")
github_parser.add_argument("--repo", help="GitHub repo (owner/repo)")
github_parser.add_argument("--name", help="Skill name")
github_parser.add_argument("--description", help="Skill description")
# === pdf subcommand ===
pdf_parser = subparsers.add_parser(
"pdf",
help="Extract from PDF file",
description="Extract content from PDF and generate skill"
)
pdf_parser.add_argument("--config", help="Config JSON file")
pdf_parser.add_argument("--pdf", help="PDF file path")
pdf_parser.add_argument("--name", help="Skill name")
pdf_parser.add_argument("--description", help="Skill description")
pdf_parser.add_argument("--from-json", help="Build from extracted JSON")
# === unified subcommand ===
unified_parser = subparsers.add_parser(
"unified",
help="Multi-source scraping (docs + GitHub + PDF)",
description="Combine multiple sources into one skill"
)
unified_parser.add_argument("--config", required=True, help="Unified config JSON file")
unified_parser.add_argument("--merge-mode", help="Merge mode (rule-based, claude-enhanced)")
unified_parser.add_argument("--dry-run", action="store_true", help="Dry run mode")
# === enhance subcommand ===
enhance_parser = subparsers.add_parser(
"enhance",
help="AI-powered enhancement (local, no API key)",
description="Enhance SKILL.md using Claude Code (local)"
)
enhance_parser.add_argument("skill_directory", help="Skill directory path")
# === package subcommand ===
package_parser = subparsers.add_parser(
"package",
help="Package skill into .zip file",
description="Package skill directory into uploadable .zip"
)
package_parser.add_argument("skill_directory", help="Skill directory path")
package_parser.add_argument("--no-open", action="store_true", help="Don't open output folder")
package_parser.add_argument("--upload", action="store_true", help="Auto-upload after packaging")
# === upload subcommand ===
upload_parser = subparsers.add_parser(
"upload",
help="Upload skill to Claude",
description="Upload .zip file to Claude via Anthropic API"
)
upload_parser.add_argument("zip_file", help=".zip file to upload")
upload_parser.add_argument("--api-key", help="Anthropic API key")
# === estimate subcommand ===
estimate_parser = subparsers.add_parser(
"estimate",
help="Estimate page count before scraping",
description="Estimate total pages for documentation scraping"
)
estimate_parser.add_argument("config", help="Config JSON file")
estimate_parser.add_argument("--max-discovery", type=int, help="Max pages to discover")
return parser
def main(argv: Optional[List[str]] = None) -> int:
"""Main entry point for the unified CLI.
Args:
argv: Command-line arguments (defaults to sys.argv)
Returns:
Exit code (0 for success, non-zero for error)
"""
parser = create_parser()
args = parser.parse_args(argv)
if not args.command:
parser.print_help()
return 1
# Delegate to the appropriate tool
try:
if args.command == "scrape":
from skill_seekers.cli.doc_scraper import main as scrape_main
# Convert args namespace to sys.argv format for doc_scraper
sys.argv = ["doc_scraper.py"]
if args.config:
sys.argv.extend(["--config", args.config])
if args.name:
sys.argv.extend(["--name", args.name])
if args.url:
sys.argv.extend(["--url", args.url])
if args.description:
sys.argv.extend(["--description", args.description])
if args.skip_scrape:
sys.argv.append("--skip-scrape")
if args.enhance:
sys.argv.append("--enhance")
if args.enhance_local:
sys.argv.append("--enhance-local")
if args.dry_run:
sys.argv.append("--dry-run")
if args.async_mode:
sys.argv.append("--async")
if args.workers:
sys.argv.extend(["--workers", str(args.workers)])
return scrape_main() or 0
elif args.command == "github":
from skill_seekers.cli.github_scraper import main as github_main
sys.argv = ["github_scraper.py"]
if args.config:
sys.argv.extend(["--config", args.config])
if args.repo:
sys.argv.extend(["--repo", args.repo])
if args.name:
sys.argv.extend(["--name", args.name])
if args.description:
sys.argv.extend(["--description", args.description])
return github_main() or 0
elif args.command == "pdf":
from skill_seekers.cli.pdf_scraper import main as pdf_main
sys.argv = ["pdf_scraper.py"]
if args.config:
sys.argv.extend(["--config", args.config])
if args.pdf:
sys.argv.extend(["--pdf", args.pdf])
if args.name:
sys.argv.extend(["--name", args.name])
if args.description:
sys.argv.extend(["--description", args.description])
if args.from_json:
sys.argv.extend(["--from-json", args.from_json])
return pdf_main() or 0
elif args.command == "unified":
from skill_seekers.cli.unified_scraper import main as unified_main
sys.argv = ["unified_scraper.py", "--config", args.config]
if args.merge_mode:
sys.argv.extend(["--merge-mode", args.merge_mode])
if args.dry_run:
sys.argv.append("--dry-run")
return unified_main() or 0
elif args.command == "enhance":
from skill_seekers.cli.enhance_skill_local import main as enhance_main
sys.argv = ["enhance_skill_local.py", args.skill_directory]
return enhance_main() or 0
elif args.command == "package":
from skill_seekers.cli.package_skill import main as package_main
sys.argv = ["package_skill.py", args.skill_directory]
if args.no_open:
sys.argv.append("--no-open")
if args.upload:
sys.argv.append("--upload")
return package_main() or 0
elif args.command == "upload":
from skill_seekers.cli.upload_skill import main as upload_main
sys.argv = ["upload_skill.py", args.zip_file]
if args.api_key:
sys.argv.extend(["--api-key", args.api_key])
return upload_main() or 0
elif args.command == "estimate":
from skill_seekers.cli.estimate_pages import main as estimate_main
sys.argv = ["estimate_pages.py", args.config]
if args.max_discovery:
sys.argv.extend(["--max-discovery", str(args.max_discovery)])
return estimate_main() or 0
else:
print(f"Error: Unknown command '{args.command}'", file=sys.stderr)
parser.print_help()
return 1
except KeyboardInterrupt:
print("\n\nInterrupted by user", file=sys.stderr)
return 130
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
return 1
if __name__ == "__main__":
sys.exit(main())

View File

@ -0,0 +1,513 @@
#!/usr/bin/env python3
"""
Source Merger for Multi-Source Skills
Merges documentation and code data intelligently:
- Rule-based merge: Fast, deterministic rules
- Claude-enhanced merge: AI-powered reconciliation
Handles conflicts and creates unified API reference.
"""
import json
import logging
import subprocess
import tempfile
import os
from pathlib import Path
from typing import Dict, List, Any, Optional
from .conflict_detector import Conflict, ConflictDetector
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class RuleBasedMerger:
"""
Rule-based API merger using deterministic rules.
Rules:
1. If API only in docs Include with [DOCS_ONLY] tag
2. If API only in code Include with [UNDOCUMENTED] tag
3. If both match perfectly Include normally
4. If conflict Include both versions with [CONFLICT] tag, prefer code signature
"""
def __init__(self, docs_data: Dict, github_data: Dict, conflicts: List[Conflict]):
"""
Initialize rule-based merger.
Args:
docs_data: Documentation scraper data
github_data: GitHub scraper data
conflicts: List of detected conflicts
"""
self.docs_data = docs_data
self.github_data = github_data
self.conflicts = conflicts
# Build conflict index for fast lookup
self.conflict_index = {c.api_name: c for c in conflicts}
# Extract APIs from both sources
detector = ConflictDetector(docs_data, github_data)
self.docs_apis = detector.docs_apis
self.code_apis = detector.code_apis
def merge_all(self) -> Dict[str, Any]:
"""
Merge all APIs using rule-based logic.
Returns:
Dict containing merged API data
"""
logger.info("Starting rule-based merge...")
merged_apis = {}
# Get all unique API names
all_api_names = set(self.docs_apis.keys()) | set(self.code_apis.keys())
for api_name in sorted(all_api_names):
merged_api = self._merge_single_api(api_name)
merged_apis[api_name] = merged_api
logger.info(f"Merged {len(merged_apis)} APIs")
return {
'merge_mode': 'rule-based',
'apis': merged_apis,
'summary': {
'total_apis': len(merged_apis),
'docs_only': sum(1 for api in merged_apis.values() if api['status'] == 'docs_only'),
'code_only': sum(1 for api in merged_apis.values() if api['status'] == 'code_only'),
'matched': sum(1 for api in merged_apis.values() if api['status'] == 'matched'),
'conflict': sum(1 for api in merged_apis.values() if api['status'] == 'conflict')
}
}
def _merge_single_api(self, api_name: str) -> Dict[str, Any]:
"""
Merge a single API using rules.
Args:
api_name: Name of the API to merge
Returns:
Merged API dict
"""
in_docs = api_name in self.docs_apis
in_code = api_name in self.code_apis
has_conflict = api_name in self.conflict_index
# Rule 1: Only in docs
if in_docs and not in_code:
conflict = self.conflict_index.get(api_name)
return {
'name': api_name,
'status': 'docs_only',
'source': 'documentation',
'data': self.docs_apis[api_name],
'warning': 'This API is documented but not found in codebase',
'conflict': conflict.__dict__ if conflict else None
}
# Rule 2: Only in code
if in_code and not in_docs:
is_private = api_name.startswith('_')
conflict = self.conflict_index.get(api_name)
return {
'name': api_name,
'status': 'code_only',
'source': 'code',
'data': self.code_apis[api_name],
'warning': 'This API exists in code but is not documented' if not is_private else 'Internal/private API',
'conflict': conflict.__dict__ if conflict else None
}
# Both exist - check for conflicts
docs_info = self.docs_apis[api_name]
code_info = self.code_apis[api_name]
# Rule 3: Both match perfectly (no conflict)
if not has_conflict:
return {
'name': api_name,
'status': 'matched',
'source': 'both',
'docs_data': docs_info,
'code_data': code_info,
'merged_signature': self._create_merged_signature(code_info, docs_info),
'merged_description': docs_info.get('docstring') or code_info.get('docstring')
}
# Rule 4: Conflict exists - prefer code signature, keep docs description
conflict = self.conflict_index[api_name]
return {
'name': api_name,
'status': 'conflict',
'source': 'both',
'docs_data': docs_info,
'code_data': code_info,
'conflict': conflict.__dict__,
'resolution': 'prefer_code_signature',
'merged_signature': self._create_merged_signature(code_info, docs_info),
'merged_description': docs_info.get('docstring') or code_info.get('docstring'),
'warning': conflict.difference
}
def _create_merged_signature(self, code_info: Dict, docs_info: Dict) -> str:
"""
Create merged signature preferring code data.
Args:
code_info: API info from code
docs_info: API info from docs
Returns:
Merged signature string
"""
name = code_info.get('name', docs_info.get('name'))
params = code_info.get('parameters', docs_info.get('parameters', []))
return_type = code_info.get('return_type', docs_info.get('return_type'))
# Build parameter string
param_strs = []
for param in params:
param_str = param['name']
if param.get('type_hint'):
param_str += f": {param['type_hint']}"
if param.get('default'):
param_str += f" = {param['default']}"
param_strs.append(param_str)
signature = f"{name}({', '.join(param_strs)})"
if return_type:
signature += f" -> {return_type}"
return signature
class ClaudeEnhancedMerger:
"""
Claude-enhanced API merger using local Claude Code.
Opens Claude Code in a new terminal to intelligently reconcile conflicts.
Uses the same approach as enhance_skill_local.py.
"""
def __init__(self, docs_data: Dict, github_data: Dict, conflicts: List[Conflict]):
"""
Initialize Claude-enhanced merger.
Args:
docs_data: Documentation scraper data
github_data: GitHub scraper data
conflicts: List of detected conflicts
"""
self.docs_data = docs_data
self.github_data = github_data
self.conflicts = conflicts
# First do rule-based merge as baseline
self.rule_merger = RuleBasedMerger(docs_data, github_data, conflicts)
def merge_all(self) -> Dict[str, Any]:
"""
Merge all APIs using Claude enhancement.
Returns:
Dict containing merged API data
"""
logger.info("Starting Claude-enhanced merge...")
# Create temporary workspace
workspace_dir = self._create_workspace()
# Launch Claude Code for enhancement
logger.info("Launching Claude Code for intelligent merging...")
logger.info("Claude will analyze conflicts and create reconciled API reference")
try:
self._launch_claude_merge(workspace_dir)
# Read enhanced results
merged_data = self._read_merged_results(workspace_dir)
logger.info("Claude-enhanced merge complete")
return merged_data
except Exception as e:
logger.error(f"Claude enhancement failed: {e}")
logger.info("Falling back to rule-based merge")
return self.rule_merger.merge_all()
def _create_workspace(self) -> str:
"""
Create temporary workspace with merge context.
Returns:
Path to workspace directory
"""
workspace = tempfile.mkdtemp(prefix='skill_merge_')
logger.info(f"Created merge workspace: {workspace}")
# Write context files for Claude
self._write_context_files(workspace)
return workspace
def _write_context_files(self, workspace: str):
"""Write context files for Claude to analyze."""
# 1. Write conflicts summary
conflicts_file = os.path.join(workspace, 'conflicts.json')
with open(conflicts_file, 'w') as f:
json.dump({
'conflicts': [c.__dict__ for c in self.conflicts],
'summary': {
'total': len(self.conflicts),
'by_type': self._count_by_field('type'),
'by_severity': self._count_by_field('severity')
}
}, f, indent=2)
# 2. Write documentation APIs
docs_apis_file = os.path.join(workspace, 'docs_apis.json')
detector = ConflictDetector(self.docs_data, self.github_data)
with open(docs_apis_file, 'w') as f:
json.dump(detector.docs_apis, f, indent=2)
# 3. Write code APIs
code_apis_file = os.path.join(workspace, 'code_apis.json')
with open(code_apis_file, 'w') as f:
json.dump(detector.code_apis, f, indent=2)
# 4. Write merge instructions for Claude
instructions = """# API Merge Task
You are merging API documentation from two sources:
1. Official documentation (user-facing)
2. Source code analysis (implementation reality)
## Context Files:
- `conflicts.json` - All detected conflicts between sources
- `docs_apis.json` - APIs from documentation
- `code_apis.json` - APIs from source code
## Your Task:
For each conflict, reconcile the differences intelligently:
1. **Prefer code signatures as source of truth**
- Use actual parameter names, types, defaults from code
- Code is what actually runs, docs might be outdated
2. **Keep documentation descriptions**
- Docs are user-friendly, code comments might be technical
- Keep the docs' explanation of what the API does
3. **Add implementation notes for discrepancies**
- If docs differ from code, explain the difference
- Example: "⚠️ The `snap` parameter exists in code but is not documented"
4. **Flag missing APIs clearly**
- Missing in docs Add [UNDOCUMENTED] tag
- Missing in code Add [REMOVED] or [DOCS_ERROR] tag
5. **Create unified API reference**
- One definitive signature per API
- Clear warnings about conflicts
- Implementation notes where helpful
## Output Format:
Create `merged_apis.json` with this structure:
```json
{
"apis": {
"API.name": {
"signature": "final_signature_here",
"parameters": [...],
"return_type": "type",
"description": "user-friendly description",
"implementation_notes": "Any discrepancies or warnings",
"source": "both|docs_only|code_only",
"confidence": "high|medium|low"
}
}
}
```
Take your time to analyze each conflict carefully. The goal is to create the most accurate and helpful API reference possible.
"""
instructions_file = os.path.join(workspace, 'MERGE_INSTRUCTIONS.md')
with open(instructions_file, 'w') as f:
f.write(instructions)
logger.info(f"Wrote context files to {workspace}")
def _count_by_field(self, field: str) -> Dict[str, int]:
"""Count conflicts by a specific field."""
counts = {}
for conflict in self.conflicts:
value = getattr(conflict, field)
counts[value] = counts.get(value, 0) + 1
return counts
def _launch_claude_merge(self, workspace: str):
"""
Launch Claude Code to perform merge.
Similar to enhance_skill_local.py approach.
"""
# Create a script that Claude will execute
script_path = os.path.join(workspace, 'merge_script.sh')
script_content = f"""#!/bin/bash
# Automatic merge script for Claude Code
cd "{workspace}"
echo "📊 Analyzing conflicts..."
cat conflicts.json | head -20
echo ""
echo "📖 Documentation APIs: $(cat docs_apis.json | grep -c '\"name\"')"
echo "💻 Code APIs: $(cat code_apis.json | grep -c '\"name\"')"
echo ""
echo "Please review the conflicts and create merged_apis.json"
echo "Follow the instructions in MERGE_INSTRUCTIONS.md"
echo ""
echo "When done, save merged_apis.json and close this terminal."
# Wait for user to complete merge
read -p "Press Enter when merge is complete..."
"""
with open(script_path, 'w') as f:
f.write(script_content)
os.chmod(script_path, 0o755)
# Open new terminal with Claude Code
# Try different terminal emulators
terminals = [
['x-terminal-emulator', '-e'],
['gnome-terminal', '--'],
['xterm', '-e'],
['konsole', '-e']
]
for terminal_cmd in terminals:
try:
cmd = terminal_cmd + ['bash', script_path]
subprocess.Popen(cmd)
logger.info(f"Opened terminal with {terminal_cmd[0]}")
break
except FileNotFoundError:
continue
# Wait for merge to complete
merged_file = os.path.join(workspace, 'merged_apis.json')
logger.info(f"Waiting for merged results at: {merged_file}")
logger.info("Close the terminal when done to continue...")
# Poll for file existence
import time
timeout = 3600 # 1 hour max
elapsed = 0
while not os.path.exists(merged_file) and elapsed < timeout:
time.sleep(5)
elapsed += 5
if not os.path.exists(merged_file):
raise TimeoutError("Claude merge timed out after 1 hour")
def _read_merged_results(self, workspace: str) -> Dict[str, Any]:
"""Read merged results from workspace."""
merged_file = os.path.join(workspace, 'merged_apis.json')
if not os.path.exists(merged_file):
raise FileNotFoundError(f"Merged results not found: {merged_file}")
with open(merged_file, 'r') as f:
merged_data = json.load(f)
return {
'merge_mode': 'claude-enhanced',
**merged_data
}
def merge_sources(docs_data_path: str,
github_data_path: str,
output_path: str,
mode: str = 'rule-based') -> Dict[str, Any]:
"""
Merge documentation and GitHub data.
Args:
docs_data_path: Path to documentation data JSON
github_data_path: Path to GitHub data JSON
output_path: Path to save merged output
mode: 'rule-based' or 'claude-enhanced'
Returns:
Merged data dict
"""
# Load data
with open(docs_data_path, 'r') as f:
docs_data = json.load(f)
with open(github_data_path, 'r') as f:
github_data = json.load(f)
# Detect conflicts
detector = ConflictDetector(docs_data, github_data)
conflicts = detector.detect_all_conflicts()
logger.info(f"Detected {len(conflicts)} conflicts")
# Merge based on mode
if mode == 'claude-enhanced':
merger = ClaudeEnhancedMerger(docs_data, github_data, conflicts)
else:
merger = RuleBasedMerger(docs_data, github_data, conflicts)
merged_data = merger.merge_all()
# Save merged data
with open(output_path, 'w') as f:
json.dump(merged_data, f, indent=2, ensure_ascii=False)
logger.info(f"Merged data saved to: {output_path}")
return merged_data
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser(description='Merge documentation and code sources')
parser.add_argument('docs_data', help='Path to documentation data JSON')
parser.add_argument('github_data', help='Path to GitHub data JSON')
parser.add_argument('--output', '-o', default='merged_data.json', help='Output file path')
parser.add_argument('--mode', '-m', choices=['rule-based', 'claude-enhanced'],
default='rule-based', help='Merge mode')
args = parser.parse_args()
merged = merge_sources(args.docs_data, args.github_data, args.output, args.mode)
# Print summary
summary = merged.get('summary', {})
print(f"\n✅ Merge complete ({merged.get('merge_mode')})")
print(f" Total APIs: {summary.get('total_apis', 0)}")
print(f" Matched: {summary.get('matched', 0)}")
print(f" Docs only: {summary.get('docs_only', 0)}")
print(f" Code only: {summary.get('code_only', 0)}")
print(f" Conflicts: {summary.get('conflict', 0)}")
print(f"\n📄 Saved to: {args.output}")

View File

@ -0,0 +1,81 @@
#!/usr/bin/env python3
"""
Multi-Skill Packager
Package multiple skills at once. Useful for packaging router + sub-skills together.
"""
import sys
import argparse
from pathlib import Path
import subprocess
def package_skill(skill_dir: Path) -> bool:
"""Package a single skill"""
try:
result = subprocess.run(
[sys.executable, str(Path(__file__).parent / "package_skill.py"), str(skill_dir)],
capture_output=True,
text=True
)
return result.returncode == 0
except Exception as e:
print(f"❌ Error packaging {skill_dir}: {e}")
return False
def main():
parser = argparse.ArgumentParser(
description="Package multiple skills at once",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Package all godot skills
python3 package_multi.py output/godot*/
# Package specific skills
python3 package_multi.py output/godot-2d/ output/godot-3d/ output/godot-scripting/
"""
)
parser.add_argument(
'skill_dirs',
nargs='+',
help='Skill directories to package'
)
args = parser.parse_args()
print(f"\n{'='*60}")
print(f"MULTI-SKILL PACKAGER")
print(f"{'='*60}\n")
skill_dirs = [Path(d) for d in args.skill_dirs]
success_count = 0
total_count = len(skill_dirs)
for skill_dir in skill_dirs:
if not skill_dir.exists():
print(f"⚠️ Skipping (not found): {skill_dir}")
continue
if not (skill_dir / "SKILL.md").exists():
print(f"⚠️ Skipping (no SKILL.md): {skill_dir}")
continue
print(f"📦 Packaging: {skill_dir.name}")
if package_skill(skill_dir):
success_count += 1
print(f" ✅ Success")
else:
print(f" ❌ Failed")
print("")
print(f"{'='*60}")
print(f"SUMMARY: {success_count}/{total_count} skills packaged")
print(f"{'='*60}\n")
if __name__ == "__main__":
main()

View File

@ -0,0 +1,220 @@
#!/usr/bin/env python3
"""
Simple Skill Packager
Packages a skill directory into a .zip file for Claude.
Usage:
skill-seekers package output/steam-inventory/
skill-seekers package output/react/
skill-seekers package output/react/ --no-open # Don't open folder
"""
import os
import sys
import zipfile
import argparse
from pathlib import Path
# Import utilities
try:
from utils import (
open_folder,
print_upload_instructions,
format_file_size,
validate_skill_directory
)
from quality_checker import SkillQualityChecker, print_report
except ImportError:
# If running from different directory, add cli to path
sys.path.insert(0, str(Path(__file__).parent))
from utils import (
open_folder,
print_upload_instructions,
format_file_size,
validate_skill_directory
)
from quality_checker import SkillQualityChecker, print_report
def package_skill(skill_dir, open_folder_after=True, skip_quality_check=False):
"""
Package a skill directory into a .zip file
Args:
skill_dir: Path to skill directory
open_folder_after: Whether to open the output folder after packaging
skip_quality_check: Skip quality checks before packaging
Returns:
tuple: (success, zip_path) where success is bool and zip_path is Path or None
"""
skill_path = Path(skill_dir)
# Validate skill directory
is_valid, error_msg = validate_skill_directory(skill_path)
if not is_valid:
print(f"❌ Error: {error_msg}")
return False, None
# Run quality checks (unless skipped)
if not skip_quality_check:
print("\n" + "=" * 60)
print("QUALITY CHECK")
print("=" * 60)
checker = SkillQualityChecker(skill_path)
report = checker.check_all()
# Print report
print_report(report, verbose=False)
# If there are errors or warnings, ask user to confirm
if report.has_errors or report.has_warnings:
print("=" * 60)
response = input("\nContinue with packaging? (y/n): ").strip().lower()
if response != 'y':
print("\n❌ Packaging cancelled by user")
return False, None
print()
else:
print("=" * 60)
print()
# Create zip filename
skill_name = skill_path.name
zip_path = skill_path.parent / f"{skill_name}.zip"
print(f"📦 Packaging skill: {skill_name}")
print(f" Source: {skill_path}")
print(f" Output: {zip_path}")
# Create zip file
with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zf:
for root, dirs, files in os.walk(skill_path):
# Skip backup files
files = [f for f in files if not f.endswith('.backup')]
for file in files:
file_path = Path(root) / file
arcname = file_path.relative_to(skill_path)
zf.write(file_path, arcname)
print(f" + {arcname}")
# Get zip size
zip_size = zip_path.stat().st_size
print(f"\n✅ Package created: {zip_path}")
print(f" Size: {zip_size:,} bytes ({format_file_size(zip_size)})")
# Open folder in file browser
if open_folder_after:
print(f"\n📂 Opening folder: {zip_path.parent}")
open_folder(zip_path.parent)
# Print upload instructions
print_upload_instructions(zip_path)
return True, zip_path
def main():
parser = argparse.ArgumentParser(
description="Package a skill directory into a .zip file for Claude",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Package skill with quality checks (recommended)
skill-seekers package output/react/
# Package skill without opening folder
skill-seekers package output/react/ --no-open
# Skip quality checks (faster, but not recommended)
skill-seekers package output/react/ --skip-quality-check
# Package and auto-upload to Claude
skill-seekers package output/react/ --upload
# Get help
skill-seekers package --help
"""
)
parser.add_argument(
'skill_dir',
help='Path to skill directory (e.g., output/react/)'
)
parser.add_argument(
'--no-open',
action='store_true',
help='Do not open the output folder after packaging'
)
parser.add_argument(
'--skip-quality-check',
action='store_true',
help='Skip quality checks before packaging'
)
parser.add_argument(
'--upload',
action='store_true',
help='Automatically upload to Claude after packaging (requires ANTHROPIC_API_KEY)'
)
args = parser.parse_args()
success, zip_path = package_skill(
args.skill_dir,
open_folder_after=not args.no_open,
skip_quality_check=args.skip_quality_check
)
if not success:
sys.exit(1)
# Auto-upload if requested
if args.upload:
# Check if API key is set BEFORE attempting upload
api_key = os.environ.get('ANTHROPIC_API_KEY', '').strip()
if not api_key:
# No API key - show helpful message but DON'T fail
print("\n" + "="*60)
print("💡 Automatic Upload")
print("="*60)
print()
print("To enable automatic upload:")
print(" 1. Get API key from https://console.anthropic.com/")
print(" 2. Set: export ANTHROPIC_API_KEY=sk-ant-...")
print(" 3. Run package_skill.py with --upload flag")
print()
print("For now, use manual upload (instructions above) ☝️")
print("="*60)
# Exit successfully - packaging worked!
sys.exit(0)
# API key exists - try upload
try:
from upload_skill import upload_skill_api
print("\n" + "="*60)
upload_success, message = upload_skill_api(zip_path)
if not upload_success:
print(f"❌ Upload failed: {message}")
print()
print("💡 Try manual upload instead (instructions above) ☝️")
print("="*60)
# Exit successfully - packaging worked even if upload failed
sys.exit(0)
else:
print("="*60)
sys.exit(0)
except ImportError:
print("\n❌ Error: upload_skill.py not found")
sys.exit(1)
sys.exit(0)
if __name__ == "__main__":
main()

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,401 @@
#!/usr/bin/env python3
"""
PDF Documentation to Claude Skill Converter (Task B1.6)
Converts PDF documentation into Claude AI skills.
Uses pdf_extractor_poc.py for extraction, builds skill structure.
Usage:
python3 pdf_scraper.py --config configs/manual_pdf.json
python3 pdf_scraper.py --pdf manual.pdf --name myskill
python3 pdf_scraper.py --from-json manual_extracted.json
"""
import os
import sys
import json
import re
import argparse
from pathlib import Path
# Import the PDF extractor
from .pdf_extractor_poc import PDFExtractor
class PDFToSkillConverter:
"""Convert PDF documentation to Claude skill"""
def __init__(self, config):
self.config = config
self.name = config['name']
self.pdf_path = config.get('pdf_path', '')
self.description = config.get('description', f'Documentation skill for {self.name}')
# Paths
self.skill_dir = f"output/{self.name}"
self.data_file = f"output/{self.name}_extracted.json"
# Extraction options
self.extract_options = config.get('extract_options', {})
# Categories
self.categories = config.get('categories', {})
# Extracted data
self.extracted_data = None
def extract_pdf(self):
"""Extract content from PDF using pdf_extractor_poc.py"""
print(f"\n🔍 Extracting from PDF: {self.pdf_path}")
# Create extractor with options
extractor = PDFExtractor(
self.pdf_path,
verbose=True,
chunk_size=self.extract_options.get('chunk_size', 10),
min_quality=self.extract_options.get('min_quality', 5.0),
extract_images=self.extract_options.get('extract_images', True),
image_dir=f"{self.skill_dir}/assets/images",
min_image_size=self.extract_options.get('min_image_size', 100)
)
# Extract
result = extractor.extract_all()
if not result:
print("❌ Extraction failed")
raise RuntimeError(f"Failed to extract PDF: {self.pdf_path}")
# Save extracted data
with open(self.data_file, 'w', encoding='utf-8') as f:
json.dump(result, f, indent=2, ensure_ascii=False)
print(f"\n💾 Saved extracted data to: {self.data_file}")
self.extracted_data = result
return True
def load_extracted_data(self, json_path):
"""Load previously extracted data from JSON"""
print(f"\n📂 Loading extracted data from: {json_path}")
with open(json_path, 'r', encoding='utf-8') as f:
self.extracted_data = json.load(f)
print(f"✅ Loaded {self.extracted_data['total_pages']} pages")
return True
def categorize_content(self):
"""Categorize pages based on chapters or keywords"""
print(f"\n📋 Categorizing content...")
categorized = {}
# Use chapters if available
if self.extracted_data.get('chapters'):
for chapter in self.extracted_data['chapters']:
category_key = self._sanitize_filename(chapter['title'])
categorized[category_key] = {
'title': chapter['title'],
'pages': []
}
# Assign pages to chapters
for page in self.extracted_data['pages']:
page_num = page['page_number']
# Find which chapter this page belongs to
for chapter in self.extracted_data['chapters']:
if chapter['start_page'] <= page_num <= chapter['end_page']:
category_key = self._sanitize_filename(chapter['title'])
categorized[category_key]['pages'].append(page)
break
# Fall back to keyword-based categorization
elif self.categories:
# Check if categories is already in the right format (for tests)
# If first value is a list of dicts (pages), use as-is
first_value = next(iter(self.categories.values()))
if isinstance(first_value, list) and first_value and isinstance(first_value[0], dict):
# Already categorized - convert to expected format
for cat_key, pages in self.categories.items():
categorized[cat_key] = {
'title': cat_key.replace('_', ' ').title(),
'pages': pages
}
else:
# Keyword-based categorization
# Initialize categories
for cat_key, keywords in self.categories.items():
categorized[cat_key] = {
'title': cat_key.replace('_', ' ').title(),
'pages': []
}
# Categorize by keywords
for page in self.extracted_data['pages']:
text = page.get('text', '').lower()
headings_text = ' '.join([h['text'] for h in page.get('headings', [])]).lower()
# Score against each category
scores = {}
for cat_key, keywords in self.categories.items():
# Handle both string keywords and dict keywords (shouldn't happen, but be safe)
if isinstance(keywords, list):
score = sum(1 for kw in keywords
if isinstance(kw, str) and (kw.lower() in text or kw.lower() in headings_text))
else:
score = 0
if score > 0:
scores[cat_key] = score
# Assign to highest scoring category
if scores:
best_cat = max(scores, key=scores.get)
categorized[best_cat]['pages'].append(page)
else:
# Default category
if 'other' not in categorized:
categorized['other'] = {'title': 'Other', 'pages': []}
categorized['other']['pages'].append(page)
else:
# No categorization - use single category
categorized['content'] = {
'title': 'Content',
'pages': self.extracted_data['pages']
}
print(f"✅ Created {len(categorized)} categories")
for cat_key, cat_data in categorized.items():
print(f" - {cat_data['title']}: {len(cat_data['pages'])} pages")
return categorized
def build_skill(self):
"""Build complete skill structure"""
print(f"\n🏗️ Building skill: {self.name}")
# Create directories
os.makedirs(f"{self.skill_dir}/references", exist_ok=True)
os.makedirs(f"{self.skill_dir}/scripts", exist_ok=True)
os.makedirs(f"{self.skill_dir}/assets", exist_ok=True)
# Categorize content
categorized = self.categorize_content()
# Generate reference files
print(f"\n📝 Generating reference files...")
for cat_key, cat_data in categorized.items():
self._generate_reference_file(cat_key, cat_data)
# Generate index
self._generate_index(categorized)
# Generate SKILL.md
self._generate_skill_md(categorized)
print(f"\n✅ Skill built successfully: {self.skill_dir}/")
print(f"\n📦 Next step: Package with: skill-seekers package {self.skill_dir}/")
def _generate_reference_file(self, cat_key, cat_data):
"""Generate a reference markdown file for a category"""
filename = f"{self.skill_dir}/references/{cat_key}.md"
with open(filename, 'w', encoding='utf-8') as f:
f.write(f"# {cat_data['title']}\n\n")
for page in cat_data['pages']:
# Add headings as section markers
if page.get('headings'):
f.write(f"## {page['headings'][0]['text']}\n\n")
# Add text content
if page.get('text'):
# Limit to first 1000 chars per page to avoid huge files
text = page['text'][:1000]
f.write(f"{text}\n\n")
# Add code samples (check both 'code_samples' and 'code_blocks' for compatibility)
code_list = page.get('code_samples') or page.get('code_blocks')
if code_list:
f.write("### Code Examples\n\n")
for code in code_list[:3]: # Limit to top 3
lang = code.get('language', '')
f.write(f"```{lang}\n{code['code']}\n```\n\n")
# Add images
if page.get('images'):
# Create assets directory if needed
assets_dir = os.path.join(self.skill_dir, 'assets')
os.makedirs(assets_dir, exist_ok=True)
f.write("### Images\n\n")
for img in page['images']:
# Save image to assets
img_filename = f"page_{page['page_number']}_img_{img['index']}.png"
img_path = os.path.join(assets_dir, img_filename)
with open(img_path, 'wb') as img_file:
img_file.write(img['data'])
# Add markdown image reference
f.write(f"![Image {img['index']}](../assets/{img_filename})\n\n")
f.write("---\n\n")
print(f" Generated: {filename}")
def _generate_index(self, categorized):
"""Generate reference index"""
filename = f"{self.skill_dir}/references/index.md"
with open(filename, 'w', encoding='utf-8') as f:
f.write(f"# {self.name.title()} Documentation Reference\n\n")
f.write("## Categories\n\n")
for cat_key, cat_data in categorized.items():
page_count = len(cat_data['pages'])
f.write(f"- [{cat_data['title']}]({cat_key}.md) ({page_count} pages)\n")
f.write("\n## Statistics\n\n")
stats = self.extracted_data.get('quality_statistics', {})
f.write(f"- Total pages: {self.extracted_data.get('total_pages', 0)}\n")
f.write(f"- Code blocks: {self.extracted_data.get('total_code_blocks', 0)}\n")
f.write(f"- Images: {self.extracted_data.get('total_images', 0)}\n")
if stats:
f.write(f"- Average code quality: {stats.get('average_quality', 0):.1f}/10\n")
f.write(f"- Valid code blocks: {stats.get('valid_code_blocks', 0)}\n")
print(f" Generated: {filename}")
def _generate_skill_md(self, categorized):
"""Generate main SKILL.md file"""
filename = f"{self.skill_dir}/SKILL.md"
# Generate skill name (lowercase, hyphens only, max 64 chars)
skill_name = self.name.lower().replace('_', '-').replace(' ', '-')[:64]
# Truncate description to 1024 chars if needed
desc = self.description[:1024] if len(self.description) > 1024 else self.description
with open(filename, 'w', encoding='utf-8') as f:
# Write YAML frontmatter
f.write(f"---\n")
f.write(f"name: {skill_name}\n")
f.write(f"description: {desc}\n")
f.write(f"---\n\n")
f.write(f"# {self.name.title()} Documentation Skill\n\n")
f.write(f"{self.description}\n\n")
f.write("## When to use this skill\n\n")
f.write(f"Use this skill when the user asks about {self.name} documentation, ")
f.write("including API references, tutorials, examples, and best practices.\n\n")
f.write("## What's included\n\n")
f.write("This skill contains:\n\n")
for cat_key, cat_data in categorized.items():
f.write(f"- **{cat_data['title']}**: {len(cat_data['pages'])} pages\n")
f.write("\n## Quick Reference\n\n")
# Get high-quality code samples
all_code = []
for page in self.extracted_data['pages']:
all_code.extend(page.get('code_samples', []))
# Sort by quality and get top 5
all_code.sort(key=lambda x: x.get('quality_score', 0), reverse=True)
top_code = all_code[:5]
if top_code:
f.write("### Top Code Examples\n\n")
for i, code in enumerate(top_code, 1):
lang = code['language']
quality = code.get('quality_score', 0)
f.write(f"**Example {i}** (Quality: {quality:.1f}/10):\n\n")
f.write(f"```{lang}\n{code['code'][:300]}...\n```\n\n")
f.write("## Navigation\n\n")
f.write("See `references/index.md` for complete documentation structure.\n\n")
# Add language statistics
langs = self.extracted_data.get('languages_detected', {})
if langs:
f.write("## Languages Covered\n\n")
for lang, count in sorted(langs.items(), key=lambda x: x[1], reverse=True):
f.write(f"- {lang}: {count} examples\n")
print(f" Generated: {filename}")
def _sanitize_filename(self, name):
"""Convert string to safe filename"""
# Remove special chars, replace spaces with underscores
safe = re.sub(r'[^\w\s-]', '', name.lower())
safe = re.sub(r'[-\s]+', '_', safe)
return safe
def main():
parser = argparse.ArgumentParser(
description='Convert PDF documentation to Claude skill',
formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument('--config', help='PDF config JSON file')
parser.add_argument('--pdf', help='Direct PDF file path')
parser.add_argument('--name', help='Skill name (with --pdf)')
parser.add_argument('--from-json', help='Build skill from extracted JSON')
parser.add_argument('--description', help='Skill description')
args = parser.parse_args()
# Validate inputs
if not (args.config or args.pdf or args.from_json):
parser.error("Must specify --config, --pdf, or --from-json")
# Load or create config
if args.config:
with open(args.config, 'r') as f:
config = json.load(f)
elif args.from_json:
# Build from extracted JSON
name = Path(args.from_json).stem.replace('_extracted', '')
config = {
'name': name,
'description': args.description or f'Documentation skill for {name}'
}
converter = PDFToSkillConverter(config)
converter.load_extracted_data(args.from_json)
converter.build_skill()
return
else:
# Direct PDF mode
if not args.name:
parser.error("Must specify --name with --pdf")
config = {
'name': args.name,
'pdf_path': args.pdf,
'description': args.description or f'Documentation skill for {args.name}',
'extract_options': {
'chunk_size': 10,
'min_quality': 5.0,
'extract_images': True,
'min_image_size': 100
}
}
# Create converter
converter = PDFToSkillConverter(config)
# Extract if needed
if config.get('pdf_path'):
if not converter.extract_pdf():
sys.exit(1)
# Build skill
converter.build_skill()
if __name__ == '__main__':
main()

View File

@ -0,0 +1,480 @@
#!/usr/bin/env python3
"""
Quality Checker for Claude Skills
Validates skill quality, checks links, and generates quality reports.
Usage:
python3 quality_checker.py output/react/
python3 quality_checker.py output/godot/ --verbose
"""
import os
import re
import sys
from pathlib import Path
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass, field
@dataclass
class QualityIssue:
"""Represents a quality issue found during validation."""
level: str # 'error', 'warning', 'info'
category: str # 'enhancement', 'content', 'links', 'structure'
message: str
file: Optional[str] = None
line: Optional[int] = None
@dataclass
class QualityReport:
"""Complete quality report for a skill."""
skill_name: str
skill_path: Path
errors: List[QualityIssue] = field(default_factory=list)
warnings: List[QualityIssue] = field(default_factory=list)
info: List[QualityIssue] = field(default_factory=list)
def add_error(self, category: str, message: str, file: str = None, line: int = None):
"""Add an error to the report."""
self.errors.append(QualityIssue('error', category, message, file, line))
def add_warning(self, category: str, message: str, file: str = None, line: int = None):
"""Add a warning to the report."""
self.warnings.append(QualityIssue('warning', category, message, file, line))
def add_info(self, category: str, message: str, file: str = None, line: int = None):
"""Add info to the report."""
self.info.append(QualityIssue('info', category, message, file, line))
@property
def has_errors(self) -> bool:
"""Check if there are any errors."""
return len(self.errors) > 0
@property
def has_warnings(self) -> bool:
"""Check if there are any warnings."""
return len(self.warnings) > 0
@property
def is_excellent(self) -> bool:
"""Check if quality is excellent (no errors, no warnings)."""
return not self.has_errors and not self.has_warnings
@property
def quality_score(self) -> float:
"""Calculate quality score (0-100)."""
# Start with perfect score
score = 100.0
# Deduct points for issues
score -= len(self.errors) * 15 # -15 per error
score -= len(self.warnings) * 5 # -5 per warning
# Never go below 0
return max(0.0, score)
@property
def quality_grade(self) -> str:
"""Get quality grade (A-F)."""
score = self.quality_score
if score >= 90:
return 'A'
elif score >= 80:
return 'B'
elif score >= 70:
return 'C'
elif score >= 60:
return 'D'
else:
return 'F'
class SkillQualityChecker:
"""Validates skill quality and generates reports."""
def __init__(self, skill_dir: Path):
"""Initialize quality checker.
Args:
skill_dir: Path to skill directory
"""
self.skill_dir = Path(skill_dir)
self.skill_md_path = self.skill_dir / "SKILL.md"
self.references_dir = self.skill_dir / "references"
self.report = QualityReport(
skill_name=self.skill_dir.name,
skill_path=self.skill_dir
)
def check_all(self) -> QualityReport:
"""Run all quality checks and return report.
Returns:
QualityReport: Complete quality report
"""
# Basic structure checks
self._check_skill_structure()
# Enhancement verification
self._check_enhancement_quality()
# Content quality checks
self._check_content_quality()
# Link validation
self._check_links()
return self.report
def _check_skill_structure(self):
"""Check basic skill structure."""
# Check SKILL.md exists
if not self.skill_md_path.exists():
self.report.add_error(
'structure',
'SKILL.md file not found',
str(self.skill_md_path)
)
return
# Check references directory exists
if not self.references_dir.exists():
self.report.add_warning(
'structure',
'references/ directory not found - skill may be incomplete',
str(self.references_dir)
)
elif not list(self.references_dir.glob('*.md')):
self.report.add_warning(
'structure',
'references/ directory is empty - no reference documentation found',
str(self.references_dir)
)
def _check_enhancement_quality(self):
"""Check if SKILL.md was properly enhanced."""
if not self.skill_md_path.exists():
return
content = self.skill_md_path.read_text(encoding='utf-8')
# Check for template indicators (signs it wasn't enhanced)
template_indicators = [
"TODO:",
"[Add description]",
"[Framework specific tips]",
"coming soon",
]
for indicator in template_indicators:
if indicator.lower() in content.lower():
self.report.add_warning(
'enhancement',
f'Found template placeholder: "{indicator}" - SKILL.md may not be enhanced',
'SKILL.md'
)
# Check for good signs of enhancement
enhancement_indicators = {
'code_examples': re.compile(r'```[\w-]+\n', re.MULTILINE),
'real_examples': re.compile(r'Example:', re.IGNORECASE),
'sections': re.compile(r'^## .+', re.MULTILINE),
}
code_blocks = len(enhancement_indicators['code_examples'].findall(content))
real_examples = len(enhancement_indicators['real_examples'].findall(content))
sections = len(enhancement_indicators['sections'].findall(content))
# Quality thresholds
if code_blocks == 0:
self.report.add_warning(
'enhancement',
'No code examples found in SKILL.md - consider enhancing',
'SKILL.md'
)
elif code_blocks < 3:
self.report.add_info(
'enhancement',
f'Only {code_blocks} code examples found - more examples would improve quality',
'SKILL.md'
)
else:
self.report.add_info(
'enhancement',
f'✓ Found {code_blocks} code examples',
'SKILL.md'
)
if sections < 4:
self.report.add_warning(
'enhancement',
f'Only {sections} sections found - SKILL.md may be too basic',
'SKILL.md'
)
else:
self.report.add_info(
'enhancement',
f'✓ Found {sections} sections',
'SKILL.md'
)
def _check_content_quality(self):
"""Check content quality."""
if not self.skill_md_path.exists():
return
content = self.skill_md_path.read_text(encoding='utf-8')
# Check YAML frontmatter
if not content.startswith('---'):
self.report.add_error(
'content',
'Missing YAML frontmatter - SKILL.md must start with ---',
'SKILL.md',
1
)
else:
# Extract frontmatter
try:
frontmatter_match = re.match(r'^---\n(.*?)\n---', content, re.DOTALL)
if frontmatter_match:
frontmatter = frontmatter_match.group(1)
# Check for required fields
if 'name:' not in frontmatter:
self.report.add_error(
'content',
'Missing "name:" field in YAML frontmatter',
'SKILL.md',
2
)
# Check for description
if 'description:' in frontmatter:
self.report.add_info(
'content',
'✓ YAML frontmatter includes description',
'SKILL.md'
)
else:
self.report.add_error(
'content',
'Invalid YAML frontmatter format',
'SKILL.md',
1
)
except Exception as e:
self.report.add_error(
'content',
f'Error parsing YAML frontmatter: {e}',
'SKILL.md',
1
)
# Check code block language tags
code_blocks_without_lang = re.findall(r'```\n[^`]', content)
if code_blocks_without_lang:
self.report.add_warning(
'content',
f'Found {len(code_blocks_without_lang)} code blocks without language tags',
'SKILL.md'
)
# Check for "When to Use" section
if 'when to use' not in content.lower():
self.report.add_warning(
'content',
'Missing "When to Use This Skill" section',
'SKILL.md'
)
else:
self.report.add_info(
'content',
'✓ Found "When to Use" section',
'SKILL.md'
)
# Check reference files
if self.references_dir.exists():
ref_files = list(self.references_dir.glob('*.md'))
if ref_files:
self.report.add_info(
'content',
f'✓ Found {len(ref_files)} reference files',
'references/'
)
# Check if references are mentioned in SKILL.md
mentioned_refs = 0
for ref_file in ref_files:
if ref_file.name in content:
mentioned_refs += 1
if mentioned_refs == 0:
self.report.add_warning(
'content',
'Reference files exist but none are mentioned in SKILL.md',
'SKILL.md'
)
def _check_links(self):
"""Check internal markdown links."""
if not self.skill_md_path.exists():
return
content = self.skill_md_path.read_text(encoding='utf-8')
# Find all markdown links [text](path)
link_pattern = re.compile(r'\[([^\]]+)\]\(([^)]+)\)')
links = link_pattern.findall(content)
broken_links = []
for text, link in links:
# Skip external links (http/https)
if link.startswith('http://') or link.startswith('https://'):
continue
# Skip anchor links
if link.startswith('#'):
continue
# Check if file exists (relative to SKILL.md)
link_path = self.skill_dir / link
if not link_path.exists():
broken_links.append((text, link))
if broken_links:
for text, link in broken_links:
self.report.add_warning(
'links',
f'Broken link: [{text}]({link})',
'SKILL.md'
)
else:
if links:
internal_links = [l for t, l in links if not l.startswith('http')]
if internal_links:
self.report.add_info(
'links',
f'✓ All {len(internal_links)} internal links are valid',
'SKILL.md'
)
def print_report(report: QualityReport, verbose: bool = False):
"""Print quality report to console.
Args:
report: Quality report to print
verbose: Show all info messages
"""
print("\n" + "=" * 60)
print(f"QUALITY REPORT: {report.skill_name}")
print("=" * 60)
print()
# Quality score
print(f"Quality Score: {report.quality_score:.1f}/100 (Grade: {report.quality_grade})")
print()
# Errors
if report.errors:
print(f"❌ ERRORS ({len(report.errors)}):")
for issue in report.errors:
location = f" ({issue.file}:{issue.line})" if issue.file and issue.line else f" ({issue.file})" if issue.file else ""
print(f" [{issue.category}] {issue.message}{location}")
print()
# Warnings
if report.warnings:
print(f"⚠️ WARNINGS ({len(report.warnings)}):")
for issue in report.warnings:
location = f" ({issue.file}:{issue.line})" if issue.file and issue.line else f" ({issue.file})" if issue.file else ""
print(f" [{issue.category}] {issue.message}{location}")
print()
# Info (only in verbose mode)
if verbose and report.info:
print(f" INFO ({len(report.info)}):")
for issue in report.info:
location = f" ({issue.file})" if issue.file else ""
print(f" [{issue.category}] {issue.message}{location}")
print()
# Summary
if report.is_excellent:
print("✅ EXCELLENT! No issues found.")
elif not report.has_errors:
print("✓ GOOD! No errors, but some warnings to review.")
else:
print("❌ NEEDS IMPROVEMENT! Please fix errors before packaging.")
print()
def main():
"""Main entry point."""
import argparse
parser = argparse.ArgumentParser(
description="Check skill quality and generate report",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Basic quality check
python3 quality_checker.py output/react/
# Verbose mode (show all info)
python3 quality_checker.py output/godot/ --verbose
# Exit with error code if issues found
python3 quality_checker.py output/django/ --strict
"""
)
parser.add_argument(
'skill_directory',
help='Path to skill directory (e.g., output/react/)'
)
parser.add_argument(
'--verbose', '-v',
action='store_true',
help='Show all info messages'
)
parser.add_argument(
'--strict',
action='store_true',
help='Exit with error code if any warnings or errors found'
)
args = parser.parse_args()
# Check if directory exists
skill_dir = Path(args.skill_directory)
if not skill_dir.exists():
print(f"❌ Directory not found: {skill_dir}")
sys.exit(1)
# Run quality checks
checker = SkillQualityChecker(skill_dir)
report = checker.check_all()
# Print report
print_report(report, verbose=args.verbose)
# Exit code
if args.strict and (report.has_errors or report.has_warnings):
sys.exit(1)
elif report.has_errors:
sys.exit(1)
else:
sys.exit(0)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,228 @@
#!/usr/bin/env python3
"""
Test Runner for Skill Seeker
Runs all test suites and generates a comprehensive test report
"""
import sys
import unittest
import os
from io import StringIO
from pathlib import Path
class ColoredTextTestResult(unittest.TextTestResult):
"""Custom test result class with colored output"""
# ANSI color codes
GREEN = '\033[92m'
RED = '\033[91m'
YELLOW = '\033[93m'
BLUE = '\033[94m'
RESET = '\033[0m'
BOLD = '\033[1m'
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.test_results = []
def addSuccess(self, test):
super().addSuccess(test)
self.test_results.append(('PASS', test))
if self.showAll:
self.stream.write(f"{self.GREEN}✓ PASS{self.RESET}\n")
elif self.dots:
self.stream.write(f"{self.GREEN}.{self.RESET}")
self.stream.flush()
def addError(self, test, err):
super().addError(test, err)
self.test_results.append(('ERROR', test))
if self.showAll:
self.stream.write(f"{self.RED}✗ ERROR{self.RESET}\n")
elif self.dots:
self.stream.write(f"{self.RED}E{self.RESET}")
self.stream.flush()
def addFailure(self, test, err):
super().addFailure(test, err)
self.test_results.append(('FAIL', test))
if self.showAll:
self.stream.write(f"{self.RED}✗ FAIL{self.RESET}\n")
elif self.dots:
self.stream.write(f"{self.RED}F{self.RESET}")
self.stream.flush()
def addSkip(self, test, reason):
super().addSkip(test, reason)
self.test_results.append(('SKIP', test))
if self.showAll:
self.stream.write(f"{self.YELLOW}⊘ SKIP{self.RESET}\n")
elif self.dots:
self.stream.write(f"{self.YELLOW}s{self.RESET}")
self.stream.flush()
class ColoredTextTestRunner(unittest.TextTestRunner):
"""Custom test runner with colored output"""
resultclass = ColoredTextTestResult
def discover_tests(test_dir='tests'):
"""Discover all test files in the tests directory"""
loader = unittest.TestLoader()
start_dir = test_dir
pattern = 'test_*.py'
suite = loader.discover(start_dir, pattern=pattern)
return suite
def run_specific_suite(suite_name):
"""Run a specific test suite"""
loader = unittest.TestLoader()
suite_map = {
'config': 'tests.test_config_validation',
'features': 'tests.test_scraper_features',
'integration': 'tests.test_integration'
}
if suite_name not in suite_map:
print(f"Unknown test suite: {suite_name}")
print(f"Available suites: {', '.join(suite_map.keys())}")
return None
module_name = suite_map[suite_name]
try:
suite = loader.loadTestsFromName(module_name)
return suite
except Exception as e:
print(f"Error loading test suite '{suite_name}': {e}")
return None
def print_summary(result):
"""Print a detailed test summary"""
total = result.testsRun
passed = total - len(result.failures) - len(result.errors) - len(result.skipped)
failed = len(result.failures)
errors = len(result.errors)
skipped = len(result.skipped)
print("\n" + "="*70)
print("TEST SUMMARY")
print("="*70)
# Overall stats
print(f"\n{ColoredTextTestResult.BOLD}Total Tests:{ColoredTextTestResult.RESET} {total}")
print(f"{ColoredTextTestResult.GREEN}✓ Passed:{ColoredTextTestResult.RESET} {passed}")
if failed > 0:
print(f"{ColoredTextTestResult.RED}✗ Failed:{ColoredTextTestResult.RESET} {failed}")
if errors > 0:
print(f"{ColoredTextTestResult.RED}✗ Errors:{ColoredTextTestResult.RESET} {errors}")
if skipped > 0:
print(f"{ColoredTextTestResult.YELLOW}⊘ Skipped:{ColoredTextTestResult.RESET} {skipped}")
# Success rate
if total > 0:
success_rate = (passed / total) * 100
color = ColoredTextTestResult.GREEN if success_rate == 100 else \
ColoredTextTestResult.YELLOW if success_rate >= 80 else \
ColoredTextTestResult.RED
print(f"\n{color}Success Rate: {success_rate:.1f}%{ColoredTextTestResult.RESET}")
# Category breakdown
if hasattr(result, 'test_results'):
print(f"\n{ColoredTextTestResult.BOLD}Test Breakdown by Category:{ColoredTextTestResult.RESET}")
categories = {}
for status, test in result.test_results:
test_name = str(test)
# Extract test class name
if '.' in test_name:
class_name = test_name.split('.')[0].split()[-1]
if class_name not in categories:
categories[class_name] = {'PASS': 0, 'FAIL': 0, 'ERROR': 0, 'SKIP': 0}
categories[class_name][status] += 1
for category, stats in sorted(categories.items()):
total_cat = sum(stats.values())
passed_cat = stats['PASS']
print(f" {category}: {passed_cat}/{total_cat} passed")
print("\n" + "="*70)
# Return status
return failed == 0 and errors == 0
def main():
"""Main test runner"""
import argparse
parser = argparse.ArgumentParser(
description='Run tests for Skill Seeker',
formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument('--suite', '-s', type=str,
help='Run specific test suite (config, features, integration)')
parser.add_argument('--verbose', '-v', action='store_true',
help='Verbose output (show each test)')
parser.add_argument('--quiet', '-q', action='store_true',
help='Quiet output (minimal output)')
parser.add_argument('--failfast', '-f', action='store_true',
help='Stop on first failure')
parser.add_argument('--list', '-l', action='store_true',
help='List all available tests')
args = parser.parse_args()
# Set verbosity
verbosity = 1
if args.verbose:
verbosity = 2
elif args.quiet:
verbosity = 0
print(f"\n{ColoredTextTestResult.BOLD}{'='*70}{ColoredTextTestResult.RESET}")
print(f"{ColoredTextTestResult.BOLD}SKILL SEEKER TEST SUITE{ColoredTextTestResult.RESET}")
print(f"{ColoredTextTestResult.BOLD}{'='*70}{ColoredTextTestResult.RESET}\n")
# Discover or load specific suite
if args.suite:
print(f"Running test suite: {ColoredTextTestResult.BLUE}{args.suite}{ColoredTextTestResult.RESET}\n")
suite = run_specific_suite(args.suite)
if suite is None:
return 1
else:
print(f"Running {ColoredTextTestResult.BLUE}all tests{ColoredTextTestResult.RESET}\n")
suite = discover_tests()
# List tests
if args.list:
print("\nAvailable tests:\n")
for test_group in suite:
for test in test_group:
print(f" - {test}")
print()
return 0
# Run tests
runner = ColoredTextTestRunner(
verbosity=verbosity,
failfast=args.failfast
)
result = runner.run(suite)
# Print summary
success = print_summary(result)
# Return appropriate exit code
return 0 if success else 1
if __name__ == '__main__':
sys.exit(main())

View File

@ -0,0 +1,320 @@
#!/usr/bin/env python3
"""
Config Splitter for Large Documentation Sites
Splits large documentation configs into multiple smaller, focused skill configs.
Supports multiple splitting strategies: category-based, size-based, and automatic.
"""
import json
import sys
import argparse
from pathlib import Path
from typing import Dict, List, Any, Tuple
from collections import defaultdict
class ConfigSplitter:
"""Splits large documentation configs into multiple focused configs"""
def __init__(self, config_path: str, strategy: str = "auto", target_pages: int = 5000):
self.config_path = Path(config_path)
self.strategy = strategy
self.target_pages = target_pages
self.config = self.load_config()
self.base_name = self.config['name']
def load_config(self) -> Dict[str, Any]:
"""Load configuration from file"""
try:
with open(self.config_path, 'r') as f:
return json.load(f)
except FileNotFoundError:
print(f"❌ Error: Config file not found: {self.config_path}")
sys.exit(1)
except json.JSONDecodeError as e:
print(f"❌ Error: Invalid JSON in config file: {e}")
sys.exit(1)
def get_split_strategy(self) -> str:
"""Determine split strategy"""
# Check if strategy is defined in config
if 'split_strategy' in self.config:
config_strategy = self.config['split_strategy']
if config_strategy != "none":
return config_strategy
# Use provided strategy or auto-detect
if self.strategy == "auto":
max_pages = self.config.get('max_pages', 500)
if max_pages < 5000:
print(f" Small documentation ({max_pages} pages) - no splitting needed")
return "none"
elif max_pages < 10000 and 'categories' in self.config:
print(f" Medium documentation ({max_pages} pages) - category split recommended")
return "category"
elif 'categories' in self.config and len(self.config['categories']) >= 3:
print(f" Large documentation ({max_pages} pages) - router + categories recommended")
return "router"
else:
print(f" Large documentation ({max_pages} pages) - size-based split")
return "size"
return self.strategy
def split_by_category(self, create_router: bool = False) -> List[Dict[str, Any]]:
"""Split config by categories"""
if 'categories' not in self.config:
print("❌ Error: No categories defined in config")
sys.exit(1)
categories = self.config['categories']
split_categories = self.config.get('split_config', {}).get('split_by_categories')
# If specific categories specified, use only those
if split_categories:
categories = {k: v for k, v in categories.items() if k in split_categories}
configs = []
for category_name, keywords in categories.items():
# Create new config for this category
new_config = self.config.copy()
new_config['name'] = f"{self.base_name}-{category_name}"
new_config['description'] = f"{self.base_name.capitalize()} - {category_name.replace('_', ' ').title()}. {self.config.get('description', '')}"
# Update URL patterns to focus on this category
url_patterns = new_config.get('url_patterns', {})
# Add category keywords to includes
includes = url_patterns.get('include', [])
for keyword in keywords:
if keyword.startswith('/'):
includes.append(keyword)
if includes:
url_patterns['include'] = list(set(includes))
new_config['url_patterns'] = url_patterns
# Keep only this category
new_config['categories'] = {category_name: keywords}
# Remove split config from child
if 'split_strategy' in new_config:
del new_config['split_strategy']
if 'split_config' in new_config:
del new_config['split_config']
# Adjust max_pages estimate
if 'max_pages' in new_config:
new_config['max_pages'] = self.target_pages
configs.append(new_config)
print(f"✅ Created {len(configs)} category-based configs")
# Optionally create router config
if create_router:
router_config = self.create_router_config(configs)
configs.insert(0, router_config)
print(f"✅ Created router config: {router_config['name']}")
return configs
def split_by_size(self) -> List[Dict[str, Any]]:
"""Split config by size (page count)"""
max_pages = self.config.get('max_pages', 500)
num_splits = (max_pages + self.target_pages - 1) // self.target_pages
configs = []
for i in range(num_splits):
new_config = self.config.copy()
part_num = i + 1
new_config['name'] = f"{self.base_name}-part{part_num}"
new_config['description'] = f"{self.base_name.capitalize()} - Part {part_num}. {self.config.get('description', '')}"
new_config['max_pages'] = self.target_pages
# Remove split config from child
if 'split_strategy' in new_config:
del new_config['split_strategy']
if 'split_config' in new_config:
del new_config['split_config']
configs.append(new_config)
print(f"✅ Created {len(configs)} size-based configs ({self.target_pages} pages each)")
return configs
def create_router_config(self, sub_configs: List[Dict[str, Any]]) -> Dict[str, Any]:
"""Create a router config that references sub-skills"""
router_name = self.config.get('split_config', {}).get('router_name', self.base_name)
router_config = {
"name": router_name,
"description": self.config.get('description', ''),
"base_url": self.config['base_url'],
"selectors": self.config['selectors'],
"url_patterns": self.config.get('url_patterns', {}),
"rate_limit": self.config.get('rate_limit', 0.5),
"max_pages": 500, # Router only needs overview pages
"_router": True,
"_sub_skills": [cfg['name'] for cfg in sub_configs],
"_routing_keywords": {
cfg['name']: list(cfg.get('categories', {}).keys())
for cfg in sub_configs
}
}
return router_config
def split(self) -> List[Dict[str, Any]]:
"""Execute split based on strategy"""
strategy = self.get_split_strategy()
print(f"\n{'='*60}")
print(f"CONFIG SPLITTER: {self.base_name}")
print(f"{'='*60}")
print(f"Strategy: {strategy}")
print(f"Target pages per skill: {self.target_pages}")
print("")
if strategy == "none":
print(" No splitting required")
return [self.config]
elif strategy == "category":
return self.split_by_category(create_router=False)
elif strategy == "router":
create_router = self.config.get('split_config', {}).get('create_router', True)
return self.split_by_category(create_router=create_router)
elif strategy == "size":
return self.split_by_size()
else:
print(f"❌ Error: Unknown strategy: {strategy}")
sys.exit(1)
def save_configs(self, configs: List[Dict[str, Any]], output_dir: Path = None) -> List[Path]:
"""Save configs to files"""
if output_dir is None:
output_dir = self.config_path.parent
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
saved_files = []
for config in configs:
filename = f"{config['name']}.json"
filepath = output_dir / filename
with open(filepath, 'w') as f:
json.dump(config, f, indent=2)
saved_files.append(filepath)
print(f" 💾 Saved: {filepath}")
return saved_files
def main():
parser = argparse.ArgumentParser(
description="Split large documentation configs into multiple focused skills",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Auto-detect strategy
python3 split_config.py configs/godot.json
# Use category-based split
python3 split_config.py configs/godot.json --strategy category
# Use router + categories
python3 split_config.py configs/godot.json --strategy router
# Custom target size
python3 split_config.py configs/godot.json --target-pages 3000
# Dry run (don't save files)
python3 split_config.py configs/godot.json --dry-run
Split Strategies:
none - No splitting (single skill)
auto - Automatically choose best strategy
category - Split by categories defined in config
router - Create router + category-based sub-skills
size - Split by page count
"""
)
parser.add_argument(
'config',
help='Path to config file (e.g., configs/godot.json)'
)
parser.add_argument(
'--strategy',
choices=['auto', 'none', 'category', 'router', 'size'],
default='auto',
help='Splitting strategy (default: auto)'
)
parser.add_argument(
'--target-pages',
type=int,
default=5000,
help='Target pages per skill (default: 5000)'
)
parser.add_argument(
'--output-dir',
help='Output directory for configs (default: same as input)'
)
parser.add_argument(
'--dry-run',
action='store_true',
help='Show what would be created without saving files'
)
args = parser.parse_args()
# Create splitter
splitter = ConfigSplitter(args.config, args.strategy, args.target_pages)
# Split config
configs = splitter.split()
if args.dry_run:
print(f"\n{'='*60}")
print("DRY RUN - No files saved")
print(f"{'='*60}")
print(f"Would create {len(configs)} config files:")
for cfg in configs:
is_router = cfg.get('_router', False)
router_marker = " (ROUTER)" if is_router else ""
print(f" 📄 {cfg['name']}.json{router_marker}")
else:
print(f"\n{'='*60}")
print("SAVING CONFIGS")
print(f"{'='*60}")
saved_files = splitter.save_configs(configs, args.output_dir)
print(f"\n{'='*60}")
print("NEXT STEPS")
print(f"{'='*60}")
print("1. Review generated configs")
print("2. Scrape each config:")
for filepath in saved_files:
print(f" skill-seekers scrape --config {filepath}")
print("3. Package skills:")
print(" skill-seekers-package-multi configs/<name>-*.json")
print("")
if __name__ == "__main__":
main()

View File

@ -0,0 +1,192 @@
#!/usr/bin/env python3
"""
Simple Integration Tests for Unified Multi-Source Scraper
Focuses on real-world usage patterns rather than unit tests.
"""
import os
import sys
import json
import tempfile
from pathlib import Path
# Add CLI to path
sys.path.insert(0, str(Path(__file__).parent))
from .config_validator import validate_config
def test_validate_existing_unified_configs():
"""Test that all existing unified configs are valid"""
configs_dir = Path(__file__).parent.parent / 'configs'
unified_configs = [
'godot_unified.json',
'react_unified.json',
'django_unified.json',
'fastapi_unified.json'
]
for config_name in unified_configs:
config_path = configs_dir / config_name
if config_path.exists():
print(f"\n✓ Validating {config_name}...")
validator = validate_config(str(config_path))
assert validator.is_unified, f"{config_name} should be unified format"
assert validator.needs_api_merge(), f"{config_name} should need API merging"
print(f" Sources: {len(validator.config['sources'])}")
print(f" Merge mode: {validator.config.get('merge_mode')}")
def test_backward_compatibility():
"""Test that legacy configs still work"""
configs_dir = Path(__file__).parent.parent / 'configs'
legacy_configs = [
'react.json',
'godot.json',
'django.json'
]
for config_name in legacy_configs:
config_path = configs_dir / config_name
if config_path.exists():
print(f"\n✓ Validating legacy {config_name}...")
validator = validate_config(str(config_path))
assert not validator.is_unified, f"{config_name} should be legacy format"
print(f" Format: Legacy")
def test_create_temp_unified_config():
"""Test creating a unified config from scratch"""
config = {
"name": "test_unified",
"description": "Test unified config",
"merge_mode": "rule-based",
"sources": [
{
"type": "documentation",
"base_url": "https://example.com/docs",
"extract_api": True,
"max_pages": 50
},
{
"type": "github",
"repo": "test/repo",
"include_code": True,
"code_analysis_depth": "surface"
}
]
}
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
json.dump(config, f)
config_path = f.name
try:
print("\n✓ Validating temp unified config...")
validator = validate_config(config_path)
assert validator.is_unified
assert validator.needs_api_merge()
assert len(validator.config['sources']) == 2
print(" ✓ Config is valid unified format")
print(f" Sources: {len(validator.config['sources'])}")
finally:
os.unlink(config_path)
def test_mixed_source_types():
"""Test config with documentation, GitHub, and PDF sources"""
config = {
"name": "test_mixed",
"description": "Test mixed sources",
"merge_mode": "rule-based",
"sources": [
{
"type": "documentation",
"base_url": "https://example.com"
},
{
"type": "github",
"repo": "test/repo"
},
{
"type": "pdf",
"path": "/path/to/manual.pdf"
}
]
}
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
json.dump(config, f)
config_path = f.name
try:
print("\n✓ Validating mixed source types...")
validator = validate_config(config_path)
assert validator.is_unified
assert len(validator.config['sources']) == 3
# Check each source type
source_types = [s['type'] for s in validator.config['sources']]
assert 'documentation' in source_types
assert 'github' in source_types
assert 'pdf' in source_types
print(" ✓ All 3 source types validated")
finally:
os.unlink(config_path)
def test_config_validation_errors():
"""Test that invalid configs are rejected"""
# Invalid source type
config = {
"name": "test",
"description": "Test",
"sources": [
{"type": "invalid_type", "url": "https://example.com"}
]
}
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
json.dump(config, f)
config_path = f.name
try:
print("\n✓ Testing invalid source type...")
try:
# validate_config() calls .validate() automatically
validator = validate_config(config_path)
assert False, "Should have raised error for invalid source type"
except ValueError as e:
assert "Invalid" in str(e) or "invalid" in str(e)
print(" ✓ Invalid source type correctly rejected")
finally:
os.unlink(config_path)
# Run tests
if __name__ == '__main__':
print("=" * 60)
print("Running Unified Scraper Integration Tests")
print("=" * 60)
try:
test_validate_existing_unified_configs()
test_backward_compatibility()
test_create_temp_unified_config()
test_mixed_source_types()
test_config_validation_errors()
print("\n" + "=" * 60)
print("✅ All integration tests passed!")
print("=" * 60)
except AssertionError as e:
print(f"\n❌ Test failed: {e}")
sys.exit(1)
except Exception as e:
print(f"\n❌ Unexpected error: {e}")
import traceback
traceback.print_exc()
sys.exit(1)

View File

@ -0,0 +1,450 @@
#!/usr/bin/env python3
"""
Unified Multi-Source Scraper
Orchestrates scraping from multiple sources (documentation, GitHub, PDF),
detects conflicts, merges intelligently, and builds unified skills.
This is the main entry point for unified config workflow.
Usage:
skill-seekers unified --config configs/godot_unified.json
skill-seekers unified --config configs/react_unified.json --merge-mode claude-enhanced
"""
import os
import sys
import json
import logging
import argparse
import subprocess
from pathlib import Path
from typing import Dict, List, Any, Optional
# Import validators and scrapers
try:
from config_validator import ConfigValidator, validate_config
from conflict_detector import ConflictDetector
from merge_sources import RuleBasedMerger, ClaudeEnhancedMerger
from unified_skill_builder import UnifiedSkillBuilder
except ImportError as e:
print(f"Error importing modules: {e}")
print("Make sure you're running from the project root directory")
sys.exit(1)
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
class UnifiedScraper:
"""
Orchestrates multi-source scraping and merging.
Main workflow:
1. Load and validate unified config
2. Scrape all sources (docs, GitHub, PDF)
3. Detect conflicts between sources
4. Merge intelligently (rule-based or Claude-enhanced)
5. Build unified skill
"""
def __init__(self, config_path: str, merge_mode: Optional[str] = None):
"""
Initialize unified scraper.
Args:
config_path: Path to unified config JSON
merge_mode: Override config merge_mode ('rule-based' or 'claude-enhanced')
"""
self.config_path = config_path
# Validate and load config
logger.info(f"Loading config: {config_path}")
self.validator = validate_config(config_path)
self.config = self.validator.config
# Determine merge mode
self.merge_mode = merge_mode or self.config.get('merge_mode', 'rule-based')
logger.info(f"Merge mode: {self.merge_mode}")
# Storage for scraped data
self.scraped_data = {}
# Output paths
self.name = self.config['name']
self.output_dir = f"output/{self.name}"
self.data_dir = f"output/{self.name}_unified_data"
os.makedirs(self.output_dir, exist_ok=True)
os.makedirs(self.data_dir, exist_ok=True)
def scrape_all_sources(self):
"""
Scrape all configured sources.
Routes to appropriate scraper based on source type.
"""
logger.info("=" * 60)
logger.info("PHASE 1: Scraping all sources")
logger.info("=" * 60)
if not self.validator.is_unified:
logger.warning("Config is not unified format, converting...")
self.config = self.validator.convert_legacy_to_unified()
sources = self.config.get('sources', [])
for i, source in enumerate(sources):
source_type = source['type']
logger.info(f"\n[{i+1}/{len(sources)}] Scraping {source_type} source...")
try:
if source_type == 'documentation':
self._scrape_documentation(source)
elif source_type == 'github':
self._scrape_github(source)
elif source_type == 'pdf':
self._scrape_pdf(source)
else:
logger.warning(f"Unknown source type: {source_type}")
except Exception as e:
logger.error(f"Error scraping {source_type}: {e}")
logger.info("Continuing with other sources...")
logger.info(f"\n✅ Scraped {len(self.scraped_data)} sources successfully")
def _scrape_documentation(self, source: Dict[str, Any]):
"""Scrape documentation website."""
# Create temporary config for doc scraper
doc_config = {
'name': f"{self.name}_docs",
'base_url': source['base_url'],
'selectors': source.get('selectors', {}),
'url_patterns': source.get('url_patterns', {}),
'categories': source.get('categories', {}),
'rate_limit': source.get('rate_limit', 0.5),
'max_pages': source.get('max_pages', 100)
}
# Write temporary config
temp_config_path = os.path.join(self.data_dir, 'temp_docs_config.json')
with open(temp_config_path, 'w') as f:
json.dump(doc_config, f, indent=2)
# Run doc_scraper as subprocess
logger.info(f"Scraping documentation from {source['base_url']}")
doc_scraper_path = Path(__file__).parent / "doc_scraper.py"
cmd = [sys.executable, str(doc_scraper_path), '--config', temp_config_path]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
logger.error(f"Documentation scraping failed: {result.stderr}")
return
# Load scraped data
docs_data_file = f"output/{doc_config['name']}_data/summary.json"
if os.path.exists(docs_data_file):
with open(docs_data_file, 'r') as f:
summary = json.load(f)
self.scraped_data['documentation'] = {
'pages': summary.get('pages', []),
'data_file': docs_data_file
}
logger.info(f"✅ Documentation: {summary.get('total_pages', 0)} pages scraped")
else:
logger.warning("Documentation data file not found")
# Clean up temp config
if os.path.exists(temp_config_path):
os.remove(temp_config_path)
def _scrape_github(self, source: Dict[str, Any]):
"""Scrape GitHub repository."""
sys.path.insert(0, str(Path(__file__).parent))
try:
from github_scraper import GitHubScraper
except ImportError:
logger.error("github_scraper.py not found")
return
# Create config for GitHub scraper
github_config = {
'repo': source['repo'],
'name': f"{self.name}_github",
'github_token': source.get('github_token'),
'include_issues': source.get('include_issues', True),
'max_issues': source.get('max_issues', 100),
'include_changelog': source.get('include_changelog', True),
'include_releases': source.get('include_releases', True),
'include_code': source.get('include_code', True),
'code_analysis_depth': source.get('code_analysis_depth', 'surface'),
'file_patterns': source.get('file_patterns', []),
'local_repo_path': source.get('local_repo_path') # Pass local_repo_path from config
}
# Scrape
logger.info(f"Scraping GitHub repository: {source['repo']}")
scraper = GitHubScraper(github_config)
github_data = scraper.scrape()
# Save data
github_data_file = os.path.join(self.data_dir, 'github_data.json')
with open(github_data_file, 'w') as f:
json.dump(github_data, f, indent=2, ensure_ascii=False)
self.scraped_data['github'] = {
'data': github_data,
'data_file': github_data_file
}
logger.info(f"✅ GitHub: Repository scraped successfully")
def _scrape_pdf(self, source: Dict[str, Any]):
"""Scrape PDF document."""
sys.path.insert(0, str(Path(__file__).parent))
try:
from pdf_scraper import PDFToSkillConverter
except ImportError:
logger.error("pdf_scraper.py not found")
return
# Create config for PDF scraper
pdf_config = {
'name': f"{self.name}_pdf",
'pdf': source['path'],
'extract_tables': source.get('extract_tables', False),
'ocr': source.get('ocr', False),
'password': source.get('password')
}
# Scrape
logger.info(f"Scraping PDF: {source['path']}")
converter = PDFToSkillConverter(pdf_config)
pdf_data = converter.extract_all()
# Save data
pdf_data_file = os.path.join(self.data_dir, 'pdf_data.json')
with open(pdf_data_file, 'w') as f:
json.dump(pdf_data, f, indent=2, ensure_ascii=False)
self.scraped_data['pdf'] = {
'data': pdf_data,
'data_file': pdf_data_file
}
logger.info(f"✅ PDF: {len(pdf_data.get('pages', []))} pages extracted")
def detect_conflicts(self) -> List:
"""
Detect conflicts between documentation and code.
Only applicable if both documentation and GitHub sources exist.
Returns:
List of conflicts
"""
logger.info("\n" + "=" * 60)
logger.info("PHASE 2: Detecting conflicts")
logger.info("=" * 60)
if not self.validator.needs_api_merge():
logger.info("No API merge needed (only one API source)")
return []
# Get documentation and GitHub data
docs_data = self.scraped_data.get('documentation', {})
github_data = self.scraped_data.get('github', {})
if not docs_data or not github_data:
logger.warning("Missing documentation or GitHub data for conflict detection")
return []
# Load data files
with open(docs_data['data_file'], 'r') as f:
docs_json = json.load(f)
with open(github_data['data_file'], 'r') as f:
github_json = json.load(f)
# Detect conflicts
detector = ConflictDetector(docs_json, github_json)
conflicts = detector.detect_all_conflicts()
# Save conflicts
conflicts_file = os.path.join(self.data_dir, 'conflicts.json')
detector.save_conflicts(conflicts, conflicts_file)
# Print summary
summary = detector.generate_summary(conflicts)
logger.info(f"\n📊 Conflict Summary:")
logger.info(f" Total: {summary['total']}")
logger.info(f" By Type:")
for ctype, count in summary['by_type'].items():
if count > 0:
logger.info(f" - {ctype}: {count}")
logger.info(f" By Severity:")
for severity, count in summary['by_severity'].items():
if count > 0:
emoji = '🔴' if severity == 'high' else '🟡' if severity == 'medium' else '🟢'
logger.info(f" {emoji} {severity}: {count}")
return conflicts
def merge_sources(self, conflicts: List):
"""
Merge data from multiple sources.
Args:
conflicts: List of detected conflicts
"""
logger.info("\n" + "=" * 60)
logger.info(f"PHASE 3: Merging sources ({self.merge_mode})")
logger.info("=" * 60)
if not conflicts:
logger.info("No conflicts to merge")
return None
# Get data files
docs_data = self.scraped_data.get('documentation', {})
github_data = self.scraped_data.get('github', {})
# Load data
with open(docs_data['data_file'], 'r') as f:
docs_json = json.load(f)
with open(github_data['data_file'], 'r') as f:
github_json = json.load(f)
# Choose merger
if self.merge_mode == 'claude-enhanced':
merger = ClaudeEnhancedMerger(docs_json, github_json, conflicts)
else:
merger = RuleBasedMerger(docs_json, github_json, conflicts)
# Merge
merged_data = merger.merge_all()
# Save merged data
merged_file = os.path.join(self.data_dir, 'merged_data.json')
with open(merged_file, 'w') as f:
json.dump(merged_data, f, indent=2, ensure_ascii=False)
logger.info(f"✅ Merged data saved: {merged_file}")
return merged_data
def build_skill(self, merged_data: Optional[Dict] = None):
"""
Build final unified skill.
Args:
merged_data: Merged API data (if conflicts were resolved)
"""
logger.info("\n" + "=" * 60)
logger.info("PHASE 4: Building unified skill")
logger.info("=" * 60)
# Load conflicts if they exist
conflicts = []
conflicts_file = os.path.join(self.data_dir, 'conflicts.json')
if os.path.exists(conflicts_file):
with open(conflicts_file, 'r') as f:
conflicts_data = json.load(f)
conflicts = conflicts_data.get('conflicts', [])
# Build skill
builder = UnifiedSkillBuilder(
self.config,
self.scraped_data,
merged_data,
conflicts
)
builder.build()
logger.info(f"✅ Unified skill built: {self.output_dir}/")
def run(self):
"""
Execute complete unified scraping workflow.
"""
logger.info("\n" + "🚀 " * 20)
logger.info(f"Unified Scraper: {self.config['name']}")
logger.info("🚀 " * 20 + "\n")
try:
# Phase 1: Scrape all sources
self.scrape_all_sources()
# Phase 2: Detect conflicts (if applicable)
conflicts = self.detect_conflicts()
# Phase 3: Merge sources (if conflicts exist)
merged_data = None
if conflicts:
merged_data = self.merge_sources(conflicts)
# Phase 4: Build skill
self.build_skill(merged_data)
logger.info("\n" + "" * 20)
logger.info("Unified scraping complete!")
logger.info("" * 20 + "\n")
logger.info(f"📁 Output: {self.output_dir}/")
logger.info(f"📁 Data: {self.data_dir}/")
except KeyboardInterrupt:
logger.info("\n\n⚠️ Scraping interrupted by user")
sys.exit(1)
except Exception as e:
logger.error(f"\n\n❌ Error during scraping: {e}")
import traceback
traceback.print_exc()
sys.exit(1)
def main():
"""Main entry point."""
parser = argparse.ArgumentParser(
description='Unified multi-source scraper',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Basic usage with unified config
skill-seekers unified --config configs/godot_unified.json
# Override merge mode
skill-seekers unified --config configs/react_unified.json --merge-mode claude-enhanced
# Backward compatible with legacy configs
skill-seekers unified --config configs/react.json
"""
)
parser.add_argument('--config', '-c', required=True,
help='Path to unified config JSON file')
parser.add_argument('--merge-mode', '-m',
choices=['rule-based', 'claude-enhanced'],
help='Override config merge mode')
args = parser.parse_args()
# Create and run scraper
scraper = UnifiedScraper(args.config, args.merge_mode)
scraper.run()
if __name__ == '__main__':
main()

View File

@ -0,0 +1,444 @@
#!/usr/bin/env python3
"""
Unified Skill Builder
Generates final skill structure from merged multi-source data:
- SKILL.md with merged APIs and conflict warnings
- references/ with organized content by source
- Inline conflict markers ()
- Separate conflicts summary section
Supports mixed sources (documentation, GitHub, PDF) and highlights
discrepancies transparently.
"""
import os
import json
import logging
from pathlib import Path
from typing import Dict, List, Any, Optional
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class UnifiedSkillBuilder:
"""
Builds unified skill from multi-source data.
"""
def __init__(self, config: Dict, scraped_data: Dict,
merged_data: Optional[Dict] = None, conflicts: Optional[List] = None):
"""
Initialize skill builder.
Args:
config: Unified config dict
scraped_data: Dict of scraped data by source type
merged_data: Merged API data (if conflicts were resolved)
conflicts: List of detected conflicts
"""
self.config = config
self.scraped_data = scraped_data
self.merged_data = merged_data
self.conflicts = conflicts or []
self.name = config['name']
self.description = config['description']
self.skill_dir = f"output/{self.name}"
# Create directories
os.makedirs(self.skill_dir, exist_ok=True)
os.makedirs(f"{self.skill_dir}/references", exist_ok=True)
os.makedirs(f"{self.skill_dir}/scripts", exist_ok=True)
os.makedirs(f"{self.skill_dir}/assets", exist_ok=True)
def build(self):
"""Build complete skill structure."""
logger.info(f"Building unified skill: {self.name}")
# Generate main SKILL.md
self._generate_skill_md()
# Generate reference files by source
self._generate_references()
# Generate conflicts report (if any)
if self.conflicts:
self._generate_conflicts_report()
logger.info(f"✅ Unified skill built: {self.skill_dir}/")
def _generate_skill_md(self):
"""Generate main SKILL.md file."""
skill_path = os.path.join(self.skill_dir, 'SKILL.md')
# Generate skill name (lowercase, hyphens only, max 64 chars)
skill_name = self.name.lower().replace('_', '-').replace(' ', '-')[:64]
# Truncate description to 1024 chars if needed
desc = self.description[:1024] if len(self.description) > 1024 else self.description
content = f"""---
name: {skill_name}
description: {desc}
---
# {self.name.title()}
{self.description}
## 📚 Sources
This skill combines knowledge from multiple sources:
"""
# List sources
for source in self.config.get('sources', []):
source_type = source['type']
if source_type == 'documentation':
content += f"- ✅ **Documentation**: {source.get('base_url', 'N/A')}\n"
content += f" - Pages: {source.get('max_pages', 'unlimited')}\n"
elif source_type == 'github':
content += f"- ✅ **GitHub Repository**: {source.get('repo', 'N/A')}\n"
content += f" - Code Analysis: {source.get('code_analysis_depth', 'surface')}\n"
content += f" - Issues: {source.get('max_issues', 0)}\n"
elif source_type == 'pdf':
content += f"- ✅ **PDF Document**: {source.get('path', 'N/A')}\n"
# Data quality section
if self.conflicts:
content += f"\n## ⚠️ Data Quality\n\n"
content += f"**{len(self.conflicts)} conflicts detected** between sources.\n\n"
# Count by type
by_type = {}
for conflict in self.conflicts:
ctype = conflict.type if hasattr(conflict, 'type') else conflict.get('type', 'unknown')
by_type[ctype] = by_type.get(ctype, 0) + 1
content += "**Conflict Breakdown:**\n"
for ctype, count in by_type.items():
content += f"- {ctype}: {count}\n"
content += f"\nSee `references/conflicts.md` for detailed conflict information.\n"
# Merged API section (if available)
if self.merged_data:
content += self._format_merged_apis()
# Quick reference from each source
content += "\n## 📖 Reference Documentation\n\n"
content += "Organized by source:\n\n"
for source in self.config.get('sources', []):
source_type = source['type']
content += f"- [{source_type.title()}](references/{source_type}/)\n"
# When to use this skill
content += f"\n## 💡 When to Use This Skill\n\n"
content += f"Use this skill when you need to:\n"
content += f"- Understand how to use {self.name}\n"
content += f"- Look up API documentation\n"
content += f"- Find usage examples\n"
if 'github' in self.scraped_data:
content += f"- Check for known issues or recent changes\n"
content += f"- Review release history\n"
content += "\n---\n\n"
content += "*Generated by Skill Seeker's unified multi-source scraper*\n"
with open(skill_path, 'w', encoding='utf-8') as f:
f.write(content)
logger.info(f"Created SKILL.md")
def _format_merged_apis(self) -> str:
"""Format merged APIs section with inline conflict warnings."""
if not self.merged_data:
return ""
content = "\n## 🔧 API Reference\n\n"
content += "*Merged from documentation and code analysis*\n\n"
apis = self.merged_data.get('apis', {})
if not apis:
return content + "*No APIs to display*\n"
# Group APIs by status
matched = {k: v for k, v in apis.items() if v.get('status') == 'matched'}
conflicts = {k: v for k, v in apis.items() if v.get('status') == 'conflict'}
docs_only = {k: v for k, v in apis.items() if v.get('status') == 'docs_only'}
code_only = {k: v for k, v in apis.items() if v.get('status') == 'code_only'}
# Show matched APIs first
if matched:
content += "### ✅ Verified APIs\n\n"
content += "*Documentation and code agree*\n\n"
for api_name, api_data in list(matched.items())[:10]: # Limit to first 10
content += self._format_api_entry(api_data, inline_conflict=False)
# Show conflicting APIs with warnings
if conflicts:
content += "\n### ⚠️ APIs with Conflicts\n\n"
content += "*Documentation and code differ*\n\n"
for api_name, api_data in list(conflicts.items())[:10]:
content += self._format_api_entry(api_data, inline_conflict=True)
# Show undocumented APIs
if code_only:
content += f"\n### 💻 Undocumented APIs\n\n"
content += f"*Found in code but not in documentation ({len(code_only)} total)*\n\n"
for api_name, api_data in list(code_only.items())[:5]:
content += self._format_api_entry(api_data, inline_conflict=False)
# Show removed/missing APIs
if docs_only:
content += f"\n### 📖 Documentation-Only APIs\n\n"
content += f"*Documented but not found in code ({len(docs_only)} total)*\n\n"
for api_name, api_data in list(docs_only.items())[:5]:
content += self._format_api_entry(api_data, inline_conflict=False)
content += f"\n*See references/api/ for complete API documentation*\n"
return content
def _format_api_entry(self, api_data: Dict, inline_conflict: bool = False) -> str:
"""Format a single API entry."""
name = api_data.get('name', 'Unknown')
signature = api_data.get('merged_signature', name)
description = api_data.get('merged_description', '')
warning = api_data.get('warning', '')
entry = f"#### `{signature}`\n\n"
if description:
entry += f"{description}\n\n"
# Add inline conflict warning
if inline_conflict and warning:
entry += f"⚠️ **Conflict**: {warning}\n\n"
# Show both versions if available
conflict = api_data.get('conflict', {})
if conflict:
docs_info = conflict.get('docs_info')
code_info = conflict.get('code_info')
if docs_info and code_info:
entry += "**Documentation says:**\n"
entry += f"```\n{docs_info.get('raw_signature', 'N/A')}\n```\n\n"
entry += "**Code implementation:**\n"
entry += f"```\n{self._format_code_signature(code_info)}\n```\n\n"
# Add source info
source = api_data.get('source', 'unknown')
entry += f"*Source: {source}*\n\n"
entry += "---\n\n"
return entry
def _format_code_signature(self, code_info: Dict) -> str:
"""Format code signature for display."""
name = code_info.get('name', '')
params = code_info.get('parameters', [])
return_type = code_info.get('return_type')
param_strs = []
for param in params:
param_str = param.get('name', '')
if param.get('type_hint'):
param_str += f": {param['type_hint']}"
if param.get('default'):
param_str += f" = {param['default']}"
param_strs.append(param_str)
sig = f"{name}({', '.join(param_strs)})"
if return_type:
sig += f" -> {return_type}"
return sig
def _generate_references(self):
"""Generate reference files organized by source."""
logger.info("Generating reference files...")
# Generate references for each source type
if 'documentation' in self.scraped_data:
self._generate_docs_references()
if 'github' in self.scraped_data:
self._generate_github_references()
if 'pdf' in self.scraped_data:
self._generate_pdf_references()
# Generate merged API reference if available
if self.merged_data:
self._generate_merged_api_reference()
def _generate_docs_references(self):
"""Generate references from documentation source."""
docs_dir = os.path.join(self.skill_dir, 'references', 'documentation')
os.makedirs(docs_dir, exist_ok=True)
# Create index
index_path = os.path.join(docs_dir, 'index.md')
with open(index_path, 'w') as f:
f.write("# Documentation\n\n")
f.write("Reference from official documentation.\n\n")
logger.info("Created documentation references")
def _generate_github_references(self):
"""Generate references from GitHub source."""
github_dir = os.path.join(self.skill_dir, 'references', 'github')
os.makedirs(github_dir, exist_ok=True)
github_data = self.scraped_data['github']['data']
# Create README reference
if github_data.get('readme'):
readme_path = os.path.join(github_dir, 'README.md')
with open(readme_path, 'w') as f:
f.write("# Repository README\n\n")
f.write(github_data['readme'])
# Create issues reference
if github_data.get('issues'):
issues_path = os.path.join(github_dir, 'issues.md')
with open(issues_path, 'w') as f:
f.write("# GitHub Issues\n\n")
f.write(f"{len(github_data['issues'])} recent issues.\n\n")
for issue in github_data['issues'][:20]:
f.write(f"## #{issue['number']}: {issue['title']}\n\n")
f.write(f"**State**: {issue['state']}\n")
if issue.get('labels'):
f.write(f"**Labels**: {', '.join(issue['labels'])}\n")
f.write(f"**URL**: {issue.get('url', 'N/A')}\n\n")
# Create releases reference
if github_data.get('releases'):
releases_path = os.path.join(github_dir, 'releases.md')
with open(releases_path, 'w') as f:
f.write("# Releases\n\n")
for release in github_data['releases'][:10]:
f.write(f"## {release['tag_name']}: {release.get('name', 'N/A')}\n\n")
f.write(f"**Published**: {release.get('published_at', 'N/A')[:10]}\n\n")
if release.get('body'):
f.write(release['body'][:500])
f.write("\n\n")
logger.info("Created GitHub references")
def _generate_pdf_references(self):
"""Generate references from PDF source."""
pdf_dir = os.path.join(self.skill_dir, 'references', 'pdf')
os.makedirs(pdf_dir, exist_ok=True)
# Create index
index_path = os.path.join(pdf_dir, 'index.md')
with open(index_path, 'w') as f:
f.write("# PDF Documentation\n\n")
f.write("Reference from PDF document.\n\n")
logger.info("Created PDF references")
def _generate_merged_api_reference(self):
"""Generate merged API reference file."""
api_dir = os.path.join(self.skill_dir, 'references', 'api')
os.makedirs(api_dir, exist_ok=True)
api_path = os.path.join(api_dir, 'merged_api.md')
with open(api_path, 'w') as f:
f.write("# Merged API Reference\n\n")
f.write("*Combined from documentation and code analysis*\n\n")
apis = self.merged_data.get('apis', {})
for api_name in sorted(apis.keys()):
api_data = apis[api_name]
entry = self._format_api_entry(api_data, inline_conflict=True)
f.write(entry)
logger.info(f"Created merged API reference ({len(apis)} APIs)")
def _generate_conflicts_report(self):
"""Generate detailed conflicts report."""
conflicts_path = os.path.join(self.skill_dir, 'references', 'conflicts.md')
with open(conflicts_path, 'w') as f:
f.write("# Conflict Report\n\n")
f.write(f"Found **{len(self.conflicts)}** conflicts between sources.\n\n")
# Group by severity
high = [c for c in self.conflicts if (hasattr(c, 'severity') and c.severity == 'high') or c.get('severity') == 'high']
medium = [c for c in self.conflicts if (hasattr(c, 'severity') and c.severity == 'medium') or c.get('severity') == 'medium']
low = [c for c in self.conflicts if (hasattr(c, 'severity') and c.severity == 'low') or c.get('severity') == 'low']
f.write("## Severity Breakdown\n\n")
f.write(f"- 🔴 **High**: {len(high)} (action required)\n")
f.write(f"- 🟡 **Medium**: {len(medium)} (review recommended)\n")
f.write(f"- 🟢 **Low**: {len(low)} (informational)\n\n")
# List high severity conflicts
if high:
f.write("## 🔴 High Severity\n\n")
f.write("*These conflicts require immediate attention*\n\n")
for conflict in high:
api_name = conflict.api_name if hasattr(conflict, 'api_name') else conflict.get('api_name', 'Unknown')
diff = conflict.difference if hasattr(conflict, 'difference') else conflict.get('difference', 'N/A')
f.write(f"### {api_name}\n\n")
f.write(f"**Issue**: {diff}\n\n")
# List medium severity
if medium:
f.write("## 🟡 Medium Severity\n\n")
for conflict in medium[:20]: # Limit to 20
api_name = conflict.api_name if hasattr(conflict, 'api_name') else conflict.get('api_name', 'Unknown')
diff = conflict.difference if hasattr(conflict, 'difference') else conflict.get('difference', 'N/A')
f.write(f"### {api_name}\n\n")
f.write(f"{diff}\n\n")
logger.info(f"Created conflicts report")
if __name__ == '__main__':
# Test with mock data
import sys
if len(sys.argv) < 2:
print("Usage: python unified_skill_builder.py <config.json>")
sys.exit(1)
config_path = sys.argv[1]
with open(config_path, 'r') as f:
config = json.load(f)
# Mock scraped data
scraped_data = {
'github': {
'data': {
'readme': '# Test Repository',
'issues': [],
'releases': []
}
}
}
builder = UnifiedSkillBuilder(config, scraped_data)
builder.build()
print(f"\n✅ Test skill built in: output/{config['name']}/")

View File

@ -0,0 +1,175 @@
#!/usr/bin/env python3
"""
Automatic Skill Uploader
Uploads a skill .zip file to Claude using the Anthropic API
Usage:
# Set API key (one-time)
export ANTHROPIC_API_KEY=sk-ant-...
# Upload skill
python3 upload_skill.py output/react.zip
python3 upload_skill.py output/godot.zip
"""
import os
import sys
import json
import argparse
from pathlib import Path
# Import utilities
try:
from utils import (
get_api_key,
get_upload_url,
print_upload_instructions,
validate_zip_file
)
except ImportError:
sys.path.insert(0, str(Path(__file__).parent))
from utils import (
get_api_key,
get_upload_url,
print_upload_instructions,
validate_zip_file
)
def upload_skill_api(zip_path):
"""
Upload skill to Claude via Anthropic API
Args:
zip_path: Path to skill .zip file
Returns:
tuple: (success, message)
"""
# Check for requests library
try:
import requests
except ImportError:
return False, "requests library not installed. Run: pip install requests"
# Validate zip file
is_valid, error_msg = validate_zip_file(zip_path)
if not is_valid:
return False, error_msg
# Get API key
api_key = get_api_key()
if not api_key:
return False, "ANTHROPIC_API_KEY not set. Run: export ANTHROPIC_API_KEY=sk-ant-..."
zip_path = Path(zip_path)
skill_name = zip_path.stem
print(f"📤 Uploading skill: {skill_name}")
print(f" Source: {zip_path}")
print(f" Size: {zip_path.stat().st_size:,} bytes")
print()
# Prepare API request
api_url = "https://api.anthropic.com/v1/skills"
headers = {
"x-api-key": api_key,
"anthropic-version": "2023-06-01",
"anthropic-beta": "skills-2025-10-02"
}
try:
# Read zip file
with open(zip_path, 'rb') as f:
zip_data = f.read()
# Upload skill
print("⏳ Uploading to Anthropic API...")
files = {
'files[]': (zip_path.name, zip_data, 'application/zip')
}
response = requests.post(
api_url,
headers=headers,
files=files,
timeout=60
)
# Check response
if response.status_code == 200:
print()
print("✅ Skill uploaded successfully!")
print()
print("Your skill is now available in Claude at:")
print(f" {get_upload_url()}")
print()
return True, "Upload successful"
elif response.status_code == 401:
return False, "Authentication failed. Check your ANTHROPIC_API_KEY"
elif response.status_code == 400:
error_msg = response.json().get('error', {}).get('message', 'Unknown error')
return False, f"Invalid skill format: {error_msg}"
else:
error_msg = response.json().get('error', {}).get('message', 'Unknown error')
return False, f"Upload failed ({response.status_code}): {error_msg}"
except requests.exceptions.Timeout:
return False, "Upload timed out. Try again or use manual upload"
except requests.exceptions.ConnectionError:
return False, "Connection error. Check your internet connection"
except Exception as e:
return False, f"Unexpected error: {str(e)}"
def main():
parser = argparse.ArgumentParser(
description="Upload a skill .zip file to Claude via Anthropic API",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Setup:
1. Get your Anthropic API key from https://console.anthropic.com/
2. Set the API key:
export ANTHROPIC_API_KEY=sk-ant-...
Examples:
# Upload skill
python3 upload_skill.py output/react.zip
# Upload with explicit path
python3 upload_skill.py /path/to/skill.zip
Requirements:
- ANTHROPIC_API_KEY environment variable must be set
- requests library (pip install requests)
"""
)
parser.add_argument(
'zip_file',
help='Path to skill .zip file (e.g., output/react.zip)'
)
args = parser.parse_args()
# Upload skill
success, message = upload_skill_api(args.zip_file)
if success:
sys.exit(0)
else:
print(f"\n❌ Upload failed: {message}")
print()
print("📝 Manual upload instructions:")
print_upload_instructions(args.zip_file)
sys.exit(1)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,224 @@
#!/usr/bin/env python3
"""
Utility functions for Skill Seeker CLI tools
"""
import os
import sys
import subprocess
import platform
from pathlib import Path
from typing import Optional, Tuple, Dict, Union
def open_folder(folder_path: Union[str, Path]) -> bool:
"""
Open a folder in the system file browser
Args:
folder_path: Path to folder to open
Returns:
bool: True if successful, False otherwise
"""
folder_path = Path(folder_path).resolve()
if not folder_path.exists():
print(f"⚠️ Folder not found: {folder_path}")
return False
system = platform.system()
try:
if system == "Linux":
# Try xdg-open first (standard)
subprocess.run(["xdg-open", str(folder_path)], check=True)
elif system == "Darwin": # macOS
subprocess.run(["open", str(folder_path)], check=True)
elif system == "Windows":
subprocess.run(["explorer", str(folder_path)], check=True)
else:
print(f"⚠️ Unknown operating system: {system}")
return False
return True
except subprocess.CalledProcessError:
print(f"⚠️ Could not open folder automatically")
return False
except FileNotFoundError:
print(f"⚠️ File browser not found on system")
return False
def has_api_key() -> bool:
"""
Check if ANTHROPIC_API_KEY is set in environment
Returns:
bool: True if API key is set, False otherwise
"""
api_key = os.environ.get('ANTHROPIC_API_KEY', '').strip()
return len(api_key) > 0
def get_api_key() -> Optional[str]:
"""
Get ANTHROPIC_API_KEY from environment
Returns:
str: API key or None if not set
"""
api_key = os.environ.get('ANTHROPIC_API_KEY', '').strip()
return api_key if api_key else None
def get_upload_url() -> str:
"""
Get the Claude skills upload URL
Returns:
str: Claude skills upload URL
"""
return "https://claude.ai/skills"
def print_upload_instructions(zip_path: Union[str, Path]) -> None:
"""
Print clear upload instructions for manual upload
Args:
zip_path: Path to the .zip file to upload
"""
zip_path = Path(zip_path)
print()
print("╔══════════════════════════════════════════════════════════╗")
print("║ NEXT STEP ║")
print("╚══════════════════════════════════════════════════════════╝")
print()
print(f"📤 Upload to Claude: {get_upload_url()}")
print()
print(f"1. Go to {get_upload_url()}")
print("2. Click \"Upload Skill\"")
print(f"3. Select: {zip_path}")
print("4. Done! ✅")
print()
def format_file_size(size_bytes: int) -> str:
"""
Format file size in human-readable format
Args:
size_bytes: Size in bytes
Returns:
str: Formatted size (e.g., "45.3 KB")
"""
if size_bytes < 1024:
return f"{size_bytes} bytes"
elif size_bytes < 1024 * 1024:
return f"{size_bytes / 1024:.1f} KB"
else:
return f"{size_bytes / (1024 * 1024):.1f} MB"
def validate_skill_directory(skill_dir: Union[str, Path]) -> Tuple[bool, Optional[str]]:
"""
Validate that a directory is a valid skill directory
Args:
skill_dir: Path to skill directory
Returns:
tuple: (is_valid, error_message)
"""
skill_path = Path(skill_dir)
if not skill_path.exists():
return False, f"Directory not found: {skill_dir}"
if not skill_path.is_dir():
return False, f"Not a directory: {skill_dir}"
skill_md = skill_path / "SKILL.md"
if not skill_md.exists():
return False, f"SKILL.md not found in {skill_dir}"
return True, None
def validate_zip_file(zip_path: Union[str, Path]) -> Tuple[bool, Optional[str]]:
"""
Validate that a file is a valid skill .zip file
Args:
zip_path: Path to .zip file
Returns:
tuple: (is_valid, error_message)
"""
zip_path = Path(zip_path)
if not zip_path.exists():
return False, f"File not found: {zip_path}"
if not zip_path.is_file():
return False, f"Not a file: {zip_path}"
if not zip_path.suffix == '.zip':
return False, f"Not a .zip file: {zip_path}"
return True, None
def read_reference_files(skill_dir: Union[str, Path], max_chars: int = 100000, preview_limit: int = 40000) -> Dict[str, str]:
"""Read reference files from a skill directory with size limits.
This function reads markdown files from the references/ subdirectory
of a skill, applying both per-file and total content limits.
Args:
skill_dir (str or Path): Path to skill directory
max_chars (int): Maximum total characters to read (default: 100000)
preview_limit (int): Maximum characters per file (default: 40000)
Returns:
dict: Dictionary mapping filename to content
Example:
>>> refs = read_reference_files('output/react/', max_chars=50000)
>>> len(refs)
5
"""
from pathlib import Path
skill_path = Path(skill_dir)
references_dir = skill_path / "references"
references: Dict[str, str] = {}
if not references_dir.exists():
print(f"⚠ No references directory found at {references_dir}")
return references
total_chars = 0
for ref_file in sorted(references_dir.glob("*.md")):
if ref_file.name == "index.md":
continue
content = ref_file.read_text(encoding='utf-8')
# Limit size per file
if len(content) > preview_limit:
content = content[:preview_limit] + "\n\n[Content truncated...]"
references[ref_file.name] = content
total_chars += len(content)
# Stop if we've read enough
if total_chars > max_chars:
print(f" Limiting input to {max_chars:,} characters")
break
return references

View File

@ -0,0 +1,596 @@
# Skill Seeker MCP Server
Model Context Protocol (MCP) server for Skill Seeker - enables Claude Code to generate documentation skills directly.
## What is This?
This MCP server allows Claude Code to use Skill Seeker's tools directly through natural language commands. Instead of running CLI commands manually, you can ask Claude Code to:
- Generate config files for any documentation site
- Estimate page counts before scraping
- Scrape documentation and build skills
- Package skills into `.zip` files
- List and validate configurations
- Split large documentation (10K-40K+ pages) into focused sub-skills
- Generate intelligent router/hub skills for split documentation
- **NEW:** Scrape PDF documentation and extract code/images
## Quick Start
### 1. Install Dependencies
```bash
# From repository root
pip3 install -r mcp/requirements.txt
pip3 install requests beautifulsoup4
```
### 2. Quick Setup (Automated)
```bash
# Run the setup script
./setup_mcp.sh
# Follow the prompts - it will:
# - Install dependencies
# - Test the server
# - Generate configuration
# - Guide you through Claude Code setup
```
### 3. Manual Setup
Add to `~/.config/claude-code/mcp.json`:
```json
{
"mcpServers": {
"skill-seeker": {
"command": "python3",
"args": [
"/path/to/Skill_Seekers/mcp/server.py"
],
"cwd": "/path/to/Skill_Seekers"
}
}
}
```
**Replace `/path/to/Skill_Seekers`** with your actual repository path!
### 4. Restart Claude Code
Quit and reopen Claude Code (don't just close the window).
### 5. Test
In Claude Code, type:
```
List all available configs
```
You should see a list of preset configurations (Godot, React, Vue, etc.).
## Available Tools
The MCP server exposes 10 tools:
### 1. `generate_config`
Create a new configuration file for any documentation website.
**Parameters:**
- `name` (required): Skill name (e.g., "tailwind")
- `url` (required): Documentation URL (e.g., "https://tailwindcss.com/docs")
- `description` (required): When to use this skill
- `max_pages` (optional): Maximum pages to scrape (default: 100)
- `rate_limit` (optional): Delay between requests in seconds (default: 0.5)
**Example:**
```
Generate config for Tailwind CSS at https://tailwindcss.com/docs
```
### 2. `estimate_pages`
Estimate how many pages will be scraped from a config (fast, no data downloaded).
**Parameters:**
- `config_path` (required): Path to config file (e.g., "configs/react.json")
- `max_discovery` (optional): Maximum pages to discover (default: 1000)
**Example:**
```
Estimate pages for configs/react.json
```
### 3. `scrape_docs`
Scrape documentation and build Claude skill.
**Parameters:**
- `config_path` (required): Path to config file
- `enhance_local` (optional): Open terminal for local enhancement (default: false)
- `skip_scrape` (optional): Use cached data (default: false)
- `dry_run` (optional): Preview without saving (default: false)
**Example:**
```
Scrape docs using configs/react.json
```
### 4. `package_skill`
Package a skill directory into a `.zip` file ready for Claude upload. Automatically uploads if ANTHROPIC_API_KEY is set.
**Parameters:**
- `skill_dir` (required): Path to skill directory (e.g., "output/react/")
- `auto_upload` (optional): Try to upload automatically if API key is available (default: true)
**Example:**
```
Package skill at output/react/
```
### 5. `upload_skill`
Upload a skill .zip file to Claude automatically (requires ANTHROPIC_API_KEY).
**Parameters:**
- `skill_zip` (required): Path to skill .zip file (e.g., "output/react.zip")
**Example:**
```
Upload output/react.zip using upload_skill
```
### 6. `list_configs`
List all available preset configurations.
**Parameters:** None
**Example:**
```
List all available configs
```
### 7. `validate_config`
Validate a config file for errors.
**Parameters:**
- `config_path` (required): Path to config file
**Example:**
```
Validate configs/godot.json
```
### 8. `split_config`
Split large documentation config into multiple focused skills. For 10K+ page documentation.
**Parameters:**
- `config_path` (required): Path to config JSON file (e.g., "configs/godot.json")
- `strategy` (optional): Split strategy - "auto", "none", "category", "router", "size" (default: "auto")
- `target_pages` (optional): Target pages per skill (default: 5000)
- `dry_run` (optional): Preview without saving files (default: false)
**Example:**
```
Split configs/godot.json using router strategy with 5000 pages per skill
```
**Strategies:**
- **auto** - Intelligently detects best strategy based on page count and config
- **category** - Split by documentation categories (creates focused sub-skills)
- **router** - Create router/hub skill + specialized sub-skills (RECOMMENDED for 10K+ pages)
- **size** - Split every N pages (for docs without clear categories)
### 9. `generate_router`
Generate router/hub skill for split documentation. Creates intelligent routing to sub-skills.
**Parameters:**
- `config_pattern` (required): Config pattern for sub-skills (e.g., "configs/godot-*.json")
- `router_name` (optional): Router skill name (inferred from configs if not provided)
**Example:**
```
Generate router for configs/godot-*.json
```
**What it does:**
- Analyzes all sub-skill configs
- Extracts routing keywords from categories and names
- Creates router SKILL.md with intelligent routing logic
- Users can ask questions naturally, router directs to appropriate sub-skill
### 10. `scrape_pdf`
Scrape PDF documentation and build Claude skill. Extracts text, code blocks, images, and tables from PDF files with advanced features.
**Parameters:**
- `config_path` (optional): Path to PDF config JSON file (e.g., "configs/manual_pdf.json")
- `pdf_path` (optional): Direct PDF path (alternative to config_path)
- `name` (optional): Skill name (required with pdf_path)
- `description` (optional): Skill description
- `from_json` (optional): Build from extracted JSON file (e.g., "output/manual_extracted.json")
- `use_ocr` (optional): Use OCR for scanned PDFs (requires pytesseract)
- `password` (optional): Password for encrypted PDFs
- `extract_tables` (optional): Extract tables from PDF
- `parallel` (optional): Process pages in parallel for faster extraction
- `max_workers` (optional): Number of parallel workers (default: CPU count)
**Examples:**
```
Scrape PDF at docs/manual.pdf and create skill named api-docs
Create skill from configs/example_pdf.json
Build skill from output/manual_extracted.json
Scrape scanned PDF with OCR: --pdf docs/scanned.pdf --ocr
Scrape encrypted PDF: --pdf docs/manual.pdf --password mypassword
Extract tables: --pdf docs/data.pdf --extract-tables
Fast parallel processing: --pdf docs/large.pdf --parallel --workers 8
```
**What it does:**
- Extracts text and markdown from PDF pages
- Detects code blocks using 3 methods (font, indent, pattern)
- Detects programming language with confidence scoring (19+ languages)
- Validates syntax and scores code quality (0-10 scale)
- Extracts images with size filtering
- **NEW:** Extracts tables from PDFs (Priority 2)
- **NEW:** OCR support for scanned PDFs (Priority 2, requires pytesseract + Pillow)
- **NEW:** Password-protected PDF support (Priority 2)
- **NEW:** Parallel page processing for faster extraction (Priority 3)
- **NEW:** Intelligent caching of expensive operations (Priority 3)
- Detects chapters and creates page chunks
- Categorizes content automatically
- Generates complete skill structure (SKILL.md + references)
**Performance:**
- Sequential: ~30-60 seconds per 100 pages
- Parallel (8 workers): ~10-20 seconds per 100 pages (3x faster)
**See:** `docs/PDF_SCRAPER.md` for complete PDF documentation guide
## Example Workflows
### Generate a New Skill from Scratch
```
User: Generate config for Svelte at https://svelte.dev/docs
Claude: ✅ Config created: configs/svelte.json
User: Estimate pages for configs/svelte.json
Claude: 📊 Estimated pages: 150
User: Scrape docs using configs/svelte.json
Claude: ✅ Skill created at output/svelte/
User: Package skill at output/svelte/
Claude: ✅ Created: output/svelte.zip
Ready to upload to Claude!
```
### Use Existing Preset
```
User: List all available configs
Claude: [Shows all configs: godot, react, vue, django, fastapi, etc.]
User: Scrape docs using configs/react.json
Claude: ✅ Skill created at output/react/
User: Package skill at output/react/
Claude: ✅ Created: output/react.zip
```
### Validate Before Scraping
```
User: Validate configs/godot.json
Claude: ✅ Config is valid!
Name: godot
Base URL: https://docs.godotengine.org/en/stable/
Max pages: 500
Rate limit: 0.5s
User: Scrape docs using configs/godot.json
Claude: [Starts scraping...]
```
### PDF Documentation - NEW
```
User: Scrape PDF at docs/api-manual.pdf and create skill named api-docs
Claude: 📄 Scraping PDF documentation...
✅ Extracted 120 pages
✅ Found 45 code blocks (Python, JavaScript, C++)
✅ Extracted 12 images
✅ Created skill at output/api-docs/
📦 Package with: python3 cli/package_skill.py output/api-docs/
User: Package skill at output/api-docs/
Claude: ✅ Created: output/api-docs.zip
Ready to upload to Claude!
```
### Large Documentation (40K Pages)
```
User: Estimate pages for configs/godot.json
Claude: 📊 Estimated pages: 40,000
⚠️ Large documentation detected!
💡 Recommend splitting into multiple skills
User: Split configs/godot.json using router strategy
Claude: ✅ Split complete!
Created 5 sub-skills:
- godot-scripting.json (5,000 pages)
- godot-2d.json (8,000 pages)
- godot-3d.json (10,000 pages)
- godot-physics.json (6,000 pages)
- godot-shaders.json (11,000 pages)
User: Scrape all godot sub-skills in parallel
Claude: [Starts scraping all 5 configs in parallel...]
✅ All skills created in 4-8 hours instead of 20-40!
User: Generate router for configs/godot-*.json
Claude: ✅ Router skill created at output/godot/
Routing logic:
- "scripting", "gdscript" → godot-scripting
- "2d", "sprites", "tilemap" → godot-2d
- "3d", "meshes", "camera" → godot-3d
- "physics", "collision" → godot-physics
- "shaders", "visual shader" → godot-shaders
User: Package all godot skills
Claude: ✅ 6 skills packaged:
- godot.zip (router)
- godot-scripting.zip
- godot-2d.zip
- godot-3d.zip
- godot-physics.zip
- godot-shaders.zip
Upload all to Claude!
Users just ask questions naturally - router handles routing!
```
## Architecture
### Server Structure
```
mcp/
├── server.py # Main MCP server
├── requirements.txt # MCP dependencies
└── README.md # This file
```
### How It Works
1. **Claude Code** sends MCP requests to the server
2. **Server** routes requests to appropriate tool functions
3. **Tools** call CLI scripts (`doc_scraper.py`, `estimate_pages.py`, etc.)
4. **CLI scripts** perform actual work (scraping, packaging, etc.)
5. **Results** returned to Claude Code via MCP protocol
### Tool Implementation
Each tool is implemented as an async function:
```python
async def generate_config_tool(args: dict) -> list[TextContent]:
"""Generate a config file"""
# Create config JSON
# Save to configs/
# Return success message
```
Tools use `subprocess.run()` to call CLI scripts:
```python
result = subprocess.run([
sys.executable,
str(CLI_DIR / "doc_scraper.py"),
"--config", config_path
], capture_output=True, text=True)
```
## Testing
The MCP server has comprehensive test coverage:
```bash
# Run MCP server tests (25 tests)
python3 -m pytest tests/test_mcp_server.py -v
# Expected output: 25 passed in ~0.3s
```
### Test Coverage
- **Server initialization** (2 tests)
- **Tool listing** (2 tests)
- **generate_config** (3 tests)
- **estimate_pages** (3 tests)
- **scrape_docs** (4 tests)
- **package_skill** (3 tests)
- **upload_skill** (2 tests)
- **list_configs** (3 tests)
- **validate_config** (3 tests)
- **split_config** (3 tests)
- **generate_router** (3 tests)
- **Tool routing** (2 tests)
- **Integration** (1 test)
**Total: 34 tests | Pass rate: 100%**
## Troubleshooting
### MCP Server Not Loading
**Symptoms:**
- Tools don't appear in Claude Code
- No response to skill-seeker commands
**Solutions:**
1. Check configuration:
```bash
cat ~/.config/claude-code/mcp.json
```
2. Verify server can start:
```bash
python3 mcp/server.py
# Should start without errors (Ctrl+C to exit)
```
3. Check dependencies:
```bash
pip3 install -r mcp/requirements.txt
```
4. Completely restart Claude Code (quit and reopen)
5. Check Claude Code logs:
- macOS: `~/Library/Logs/Claude Code/`
- Linux: `~/.config/claude-code/logs/`
### "ModuleNotFoundError: No module named 'mcp'"
```bash
pip3 install -r mcp/requirements.txt
```
### Tools Appear But Don't Work
**Solutions:**
1. Verify `cwd` in config points to repository root
2. Check CLI tools exist:
```bash
ls cli/doc_scraper.py
ls cli/estimate_pages.py
ls cli/package_skill.py
```
3. Test CLI tools directly:
```bash
python3 cli/doc_scraper.py --help
```
### Slow Operations
1. Check rate limit in configs (increase if needed)
2. Use smaller `max_pages` for testing
3. Use `skip_scrape` to avoid re-downloading data
## Advanced Configuration
### Using Virtual Environment
```bash
# Create venv
python3 -m venv venv
source venv/bin/activate
pip install -r mcp/requirements.txt
pip install requests beautifulsoup4
which python3 # Copy this path
```
Configure Claude Code to use venv Python:
```json
{
"mcpServers": {
"skill-seeker": {
"command": "/path/to/Skill_Seekers/venv/bin/python3",
"args": ["/path/to/Skill_Seekers/mcp/server.py"],
"cwd": "/path/to/Skill_Seekers"
}
}
}
```
### Debug Mode
Enable verbose logging:
```json
{
"mcpServers": {
"skill-seeker": {
"command": "python3",
"args": ["-u", "/path/to/Skill_Seekers/mcp/server.py"],
"cwd": "/path/to/Skill_Seekers",
"env": {
"DEBUG": "1"
}
}
}
}
```
### With API Enhancement
For API-based enhancement (requires Anthropic API key):
```json
{
"mcpServers": {
"skill-seeker": {
"command": "python3",
"args": ["/path/to/Skill_Seekers/mcp/server.py"],
"cwd": "/path/to/Skill_Seekers",
"env": {
"ANTHROPIC_API_KEY": "sk-ant-your-key-here"
}
}
}
}
```
## Performance
| Operation | Time | Notes |
|-----------|------|-------|
| List configs | <1s | Instant |
| Generate config | <1s | Creates JSON file |
| Validate config | <1s | Quick validation |
| Estimate pages | 1-2min | Fast, no data download |
| Split config | 1-3min | Analyzes and creates sub-configs |
| Generate router | 10-30s | Creates router SKILL.md |
| Scrape docs | 15-45min | First time only |
| Scrape docs (40K pages) | 20-40hrs | Sequential |
| Scrape docs (40K pages, parallel) | 4-8hrs | 5 skills in parallel |
| Scrape (cached) | <1min | With `skip_scrape` |
| Package skill | 5-10s | Creates .zip |
| Package multi | 30-60s | Packages 5-10 skills |
## Documentation
- **Full Setup Guide**: [docs/MCP_SETUP.md](../docs/MCP_SETUP.md)
- **Main README**: [README.md](../README.md)
- **Usage Guide**: [docs/USAGE.md](../docs/USAGE.md)
- **Testing Guide**: [docs/TESTING.md](../docs/TESTING.md)
## Support
- **Issues**: [GitHub Issues](https://github.com/yusufkaraaslan/Skill_Seekers/issues)
- **Discussions**: [GitHub Discussions](https://github.com/yusufkaraaslan/Skill_Seekers/discussions)
## License
MIT License - See [LICENSE](../LICENSE) for details

Some files were not shown because too many files have changed in this diff Show More