414 lines
14 KiB
Markdown
414 lines
14 KiB
Markdown
# Example Issue Description
|
||
|
||
## Problem
|
||
Report generation times out for datasets containing more than 10,000 rows, causing 45% of export attempts to fail and generating 20+ support tickets per week.
|
||
|
||
### Current Behavior
|
||
1. User clicks "Generate Report" button for dataset with >10K rows
|
||
2. No progress indication displayed to user
|
||
3. After 60 seconds, browser timeout error appears: "Request timeout"
|
||
4. No way to resume, cancel, or save partial results
|
||
5. User forced to manually split dataset and export in smaller batches
|
||
|
||
### Impact
|
||
**Users:**
|
||
- 45% of report generation attempts fail (from analytics)
|
||
- Average 3-5 retry attempts before giving up or contacting support
|
||
- Lost productivity: ~15 minutes per failed export
|
||
|
||
**Business:**
|
||
- 20-25 support tickets per week (5 hours support time)
|
||
- User frustration score: 3.2/10 (below acceptable threshold of 7/10)
|
||
- Enterprise customers threatening to churn due to export limitations
|
||
|
||
**Technical:**
|
||
- Server memory spikes to 2GB+ during large exports
|
||
- CPU usage reaches 100% during processing
|
||
- Occasional OOM crashes affecting other users
|
||
|
||
### Root Cause Analysis
|
||
Current implementation loads entire dataset into memory before processing:
|
||
|
||
```python
|
||
# Current approach (problematic)
|
||
def generate_report(dataset_id):
|
||
# Load ALL data into memory at once
|
||
data = db.query(f"SELECT * FROM {dataset_id}").fetchall() # 10K+ rows
|
||
|
||
# Process all data before returning
|
||
results = process_all_data(data) # Blocks for 60+ seconds
|
||
|
||
return results # Times out before reaching this point
|
||
```
|
||
|
||
Problems:
|
||
1. No streaming - all data loaded at once
|
||
2. No progress tracking - user sees nothing for 60s
|
||
3. No cancellation - process continues even if user navigates away
|
||
4. No memory limits - can spike to 2GB+
|
||
|
||
## Solution
|
||
Implement streaming report generation with progressive rendering and chunked processing.
|
||
|
||
### Proposed Architecture
|
||
```
|
||
┌──────────┐ 1. Request ┌────────────────┐ 2. Query ┌──────────┐
|
||
│ Client │ ────────────> │ API Server │ ──────────> │ Database │
|
||
└──────────┘ └────────────────┘ └──────────┘
|
||
│ │ │
|
||
│ │ 3. Stream results │
|
||
│ │ <───────────────────────────┘
|
||
│ │
|
||
│ 4. Server-Sent Events │ 5. Process chunks (1K rows)
|
||
│ (progress updates) │ Send to client as ready
|
||
│ <──────────────────────────│
|
||
│ │
|
||
│ 6. Progressive rendering │
|
||
│ Display results as │
|
||
│ they arrive │
|
||
```
|
||
|
||
### Implementation Approach
|
||
|
||
**Backend (Python/FastAPI):**
|
||
```python
|
||
async def generate_report_streaming(dataset_id):
|
||
"""Stream report generation with chunked processing."""
|
||
async def event_generator():
|
||
# Query with cursor (no full load)
|
||
cursor = db.cursor()
|
||
cursor.execute(f"SELECT * FROM {dataset_id}")
|
||
|
||
total_rows = cursor.rowcount
|
||
processed = 0
|
||
|
||
# Process in 1,000-row chunks
|
||
while True:
|
||
chunk = cursor.fetchmany(size=1000)
|
||
if not chunk:
|
||
break
|
||
|
||
# Process chunk
|
||
results = process_chunk(chunk)
|
||
|
||
# Send progress update
|
||
processed += len(chunk)
|
||
yield {
|
||
"progress": (processed / total_rows) * 100,
|
||
"data": results
|
||
}
|
||
|
||
return StreamingResponse(event_generator(), media_type="text/event-stream")
|
||
```
|
||
|
||
**Frontend (JavaScript):**
|
||
```javascript
|
||
// Connect to streaming endpoint
|
||
const eventSource = new EventSource('/api/reports/stream/' + datasetId);
|
||
|
||
// Update progress bar
|
||
eventSource.addEventListener('message', (event) => {
|
||
const { progress, data } = JSON.parse(event.data);
|
||
|
||
// Update UI
|
||
progressBar.value = progress;
|
||
resultsTable.append(data);
|
||
|
||
if (progress >= 100) {
|
||
eventSource.close();
|
||
showCompleteMessage();
|
||
}
|
||
});
|
||
|
||
// Allow cancellation
|
||
cancelButton.onclick = () => {
|
||
eventSource.close();
|
||
fetch('/api/reports/cancel/' + jobId, { method: 'POST' });
|
||
};
|
||
```
|
||
|
||
### Key Features
|
||
1. **Chunked processing**: Process 1,000 rows at a time
|
||
2. **Progressive rendering**: Display results as they arrive
|
||
3. **Progress tracking**: Real-time percentage indicator
|
||
4. **Cancellation support**: User can cancel at any time
|
||
5. **Memory limits**: Max 500MB regardless of dataset size
|
||
6. **Fault tolerance**: Resume on network interruption
|
||
|
||
## Motivation
|
||
|
||
### User Impact
|
||
- **Current**: 45% failure rate → 2-3 hour productivity loss per week
|
||
- **After fix**: <1% failure rate → 30 minutes saved per week per user
|
||
- **Scale**: 500 active users × 30 min/week = 250 hours/week saved
|
||
|
||
### Business Impact
|
||
- Reduce support tickets from 20/week to <5/week (15 hours/week saved)
|
||
- Improve user satisfaction score from 3.2/10 to >7/10
|
||
- Prevent enterprise customer churn ($50K ARR at risk)
|
||
- Enable larger dataset support (competitive advantage)
|
||
|
||
### Technical Impact
|
||
- Reduce server memory usage by 75% (2GB → 500MB)
|
||
- Enable horizontal scaling (stateless processing)
|
||
- Improve overall system stability (fewer OOM crashes)
|
||
- Better resource utilization (CPU distributed over time)
|
||
|
||
## Acceptance Criteria
|
||
|
||
### Functional Requirements
|
||
- [ ] Reports with 10K+ rows complete successfully without timeout
|
||
- [ ] First results visible within 2 seconds of clicking "Generate"
|
||
- [ ] Complete report generated in <10 seconds for 10K rows
|
||
- [ ] Progress indicator shows accurate % complete during generation
|
||
- [ ] User can cancel report generation at any time
|
||
- [ ] Partial results saved if user cancels
|
||
- [ ] Report generation works for datasets up to 100K rows
|
||
|
||
### Non-Functional Requirements
|
||
- [ ] Memory usage stays below 500MB regardless of dataset size
|
||
- [ ] No memory leaks (tested with 100 consecutive report generations)
|
||
- [ ] Works on Chrome 119+, Firefox 120+, Safari 17+
|
||
- [ ] Responsive on mobile devices (tablet and desktop)
|
||
- [ ] Handles slow network connections (3G, throttled)
|
||
|
||
### Performance Targets
|
||
| Metric | Current | Target | Improvement |
|
||
|--------|---------|--------|-------------|
|
||
| Success rate | 55% | >99% | +80% |
|
||
| Time to first result | N/A (timeout) | <2s | ∞ |
|
||
| Complete export (10K rows) | Timeout (60s) | <10s | 6x faster |
|
||
| Memory usage (10K rows) | 2GB+ | <500MB | 75% reduction |
|
||
| Support tickets/week | 20-25 | <5 | 80% reduction |
|
||
|
||
### Edge Cases
|
||
- [ ] Empty datasets display "No data" message
|
||
- [ ] Datasets with 100K+ rows generate successfully (may take 30-60s)
|
||
- [ ] Special characters render correctly (unicode, emojis, HTML entities)
|
||
- [ ] Network interruption shows error and allows retry
|
||
- [ ] Concurrent report generation by same user works correctly
|
||
- [ ] Server restart during generation shows clear error message
|
||
|
||
### Error Handling
|
||
- [ ] Database connection errors display user-friendly message
|
||
- [ ] Permission denied shows appropriate error (403 Forbidden)
|
||
- [ ] Invalid dataset ID returns 404 Not Found
|
||
- [ ] Rate limiting (>5 concurrent reports) shows clear message
|
||
- [ ] Timeout after 5 minutes shows clear error and suggests smaller dataset
|
||
|
||
## Technical Approach
|
||
|
||
### Architecture Changes
|
||
|
||
**Current (Synchronous):**
|
||
```
|
||
Client ──> API Server ──> Database
|
||
↓ (load all)
|
||
Process all
|
||
↓
|
||
Return
|
||
↓
|
||
(timeout!)
|
||
```
|
||
|
||
**Proposed (Streaming):**
|
||
```
|
||
Client ──> API Server ──> Database
|
||
↑ ↓ (cursor) ↓
|
||
│ Stream chunks Stream rows
|
||
│ ↓ ↑
|
||
└──── Progressive ─────────┘
|
||
rendering
|
||
```
|
||
|
||
### Implementation Steps
|
||
|
||
#### Phase 1: Backend Streaming (Week 1)
|
||
1. Add FastAPI StreamingResponse support
|
||
2. Implement chunked database queries (1K rows/chunk)
|
||
3. Add Server-Sent Events (SSE) endpoint
|
||
4. Implement job cancellation endpoint
|
||
5. Add memory usage monitoring
|
||
|
||
#### Phase 2: Frontend Progressive Rendering (Week 1)
|
||
1. Add EventSource for SSE connection
|
||
2. Implement progress bar component
|
||
3. Add cancel button with confirmation
|
||
4. Implement progressive table rendering
|
||
5. Add error handling and retry logic
|
||
|
||
#### Phase 3: Testing & Optimization (Week 2)
|
||
1. Load testing with 100K row datasets
|
||
2. Memory profiling during generation
|
||
3. Concurrent user testing (10 simultaneous exports)
|
||
4. Edge case testing (network interruption, cancellation)
|
||
5. Performance tuning (chunk size optimization)
|
||
|
||
#### Phase 4: Deployment (Week 2)
|
||
1. Deploy to staging environment
|
||
2. Internal beta testing (dev team)
|
||
3. Gradual rollout (10% → 50% → 100%)
|
||
4. Monitor error rates and performance
|
||
5. Full production deployment
|
||
|
||
### Database Optimization
|
||
- Add index on frequently filtered columns
|
||
- Use read replicas for report queries (reduce load on primary)
|
||
- Implement query result caching for identical requests
|
||
|
||
### Monitoring
|
||
- Track report generation success rate
|
||
- Monitor memory usage per report
|
||
- Alert on failure rate >5%
|
||
- Track average generation time
|
||
|
||
## Alternatives Considered
|
||
|
||
### Alternative 1: Asynchronous Job Queue
|
||
**Approach:** Submit report to background job queue, email user when complete
|
||
|
||
**Pros:**
|
||
- Simple implementation (Celery + Redis)
|
||
- No frontend changes needed
|
||
- Works for very large datasets
|
||
|
||
**Cons:**
|
||
- Poor UX (user must wait for email)
|
||
- No real-time progress updates
|
||
- Increased infrastructure complexity
|
||
- Doesn't solve immediate feedback problem
|
||
|
||
**Decision:** Rejected - UX too poor for interactive reports
|
||
|
||
### Alternative 2: Client-Side Processing
|
||
**Approach:** Download raw data, process in browser with Web Workers
|
||
|
||
**Pros:**
|
||
- Offloads processing to client
|
||
- No server load
|
||
|
||
**Cons:**
|
||
- Slow download for large datasets
|
||
- High bandwidth usage
|
||
- Limited by browser memory
|
||
- Requires significant client-side code
|
||
|
||
**Decision:** Rejected - Not viable for 10K+ row datasets
|
||
|
||
### Alternative 3: Paginated Results
|
||
**Approach:** Show first 100 rows, user clicks "Load More"
|
||
|
||
**Pros:**
|
||
- Fast initial load
|
||
- Simple implementation
|
||
|
||
**Cons:**
|
||
- User must click multiple times for full report
|
||
- Not a true "export" solution
|
||
- Poor UX for users needing complete data
|
||
|
||
**Decision:** Rejected - Doesn't meet user requirements
|
||
|
||
## Open Questions
|
||
- [x] Should we cache generated reports? → No, data changes frequently
|
||
- [x] What's the ideal chunk size? → 1,000 rows (tested)
|
||
- [x] Should we limit concurrent reports per user? → Yes, max 5
|
||
- [ ] Should we support export to CSV/Excel during streaming?
|
||
- [ ] Should we add email notification when generation completes?
|
||
|
||
## Testing Strategy
|
||
|
||
### Unit Tests
|
||
- `test_streaming_report_generator.py`: Chunked processing logic
|
||
- `test_progress_tracking.py`: Accurate progress calculation
|
||
- `test_cancellation.py`: Job cancellation and cleanup
|
||
- `test_error_handling.py`: Database errors, network issues
|
||
|
||
### Integration Tests
|
||
- `test_report_api.py`: End-to-end streaming report generation
|
||
- `test_concurrent_reports.py`: Multiple simultaneous reports
|
||
- `test_large_datasets.py`: 100K row datasets
|
||
|
||
### Load Tests
|
||
```bash
|
||
# Test with 50 concurrent users generating 10K row reports
|
||
locust -f tests/load/test_report_streaming.py --users 50 --spawn-rate 5
|
||
|
||
# Performance targets:
|
||
# - 99th percentile response time: <15s
|
||
# - Error rate: <1%
|
||
# - Memory usage per worker: <500MB
|
||
```
|
||
|
||
### Edge Case Tests
|
||
- Empty dataset
|
||
- Single row dataset
|
||
- 100K row dataset
|
||
- Network interruption mid-generation
|
||
- Database connection loss
|
||
- Server restart during generation
|
||
- Concurrent cancellations
|
||
|
||
## Rollout Plan
|
||
|
||
### Week 1: Development
|
||
- [x] Implement backend streaming
|
||
- [x] Implement frontend progressive rendering
|
||
- [x] Unit tests and integration tests
|
||
|
||
### Week 2: Testing & Staging
|
||
- [x] Load testing
|
||
- [x] Deploy to staging
|
||
- [x] Internal testing (dev team)
|
||
- [x] Fix any issues found
|
||
|
||
### Week 3: Gradual Production Rollout
|
||
- [ ] Deploy to production with feature flag
|
||
- [ ] Enable for 10% of users
|
||
- [ ] Monitor error rates, performance metrics
|
||
- [ ] If successful, increase to 50%
|
||
- [ ] If successful, increase to 100%
|
||
|
||
### Week 4: Full Deployment
|
||
- [ ] 100% of users on streaming reports
|
||
- [ ] Remove old synchronous implementation
|
||
- [ ] Update documentation
|
||
|
||
## Related
|
||
- Related to #234 (API performance improvements)
|
||
- Related to #235 (Memory optimization)
|
||
- Blocks #236 (Enterprise tier launch - requires large dataset support)
|
||
- Depends on #237 (Database read replica setup)
|
||
- See design doc: [Streaming Reports Architecture](link)
|
||
|
||
## Priority
|
||
**P1-High**
|
||
|
||
**Justification:**
|
||
- Affects 45% of report generation attempts (critical failure rate)
|
||
- Generating 20+ support tickets per week (significant support burden)
|
||
- Enterprise customer churn risk ($50K ARR)
|
||
- Competitive disadvantage (competitors support larger datasets)
|
||
|
||
**Timeline:** Target completion in 3 weeks (includes testing and gradual rollout)
|
||
|
||
## Complexity Estimate
|
||
- **Effort**: 2-3 weeks (including testing and gradual rollout)
|
||
- **Risk**: Medium (requires careful testing of streaming implementation)
|
||
- **Dependencies**: Database read replica setup (Issue #237)
|
||
- **Skills needed**: Backend (Python/FastAPI), Frontend (JavaScript/SSE), Database optimization
|
||
|
||
## Labels
|
||
`bug`, `performance`, `P1-high`, `backend`, `frontend`, `user-experience`
|
||
|
||
## Assignees
|
||
- Backend: @backend-dev
|
||
- Frontend: @frontend-dev
|
||
- QA: @qa-engineer
|
||
|
||
---
|
||
|
||
**Issue created by:** Product Manager (@pm-user)
|
||
**Date:** 2025-11-12
|
||
**Milestone:** Q4 2025
|