14 KiB
Example Issue Description
Problem
Report generation times out for datasets containing more than 10,000 rows, causing 45% of export attempts to fail and generating 20+ support tickets per week.
Current Behavior
- User clicks "Generate Report" button for dataset with >10K rows
- No progress indication displayed to user
- After 60 seconds, browser timeout error appears: "Request timeout"
- No way to resume, cancel, or save partial results
- User forced to manually split dataset and export in smaller batches
Impact
Users:
- 45% of report generation attempts fail (from analytics)
- Average 3-5 retry attempts before giving up or contacting support
- Lost productivity: ~15 minutes per failed export
Business:
- 20-25 support tickets per week (5 hours support time)
- User frustration score: 3.2/10 (below acceptable threshold of 7/10)
- Enterprise customers threatening to churn due to export limitations
Technical:
- Server memory spikes to 2GB+ during large exports
- CPU usage reaches 100% during processing
- Occasional OOM crashes affecting other users
Root Cause Analysis
Current implementation loads entire dataset into memory before processing:
# Current approach (problematic)
def generate_report(dataset_id):
# Load ALL data into memory at once
data = db.query(f"SELECT * FROM {dataset_id}").fetchall() # 10K+ rows
# Process all data before returning
results = process_all_data(data) # Blocks for 60+ seconds
return results # Times out before reaching this point
Problems:
- No streaming - all data loaded at once
- No progress tracking - user sees nothing for 60s
- No cancellation - process continues even if user navigates away
- No memory limits - can spike to 2GB+
Solution
Implement streaming report generation with progressive rendering and chunked processing.
Proposed Architecture
┌──────────┐ 1. Request ┌────────────────┐ 2. Query ┌──────────┐
│ Client │ ────────────> │ API Server │ ──────────> │ Database │
└──────────┘ └────────────────┘ └──────────┘
│ │ │
│ │ 3. Stream results │
│ │ <───────────────────────────┘
│ │
│ 4. Server-Sent Events │ 5. Process chunks (1K rows)
│ (progress updates) │ Send to client as ready
│ <──────────────────────────│
│ │
│ 6. Progressive rendering │
│ Display results as │
│ they arrive │
Implementation Approach
Backend (Python/FastAPI):
async def generate_report_streaming(dataset_id):
"""Stream report generation with chunked processing."""
async def event_generator():
# Query with cursor (no full load)
cursor = db.cursor()
cursor.execute(f"SELECT * FROM {dataset_id}")
total_rows = cursor.rowcount
processed = 0
# Process in 1,000-row chunks
while True:
chunk = cursor.fetchmany(size=1000)
if not chunk:
break
# Process chunk
results = process_chunk(chunk)
# Send progress update
processed += len(chunk)
yield {
"progress": (processed / total_rows) * 100,
"data": results
}
return StreamingResponse(event_generator(), media_type="text/event-stream")
Frontend (JavaScript):
// Connect to streaming endpoint
const eventSource = new EventSource('/api/reports/stream/' + datasetId);
// Update progress bar
eventSource.addEventListener('message', (event) => {
const { progress, data } = JSON.parse(event.data);
// Update UI
progressBar.value = progress;
resultsTable.append(data);
if (progress >= 100) {
eventSource.close();
showCompleteMessage();
}
});
// Allow cancellation
cancelButton.onclick = () => {
eventSource.close();
fetch('/api/reports/cancel/' + jobId, { method: 'POST' });
};
Key Features
- Chunked processing: Process 1,000 rows at a time
- Progressive rendering: Display results as they arrive
- Progress tracking: Real-time percentage indicator
- Cancellation support: User can cancel at any time
- Memory limits: Max 500MB regardless of dataset size
- Fault tolerance: Resume on network interruption
Motivation
User Impact
- Current: 45% failure rate → 2-3 hour productivity loss per week
- After fix: <1% failure rate → 30 minutes saved per week per user
- Scale: 500 active users × 30 min/week = 250 hours/week saved
Business Impact
- Reduce support tickets from 20/week to <5/week (15 hours/week saved)
- Improve user satisfaction score from 3.2/10 to >7/10
- Prevent enterprise customer churn ($50K ARR at risk)
- Enable larger dataset support (competitive advantage)
Technical Impact
- Reduce server memory usage by 75% (2GB → 500MB)
- Enable horizontal scaling (stateless processing)
- Improve overall system stability (fewer OOM crashes)
- Better resource utilization (CPU distributed over time)
Acceptance Criteria
Functional Requirements
- Reports with 10K+ rows complete successfully without timeout
- First results visible within 2 seconds of clicking "Generate"
- Complete report generated in <10 seconds for 10K rows
- Progress indicator shows accurate % complete during generation
- User can cancel report generation at any time
- Partial results saved if user cancels
- Report generation works for datasets up to 100K rows
Non-Functional Requirements
- Memory usage stays below 500MB regardless of dataset size
- No memory leaks (tested with 100 consecutive report generations)
- Works on Chrome 119+, Firefox 120+, Safari 17+
- Responsive on mobile devices (tablet and desktop)
- Handles slow network connections (3G, throttled)
Performance Targets
| Metric | Current | Target | Improvement |
|---|---|---|---|
| Success rate | 55% | >99% | +80% |
| Time to first result | N/A (timeout) | <2s | ∞ |
| Complete export (10K rows) | Timeout (60s) | <10s | 6x faster |
| Memory usage (10K rows) | 2GB+ | <500MB | 75% reduction |
| Support tickets/week | 20-25 | <5 | 80% reduction |
Edge Cases
- Empty datasets display "No data" message
- Datasets with 100K+ rows generate successfully (may take 30-60s)
- Special characters render correctly (unicode, emojis, HTML entities)
- Network interruption shows error and allows retry
- Concurrent report generation by same user works correctly
- Server restart during generation shows clear error message
Error Handling
- Database connection errors display user-friendly message
- Permission denied shows appropriate error (403 Forbidden)
- Invalid dataset ID returns 404 Not Found
- Rate limiting (>5 concurrent reports) shows clear message
- Timeout after 5 minutes shows clear error and suggests smaller dataset
Technical Approach
Architecture Changes
Current (Synchronous):
Client ──> API Server ──> Database
↓ (load all)
Process all
↓
Return
↓
(timeout!)
Proposed (Streaming):
Client ──> API Server ──> Database
↑ ↓ (cursor) ↓
│ Stream chunks Stream rows
│ ↓ ↑
└──── Progressive ─────────┘
rendering
Implementation Steps
Phase 1: Backend Streaming (Week 1)
- Add FastAPI StreamingResponse support
- Implement chunked database queries (1K rows/chunk)
- Add Server-Sent Events (SSE) endpoint
- Implement job cancellation endpoint
- Add memory usage monitoring
Phase 2: Frontend Progressive Rendering (Week 1)
- Add EventSource for SSE connection
- Implement progress bar component
- Add cancel button with confirmation
- Implement progressive table rendering
- Add error handling and retry logic
Phase 3: Testing & Optimization (Week 2)
- Load testing with 100K row datasets
- Memory profiling during generation
- Concurrent user testing (10 simultaneous exports)
- Edge case testing (network interruption, cancellation)
- Performance tuning (chunk size optimization)
Phase 4: Deployment (Week 2)
- Deploy to staging environment
- Internal beta testing (dev team)
- Gradual rollout (10% → 50% → 100%)
- Monitor error rates and performance
- Full production deployment
Database Optimization
- Add index on frequently filtered columns
- Use read replicas for report queries (reduce load on primary)
- Implement query result caching for identical requests
Monitoring
- Track report generation success rate
- Monitor memory usage per report
- Alert on failure rate >5%
- Track average generation time
Alternatives Considered
Alternative 1: Asynchronous Job Queue
Approach: Submit report to background job queue, email user when complete
Pros:
- Simple implementation (Celery + Redis)
- No frontend changes needed
- Works for very large datasets
Cons:
- Poor UX (user must wait for email)
- No real-time progress updates
- Increased infrastructure complexity
- Doesn't solve immediate feedback problem
Decision: Rejected - UX too poor for interactive reports
Alternative 2: Client-Side Processing
Approach: Download raw data, process in browser with Web Workers
Pros:
- Offloads processing to client
- No server load
Cons:
- Slow download for large datasets
- High bandwidth usage
- Limited by browser memory
- Requires significant client-side code
Decision: Rejected - Not viable for 10K+ row datasets
Alternative 3: Paginated Results
Approach: Show first 100 rows, user clicks "Load More"
Pros:
- Fast initial load
- Simple implementation
Cons:
- User must click multiple times for full report
- Not a true "export" solution
- Poor UX for users needing complete data
Decision: Rejected - Doesn't meet user requirements
Open Questions
- Should we cache generated reports? → No, data changes frequently
- What's the ideal chunk size? → 1,000 rows (tested)
- Should we limit concurrent reports per user? → Yes, max 5
- Should we support export to CSV/Excel during streaming?
- Should we add email notification when generation completes?
Testing Strategy
Unit Tests
test_streaming_report_generator.py: Chunked processing logictest_progress_tracking.py: Accurate progress calculationtest_cancellation.py: Job cancellation and cleanuptest_error_handling.py: Database errors, network issues
Integration Tests
test_report_api.py: End-to-end streaming report generationtest_concurrent_reports.py: Multiple simultaneous reportstest_large_datasets.py: 100K row datasets
Load Tests
# Test with 50 concurrent users generating 10K row reports
locust -f tests/load/test_report_streaming.py --users 50 --spawn-rate 5
# Performance targets:
# - 99th percentile response time: <15s
# - Error rate: <1%
# - Memory usage per worker: <500MB
Edge Case Tests
- Empty dataset
- Single row dataset
- 100K row dataset
- Network interruption mid-generation
- Database connection loss
- Server restart during generation
- Concurrent cancellations
Rollout Plan
Week 1: Development
- Implement backend streaming
- Implement frontend progressive rendering
- Unit tests and integration tests
Week 2: Testing & Staging
- Load testing
- Deploy to staging
- Internal testing (dev team)
- Fix any issues found
Week 3: Gradual Production Rollout
- Deploy to production with feature flag
- Enable for 10% of users
- Monitor error rates, performance metrics
- If successful, increase to 50%
- If successful, increase to 100%
Week 4: Full Deployment
- 100% of users on streaming reports
- Remove old synchronous implementation
- Update documentation
Related
- Related to #234 (API performance improvements)
- Related to #235 (Memory optimization)
- Blocks #236 (Enterprise tier launch - requires large dataset support)
- Depends on #237 (Database read replica setup)
- See design doc: Streaming Reports Architecture
Priority
P1-High
Justification:
- Affects 45% of report generation attempts (critical failure rate)
- Generating 20+ support tickets per week (significant support burden)
- Enterprise customer churn risk ($50K ARR)
- Competitive disadvantage (competitors support larger datasets)
Timeline: Target completion in 3 weeks (includes testing and gradual rollout)
Complexity Estimate
- Effort: 2-3 weeks (including testing and gradual rollout)
- Risk: Medium (requires careful testing of streaming implementation)
- Dependencies: Database read replica setup (Issue #237)
- Skills needed: Backend (Python/FastAPI), Frontend (JavaScript/SSE), Database optimization
Labels
bug, performance, P1-high, backend, frontend, user-experience
Assignees
- Backend: @backend-dev
- Frontend: @frontend-dev
- QA: @qa-engineer
Issue created by: Product Manager (@pm-user) Date: 2025-11-12 Milestone: Q4 2025