Overview
Visual content drives engagement. Tagtaly ensures every article has a quality image for dashboard displays and social sharing. With 99.2% image coverage (497 of 501 articles), this document explains the 9-method extraction system, CDN bypass techniques, and responsive image handling.
The Challenge
Finding article images is harder than it seems:
- RSS feeds: Many don't include image data
- Guardian: CDN returns 401 Unauthorized errors
- Washington Post: Images embedded in HTML, not RSS
- Lazy loading: Modern sites load images dynamically
- Responsive images: Multiple resolutions in `srcset` attributes
After Optimization: 99.2% (497 of 501 articles have images)
Improvement: +35.2 percentage points
The 9-Method Priority System
Tagtaly tries 9 extraction methods in order. When one succeeds, it stops (avoiding redundant fetches):
| Priority | Method | Coverage | Speed |
|---|---|---|---|
| 1 | RSS native image tag | 40% | Instant |
| 2 | Open Graph og:image | 35% | 1-2 sec |
| 3 | Twitter Card twitter:image | 25% | 1-2 sec |
| 4 | Schema.org JSON-LD image | 20% | 1-2 sec |
| 5 | Alternate image meta tag | 15% | 1-2 sec |
| 6 | Source-specific CDN patterns | 10% | Instant |
| 7 | Standard HTML img tags | 8% | 1-2 sec |
| 8 | Lazy-loaded data-src | 5% | 1-2 sec |
| 9 | Responsive srcset | 3% | 1-2 sec |
Method 1: RSS Native Image
Fastest method. Some RSS feeds include `
Coverage: BBC, Sky News (~40%)
Methods 2-5: Metadata Extraction
When RSS fails, scrape article HTML for meta tags:
Combined Coverage: Most outlets (~35-40%)
Method 6: Source-Specific CDN Patterns
Each outlet has CDN patterns. BBC uses `images.bbc.co.uk`, Guardian uses `media.guim.co.uk`. For outlets with access restrictions, use proxy services:
Coverage: Guardian, Washington Post, NPR (~10-15%)
Methods 7-9: HTML Parsing
When metadata fails, parse article HTML for images:
Coverage: Fallback for remaining articles (~15-20%)
Image Quality & Sizing
Preferred Dimensions
For dashboard and social sharing, images should be:
- Aspect ratio: 16:9 or 1.2:1 (landscape)
- Minimum width: 800px (for quality)
- File size: <2 MB (for load speed)
- Format: JPEG, PNG, or WebP
Validation
Before storing, images are validated:
- URL accessibility (200 HTTP response)
- Valid image format (JPEG, PNG, WebP)
- Minimum dimensions (>400px width)
- File size check (<10 MB)
Handling Edge Cases
Blocked/Restricted Images
Some outlets block external access. Tagtaly uses three strategies:
- Proxy services: weserv.nl, imgproxy, or similar
- User-Agent headers: Spoof browser request
- Fallback placeholder: Use generic news image
Corrupted/Invalid Images
If all 9 methods fail, use a source-branded placeholder image. This ensures 100% visual coverage (though some placeholders vs actual images).
Performance Optimization
Caching
Image URLs are cached to avoid redundant fetches:
Timeout Handling
HTML scraping has a 5-second timeout per article. If scraping takes longer, the timeout triggers and moves to next article. This prevents slow websites from blocking the pipeline.
Batch Processing
Image extraction runs in parallel for multiple articles (up to 10 concurrent), reducing total extraction time from minutes to seconds.
Results & Statistics
- BBC: 100% (RSS native images)
- Guardian: 100% (CDN proxy bypass)
- Sky News: 100% (RSS + HTML)
- Independent: 98% (meta + HTML)
- Washington Post: 99% (JSON-LD + proxy)
- Total articles processed: 501
- Images found: 497 (99.2%)
- Placeholder fallbacks: 4 (0.8%)
- Average extraction time: 1.2 seconds per article
Future Improvements
- AI-powered image ranking (choose best image if multiple exist)
- Image quality scoring (blur detection, relevance)
- Copyright checking (license verification)
- Video thumbnail extraction (for video articles)
Questions?
For questions about image extraction, contact admin@tagtaly.com.