Image Extraction Methodology

Overview

Visual content drives engagement. Tagtaly ensures every article has a quality image for dashboard displays and social sharing. With 99.2% image coverage (497 of 501 articles), this document explains the 9-method extraction system, CDN bypass techniques, and responsive image handling.

The Challenge

Finding article images is harder than it seems:

RSS feeds: Many don't include image data
Guardian: CDN returns 401 Unauthorized errors
Washington Post: Images embedded in HTML, not RSS
Lazy loading: Modern sites load images dynamically
Responsive images: Multiple resolutions in `srcset` attributes

 Original Coverage: 64% (183 of 501 articles had images) 
 After Optimization: 99.2% (497 of 501 articles have images) 
 Improvement: +35.2 percentage points 

The 9-Method Priority System

Tagtaly tries 9 extraction methods in order. When one succeeds, it stops (avoiding redundant fetches):

Priority	Method	Coverage	Speed
1	RSS native image tag	40%	Instant
2	Open Graph og:image	35%	1-2 sec
3	Twitter Card twitter:image	25%	1-2 sec
4	Schema.org JSON-LD image	20%	1-2 sec
5	Alternate image meta tag	15%	1-2 sec
6	Source-specific CDN patterns	10%	Instant
7	Standard HTML img tags	8%	1-2 sec
8	Lazy-loaded data-src	5%	1-2 sec
9	Responsive srcset	3%	1-2 sec

Method 1: RSS Native Image

Fastest method. Some RSS feeds include `` tags with article images:

 <image>
    <url>https://example.com/image.jpg</url>
    <width>1200</width>
    <height>630</height>
</image> 

Coverage: BBC, Sky News (~40%)

Methods 2-5: Metadata Extraction

When RSS fails, scrape article HTML for meta tags:

 # Open Graph (most common)
meta og:image content="https://..."

# Twitter Card
meta twitter:image content="https://..."

# JSON-LD Schema
"image": "https://..." 

Combined Coverage: Most outlets (~35-40%)

Method 6: Source-Specific CDN Patterns

Each outlet has CDN patterns. BBC uses `images.bbc.co.uk`, Guardian uses `media.guim.co.uk`. For outlets with access restrictions, use proxy services:

 # Guardian bypass via weserv.nl proxy
original_url = "https://media.guim.co.uk/abc123/image.jpg"
proxied_url = f"https://images.weserv.nl/?url={original_url}" 

Coverage: Guardian, Washington Post, NPR (~10-15%)

Methods 7-9: HTML Parsing

When metadata fails, parse article HTML for images:

 from bs4 import BeautifulSoup
import requests

response = requests.get(article_url, timeout=5)
soup = BeautifulSoup(response.content, 'html.parser')

# Standard images
images = soup.find_all('img')

# Lazy-loaded images
lazy_images = soup.find_all('img', {'data-src': True})

# Responsive images in srcset
for img in images:
    if img.get('srcset'):
        # Pick largest resolution
        srcset = img['srcset'].split(',') 

Coverage: Fallback for remaining articles (~15-20%)

Image Quality & Sizing

Preferred Dimensions

For dashboard and social sharing, images should be:

Target Specifications:

Aspect ratio: 16:9 or 1.2:1 (landscape)
Minimum width: 800px (for quality)
File size: <2 MB (for load speed)
Format: JPEG, PNG, or WebP

Validation

Before storing, images are validated:

URL accessibility (200 HTTP response)
Valid image format (JPEG, PNG, WebP)
Minimum dimensions (>400px width)
File size check (<10 MB)

Handling Edge Cases

Blocked/Restricted Images

Some outlets block external access. Tagtaly uses three strategies:

Proxy services: weserv.nl, imgproxy, or similar
User-Agent headers: Spoof browser request
Fallback placeholder: Use generic news image

Corrupted/Invalid Images

If all 9 methods fail, use a source-branded placeholder image. This ensures 100% visual coverage (though some placeholders vs actual images).

Performance Optimization

Caching

Image URLs are cached to avoid redundant fetches:

 # Cache successful extractions
image_cache = {
    'article_url': 'image_url'
}

# For duplicate articles (same URL), use cached image
if article_url in image_cache:
    image_url = image_cache[article_url]
else:
    image_url = extract_image(article_url) 

Timeout Handling

HTML scraping has a 5-second timeout per article. If scraping takes longer, the timeout triggers and moves to next article. This prevents slow websites from blocking the pipeline.

Batch Processing

Image extraction runs in parallel for multiple articles (up to 10 concurrent), reducing total extraction time from minutes to seconds.

Results & Statistics

Coverage by Source:

BBC: 100% (RSS native images)
Guardian: 100% (CDN proxy bypass)
Sky News: 100% (RSS + HTML)
Independent: 98% (meta + HTML)
Washington Post: 99% (JSON-LD + proxy)

Overall Performance:

Total articles processed: 501
Images found: 497 (99.2%)
Placeholder fallbacks: 4 (0.8%)
Average extraction time: 1.2 seconds per article

Future Improvements

AI-powered image ranking (choose best image if multiple exist)
Image quality scoring (blur detection, relevance)
Copyright checking (license verification)
Video thumbnail extraction (for video articles)

Questions?

For questions about image extraction, contact admin@tagtaly.com.