Explain Codes LogoExplain Codes Logo

How to extract img src, title, and alt from HTML using PHP?

html
html-parsing
domdocument
performance-optimization
Anton ShumikhinbyAnton Shumikhin·Feb 17, 2025
TLDR

Extract img details in PHP using DOMDocument and DOMXPath:

$dom = new DOMDocument(); @$dom->loadHTML($html); // Obne San Kenobi: "This is not the error you're looking for!" $xpath = new DOMXPath($dom); // Extract 'src', 'title', 'alt' for all img tags with the power of XPath foreach ($xpath->query("//img") as $img) { $src = $img->getAttribute('src'); $title = $img->getAttribute('title'); $alt = $img->getAttribute('alt'); // TODO: Prepare a welcoming party for above variables }

This method offers a robust way to fetch the img attributes from HTML documents.

Unraveling the HTML Parse-nomicon with PHP

Parsing HTML with PHP shouldn't feel like performing a ritual from the Necronomicon. Your tool of choice is DOMDocument.

Regex or DOMDocument: The showdown

Regex for HTML parsing is the equivalent of using a shotgun for surgery. Sure it might work, but it's terribly inefficient and messy.

On the flip side, DOMDocument is precision-wide — it's your surgeon's scalpel. It’s resilient against non-XHTML compliant HTML and flexible enough to handle variations in attribute order.

Bootstrap your PHP codesailor against the angry sea of malformed HTML:

  • libxml_use_internal_errors(true): Duct tape for your error messages.
  • HTML structure: Enclose fragments in a basic HTML structure before DOMDocument loading.
  • Character encoding: Set <meta charset="UTF-8"> if necessary. DOMDocument respects the charset.

Alternative paths in the labyrinth

Sometimes you'll need a thinner thread to navigate the HTML labyrinth:

  • simplexml_load_string(): A lighter, simpler tool for XPath queries.
  • simplexml_import_dom(): Convert DOMDocument to SimpleXMLElement for XPath usage when you're already in the DOMDocument realm.

Conquering performance bottlenecks and edge cases

Efficiently parsing large HTML documents

When dealing with the Krakens of HTML documents:

  • Load only the essential parts of the document to save processing time.
  • Use ob_start to buffer the streams of data, preventing them from swamping your server's memory.

Caching - The secret weapon for performance

Unleash the power of caching:

  • Serialize the DOMDocument object and stash it for repeated use. Think of it as a roadmap for your HTML.
  • Control the freshness of your data with ETags or Last-Modified headers to avoid unnecessary HTML fetches.

Decoding riddles in the HTML

Sometimes, HTML treats us like we're in an escape room:

  • Use html_entity_decode() for proper attribute extraction.
  • Enable the mb_string extension for dealing with the snake pit of multi-byte characters.