Home >Backend Development >PHP Tutorial >How can I extract and categorize text data from an HTML document based on specific element classes using PHP?

How can I extract and categorize text data from an HTML document based on specific element classes using PHP?

Mary-Kate Olsen
Mary-Kate OlsenOriginal
2024-11-12 15:48:01659browse

How can I extract and categorize text data from an HTML document based on specific element classes using PHP?

Retrieve Text from Elements with Specified Class as a Comprehensive Array

In this query, the task at hand is to extract and categorize text data from an HTML document based on specific element classes. The HTML document contains various paragraphs with classes like "Heading1-P" and "Normal-P," each containing corresponding headings and content.

To accomplish this, we can utilize PHP DOM Document and XPath. The process involves parsing the HTML document and traversing its elements using XPath. We define a custom function, parseToArray() that takes an XPath object and class name as inputs. This function iterates through the elements matching the class and extracts their text content into an array.

Here's the detailed solution:

$test = <<< HTML
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($test);
$xpath = new DOMXPath($dom);
$heading = parseToArray($xpath, 'Heading1-H');
$content = parseToArray($xpath, 'Normal-H');

var_dump($heading);
echo "<br/>";
var_dump($content);
echo "<br/>";

function parseToArray(DOMXPath $xpath, string $class): array
{
    $xpathquery = "//[@class='$class']";
    $elements = $xpath->query($xpathquery);

    $resultarray = [];
    foreach ($elements as $element) {
        $nodes = $element->childNodes;
        foreach ($nodes as $node) {
            $resultarray[] = $node->nodeValue;
        }
    }

    return $resultarray;
}

The function parseToArray() identifies elements based on a specific class name and extracts their text content into an array. Subsequently, two arrays are created: $heading and $content, which contain the chapter titles and corresponding paragraph text, respectively. The output of the code will be as follows:

array(3) {
  [0] =>
  string(8) "Chapter 1"
  [1] =>
  string(8) "Chapter 2"
  [2] =>
  string(8) "Chapter 3"
}
array(3) {
  [0] =>
  string(16) "This is chapter 1"
  [1] =>
  string(16) "This is chapter 2"
  [2] =>
  string(16) "This is chapter 3"
}

By employing this approach, you can efficiently retrieve and separate text content based on specific class names from an HTML document, allowing for flexible and targeted data processing.

The above is the detailed content of How can I extract and categorize text data from an HTML document based on specific element classes using PHP?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn