Home >Web Front-end >JS Tutorial >Clean up HTML Content for Retrieval-Augmented Generation with Readability.js
Web scraping is a common method for gathering content for your retrieval-augmented generation (RAG) application. However, parsing web page content can be challenging.
Mozilla's open-source Readability.js library offers a convenient solution for extracting only the essential parts of a web page. Let's explore its integration into a data ingestion pipeline for a RAG application.
Web pages are rich sources of unstructured data, ideal for RAG applications. However, web pages often contain irrelevant information such as headers, sidebars, and footers. While useful for browsing, this extra content detracts from the page's main subject.
For optimal RAG data, irrelevant content must be removed. While tools like Cheerio can parse HTML based on a site's known structure, this approach is inefficient for scraping diverse website layouts. A robust method is needed to extract only relevant content.
Most browsers include a reader view that removes all but the article title and content. The following image illustrates the difference between standard browsing and reader mode applied to a DataStax blog post:
Mozilla provides Readability.js, the library behind Firefox's reader mode, as a standalone open-source module. This allows us to integrate Readability.js into a data pipeline to remove irrelevant content and improve scraping results.
Let's illustrate scraping article content from a previous blog post about creating vector embeddings in Node.js. The following JavaScript code retrieves the page's HTML:
<code class="language-javascript">const html = await fetch( "https://www.datastax.com/blog/how-to-create-vector-embeddings-in-node-js" ).then((res) => res.text()); console.log(html);</code>
This includes all HTML, including navigation, footers, and other elements common on websites.
Alternatively, you could use Cheerio to select specific elements:
<code class="language-javascript">npm install cheerio</code>
<code class="language-javascript">import * as cheerio from "cheerio"; const html = await fetch( "https://www.datastax.com/blog/how-to-create-vector-embeddings-in-node-js" ).then((res) => res.text()); const $ = cheerio.load(html); console.log($("h1").text(), "\n"); console.log($("section#blog-content > div:first-child").text());</code>
This yields the title and article text. However, this approach relies on knowing the HTML structure, which is not always feasible.
A better approach involves installing Readability.js and jsdom:
<code class="language-bash">npm install @mozilla/readability jsdom</code>
Readability.js operates within a browser environment, requiring jsdom to simulate this in Node.js. We can convert the loaded HTML into a document and use Readability.js to parse the content:
<code class="language-javascript">import { Readability } from "@mozilla/readability"; import { JSDOM } from "jsdom"; const url = "https://www.datastax.com/blog/how-to-create-vector-embeddings-in-node-js"; const html = await fetch(url).then((res) => res.text()); const doc = new JSDOM(html, { url }); const reader = new Readability(doc.window.document); const article = reader.parse(); console.log(article);</code>
The article
object contains various parsed elements:
This includes the title, author, excerpt, publication time, and both HTML (content
) and plain text (textContent
). textContent
is ready for chunking, embedding, and storage, while content
retains links and images for further processing.
The isProbablyReaderable
function helps determine if the document is suitable for Readability.js:
<code class="language-javascript">const html = await fetch( "https://www.datastax.com/blog/how-to-create-vector-embeddings-in-node-js" ).then((res) => res.text()); console.log(html);</code>
Unsuitable pages should be flagged for review.
Readability.js integrates seamlessly with LangChain.js. The following example uses LangChain.js to load a page, extract content with MozillaReadabilityTransformer
, split text with RecursiveCharacterTextSplitter
, create embeddings with OpenAI, and store data in Astra DB.
Required dependencies:
<code class="language-javascript">npm install cheerio</code>
You'll need Astra DB credentials ( ASTRA_DB_APPLICATION_TOKEN
, ASTRA_DB_API_ENDPOINT
) and an OpenAI API key (OPENAI_API_KEY
) as environment variables.
Import necessary modules:
<code class="language-javascript">import * as cheerio from "cheerio"; const html = await fetch( "https://www.datastax.com/blog/how-to-create-vector-embeddings-in-node-js" ).then((res) => res.text()); const $ = cheerio.load(html); console.log($("h1").text(), "\n"); console.log($("section#blog-content > div:first-child").text());</code>
Initialize components:
<code class="language-bash">npm install @mozilla/readability jsdom</code>
Load, transform, split, embed, and store documents:
<code class="language-javascript">import { Readability } from "@mozilla/readability"; import { JSDOM } from "jsdom"; const url = "https://www.datastax.com/blog/how-to-create-vector-embeddings-in-node-js"; const html = await fetch(url).then((res) => res.text()); const doc = new JSDOM(html, { url }); const reader = new Readability(doc.window.document); const article = reader.parse(); console.log(article);</code>
Readability.js, a robust library powering Firefox's reader mode, efficiently extracts relevant data from web pages, improving RAG data quality. It can be used directly or via LangChain.js's MozillaReadabilityTransformer
.
This is just the initial stage of your ingestion pipeline. Chunking, embedding, and Astra DB storage are subsequent steps in building your RAG application.
Do you employ other methods for cleaning web content in your RAG applications? Share your techniques!
The above is the detailed content of Clean up HTML Content for Retrieval-Augmented Generation with Readability.js. For more information, please follow other related articles on the PHP Chinese website!