search
HomeWeb Front-endJS TutorialClean up HTML Content for Retrieval-Augmented Generation with Readability.js

Web scraping is a common method for gathering content for your retrieval-augmented generation (RAG) application. However, parsing web page content can be challenging.

Mozilla's open-source Readability.js library offers a convenient solution for extracting only the essential parts of a web page. Let's explore its integration into a data ingestion pipeline for a RAG application.

Extracting Unstructured Data from Web Pages

Web pages are rich sources of unstructured data, ideal for RAG applications. However, web pages often contain irrelevant information such as headers, sidebars, and footers. While useful for browsing, this extra content detracts from the page's main subject.

For optimal RAG data, irrelevant content must be removed. While tools like Cheerio can parse HTML based on a site's known structure, this approach is inefficient for scraping diverse website layouts. A robust method is needed to extract only relevant content.

Leveraging Reader View Functionality

Most browsers include a reader view that removes all but the article title and content. The following image illustrates the difference between standard browsing and reader mode applied to a DataStax blog post:

Clean up HTML Content for Retrieval-Augmented Generation with Readability.js

Mozilla provides Readability.js, the library behind Firefox's reader mode, as a standalone open-source module. This allows us to integrate Readability.js into a data pipeline to remove irrelevant content and improve scraping results.

Scraping Data with Node.js and Readability.js

Let's illustrate scraping article content from a previous blog post about creating vector embeddings in Node.js. The following JavaScript code retrieves the page's HTML:

const html = await fetch(
  "https://www.datastax.com/blog/how-to-create-vector-embeddings-in-node-js"
).then((res) => res.text());
console.log(html);

This includes all HTML, including navigation, footers, and other elements common on websites.

Alternatively, you could use Cheerio to select specific elements:

npm install cheerio
import * as cheerio from "cheerio";

const html = await fetch(
  "https://www.datastax.com/blog/how-to-create-vector-embeddings-in-node-js"
).then((res) => res.text());

const $ = cheerio.load(html);

console.log($("h1").text(), "\n");
console.log($("section#blog-content > div:first-child").text());

This yields the title and article text. However, this approach relies on knowing the HTML structure, which is not always feasible.

A better approach involves installing Readability.js and jsdom:

npm install @mozilla/readability jsdom

Readability.js operates within a browser environment, requiring jsdom to simulate this in Node.js. We can convert the loaded HTML into a document and use Readability.js to parse the content:

import { Readability } from "@mozilla/readability";
import { JSDOM } from "jsdom";

const url = "https://www.datastax.com/blog/how-to-create-vector-embeddings-in-node-js";
const html = await fetch(url).then((res) => res.text());

const doc = new JSDOM(html, { url });
const reader = new Readability(doc.window.document);
const article = reader.parse();

console.log(article);

The article object contains various parsed elements:

Clean up HTML Content for Retrieval-Augmented Generation with Readability.js

This includes the title, author, excerpt, publication time, and both HTML (content) and plain text (textContent). textContent is ready for chunking, embedding, and storage, while content retains links and images for further processing.

The isProbablyReaderable function helps determine if the document is suitable for Readability.js:

const html = await fetch(
  "https://www.datastax.com/blog/how-to-create-vector-embeddings-in-node-js"
).then((res) => res.text());
console.log(html);

Unsuitable pages should be flagged for review.

Integrating Readability with LangChain.js

Readability.js integrates seamlessly with LangChain.js. The following example uses LangChain.js to load a page, extract content with MozillaReadabilityTransformer, split text with RecursiveCharacterTextSplitter, create embeddings with OpenAI, and store data in Astra DB.

Required dependencies:

npm install cheerio

You'll need Astra DB credentials ( ASTRA_DB_APPLICATION_TOKEN, ASTRA_DB_API_ENDPOINT) and an OpenAI API key (OPENAI_API_KEY) as environment variables.

Import necessary modules:

import * as cheerio from "cheerio";

const html = await fetch(
  "https://www.datastax.com/blog/how-to-create-vector-embeddings-in-node-js"
).then((res) => res.text());

const $ = cheerio.load(html);

console.log($("h1").text(), "\n");
console.log($("section#blog-content > div:first-child").text());

Initialize components:

npm install @mozilla/readability jsdom

Load, transform, split, embed, and store documents:

import { Readability } from "@mozilla/readability";
import { JSDOM } from "jsdom";

const url = "https://www.datastax.com/blog/how-to-create-vector-embeddings-in-node-js";
const html = await fetch(url).then((res) => res.text());

const doc = new JSDOM(html, { url });
const reader = new Readability(doc.window.document);
const article = reader.parse();

console.log(article);

Improved Web Scraping Accuracy with Readability.js

Readability.js, a robust library powering Firefox's reader mode, efficiently extracts relevant data from web pages, improving RAG data quality. It can be used directly or via LangChain.js's MozillaReadabilityTransformer.

This is just the initial stage of your ingestion pipeline. Chunking, embedding, and Astra DB storage are subsequent steps in building your RAG application.

Do you employ other methods for cleaning web content in your RAG applications? Share your techniques!

The above is the detailed content of Clean up HTML Content for Retrieval-Augmented Generation with Readability.js. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Replace String Characters in JavaScriptReplace String Characters in JavaScriptMar 11, 2025 am 12:07 AM

Detailed explanation of JavaScript string replacement method and FAQ This article will explore two ways to replace string characters in JavaScript: internal JavaScript code and internal HTML for web pages. Replace string inside JavaScript code The most direct way is to use the replace() method: str = str.replace("find","replace"); This method replaces only the first match. To replace all matches, use a regular expression and add the global flag g: str = str.replace(/fi

8 Stunning jQuery Page Layout Plugins8 Stunning jQuery Page Layout PluginsMar 06, 2025 am 12:48 AM

Leverage jQuery for Effortless Web Page Layouts: 8 Essential Plugins jQuery simplifies web page layout significantly. This article highlights eight powerful jQuery plugins that streamline the process, particularly useful for manual website creation

Build Your Own AJAX Web ApplicationsBuild Your Own AJAX Web ApplicationsMar 09, 2025 am 12:11 AM

So here you are, ready to learn all about this thing called AJAX. But, what exactly is it? The term AJAX refers to a loose grouping of technologies that are used to create dynamic, interactive web content. The term AJAX, originally coined by Jesse J

10 Mobile Cheat Sheets for Mobile Development10 Mobile Cheat Sheets for Mobile DevelopmentMar 05, 2025 am 12:43 AM

This post compiles helpful cheat sheets, reference guides, quick recipes, and code snippets for Android, Blackberry, and iPhone app development. No developer should be without them! Touch Gesture Reference Guide (PDF) A valuable resource for desig

Improve Your jQuery Knowledge with the Source ViewerImprove Your jQuery Knowledge with the Source ViewerMar 05, 2025 am 12:54 AM

jQuery is a great JavaScript framework. However, as with any library, sometimes it’s necessary to get under the hood to discover what’s going on. Perhaps it’s because you’re tracing a bug or are just curious about how jQuery achieves a particular UI

How do I create and publish my own JavaScript libraries?How do I create and publish my own JavaScript libraries?Mar 18, 2025 pm 03:12 PM

Article discusses creating, publishing, and maintaining JavaScript libraries, focusing on planning, development, testing, documentation, and promotion strategies.

10 jQuery Fun and Games Plugins10 jQuery Fun and Games PluginsMar 08, 2025 am 12:42 AM

10 fun jQuery game plugins to make your website more attractive and enhance user stickiness! While Flash is still the best software for developing casual web games, jQuery can also create surprising effects, and while not comparable to pure action Flash games, in some cases you can also have unexpected fun in your browser. jQuery tic toe game The "Hello world" of game programming now has a jQuery version. Source code jQuery Crazy Word Composition Game This is a fill-in-the-blank game, and it can produce some weird results due to not knowing the context of the word. Source code jQuery mine sweeping game

jQuery Parallax Tutorial - Animated Header BackgroundjQuery Parallax Tutorial - Animated Header BackgroundMar 08, 2025 am 12:39 AM

This tutorial demonstrates how to create a captivating parallax background effect using jQuery. We'll build a header banner with layered images that create a stunning visual depth. The updated plugin works with jQuery 1.6.4 and later. Download the

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Hot Tools

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)