Home >Web Front-end >JS Tutorial >Building a Meta Tags Scraping API in Under Lines of Code

Building a Meta Tags Scraping API in Under Lines of Code

DDD
DDDOriginal
2024-10-21 16:33:02568browse

Have you ever wondered how messaging apps like Whatsapp or Telegram let you see a preview of a link that you send?

Building a Meta Tags Scraping API in Under Lines of Code

Building a Meta Tags Scraping API in Under Lines of Code


Whatsapp and Telegram url previews

In this post, we'll be building a scraping API with Deno that accepts a URL and retrieves the meta tags for it, so we can get fields like the title, description, image and more from almost any website.

For example:

curl https://metatags.deno.dev/api/meta?url=https://dev.to

will give this result

{
  "last-updated": "2024-10-15 15:10:02 UTC",
  "user-signed-in": "false",
  "head-cached-at": "1719685934",
  "environment": "production",
  "description": "A constructive and inclusive social network for software developers. With you every step of your journey.",
  "keywords": "software development, engineering, rails, javascript, ruby",
  "og:type": "website",
  "og:url": "https://dev.to/",
  "og:title": "DEV Community",
  "og:image": "https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8lvvnvil0m75nw7yi6iz.jpg",
  "og:description": "A constructive and inclusive social network for software developers. With you every step of your journey.",
  "og:site_name": "DEV Community",
  "twitter:site": "@thepracticaldev",
  "twitter:title": "DEV Community",
  "twitter:description": "A constructive and inclusive social network for software developers. With you every step of your journey.",
  "twitter:image:src": "https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8lvvnvil0m75nw7yi6iz.jpg",
  "twitter:card": "summary_large_image",
  "viewport": "width=device-width, initial-scale=1.0, viewport-fit=cover",
  "apple-mobile-web-app-title": "dev.to",
  "application-name": "dev.to",
  "theme-color": "#000000",
  "forem:name": "DEV Community",
  "forem:logo": "https://media.dev.to/cdn-cgi/image/width=512,height=,fit=scale-down,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8j7kvp660rqzt99zui8e.png",
  "forem:domain": "dev.to",
  "title": "DEV Community"
}

pretty cool, isn't it?

Meta tags and why do we need them

Meta tags are HTML elements that are used to provide additional information about a page to search engines and other clients.
These tags typically include a name or property attribute that defines the type of information, and a content attribute that contains the value of that information. Here’s an example of two meta tags:

<meta name="description" content="The <meta> HTML element represents metadata that cannot be represented by other HTML meta-related elements, like <base>, <link>, <script>, <style> or <title>.">
<meta property="og:image" content="https://developer.mozilla.org/mdn-social-share.cd6c4a5a.png">

The first tag provides a description of the page, while the second is an Open Graph tag that defines an image to display when the page is shared on social media.

One practical application of meta tags is building a bookmark manager. Instead of manually adding the title, description, and image for each bookmark, you can automatically scrape this information from the bookmarked URL using meta tags.

Open Graph

Open Graph is an internet protocol that was originally created by Facebook to standardize the use of metadata within a webpage to represent the content of a page, it helps social networks generate rich link previews.
Read more about it here.

Why Deno?

  1. Deno has secure defaults, meaning it requires explicit permission for file, network, and environment access, reducing the risk of security vulnerabilities.
  2. Deno is built on web standards, uses ES Modules, and aims to use Web Platform APIs (like fetch) over proprietary APIs, making Deno code very similar to the code you'll write in the browser - but still has some spec deviation from the browser.
  3. Deno has built-in TypeScript support, allowing you to write TypeScript code without a build step.
  4. Deno comes with a standard library that includes modules for common tasks like HTTP servers, file system operations, and more.
  5. Deno provides a Linter, Formatter and Test runner, allowing you to use the platform instead of relying on third party packages or tools, making it an all-in-one tool for Javascript development.
  6. Deno provides Deno Deploy which is a scalable platform for serverless JavaScript/Typescript applications that are globally distributed, ensuring minimal latency and maximum uptime.

The API that we're building will consist of two parts, a function for fetching and parsing the meta tags, and an API server which will respond to HTTP requests.

Getting the meta tags

Let's start by going to Deno Deploy and signing in.
After signing in click on "New Playground"
Building a Meta Tags Scraping API in Under Lines of Code
This we'll give us a hello world starting point.
Now we'll add function that is called getMetaTags that accepts a url and uses the Fetch API to get the HTML of the requested URL and passes it on to a package for HTML parsing (deno-dom).
To add deno-dom to our project we can use the jsr package manager:

curl https://metatags.deno.dev/api/meta?url=https://dev.to

Now we'll use the Fetch API to get the HTML as text:

{
  "last-updated": "2024-10-15 15:10:02 UTC",
  "user-signed-in": "false",
  "head-cached-at": "1719685934",
  "environment": "production",
  "description": "A constructive and inclusive social network for software developers. With you every step of your journey.",
  "keywords": "software development, engineering, rails, javascript, ruby",
  "og:type": "website",
  "og:url": "https://dev.to/",
  "og:title": "DEV Community",
  "og:image": "https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8lvvnvil0m75nw7yi6iz.jpg",
  "og:description": "A constructive and inclusive social network for software developers. With you every step of your journey.",
  "og:site_name": "DEV Community",
  "twitter:site": "@thepracticaldev",
  "twitter:title": "DEV Community",
  "twitter:description": "A constructive and inclusive social network for software developers. With you every step of your journey.",
  "twitter:image:src": "https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8lvvnvil0m75nw7yi6iz.jpg",
  "twitter:card": "summary_large_image",
  "viewport": "width=device-width, initial-scale=1.0, viewport-fit=cover",
  "apple-mobile-web-app-title": "dev.to",
  "application-name": "dev.to",
  "theme-color": "#000000",
  "forem:name": "DEV Community",
  "forem:logo": "https://media.dev.to/cdn-cgi/image/width=512,height=,fit=scale-down,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8j7kvp660rqzt99zui8e.png",
  "forem:domain": "dev.to",
  "title": "DEV Community"
}

After getting the HTML, we can use deno-dom to parse it and then use standard DOM functions like querySelectorAll to get all the meta HTML elements, iterate on them and use getAttribute to get the name, property and content of each one of those tags:

<meta name="description" content="The <meta> HTML element represents metadata that cannot be represented by other HTML meta-related elements, like <base>, <link>, <script>, <style> or <title>.">
<meta property="og:image" content="https://developer.mozilla.org/mdn-social-share.cd6c4a5a.png">

Finally, we'll also query the element of the page to add it as a field in our API:<br> </p> <pre class="brush:php;toolbar:false">import { DOMParser, Element } from "jsr:@b-fuze/deno-dom"; </pre> <p>It isn't exactly a meta tag, but I think that it is a useful field, so it's going to be part of our API anyway. :)</p> <p>Our final getMetaTags function should look like this:<br> </p> <pre class="brush:php;toolbar:false"> const headers = new Headers(); headers.set("accept", "text/html,application/xhtml+xml,application/xml"); const res = await fetch(url, { headers }); const html = await res.text(); </pre> <h2> The server </h2> <p>For simplicity, I've decided to use Deno's built in http server which is just a simple Deno.serve() call.<br> Thanks to the fact that deno is built on web standards, we can use the built in Response object in the Fetch API to respond to requests.<br> </p> <pre class="brush:php;toolbar:false">curl https://metatags.deno.dev/api/meta?url=https://dev.to </pre> <p>Our server parses the request url, checks if we received a GET request to the /api/meta path, and calls the getMetaTags function we created and then returns the meta tags as the response body.</p> <p>We also add two headers, the first is Content-Type that is needed for the client to know the kind of data they're getting in the response, which in our case is a JSON response.</p> <p>The second header is Access-Control-Allow-Origin that allows our API to accept requests from specific origins, in our case I chose "*" to accept any origin, but you might want to change it to only accept request from your frontend's origin.<br> Note that the CORS headers will only affect requests made by the browser, meaning the browser will block the request according to the origin that is specified in the header, but directly calling the API from a server will still be possible. Read more about CORS here.</p> <p>You can now click on "Save & Deploy"<br> <img src="https://img.php.cn/upload/article/000/000/000/172949959089268.jpg" alt="Building a Meta Tags Scraping API in Under Lines of Code"><br> Then wait for deno deploy to deploy your code to the playground:<br> <img src="https://img.php.cn/upload/article/000/000/000/172949959198494.jpg" alt="Building a Meta Tags Scraping API in Under Lines of Code"><br> The url on the top right is your playground's url, copy it and add /api/meta?url=https://dev.to to see it in action, the url should look something like https://metatags.deno.dev/api/meta?url=https://dev.to<br> You should now see the API responding with dev.to's meta tags!<br> <img src="https://img.php.cn/upload/article/000/000/000/172949959294656.jpg" alt="Building a Meta Tags Scraping API in Under Lines of Code"></p> <h2> Deployment </h2> <p>Using Deno deploy's playground means your code is technically already deployed, it is public and can be accessed by anyone.<br> For a simple API like the one we're building, a single file playground can be enough, but in many cases we'd like to scale our project further, for that you can use Deno deploy's Github export to make a proper code repository for you API, with support for automatic building on new code pushes:<br> <img src="https://img.php.cn/upload/article/000/000/000/172949959428755.jpg" alt="Building a Meta Tags Scraping API in Under Lines of Code"><br> or from the playground's settings:<br> <img src="https://img.php.cn/upload/article/000/000/000/172949959544011.jpg" alt="Building a Meta Tags Scraping API in Under Lines of Code"></p> <h2> Caveats </h2> <p>The scraping method presented in this post only works on websites that have meta tags in the html file that’s returned from the server, meaning server rendered or prerendered sites are more likely to return proper results, single page apps can also work as long as the meta tags are set in build time and not in runtime.</p> <h2> Conclusion </h2> <p>We've demonstrated how quick and simple it is to build and deploy an API with Deno, we've gone over Meta tags, and how we can use the Fetch API, a DOM parser and Deno's built in server to build a Meta tags scraping API in under 40 lines of code.</p> <p>To see the project built in this post you can check out the Deno deploy playground (You'll need to add /api/meta?url=https://dev.to to the url bar on the right to see an example response) or this github repository.</p> <hr> <h2> What Will You Build Next? </h2> <p>I hope this post has inspired you to explore the power of meta tags and Deno! Try building your own version of the API or integrate it into a project like a bookmark manager. </p> <p>Got stuck, have questions, or want to show off what you built? Drop a comment below or connect with me on Twitter/X – I’d love to hear from you! </p> <p>Check out my previous post about building a react state management library in under 40 lines of code here.</p> <p>The above is the detailed content of Building a Meta Tags Scraping API in Under Lines of Code. For more information, please follow other related articles on the PHP Chinese website!</p></div><div class="nphpQianMsg"><a href="javascript:void(0);">JavaScript</a> <a href="javascript:void(0);">typescript</a> <a href="javascript:void(0);">json</a> <a href="javascript:void(0);">html</a> <a href="javascript:void(0);">Object</a> <a href="javascript:void(0);">if</a> <a href="javascript:void(0);">for</a> <a href="javascript:void(0);">while</a> <a href="javascript:void(0);">include</a> <a href="javascript:void(0);">try</a> <a href="javascript:void(0);">using</a> <a href="javascript:void(0);">public</a> <a href="javascript:void(0);">finally</a> <a href="javascript:void(0);">Attribute</a> <a href="javascript:void(0);">Property</a> <a href="javascript:void(0);">copy</a> <a href="javascript:void(0);">function</a> <a href="javascript:void(0);">dom</a> <a href="javascript:void(0);">this</a> <a href="javascript:void(0);">display</a> <a href="javascript:void(0);">github</a> <a href="javascript:void(0);">serverless</a> <a href="javascript:void(0);">kind</a> <a href="javascript:void(0);">http</a> <a href="javascript:void(0);">https</a> <a href="javascript:void(0);">Access</a><div class="clear"></div></div><div class="nphpQianSheng"><span>Statement:</span><div>The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn</div></div></div><div class="nphpSytBox"><span>Previous article:<a class="dBlack" title="How to Convert Milliseconds to a Readable Date in jQuery/JavaScript?" href="https://m.php.cn/faq/1796638584.html">How to Convert Milliseconds to a Readable Date in jQuery/JavaScript?</a></span><span>Next article:<a class="dBlack" title="How to Convert Milliseconds to a Readable Date in jQuery/JavaScript?" href="https://m.php.cn/faq/1796638593.html">How to Convert Milliseconds to a Readable Date in jQuery/JavaScript?</a></span></div><div class="nphpSytBox2"><div class="nphpZbktTitle"><h2>Related articles</h2><em><a href="https://m.php.cn/article.html" class="bBlack"><i>See more</i><b></b></a></em><div class="clear"></div></div><ins class="adsbygoogle" style="display:block" data-ad-format="fluid" data-ad-layout-key="-6t+ed+2i-1n-4w" data-ad-client="ca-pub-5902227090019525" data-ad-slot="8966999616"></ins><script> (adsbygoogle = window.adsbygoogle || []).push({}); </script><ul class="nphpXgwzList"><li><b></b><a href="https://m.php.cn/faq/1609.html" title="An in-depth analysis of the Bootstrap list group component" class="aBlack">An in-depth analysis of the Bootstrap list group component</a><div class="clear"></div></li><li><b></b><a href="https://m.php.cn/faq/1640.html" title="Detailed explanation of JavaScript function currying" class="aBlack">Detailed explanation of JavaScript function currying</a><div class="clear"></div></li><li><b></b><a href="https://m.php.cn/faq/1949.html" title="Complete example of JS password generation and strength detection (with demo source code download)" class="aBlack">Complete example of JS password generation and strength detection (with demo source code download)</a><div class="clear"></div></li><li><b></b><a href="https://m.php.cn/faq/2248.html" title="Angularjs integrates WeChat UI (weui)" class="aBlack">Angularjs integrates WeChat UI (weui)</a><div class="clear"></div></li><li><b></b><a href="https://m.php.cn/faq/2351.html" title="How to quickly switch between Traditional Chinese and Simplified Chinese with JavaScript and the trick for websites to support switching between Simplified and Traditional Chinese_javascript skills" class="aBlack">How to quickly switch between Traditional Chinese and Simplified Chinese with JavaScript and the trick for websites to support switching between Simplified and Traditional Chinese_javascript skills</a><div class="clear"></div></li></ul></div></div><ins class="adsbygoogle" style="display:block" data-ad-format="autorelaxed" data-ad-client="ca-pub-5902227090019525" data-ad-slot="5027754603"></ins><script> (adsbygoogle = window.adsbygoogle || []).push({}); </script><footer><div class="footer"><div class="footertop"><img src="/static/imghwm/logo.png" alt=""><p>Public welfare online PHP training,Help PHP learners grow quickly!</p></div><div class="footermid"><a href="https://m.php.cn/about/us.html">About us</a><a href="https://m.php.cn/about/disclaimer.html">Disclaimer</a><a href="https://m.php.cn/update/article_0_1.html">Sitemap</a></div><div class="footerbottom"><p> © php.cn All rights reserved </p></div></div></footer><script>isLogin = 0;</script><script type="text/javascript" src="/static/layui/layui.js"></script><script type="text/javascript" src="/static/js/global.js?4.9.47"></script></div><script src="https://vdse.bdstatic.com//search-video.v1.min.js"></script><link rel='stylesheet' id='_main-css' href='/static/css/viewer.min.css' type='text/css' media='all'/><script type='text/javascript' src='/static/js/viewer.min.js?1'></script><script type='text/javascript' src='/static/js/jquery-viewer.min.js'></script><script>jQuery.fn.wait = function (func, times, interval) { var _times = times || -1, //100次 _interval = interval || 20, //20毫秒每次 _self = this, _selector = this.selector, //选择器 _iIntervalID; //定时器id if( this.length ){ //如果已经获取到了,就直接执行函数 func && func.call(this); } else { _iIntervalID = setInterval(function() { if(!_times) { //是0就退出 clearInterval(_iIntervalID); } _times <= 0 || _times--; //如果是正数就 -- _self = $(_selector); //再次选择 if( _self.length ) { //判断是否取到 func && func.call(_self); clearInterval(_iIntervalID); } }, _interval); } return this; } $("table.syntaxhighlighter").wait(function() { $('table.syntaxhighlighter').append("<p class='cnblogs_code_footer'><span class='cnblogs_code_footer_icon'></span></p>"); }); $(document).on("click", ".cnblogs_code_footer",function(){ $(this).parents('table.syntaxhighlighter').css('display','inline-table');$(this).hide(); }); $('.nphpQianCont').viewer({navbar:true,title:false,toolbar:false,movable:false,viewed:function(){$('img').click(function(){$('.viewer-close').trigger('click');});}}); </script></body><!-- Matomo --><script> var _paq = window._paq = window._paq || []; /* tracker methods like "setCustomDimension" should be called before "trackPageView" */ _paq.push(['trackPageView']); _paq.push(['enableLinkTracking']); (function() { var u="https://tongji.php.cn/"; _paq.push(['setTrackerUrl', u+'matomo.php']); _paq.push(['setSiteId', '9']); var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0]; g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s); })(); </script><!-- End Matomo Code --></html>