I'm trying to build a Chrome extension that aggregates information from a range of websites when a user visits website A.
async function fetchHTML(url) { const response = await fetch(proxyUrl + url); const html = await response.text(); console.log(html); return html; } // 从HTML内容中提取元素 - 总违规次数 function extractTotalViolations(html) { const parser = new DOMParser(); const doc = parser.parseFromString(html, "text/html"); const totalViolations = doc.querySelector(".total-violations").textContent; return totalViolations; } // 我们想要抓取的页面的URL const url = "https://whoownswhat.justfix.org/en/address/MANHATTAN/610/EAST%2020%20STREET"; // 获取页面的HTML内容并提取总违规次数 fetchHTML(url).then(html => { const totalViolations = extractTotalViolations(html); console.log(totalViolations); });
When I print totalViolations, I get NULL. So I printed the scraped HTML, and I realized that I was getting some JavaScript code that looked completely different than the HTML code I saw directly on the website. I suspect the site is using some JavaScript masking or I'm not getting the HTML correctly.
<script> !function(e){function t(t){for(var n,l,i=t[0],f=t[1],a=t[2],p=0,s=[];p<i.length;p++)l=i[p],Object.prototype.hasOwnProperty.call(o,l)&&o[l]&&s.push(o[l][0]),o[l]=0;for(n in f)Object.prototype.hasOwnProperty.call(f,n)&&(e[n]=f[n]);for(c&&c(t);s.length;)s.shift()();return u.push.apply(u,a||[]),r()}function r(){for(var e,t=0;t<u.length;t++){for(var r=u[t],n=!0,i=1;i<r.length;i++){var f=r[i];0!==o[f]&&(n=!1)}n&&(u.splice(t--,1),e=l(l.s=r[0]))}return e}var n={},o={1:0},u=[];function l(t){if(n[t])return n[t].exports;var r=n[t]={i:t,l:!1,exports:{}};return e[t].call(r.exports,r,r.exports,l),r.l=!0,r.exports}l.m=e,l.c=n,l.d=function(e,t,r){l.o(e,t)||Object.defineProperty(e,t,{enumerable:!0,get:r})},l.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule", </script>
My question is how to extract the HTML correctly so that I can parse the DOM and get all the information from the website that I want to put on the extension. Thanks.
P粉6383439952024-03-31 00:33:57
The fact that you get Javascript as a response:
This means you need to load the page with your browser's dev tools open and carefully study the request being sent. Based on your description, when you visit the page, the first request sent may load a Javascript code, which is then processed and further requests are sent to the server. Carefully study requests, including their URLs, request headers and payloads, and responses.
You need to copy the request sent, and parse the response. If the response ends up being some HTML, then you can parse it the way you've already tried (what changes is where and how the request is sent), otherwise, if the response is not HTML, but something else, like JSON, then take a closer look at the target HTML displayed on the website, and implement a code that converts the raw server response into HTML-like code.