Home >Backend Development >Golang >Scrapper Competitor

Scrapper Competitor

Barbara Streisand
Barbara StreisandOriginal
2024-11-06 15:21:031042browse

Scrapper Concorrente

Program objective

Access web pages at the same time to extract the title of each page and display these titles in the terminal. This is done using concurrency in Go, which allows you to access multiple pages simultaneously, saving time.

Explanation of the Code

Packages used

import (
    "fmt"
    "net/http"
    "sync"
    "github.com/PuerkitoBio/goquery"
)

fetchTitle function

This role is responsible for:

  • Access a web page (url)
  • Extract page title
  • Evniate the result to a channel
func fetchTitle(url string, wg *sync.WaitGroup, results chan<- string) {
    defer wg.Done() // Marca a goroutine como concluída no WaitGroup

Function parameters:

  • url string: Represents the address of the web page (url) that we are going to access to obtain the title
  • wg *sync.WaitGroup: Pointer to a WaitGroup, which we use to synchronize the completion of all tasks (goroutines) that are running at the same time. The * indicates that we are passing an "address" to WaitGroup` and not a copy of it.
  • results chan<- string: This is a one-way channel that allows you to send strings to another part of the program. It is used to pass results (titles or error messages) to the main function

The defer wg.Done() line tells the program to mark this task (goroutine) as completed when the fetchTitle function finishes. This is important so that main knows when all tasks have been completed.

HTTP Request


req, err := http.Get(url)
if err != nil {
results <- fmt.Sprintf("Error accessing %s: %v", url, err)
return
}
defer req.Body.Close()

  • http.Get(url): This line makes a HTTP GET request to the URL. This means we are accessing the page and asking the server for its content.
  • err != nil: Here we check if there was any error when accessing the page (for example, if the page does not exist or the server is not responding). If there is an error, we send a message to the results channel and end the function with return.
  • defer req.Body.Close(): This ensures that after we are done using the page content, we free up the memory allocated to store it.

Status Check


if req.StatusCode != 200 {
results <- fmt.Sprintf("Error accessing %s: status %d %s", url, req.StatusCode, req.Status)
return
}

  • req.StatusCode != 200: We check if the server responded with the code 200 OK (indicates success). If it is not 200, it means the page did not load correctly. We then send an error message to the results channel and terminate the function.

Title Loading and Search


doc, err := goquery.NewDocumentFromReader(req.Body)
if err != nil {
results <- fmt.Sprintf("Error loading document from %s: %v", url, err)
return
}
title := doc.Find("title").Text()
results <- fmt.Sprintf("Title of %s: %s", url, title)
}

  • goquery.NewDocumentFromReader(req.Body): We load the HTML content of the page (provided by req.Body) into goquery, which allows you to navigate and search specific parts of the HTML.
  • doc.Find("title").Text(): We look for the tag in the HTML of the page and get the text inside it (i.e. the title). </pre> <li> <strong>results <- fmt.Sprintf("Título de %s: %s", url, title)</strong>: We send the extracted title to the results channel, where it will be read later.</li> <h2> main function </h2> <p>The main function is the main function that configures and controls the program.</p> <p><br> func main() {<br> urls := []string{<br> "http://olos.novagne.com.br/Olos/login.aspx?logout=true",<br> "http://sistema.novagne.com.br/novagne/",<br> }<br> </p> <ul> <li> <strong>urls := []string{...}</strong>: We define a list of URLs that we want to process. Each URL will be passed to a goroutine that will extract the page title.</li> </ul> <h2> WaitGroup and Channel Configuration </h2> <p><br> var wg sync.WaitGroup<br> results := make(chan string, len(urls)) // Channel to store the results<br> </p> <ul> <li> <strong>var wg sync.WaitGroup</strong>: We create a new instance of WaitGroup, which will control the number of goroutines and ensure that they all finish before the program ends.</li> <li> <strong>results := make(chan string, len(urls))</strong>: We create a results channel with capacity equal to the number of URLs. This channel will store messages with titles or errors.</li> </ul> <h2> Home of Goroutines </h2> <p><br> for _, url := range urls {<br> wg.Add(1)<br> go fetchTitle(url, &wg, results)<br> }<br> </p> <ul> <li> <strong>for _, url := range urls</strong>: Here we loop through each URL in the list.</li> <li> <strong>wg.Add(1)</strong>: For each URL, we increment the WaitGroup counter to indicate that a new task (goroutine) will be started.</li> <li> <strong>go fetchTitle(url, &wg, results)</strong>: We call fetchTitle as a <strong>goroutine</strong> for each URL, that is, we make it run in parallel with the others.</li> </ul> <h2> Waiting and Displaying Results </h2> <p><br> wg.Wait()<br> close(results)<br> </p> <hr> <p>REPO: https://github.com/ionnss/Scrapper-GoRoutine</p> <hr> <p>ions,</p> <p>another earth day</p> <p>The above is the detailed content of Scrapper Competitor. For more information, please follow other related articles on the PHP Chinese website!</p></div><div class="nphpQianMsg"><a href="javascript:void(0);">html</a> <a href="javascript:void(0);">String</a> <a href="javascript:void(0);">if</a> <a href="javascript:void(0);">for</a> <a href="javascript:void(0);">var</a> <a href="javascript:void(0);">len</a> <a href="javascript:void(0);">nil</a> <a href="javascript:void(0);">github</a> <a href="javascript:void(0);">http</a> <a href="javascript:void(0);">https</a><div class="clear"></div></div><div class="nphpQianSheng"><span>Statement:</span><div>The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn</div></div></div><div class="nphpSytBox"><span>Previous article:<a class="dBlack" title="From C# to Go: Achieving AES and Base64 Encoding Compatibility" href="https://m.php.cn/faq/1796672826.html">From C# to Go: Achieving AES and Base64 Encoding Compatibility</a></span><span>Next article:<a class="dBlack" title="From C# to Go: Achieving AES and Base64 Encoding Compatibility" href="https://m.php.cn/faq/1796672838.html">From C# to Go: Achieving AES and Base64 Encoding Compatibility</a></span></div><div class="nphpSytBox2"><div class="nphpZbktTitle"><h2>Related articles</h2><em><a href="https://m.php.cn/article.html" class="bBlack"><i>See more</i><b></b></a></em><div class="clear"></div></div><ins class="adsbygoogle" style="display:block" data-ad-format="fluid" data-ad-layout-key="-6t+ed+2i-1n-4w" data-ad-client="ca-pub-5902227090019525" data-ad-slot="8966999616"></ins><script> (adsbygoogle = window.adsbygoogle || []).push({}); </script><ul class="nphpXgwzList"><li><b></b><a href="https://m.php.cn/faq/419133.html" title="What is Go language? Introduction to the advantages and disadvantages of Go language" class="aBlack">What is Go language? Introduction to the advantages and disadvantages of Go language</a><div class="clear"></div></li><li><b></b><a href="https://m.php.cn/faq/419289.html" title="What does gin mean?" class="aBlack">What does gin mean?</a><div class="clear"></div></li><li><b></b><a href="https://m.php.cn/faq/421167.html" title="Why is go more performant than php?" class="aBlack">Why is go more performant than php?</a><div class="clear"></div></li><li><b></b><a href="https://m.php.cn/faq/421591.html" title="What is the go language suitable for?" class="aBlack">What is the go language suitable for?</a><div class="clear"></div></li><li><b></b><a href="https://m.php.cn/faq/422570.html" title="go language basics" class="aBlack">go language basics</a><div class="clear"></div></li></ul></div></div><ins class="adsbygoogle" style="display:block" data-ad-format="autorelaxed" data-ad-client="ca-pub-5902227090019525" data-ad-slot="5027754603"></ins><script> (adsbygoogle = window.adsbygoogle || []).push({}); </script><footer><div class="footer"><div class="footertop"><img src="/static/imghwm/logo.png" alt=""><p>Public welfare online PHP training,Help PHP learners grow quickly!</p></div><div class="footermid"><a href="https://m.php.cn/about/us.html">About us</a><a href="https://m.php.cn/about/disclaimer.html">Disclaimer</a><a href="https://m.php.cn/update/article_0_1.html">Sitemap</a></div><div class="footerbottom"><p> © php.cn All rights reserved </p></div></div></footer><script>isLogin = 0;</script><script type="text/javascript" src="/static/layui/layui.js"></script><script type="text/javascript" src="/static/js/global.js?4.9.47"></script></div><script src="https://vdse.bdstatic.com//search-video.v1.min.js"></script><link rel='stylesheet' id='_main-css' href='/static/css/viewer.min.css' type='text/css' media='all'/><script type='text/javascript' src='/static/js/viewer.min.js?1'></script><script type='text/javascript' src='/static/js/jquery-viewer.min.js'></script><script>jQuery.fn.wait = function (func, times, interval) { var _times = times || -1, //100次 _interval = interval || 20, //20毫秒每次 _self = this, _selector = this.selector, //选择器 _iIntervalID; //定时器id if( this.length ){ //如果已经获取到了,就直接执行函数 func && func.call(this); } else { _iIntervalID = setInterval(function() { if(!_times) { //是0就退出 clearInterval(_iIntervalID); } _times <= 0 || _times--; //如果是正数就 -- _self = $(_selector); //再次选择 if( _self.length ) { //判断是否取到 func && func.call(_self); clearInterval(_iIntervalID); } }, _interval); } return this; } $("table.syntaxhighlighter").wait(function() { $('table.syntaxhighlighter').append("<p class='cnblogs_code_footer'><span class='cnblogs_code_footer_icon'></span></p>"); }); $(document).on("click", ".cnblogs_code_footer",function(){ $(this).parents('table.syntaxhighlighter').css('display','inline-table');$(this).hide(); }); $('.nphpQianCont').viewer({navbar:true,title:false,toolbar:false,movable:false,viewed:function(){$('img').click(function(){$('.viewer-close').trigger('click');});}}); </script></body><!-- Matomo --><script> var _paq = window._paq = window._paq || []; /* tracker methods like "setCustomDimension" should be called before "trackPageView" */ _paq.push(['trackPageView']); _paq.push(['enableLinkTracking']); (function() { var u="https://tongji.php.cn/"; _paq.push(['setTrackerUrl', u+'matomo.php']); _paq.push(['setSiteId', '9']); var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0]; g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s); })(); </script><!-- End Matomo Code --></html>