suchen

Heim  >  Fragen und Antworten  >  Hauptteil

Ist es möglich, dass der Code eine Iteration beim Web-Scraping überspringt? IndexError: Popup-Index außerhalb des gültigen Bereichs

Ich habe also einen Code, der den Namen und den Preis eines Minerals aus (bisher) 14 Seiten entfernt und in einer TXT-Datei speichert. Ich habe zuerst versucht, nur Seite1 zu verwenden, dann wollte ich weitere Seiten hinzufügen, um mehr Daten zu erhalten. Aber dann greift der Code auf etwas, das er nicht sollte – zufällige Namen/Strings. Ich hatte nicht damit gerechnet, dass es dieses Exemplar ergattern würde, aber es tat es und gab diesem Exemplar den falschen Preis! Es passiert, wenn ein Mineral diesen „unerwarteten Namen“ hat und dann der gesamte Rest der Liste den falschen Preis hat. Siehe unten:

Da sich diese Zeichenfolge von anderen Zeichenfolgen unterscheidet, kann der weitere Code sie nicht aufteilen und gibt die Fehlermeldung aus:

cutted2 = split2.pop(1)
              ^^^^^^^^^^^^^
IndexError: pop index out of range

Ich habe versucht, diese Fehler zu ignorieren und eine der Methoden zu verwenden, die auf verschiedenen Stackoverflow-Seiten verwendet werden:

try:
   cutted2 = split2.pop(1)
except IndexError:
   continue

Es hat tatsächlich funktioniert, es sind keine Fehler aufgetreten...aber dann hat es dem falschen Mineral den falschen Preis zugewiesen (wie mir aufgefallen ist)! Wie kann ich den Code ändern, um diese „seltsamen“ Namen zu ignorieren und mit der Liste fortzufahren? Unten ist der vollständige Code. Ich erinnere mich, dass er bei URL5 angehalten hat und diesen Popup-Indexfehler ausgegeben hat:

import requests
from bs4 import BeautifulSoup
import re
def collecter(URL):
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}

    soup = BeautifulSoup(requests.get(URL, headers=headers).text, "lxml")

    names = [n.getText(strip=True) for n in soup.select("table tr td font a")]
    prices = [
        p.getText(strip=True).split("Price:")[-1] for p
        in soup.select("table tr td font font")
    ]
    
    names[:] = [" ".join(n.split()) for n in names if not n.startswith("[")]
    prices[:] = [p for p in prices if p]

    with open("Minerals.txt", "a+", encoding='utf-8') as file:
        for name, price in zip(names, prices):
                # print(f"{name}\n{price}")
                # print("-" * 50)
                filename = str(name)+" "+str(price)+"\n"
                split1 = filename.split(' / ')          
                cutted1 = split1.pop(0)
                split2 = cutted1.split(": ")
                try:
                    cutted2 = split2.pop(1)
                except IndexError:
                    continue
                two_prices = cutted2+" "+split1.pop(0)+"\n"
                file.write(two_prices)
                      
URL1 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=0"
URL2 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=25"
URL3 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=50"
URL4 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=75"
URL5 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=100"
URL6 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=125"
URL7 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=150"
URL8 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=175"
URL9 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=200"
URL10 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=225"
URL11 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=250"
URL12 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=275"
URL13 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=300"
URL14 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=325"

collecter(URL1)
collecter(URL2)
collecter(URL3)
collecter(URL4)
collecter(URL5)
collecter(URL6)
collecter(URL7)
collecter(URL8)
collecter(URL9)
collecter(URL10)
collecter(URL11)
collecter(URL12)
collecter(URL13)
collecter(URL14)

EDIT: Hier ist der voll funktionsfähige Code unten, danke an den Helfer!

import requests
from bs4 import BeautifulSoup
import re
for URL in range(0,2569,25):
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}

    soup = BeautifulSoup(requests.get(f'https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First={URL}', headers=headers).text, "lxml")

    names = [n.getText(strip=True) for n in soup.select("table tr td font>a")]

    prices = [p.getText(strip=True).split("Price:")[-1] for p in soup.select("table tr td font>font")]   
       
    names[:] = [" ".join(n.split()) for n in names if not n.startswith("[") ]
    prices[:] = [p for p in prices if p]

    with open("MineralsList.txt", "a+", encoding='utf-8') as file:
        for name, price in zip(names, prices):
                # print(f"{name}\n{price}")
                # print("-" * 50)
                filename = str(name)+"  "+str(price)+"\n"
                split1 = filename.split(' / ')          
                cutted1 = split1.pop(0)
                split2 = cutted1.split(": ")
                cutted2 = split2.pop(1)
                try:
                    two_prices = cutted2+"  "+split1.pop(0)+"\n"
                except IndexError:
                    two_prices = cutted2+"\n"
                file.write(two_prices)

Aber nach einigen Änderungen stoppt es mit einem neuen Fehler – es kann die Zeichenfolge anhand der angegebenen Eigenschaft nicht finden, daher der Fehler „IndexError: popping from empty list“ ... sogar soup.select("table tr td font>font" ) hat Hilfe bereitgestellt, wie es in „Name“ der Fall ist.

P粉846294303P粉846294303308 Tage vor440

Antworte allen(2)Ich werde antworten

  • P粉391955763

    P粉3919557632024-02-22 14:52:43

    您可以尝试下一个示例以及分页

    import requests
    from bs4 import BeautifulSoup
    
    for URL in range(0,100,25):
        headers = {"User-Agent": "Mozilla/5.0"}
    
        soup = BeautifulSoup(requests.get(f'https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First={URL}', headers=headers).text, "lxml")
    
        names = [ x.get_text(strip=True) for x in soup.select('table tr td font a')][:25]
        print(names)
        prices = [ x.get_text(strip=True) for x in soup.select('table tr td font:nth-child(3)')][:25]
        print(prices)
    
        # with open("Minerals.txt", "a+", encoding='utf-8') as file:
        #     for name, price in zip(names, prices):
        #             # print(f"{name}\n{price}")
        #             # print("-" * 50)
        #             filename = str(name)+" "+str(price)+"\n"
        #             split1 = filename.split(' / ')          
        #             cutted1 = split1.pop(0)
        #             split2 = cutted1.split(": ")
        #             try:
        #                 cutted2 = split2.pop(1)
        #             except IndexError:
        #                 continue
        #             two_prices = cutted2+" "+split1.pop(0)+"\n"
        #             file.write(two_prices)

    输出:

    ["NX51AH2:\n'lepidolite' after Elbaite with Elbaite", "TH27AL9:\n'Pearceite' with Calcite", "TFM69AN5:\n'Stilbite'", 'SM90CEX:\nAcanthite', 'TMA97AN5:\nAcanthite', 'TB90AE8:\n Acanthite', 'TZ71AK9:\nAcanthite', 'EC63G1:\nAcanthite', 'MN56K9:\nAcanthite', 'TF89AL3:\nAcanthite (Se-bearing) with Polybasite (Se-bearing) and Calcite', 'TP66AJ8:\nAcanthite (Se-bearing) with Pyrite', 'TY86AN2:\nAcanthite after Polybasite', 'TA66AF6:\nAcanthite with Calcite', 'JFD104AO2:\nAcanthite with Calcite', 'TX36AL6:\nAcanthite with Calcite', 'TA48AH1:\nAcanthite with Chalcopyrite', 'EF89L9:\nAcanthite with Pyrite and Calcite', 'TX89AN0:\nAcanthite with Siderite and Proustite', 'EA56K0:\nAcanthite with Silver', 'EC48K0:\nAcanthite with Silver', '11AT12:\nAcanthite, Calcite', '9EF89L9:\nAcanthite, Pyrite, Calcite', 'SM75TDA:\nAdamite', '2M14:\nAdamite', '20MJX66:\nAdamite']
    ['Price:€580 / US$598 / ¥84010 / AUD$890', 'Price:€220 / US$227 / ¥31860 / AUD$330', 'Price:€450 / US$464 / ¥65180 / AUD$690', 'Price:€90 / US$92 / ¥13030 / AUD$130', 'Price:€240 / US$247 / ¥34760 / AUD$370', 'Price:€540 / US$557 / 
    ¥78220 / AUD$830', 'Price:€580 / US$598 / ¥84010 / AUD$890', 'Price:€85 / US$87 / ¥12310 / AUD$130', 'Price:€155 / US$159 / ¥22450 / AUD$230', 'Price:€460 / US$474 / ¥66630 / AUD$700', 'Price:€1500 / US$1547 / ¥217290 / AUD$2310', 'Price:€1600 / US$1651 / ¥231770 / AUD$2460', 'Price:€160 / US$165 / ¥23170 / AUD$240', 'Price:€240 / US$247 / ¥34760 / AUD$370', 'Price:€1200 / US$1238 / ¥173830 / AUD$1850', 'Price:€290 / US$299 / ¥42000 / AUD$440', 'Price:€480 / US$495 / ¥69530 / AUD$740', 'Price:€4800 / US$4953 / ¥695320 / AUD$7400', 'Price:€150 / US$154 / ¥21720 / AUD$230', 'Price:€290 / US$299 / ¥42000 / AUD$440', 'Price:€70 / US$72 / ¥10140 / AUD$100', 'Price:€320 / US$330 / ¥46350 / AUD$490', 'Price:€75 / US$77 / ¥10860 / AUD$110', 'Price:€90 / US$92 / ¥13030 / AUD$130', 'Price:€140 / US$144 / ¥20280 / AUD$215']
    ['5TD76M9:\nAdamite', 'MA54AE9:\nAdamite (variety Cu-bearing adamite) with Calcite', 'EA11Y6:\nAdamite (variety cuprian)', 'EB14Y6:\nAdamite (variety cuprian)', 'MC11X8:\nAdamite (variety cuprian) with Smithsonite', 'JRM10AN8:\nAegirine', 'MFA46AP3:\nAegirine with Zircon, Orthoclase and Quartz (variety smoky)', 'EM48AF8:\nAlabandite with Calcite', 'MC92T6:\nAlabandite with Calcite and Rhodochrosite', 'TF16AN1:\nAlabandite with Rhodochrosite', 'TX17S1:\nAlabandite with Rhodochrosite', 'TD34S1:\nAlabandite with Rhodochrosite', '10TR46:\nAlmandine', 'HM90EJ:\nAnalcime', 'EFH36AP3:\nAnalcime with Natrolite, Rhodochrosite and Serandite', 'ELR67AP1:\nAnalcime with Quartz', 'EML88AP1:\nAnalcime with Quartz', 'TF87AF4:\nAndorite', 'TR88AJ3:\nAndorite', 'ND56AN0:\nAndorite with Zinkenite', 'SM180NH:\nAndradite (variety demantoid)', 'MT86AL3:\nAndradite (variety demantoid) with Calcite', 'MA27AL7:\nAndradite (variety demantoid) with Calcite', 'TC80TL:\nAndradite (variety topazolite) with Clinochlore', 'TC85TE:\nAndradite (variety topazolite) with Clinochlore']
    ['Price:€180 / US$185 / ¥26070 / AUD$270', 'Price:€840 / US$866 / ¥121680 / AUD$1290', 'Price:€60 / US$61 / ¥8690 / 
    AUD$90', 'Price:€90 / US$92 / ¥13030 / AUD$130', 'Price:€70 / US$72 / ¥10140 / AUD$100', 'Price:€580 / US$598 / ¥84010 / AUD$890', 'Price:€1600 / US$1651 / ¥231770 / AUD$2468', 'Price:€2700 / US$2786 / ¥391120 / AUD$4160', 'Price:€740 / US$763 / ¥107190 / AUD$1140', 'Price:€110 / US$113 / ¥15930 / AUD$160', 'Price:€220 / US$227 / ¥31860 / AUD$330', 'Price:€920 / US$949 / ¥133270 / AUD$1410', 'Price:€140 / US$144 / ¥20280 / AUD$210', 'Price:€90 / US$92 / ¥13030 / AUD$130', 'Price:€130 / US$134 / ¥18830 / AUD$200', 'Price:€260 / US$268 / ¥37660 / AUD$400', 'Price:€380 / US$392 / ¥55040 / AUD$580', 'Price:€240 / US$247 / ¥34760 / AUD$370', 'Price:€390 / US$402 / ¥56490 / AUD$600', 'Price:€150 / US$154 / ¥21720 / AUD$230', 'Price:€180 / US$185 / ¥26070 / AUD$270', 'Price:€1600 / US$1651 / ¥231770 / AUD$2460', 'Price:€2200 / US$2270 / ¥318690 / AUD$3390', 'Price:€80 / US$82 / ¥11580 / AUD$120', 'Price:€85 / US$87 / ¥12310 / AUD$130']
    ['T29NAK3:\nAndradite (variety topazolite) with Clinochlore', 'TC85TV:\nAndradite (variety topazolite) with Clinochlore', 'T89GH5:\nAndradite (variety topazolite) with Clinochlore', 'TQ94Q0:\nAndradite (variety topazolite) with Stilbite', 'SM140TFV:\nAndradite on Microcline', 'HM140NG:\nAndradite with Calcite', 'GM66R9:\nAndradite with Clinochlore', 'SM70TYW:\nAndradite with Epidote', 'TC290TVH:\nAndradite with Epidote and Microcline', 'TKX11AO7:\nAndradite with Microcline', 'TC2100TEJ:\nAndradite with Microcline', 'TH16AN2:\nAndradite with Microcline', 'TTX66AO7:\nAndradite with Microcline', 'TC2150TJL:\nAndradite with Microcline', 'TQ96AN2:\nAndradite with Microcline', 'TF48AF2:\nAnglesite', 'MA47AL4:\nAnglesite with Galena', 'LQ88AE6:\nAnglesite with Galena', 'ER90AL8:\nAnglesite with Galena', 'TP70AE1:\nAnglesite with Galena', 'N54NAL5:\nAnglesite with Galena', 'GV96R9:\nAnhydrite with Calcite and Pyrite', '11TV99:\nAnhydrite, Calcite', 'MG26AL4:\nAnorthoroselite with Calcite', 'XM260NFF:\nAragonite']
    ['Price:€240 / US$247 / ¥34760 / AUD$370', 'Price:€85 / US$87 / ¥12310 / AUD$130', 'Price:€220 / US$227 / ¥31860 / AUD$330', 'Price:€980 / US$1011 / ¥141960 / AUD$1510', 'Price:€140 / US$144 / ¥20280 / AUD$210', 'Price:€140 / US$144 / ¥20280 / AUD$210', 'Price:€160 / US$165 / ¥23170 / AUD$240', 'Price:€70 / US$72 / ¥10140 / AUD$100', 'Price:€90 / US$92 / ¥13030 / AUD$130', 'Price:€70 / US$72 / ¥10140 / AUD$100', 'Price:€100 / US$103 / ¥14480 / AUD$150', 'Price:€110 / US$113 / ¥15930 / AUD$160', 'Price:€140 / US$144 / ¥20280 / AUD$210', 'Price:€150 / US$154 / ¥21720 / AUD$230', 'Price:€220 / US$227 / ¥31860 / AUD$330', 'Price:€380 / US$392 / ¥55040 / AUD$580', 'Price:€220 / US$227 / ¥31860 / AUD$330', 'Price:€360 / US$371 / ¥52140 / AUD$550', 'Price:€540 / US$557 / ¥78220 / AUD$830', 'Price:€540 / US$557 / ¥78220 / AUD$830', 'Price:€940 / US$969 / ¥136160 / AUD$1450', 'Price:€220 / US$227 / ¥31860 / AUD$330', 'Price:€460 / US$474 / ¥66630 / AUD$700', 'Price:€140 / US$144 / ¥20280 / AUD$210', 'Price:€60 / US$61 / ¥8690 / AUD$92'] 
    ['XM295EAR:\nAragonite', 'ETE46AP2:\nAragonite', 'EXM26AP0:\nAragonite', 'EYB26AP0:\nAragonite', 'EXE56AP2:\nAragonite', 'ETF46AP0:\nAragonite', 'XM2160ERF:\nAragonite', 'EXM46AP0:\nAragonite', 'XM2190MEX:\nAragonite', 'XM2780EFT:\nAragonite', 'EHM93AO9:\nAragonite', 'TYB37AO8:\nAragonite (variety Cu-bearing aragonite)', 'SM99AM3:\nAragonite (variety cuprian)', '1M06:\nAragonite (variety flos ferri)', 'TG69AL3:\nAragonite (variety tarnowitzite)', 'MLC96AO2:\nAragonite on Calcite', 'MLE68AO2:\nAragonite on Calcite', 'MTB66AP3:\nAragonite with Quartz (variety hematoide)', 'MXF96AP3:\nAragonite with Quartz (variety hematoide)', 'MRR47AP3:\nAragonite with Quartz (variety hematoide)', 'MTR37AP3:\nAragonite with Quartz (variety hematoide)', 'JFD193AP3:\nArfvedsonite with Microcline', 'TFX76AO7:\nArsenopyrite with Calcite, Pyrite, Sphalerite and Rhodochrosite', 'NB37AL3:\nArsenopyrite with Muscovite', 'HM220NX:\nArsenopyrite with Muscovite']
    ['Price:€95 / US$98 / ¥13760 / AUD$146', 'Price:€140 / US$144 / ¥20280 / AUD$210', 'Price:€140 / US$144 / ¥20280 / AUD$210', 'Price:€140 / US$144 / ¥20280 / AUD$210', 'Price:€150 / US$154 / ¥21720 / AUD$230', 'Price:€150 / US$154 / 
    ¥21720 / AUD$230', 'Price:€160 / US$165 / ¥23170 / AUD$246', 'Price:€160 / US$165 / ¥23170 / AUD$240', 'Price:€190 / US$196 / ¥27520 / AUD$293', 'Price:€780 / US$804 / ¥112990 / AUD$1203', 'Price:€880 / US$908 / ¥127470 / AUD$1350', 'Price:€240 / US$247 / ¥34760 / AUD$370', 'Price:€480 / US$495 / ¥69530 / AUD$740', 'Price:€100 / US$103 / ¥14480 / AUD$150', 'Price:€460 / US$474 / ¥66630 / AUD$700', 'Price:€190 / US$196 / ¥27520 / AUD$290', 'Price:€360 / US$371 
    / ¥52140 / AUD$550', 'Price:€160 / US$165 / ¥23170 / AUD$246', 'Price:€190 / US$196 / ¥27520 / AUD$293', 'Price:€230 / US$237 / ¥33310 / AUD$354', 'Price:€230 / US$237 / ¥33310 / AUD$354', 'Price:€240 / US$247 / ¥34760 / AUD$370', 'Price:€170 / US$175 / ¥24620 / AUD$260', 'Price:€220 / US$227 / ¥31860 / AUD$330', 'Price:€220 / US$227 / ¥31860 / AUD$330']

    Antwort
    0
  • P粉677684876

    P粉6776848762024-02-22 00:54:41

    您只需使 CSS 选择器更加具体,以便仅识别直接位于字体元素内部(而不是向下几层)的链接:

    soup.select("table tr td font>a")

    添加进一步的条件,即链接指向单个项目而不是页面底部的下一页/上一页链接也将有所帮助:

    soup.select("table tr td font>a[href*='CODE']")

    Antwort
    0
  • StornierenAntwort