Home  >  Q&A  >  body text

Is there a way to extract a regex pattern by linking in a dataframe in Pandas?

<p>I'm trying to extract a regex pattern from links in a generated Pandas table. </p> <p>The code to generate a Pandas data frame is as follows: </p> <pre class="brush:php;toolbar:false;">import pandas as pd import re url = 'https://www.espncricinfo.com/records/year/team-match-results/2005-2005/twenty20-internationals-3' base_url = 'https://www.espncricinfo.com' table = pd.read_html(url, extract_links = "body")[0] table = table.apply(lambda col: [link[0] if link[1] is None else f'{base_url}{link[1]}' for link in col]) table</pre> <p>I want to extract the match ID from the link in the table. For each game, the game ID is the consecutive set of numbers following the "t20i-" pattern and ending before the slash. For example: For this match, the match ID is 211048. Here's the code for a single game: </p> <pre class="brush:php;toolbar:false;">scorecard_url = 'https://www.espncricinfo.com/series/australia-tour-of-new-zealand-2004-05-61407/new- zealand-vs-australia-only-t20i-211048/full-scorecard' match_id = re.findall('t20i-(d*)/', scorecard_url) match_id[0]</pre> <p>I want to operate on the entire table by using a derived column match-id. This column uses the Scorecard column. However, I have been unable to achieve this. </p> <p>I initially tried this simple command: </p> <pre class="brush:php;toolbar:false;">table['match_id']= re.findall('t20i-(d*)/', table['Scorecard']) table</pre> <p>I get a 'TypeError: expected string or bytes-like object' error, which makes me think that the link is not being stored as a string and may be causing the problem. </p> <p>Then I tried: </p> <pre class="brush:php;toolbar:false;">table['match_id']= re.findall('t20i-(d*)/', str(table['Scorecard'])) table</pre> <p>This gives me a 'ValueError: Length of values ​​(0) does not match length of index (3)' error, I'm not sure what the cause is. </p> <p>I also tried using a lambda function, but without success. If this method works, I wouldn't mind using it. </p>
P粉770375450P粉770375450430 days ago571

reply all(1)I'll reply

  • P粉310931198

    P粉3109311982023-08-17 00:08:30

    You are close. This will add a new column with the match ID.

    import pandas as pd
    import re
    
    url = 'https://www.espncricinfo.com/records/year/team-match-results/2005-2005/twenty20-internationals-3'
    base_url = 'https://www.espncricinfo.com'
    
    def match(row):
        match_id = re.findall('t20i-(\d*)/', row[1])
        return match_id[0]
        
    table = pd.read_html(url, extract_links = "body")[0]
    table['match'] = table['Scorecard'].apply(match)
    print(table)

    Output:

    Team 1  ...   match
    0   (新西兰, None)  ...  211048
    1       (英格兰, None)  ...  211028
    2  (南非, None)  ...  222678
    
    [3 行 x 8 列]

    reply
    0
  • Cancelreply