Home >Backend Development >Python Tutorial >How to extract data from wiki links?
I want to extract data from the wiki link returned by the mwparserfromhell library. For example, I want to parse the following string:
[[file:warszawa, ul. freta 16 20170516 002.jpg|thumb|upright=1.18|[[maria skłodowska-curie museum|birthplace]] of marie curie, at 16 freta street, in [[warsaw]], [[poland]].]]
If I split the string using the characters |
it doesn't work because there is also a link using |
in the image description: [[Maria Skvo Dowska-Curie Museum|Birthplace]]
.
I used a regular expression to first replace all the links in the string and then split it. It works (in this case), but doesn't feel clean (see code below). Is there a better way to extract information from a string like this?
import re wiki_code = "[[File:Warszawa, ul. Freta 16 20170516 002.jpg|thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].]]" # Remove [[File: at the begining of the string prefix = "[[File:" if (wiki_code.startswith(prefix)): wiki_code = wiki_code[len(prefix):] # Remove ]] at the end of the string suffix = "]]" if (wiki_code.endswith(suffix)): wiki_code = wiki_code[:-len(suffix)] # Replace links with their link_pattern = re.compile(r'\[\[.*?\]\]') matches = link_pattern.findall(wiki_code) for match in matches: content = match[2:-2] arr = content.split("|") label = arr[-1] wiki_code = wiki_code.replace(match, label) print(wiki_code.split("|"))
.filter_wikilinks()
The link returned is the wikilink
class, This class has title
and text
properties.
title
Returns the title of the link: file:warszawa, ul. Fretta16 20170516 002.jpg
text
Return to the rest of the link: thumb|upright=1.18|[[maria skłodowska-curie museum|birthplace]] Marie Curie, 16 freta street , [[Warsaw]], [[Poland]].
These are returned as wikicode
objects.
Since the actual text is always the last fragment, you first need to find the other fragments using the following regular expression:
([^\[\]|]*\|)
(
)
: Group
[^\[\]|]*
: 0 or more characters that are not square brackets or vertical bars\|
:Literal Pipe
: 1 or more Everything else from the end index of the last match to the end of the string is the last fragment.
>>> import mwparserfromhell >>> import re >>> wikitext = mwparserfromhell.parse('[[File:Warszawa, ul. Freta 16 20170516 002.jpg|thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].]]') >>> image_link = wikitext.filter_wikilinks()[0] >>> image_link '[[File:Warszawa, ul. Freta 16 20170516 002.jpg|thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].]]' >>> image_link.title 'File:Warszawa, ul. Freta 16 20170516 002.jpg' >>> text = str(image_link.text) >>> text 'thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].' >>> other_fragments = re.match(r'([^\[\]|]*\|)+', text) >>> other_fragments <re.Match object; span=(0, 19), match='thumb|upright=1.18|'> >>> other_fragments.span(0)[1] 19 >>> text[19:] '[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].'
The above is the detailed content of How to extract data from wiki links?. For more information, please follow other related articles on the PHP Chinese website!