Home  >  Q&A  >  body text

How to extract domain name using python regular expression

<script type="application/ld+json">{
    "@context": "http://schema.org",
    "@type": "SaleEvent",
    "name": "10% Off First Orders",
    "url": "https://www.myvouchercodes.co.uk/coggles",
    "image": "https://mvp.tribesgds.com/dyn/oh/Ow/ohOwXIWglMg/_/mQR5xLX5go8/m0Ys/coggles-logo.png",
    "startDate": "2017-02-17",
    "endDate": "2017-12-31",
    "location": {
        "@type": "Place",
        "name": "Coggles",
        "url": "coggles.co.uk",
        "address": "Coggles"
    },
    "description": "Get the top branded fashion items from Coggles at discounted prices. Apply this code and enjoy savings on your purchase.",
    "eventStatus": "EventScheduled"
}</script>

How to use python regular expression to extract the coggles.co.uk domain name from this script? I hope experts from all walks of life can show me their skills...

淡淡烟草味淡淡烟草味2699 days ago969

reply all(2)I'll reply

  • ringa_lee

    ringa_lee2017-06-22 11:53:53

    When implementing regularization, just make sure that your calibration/features are unique. But the symbol "url" is not the only one. At this time @prolifes' method is very good.

    If you must implement regular implementation, you need to use zero-width assertions. Of course, the translation of this word is relatively straightforward, which leads to many misunderstandings. It actually means matching at the specified position, and the width of the position is 0.

    Here we can see the "url" we need in "location", which can be used as location information.

    The code is as follows:

    re.search('(?<=location).+?"url": "([^"]+)"', string, re.DOTALL).group(1)

    Let me explain a little bit,
    (?<=location)This place means that there must be a location in front. If there is any later, write it like this: (?=location)
    re.DOTALLThis is necessary because these strings have crossed lines. Its function is to expand the string matching range of . to include newlines.
    "([^"]+)"This place is my habit, [^"] means all characters that are not ", which matches all strings in double quotes.

    reply
    0
  • 世界只因有你

    世界只因有你2017-06-22 11:53:53

    This is a pretty standard json, if you want to be more rough, convert it directly into json

    import json
    
    str = '''
    <script type="application/ld+json">{
        "@context": "http://schema.org",
        "@type": "SaleEvent",
        "name": "10% Off First Orders",
        "url": "https://www.myvouchercodes.co.uk/coggles",
        "image": "https://mvp.tribesgds.com/dyn/oh/Ow/ohOwXIWglMg/_/mQR5xLX5go8/m0Ys/coggles-logo.png",
        "startDate": "2017-02-17",
        "endDate": "2017-12-31",
        "location": {
            "@type": "Place",
            "name": "Coggles",
            "url": "coggles.co.uk",
            "address": "Coggles"
        },
        "description": "Get the top branded fashion items from Coggles at discounted prices. Apply this code and enjoy savings on your purchase.",
        "eventStatus": "EventScheduled"
    }</script>
    '''
    
    d = json.loads(re.search('({[\s\S]*})', str).group(1))
    print d['location']['url']

    reply
    0
  • Cancelreply