Use lexical analysis to extract domain names and IPs-Safety-php.cn

Home

Operation and Maintenance

Safety

Use lexical analysis to extract domain names and IPs

王林

Dec 25, 2019 pm 01:08 PM

ipdomain nameextractlexical analysis

Use lexical analysis to extract domain names and IPs

Background

When analyzing the logs, I found that some log parameters contained other URLs, for example:

Use lexical analysis to extract domain names and IPs

##Extract the URL (xss.ha.ckers.org) in the request parameters, and then compare it with the threat intelligence database. If it hits the blacklist, it will be blacklisted. If it is not in the blacklist or the company's whitelist, you can mark it first and focus on analysis later.

Extract URL

There are many articles on the Internet about URL extraction, most of which use regular expressions. The method is simple but not very accurate. I provide a method here: use lexical analysis to extract domain names and IPs. The idea is borrowed from this article:

https://blog.csdn.net/breaksoftware/article/details/7009209. If you are interested, you can take a look. Facts have proved that following the master really improves your posture.

The original text is in C version, here I wrote a similar one in Python for your reference.

Common URL classification

Use lexical analysis to extract domain names and IPs

Observation can be seen: the IP form of URL structure is the simplest: 4 numbers less than 255 are divided by.; domain form comparison Complex, but they have something in common: they all have the top-level domain name .com.

Define legal characters:

Use lexical analysis to extract domain names and IPs

Top-level domain name list:

Use lexical analysis to extract domain names and IPs

Domain name form extraction: such as

www.baidu.com.

Use lexical analysis to extract domain names and IPs

IP format extraction: such as 192.168.1.1.

Use lexical analysis to extract domain names and IPs

while (i < len(z) and z[i].isdigit()):
                i = i + 1
                ip_v1 = True
                reti = i            if i < len(z) and z[i] == &#39;.&#39;:
                i = i + 1
                reti = i            else:
                tokenType = TK_OTHER
                reti = 1while (i < len(z) and z[i].isdigit()):
                i = i + 1
                ip_v2 = True
            if i < len(z) and z[i] == &#39;.&#39;:
                i = i + 1
            else:                if tokenType != TK_DOMAIN:
                    tokenType = TK_OTHER
                    reti = 1while (i < len(z) and z[i].isdigit()):
                i = i + 1
                ip_v3 = True
            if i < len(z) and z[i] == &#39;.&#39;:
                i = i + 1
            else:                if tokenType != TK_DOMAIN:
                    tokenType = TK_OTHER
                    reti = 1while (i < len(z) and z[i].isdigit()):
                i = i + 1
                ip_v4 = True

            if i < len(z) and z[i] == &#39;:&#39;:
                i = i + 1
            while (i < len(z) and z[i].isdigit()):
                i = i + 1

            if ip_v1 and ip_v2 and ip_v3 and ip_v4:                
                self.urls.append(z[0:i])                
                return reti, tokenType            
            else:                
                if tokenType != TK_DOMAIN:
                    tokenType = TK_OTHER
                    reti = 1

Mixed form extraction: such as 1234.com.

Scan the first half of 1234, which conforms to the characteristics of the IP form, but it is found that the code will report an exception, so the IP processing code segment needs to be added to determine whether the suffix is a top-level domain name:

Use lexical analysis to extract domain names and IPs

Result test

Test data:

Use lexical analysis to extract domain names and IPs

Running result:

Use lexical analysis to extract domain names and IPs

This is just a preliminary version, please correct me if there are any bugs.

Conclusion

In the past, I only focused on writing code with my head down, ignoring the thinking and summary afterwards. Now I’m trying to change it, and while working, I’m refining and summarizing it. When I encounter something that feels good, I try to write it as a tool and open source it to share with everyone.

Code Portal:

https://github.com/skskevin/UrlDetect/blob/master/tool/domainExtract/domainExtract.py