Home > Article > Backend Development > PHP regular extraction of image address_PHP tutorial
Recently, when developing a program, I need to obtain the image address in the extracted content. I will briefly share the method here. Friends who need it can refer to it
I am obsessed with regular rules and keep trying new tricks. First of all, thanks to TNA's incomplete output RSS, and then again to SH's compulsive learning. Without TNA, I wouldn’t have looked at regular expressions, and I didn’t know there were such awesome expressions in the world. If SH wasn’t just saying that he didn’t understand, I wouldn’t have had the guts to figure it out and improve it. To achieve the same goal, regular expressions may not be unique, and there is nothing that cannot be done, it’s just that you didn’t expect it. You can put it this way, regularity is playing with setting rules. I love this kind of thing. Nothing excites me more and makes me feel awesome than setting rules to filter things. Share some tips on using regular expressions to extract image addresses in the PHP environment: The html code of the image URL specification is nothing more than The code is as follows: 囧1 and 囧2 are not required. If you want to pass XHTML certification, 囧4, 囧5, and 囧6 are essential. 囧3 is the core content and of course it is indispensable. Talking about regularity, the shortest match I wrote is The code is as follows: (?<=img.+?src=").*?(?=") However, this does not work in php, it will appear: Warning: preg_match_all() [function.preg-match-all]: Compilation failed: lookbehind assertion is not fixed length at offset *** in *** I have been struggling for a long time, but it doesn’t work. What’s the reason? After trying many times, I finally found that the problem is in the zero-width assertion (?<=img.+?src="). In PHP, zero-width assertions do not support unlimited times like "*" and "+". Something, so an error was reported, just change ".+?" to a fixed length. However, it is basically impossible to fix the length between "img" and "src=". Usually, the img and src of the image address are only. It will be separated by a very simple space, but it does not rule out that in some cases there are alt, titlte and other things before src and after img. so The code is as follows: (?<=img.src=").*?(?=") or The code is as follows: (?<=imgssrc=").*?(?=") It may be possible, but there is no guarantee that it will be 100% ok. You may ask, simple The code is as follows: (?<=src=").*?(?=") No? Normally, yes, but friends who have searched the page should know that in addition to image addresses starting with src, javascript addresses also start with src! Moreover, there are too many mysterious and unpredictable factors hidden in it, so this seemingly short and perfect writing method will not work. You may also ask, clever and short is not enough, I will list the suffix of the picture, it should be ok, such as The code is as follows: (?<=src=").*?.(jpg|jpeg|gif|png|bmp|JPG|JPEG|GIF|PNG|BMP) Indeed, this way of writing is really honest, but have you ever seen pictures without suffixes? wwe.com has many examples of this. RAW http://us.wwe.com/content/media/images/Headers/15559182 SmackDown http://us.wwe.com/content/media/images/Headers/15854138 NXT http://us.wwe.com/content/media/images/Headers/15929136 Superstars http://us.wwe.com/content/media/images/Headers/15815850 The URLs above are all pictures, but they don’t have traditional suffixes. It’s no use being honest and you still can’t get them. What to do? It can still be like this The code is as follows: