Home  >  Article  >  Backend Development  >  Some notes on post-collection data processing based on preg_match_all (encoding conversion and regular matching)_PHP tutorial

Some notes on post-collection data processing based on preg_match_all (encoding conversion and regular matching)_PHP tutorial

WBOY
WBOYOriginal
2016-07-13 10:39:53805browse

1. Use curl to achieve off-site collection

Please refer to my last note for details: http://www.jb51.net/article/46432.htm

2. Encoding conversion
First find the encoding used by the collected website by viewing the source code, and transcode it through the mb_convert_encoding function;

Specific usage:

Copy code The code is as follows:

//The source character is $str

//The following is known The original encoding is GBK, converted to utf-8
mb_convert_encoding($str, "UTF-8", "GBK");

//The following unknown original encoding, after automatic detection by auto, convert the encoding For utf-8
mb_convert_encoding($str, "UTF-8", "auto");

3. In order to better avoid the obstacles of uncertain factors such as line breaks and spaces, it is necessary to first remove line breaks, spaces and tab characters in the collected source code

Copy code The code is as follows:

//Method 1, use str_replace to replace
$contents = str_replace(" rn", '', $contents); //Clear newline characters
$contents = str_replace("n", '', $contents); //Clear newline characters
$contents = str_replace("t" , '', $contents); //Clear tab characters
$contents = str_replace(" ", '', $contents); //Clear space characters

//Method 2, use regular expressions Expression replacement
$contents = preg_replace("/([rn|n|t| ]+)/",'',$contents);

4. Find the code segment you need to obtain through regular expression matching, and use preg_match_all to achieve the matching

Copy code The code is as follows:

Function explanation:
int preg_match_all ( string pattern, string subject, array matches [ , int flags] )
pattern is the regular expression
subject is the original text to be searched
matches is the array used to store the output results
flags is the stored pattern, including:
PREG_PATTERN_ORDER ; //The entire array is a two-dimensional array, $arr1[0] is an array of matching strings including the boundaries, $arr1[1] is an array of matching strings minus the boundaries
PREG_SET_ORDER; //The entire array is a two-dimensional array, $arr2[0][0] is the first matching string consisting of boundaries, $arr2[0][1] is the first matching string consisting of removing boundaries, and then The array can be deduced by analogy
PREG_OFFSET_CAPTURE; //The entire array is a three-dimensional array, $arr3[0][0][0] is the first matching string including the boundary, $arr3[0][0 ][1] is the offset to the boundary of the first matching string (the boundary is not included), and so on, $arr2[1][0][0] is the first including the boundary The matched string, $arr3[1][0][1] is the offset to the boundary of the first matched string (boundary is included);

//Application
preg_match_all('/(.*?)

/',$contents, $out, PREG_SET_ORDER);
$out will get all matching elements
$out[0][0] will be the entire character including


$out[0][1] will be only the (.* ?) The matched character segment in the brackets

// By analogy, the nth matched field can be obtained using the following method
$out[n-1][1]

//If there are multiple parentheses in the regular expression, the method to obtain the mth matching point in the sentence is
$out[n-1][m]

5. After obtaining the characters to be found, if you want to remove the html tags, you can easily achieve this by using the function strip_tags that comes with PHP

Copy code The code is as follows:

//Example
$result=strip_tags($out[0][1 ]);

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/728086.htmlTechArticle1. For details on using curl to achieve off-site collection, please refer to my last note: http://www.jb51 .net/article/46432.htm 2. Encoding conversion: First find the encoding used by the collected website by viewing the source code...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn