Home  >  Article  >  Backend Development  >  PHP filter advertising content part-time job, QQ account, Taobao part-time job, website

PHP filter advertising content part-time job, QQ account, Taobao part-time job, website

WBOY
WBOYOriginal
2016-08-08 09:19:161569browse

If your website has comments, you will definitely find that your website is often injected with advertisements by one person, such as part-time jobs, QQ accounts, Taobao part-time jobs, and website information. Let’s take a look at how to filter these contents.


The types of comments or other content advertisements posted by users generally have the following types:

1: Taobao part-time job, add QQ 123456789 group (with QQ number or WeChat number or other digital number)
2: Taobao part-time job, add QQ number (with English keywords)
3: Taobao part-time job, add QQ ① ① ① ① ① ① (Special digit number)
4: 22222222 (Full-width type number)

Filtering method:
Use regular rules to Match and replace the punctuation marks, numbers, and letters of the string to determine whether there are consecutive numbers or keywords (full-width and rounded corners are supported), because advertisements generally carry contact information such as QQ numbers. Therefore, we must first "purify" and replace the comments, convert the full-width ones into half-width ones, remove some "sand", such as punctuation marks, spaces, letters, etc., leaving only Chinese characters and numbers.

Example:

$comment= "This $% is a (1)8 artifact three or four website, come and join ④④he@#heqq 1 2 3 4 5 6 7 8″;

1:" "Purify" content and remove punctuation marks

$flag_arr=array('?','!','¥','(',')',':',''',''','"', '"','《','》',',','...','.',',','nbsp','】','【','~'); preg_replace('/s/','',preg_replace("/[[:punct:]]/",'',strip_tags(html_entity_decode(str_replace($flag_arr,'',$comment),ENT_QUOTES,'UTF-8 '))));

After processing, $comment becomes: "This is a (1)8 artifact 34 website. B come and join ①④hehe qqq12345678"


2: It may contain some full-width symbols. Or numbers, so use the following code to convert full-width symbols into half-width symbols that can be matched by regular expressions

$quanjiao = array('0' => '0', '1' => '1', '2' => ; '2', '3' => '3', '4' => '4','5' => '5', '6' => '6', '7' => ; '7', '8' => '8', '9' => '9', 'A' => 'A', 'B' => 'B', 'C' => 'C', 'D' => 'D', 'E' => 'E','F' => 'F', 'G' => 'G', 'H' => ; 'H', 'I' => 'I', 'J' => 'J', 'K' => 'K', 'L' => 'L', 'M' => ; 'M', 'N' => 'N', 'O' => 'O','P' => 'P', 'Q' => 'Q', 'R' => ; 'R', 'S' => 'S', 'T' => 'T','U' => 'U', 'V' => 'V', 'W' => ; 'W', 'X' => 'X', 'Y' => 'Y','Z' => 'Z', 'a' => 'a', 'b' => ; 'b', 'c' => 'c', 'd' => 'd','e' => 'e', ​​'f' => 'f', 'g' => ; 'g', 'h' => 'h', 'i' => 'i','j' => 'j', 'k' => 'k', 'l' => ; 'l', 'm' => 'm', 'n' => 'n','o' => 'o', 'p' => 'p', 'q' => ; 'q', 'rr' => 'r', 's' => 's', 't' => 't', 'u' => 'u', 'v' => ; 'v', 'w' => 'w', 'x' => 'x', 'y' => 'y', 'スz' => 'z','(' => ; '(', ')' => ')', '〔' => '[', '〕' => ']', '【' => '[','】' => ; ']', '〖' => '[', '〗' => ']', '"' => '[', '"' => ']',''' => ; '[', ''' => ']', '{' => '{', '}' => '}', '《' => '<','》' = > '>','%' => '%', '+' => '+', '—' => '-', '-' => '-', '~' => '-',':' => ':', '. ' => '.', ',' => ',', ',' => '.', ',' => '.', ';' => ',', '? ' => '?', '! ' => '!', '…' => '-', '‖' => '|', '"' => '"', ''' => '`', '' ' => '`', '|' => '|', '〃' => '"',' ' => ' ');

$comment=strtr($comment, $quanjiao) ;

php’s strtr function is used to convert specific characters in a string.

You can use
strtr(string,from,to)
or
strtr(string,array)

After processing, $comment becomes:” This is a 18 artifact 34 website. B come and join①④heheqq12345678″;

3: The comments may also contain special characters (you can expand new special characters in the array below)

$special_num_char=array('①'=>'1','②'=>'2','③'=>'3','④'=>'4','⑤'= >'5','⑥'=>'6','⑦'=>'7','⑧'=>'8','⑨'=>'9','⑩'= >'10','⑴'=>'1','⑵'=>'2','⑶'=>'3','⑷'=>'4','⑸'= >'5','⑹'=>'6','⑺'=>'7','⑻'=>'8','⑼'=>'9','⑽'= >'10','一'=>'1','二'=>'2','三'=>'3','四'=>'4','五'= >'5','six'=>'6','seven'=>'7','eight'=>'8','nine'=>'9','zero'= >'0');
$comment=strtr($comment, $special_num_char);
After processing, $comment becomes: "This is a 18 artifact website B Come and join 14heheqq12345678";
If you comment Traditional Chinese numbers appear in it, such as 'zero', 'one', 'two', 'three', 'four', 'five', 'Lu', 'seven', 'eight', 'nine', 'shi' For these, just add and expand the $special_num_char above.

4: There may also be a mixture of normal numbers and Chinese character numbers in the comments. Just use the method in point 3 to convert them into normal numbers.

Example: This is an advertisement qq 1二二45六7899
After conversion:
This is an advertisement qq 1224567899

5: Regular processing to filter advertisements

Use regular matching preg_match_all('/d+/',$comment, $match)
Analyze the obtained match[0] matching array

foreach($match[0] as $val)//Whether there is a digital QQ number and WeChat ID? if(strlen($val)> = 6)
{// There is a number of numbers with a continuous length of more than 6 digits, and the suspicion of advertising is very large
$ is_ad = true; )
{//There are a lot of intermittent numbers, and there is suspicion of advertising
$is_ad=true;
}

ok, so you can judge whether the content is advertising, and you can filter most common ads

 $flag_arr=array('?','!','¥','(',')',':','‘','’','“','”','《','》',',','…','。','、','nbsp','】','【','~');
        $comment=preg_replace('/\s/','',preg_replace("/[[:punct:]]/",'',strip_tags(html_entity_decode(str_replace($flag_arr,'',$comment),ENT_QUOTES,'UTF-8'))));

        $quanjiao = array('0' => '0', '1' => '1', '2' => '2', '3' => '3', '4' => '4','5' => '5', '6' => '6', '7' => '7', '8' => '8', '9' => '9', 'A' => 'A', 'B' => 'B', 'C' => 'C', 'D' => 'D', 'E' => 'E','F' => 'F', 'G' => 'G', 'H' => 'H', 'I' => 'I', 'J' => 'J', 'K' => 'K', 'L' => 'L', 'M' => 'M', 'N' => 'N', 'O' => 'O','P' => 'P', 'Q' => 'Q', 'R' => 'R', 'S' => 'S', 'T' => 'T','U' => 'U', 'V' => 'V', 'W' => 'W', 'X' => 'X', 'Y' => 'Y','Z' => 'Z', 'a' => 'a', 'b' => 'b', 'c' => 'c', 'd' => 'd','e' => 'e', 'f' => 'f', 'g' => 'g', 'h' => 'h', 'i' => 'i','j' => 'j', 'k' => 'k', 'l' => 'l', 'm' => 'm', 'n' => 'n','o' => 'o', 'p' => 'p', 'q' => 'q', 'r' => 'r', 's' => 's', 't' => 't', 'u' => 'u', 'v' => 'v', 'w' => 'w', 'x' => 'x', 'y' => 'y', 'z' => 'z','(' => '(', ')' => ')', '〔' => '[', '〕' => ']', '【' => '[','】' => ']', '〖' => '[', '〗' => ']', '“' => '[', '”' => ']','‘' => '[', '\'' => ']', '{' => '{', '}' => '}', '《' => '<','》' => '>','%' => '%', '+' => '+', '—' => '-', '-' => '-', '~' => '-',':' => ':', '。' => '.', '、' => ',', ',' => '.', '、' => '.', ';' => ',', '?' => '?', '!' => '!', '…' => '-', '‖' => '|', '”' => '"', '\'' => '`', '‘' => '`', '|' => '|', '〃' => '"',' ' => ' ');
        $comment=strtr($comment, $quanjiao);
        $special_num_char=array('①'=>'1','②'=>'2','③'=>'3','④'=>'4','⑤'=>'5','⑥'=>'6','⑦'=>'7','⑧'=>'8','⑨'=>'9','⑩'=>'10','⑴'=>'1','⑵'=>'2','⑶'=>'3','⑷'=>'4','⑸'=>'5','⑹'=>'6','⑺'=>'7','⑻'=>'8','⑼'=>'9','⑽'=>'10','一'=>'1','二'=>'2','三'=>'3','四'=>'4','五'=>'5','六'=>'6','七'=>'7','八'=>'8','九'=>'9','零'=>'0');
        $comment=strtr($comment, $special_num_char);
        preg_match_all('/\d+/',$comment,$match);
        $is_ad = false;
        foreach($match[0] as $val)//是否存在数字的qq号和微信号
        {
            if(strlen($val)>=6)
            {//存在连续的长度超过6位的数字串,广告嫌疑很大
                $is_ad=true;
                break;
            }
        }
        if(count($match[0])>=10)
        {//间断的数字很多,存在广告的嫌疑
            $is_ad=true;
        }


The above introduces PHP filtering advertising content part-time job, QQ account, Taobao part-time job, website, including the content. I hope it will be helpful to friends who are interested in PHP tutorials.

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn