search
HomeBackend DevelopmentPHP TutorialSome notes on post-collection data processing based on preg_match_all (encoding conversion and regular matching)_PHP tutorial

1. Use curl to achieve off-site collection

Please refer to my last note for details: http://www.jb51.net/article/46432.htm

2. Encoding conversion
First find the encoding used by the collected website by viewing the source code, and transcode it through the mb_convert_encoding function;

Specific usage:

Copy code The code is as follows:

//The source character is $str

//The following is known The original encoding is GBK, converted to utf-8
mb_convert_encoding($str, "UTF-8", "GBK");

//The following unknown original encoding, after automatic detection by auto, convert the encoding For utf-8
mb_convert_encoding($str, "UTF-8", "auto");

3. In order to better avoid the obstacles of uncertain factors such as line breaks and spaces, it is necessary to first remove line breaks, spaces and tab characters in the collected source code

Copy code The code is as follows:

//Method 1, use str_replace to replace
$contents = str_replace(" rn", '', $contents); //Clear newline characters
$contents = str_replace("n", '', $contents); //Clear newline characters
$contents = str_replace("t" , '', $contents); //Clear tab characters
$contents = str_replace(" ", '', $contents); //Clear space characters

//Method 2, use regular expressions Expression replacement
$contents = preg_replace("/([rn|n|t| ]+)/",'',$contents);

4. Find the code segment you need to obtain through regular expression matching, and use preg_match_all to achieve the matching

Copy code The code is as follows:

Function explanation:
int preg_match_all ( string pattern, string subject, array matches [ , int flags] )
pattern is the regular expression
subject is the original text to be searched
matches is the array used to store the output results
flags is the stored pattern, including:
PREG_PATTERN_ORDER ; //The entire array is a two-dimensional array, $arr1[0] is an array of matching strings including the boundaries, $arr1[1] is an array of matching strings minus the boundaries
PREG_SET_ORDER; //The entire array is a two-dimensional array, $arr2[0][0] is the first matching string consisting of boundaries, $arr2[0][1] is the first matching string consisting of removing boundaries, and then The array can be deduced by analogy
PREG_OFFSET_CAPTURE; //The entire array is a three-dimensional array, $arr3[0][0][0] is the first matching string including the boundary, $arr3[0][0 ][1] is the offset to the boundary of the first matching string (the boundary is not included), and so on, $arr2[1][0][0] is the first including the boundary The matched string, $arr3[1][0][1] is the offset to the boundary of the first matched string (boundary is included);

//Application
preg_match_all('/(.*?)/',$contents, $out, PREG_SET_ORDER);
$out will get all matching elements
$out[0][0] will be the entire character including
$out[0][1] will be only the (.* ?) The matched character segment in the brackets

// By analogy, the nth matched field can be obtained using the following method
$out[n-1][1]

//If there are multiple parentheses in the regular expression, the method to obtain the mth matching point in the sentence is
$out[n-1][m]

5. After obtaining the characters to be found, if you want to remove the html tags, you can easily achieve this by using the function strip_tags that comes with PHP

Copy code The code is as follows:

//Example
$result=strip_tags($out[0][1 ]);

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/728086.htmlTechArticle1. For details on using curl to achieve off-site collection, please refer to my last note: http://www.jb51 .net/article/46432.htm 2. Encoding conversion: First find the encoding used by the collected website by viewing the source code...
Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
如何在电脑上输入下划线如何在电脑上输入下划线Feb 19, 2024 pm 08:36 PM

电脑下划线怎么打在电脑输入文字时,我们经常需要使用下划线来突出某些内容或进行标记。然而,对于一些不太熟悉电脑输入法的人来说,打出下划线可能会有些困惑。本文就将向大家介绍如何在电脑上打出下划线。在不同的电脑操作系统和软件中,输入下划线的方式可能会稍有不同。下面将分别介绍Windows操作系统和Mac操作系统上的常用方法。首先,我们先来看一下在Windows操作

如何用 Golang 正则匹配多个单词或字符串?如何用 Golang 正则匹配多个单词或字符串?May 31, 2024 am 10:32 AM

Golang正则表达式使用管道符|来匹配多个单词或字符串,将各个选项作为逻辑OR表达式分隔开来。例如:匹配"fox"或"dog":fox|dog匹配"quick"、"brown"或"lazy":(quick|brown|lazy)匹配"Go"、"Python"或"Java":Go|Python|Java匹配单词或4位邮政编码:([a-zA

如何用php正则替换以什么开头的字符串如何用php正则替换以什么开头的字符串Mar 24, 2023 pm 02:57 PM

PHP正则表达式是一种针对文本处理和转换的有力工具。它可以通过解析文本内容,并按照特定的模式进行替换或截取,达到有效管理文本信息的目的。其中,正则表达式的一个常见应用是替换以特定字符开头的字符串,对此,我们进行如下的讲解

php 如何用正则去除中文php 如何用正则去除中文Mar 03, 2023 am 10:12 AM

php用正则去除中文的方法:1、创建一个php示例文件;2、定义一个含有中文和英文的字符串;3、通过“preg_replace('/([\x80-\xff]*)/i','',$a);”正则方法去除查询结果中的中文字符即可。

php怎么利用正则匹配去掉html标签php怎么利用正则匹配去掉html标签Mar 21, 2023 pm 05:17 PM

在本文中,我们将学习如何使用PHP正则表达式删除HTML标签,并从HTML字符串中提取纯文本内容。 为了演示如何去掉HTML标记,让我们首先定义一个包含HTML标签的字符串。

探究golang编码转换的实现机制探究golang编码转换的实现机制Feb 19, 2024 pm 03:21 PM

Golang作为一种强大的编程语言,具有较高的性能和并发能力,同时也提供了丰富的标准库支持,其中包括了对编码转换的支持。本文将深入探讨Golang中编码转换的实现原理,并结合具体的代码示例进行分析。什么是编码转换?编码转换指的是将一个字符序列从一种编码转换为另一种编码的过程。在实际的开发中,我们经常需要处理不同编码之间的转换,比如将UTF-8编码的字符串转换

简单学习dedecms编码转换功能的方法简单学习dedecms编码转换功能的方法Mar 14, 2024 pm 02:09 PM

学习dedecms编码转换功能并不复杂,通过简单的代码示例,可以帮助您快速掌握这一技能。在dedecms中,编码转换功能通常用于处理中文乱码、特殊字符等问题,确保系统的正常运行和数据的准确性。下面将详细介绍如何使用dedecms的编码转换功能,让您轻松应对各种编码相关的需求。1.UTF-8转GBK在dedecms中,如果需要将UTF-8编码的字符串转换为G

如何处理C++开发中的编码转换问题如何处理C++开发中的编码转换问题Aug 22, 2023 am 11:07 AM

如何处理C++开发中的编码转换问题在C++开发过程中,经常会遇到需要处理不同编码之间转换的问题。由于不同的编码格式之间存在差异,因此在进行编码转换时需要注意一些细节。本文将介绍如何处理C++开发中的编码转换问题。一、了解不同编码格式在处理编码转换问题之前,首先需要了解不同的编码格式。常见的编码格式有ASCII、UTF-8、GBK等。ASCII是最早的编码格式

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment