


1. Use curl to achieve off-site collection
Please refer to my last note for details: http://www.jb51.net/article/46432.htm
2. Encoding conversion
First find the encoding used by the collected website by viewing the source code, and transcode it through the mb_convert_encoding function;
Specific usage:
//The source character is $str
//The following is known The original encoding is GBK, converted to utf-8
mb_convert_encoding($str, "UTF-8", "GBK");
//The following unknown original encoding, after automatic detection by auto, convert the encoding For utf-8
mb_convert_encoding($str, "UTF-8", "auto");
3. In order to better avoid the obstacles of uncertain factors such as line breaks and spaces, it is necessary to first remove line breaks, spaces and tab characters in the collected source code
//Method 1, use str_replace to replace
$contents = str_replace(" rn", '', $contents); //Clear newline characters
$contents = str_replace("n", '', $contents); //Clear newline characters
$contents = str_replace("t" , '', $contents); //Clear tab characters
$contents = str_replace(" ", '', $contents); //Clear space characters
//Method 2, use regular expressions Expression replacement
$contents = preg_replace("/([rn|n|t| ]+)/",'',$contents);
4. Find the code segment you need to obtain through regular expression matching, and use preg_match_all to achieve the matching
Function explanation:
int preg_match_all ( string pattern, string subject, array matches [ , int flags] )
pattern is the regular expression
subject is the original text to be searched
matches is the array used to store the output results
flags is the stored pattern, including:
PREG_PATTERN_ORDER ; //The entire array is a two-dimensional array, $arr1[0] is an array of matching strings including the boundaries, $arr1[1] is an array of matching strings minus the boundaries
PREG_SET_ORDER; //The entire array is a two-dimensional array, $arr2[0][0] is the first matching string consisting of boundaries, $arr2[0][1] is the first matching string consisting of removing boundaries, and then The array can be deduced by analogy
PREG_OFFSET_CAPTURE; //The entire array is a three-dimensional array, $arr3[0][0][0] is the first matching string including the boundary, $arr3[0][0 ][1] is the offset to the boundary of the first matching string (the boundary is not included), and so on, $arr2[1][0][0] is the first including the boundary The matched string, $arr3[1][0][1] is the offset to the boundary of the first matched string (boundary is included);
//Application
preg_match_all('/
$out will get all matching elements
$out[0][0] will be the entire character including
$out[0][1] will be only the (.* ?) The matched character segment in the brackets
// By analogy, the nth matched field can be obtained using the following method
$out[n-1][1]
//If there are multiple parentheses in the regular expression, the method to obtain the mth matching point in the sentence is
$out[n-1][m]
5. After obtaining the characters to be found, if you want to remove the html tags, you can easily achieve this by using the function strip_tags that comes with PHP
//Example
$result=strip_tags($out[0][1 ]);

电脑下划线怎么打在电脑输入文字时,我们经常需要使用下划线来突出某些内容或进行标记。然而,对于一些不太熟悉电脑输入法的人来说,打出下划线可能会有些困惑。本文就将向大家介绍如何在电脑上打出下划线。在不同的电脑操作系统和软件中,输入下划线的方式可能会稍有不同。下面将分别介绍Windows操作系统和Mac操作系统上的常用方法。首先,我们先来看一下在Windows操作

Golang正则表达式使用管道符|来匹配多个单词或字符串,将各个选项作为逻辑OR表达式分隔开来。例如:匹配"fox"或"dog":fox|dog匹配"quick"、"brown"或"lazy":(quick|brown|lazy)匹配"Go"、"Python"或"Java":Go|Python|Java匹配单词或4位邮政编码:([a-zA

PHP正则表达式是一种针对文本处理和转换的有力工具。它可以通过解析文本内容,并按照特定的模式进行替换或截取,达到有效管理文本信息的目的。其中,正则表达式的一个常见应用是替换以特定字符开头的字符串,对此,我们进行如下的讲解

php用正则去除中文的方法:1、创建一个php示例文件;2、定义一个含有中文和英文的字符串;3、通过“preg_replace('/([\x80-\xff]*)/i','',$a);”正则方法去除查询结果中的中文字符即可。

在本文中,我们将学习如何使用PHP正则表达式删除HTML标签,并从HTML字符串中提取纯文本内容。 为了演示如何去掉HTML标记,让我们首先定义一个包含HTML标签的字符串。

Golang作为一种强大的编程语言,具有较高的性能和并发能力,同时也提供了丰富的标准库支持,其中包括了对编码转换的支持。本文将深入探讨Golang中编码转换的实现原理,并结合具体的代码示例进行分析。什么是编码转换?编码转换指的是将一个字符序列从一种编码转换为另一种编码的过程。在实际的开发中,我们经常需要处理不同编码之间的转换,比如将UTF-8编码的字符串转换

学习dedecms编码转换功能并不复杂,通过简单的代码示例,可以帮助您快速掌握这一技能。在dedecms中,编码转换功能通常用于处理中文乱码、特殊字符等问题,确保系统的正常运行和数据的准确性。下面将详细介绍如何使用dedecms的编码转换功能,让您轻松应对各种编码相关的需求。1.UTF-8转GBK在dedecms中,如果需要将UTF-8编码的字符串转换为G

如何处理C++开发中的编码转换问题在C++开发过程中,经常会遇到需要处理不同编码之间转换的问题。由于不同的编码格式之间存在差异,因此在进行编码转换时需要注意一些细节。本文将介绍如何处理C++开发中的编码转换问题。一、了解不同编码格式在处理编码转换问题之前,首先需要了解不同的编码格式。常见的编码格式有ASCII、UTF-8、GBK等。ASCII是最早的编码格式


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Linux new version
SublimeText3 Linux latest version

Notepad++7.3.1
Easy-to-use and free code editor

Atom editor mac version download
The most popular open source editor

WebStorm Mac version
Useful JavaScript development tools

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment
