Some notes on post-collection data processing based on preg_match_all (encoding conversion and regular matching)

Some notes on post-collection data processing based on preg_match_all (encoding conversion and regular matching)_PHP tutorial

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jul 13, 2016 am 10:39 AM

matchregulartranscoding

1. Use curl to achieve off-site collection

Please refer to my last note for details: http://www.jb51.net/article/46432.htm

2. Encoding conversion
First find the encoding used by the collected website by viewing the source code, and transcode it through the mb_convert_encoding function;

Specific usage:

Copy code The code is as follows:

//The source character is $str

//The following is known The original encoding is GBK, converted to utf-8
mb_convert_encoding($str, "UTF-8", "GBK");

//The following unknown original encoding, after automatic detection by auto, convert the encoding For utf-8
mb_convert_encoding($str, "UTF-8", "auto");

3. In order to better avoid the obstacles of uncertain factors such as line breaks and spaces, it is necessary to first remove line breaks, spaces and tab characters in the collected source code

Copy code The code is as follows:

//Method 1, use str_replace to replace
$contents = str_replace(" rn", '', $contents); //Clear newline characters
$contents = str_replace("n", '', $contents); //Clear newline characters
$contents = str_replace("t" , '', $contents); //Clear tab characters
$contents = str_replace(" ", '', $contents); //Clear space characters

//Method 2, use regular expressions Expression replacement
$contents = preg_replace("/([rn|n|t| ]+)/",'',$contents);

4. Find the code segment you need to obtain through regular expression matching, and use preg_match_all to achieve the matching

Copy code The code is as follows:

Function explanation:
int preg_match_all ( string pattern, string subject, array matches [ , int flags] )
pattern is the regular expression
subject is the original text to be searched
matches is the array used to store the output results
flags is the stored pattern, including:
PREG_PATTERN_ORDER ; //The entire array is a two-dimensional array, $arr1[0] is an array of matching strings including the boundaries, $arr1[1] is an array of matching strings minus the boundaries
PREG_SET_ORDER; //The entire array is a two-dimensional array, $arr2[0][0] is the first matching string consisting of boundaries, $arr2[0][1] is the first matching string consisting of removing boundaries, and then The array can be deduced by analogy
PREG_OFFSET_CAPTURE; //The entire array is a three-dimensional array, $arr3[0][0][0] is the first matching string including the boundary, $arr3[0][0 ][1] is the offset to the boundary of the first matching string (the boundary is not included), and so on, $arr2[1][0][0] is the first including the boundary The matched string, $arr3[1][0][1] is the offset to the boundary of the first matched string (boundary is included);

//Application
preg_match_all('/(.*?)/',$contents, $out, PREG_SET_ORDER);
$out will get all matching elements
$out[0][0] will be the entire character including
$out[0][1] will be only the (.* ?) The matched character segment in the brackets

// By analogy, the nth matched field can be obtained using the following method
$out[n-1][1]

//If there are multiple parentheses in the regular expression, the method to obtain the mth matching point in the sentence is
$out[n-1][m]

5. After obtaining the characters to be found, if you want to remove the html tags, you can easily achieve this by using the function strip_tags that comes with PHP

Copy code The code is as follows:

//Example
$result=strip_tags($out[0][1 ]);

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

PHP Dependency Injection Container: A Quick StartMay 13, 2025 am 12:11 AM

APHPDependencyInjectionContainerisatoolthatmanagesclassdependencies,enhancingcodemodularity,testability,andmaintainability.Itactsasacentralhubforcreatingandinjectingdependencies,thusreducingtightcouplingandeasingunittesting.

Dependency Injection vs. Service Locator in PHPMay 13, 2025 am 12:10 AM

Select DependencyInjection (DI) for large applications, ServiceLocator is suitable for small projects or prototypes. 1) DI improves the testability and modularity of the code through constructor injection. 2) ServiceLocator obtains services through center registration, which is convenient but may lead to an increase in code coupling.

PHP performance optimization strategies.May 13, 2025 am 12:06 AM

PHPapplicationscanbeoptimizedforspeedandefficiencyby:1)enablingopcacheinphp.ini,2)usingpreparedstatementswithPDOfordatabasequeries,3)replacingloopswitharray_filterandarray_mapfordataprocessing,4)configuringNginxasareverseproxy,5)implementingcachingwi

PHP Email Validation: Ensuring Emails Are Sent CorrectlyMay 13, 2025 am 12:06 AM

PHPemailvalidationinvolvesthreesteps:1)Formatvalidationusingregularexpressionstochecktheemailformat;2)DNSvalidationtoensurethedomainhasavalidMXrecord;3)SMTPvalidation,themostthoroughmethod,whichchecksifthemailboxexistsbyconnectingtotheSMTPserver.Impl

How to make PHP applications fasterMay 12, 2025 am 12:12 AM

TomakePHPapplicationsfaster,followthesesteps:1)UseOpcodeCachinglikeOPcachetostoreprecompiledscriptbytecode.2)MinimizeDatabaseQueriesbyusingquerycachingandefficientindexing.3)LeveragePHP7 Featuresforbettercodeefficiency.4)ImplementCachingStrategiessuc

PHP Performance Optimization Checklist: Improve Speed NowMay 12, 2025 am 12:07 AM

ToimprovePHPapplicationspeed,followthesesteps:1)EnableopcodecachingwithAPCutoreducescriptexecutiontime.2)ImplementdatabasequerycachingusingPDOtominimizedatabasehits.3)UseHTTP/2tomultiplexrequestsandreduceconnectionoverhead.4)Limitsessionusagebyclosin

PHP Dependency Injection: Improve Code TestabilityMay 12, 2025 am 12:03 AM

Dependency injection (DI) significantly improves the testability of PHP code by explicitly transitive dependencies. 1) DI decoupling classes and specific implementations make testing and maintenance more flexible. 2) Among the three types, the constructor injects explicit expression dependencies to keep the state consistent. 3) Use DI containers to manage complex dependencies to improve code quality and development efficiency.

PHP Performance Optimization: Database Query OptimizationMay 12, 2025 am 12:02 AM

DatabasequeryoptimizationinPHPinvolvesseveralstrategiestoenhanceperformance.1)Selectonlynecessarycolumnstoreducedatatransfer.2)Useindexingtospeedupdataretrieval.3)Implementquerycachingtostoreresultsoffrequentqueries.4)Utilizepreparedstatementsforeffi

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Nordhold: Fusion System, Explained

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Atom editor mac version download

The most popular open source editor

WebStorm Mac version

Useful JavaScript development tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.