Regular expressions and text mining--Text Mining-PHP Tutorial-php.cn

Home

Backend Development

PHP Tutorial

Regular expressions and text mining--Text Mining

伊谢尔伦

Dec 05, 2016 am 11:56 AM

text miningregular expression

When conducting text mining, the wildcard character (Wildchar) in TSQL seems to be insufficient. At this time, using "CLR+regular expression" is a very good choice. Regular expressions seem to be very complicated, but they remain the same. If you are proficient in the metadata of regular expressions, you will be able to use regular expressions proficiently and flexibly to complete complex Text Mining work.

1. Special characters of regular expressions

1. Commonly used metacharacters

are used to match specific characters (letters, numbers, symbols). Note that letters are case-sensitive:

. : matches except line breaks. Any character
w: Matches letters or numbers or underscores or Chinese characters
s: Matches any whitespace character
d: Matches numbers
b: Matches the beginning or end of a word
^: Matches the beginning of a string
$: Matches a string The end of
k: Reference to the group name, for example: k, means to reference the group named group_name
group_number: group_number is the group number of the group, 1, 2, 3, etc., means to reference the group through the group number
2, repeated characters or groups

Specify the number of times the previous character or group is repeated:

: Repeat zero or more times

: Repeat one or more times
?: Repeat zero or one time
{n}: Repeat n times
{n ,}: repeated n times or more
{n,m}: repeated n to m times
3, grouping, escaping, branching, qualifier

These characters have specific meanings and uses:

(): Use parentheses to represent a group
: Define the group name. The string between |: Branch, the expressions are "or" related
[]: Specify a list of qualified characters, one character must match any character in the list, specify the match in square brackets A character list, for example: [aeiou] A character must be any one in aeiou;
[^]: Specify a list of excluded characters, a character cannot be any character in the excluded list, the excluded character list is specified in square brackets, for example :[^aeiou] A character cannot be any one of aeiou;
Second, grouping reference

Grouping is a subexpression specified using parentheses; grouping reference refers to the repeated use of subexpressions in an expression , making the writing of regular expressions more concise. By default, regular expressions automatically assign a group number to each group. The rule is: the group number starts from 1, and from left to right, the group number increases by 1 (base-1). ), for example, the group number of the first group is 1, the group number of the second group is 2, and so on.

Three forms of grouping definition:

(exp): automatically assign group numbers through grouping. No. refers to the group;

(?exp): Name the group, refer to the group through the group name;
(?:exp): This group only matches text at the current position, after the group, the group cannot be referenced, the group has no Group name, and no group number;
1, refer to the group through the group number

Define a group (exp) in front of the regular expression, and after the expression, you can reference the expression of the group through the group number, and reference the group The syntax is: group_number;

For example: b(w+)bs+1b. In this regular expression, there is only one group (w+), and the group number is 1. After the group, use 1 to refer to the group. Replace 1 with the grouped subexpression, which is equivalent to: b(w+)bs+(w+)b.

2. Reference the group through the group name

In the regular expression, the group can be named. The named group format is: (?exp). The group name is name. The format for referencing the group through name is: k, through Group names and group numbers refer to groups, and their text matching behavior is the same.

For example: b(?w+)bs+1b, in the back of the group, use k to refer to the group, replace k with the subexpression of the group, which is equivalent to: b(w+)bs+(w+)b.

3, unquotable group

(?:exp): A group defined using this syntax cannot be quoted and can only match text at the current position. The regular expression does not automatically assign a group number to the group.

Three, assertion search

Assertion is a logical expression. Only when the expression is true, the match is successful. When a match is successful, the text returned does not contain prefixes or suffixes, i.e. the assertion is used to find text that comes before or after a specific "text". Four syntaxes for assertions:

(?=exp): The back of the text matches the expression exp, and the expression before the exp position is returned.

(?(?!exp): The suffix of the text is not exp, returns an expression whose suffix is not exp
(? 1, suffix matching

(?=exp): Matches the expression exp after the text and returns the expression before the exp position. Suffix matching is similar to TSQL’s “%ing”;

For example, regular expression: bw+(?=ingb)

Analysis: Assert that its suffix is ing and it is the end of the word (b), match words ending with ing, but return the front part of the word, the part before ing;

For example, find "I'm reading a book" , it will match "reading" because the character ends with ing. The regular expression returns read and asserts that the returned text does not contain the suffix.

2, prefix matching

(?For example, regular expression: (?

Analysis: The beginning of a word (b), and the prefix of the word is re, and the match starts with re The word returns the second half of the word, the part after re;

For example, if you search for "I am reading a book", it will match "reading", because the character starts with re, and the regular expression returns ading, Assert that the text returned does not contain the prefix.

3. Find text whose prefix or suffix is not a specific text

These two assertion searches are opposite to the previous two and have little effect. Let’s have a brief understanding:

(?!exp): The suffix of the text is not exp, return The expression whose suffix is not exp
(? 3.1 For example, regular expression: bw+(?!ingb)

Analysis: does not match ing For words ending in "I am reading a book", the returned text is: I,am,a,book

3.2 For example, regular expression: (?

Analysis: does not match the words ending with For words starting with re, search for "I am reading a book", and the returned text is: I, am, a, book

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

php怎么去除字符串中的所有大写字母Sep 26, 2022 pm 07:59 PM

两种去除方法：1、利用preg_replace()执行正则表达式搜索所有大写字母并将其替换为空字符即可，语法“preg_replace('/[A-Z]/','',$str)”。2、利用preg_filter()执行正则表达式搜索所有大写字母并将其替换为空字符即可，语法“preg_filter('/[A-Z]/','',$str)”。

php怎么替换nbsp空格符Apr 24, 2022 pm 02:55 PM

方法：1、用“str_replace(" ","其他字符",$str)”语句，可将nbsp符替换为其他字符；2、用“preg_replace("/(\s|\&nbsp\;||\xc2\xa0)/","其他字符",$str)”语句。

使用Go语言编写高性能的正则表达式匹配Jun 15, 2023 pm 10:56 PM

随着数据量的不断增大，正则表达式匹配成为了程序中常用的操作之一。而在Go语言中，由于其天然的并行ism，以及与底层系统的交互性和高效性，使得Go语言的正则表达式匹配极具优势。那么如何使用Go语言编写高性能的正则表达式匹配呢？一、了解正则表达式在使用正则表达式前，我们首先需要了解正则表达式，了解其基本语法规则以及常用的匹配字符，使我们能够在编写正则表达式时更加

php怎么利用正则排除字符串中的字符Dec 15, 2022 pm 03:30 PM

两种方法：1、用preg_replace()，可执行正则表达式的搜索和替换，只需将字符串中匹配的字符替换为空字符即可，语法“preg_replace(正则, "", $str)”。2、用preg_match_all()，可搜索字符串中所有和正则表达式匹配的结果，会将每次的匹配结果放在一个数组$array中，语法“preg_match_all(正则,$str,$array);”。

php怎么只获取中文字符Apr 28, 2022 pm 08:15 PM

php中可用preg_match_all()配合正则表达式过滤字符串，只获取中文字符；语法“preg_match_all("/[\x{4e00}-\x{9fff}]+/u","$str",$arr);”，会将匹配字符存入“$arr”数组中。

javascript怎么正则替换非汉字的字符Oct 13, 2022 pm 05:37 PM

在javascript中，可以使用replace()函数配合正则表达式“/[u4e00-u9fa5|,]+/ig”来查找字符串中的所有非汉字字符，并将其替换为其他指定值，语法“字符串对象.replace(/[u4e00-u9fa5|,]+/ig,'指定替换值')”。

Java语言正则表达式的使用方法Jun 10, 2023 am 08:13 AM

Java语言正则表达式的使用方法正则表达式是一种强大的文本处理工具，可以用来匹配和验证文本。在Java语言中，也可以使用正则表达式来实现字符串的匹配和处理。本文将介绍Java语言正则表达式的使用方法，涵盖正则表达式的基础知识，常用的正则表达式语法，以及在Java程序中使用正则表达式的方法。一、基础知识正则表达式是什么？正则表达式是一种文本模式，用来描述一组字

PHP开发：如何编写高效的正则表达式Jun 15, 2023 pm 09:04 PM

在PHP开发中，正则表达式是非常重要的工具，用于匹配、查找和替换文本中的特定字符串。然而，编写高效的正则表达式并不是一件易事，需要开发者具备一定的技巧和经验。下面是一些可以帮助您编写高效正则表达式的技巧：1.尽可能使用非贪婪匹配默认情况下，正则表达式是贪婪的，即它们将尽可能匹配更多的文本。在某些情况下，可能需要使用非贪婪匹配来避免这种情况。非贪婪匹配使用"

See all articles