Home  >  Article  >  Backend Development  >  Match the overlap between the csdn user database and official users and filter out the overlapping users_PHP tutorial

Match the overlap between the csdn user database and official users and filter out the overlapping users_PHP tutorial

WBOY
WBOYOriginal
2016-07-21 15:21:56892browse

Process:
1. Obtain the user database of csdn and import it locally
When trying to open editplus, it prompts that there is insufficient memory and cannot find a way. My colleague checked it under Linux and the basic format is as follows:
Username # Password # Email
Username# Password# Email
Corresponding data structure:

Copy code The code is as follows:

CREATE TABLE IF NOT EXISTS `csdn_userdb` (
`id` int(10) NOT NULL auto_increment,
`username` varchar(50) character set gbk NOT NULL,
`password` varchar(50) character set gbk NOT NULL,
`email` varchar(50) character set gbk NOT NULL,
PRIMARY KEY (`id`),
KEY `username` (`username`),
KEY `email` (`email`)
 ) ENGINE=MyISAM DEFAULT CHARSET=gbk AUTO_INCREMENT=1;

I have always suspected that fopen opens files into the cache, but practice has proved that it is very fast, so there should be no Write to the cache, the following is the code for importing data
Copy the code The code is as follows:

$ link = mysql_connect('localhost', 'root', 'admin', true);
mysql_select_db('csdn',$link);
$handle = fopen("C:UserszhudongDesktopwww.csdn.net.sql ", "r");
while (!feof($handle)){
$i++;
$buffer = fgets($handle);
list($u,$p,$ e) = explode(" # ",$buffer);
mysql_query("INSERT INTO csdn_userdb(username,password,email) VALUES ('$u','$p','$e')",$link );
if ($i%1000 == 0) echo $i."n";
}
fclose($handle);
?>

The efficiency of the above code is very poor, so the modified code is as follows:
Copy the code The code is as follows:

$link = mysql_connect('localhost', 'root', 'admin', true);
mysql_select_db('csdn',$link);
$handle = fopen("C:UserszhudongDesktopwww.csdn .net.sql", "r");
$perpage = 50;
while (!feof($handle)){
$i++;
$buffer = fgets($handle);
list($u,$p,$e) = explode(" # ",$buffer);
$insertValue[] = "('$u','$p','$e') ";
if ($i% $perpage == 0){ $perpage == 0){
$insrtValueString = implode(',',$insertValue);
mysql_query("INSERT INTO csdn_userdb( username,password,email) VALUES $instrtValueString",$link);
echo $i."n";
$insertValue = array();
}
}
fclose($ handle);

In order to find out what factors affect the efficiency of imported data, we conducted tests based on different settings
Total CSDN user data 6428600
When $perpage=500 ;Data after import: 5,902,000; data loss 526600 Loss rate: 8%; Data table engine: MyISAM; Index: Yes; Total time taken: 15 minutes
When $perpage=200, total data after import: 6,210,200; data loss :218400; Loss rate: 3.3%; Data table engine: MYISAM; Index: Yes; Total time taken: 30 minutes
When $perpage=200, the total number of data after import: 6,210,200; Data loss: 218400; Loss rate: 3.3 %; Data table engine: INNODB; Index: Yes; Total time taken: 65 minutes
When $perpage=200, the total number of data after import: 6,210,200; Data loss: 218400; Loss rate: 3.3%; Data table engine: MYISAM ; Index: None; Total time taken: 14 minutes (rebuild the index separately after the data is imported)
When $perpage=50, the total number of data after import: 6,371,200; data loss: 57400, loss rate: 0.8%; data table engine : MYISAM; Index: None: Total time taken: 20 minutes
Based on the above situation, the summary is as follows:
1. Importing data first and then adding index is twice as efficient as adding index first and then importing data
2. The efficiency of InnoDB in single-process data insertion is much lower than that of MYISAM
3. When perpage=50, the data loss rate is less than 1%
Copy code The code is as follows:

Because there will be a timeout problem when executing through the browser, and the efficiency is low, it is run through the command line. I encountered a little trouble during the process and was delayed. It took less time
At first I executed the following code:
php.exe E:usrwwwimportcsdndb.php
But it kept reporting an error: call to undefined function mysql_connect
After much trouble, I found that php.ini was not loaded
Correct code For:
php.exe -c E:/usr/local/apache2/php.ini importcsdndb.php

2. Import the user data that needs to be matched to the local
command line Enter msyql (you don’t know how to use Baidu)
Then execute: mysql>source C:/Users/zhudong/Desktop/userdb.sql
3. Compare and filter users
After the comparison program is written, remember to add it in the command Run under the line:
Copy the code The code is as follows:

$link = mysql_connect('localhost', 'root', 'admin', true);
mysql_select_db('csdn',$link);
$handle_username = fopen("E:/records_username.txt","a");
//$handle_email = fopen("E:/records_email.txt","a");
$username_num = $email_num = $uid = 0;
while ($uid<2181106) {
$nextuid=$uid+10000;
$query = mysql_query("SELECT * FROM pw_members WHERE uid>'$uid' AND uid<'$nextuid'");
while ($rt = mysql_fetch_array($query,MYSQL_ASSOC)) {
$username = $rt['username'];
$email = $rt['email'];
$query2 = mysql_query("SELECT * FROM scdn_userdb WHERE username='$username' OR email='$email'");
while ($rt2 = mysql_fetch_array($query2,MYSQL_ASSOC)) {
if ($rt['password'] = md5($rt2['password'])) {
if ($rt2['username'] == $username) {
$username_num++;
fwrite($handle_username,'OWN:'.$rt['uid'].'|'.$rt['username'].'|'.$rt['password'].'|'.$rt['email'].' CSDN:'.$rt2['username'].'|'.$rt2['password'].'|'.$rt2['email']."rn");
echo 'username_num='.$username_num."rn";
continue;
}
/*
if ($rt2['email'] == $email) {
$email_num++;
fwrite($handle_email,'OWN:'.$rt['uid'].'|'.$rt['username'].'|'.$rt['password'].'|'.$rt['email'].' CSDN:'.$rt2['username'].'|'.$rt2['password'].'|'.$rt2['email']."rn");
echo 'email_num='.$email_num."rn";
}
*/
}
}
mysql_free_result($query2);
}
$uid = $nextuid;
}
?>

您看到的以上的代码是非常蹩脚的,因为其效率特别低 ,几百万的数据,要跑10多个小时,怎么能忘记连表查询这么基本的东西呢,以下为修正后的方法
复制代码 代码如下:

$link = mysql_connect('localhost', 'root', 'admin', true);
mysql_select_db('csdn',$link);
$handle_username = fopen("E:/records_username.txt","a");
while($uid<2181106) {//此处的数字为要对比用户库的最大ID
$nextuid= $uid+10000;
$query = mysql_query("SELECT m.uid,m.username,m.password,m.email,u.password as csdn_password,u.email as csdn_email FROM own_members m LEFT JOIN csdn_userdb u USING(username) WHERE m.uid>'$uid' AND m.uid<='$nextuid' AND u.username!=''");
while ($rt = mysql_fetch_array($query,MYSQL_ASSOC)) {
if ($rt['password'] == md5($rt['csdn_password'])) {
$username_num++;
fwrite($handle_username,'OWN:'.$rt['uid'].'|'.$rt['username'].'|'.$rt['password'].'|'.$rt['email'].' CSDN:'.$rt['username'].'|'.$rt['csdn_password'].'|'.$rt['csdn_email']."rn");
echo 'username_num='.$username_num."rn";
}
}
$uid = $nextuid;
echo 'uid='.$uid;
}
?>

总对比时间25分钟,相比较之前10多个小时的执行真是大有提升
总重名用户:34175
占总会员比例:1.7%
1.7%的重名用户还是挺严重的,希望本文对各位站长对比出本站的用户有所帮助

www.bkjia.comtruehttp://www.bkjia.com/PHPjc/324851.htmlTechArticle过程: 1、获取csdn的用户数据库导入本地 试用editplus打开提示内存不足,没找到办法,同事的linux下查看了一下,基本的格式如下: 用户名...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Previous article:PHP reads mysql garbled code and shares the principle of solving it with set names XXX_PHP tutorialNext article:PHP reads mysql garbled code and shares the principle of solving it with set names XXX_PHP tutorial

Related articles

See more