Home >Web Front-end >JS Tutorial >Summary of the problem of garbled data captured by nodejs crawler_node.js
1. Non-UTF-8 page processing.
1. Background
windows-1251 encoding
For example, Russian website: https://vk.com/cciinniikk
Shameful to find this encoding
What we mainly talk about here is the issue of Windows-1251 (cp1251) encoding and utf-8 encoding. Others such as gbk will not be taken into consideration~
2. Solution
1.
Use js native encoding conversion
But I haven’t found a way yet..
If it’s utf-8 to window-1251 it’s okayhttp://stackoverflow.com/questions/2696481/encoding-conversation-utf-8-to-1251-in-javascript
var DMap = {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31, 32: 32, 33: 33, 34: 34, 35: 35, 36: 36, 37: 37, 38: 38, 39: 39, 40: 40, 41: 41, 42: 42, 43: 43, 44: 44, 45: 45, 46: 46, 47: 47, 48: 48, 49: 49, 50: 50, 51: 51, 52: 52, 53: 53, 54: 54, 55: 55, 56: 56, 57: 57, 58: 58, 59: 59, 60: 60, 61: 61, 62: 62, 63: 63, 64: 64, 65: 65, 66: 66, 67: 67, 68: 68, 69: 69, 70: 70, 71: 71, 72: 72, 73: 73, 74: 74, 75: 75, 76: 76, 77: 77, 78: 78, 79: 79, 80: 80, 81: 81, 82: 82, 83: 83, 84: 84, 85: 85, 86: 86, 87: 87, 88: 88, 89: 89, 90: 90, 91: 91, 92: 92, 93: 93, 94: 94, 95: 95, 96: 96, 97: 97, 98: 98, 99: 99, 100: 100, 101: 101, 102: 102, 103: 103, 104: 104, 105: 105, 106: 106, 107: 107, 108: 108, 109: 109, 110: 110, 111: 111, 112: 112, 113: 113, 114: 114, 115: 115, 116: 116, 117: 117, 118: 118, 119: 119, 120: 120, 121: 121, 122: 122, 123: 123, 124: 124, 125: 125, 126: 126, 127: 127, 1027: 129, 8225: 135, 1046: 198, 8222: 132, 1047: 199, 1168: 165, 1048: 200, 1113: 154, 1049: 201, 1045: 197, 1050: 202, 1028: 170, 160: 160, 1040: 192, 1051: 203, 164: 164, 166: 166, 167: 167, 169: 169, 171: 171, 172: 172, 173: 173, 174: 174, 1053: 205, 176: 176, 177: 177, 1114: 156, 181: 181, 182: 182, 183: 183, 8221: 148, 187: 187, 1029: 189, 1056: 208, 1057: 209, 1058: 210, 8364: 136, 1112: 188, 1115: 158, 1059: 211, 1060: 212, 1030: 178, 1061: 213, 1062: 214, 1063: 215, 1116: 157, 1064: 216, 1065: 217, 1031: 175, 1066: 218, 1067: 219, 1068: 220, 1069: 221, 1070: 222, 1032: 163, 8226: 149, 1071: 223, 1072: 224, 8482: 153, 1073: 225, 8240: 137, 1118: 162, 1074: 226, 1110: 179, 8230: 133, 1075: 227, 1033: 138, 1076: 228, 1077: 229, 8211: 150, 1078: 230, 1119: 159, 1079: 231, 1042: 194, 1080: 232, 1034: 140, 1025: 168, 1081: 233, 1082: 234, 8212: 151, 1083: 235, 1169: 180, 1084: 236, 1052: 204, 1085: 237, 1035: 142, 1086: 238, 1087: 239, 1088: 240, 1089: 241, 1090: 242, 1036: 141, 1041: 193, 1091: 243, 1092: 244, 8224: 134, 1093: 245, 8470: 185, 1094: 246, 1054: 206, 1095: 247, 1096: 248, 8249: 139, 1097: 249, 1098: 250, 1044: 196, 1099: 251, 1111: 191, 1055: 207, 1100: 252, 1038: 161, 8220: 147, 1101: 253, 8250: 155, 1102: 254, 8216: 145, 1103: 255, 1043: 195, 1105: 184, 1039: 143, 1026: 128, 1106: 144, 8218: 130, 1107: 131, 8217: 146, 1108: 186, 1109: 190} function UnicodeToWin1251(s) { var L = [] for (var i=0; i<s.length; i++) { var ord = s.charCodeAt(i) if (!(ord in DMap)) throw "Character "+s.charAt(i)+" isn't supported by win1251!" L.push(String.fromCharCode(DMap[ord])) } return L.join('') }
Well, this is a good idea. What Dmap stores is actually the mapping relationship between window-1251 encoding and unicode
So I just planned to do it the other way around
But on the contrary, I discovered that the charCodeAt method is only valid for unicode. How to dig out the code segments of other encodings? Because I am using nodejs, I consider using the corresponding module
2.
For instructions on installing and using the nodejs module iconv-lite, see https://www.npmjs.com/package/iconv-lite
According to the usage method, it should be used in a similar way
var iconv = require('iconv-lite'); var Buffer = require('buffer').Buffer; // Convert from an encoded windows-1251 to utf-8 //这个str1应该是http.get 或request等请求返回的数据 //请求的时候要带参数,不然就会出错 //除了基本的参数之外 要注意记得使用 encoding: 'binary'这个参数 //比如 str1 = 'ценности ни в '; //把获取到的数据 转换成Buffer,记得格式使用 binary //binary在各编码直接穿梭无阻~ var buf = new Buffer(str1,'binary'); var str2 = iconv.decode(buf, 'win1251'); //str2就被转换出来了,默认是转成 Unicode格式,估计这也是iconv-lite的初衷吧 console.log(str2);
3.
Instructions for installing and using the nodejs module iconv are available at https://github.com/bnoordhuis/node-iconv
(In fact, the essence is to install node-gyp. I didn’t read the official instructions carefully before)
Generally, after simple use, the code is still garbled. The format is: пїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅ пїЅпїЅ
http://stackoverflow.com/questions/8693400/nodejs-convertinf-from-windows-1251-to-utf-8
The solution is to convert the read data into binary encoding: binary (the default encoding is utf-8)
request({ uri: website_url, method: 'GET', encoding: 'binary' }, function (error, response, body) { body = new Buffer(body, 'binary'); conv = new iconv.Iconv('WINDOWS-1251', 'utf8'); body = conv.convert(body).toString(); } });
--> In addition, the use of iconv requires some environmental dependencies. See the official instructions: https://github.com/TooTallNate/node-gyp
So:
Firstly, you need the support of python corresponding version (such as 2.7);
Second, it requires the support of compilation tools (most errors occur under Windows)
Error similar to this
Node, if there is no specific version or higher, the vs2005 compilation tool is used by default (so the solution to the error message is generally to follow vs2005 and framwork sdk2.0)
Problem solution:
1. Install visual studio 2010
2. Specify the vs compilation tool version (if it is vs2012, it is 2012)
(Sometimes it will be automatically specified, so this command is not necessarily needed npm config set msvs_version 2010 --global)
3. If it still prompts that the framwork sdk cannot be found, you can add its installation path to the system environment variable path
(2010 corresponds to sdk4.0 version, similar to 2008 sdj3.5 2012 sdk4.5?)
Another thing to remember is that the environment variable will only read the first one!
For example, if you have set the path of SDK2.0 to the system environment variable before, then when you add and set the path of SDK4.0 now, only the first one will work
So:
Or delete the previous one
Or put the path you want to add in front of it
2. Gzip page processing
Sometimes we find that it is normal for the browser to access the page, but the simulated request is garbled when it comes back. You can check the Response information requested by the browser. If there is Content-Encoding: gzip, it is most likely because the page is compressed by gzip. , then you need to add the following parameters when requesting
gzip:true
The above is the entire content of this article, I hope you all like it.