


Summary of the problem of garbled data captured by nodejs crawler_node.js
1. Non-UTF-8 page processing.
1. Background
windows-1251 encoding
For example, Russian website: https://vk.com/cciinniikk
Shameful to find this encoding
What we mainly talk about here is the issue of Windows-1251 (cp1251) encoding and utf-8 encoding. Others such as gbk will not be taken into consideration~
2. Solution
1.
Use js native encoding conversion
But I haven’t found a way yet..
If it’s utf-8 to window-1251 it’s okayhttp://stackoverflow.com/questions/2696481/encoding-conversation-utf-8-to-1251-in-javascript
var DMap = {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31, 32: 32, 33: 33, 34: 34, 35: 35, 36: 36, 37: 37, 38: 38, 39: 39, 40: 40, 41: 41, 42: 42, 43: 43, 44: 44, 45: 45, 46: 46, 47: 47, 48: 48, 49: 49, 50: 50, 51: 51, 52: 52, 53: 53, 54: 54, 55: 55, 56: 56, 57: 57, 58: 58, 59: 59, 60: 60, 61: 61, 62: 62, 63: 63, 64: 64, 65: 65, 66: 66, 67: 67, 68: 68, 69: 69, 70: 70, 71: 71, 72: 72, 73: 73, 74: 74, 75: 75, 76: 76, 77: 77, 78: 78, 79: 79, 80: 80, 81: 81, 82: 82, 83: 83, 84: 84, 85: 85, 86: 86, 87: 87, 88: 88, 89: 89, 90: 90, 91: 91, 92: 92, 93: 93, 94: 94, 95: 95, 96: 96, 97: 97, 98: 98, 99: 99, 100: 100, 101: 101, 102: 102, 103: 103, 104: 104, 105: 105, 106: 106, 107: 107, 108: 108, 109: 109, 110: 110, 111: 111, 112: 112, 113: 113, 114: 114, 115: 115, 116: 116, 117: 117, 118: 118, 119: 119, 120: 120, 121: 121, 122: 122, 123: 123, 124: 124, 125: 125, 126: 126, 127: 127, 1027: 129, 8225: 135, 1046: 198, 8222: 132, 1047: 199, 1168: 165, 1048: 200, 1113: 154, 1049: 201, 1045: 197, 1050: 202, 1028: 170, 160: 160, 1040: 192, 1051: 203, 164: 164, 166: 166, 167: 167, 169: 169, 171: 171, 172: 172, 173: 173, 174: 174, 1053: 205, 176: 176, 177: 177, 1114: 156, 181: 181, 182: 182, 183: 183, 8221: 148, 187: 187, 1029: 189, 1056: 208, 1057: 209, 1058: 210, 8364: 136, 1112: 188, 1115: 158, 1059: 211, 1060: 212, 1030: 178, 1061: 213, 1062: 214, 1063: 215, 1116: 157, 1064: 216, 1065: 217, 1031: 175, 1066: 218, 1067: 219, 1068: 220, 1069: 221, 1070: 222, 1032: 163, 8226: 149, 1071: 223, 1072: 224, 8482: 153, 1073: 225, 8240: 137, 1118: 162, 1074: 226, 1110: 179, 8230: 133, 1075: 227, 1033: 138, 1076: 228, 1077: 229, 8211: 150, 1078: 230, 1119: 159, 1079: 231, 1042: 194, 1080: 232, 1034: 140, 1025: 168, 1081: 233, 1082: 234, 8212: 151, 1083: 235, 1169: 180, 1084: 236, 1052: 204, 1085: 237, 1035: 142, 1086: 238, 1087: 239, 1088: 240, 1089: 241, 1090: 242, 1036: 141, 1041: 193, 1091: 243, 1092: 244, 8224: 134, 1093: 245, 8470: 185, 1094: 246, 1054: 206, 1095: 247, 1096: 248, 8249: 139, 1097: 249, 1098: 250, 1044: 196, 1099: 251, 1111: 191, 1055: 207, 1100: 252, 1038: 161, 8220: 147, 1101: 253, 8250: 155, 1102: 254, 8216: 145, 1103: 255, 1043: 195, 1105: 184, 1039: 143, 1026: 128, 1106: 144, 8218: 130, 1107: 131, 8217: 146, 1108: 186, 1109: 190} function UnicodeToWin1251(s) { var L = [] for (var i=0; i<s.length; i++) { var ord = s.charCodeAt(i) if (!(ord in DMap)) throw "Character "+s.charAt(i)+" isn't supported by win1251!" L.push(String.fromCharCode(DMap[ord])) } return L.join('') }
Well, this is a good idea. What Dmap stores is actually the mapping relationship between window-1251 encoding and unicode
So I just planned to do it the other way around
But on the contrary, I discovered that the charCodeAt method is only valid for unicode. How to dig out the code segments of other encodings? Because I am using nodejs, I consider using the corresponding module
2.
For instructions on installing and using the nodejs module iconv-lite, see https://www.npmjs.com/package/iconv-lite
According to the usage method, it should be used in a similar way
var iconv = require('iconv-lite'); var Buffer = require('buffer').Buffer; // Convert from an encoded windows-1251 to utf-8 //这个str1应该是http.get 或request等请求返回的数据 //请求的时候要带参数,不然就会出错 //除了基本的参数之外 要注意记得使用 encoding: 'binary'这个参数 //比如 str1 = 'ценности ни в '; //把获取到的数据 转换成Buffer,记得格式使用 binary //binary在各编码直接穿梭无阻~ var buf = new Buffer(str1,'binary'); var str2 = iconv.decode(buf, 'win1251'); //str2就被转换出来了,默认是转成 Unicode格式,估计这也是iconv-lite的初衷吧 console.log(str2);
3.
Instructions for installing and using the nodejs module iconv are available at https://github.com/bnoordhuis/node-iconv
(In fact, the essence is to install node-gyp. I didn’t read the official instructions carefully before)
Generally, after simple use, the code is still garbled. The format is: пїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅ пїЅпїЅ
http://stackoverflow.com/questions/8693400/nodejs-convertinf-from-windows-1251-to-utf-8
The solution is to convert the read data into binary encoding: binary (the default encoding is utf-8)
request({ uri: website_url, method: 'GET', encoding: 'binary' }, function (error, response, body) { body = new Buffer(body, 'binary'); conv = new iconv.Iconv('WINDOWS-1251', 'utf8'); body = conv.convert(body).toString(); } });
--> In addition, the use of iconv requires some environmental dependencies. See the official instructions: https://github.com/TooTallNate/node-gyp
So:
Firstly, you need the support of python corresponding version (such as 2.7);
Second, it requires the support of compilation tools (most errors occur under Windows)
Error similar to this
Node, if there is no specific version or higher, the vs2005 compilation tool is used by default (so the solution to the error message is generally to follow vs2005 and framwork sdk2.0)
Problem solution:
1. Install visual studio 2010
2. Specify the vs compilation tool version (if it is vs2012, it is 2012)
(Sometimes it will be automatically specified, so this command is not necessarily needed npm config set msvs_version 2010 --global)
3. If it still prompts that the framwork sdk cannot be found, you can add its installation path to the system environment variable path
(2010 corresponds to sdk4.0 version, similar to 2008 sdj3.5 2012 sdk4.5?)
Another thing to remember is that the environment variable will only read the first one!
For example, if you have set the path of SDK2.0 to the system environment variable before, then when you add and set the path of SDK4.0 now, only the first one will work
So:
Or delete the previous one
Or put the path you want to add in front of it
2. Gzip page processing
Sometimes we find that it is normal for the browser to access the page, but the simulated request is garbled when it comes back. You can check the Response information requested by the browser. If there is Content-Encoding: gzip, it is most likely because the page is compressed by gzip. , then you need to add the following parameters when requesting
gzip:true
The above is the entire content of this article, I hope you all like it.

JavaandJavaScriptaredistinctlanguages:Javaisusedforenterpriseandmobileapps,whileJavaScriptisforinteractivewebpages.1)Javaiscompiled,staticallytyped,andrunsonJVM.2)JavaScriptisinterpreted,dynamicallytyped,andrunsinbrowsersorNode.js.3)JavausesOOPwithcl

JavaScript core data types are consistent in browsers and Node.js, but are handled differently from the extra types. 1) The global object is window in the browser and global in Node.js. 2) Node.js' unique Buffer object, used to process binary data. 3) There are also differences in performance and time processing, and the code needs to be adjusted according to the environment.

JavaScriptusestwotypesofcomments:single-line(//)andmulti-line(//).1)Use//forquicknotesorsingle-lineexplanations.2)Use//forlongerexplanationsorcommentingoutblocksofcode.Commentsshouldexplainthe'why',notthe'what',andbeplacedabovetherelevantcodeforclari

The main difference between Python and JavaScript is the type system and application scenarios. 1. Python uses dynamic types, suitable for scientific computing and data analysis. 2. JavaScript adopts weak types and is widely used in front-end and full-stack development. The two have their own advantages in asynchronous programming and performance optimization, and should be decided according to project requirements when choosing.

Whether to choose Python or JavaScript depends on the project type: 1) Choose Python for data science and automation tasks; 2) Choose JavaScript for front-end and full-stack development. Python is favored for its powerful library in data processing and automation, while JavaScript is indispensable for its advantages in web interaction and full-stack development.

Python and JavaScript each have their own advantages, and the choice depends on project needs and personal preferences. 1. Python is easy to learn, with concise syntax, suitable for data science and back-end development, but has a slow execution speed. 2. JavaScript is everywhere in front-end development and has strong asynchronous programming capabilities. Node.js makes it suitable for full-stack development, but the syntax may be complex and error-prone.

JavaScriptisnotbuiltonCorC ;it'saninterpretedlanguagethatrunsonenginesoftenwritteninC .1)JavaScriptwasdesignedasalightweight,interpretedlanguageforwebbrowsers.2)EnginesevolvedfromsimpleinterpreterstoJITcompilers,typicallyinC ,improvingperformance.

JavaScript can be used for front-end and back-end development. The front-end enhances the user experience through DOM operations, and the back-end handles server tasks through Node.js. 1. Front-end example: Change the content of the web page text. 2. Backend example: Create a Node.js server.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Atom editor mac version download
The most popular open source editor

Dreamweaver Mac version
Visual web development tools

SublimeText3 Chinese version
Chinese version, very easy to use

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SublimeText3 English version
Recommended: Win version, supports code prompts!
