今天使用Nutch1.7抓取中文网站的时候,发现抓取后的数据是乱码,网上找了很多资料都没有解决。于是查看源代码发现,Nutch解析文件使用的是HtmlParser类,此类中有获取网页编码的代码:
// NUTCH-1006 Meta equiv with single quotes not accepted private static Pattern metaPattern = Pattern.compile("<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>", Pattern.CASE_INSENSITIVE); private static Pattern charsetPattern = Pattern.compile("charset=\\s*([a-z][_\\-0-9a-z]*)", Pattern.CASE_INSENSITIVE);
private static String sniffCharacterEncoding(byte[] content) { int length = content.length < CHUNK_SIZE ? content.length : CHUNK_SIZE; // We don't care about non-ASCII parts so that it's sufficient // to just inflate each byte to a 16-bit value by padding. // For instance, the sequence {0x41, 0x82, 0xb7} will be turned into // {U+0041, U+0082, U+00B7}. String str = ""; try { str = new String(content, 0, length, Charset.forName("ASCII").toString()); } catch (UnsupportedEncodingException e) { // code should never come here, but just in case... return null; } Matcher metaMatcher = metaPattern.matcher(str); String encoding = null; if (metaMatcher.find()) { Matcher charsetMatcher = charsetPattern.matcher(metaMatcher.group(1)); if (charsetMatcher.find()) encoding = new String(charsetMatcher.group(1)); } return encoding; }
获得网页中的charset后,会通过detector.guessEncoding方法获取最匹配的编码
EncodingDetector detector = new EncodingDetector(conf); detector.autoDetectClues(content, true); detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed"); String encoding = detector.guessEncoding(content, defaultCharEncoding);
其中EncodingDetector类的guessEncoding方法:
public String guessEncoding(Content content, String defaultValue) { /* * This algorithm could be replaced by something more sophisticated; * ideally we would gather a bunch of data on where various clues * (autodetect, HTTP headers, HTML meta tags, etc.) disagree, tag each with * the correct answer, and use machine learning/some statistical method * to generate a better heuristic. */ String base = content.getBaseUrl(); if (LOG.isTraceEnabled()) { findDisagreements(base, clues); } /* * Go down the list of encoding "clues". Use a clue if: * 1. Has a confidence value which meets our confidence threshold, OR * 2. Doesn't meet the threshold, but is the best try, * since nothing else is available. */ EncodingClue defaultClue = new EncodingClue(defaultValue, "default"); EncodingClue bestClue = defaultClue; for (EncodingClue clue : clues) { if (LOG.isTraceEnabled()) { LOG.trace(base + ": charset " + clue); } String charset = clue.value; if (minConfidence >= 0 && clue.confidence >= minConfidence) { if (LOG.isTraceEnabled()) { LOG.trace(base + ": Choosing encoding: " + charset + " with confidence " + clue.confidence); } return resolveEncodingAlias(charset).toLowerCase(); } else if (clue.confidence == NO_THRESHOLD && bestClue == defaultClue) { bestClue = clue; } } if (LOG.isTraceEnabled()) { LOG.trace(base + ": Choosing encoding: " + bestClue); } return bestClue.value.toLowerCase(); }
debug发现,网页中的charset是GBK,而最终获取的编码是GB18030。造成此结果的原因是EncodingDetector的默认设置,将GBK使用GB18030来解析:
static { DETECTABLES.add("text/html"); DETECTABLES.add("text/plain"); DETECTABLES.add("text/richtext"); DETECTABLES.add("text/rtf"); DETECTABLES.add("text/sgml"); DETECTABLES.add("text/tab-separated-values"); DETECTABLES.add("text/xml"); DETECTABLES.add("application/rss+xml"); DETECTABLES.add("application/xhtml+xml"); /* * the following map is not an alias mapping table, but * maps character encodings which are often used in mislabelled * documents to their correct encodings. For instance, * there are a lot of documents labelled 'ISO-8859-1' which contain * characters not covered by ISO-8859-1 but covered by windows-1252. * Because windows-1252 is a superset of ISO-8859-1 (sharing code points * for the common part), it's better to treat ISO-8859-1 as * synonymous with windows-1252 than to reject, as invalid, documents * labelled as ISO-8859-1 that have characters outside ISO-8859-1. */ ALIASES.put("ISO-8859-1", "windows-1252"); ALIASES.put("EUC-KR", "x-windows-949"); ALIASES.put("x-EUC-CN", "GB18030"); ALIASES.put("GBK", "GB18030"); //ALIASES.put("Big5", "Big5HKSCS"); //ALIASES.put("TIS620", "Cp874"); //ALIASES.put("ISO-8859-11", "Cp874"); }
修改代码
static { DETECTABLES.add("text/html"); DETECTABLES.add("text/plain"); DETECTABLES.add("text/richtext"); DETECTABLES.add("text/rtf"); DETECTABLES.add("text/sgml"); DETECTABLES.add("text/tab-separated-values"); DETECTABLES.add("text/xml"); DETECTABLES.add("application/rss+xml"); DETECTABLES.add("application/xhtml+xml"); /* * the following map is not an alias mapping table, but * maps character encodings which are often used in mislabelled * documents to their correct encodings. For instance, * there are a lot of documents labelled 'ISO-8859-1' which contain * characters not covered by ISO-8859-1 but covered by windows-1252. * Because windows-1252 is a superset of ISO-8859-1 (sharing code points * for the common part), it's better to treat ISO-8859-1 as * synonymous with windows-1252 than to reject, as invalid, documents * labelled as ISO-8859-1 that have characters outside ISO-8859-1. */ ALIASES.put("ISO-8859-1", "windows-1252"); ALIASES.put("EUC-KR", "x-windows-949"); ALIASES.put("x-EUC-CN", "GB18030"); ALIASES.put("GBK", "GBK"); //ALIASES.put("Big5", "Big5HKSCS"); //ALIASES.put("TIS620", "Cp874"); //ALIASES.put("ISO-8859-11", "Cp874"); }
就解决乱码问题了
相关推荐
nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据
nutch乱码BUG修正,详细解释了nutch乱码的原因就修复方法,有代码和详细说明
nutch部分网页乱码BUG修正,有代码和详细修改说明
包括nutch的参考书,和NUTCH源代码分析
nutch 爬到的CSDN数据 nutch crawlnutch 爬到的CSDN数据 nutch crawlnutch 爬到的CSDN数据 nutch crawl
Nutch爬虫工作流程及文件格式详细分析!!!!!
nutch爬虫,java也能做爬虫,不一定非得用python呦
nutch爬虫系统分析.doc
Nutch爬虫工作流程及文件格式详细分析,文档不大
爬行企业内部网(Intranet crawling:针对少数网站进行,用 crawl 命令。 爬行整个互联网:使用低层的 inject, generate, fetch 和 updatedb 命令,具有更强的可控制性。 有研究或探讨的请加群:37424970 或...
nutch应用,nutch中文分词,nutch中文乱码
关于nutch爬虫一些需要监测的网站,为舆情系统或者监控系统或者全控媒体系统做数据的支撑。
nutch爬虫系统分析报告.doc
nutch安装指南,nutch教程,nutch网络爬取
nutch爬虫系统分析设计论文.doc
毕业论文-nutch爬虫系统分析.doc
介绍 Nutch 的背景知识,包括 Nutch 架构,爬虫和...然后示例说明 Nutch 爬虫如何抓取目标网站内容,产生片断和索引,并将结果存放在集群的2个节点上。最后使用 Nutch 检索器提供的 API 开发应用,为用户提供搜索接口。
一个已经部署好的 nutch1.7爬虫。 导入到 eclipse里面就能用了。假如不能用的话。 还是装个cygwin 吧 找到org.apache.nutch.crawl.Crawl 这个类。 run configuration 在 Programa argument 里面 输入 crawl urls -...
lucene学习的基本代码资料,里面有nutch扩展爬虫代码,可以抓取网页信息,新闻信息等,代码很详细,初学者的好帮手。