`
h140465
  • 浏览: 20845 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

Nutch2+Mysql的数据库异常

 
阅读更多

      最近使用Nutch2.2.1+MySql进行爬虫研究,发现经常会遇到建表失败,乱码异常种类的问题。经过这些天的研究,特别解决方法记录一下。

     首先,Mysql的安装,数据库的编码必须是utf8格式(GBK也可以)。需要修改my.ini文件

[client]
port = 3306
default-character-set = utf8

[mysql]
default-character-set = utf8

[mysqld]
port = 3306
character-set-client-handshake = FALSE
character-set-server = utf8
collation-server = utf8_general_ci
init_connect='SET NAMES utf8'
#数据库安装路径
basedir=E:\Program Files\MySql5.6\
#数据存储路径
datadir=E:\ProgramData\MySQL\MySQL Server 5.6\data\ #

 接着需要修改Nutch中的表映射文件gora-sql-mapping.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at
  
  http://www.apache.org/licenses/LICENSE-2.0
  
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<gora-orm>

<class name="org.apache.nutch.storage.WebPage" keyClass="java.lang.String" table="webpage">
  <primarykey column="id" length="255"/>
    <field name="baseUrl" column="baseUrl" length="512"/>    
    <field name="status" column="status"/>
    <field name="prevFetchTime" column="prevFetchTime"/>
    <field name="fetchTime" column="fetchTime"/>
    <field name="fetchInterval" column="fetchInterval"/>
    <field name="retriesSinceFetch" column="retriesSinceFetch"/>
    <field name="reprUrl" column="reprUrl" length="512"/>
    <field name="content" column="content" length="21044"  />
    <field name="contentType" column="typ" length="32"/>    
    <field name="protocolStatus" column="protocolStatus"/>
    <field name="modifiedTime" column="modifiedTime"/>
    <field name="prevModifiedTime" column="prevModifiedTime"/>
    <field name="batchId" column="batchId" length="32"/>

    <!-- parse fields                                       -->
    <field name="title" column="title" length="512"/>
    <field name="text" column="text" jdbc-type="TEXT"/>
    <field name="parseStatus" column="parseStatus"/>
    <field name="signature" column="signature"/>
    <field name="prevSignature" column="prevSignature"/>

    <!-- score fields                                       -->
    <field name="score" column="score"/>
    <field name="headers" column="headers"/>
    <field name="inlinks" column="inlinks"/>
    <field name="outlinks" column="outlinks"/>
    <field name="metadata" column="metadata"/>
    <field name="markers" column="markers"/>
</class>

<class name="org.apache.nutch.storage.Host" keyClass="java.lang.String"
table="host">
  <primarykey column="id" length="512"/>
  <field name="metadata" column="metadata"/>
  <field name="inlinks" column="inlinks"/>
  <field name="outlinks" column="outlinks"/>
</class>

</gora-orm>

 至此,运行nutch抓取网页的时候,可以正常工作,不会出现建表失败,乱码导致异常

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics