当前位置: 动力学知识库 > 问答 > 编程问答 >

How to configure solr dataimport handler to parse wikipedia xml document?

问题描述:

So this is what I have done so far.

I have added a request handler in solrconfig.xml as follows:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">

<lst name="defaults">

<str name="config">wiki-data-config.xml</str>

</lst>

</requestHandler>

In the same configuration directory I have created a file wiki-data-config.xml which contains the following,

<dataConfig>

<dataSource type="FileDataSource" encoding="UTF-8" />

<document>

<entity name="page"

pk="id"

processor="XPathEntityProcessor"

stream="true"

forEach="/mediawiki/page/"

url="/home/tanny/Downloads/Data/Wiki/enwiki-20150702-stub-articles8.xml"

flatten="true" >

<field column="id" xpath="/mediawiki/page/id" />

<field column="title" xpath="/mediawiki/page/title" />

<field column="revision" xpath="/mediawiki/page/revision/id" />

<field column="user" xpath="/mediawiki/page/revision/contributor/username" />

<field column="userId" xpath="/mediawiki/page/revision/contributor/id" />

<field column="text" xpath="/mediawiki/page/revision/text" />

<field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />

<field column="$skipDoc" regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>

</entity>

</document>

</dataconfig>

And my schema.xml contains the following,

<!-- Tanny edit starts -->

<field name="id" type="int" indexed="true" stored="true" required="true"/>

<field name="title" type="string" indexed="true" stored="false"/>

<field name="revision" type="int" indexed="true" stored="true"/>

<field name="user" type="string" indexed="true" stored="true"/>

<field name="userId" type="int" indexed="true" stored="true"/>

<field name="text" type="text_en" indexed="true" stored="false"/>

<field name="timestamp" type="date" indexed="true" stored="true"/>

<field name="titleText" type="text_en" indexed="true" stored="true"/>

<uniqueKey>id</uniqueKey>

<copyField source="title" dest="titleText"/>

<!-- Tanny edit ends -->

Now after restarting the SOLR, I try to post the WikiMedia XML Data using the ./bin/post script in the following way,

[email protected]:~/binaries/solr-5.2.1$ ./bin/post -c core-base-wiki /home/tanny/Downloads/Data/Wiki/enwiki-20150702-stub-articles8.xml

And it prints the following in the console

/usr/lib/jvm/java-7-oracle-cloudera//bin/java -classpath /home/tanny/binaries/solr-5.2.1/dist/solr-core-5.2.1.jar -Dauto=yes -Dc=core-base-wiki -Ddata=files org.apache.solr.util.SimplePostTool /home/tanny/Downloads/Data/Wiki/enwiki-20150702-stub-articles8.xml

SimplePostTool version 5.0.0

Posting files to [base] url http://localhost:8983/solr/core-base-wiki/update...

Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log

POSTing file enwiki-20150702-stub-articles8.xml (application/xml) to [base]

1 files indexed.

COMMITting Solr index changes to http://localhost:8983/solr/core-base-wiki/update...

Time spent: 0:00:00.863

However, when I go to the UI and check for the overview it says 0 documents indexed.

I am at a loss to understand what configuration I am missing out on. Any help/guidance will be higly appreciated.

P.S.: The dataset enwiki-20150702-stub-articles8.xml is downloaded from WikiMedia Page. Few sample lines from the document are mentioned as follows,

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">

<siteinfo>

<sitename>Wikipedia</sitename>

<dbname>enwiki</dbname>

<base>https://en.wikipedia.org/wiki/Main_Page</base>

<generator>MediaWiki 1.26wmf11</generator>

<case>first-letter</case>

<namespaces>

<namespace key="-2" case="first-letter">Media</namespace>

<namespace key="829" case="first-letter">Module talk</namespace>

...

...

<namespace key="2600" case="first-letter">Topic</namespace>

</namespaces>

</siteinfo>

<page>

<title>700 (number)</title>

<ns>0</ns>

<id>465001</id>

<revision>

<id>663854862</id>

<parentid>655386821</parentid>

<timestamp>2015-05-24T21:01:24Z</timestamp>

<contributor>

<username>Cnwilliams</username>

<id>10190671</id>

</contributor>

<comment>Disambiguated: [[Tintin]] → [[The Adventures of Tintin]]</comment>

<model>wikitext</model>

<format>text/x-wiki</format>

<text id="669059875" bytes="12464" />

<sha1>q15fslnvlsrgbeo8f6mcyrg00l2d2a5</sha1>

</revision>

</page>

<page>

<title>Canadian federal election, 1957</title>

<ns>0</ns>

<id>465004</id>

<revision>

<id>666418811</id>

<parentid>666417048</parentid>

<timestamp>2015-06-11T01:38:05Z</timestamp>

<contributor>

<username>Wehwalt</username>

<id>458237</id>

</contributor>

<comment>/* Impact */ clarify</comment>

<model>wikitext</model>

<format>text/x-wiki</format>

<text id="671713242" bytes="77788" />

<sha1>05g14m9sfavo7buuirpr8lx4c6vfwee</sha1>

</revision>

</page>

...

...

<page>

<title>Professional Players Tournament (snooker)</title>

<ns>0</ns>

<id>665001</id>

<redirect title="World Open (snooker)" />

<revision>

<id>359952698</id>

<parentid>25566787</parentid>

<timestamp>2010-05-03T23:48:34Z</timestamp>

<contributor>

<username>Xqbot</username>

<id>8066546</id>

</contributor>

<minor/>

<comment>Robot: Fixing double redirect to [[World Open (snooker)]]</comment>

<model>wikitext</model>

<format>text/x-wiki</format>

<text id="360810125" bytes="34" />

<sha1>lxtjwcda9vk58fphj8ie2logjm607mv</sha1>

</revision>

</page>

</mediawiki>

网友答案:

The data got indexed after I tried to ingest using the command: "curl http://localhost:8983/solr/core-base-wiki/dataimport?command=full-import".

Somehow the ./bin/post was not able to do the same. Didn't research more on the same, if anyone else has figured out how to, you are requested to share your findings.

网友答案:

You're missing lib element in solrconfig.xml.

<lib dir="../../../dist" regex="solr-dataimporthandler-.*\.jar" />
分享给朋友:
您可能感兴趣的文章:
随机阅读: