• Solr test dataset

    29-12-2010Author: basdenooijer

    For an opensource project I’m working on I need a good Solr test dataset. More info about the project will follow soon, but as a teaser I can already tell it’s Solr and PHP related ;)
    The dataset needs to be of a reasonable size (not unrealisticly small, but not huge either) and it should be free to use for anyone, as anyone should be able to test the project.

    I’ve worked on quite a lot of Solr projects by now, and have a local environment of most. But obviously I cannot use these indexes for anything other than the projects they belong to, let alone redistribute the data.

    For some demos and my post complex solr faceting I’ve used the dataset from the book ‘Solr 1.4 Enterprise Search Server‘, based on MusicBrainz data. But this dataset is not so great for faceting, which is one of the more important items to me. So I decided to look for a better alternative.

    The first thing that came to my mind was IMDB. But at first glance their licensing terms seem to be an issue. Then I remembered using a geocoding service based on free location data. With a some searching I found it, GeoNames. It seems like a perfect fit:

    • it uses a Creative Commons Attribution 3.0 License
    • the dataset is big enough, and can easily be extended by importing more parts of the dataset
    • the data lends itself for faceting
    • lots of special characters in the data for testing utf8 stuff

    The only downside is that this dataset has no large texts, just some names for items. So for text analysis related items I would need another dataset, maybe some wikipedia content. But for now this dataset will do just fine.

    Next step was to get the data into Solr. I did a quick search and came across this blog post: Solr and Geonames
    I used some of the steps he describes as a starting point, these were the steps I took:

    1. create a new core in my solr test instance
    2. create a schema, see my settings below
    3. I also needed to alter my solrconfig.xml, enableRemoteStreaming needs to be set to “true”. Be aware of security though, don’t do this if your Solr instance can be reached by others!
    4. reload solr to load the new core
    5. download the data file: http://download.geonames.org/export/dump/cities1000.zip
    6. unzip it
    7. I had issues with importing the file due to some quotes in the data. Since only a few of the about 100k records gave the error I didn’t really look into it as it doesn’t matter for my tests, so I ‘fixed’ it by removing the quotes that gave the errors:
      sed ‘s/”//g’ cities1000.txt > cities1000_fixed.txt
    8. start the import by calling this URL (adapt to your own environment):
      http://localhost:8983/solr/geonames/update/csv?commit=true&separator=%09&fieldnames=id,name,,alternative_names,latitude,longitude,,,countrycode,,,,,,population,elevation,,timezone,lastupdate&stream.file=/path/to/file.txt&overwrite=true&stream.contentType=text/plain;charset=utf-8
    9. after wating a few seconds you should get a confirmation, and you should be able to see the results at:
      http://localhost:8983/solr/geonames/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on

    If you are new to Solr these steps are probably too briefly described, just let me know and I’ll make a step-by-step tutorial at request

    These are the schema settings (I left out the standard parts, Solr comes with good examples):

    <fields>   
       <field name="id" type="int" indexed="true" stored="true" />
       <field name="name" type="string" indexed="true" stored="true" />
       <field name="alternative_names" type="text" indexed="true" stored="true" />
       <field name="latitude" type="string" indexed="true" stored="true" />
       <field name="longitude" type="string" indexed="true" stored="true" />   
       <field name="countrycode" type="string" indexed="true" stored="true" />
       <field name="population" type="int" indexed="true" stored="true" />
       <field name="elevation" type="string" indexed="true" stored="true" />
       <field name="timezone" type="string" indexed="true" stored="true" />
       <field name="lastupdate" type="string" indexed="true" stored="true" />
     </fields>
    
     <uniquekey>id</uniquekey>
     
     <defaultSearchField>name</defaultSearchField>
    
  • 5 comments on “Solr test dataset

    • 26 February 2011 at 23:06

      Thanks for the article. Good to know SOlr can load csv files out of box. Can it load XML files as well? How about other types?
      I’ve been trying to come up with a good ETL strategy to load Solr but so far nothing that works flawlessly. Talend comes close but it has its own problems.

    • 27 February 2011 at 08:38

      There are many ways of getting data into Solr. It really depends on your situation (datasource, available tools, schema etcetera) which one is the best.

      The Solr DataImportHandler has support for multiple datasources.
      – XML files (this includes remote files like RSS feeds)
      – CSV files
      – a direct JDBC database connection using queries
      – emails using an IMAP connection
      – lots of document types using tika (see http://tika.apache.org/0.9/formats.html)

      For info about the above options see http://wiki.apache.org/solr/DataImportHandler

      You can also use Apache Nutch, a web crawler that can use Solr as an index. This way you can index a complete website just like search engines. An easy and quick way to add search to a site but limited in options.

      Finally you can also create your own solution by using the update API, this allows for the most flexibility. There are libraries for most languages available.
      A hybrid solution is also a possibility, use some of your own tools to create XML files in a suitable format and let the Solr DIH import these XML files.

    • 27 February 2011 at 18:28

      Thanks for the quick response. IF you want sample customer data try http://www.briandunning.com/sample-data/ It has up to 35k records for free. I just loaded my solr instance with that. You can get a csv file that you can load directly in to solr. I tried loading it using Talend with the SolrJ library.

      Data import handler works for a full load fine but if you want to do frequent near-real time loads and keep track of failures, logging, etc. I think doing the ETL in an external tool is better.

    • 28 February 2011 at 18:04

      If you’re interested I wrote a blog post about ETLing in to Solr using Talend

    • 28 February 2011 at 18:05