Skip to content
Benchmark for natural language queries
Python TSQL
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


Benchmark for natural language queries

This package contains the data set for the paper

Stefanie Nadig, Martin Braschler, Kurt Stockinger, Database Search vs. Information Retrieval: A Novel Method for Studying Natural Language Querying of Semi-Structured Data, International Conference on Language Resources and Evaluation (LREC), 2020.

Please cite the paper when using the package.

The scripts transforms the 2010 IMDB Collection provided by INEX from XML structure into a relational database. Thus the data set can then be used to run experiments on different systems. The target database schema is as follows: ER Schema


Download the data set from here: 2010 IMDB Collection (1.4GB). Unzip and move all documents from the subfolders into one.

Relevance Assessment

Download the original relevance assessment here: 2011 qrels (adhoc track)

Parse XML to SQL Statements

The python script parses all XML files and generates sql statements which are filled in 8 sql-files named after the respective tables in the database. The first and only argument is the path to the document collection.

python ~/scripts/ ~/documentcollection

After executing the script, you will see the following sql-files as output in the scripts folder:

  • movie_inserts.sql
  • person_inserts.sql
  • biography_inserts.sql
  • link_inserts.sql
  • movie_value_inserts.sql
  • person_value_inserts.sql
  • actor_inserts.sql
  • role_inserts.sql

The sql statements for tables movie_attributes and person_attributes are already provided and can be found in the database folder. After a clean run the created errors.txt should be empty.

Remove Duplicates

Use the script to remove duplicate entries for roles and actors. The other tables have no duplicates. The first and only argument is a sql-file.

To remove the duplicates of the table "roles":

python ~/scripts/ ~/scripts/role_inserts.sql

To remove the duplicates of the table "actors":

python ~/scripts/ ~/scripts/actor_inserts.sql


Create a MySQL database with ~/database/create-eav-db.sql or adapt as necessary for other database systems.

Import the generated sql-files into your databases. Create the fiels in the following order to obey the dependencies:

  • Movie
  • Person
  • Movie_Attributes
  • Person_Attributes

Afterwards import the remaining sql-files.

You can’t perform that action at this time.