Skip to content
Benchmark for natural language queries
Python TSQL
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
database
scripts
README.md
inex-queries.txt
inex-relevance-assessment.txt

README.md

nli-db-benchmark

Benchmark for natural language queries

This package contains the data set for the paper Database Search vs. Information Retrieval: A Novel Method for Studying Natural Language Querying of Semi-Structured Data by Stefanie Nadig, Martin Braschler, Kurt Stockinger, Zurich University of Applied Sciences, Switzerland

The scripts transforms the 2010 IMDB Collection provided by INEX from XML structure into a relational database. Thus the data set can then be used to run experiments on different systems. The target database schema ist as follows: ER Schema

Data

Download the data set from here: 2010 IMDB Collection (1.4GB). Unzip and move all documents from the subfolders into one.

Relevance Assessment

Download the original relevance assessment here: 2011 qrels (adhoc track)

Parse XML to SQL Statements

The create-inserts.py python script parses all XML files and generates sql statements which are filled in 8 sql-files named after the respective tables in the database. The first and only argument is the path to the document collection.

python ~/scripts/create-inserts.py ~/documentcollection

After executing the script create-insert.py, you will see the following sql-files as output in the scripts folder:

  • movie_inserts.sql
  • person_inserts.sql
  • biography_inserts.sql
  • link_inserts.sql
  • movie_value_inserts.sql
  • person_value_inserts.sql
  • actor_inserts.sql
  • role_inserts.sql

The sql statements for tables movie_attributes and person_attributes are already provided and can be found in the database folder. After a clean run the created errors.txt should be empty.

Remove Duplicates

Use the script remove-duplicates.py to remove duplicate entries for roles and actors. The other tables have no duplicates. The first and only argument is a sql-file.

To remove the duplicates of the table "roles":

python ~/scripts/remove-duplicates.py ~/scripts/role_inserts.sql

To remove the duplicates of the table "actors":

python ~/scripts/remove-duplicates.py ~/scripts/actor_inserts.sql

Database

Create a MySQL database with ~/database/create-eav-db.sql or adapt as necessary for other database systems.

Import the generated sql-files into your databases. Create the fiels in the following order to obey the dependencies:

  • Movie
  • Person
  • Movie_Attributes
  • Person_Attributes

Afterwards import the remaining sql-files.

You can’t perform that action at this time.