EventKG - Implementation

Code on GitHub

The code of the EventKG can be found on GitHub.

Software License

The EventKG software code is licensed under the terms of the MIT license (see LICENSE.txt in the GitHub repository).

Configuration

Create a configuration file like the following to state where to store your EventKG version, and the languages and dumps to be used for extraction:

data_folder /home/....../data
languages en,de,ru,fr,pt,it
enwiki 20190101
dewiki 20190101
frwiki 20190101
ruwiki 20190101
ptwiki 20190101
dbpedia 2016-10
wikidata 20181231

Currently, 15 languages are supported (en, fr, de, ru, pt, es, it, da, nl, ro, no, pl, hr, sl, bg). Timestamps of current Wikipedia dumps can be found on https://dumps.wikimedia.org/enwiki. Usually, the dump dates are consistent between languages. The chosen dump needs to say "Dump complete" on the dump's website. Wikidata dumps are listed on https://dumps.wikimedia.org/wikidatawiki/entities/. There is one dump for each language. DBpedia is dumped for all languages at once. The newest dump is listed on the top of http://wiki.dbpedia.org/datasets.

Run the extraction

The EventKG extraction pipeline consists of several steps described in the following. Consider that some of these step require some time and resources (e.g. for the data download, for processing the big Wikidata dump file, and for processing the Wikipedia XML files).

1. Export the Pipeline class (de.l3s.eventkg.pipeline.Pipeline) as executable jar (Pipeline.jar).

2. Start the data download using:

java -jar Pipeline.jar path_to_config_file.txt 1

3. Run the first steps of extraction

java -jar Pipeline.jar path_to_config_file.txt 2,3

4. Export the Dumper class (de.l3s.eventkg.wikipedia.mwdumper.Dumper) as Jar (Dumper.jar). Run the extraction from the Wikipedia dump files for each language by running the following command (here for Portuguese, replace pt with other languages if needed). GNU parallel is required.

nohup parallel -j9 "bzip2 -dc {} | java -jar -Xmx6G -Xss40m Dumper.jar path_to_config_file.txt pt" :::: data/raw_data/wikipedia/pt/dump_file_list.txt 2> log_dumper.txt

5. Start the final steps of extraction:

java -jar Pipeline.jar path_to_config_file.txt 4,5,6,7,8

6. The resulting .nq files can be found in the folder data/results/all.