We provide the following files for future experiments:
Predictions are evaluated as:
python code/evaluate.py --golds <gold_annotations> --predictions
e.g., python code/evaluate.py --golds evaluation_data/gold_American_VealeNOC.txt --predictions embedding_predictions/predictions_3cosadd_American_VealeNOC.txt
Human Generated Adaptations The final human generated data is available under evaluation_data as the four gold files: German VealeNOC, German Wiki, American VealeNOC, American Wiki.
Human Evaluations of All Adaptations The questions and the five translator judgements for them are provided at evaluation_data/translator_evaluations.csv.
Embedding-Based Adaptations The final generated data for our VealeNOC and Wikipedia sourced entities are available under embedding_predictions. There are 6 files: German Veale, German Wiki, American Veale, American Wiki created with the 3cosadd. And 2 files ("learned") that are trained on Wikipedia and tested on VealeNOC.
WikiData Adaptations The final generated data for our VealeNOC and Wikipedia sourced entities are available under wikidata_predictions. There are 4 files: German Veale, German Wiki, American Veale, American Wiki created with our WikiData method.
To create them yourself, you will need (combined 50GB):
Optionally, we provide the original WikiData dump from 10-26-2020 (processed to remove everything unnecessary to Properties and Values): https://obj.umiacs.umd.edu/adaptation/10-26-20-wikidata.jsonl
FAQ
1) What python environment do I need?
pip install -r requirements.txt
2) How do I produce embeddings based modulations?
We provide modulate.py
which supports both the unsupervised 3cosadd
and the supervised learned
modulation modes. For detailed parameters run:
python code/modulate.py -h
- Example American to German modulation with
3cosadd
:
python modulate.py \
--input input_American_Wiki.txt \
--output predictions_3cosadd_American_Wiki.txt \
--src_emb vectors-en.txt \
--trg_emb vectors-de.txt \
--method add \
--src_pos Germany \
--src_neg United_States \
--trg_pos Deutschland \
--trg_neg USA
- Example German to American modulation with
learned
:
python code/modulate.py \
--input input_German_VealeNOC.txt \
--output predictions_learned_German_VealeNOC.txt \
--src_emb vectors-de.txt \
--trg_emb vectors-en.txt \
--method ridge \
--train_file train_German_Wiki.txt
3) How do I get my own WikiData dump?:
- download a specific date from https://dumps.wikimedia.org/wikidatawiki/entities/ You're looking for the file titled: e.g., wikidata-20210830-all.json.bz2 under a recent date.
- process the data to get it into .jsonl format (WikiData is unsurprisingly large, so removing unrelated attributes and making it into a JSONLines format---which can be loaded item by item---is a helpful preprocessing step.
We use https://github.com/EntilZha/wikidata-rust to make this conversion.
4) What computing environment do I need?
Wikipedia and Wikidata are obviously large. The code for Wikidata used a large RAM CPU (100+ GB) for pre-processing the data, and a GPU for computing Faiss distance. Since the data is provided in a .jsonl format, the code could likely be reworked to require less CPU memory if needed. The Faiss distance calculation is tractable (~1 hour) on a CPU.
Please contact dpeskov.work@gmail.com with any questions.