Adapting Entities Across Languages and Cultures

EMNLP Findings 2021

We provide the following files for future experiments:

Predictions are evaluated as:

python code/evaluate.py --golds <gold_annotations> --predictions

e.g., python code/evaluate.py --golds evaluation_data/gold_American_VealeNOC.txt --predictions embedding_predictions/predictions_3cosadd_American_VealeNOC.txt

Human Generated Adaptations The final human generated data is available under evaluation_data as the four gold files: German VealeNOC, German Wiki, American VealeNOC, American Wiki.

Human Evaluations of All Adaptations The questions and the five translator judgements for them are provided at evaluation_data/translator_evaluations.csv.

Embedding-Based Adaptations The final generated data for our VealeNOC and Wikipedia sourced entities are available under embedding_predictions. There are 6 files: German Veale, German Wiki, American Veale, American Wiki created with the 3cosadd. And 2 files ("learned") that are trained on Wikipedia and tested on VealeNOC.

WikiData Adaptations The final generated data for our VealeNOC and Wikipedia sourced entities are available under wikidata_predictions. There are 4 files: German Veale, German Wiki, American Veale, American Wiki created with our WikiData method.

To create them yourself, you will need (combined 50GB):

German Matrix

American Matrix

Optionally, we provide the original WikiData dump from 10-26-2020 (processed to remove everything unnecessary to Properties and Values): https://obj.umiacs.umd.edu/adaptation/10-26-20-wikidata.jsonl

FAQ

1) What python environment do I need?

pip install -r requirements.txt

2) How do I produce embeddings based modulations?

We provide modulate.py which supports both the unsupervised 3cosadd and the supervised learned modulation modes. For detailed parameters run:

python code/modulate.py -h

Example American to German modulation with 3cosadd:

python modulate.py \
    --input input_American_Wiki.txt \
    --output predictions_3cosadd_American_Wiki.txt \
    --src_emb vectors-en.txt \
    --trg_emb vectors-de.txt \
    --method add \
    --src_pos Germany \
    --src_neg United_States \
    --trg_pos Deutschland \
    --trg_neg USA

Example German to American modulation with learned:

python code/modulate.py \
    --input input_German_VealeNOC.txt \
    --output predictions_learned_German_VealeNOC.txt \
    --src_emb vectors-de.txt \
    --trg_emb vectors-en.txt \
    --method ridge \
    --train_file train_German_Wiki.txt

3) How do I get my own WikiData dump?:

download a specific date from https://dumps.wikimedia.org/wikidatawiki/entities/ You're looking for the file titled: e.g., wikidata-20210830-all.json.bz2 under a recent date.
process the data to get it into .jsonl format (WikiData is unsurprisingly large, so removing unrelated attributes and making it into a JSONLines format---which can be loaded item by item---is a helpful preprocessing step.
We use https://github.com/EntilZha/wikidata-rust to make this conversion.

4) What computing environment do I need?

Wikipedia and Wikidata are obviously large. The code for Wikidata used a large RAM CPU (100+ GB) for pre-processing the data, and a GPU for computing Faiss distance. Since the data is provided in a .jsonl format, the code could likely be reworked to require less CPU memory if needed. The Faiss distance calculation is tractable (~1 hour) on a CPU.

Please contact dpeskov.work@gmail.com with any questions.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
code		code
embedding_predictions		embedding_predictions
evaluation_data		evaluation_data
wikidata_predictions		wikidata_predictions
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

embedding_predictions

embedding_predictions

evaluation_data

evaluation_data

wikidata_predictions

wikidata_predictions

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Adapting Entities Across Languages and Cultures

EMNLP Findings 2021

About

Releases

Packages

Contributors 2

Languages

DenisPeskoff/2021_emnlp_adaptation

Folders and files

Latest commit

History

Repository files navigation

Adapting Entities Across Languages and Cultures

EMNLP Findings 2021

About

Resources

Stars

Watchers

Forks

Languages