Shared LRs

LREC recognises the importance of sharing Language Resources (LRs) and making them available to the community. When submitting a paper, participants were offered the possibility to share their LRs (data, tools, web-services, etc.), uploading them in a special LREC repository set up by ELRA. This effort of sharing LRs, linked to the LRE Map initiative for their description, contributes to creating a common repository where everyone can deposit and share data.

After the conference, the Shared LRs set at LREC 2018 was manually checked and a cleaned version of the list of LRs is now available. The LRs in this list comply with the following criteria:

LRs accessible (whether downloadable directly or through an an external URL)
LRs categorized as Datasets only. It can be a:
- Corpus,
- Evaluation Data,
- Lexicon,
- Ontology,
- Terminology,
- Treebank.

Excluded LRs are:

Uploaded LRs with a content that does not match the description
LRs with no download URL or URL now a dead link
LRs categorized as tools or guidelines
LRs associated to rejected papers

Search for LRs

Filter by resource type:

Reset

Shared-LRs @ LREC 2018

Name	A Tweet Dataset Annotated in Four Emotion Dimensions
Resource type	Corpus
Size	2019 entries
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Emotion Recognition/Generation
License	CC-BY 4.0
Conditions of use	Attribution
Description	A corpus of 2,019 tweets annotated along each of four emotion dimensions: Valence, Dominance, Arousal and Surprise. Two annotation schemes are used: a 5-point ordinal scale (using SAM manikins for Valence, Arousal and Dominance) and pair-wise comparisons with an "about the same" option (here 2,019 tweet pairs are annotated such that each of the 2,019 tweets is in at lest one pair and no pairs are duplicated). In all cases, there was a "Can't Tell" option for unintelligible tweets. Files provided are csv output from CrowdFlower with useful columns largely self-explanatory. Annotation columns are emotion names (5-point scale) or "most_emotion" (comparisons), "index" contains a unique id for each annotation task. "_worker_id" contains a unique identifier for each annotator.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/61_res_1.tgz [3,45 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/61.html
Edition	LREC 2018

Name	Abu El-Khair Corpus
Resource type	Corpus
Size	16 Gbyte
Languages	Arabic
Production status	Complete
Resource usage	Information Retrieval, Natural Language Processing, Machine Learning
License	OpenSource
Conditions of use	Free
Description	A text corpus that includes five million newspaper articles. A billion and a half words in total, out of which, there is about three million unique words. The data were collected from newspaper articles in ten major news sources from eight Arabic countries, over a period of fourteen years. The corpus was encoded with two types of encoding, namely: UTF-8, and Windows CP-1256. Also it was marked with two mark-up languages, namely: SGML, and XML.
Download from	http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/4.html
Edition	LREC 2018

Name	Adimen-SUMO v2.6
Resource type	Ontology
Size	4000 rules
Languages	<Not Specified>
Production status	Existing-updated
Resource usage	Automated reasoning
License	Creative Commons CC BY 3.0 Unported
Conditions of use	Attribution
Description	Adimen-SUMO is an off-the-shelf first-order ontology that has been obtained by reengineering out of the 88% of SUMO (Suggested Upper Merged Ontology). Adimen-SUMO can be used appropriately by FO theorem provers (like E-Prover or Vampire) for formal reasoning.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/308_res_1.zip [3,41 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/308.html
Edition	LREC 2018

Name	Anchor wordlists for code-switching
Resource type	Lexicon
Size	65 MByte
Languages	English (eng) Spanish (spa)
Production status	Newly created-finished
Resource usage	Language Modelling
License	<Not Specified>
Conditions of use	<Not Specified>
Description	This Language Resource package contains the anchor wordlists described in the LREC 2018 paper: "Collecting Code-Switched Data from Social Media", Gideon Mendels, Victor Soto, Aaron Jaech and Julia Hirschberg, LREC 2018. - The folder strong_anchors contains the collection of strong anchors computed as described in the paper. The three subfolders anchors, anchors_news and anchors_wiki contain the strong anchors when computed using the news resources, wiki resources, or all resources from the Leipzig Corpora Collection (LCC). - The folder weak_anchors contains the collection of Spanish and English weak anchors as described in the paper, using the GigaCorpus dataset of Broadcast News data. You can use the anchors wordlists to seed the search of code-switched tweets using Babler. (https://github.com/gidim/Babler/blob/master/README.md).This Language Resource package contains the anchor wordlists described in the LREC 2018 paper: "Collecting Code-Switched Data from Social Media", Gideon Mendels, Victor Soto, Aaron Jaech and Julia Hirschberg, LREC 2018. - The folder strong_anchors contains the collection of strong anchors computed as described in the paper. The three subfolders anchors, anchors_news and anchors_wiki contain the strong anchors when computed using the news resources, wiki resources, or all resources from the Leipzig Corpora Collection (LCC). - The folder weak_anchors contains the collection of Spanish and English weak anchors as described in the paper, using the GigaCorpus dataset of Broadcast News data. You can use the anchors wordlists to seed the search of code-switched tweets using Babler. (https://github.com/gidim/Babler/blob/master/README.md).This Language Resource package contains the anchor wordlists described in the LREC 2018 paper: "Collecting Code-Switched Data from Social Media", Gideon Mendels, Victor Soto, Aaron Jaech and Julia Hirschberg, LREC 2018. - The folder strong_anchors contains the collection of strong anchors computed as described in the paper. The three subfolders anchors, anchors_news and anchors_wiki contain the strong anchors when computed using the news resources, wiki resources, or all resources from the Leipzig Corpora Collection (LCC). - The folder weak_anchors contains the collection of Spanish and English weak anchors as described in the paper, using the GigaCorpus dataset of Broadcast News data. You can use the anchors wordlists to seed the search of code-switched tweets using Babler. (https://github.com/gidim/Babler/blob/master/README.md).
Download from	http://www.cs.columbia.edu/~vsoto/files/anchor_wordlists.zip
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/92.html
Edition	LREC 2018

Name	Annotated arXiv CS Data Set
Resource type	Corpus
Size	3.5 GByte
Languages	<Not Specified>
Production status	Newly created-finished
Resource usage	Information Extraction, Information Retrieval
License	arXiv
Conditions of use	Website is granted a perpetual, non-exclusive right to distribute the content
Description	The data set contains 15.5M sentences of arXiv.org publications in the computer science domain. In those sentences, the citation markers were replaced by global paper identifiers. All citing and cited papers are linked to DBLP, as far as possible. The data set can be used for a variety of citation-based tasks, such as citation recommendation, citation function determination, and citation-based document summarization.
Download from	http://citation-recommendation.org/publications/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/283.html
Edition	LREC 2018

Name	Annotated Corpus of Scientific Conference's Homepages
Resource type	Corpus
Size	57.8
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Information Extraction, Information Retrieval
License	Creative Commons CC BY-NC-SA 4.0
Conditions of use	Attribution Non-Commercial Share-Alike
Description	The corpus contains homepages of conferences with annotations of interesting information, e.g., name of a conference, its abbreviation, several important dates for the conference. The corpus can be used to train a tool for information extraction from unstructured sources containing data describing conferences. We chose conference home pages as a source as they contain up-to-date information. Structured services, such as WikiCFP, do not always update information, e.g., deadline changes and cannot be used in a real system for gathering up-to-date information about conferences.
Download from	http://ii.pw.edu.pl/~pandrusz/data/conferences/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/828.html
Edition	LREC 2018

Name	Arabic Dialects Dataset
Resource type	Corpus
Size	355069 words
Languages	Egyptian Arabic (arz) Gulf Arabic (afb) North Levantine Arabic (apc) Tunisian Arabic (aeb) Modern Standard Arabic
Production status	Existing-updated
Resource usage	Document Classification, Text categorisation
License	Open Source
Conditions of use	<Not Specified>
Description	Arabic Dialect Dataset is a collection of Arabic dialect documents collected from the Arabic Commentary Dataset and the Tunisian Arabic dataset. We selected only 100% dialectical documents from each resources, filtered them by dialect so no document is written in more than one dialect. We built frequency lists for the collected documents, bivalency and dialectical modern standard arabic lists.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/237_res_1.zip [2,24 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/237.html
Edition	LREC 2018

Name	ArapTweet
Resource type	Corpus
Size	2500000 Tweets
Languages	Dialectal Arabic
Production status	Newly created-in progress
Resource usage	Language Identification
License	Creative Commons CC BY 4.0
Conditions of use	Attribution
Description	An Arabic multi-dialectal corpus of Tweets from 12 regions and 17 countries in the Arab world. The corpus is annotated for age categories, gender and for the dialectal variety.
Download from	http://arap.qatar.cmu.edu/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/521.html
Edition	LREC 2018

Name	ArmanPersoNERCorpus
Resource type	Corpus
Size	1.6 MByte
Languages	Iranian Persian (pes)
Production status	Existing-updated
Resource usage	Named Entity Recognition
License	<Not Specified>
Conditions of use	<Not Specified>
Description	ArmanPersoNERCorpus is the first manually-annotated Persian named-entity (NE) dataset (ISLRN 399-379-640-828-6). We are releasing it only for academic research use. The data set includes 250,015 tokens and 7,682 Persian sentences in total. It is available in 3 folds to be used in turn as training and test sets. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated with a newline. The NER tags are in IOB format. According to the instructions provided to the annotators, NEs are categorized into six classes: person, organization (such as banks, ministries, embassies, teams, nationalities, networks and publishers), location (such as cities, villages, rivers, seas, gulfs, deserts and mountains), facility (such as schools, universities, research centers, airports, railways, bridges, roads, harbors, stations, hospitals, parks, zoos and cinemas), product (such as books, newspapers, TV shows, movies, airplanes, ships, cars, theories, laws, agreements and religions), and event (such as wars, earthquakes, national holidays, festivals and conferences); other are the remaining tokens.ArmanPersoNERCorpus is the first manually-annotated Persian named-entity (NE) dataset (ISLRN 399-379-640-828-6). We are releasing it only for academic research use. The data set includes 250,015 tokens and 7,682 Persian sentences in total. It is available in 3 folds to be used in turn as training and test sets. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated with a newline. The NER tags are in IOB format. According to the instructions provided to the annotators, NEs are categorized into six classes: person, organization (such as banks, ministries, embassies, teams, nationalities, networks and publishers), location (such as cities, villages, rivers, seas, gulfs, deserts and mountains), facility (such as schools, universities, research centers, airports, railways, bridges, roads, harbors, stations, hospitals, parks, zoos and cinemas), product (such as books, newspapers, TV shows, movies, airplanes, ships, cars, theories, laws, agreements and religions), and event (such as wars, earthquakes, national holidays, festivals and conferences); other are the remaining tokens.ArmanPersoNERCorpus is the first manually-annotated Persian named-entity (NE) dataset (ISLRN 399-379-640-828-6). We are releasing it only for academic research use. The data set includes 250,015 tokens and 7,682 Persian sentences in total. It is available in 3 folds to be used in turn as training and test sets. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated with a newline. The NER tags are in IOB format. According to the instructions provided to the annotators, NEs are categorized into six classes: person, organization (such as banks, ministries, embassies, teams, nationalities, networks and publishers), location (such as cities, villages, rivers, seas, gulfs, deserts and mountains), facility (such as schools, universities, research centers, airports, railways, bridges, roads, harbors, stations, hospitals, parks, zoos and cinemas), product (such as books, newspapers, TV shows, movies, airplanes, ships, cars, theories, laws, agreements and religions), and event (such as wars, earthquakes, national holidays, festivals and conferences); other are the remaining tokens.
Download from	https://github.com/HaniehP/PersianNER
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/48.html
Edition	LREC 2018

Name	b5 corpus
Resource type	Corpus
Size	1082 Big five inventories and text
Languages	Portuguese (por)
Production status	Newly created-finished
Resource usage	Document Classification, Text categorisation
License	Creative Commons CC BY 4.0
Conditions of use	Attribution
Description	The b5 corpus is a dataset containing texts (in Brazilian Portuguese) and self-reported personality inventories of their authors. This consists of an author's knowledge base called b5-subject and four text databases (or subcorpora) called b5-post, b5-ref, b5-text, and b5-caption. The b5-subject knowledge base contains contains 1082 personality inventories, and partial author information regarding gender, age, background, degree of religiosity (on a 1-5 scale) and undergraduate course information. The b5-post subcorpus contains Facebook status updates from participants who filled out the personality inventory using a purpose-built application. For each subject, up to 1,000 Facebook status updates were collected. Users with little or no Facebook activity were discarded, resulting in a corpus of 1019 texts. The b5-ref corpus is a collection of 1810 definite descriptions elicited from visual contexts, and annotated with their semantic properties. The b5-text and b5-caption subcorpora contain about 1500 scene descriptions produced in two versions: a detailed version in the form of multi-sentential text, and a short version in the form of a single sentences similar to picture captions.The b5 corpus is a dataset containing texts (in Brazilian Portuguese) and self-reported personality inventories of their authors. This consists of an author's knowledge base called b5-subject and four text databases (or subcorpora) called b5-post, b5-ref, b5-text, and b5-caption. The b5-subject knowledge base contains contains 1082 personality inventories, and partial author information regarding gender, age, background, degree of religiosity (on a 1-5 scale) and undergraduate course information. The b5-post subcorpus contains Facebook status updates from participants who filled out the personality inventory using a purpose-built application. For each subject, up to 1,000 Facebook status updates were collected. Users with little or no Facebook activity were discarded, resulting in a corpus of 1019 texts. The b5-ref corpus is a collection of 1810 definite descriptions elicited from visual contexts, and annotated with their semantic properties. The b5-text and b5-caption subcorpora contain about 1500 scene descriptions produced in two versions: a detailed version in the form of multi-sentential text, and a short version in the form of a single sentences similar to picture captions.
Download from	https://drive.google.com/open?id=0B-KyU7T8S8bLTHpaMnh2U2NWZzQ
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/31.html
Edition	LREC 2018

Name	b5-ref-lex corpus
Resource type	Corpus
Size	4711 Definite descriptions
Languages	Portuguese (por)
Production status	Newly created-finished
Resource usage	Natural Language Generation
License	Creative Commons CC BY 4.0
Conditions of use	Attribution
Description	The b5-ref-lex dataset conveys semantic properties taken from the b5-ref corpus of referring expressions, the corresponding Big Five personality traits about their authors, and their surface forms (in Brazilian Portuguese). The dataset has been created for the development of machine learning methods of personality-based referring expression lexical choice, in which the goal is to learn the appropriate lexicalisation for a given input semantics and target personality. Each instance of the dataset is represented by seven features: a referential property (or attribute-value pair) as defined in the b5-ref domain, the five personality values (Extraversion, Agreeableness, Conscientiousness, Neuroticism, Openness to Experience) of the speaker who selected the property to produce a definite description, and the basic word that was uttered. The b5-ref-lex dataset conveys semantic properties taken from the b5-ref corpus of referring expressions, the corresponding Big Five personality traits about their authors, and their surface forms (in Brazilian Portuguese). The dataset has been created for the development of machine learning methods of personality-based referring expression lexical choice, in which the goal is to learn the appropriate lexicalisation for a given input semantics and target personality. Each instance of the dataset is represented by seven features: a referential property (or attribute-value pair) as defined in the b5-ref domain, the five personality values (Extraversion, Agreeableness, Conscientiousness, Neuroticism, Openness to Experience) of the speaker who selected the property to produce a definite description, and the basic word that was uttered.
Download from	https://drive.google.com/open?id=0B-KyU7T8S8bLclA0XzIwX0NncEk
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/35.html
Edition	LREC 2018

Name	Baidu Baike document titles
Resource type	Lexicon
Size	93.9 MByte
Languages	Chinese
Production status	Newly created-finished
Resource usage	Lexicon Creation/Annotation
License	OpenSource
Conditions of use	<Not Specified>
Description	There are 10,143,321 titles of Baidu Baike (the largest Chinese-language encyclopedia like Wikipedia) documents. The title can be considered as a specific word or combination of words.
Download from	https://drive.google.com/open?id=1rO-LUcWpm_5KhBUpYID1M5B_an55XVO2
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/295.html
Edition	LREC 2018

Name	Basque Multiword Expressions Scored for Compositionality (Unfiltered)
Resource type	Lexicon
Size	77679 lexemes
Languages	Basque (eus)
Production status	Newly created-finished
Resource usage	Multiword Expression Compositionality
License	CreativeCommons
Conditions of use	<Not Specified>
Description	A compositionality-ranked list of Basque multiword expressions, created using the methodology detailed in our paper; evaluation for this resource is not presented.
Download from	http://amor.cms.hu-berlin.de/~robertsw/files/lrec2016/MWE_eu-unfiltered.utf8.txt.gz
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/650.html
Edition	LREC 2018

Name	Berriak
Resource type	Corpus
Size	500 sentences
Languages	Basque English
Production status	Newly created-in progress
Resource usage	Machine Translation, SpeechToSpeech Translation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	500 English-Basque accurate translations, translated by human translators. The corpus was created thanks to the help of Librezale, a group that works for an increase in the presence of Basque language in informatics. This is the first version of the corpus, we aim to increase the number of translations in the future. Pre-trained word embeddings in Basque: We have learned Basque word embeddings from Basque Wikipedia with Glove. 500 English-Basque accurate translations, translated by human translators. The corpus was created thanks to the help of Librezale, a group that works for an increase in the presence of Basque language in informatics. This is the first version of the corpus, we aim to increase the number of translations in the future. Pre-trained word embeddings in Basque: We have learned Basque word embeddings from Basque Wikipedia with Glove. 500 English-Basque accurate translations, translated by human translators. The corpus was created thanks to the help of Librezale, a group that works for an increase in the presence of Basque language in informatics. This is the first version of the corpus, we aim to increase the number of translations in the future. Pre-trained word embeddings in Basque: We have learned Basque word embeddings from Basque Wikipedia with Glove.
Download from	https://github.com/ijauregiCMCRC/english_basque_MT
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/101.html
Edition	LREC 2018

Name	BlogSet-BR
Resource type	Corpus
Size	4.7 GByte
Languages	Portuguese (por)
Production status	Newly created-finished
Resource usage	Text Mining
License	Apache 2.0
Conditions of use	Preservation of Copyright Notice
Description	The processed corpus is a CSV file with 7.4 millions posts only from Brazilian bloggers with the columns: post id number, blog id number, published date, title, content, author id number, author display name and replies (number of comments). The total size of compressed processed data is 4.7 GB. The survey data with 4 thousand Brazilian bloggers responses is a XLS file sized 1.1 MB.The processed corpus is a CSV file with 7.4 millions posts only from Brazilian bloggers with the columns: post id number, blog id number, published date, title, content, author id number, author display name and replies (number of comments). The total size of compressed processed data is 4.7 GB. The survey data with 4 thousand Brazilian bloggers responses is a XLS file sized 1.1 MB.The processed corpus is a CSV file with 7.4 millions posts only from Brazilian bloggers with the columns: post id number, blog id number, published date, title, content, author id number, author display name and replies (number of comments). The total size of compressed processed data is 4.7 GB. The survey data with 4 thousand Brazilian bloggers responses is a XLS file sized 1.1 MB.
Download from	http://www.inf.pucrs.br/linatural/blogset-br
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/10.html
Edition	LREC 2018

Name	BPEmb
Resource type	Corpus
Size	100 GByte
Languages	275 languages
Production status	Existing-used
Resource usage	Input for Neural Models
License	MIT License
Conditions of use	Preservation of Copyright Notice
Description	Subword embeddings in 275 languages, based on Byte-Pair Encoding, trained on Wikipedia article text
Download from	https://github.com/bheinzerling/bpemb
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/1049.html
Edition	LREC 2018

Name	Bulgarian Multiword Expressions Scored for Compositionality (Unfiltered)
Resource type	Lexicon
Size	98872 entries
Languages	Bulgarian (bul)
Production status	Newly created-finished
Resource usage	Multiword Expression Compositionality
License	CreativeCommons
Conditions of use	<Not Specified>
Description	A compositionality-ranked list of Bulgarian multiword expressions, created using the methodology detailed in our paper; evaluation for this resource is not presented.
Download from	http://amor.cms.hu-berlin.de/~robertsw/files/lrec2016/MWE_bg-unfiltered.utf8.txt.gz
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/650.html
Edition	LREC 2018

Name	Chinese Word Embedding Evaluation Sets
Resource type	Lexicon
Size	entries
Languages	Chinese
Production status	Newly created-finished
Resource usage	Word Sense Disambiguation
License	OpenSource
Conditions of use	Attribution
Description	Evaluation Datasets for Chinese Word Embedding.
Download from	http://ckipsvr.iis.sinica.edu.tw/ecemb/reg.php
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/159.html
Edition	LREC 2018

Name	Classical Chinese Evaluation Dataset (Twenty-Five Histories)
Resource type	Evaluation Data
Size	147 KByte
Languages	Classical Chinese
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	OpenSource
Conditions of use	<Not Specified>
Description	The manually segmented texts in the evaluation dataset, consisting of 32,689 characters, are proportionally selected from each historical book of the Twenty-Five histories. The dataset is exported from MongoDB.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/295_res_2.zip [147 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/295.html
Edition	LREC 2018

Name	Code-switched English-Spanish Tweets
Resource type	Corpus
Size	493 KByte
Languages	English (eng) Spanish (spa)
Production status	Newly created-finished
Resource usage	Language Modelling
License	Apache 2.0
Conditions of use	Preservation of Copyright Notice Share-Alike
Description	This package contains the collection of tweets described in the LREC 2018 paper: "Collecting Code-Switched Data from Social Media", Gideon Mendels, Victor Soto, Aaron Jaech and Julia Hirschberg, LREC 2018. Please remember to cite this paper if you use this resource. The tagged_tweets_ids file contains the IDs of the 8,285 tweets for which we crowdsourced language tags. These tweets were collected using Babler (https://github.com/gidim/Babler/blob/master/README.md) and the anchor wordlists described in the paper and that can be found in http://www.cs.columbia.edu/~vsoto/files/anchor_wordlists.zip The tagged_tweets_labels file contains the crowdsourced language tags for each token in the collection of 8,285 tweets. The format of the file is one line per token and each line contains a tweet ID, token index and language tag. The language tag values are the following (for a more thorough explanation read the paper): lang1 = English, lang2 = Spanish, ne = Named Entity, unk = Unknown, fw = Foreign Word, ambiguous, mixed and other.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/92_res_2.zip [481 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/92.html
Edition	LREC 2018

Name	Concept Space
Resource type	Corpus
Size	12.5 GByte
Languages	English
Production status	Newly created-finished
Resource usage	Knowledge Discovery/Representation
License	CreativeCommons
Conditions of use	<Not Specified>
Description	A concept space for Explicit Semantic Analysis (ESA) (Gabrilovich & Markovitch 2007). It is a technique that provides a semantic representation of text in a space of concepts derived from Wikipedia. ESA defines concepts from Wikipedia articles e.g., BARACK OBAMA and ACOMPUTER SCIENCE. This resource is a concept space created from a Wikipedia (April 2017 snapshot)
Download from	https://goo.gl/JZhEvm
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/806.html
Edition	LREC 2018

Name	Cornell eRulemaking Corpus -- CDCP
Resource type	Corpus
Size	4931 sentences
Languages	English
Production status	Newly created-finished
Resource usage	Argument Mining
License	Open Database License
Conditions of use	<Not Specified>
Description	This dataset consists in argument annotations on user comments about rule proposals regarding Consumer Debt Collection Practices by the Consumer Financial Protection Bureau crawled from an eRulemaking website, regulationroom.org. The annotation scheme is based on the argumentation model presented in "Toward Machine-assisted Participation in eRulemaking: An Argumentation Model of Evaluability" by Joonsuk Park, Cheryl Blake and Claire Cardie (ICAIL 2015)
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/679_res_1.zip [194 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/679.html
Edition	LREC 2018

Name	Corpus for Sarcasm Detection in English-Hindi code-mixed tweets
Resource type	Corpus
Size	3.2 Mbyte
Languages	Hindi English
Production status	Complete
Resource usage	Sarcasm detection in English-Hindi code-mixed tweets
License	GNU GPL 3.0
Conditions of use	Preservation of Copyright Notice Share-Alike
Description	The dataset is divided into three files: - Sarcasm_tweets.txt: This file contains all the tweet ids and their corresponding tweet text. Each tweet id is followed by the text and a blank line and so on. - Sarcasm_tweets_with_language.txt: This file contains tweet id followed by the corresponding tweet that tokenized and each token is tagged with a language tag (en/hi/rest). - Sarcasm_tweet_truth.txt: This file contains tweet id followed by a label (YES/NO) that indicates presence of sarcasm and then a blank line.The dataset is divided into three files: - Sarcasm_tweets.txt: This file contains all the tweet ids and their corresponding tweet text. Each tweet id is followed by the text and a blank line and so on. - Sarcasm_tweets_with_language.txt: This file contains tweet id followed by the corresponding tweet that tokenized and each token is tagged with a language tag (en/hi/rest). - Sarcasm_tweet_truth.txt: This file contains tweet id followed by a label (YES/NO) that indicates presence of sarcasm and then a blank line.The dataset is divided into three files: - Sarcasm_tweets.txt: This file contains all the tweet ids and their corresponding tweet text. Each tweet id is followed by the text and a blank line and so on. - Sarcasm_tweets_with_language.txt: This file contains tweet id followed by the corresponding tweet that tokenized and each token is tagged with a language tag (en/hi/rest). - Sarcasm_tweet_truth.txt: This file contains tweet id followed by a label (YES/NO) that indicates presence of sarcasm and then a blank line.The dataset is divided into three files: - Sarcasm_tweets.txt: This file contains all the tweet ids and their corresponding tweet text. Each tweet id is followed by the text and a blank line and so on. - Sarcasm_tweets_with_language.txt: This file contains tweet id followed by the corresponding tweet that tokenized and each token is tagged with a language tag (en/hi/rest). - Sarcasm_tweet_truth.txt: This file contains tweet id followed by a label (YES/NO) that indicates presence of sarcasm and then a blank line.
Download from	https://github.com/sahilswami96/SarcasmDetection_CodeMixed/tree/master/Dataset
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/7.html
Edition	LREC 2018

Name	CorpusDRF
Resource type	Lexicon
Size	936 entries
Languages	French (fra)
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	OpenSource
Conditions of use	<Not Specified>
Description	CorpusDRF is an open‑source, digitized collection of regionalisms, their parts of speech, and recognition rates published in the 'Dictionnaire des Regionalismes de France' (DRF, "Dictionary of Regionalisms of France") (Rezeau, 2001), enabling the visualization and analyses of the largest‑scale study of French regionalisms in the 20th century using publicly available data. The corpus documents, in a tabular format, the numerical values of recognition rates for each DRF entry sorted according to each of the 94 departments in continental France. Each CSV file contains 95 rows for the 94 French departments with a header and 936 DRF entries as columns. There are 3 versions of the data -- one with empty values (NAs) left as empty, one with NAs imputed with 0s, and one with NAs as -1s.CorpusDRF is an open‑source, digitized collection of regionalisms, their parts of speech, and recognition rates published in the 'Dictionnaire des Regionalismes de France' (DRF, "Dictionary of Regionalisms of France") (Rezeau, 2001), enabling the visualization and analyses of the largest‑scale study of French regionalisms in the 20th century using publicly available data. The corpus documents, in a tabular format, the numerical values of recognition rates for each DRF entry sorted according to each of the 94 departments in continental France. Each CSV file contains 95 rows for the 94 French departments with a header and 936 DRF entries as columns. There are 3 versions of the data -- one with empty values (NAs) left as empty, one with NAs imputed with 0s, and one with NAs as -1s.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/1005_res_1.zip [62,6 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/1005.html
Edition	LREC 2018

Name	Czech Multiword Expressions Scored for Compositionality (Unfiltered)
Resource type	Lexicon
Size	175713 entries
Languages	Czech (ces)
Production status	Newly created-finished
Resource usage	Multiword Expression Compositionality
License	CreativeCommons
Conditions of use	<Not Specified>
Description	A compositionality-ranked list of Czech multiword expressions, created using the methodology detailed in our paper; evaluation for this resource is not presented.
Download from	http://amor.cms.hu-berlin.de/~robertsw/files/lrec2016/MWE_cs-unfiltered.utf8.txt.gz
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/650.html
Edition	LREC 2018

Name	Czech Text Document Corpus v 2.0
Resource type	Corpus
Size	694 MByte
Languages	Czech (ces)
Production status	Existing-updated
Resource usage	Document Classification, Text categorisation
License	Creative Commons CC BY-NC-SA 3.0
Conditions of use	Attribution Non-Commercial Share-Alike
Description	Czech Text Document Corpus v 2.0 is a collection of text documents for automatic document classification in Czech language. It is composed of the text documents provided by the Czech News Agency and is freely available for research purposes. This corpus was created in order to facilitate a straightforward comparison of the document classification approaches on Czech data. It is particularly dedicated to evaluation of multi-label document classification approaches, because one document is usually labelled with more than one label. Besides the information about the document classes, the corpus is also annotated at the morphological layer. The main part (for training and testing) is composed of 11,955 real newspaper articles. We provide also a development set which is intended to be used for tuning of the hyper-parameters of the created models. This set contains 2735 additional articles. The total category number is 60 out of which 37 most frequent ones are used for classification. The reason of this reduction is to keep only the classes with the sufficient number of occurrences to train the models. Technical Details: Text documents are stored in the individual text files using UTF-8 encoding. Each filename is composed of the serial number and the list of the categories abbreviations separated by the underscore symbol and the .txt suffix. Serial numbers are composed of five digits and the numerical series starts from the value one. For instance the file 00046_kul_nab_mag.txt represents the document file number 46 annotated by the categories kul (culture), nab (religion) and mag (magazine selection). The content of the document, i.e. the word tokens, is stored in one line. The tokens are separated by the space symbols. Every text document was further automatically mophologically analyzed. This analysis includes lemmatization, POS tagging and syntactic parsing. The fully annotated files are stored in .conll files. We also provide the lemmatized form, file with suffix .lemma, and appropriate POS-tags, see .pos files. The tokenized version of the documents is also available in .tok files. This corpus is available only for research purposes for free. Commercial use in any form is strictly excluded. Czech Text Document Corpus v 2.0 is a collection of text documents for automatic document classification in Czech language. It is composed of the text documents provided by the Czech News Agency and is freely available for research purposes. This corpus was created in order to facilitate a straightforward comparison of the document classification approaches on Czech data. It is particularly dedicated to evaluation of multi-label document classification approaches, because one document is usually labelled with more than one label. Besides the information about the document classes, the corpus is also annotated at the morphological layer. The main part (for training and testing) is composed of 11,955 real newspaper articles. We provide also a development set which is intended to be used for tuning of the hyper-parameters of the created models. This set contains 2735 additional articles. The total category number is 60 out of which 37 most frequent ones are used for classification. The reason of this reduction is to keep only the classes with the sufficient number of occurrences to train the models. Technical Details: Text documents are stored in the individual text files using UTF-8 encoding. Each filename is composed of the serial number and the list of the categories abbreviations separated by the underscore symbol and the .txt suffix. Serial numbers are composed of five digits and the numerical series starts from the value one. For instance the file 00046_kul_nab_mag.txt represents the document file number 46 annotated by the categories kul (culture), nab (religion) and mag (magazine selection). The content of the document, i.e. the word tokens, is stored in one line. The tokens are separated by the space symbols. Every text document was further automatically mophologically analyzed. This analysis includes lemmatization, POS tagging and syntactic parsing. The fully annotated files are stored in .conll files. We also provide the lemmatized form, file with suffix .lemma, and appropriate POS-tags, see .pos files. The tokenized version of the documents is also available in .tok files. This corpus is available only for research purposes for free. Commercial use in any form is strictly excluded. Czech Text Document Corpus v 2.0 is a collection of text documents for automatic document classification in Czech language. It is composed of the text documents provided by the Czech News Agency and is freely available for research purposes. This corpus was created in order to facilitate a straightforward comparison of the document classification approaches on Czech data. It is particularly dedicated to evaluation of multi-label document classification approaches, because one document is usually labelled with more than one label. Besides the information about the document classes, the corpus is also annotated at the morphological layer. The main part (for training and testing) is composed of 11,955 real newspaper articles. We provide also a development set which is intended to be used for tuning of the hyper-parameters of the created models. This set contains 2735 additional articles. The total category number is 60 out of which 37 most frequent ones are used for classification. The reason of this reduction is to keep only the classes with the sufficient number of occurrences to train the models. Technical Details: Text documents are stored in the individual text files using UTF-8 encoding. Each filename is composed of the serial number and the list of the categories abbreviations separated by the underscore symbol and the .txt suffix. Serial numbers are composed of five digits and the numerical series starts from the value one. For instance the file 00046_kul_nab_mag.txt represents the document file number 46 annotated by the categories kul (culture), nab (religion) and mag (magazine selection). The content of the document, i.e. the word tokens, is stored in one line. The tokens are separated by the space symbols. Every text document was further automatically mophologically analyzed. This analysis includes lemmatization, POS tagging and syntactic parsing. The fully annotated files are stored in .conll files. We also provide the lemmatized form, file with suffix .lemma, and appropriate POS-tags, see .pos files. The tokenized version of the documents is also available in .tok files. This corpus is available only for research purposes for free. Commercial use in any form is strictly excluded.
Download from	http://ctdc.kiv.zcu.cz/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/671.html
Edition	LREC 2018

Name	Dataset of Nuanced Assertions on Controversial Issues (NAoCI dataset)
Resource type	Evaluation Data
Size	<Not Specified>
Languages	English
Production status	Newly created-finished
Resource usage	Opinion Mining/Sentiment Analysis
License	Creative Commons CC BY-NC-ND 4.0
Conditions of use	Attribution Non-Commercial No-Derivatives
Description	The Dataset of Nuanced Assertions on Controversial Issues (NAoCI) dataset consists of over 2,000 assertions on sixteen different controversial issues. It has over 100,000 judgments of whether people agree or disagree with the assertions, and of about 70,000 judgments indicating how strongly people support or oppose the assertions.The Dataset of Nuanced Assertions on Controversial Issues (NAoCI) dataset consists of over 2,000 assertions on sixteen different controversial issues. It has over 100,000 judgments of whether people agree or disagree with the assertions, and of about 70,000 judgments indicating how strongly people support or oppose the assertions.The Dataset of Nuanced Assertions on Controversial Issues (NAoCI) dataset consists of over 2,000 assertions on sixteen different controversial issues. It has over 100,000 judgments of whether people agree or disagree with the assertions, and of about 70,000 judgments indicating how strongly people support or oppose the assertions.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/321_res_1.zip [283 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/321.html
Edition	LREC 2018

Name	Datasets for classification experiments IS-pros
Resource type	Corpus
Size	135 KByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	GNU GPL 3.0
Conditions of use	Preservation of Copyright Notice Share-Alike
Description	Datasets is arff format (for Weka machine learning software) are made available to reproduce the validation experiments presented in the paper.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/530_res_3.zip [135 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/530.html
Edition	LREC 2018

Name	Debate Recordings
Resource type	Corpus
Size	280 MByte
Languages	English
Production status	Newly created-finished
Resource usage	Debating technologies and computational argumenation
License	© Copyright Wikipedia CC-BY-SA 3.0 © Copyright IBM 2014. Released under CC-BY-SA. 4.0
Conditions of use	Attribution Non-Commercial Share-Alike
Description	This resource is an audio and textual dataset of debating speeches for computational argumentation and debating technologies research. It contains 60 speeches recorded by experienced debaters, as well as their automatic and manual transcripts, in both raw and clean versions (5 formats in total). We plan to release more data in the future.
Download from	https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/66.html
Edition	LREC 2018

Name	Dundee GCG-Bank
Resource type	Treebank
Size	2372 sentences
Languages	English
Production status	Complete
Resource usage	<Not Specified>
License	GNU GPL 3.0
Conditions of use	Preservation of Copyright Notice Share-Alike Free
Description	Dundee GCG-Bank contains hand-corrected deep syntactic annotations for the Dundee eye-tracking corpus (Kennedy et al., 2003). The annotations are designed to support psycholinguistic investigation into the structural determinants of sentence processing effort. Dundee GCG-Bank is distributed as a sub-module of the ModelBlocks repository, a code base designed to support broad-coverage psycholinguistic modeling.
Download from	https://github.com/modelblocks/modelblocks-release
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/9.html
Edition	LREC 2018

Name	Dutch Multiword Expressions Scored for Compositionality (Unfiltered)
Resource type	Lexicon
Size	269498 entries
Languages	Dutch (nld)
Production status	Newly created-finished
Resource usage	Multiword Expression Compositionality
License	CreativeCommons
Conditions of use	<Not Specified>
Description	A compositionality-ranked list of Dutch multiword expressions, created using the methodology detailed in our paper; evaluation for this resource is not presented.
Download from	http://amor.cms.hu-berlin.de/~robertsw/files/lrec2016/MWE_nl-unfiltered.utf8.txt.gz
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/650.html
Edition	LREC 2018

Name	E-HowNet
Resource type	Lexicon
Size	web browser: 90000 words ; download: 30000 words
Languages	Chinese English
Production status	Newly created-finished
Resource usage	Information Extraction, Information Retrieval
License	Academic License Trial Version Agreement (Automatic Translation of Chinese version)
Conditions of use	Attribution Share-Alike
Description	Extended-HowNet (E-HowNet) is a lexical knowledge base evolved from HowNet and created by the CKIP (Chinese Knowledge and Information Processing) group. It consists of definitions for lexical senses and an ontology. The ontology is built by modifying HowNet taxonomy for sememes to denote taxonomic relations between concepts and attributes of concepts and aimed to construct a lexical knowledge database. It is a very important groundwork for E-HowNet project.
Download from	http://ckip.iis.sinica.edu.tw/CKIP/ehownet_reg/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/547.html
Edition	LREC 2018

Name	Emotion Movie Transcript Corpus
Resource type	Corpus
Size	100 MByte
Languages	English (eng)
Production status	Newly created-in progress
Resource usage	Emotion Recognition/Generation
License	In Readme File
Conditions of use	Research Uses Commercial usage if approved by the author
Description	Emotion Movie Transcript Corpus (EMTC) is an emotion conversational text corpus collected from Imdb quotes dataset. The corpus is partly annotated using multi-label scheme. It has relatively high inter-annotators agreement score. The corpus is practical and closer to real-life settings than other emotion corpus. Emotion analysis system can benefit by using the corpus as training/testing data or extracting emotion lexicon from it. The corpus include 3 files (excluding the README.txt file)
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/405_res_1.zip [88 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/405.html
Edition	LREC 2018

Name	EN-Hi : Humor Detection Code-mixed texts
Resource type	Corpus
Size	1.8 MByte
Languages	English (eng) Hindi (hin)
Production status	Newly created-in progress
Resource usage	Humor detection and language identification in English-Hindi code-mixed texts
License	GNU GPL 3.0
Conditions of use	Attribution Preservation of Copyright Notice
Description	Corpus consists of English-Hindi code-mixed social media texts. It contains 3453 tweets annotated with the presence of humor in the text, along with the language identification at the word level.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/363_res_1.txt [692 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/363.html
Edition	LREC 2018

Name	English Multiword Expressions Scored for Compositionality (Filtered)
Resource type	Lexicon
Size	817592 entries
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Multiword Expression Compositionality
License	CreativeCommons
Conditions of use	<Not Specified>
Description	The compositionality-ranked list of English multiword expressions presented in the paper, semi-automatically filtered for use in our machine translation experiments.
Download from	http://amor.cms.hu-berlin.de/~robertsw/files/lrec2016/MWE_en-filtered.utf8.txt.gz
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/650.html
Edition	LREC 2018

Name	English Multiword Expressions Scored for Compositionality (Unfiltered)
Resource type	Lexicon
Size	917648 entries
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Multiword Expression Compositionality
License	CreativeCommons
Conditions of use	<Not Specified>
Description	The compositionality-ranked list of English multiword expressions presented in the paper.
Download from	http://amor.cms.hu-berlin.de/~robertsw/files/lrec2016/MWE_en-unfiltered.utf8.txt.gz
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/650.html
Edition	LREC 2018

Name	English Vocabulary Knowledge Dataset
Resource type	Evaluation Data
Size	22 KByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Building and evaluating educational applications.
License	Open Source
Conditions of use	<Not Specified>
Description	This is the vocabulary test results of Japanese English-as-a-Second-Language learners collected via crowdsourcing. The test used for collecting this data set is the Vocabulary Size Test (Nation and Beglar 2007). The test tests test-takers' vocabulary by asking the correct meaning from multiple options for 100 English words. The test was answered by 100 test-takers collected from a crowdsourcing service called Lancers, whose workers are mostly Japanese. Every test-taker was required to have taken the TOEIC test at least once, and was asked to report their TOEIC score and when they took the test. This dataset was collected in Jan. 2016. The test results for 100 questions of each test-taker follow. The test can be downloaded from https://www.victoria.ac.nz/lals/about/staff/publications/paul-nation/VST-version-A.pdf\nThe test result can be download from https://www.victoria.ac.nz/lals/about/staff/publications/paul-nation/VST-version-A_answers.pdf For copyright reasons, we do not attach the test and the test answer.This is the vocabulary test results of Japanese English-as-a-Second-Language learners collected via crowdsourcing. The test used for collecting this data set is the Vocabulary Size Test (Nation and Beglar 2007).
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/978_res_1.zip [6,19 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/978.html
Edition	LREC 2018

Name	English Wiktionary
Resource type	Lexicon
Size	5317000 entries
Languages	English (eng)
Production status	Existing-used
Resource usage	Lexicon Creation/Annotation
License	Creative Commons CC BY-SA 3.0
Conditions of use	Attribution Share-Alike
Description	<Not Specified>
Download from	https://en.wiktionary.org
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/312.html
Edition	LREC 2018

Name	English-Hindi code-mixed dataset for sarcasm detection
Resource type	Corpus
Size	3.2 Mbyte
Languages	English Hindi
Production status	Complete
Resource usage	Develop systems for classification of English-Hindi code-mixed texts and sarcasm detection
License	GNU GPL 3.0
Conditions of use	Preservation of Copyright Notice Share-Alike
Description	The dataset is divided into three files: - Sarcasm_tweets.txt: This file contains all the tweet ids and their corresponding tweet text. Each tweet id is followed by the text and a blank line and so on. - Sarcasm_tweets_with_language.txt: This file contains tweet id followed by the corresponding tweet that tokenized and each token is tagged with a language tag (en/hi/rest). - Sarcasm_tweet_truth.txt: This file contains tweet id followed by a label (YES/NO) that indicates presence of sarcasm and then a blank line.The dataset is divided into three files: - Sarcasm_tweets.txt: This file contains all the tweet ids and their corresponding tweet text. Each tweet id is followed by the text and a blank line and so on. - Sarcasm_tweets_with_language.txt: This file contains tweet id followed by the corresponding tweet that tokenized and each token is tagged with a language tag (en/hi/rest). - Sarcasm_tweet_truth.txt: This file contains tweet id followed by a label (YES/NO) that indicates presence of sarcasm and then a blank line.The dataset is divided into three files: - Sarcasm_tweets.txt: This file contains all the tweet ids and their corresponding tweet text. Each tweet id is followed by the text and a blank line and so on. - Sarcasm_tweets_with_language.txt: This file contains tweet id followed by the corresponding tweet that tokenized and each token is tagged with a language tag (en/hi/rest). - Sarcasm_tweet_truth.txt: This file contains tweet id followed by a label (YES/NO) that indicates presence of sarcasm and then a blank line.The dataset is divided into three files: - Sarcasm_tweets.txt: This file contains all the tweet ids and their corresponding tweet text. Each tweet id is followed by the text and a blank line and so on. - Sarcasm_tweets_with_language.txt: This file contains tweet id followed by the corresponding tweet that tokenized and each token is tagged with a language tag (en/hi/rest). - Sarcasm_tweet_truth.txt: This file contains tweet id followed by a label (YES/NO) that indicates presence of sarcasm and then a blank line.The dataset is divided into three files: - Sarcasm_tweets.txt: This file contains all the tweet ids and their corresponding tweet text. Each tweet id is followed by the text and a blank line and so on. - Sarcasm_tweets_with_language.txt: This file contains tweet id followed by the corresponding tweet that tokenized and each token is tagged with a language tag (en/hi/rest). - Sarcasm_tweet_truth.txt: This file contains tweet id followed by a label (YES/NO) that indicates presence of sarcasm and then a blank line.
Download from	https://github.com/sahilswami96/SarcasmDetection_CodeMixed/tree/master/Dataset
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/20.html
Edition	LREC 2018

Name	English-Malayalam-MorphGenerated Forms
Resource type	Lexicon
Size	413397 entries
Languages	English Malayalam (mal)
Production status	Newly created-in progress
Resource usage	Machine Translation, SpeechToSpeech Translation
License	Creative Commons CC BY-NC 4.0
Conditions of use	Attribution Non-Commercial
Description	The corpus contains the following resources: 1. File name - MorphWords.en.txt, MorphWords.ml.txt \na. Morphology generated forms of 370065 parallel entries of English-Malayalam language pair 2. File name – NounCaseForms.en.txt, NounCaseForms.ml.txt: 21360 entries of NounCase morphology generated sample forms with tags for English and Malayalam 3. File name – NounBaseWord.en.txt, NounBaseWords.ml.txt: 13326 entries of NounBase forms for English and Malayalam 4. File name – VerbRootWords.en.txt, VerbRootWords.ml.txt: 2526 parallel entries of English-Malayalam root verb words 5. File name – VerbMorphTagForms.en.txt, VerbMorphTagForms.ml.txt: 6120 parallel entries of English-Malayalam Morphology generated tagged sample verb forms These pairs were generated programmatically and also extracted phrases from the corpus. These forms have been validated manually with English-Malayalam Bilingual experts having qualifications of Masters Degree in Malayalam Literature. The details regarding the dataset are mentioned in the following paper. Kindly cite this paper if you are using this dataset for research: Sreelekha.S, Pushpak Bhattacharyya. Morphology Generation for English-Malayalam SMT. Language and Resources and Evaluation Conference (LREC). 2018. The details of the license can be found below:<a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc/4.0/\"><img alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc/4.0/88x31.png\" /></a><br /><span xmlns:dct=\"http://purl.org/dc/terms/\" href=\"http://purl.org/dc/dcmitype/Text\" property=\"dct:title\" rel=\"dct:type\">Corpus with Morphology Generated Forms for English-Malayalam</span> is licensed under a <a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc/4.0/\">Creative Commons Attribution-NonCommercial 4.0 International License</a>..
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/125_res_1.zip [7.13 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/125.html
Edition	LREC 2018

Name	Extended Typology Paraphrase Corpus (ETPC)
Resource type	Corpus
Size	5800 paraphrase pairs
Languages	English
Production status	Newly created-in progress
Resource usage	Textual Entailment and Paraphrasing
License	<Not Specified>
Conditions of use	<Not Specified>
Description	Corpus for Paraphrase Identification (PI), annotated with paraphrase types. Additionally annotated with negation.
Download from	https://github.com/venelink/ETPC
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/661.html
Edition	LREC 2018

Name	Finnish Wiktionary
Resource type	Lexicon
Size	340787 entries
Languages	Finnish (fin)
Production status	Existing-used
Resource usage	Lexicon Creation/Annotation
License	Creative Commons CC BY-SA 3.0
Conditions of use	Attribution Share-Alike
Description	<Not Specified>
Download from	https://fi.wiktionary.org
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/312.html
Edition	LREC 2018

Name	FooTweets_Corpus
Resource type	Corpus
Size	747 KByte
Languages	English German
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	FooTweets is a first bilingual parallel corpus for English--German tweets. A total of 4,000 English tweets are collected from the FIFA World Cup 2014 and translated into German. The English tweets are essentially informal in nature but they are translated into formal texts in German in order to help build machine translation systems that is capable of translating informal texts into the formal ones. In addition to this, each tweet is assigned with a sentiment score of either 0.3, 0.5 or 0.7 to represent the negative, neutral and positive sentiment classes, respectively.
Download from	https://github.com/HAfli/FooTweets_Corpus
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/471.html
Edition	LREC 2018

Name	German Multiword Expressions Scored for Compositionality (Unfiltered)
Resource type	Lexicon
Size	773468 entries
Languages	German (deu)
Production status	Newly created-finished
Resource usage	Multiword Expression Compositionality
License	CreativeCommons
Conditions of use	<Not Specified>
Description	A compositionality-ranked list of German multiword expressions, created using the methodology detailed in our paper; evaluation for this resource is not presented.
Download from	http://amor.cms.hu-berlin.de/~robertsw/files/lrec2016/MWE_de-unfiltered.utf8.txt.gz
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/650.html
Edition	LREC 2018

Name	GOST Lexicon
Resource type	Lexicon
Size	44613 lexemes
Languages	English
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	Open Source
Conditions of use	Academic Use Only
Description	Created by Lancaster University, GOST Lexicon contains 433 single word bio-terms (see OntologySW-AllPaths.usas) and 44,180 multiword bio-terms (see OntologyMWE-AllPaths.usas). It has been merged into the Lancaster UCREL Semantic lexicons to create a new version of the Lancaster USAS semantic annotation system (Rayson et al., 2004; Piao et al., 2015; Piao et al.,2017), named GOST (Gene Ontology Semantic Tagger), in order to automatically annotate the bio-terms with GO IDs in Medical journal articles, along with generic USAS semantic tags.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/706_res_1.zip [2,72 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/706.html
Edition	LREC 2018

Name	HAI Alice-corpus
Resource type	Corpus
Size	9900 tokens
Languages	English
Production status	Newly created-finished
Resource usage	Question Answering
License	Creative Commons CC BY 4.0
Conditions of use	Attribution
Description	The resource currently contains the transcriptions on speech of 15 Human-Agent dialogs. We will provide additional resources in the near future.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/429_res_1.xml [95,1 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/429.html
Edition	LREC 2018

Name	HappyDB
Resource type	Corpus
Size	24 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	CreativeCommons
Conditions of use	Attribution
Description	HappyDB is a corpus of 100,000+ crowd-sourced happy moments. The goal of the corpus is to advance the state of the art of understanding the causes of happiness that can be gleaned from text.
Download from	https://rit-public.github.io/HappyDB/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/204.html
Edition	LREC 2018

Name	Hotels Dialogues and Utterances
Resource type	Corpus
Size	6000 sentences
Languages	English
Production status	Newly created-finished
Resource usage	Dialogue
License	<Not Specified>
Conditions of use	<Not Specified>
Description	Collected Utterances using methods described in the paper and whole dialogues for a conversational agent in the hotels domain.
Download from	https://nlds.soe.ucsc.edu/hotels
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/763.html
Edition	LREC 2018

Name	https://github.com/PLN-FaMAF/ArgumentMiningECHR
Resource type	Corpus
Size	<Not Specified>
Languages	English
Production status	Newly created-in progress
Resource usage	Machine Learning
License	<Not Specified>
Conditions of use	<Not Specified>
Description	Corpus of Sentences of the European Court of Human Rights annotated with Argumentation concepts, namely Claims and justifications (Premises) attacking or supporting them.
Download from	https://github.com/PLN-FaMAF/ArgumentMiningECHR
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/1048.html
Edition	LREC 2018

Name	Humor Detection Classifier
Resource type	Corpus
Size	28.8 KByte
Languages	<Not Specified>
Production status	Newly created-finished
Resource usage	Classification of humor in texts
License	GNU GPL 3.0
Conditions of use	Preservation of Copyright Notice Share-Alike
Description	The corpus contains tweetID along with Humor(N) and Non-Humrous(N) tags. It was built for humor detection in English-Hindi code-mixed social media content.
Download from	https://github.com/Ankh2295/humor-detection-corpus
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/363.html
Edition	LREC 2018

Name	Hungarian Webcorpus
Resource type	Corpus
Size	1.48 billion words
Languages	Hungarian (hun)
Production status	Existing-used
Resource usage	Evaluation/Validation
License	Open Source
Conditions of use	<Not Specified>
Description	With over 1.48 billion words unfiltered (589m words fully filtered), this is by far the largest Hungarian language corpus, and unlike the Hungarian National Corpus (125m words), it is available in its entirety under a permissive Open Content license. The Hungarian webcorpus was crawled in the winter of 2003 as part of the WordSword project at the Media Research and Education Centre.
Download from	http://mokk.bme.hu/resources/webcorpus/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/730.html
Edition	LREC 2018

Name	Hungarian Wiktionary
Resource type	Lexicon
Size	335886 entries
Languages	Hungarian (hun)
Production status	Existing-used
Resource usage	Lexicon Creation/Annotation
License	Creative Commons CC BY-SA 3.0
Conditions of use	Attribution Share-Alike
Description	<Not Specified>
Download from	https://hu.wiktionary.org
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/312.html
Edition	LREC 2018

Name	IPSL: A Database of Iconicity Patterns in Sign Languages
Resource type	Lexicon
Size	193 KByte
Languages	Russian Sign Language (rsl) French Sign Language (fsl) American Sign Language (ase) British Sign Language (bfi) Spanish Sign Language (ssp)
Production status	Newly created-finished
Resource usage	Lexicon Creation/Annotation
License	Creative Commons CC BY-SA 3.0
Conditions of use	Attribution
Description	This is the first large-scale database of signs annotated according to various parameters of iconicity. The signs represent concrete concepts in seven semantic fields in nineteen sign languages; 1542 signs in total. Each sign was annotated with respect to the type of form-image association, the presence of iconic location and movement, personification, and with respect to whether the sign depicts a salient part of the concept. The database is also a basis of a website with several visualization tools to represent the data from the database. It is possible to visualize iconic properties of separate concepts or iconic properties of semantic fields on the map of the world, and to build graphs representing iconic patterns for selected semantic fields.
Download from	https://sl-iconicity.shinyapps.io/iconicity_patterns/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/102.html
Edition	LREC 2018

Name	KIT-Multi
Resource type	Corpus
Size	140000 entries
Languages	English German French
Production status	Newly created-in progress
Resource usage	Knowledge Discovery/Representation
License	CreativeCommons
Conditions of use	<Not Specified>
Description	KIT-Multi is a multilingual embedding corpus, currently consisting word embeddings of English-German-French. Other languages such as Chinese, Japanese, Korean, Vietnamese, Dutch, Italian, Romanian, Spanish or Portuguese are being added.
Download from	http://i13pc106.ira.uka.de/~tha/KIT-Multi
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/688.html
Edition	LREC 2018

Name	Konstanz Resource of Questions (KRoQ)
Resource type	Corpus
Size	140 MByte
Languages	German French Spanish Greek
Production status	Newly created-in progress
Resource usage	Question Classification
License	https://github.com/kkalouli/BIBLE-processing/blob/master/KRoQ/license
Conditions of use	<Not Specified>
Description	A Multilingual Approach to Question Classification To cite this work use: Kalouli, A.-L., Kaiser, K., Hautli-Janisz, A., Kaiser, G., Butt., M. 2018. A Multilingual Approach to Question Classification. In Proceedings of the Language Resources and Evaluation Conference (LREC), Miyazaki-JapanA Multilingual Approach to Question Classification To cite this work use: Kalouli, A.-L., Kaiser, K., Hautli-Janisz, A., Kaiser, G., Butt., M. 2018. A Multilingual Approach to Question Classification. In Proceedings of the Language Resources and Evaluation Conference (LREC), Miyazaki-JapanA Multilingual Approach to Question Classification To cite this work use: Kalouli, A.-L., Kaiser, K., Hautli-Janisz, A., Kaiser, G., Butt., M. 2018. A Multilingual Approach to Question Classification. In Proceedings of the Language Resources and Evaluation Conference (LREC), Miyazaki-JapanA Multilingual Approach to Question Classification To cite this work use: Kalouli, A.-L., Kaiser, K., Hautli-Janisz, A., Kaiser, G., Butt., M. 2018. A Multilingual Approach to Question Classification. In Proceedings of the Language Resources and Evaluation Conference (LREC), Miyazaki-Japan
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/13_res_1.zip [180 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/13.html
Edition	LREC 2018

Name	Korean L2 Unknown Words - Labeled Dataset
Resource type	Evaluation Data
Size	53202 word annotations
Languages	Korean (kor)
Production status	Newly created-finished
Resource usage	Supervised Machine Learning & Evaluation/Validation
License	Creative Commons CC BY-NC 4.0
Conditions of use	Attribution Non-Commercial
Description	This is a labeled dataset for training and/or evaluating unknown word prediction models for L2 learners of Korean. It was extracted from a corpus of passages annotated by L2 learners. To produce this dataset, each annotated word was normalized to its base form by removing inflectional and derivational suffixes, and duplicate annotations were removed so that there were at most 1 annotation per annotator-word pair. All annotated words are labeled as either being known or unknown by the annotator. Metadata about each annotator is also provided, including reported Korean proficiency level, their estimated level based on annotations provided, native language, country, etc.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/272_res_1.zip [555 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/272.html
Edition	LREC 2018

Name	KRAUTS
Resource type	Corpus
Size	<Not Specified>
Languages	German (deu)
Production status	Newly created-finished
Resource usage	<Not Specified>
License	Creative Commons CC BY-NC 4.0
Conditions of use	Attribution Non-Commercial
Description	KRAUTS (Korpus of newspapeR Articles with Underlinded Temporal expressionS) is a German temporally annotated news corpus accompanied with TimeML annotation guidelines for German. It was developed at Fondazione Bruno Kessler, Trento, Italy and at the Max Planck Institute for Informatics, Saarbrücken, Germany. Our goal is to boost temporal tagging research [1] for German.
Download from	https://github.com/JannikStroetgen/KRAUTS/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/436.html
Edition	LREC 2018

Name	lexfom
Resource type	Ontology
Size	100 KByte
Languages	<Not Specified>
Production status	Existing-updated
Resource usage	Knowledge Discovery/Representation
License	OpenSource
Conditions of use	<Not Specified>
Description	This is an ontology to represent the Meaning Text Theory's lexical functions and to represent lexical relations.
Download from	https://github.com/alex-fonseca/lexfom
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/1102.html
Edition	LREC 2018

Name	LIA-msc
Resource type	Corpus
Size	210.8 KByte
Languages	Portuguese (por) Spanish (spa)
Production status	Newly created-finished
Resource usage	Summarisation
License	Lesser General Public License For Linguistic Resources
Conditions of use	https://dev.termwatch.es/~fresa/CORPUS/MSF2/lgpllr.html Preservation of Copyright Notice Share-Alike Notify substantive changes
Description	Multi-Sentence Compression (MSC) is a variation of Sentence Compression. MSC aims at analyzing a cluster of similar sentences to generate a new sentence, which is shorter than the average length of source sentences and has the key information of the cluster. MSC enables summarisation and question-answering systems to generate outputs combining fully formed sentences from one or several documents. We present a new annotated corpus in the Portuguese and Spanish languages for the MSC task. This corpus was collected from Portuguese and Spanish Google News and it is composed of clusters of similar sentences along with reference compression for each cluster.Multi-Sentence Compression (MSC) is a variation of Sentence Compression. MSC aims at analyzing a cluster of similar sentences to generate a new sentence, which is shorter than the average length of source sentences and has the key information of the cluster. MSC enables summarisation and question-answering systems to generate outputs combining fully formed sentences from one or several documents.
Download from	http://dev.termwatch.es/~fresa/CORPUS/MSF2/index.html
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/275.html
Edition	LREC 2018

Name	LIdioms
Resource type	Lexicon
Size	147 KByte
Languages	English (en) Portuguese (pt) German (de) Italian (ita) Russian (ru)
Production status	Newly created-finished
Resource usage	Semantic Web
License	Creative Commons CC BY-NC-SA 3.0
Conditions of use	<Not Specified>
Description	LIDIOMS data set consists in a multilingual RDF representation of idioms containing five languages. The data set is intended to support natural language processing applications by providing links between idioms across languages. The underlying data was crawled and integrated from various sources. To ensure the quality of the crawled data, all idioms were evaluated by at least two native speakers. LIDIOMS is linked with two well-known multilingual data sets BabelNet and DBnary.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/46_res_1.tgz [147 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/46.html
Edition	LREC 2018

Name	Lingmotif-lex
Resource type	Lexicon
Size	67400 entries
Languages	English Spanish
Production status	Newly created-in progress
Resource usage	Opinion Mining/Sentiment Analysis
License	<Not Specified>
Conditions of use	<Not Specified>
Description	Wide coverage, manually curated sentiment lexicon featuring fine-grained valence system and sentiment shifters system accessible through accompanying Python 3 library.
Download from	http://tecnolengua.uma.es
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/457.html
Edition	LREC 2018

Name	Live blog Summarization Corpus
Resource type	Corpus
Size	778 MByte
Languages	English
Production status	Newly created-finished
Resource usage	Summarisation
License	Apache 2.0
Conditions of use	Preservation of Copyright Notice
Description	Live blogs are an increasingly popular news format to cover breaking news and live events in online journalism. Online news websites around the world are using this medium to give their readers a minute by minute update on an event. Good summaries enhance the value of the live blogs for a reader but are often not available. In this paper, we study a way of collecting corpora for automatic live blog summarization. In an empirical evaluation using well-known state-of-the-art summarization systems, we show that live blogs corpus poses new challenges in the field of summarization. We make our tools publicly available to reconstruct the corpus to encourage the research community and replicate our results.Live blogs are an increasingly popular news format to cover breaking news and live events in online journalism. Online news websites around the world are using this medium to give their readers a minute by minute update on an event. Good summaries enhance the value of the live blogs for a reader but are often not available. In this paper, we study a way of collecting corpora for automatic live blog summarization. In an empirical evaluation using well-known state-of-the-art summarization systems, we show that live blogs corpus poses new challenges in the field of summarization. We make our tools publicly available to reconstruct the corpus to encourage the research community and replicate our results.Live blogs are an increasingly popular news format to cover breaking news and live events in online journalism. Online news websites around the world are using this medium to give their readers a minute by minute update on an event. Good summaries enhance the value of the live blogs for a reader but are often not available. In this paper, we study a way of collecting corpora for automatic live blog summarization. In an empirical evaluation using well-known state-of-the-art summarization systems, we show that live blogs corpus poses new challenges in the field of summarization. We make our tools publicly available to reconstruct the corpus to encourage the research community and replicate our results.
Download from	https://github.com/UKPLab/lrec2018-live-blog-corpus
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/317.html
Edition	LREC 2018

Name	LOaDing
Resource type	Ontology
Size	1.1 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Word Sense Disambiguation
License	Creative Commons CC BY 4.0
Conditions of use	<Not Specified>
Description	LOaDing is a new resource that enriches the Framester knowledge graph, which links Framenet, WordNet, VerbNet and\nother resources, with semantic features extracted from text corpora. Features are extracted from distributional semantics-based sense inventories and allow to connect the resource with text, for instance to boost the performance on Word Frame Disambiguation. Since Framester is a frame-based knowledge graph, which enables full-fledged OWL querying and reasoning, our resource paves the way for the development of novel, deeper semantic-aware applications that could benefit from the combination of knowledge from text and complex symbolic representations of events and participants.
Download from	http://data.dws.informatik.uni-mannheim.de/download/loading/ddt-wiki-n30-1400k-loading.zip
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/263.html
Edition	LREC 2018

Name	LX-LX-DSemVectors 2.2b
Resource type	Lexicon
Size	<Not Specified>
Languages	Portuguese (por)
Production status	Newly created-in progress
Resource usage	<Not Specified>
License	Creative Commons CC BY 4.0
Conditions of use	<Not Specified>
Description	Distributional semantics model (aka word embeddings) for Portuguese, LX-DSemVectors 2.2b, trained over 2.2 billion tokens, with the largest vocabulary and the best intrinsic evaluation scores.
Download from	http://github.com/nlx-group
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/592.html
Edition	LREC 2018

Name	MCDTB
Resource type	Treebank
Size	294 KByte
Languages	Chinese
Production status	Newly created-in progress
Resource usage	Discourse
License	Creative Commons CC BY 4.0
Conditions of use	Attribution
Description	In view of the differences between the annotations of micro and macro discourse relationships, this paper describes the relevant experiments on the construction of the Macro Chinese Discourse Treebank (MCDTB), a higher-level Chinese discourse corpus. Following RST (Rhetorical Structure Theory), we annotate the macro discourse information, including discourse structure, nuclearity and relationship, and the additional discourse information, including topic sentences, lead and abstract, to make the macro discourse annotation more objective and accurate. Finally, we annotated 720 articles with a Kappa value greater than 0.6. Preliminary experiments on this corpus verify the computability of MCDTB.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/147_res_1.zip [39,4 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/147.html
Edition	LREC 2018

Name	MGAD Syntactic Analogy Datasets
Resource type	Evaluation Data
Size	5 MByte
Languages	Arabic Russian Hindi
Production status	Newly created-in progress
Resource usage	Evaluation/Validation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	Word embeddings evaluation syntactic analogy datasets for Arabic, Russian, and Hindi.
Download from	https://github.com/rutrastone/LREC2018
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/1022.html
Edition	LREC 2018

Name	MirasText
Resource type	Corpus
Size	15.3 GByte
Languages	Iranian Persian (pes)
Production status	Newly created-in progress
Resource usage	Language Modelling
License	MIT License
Conditions of use	Preservation of Copyright Notice
Description	This repository contains MirasText corpus and description along side with what it has been used for and what it can be used for. A sample of the dataset is provided in MirasText_sample.txt which contains 1000 documents.
Download from	https://github.com/miras-tech/MirasText
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/385.html
Edition	LREC 2018

Name	MirasVoice
Resource type	Corpus
Size	20 GByte
Languages	Iranian Persian (pes) English (eng)
Production status	Newly created-in progress
Resource usage	Person Identification
License	Apache 2.0
Conditions of use	Preservation of Copyright Notice
Description	The MirasVoice Speech Corpus (MVSC) is one of the largest Farsi-English voice datasets currently available for general purpose studies and expert system development. Some of the applications this dataset can be used for is for speaker recognition systems, speech recognition studies, gender recognition, cognitive science, and pattern recognition.
Download from	https://github.com/miras-tech/MirasVoice/blob/master/LICENSE
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/443.html
Edition	LREC 2018

Name	morphdb.hu
Resource type	Lexicon
Size	6.2 MByte
Languages	Hungarian (hun)
Production status	Existing-used
Resource usage	Morphological Analysis
License	Open Source
Conditions of use	<Not Specified>
Description	morphdb.hu is an open source morphological database of Hungarian, consisting of a lexicon and morphological grammar that are based on well-founded theoretical decisions.
Download from	http://mokk.bme.hu/resources/morphdb-hu/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/730.html
Edition	LREC 2018

Name	MPST: A Corpus of Movie Plot Synopses with Tags
Resource type	Corpus
Size	153 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Document Classification, Text categorisation
License	GNU GPL 3.0
Conditions of use	Preservation of Copyright Notice Share-Alike
Description	The corpus contains 14,828 plot synopses of movies and their multi-label associations with 71 fine-grained tags. The tagset was created by collecting tags from the MovieLens 20M dataset and IMDB, filtering tags related to the plots and grouping semantically similar tags together. These tags represent a wide range of information about the movies like their genres, plot structures, and emotional experiences that a viewer may feel after watching the movie. The plot synopses were collected from IMDB and Wikipedia and they all have at least 10 sentences.
Download from	http://ritual.uh.edu/mpst-2018/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/332.html
Edition	LREC 2018

Name	MSimlex999_Polish
Resource type	Evaluation Data
Size	1998 words
Languages	Polish (pol)
Production status	Newly created-in progress
Resource usage	Evaluation/Validation
License	Creative Commons CC BY
Conditions of use	<Not Specified>
Description	Polish translation of the SimLex-999 data (https://www.cl.cam.ac.uk/~fh295/simlex.html) with similarity and relatedness scores.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/687_res_1.txt [29,9 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/687.html
Edition	LREC 2018

Name	Mulitmodal Lexical Translation Dataset
Resource type	Corpus
Size	98647 sentences
Languages	English German French
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	Creative Commons CC BY-NC-SA 4.0
Conditions of use	<Not Specified>
Description	Multimodal Lexical Translation Dataset is a collection of 4-tuples of the form: (x, y, X, V) where x is an ambiguous word, X is its textual context (a sentence in source language), V is its visual context (an image), and y is its translation that conforms with both the textual and visual contexts.
Download from	https://github.com/sheffieldnlp/mlt
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/629.html
Edition	LREC 2018

Name	MultiBooked Corpus
Resource type	Corpus
Size	29 MByte
Languages	Basque (eus) Catalan (cat)
Production status	Newly created-finished
Resource usage	Opinion Mining/Sentiment Analysis
License	Creative Commons
Conditions of use	<Not Specified>
Description	While sentiment analysis has become an established field in the NLP community, research into languages other than English has been hindered by the lack of resources. Although much research in multi-lingual and cross-lingual sentiment analysis has focused on unsupervised or semi-supervised approaches, these still require a large number of resources and do not reach the performance of supervised approaches. With this in mind, we introduce two datasets for supervised aspect-level sentiment analysis in Basque and Catalan, both of which are under-resourced languages. We provide high-quality annotations and benchmarks with the hope that they will be useful to the growing community of researchers working on these languages.
Download from	https://repositori.upf.edu/handle/10230/33928
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/217.html
Edition	LREC 2018

Name	Multilingual IsA (MIsA)
Resource type	Corpus
Size	2 GByte
Languages	English Spanish French Italian Dutch
Production status	Newly created-finished
Resource usage	Information Extraction, Information Retrieval
License	Creative Commons CC BY 4.0
Conditions of use	<Not Specified>
Description	Multilingual IsA (MISA) is a collection of hypernymy relations in five languages (i.e., English, Spanish, French, Italian and Dutch) extracted from the corresponding full Wikipedia corpus. The extraction, for each language, is based on an established set of existing (viz. found in literature) or newly defined lexico-syntactic patterns. Similarly to WebIsADb, the resulting resource contains hypernymy relations represented as "tuples", as well as additional information such as provenance, context of the extraction etc. Multilingual IsA (MISA) is a collection of hypernymy relations in five languages (i.e., English, Spanish, French, Italian and Dutch) extracted from the corresponding full Wikipedia corpus. The extraction, for each language, is based on an established set of existing (viz. found in literature) or newly defined lexico-syntactic patterns. Similarly to WebIsADb, the resulting resource contains hypernymy relations represented as "tuples", as well as additional information such as provenance, context of the extraction etc. Multilingual IsA (MISA) is a collection of hypernymy relations in five languages (i.e., English, Spanish, French, Italian and Dutch) extracted from the corresponding full Wikipedia corpus. The extraction, for each language, is based on an established set of existing (viz. found in literature) or newly defined lexico-syntactic patterns. Similarly to WebIsADb, the resulting resource contains hypernymy relations represented as "tuples", as well as additional information such as provenance, context of the extraction etc. Multilingual IsA (MISA) is a collection of hypernymy relations in five languages (i.e., English, Spanish, French, Italian and Dutch) extracted from the corresponding full Wikipedia corpus. The extraction, for each language, is based on an established set of existing (viz. found in literature) or newly defined lexico-syntactic patterns. Similarly to WebIsADb, the resulting resource contains hypernymy relations represented as "tuples", as well as additional information such as provenance, context of the extraction etc. Multilingual IsA (MISA) is a collection of hypernymy relations in five languages (i.e., English, Spanish, French, Italian and Dutch) extracted from the corresponding full Wikipedia corpus. The extraction, for each language, is based on an established set of existing (viz. found in literature) or newly defined lexico-syntactic patterns. Similarly to WebIsADb, the resulting resource contains hypernymy relations represented as "tuples", as well as additional information such as provenance, context of the extraction etc.
Download from	http://web.informatik.uni-mannheim.de/misa/download.html
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/254.html
Edition	LREC 2018

Name	N-gram Analogical Clusters and Analogical Grids
Resource type	Lexicon
Size	374 MByte
Languages	Danish German Modern Greek English Spanish
Production status	Newly created-finished
Resource usage	Morphological Analysis
License	Creative Commons CC BY-NC 4.0
Conditions of use	<Not Specified>
Description	This is a complete dataset of analogical clusters and analogical grids built on the vocabularies of 11 languages contained in 1,000 corresponding lines of the 11 different language versions of the Europarl corpus v.3. The grids were built on N-grams of different lengths, from words to 6-grams. We hope that the use of structured parallel data will foster research in comparative linguistics.This is a complete dataset of analogical clusters and analogical grids built on the vocabularies of 11 languages contained in 1,000 corresponding lines of the 11 different language versions of the Europarl corpus v.3. The grids were built on N-grams of different lengths, from words to 6-grams. We hope that the use of structured parallel data will foster research in comparative linguistics.This is a complete dataset of analogical clusters and analogical grids built on the vocabularies of 11 languages contained in 1,000 corresponding lines of the 11 different language versions of the Europarl corpus v.3. The grids were built on N-grams of different lengths, from words to 6-grams. We hope that the use of structured parallel data will foster research in comparative linguistics.This is a complete dataset of analogical clusters and analogical grids built on the vocabularies of 11 languages contained in 1,000 corresponding lines of the 11 different language versions of the Europarl corpus v.3. The grids were built on N-grams of different lengths, from words to 6-grams. We hope that the use of structured parallel data will foster research in comparative linguistics.This is a complete dataset of analogical clusters and analogical grids built on the vocabularies of 11 languages contained in 1,000 corresponding lines of the 11 different language versions of the Europarl corpus v.3. The grids were built on N-grams of different lengths, from words to 6-grams. We hope that the use of structured parallel data will foster research in comparative linguistics.This is a complete dataset of analogical clusters and analogical grids built on the vocabularies of 11 languages contained in 1,000 corresponding lines of the 11 different language versions of the Europarl corpus v.3. The grids were built on N-grams of different lengths, from words to 6-grams. We hope that the use of structured parallel data will foster research in comparative linguistics.
Download from	https://waseda.pure.elsevier.com/en/publications/tools-for-the-production-of-analogical-grids-and-a-resource-of-n-
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/344.html
Edition	LREC 2018

Name	Natural Stories GCG-Bank
Resource type	Treebank
Size	485 Other
Languages	English
Production status	Complete
Resource usage	<Not Specified>
License	GNU GPL 3.0
Conditions of use	Preservation of Copyright Notice Share-Alike Free
Description	Natural Stories GCG-Bank contains hand-corrected deep syntactic annotations for the Natural Stories self-paced reading corpus (Futrell et al., 2017). The annotations are designed to support psycholinguistic investigation into the structural determinants of sentence processing effort. Natural Stories GCG-Bank is distributed as a sub-module of the ModelBlocks repository, a code base designed to support broad-coverage psycholinguistic modeling.
Download from	https://github.com/modelblocks/modelblocks-release
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/9.html
Edition	LREC 2018

Name	NL2Bash
Resource type	Corpus
Size	12609 entries
Languages	English
Production status	Newly created-finished
Resource usage	Natural language to code generation
License	GNU GPL 3.0
Conditions of use	Attribution Share-Alike
Description	A parallel corpus of one-line Bash commands paired with their natural language descriptions.
Download from	https://github.com/TellinaTool/nl2bash
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/1021.html
Edition	LREC 2018

Name	NL2KB
Resource type	Terminology
Size	3.9 MByte
Languages	English
Production status	Newly created-finished
Resource usage	Information Extraction, Information Retrieval
License	MIT License
Conditions of use	<Not Specified>
Description	Two files are included in this release: - kb2nl.txt: The relational mappings from knowledge base (KB) predicates to natural language (NL) relation patterns. Each line is one of the most frequenty 629 KB predicates in DBpedia, columns in a line is separated by tags. The first column is the predicate in the format #p#{predicate_name}. The rest of the line shows the mapped NL relation patterns in the format: #r#{pattern} score - nl2kb.txt: The relational mappings from natural language (NL) relation patterns to knowledge base (KB) predicates.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/94_res_1.tgz [3,88 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/94.html
Edition	LREC 2018

Name	Online Drug User Guideline Corpus
Resource type	Corpus
Size	1.4 MByte
Languages	English
Production status	Newly created-in progress
Resource usage	Information Extraction, Information Retrieval
License	Creative Commons CC BY 4.0
Conditions of use	Attribution
Description	The resource is the dataset or corpus presented in the paper. It is publicly available.
Download from	https://zenodo.org/record/1173345#.WoTZkJM-f-Y
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/426.html
Edition	LREC 2018

Name	Open Source International Arabic News Corpus
Resource type	Corpus
Size	157000000 tokens (roughly)
Languages	Arabic
Production status	First version
Resource usage	<Not Specified>
License	OpenSource
Conditions of use	Public (free)
Description	The Open Source International Arabic News (OSIAN) corpus has been collected from international Arabic news websites like CNN, DW, RT, Aljazeera, among others. With a server-friendly crawling policy we extracted 1 million web pages. After necessary cleaning and filtering steps, the OSIAN corpus has 477,556 articles comprising 2,861,944 sentences and roughly 157 million words. The corpus is encoded in XML, each article is annotated with metadata information, which gives the information about its web location and the date of its extraction. Moreover, each word is annotated with lemma and part-of-speech.
Download from	http://oujda-nlp-team.net/en/corpora/osian-corpus/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/7.html
Edition	LREC 2018

Name	Open-Content Text Corpus
Resource type	Corpus
Size	28 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	Annotations: Creative Commons CC-BY 4.0 license Original content from ClueWeb12 keeps its original license Annotation tool: GNU General Public License v3.0
Conditions of use	<Not Specified>
Description	The following repository contains the corpus that was created for the publication 'Beyond Generic Summarization: A Multi-faceted Hierarchical Summarization Corpus of Large Heterogeneous Data' as well as the annotation tool that was developed for that purpose and an example Amazon Mechanical Turk HIT . The Corpus: Included in the Corpus folder is the following: Included in the SourceDocuments folder are the .xml files of all source topics and a .txt file with the topic names. Included in the AMTAllNuggets folder is a tab-delimited csv file with all annotations from Amazon Mechanical Turk in the format worker [tab] annotation. The turker IDs have been hashed in order to anonymize them. Included in the Trees folder are the inout documents for the tree annotation, the trees from three annotators as well as the gold standard trees created out of these trees. The Annotation tool: Included in the AnnotationTool folder is the Annotation tool as a Java archive as well as the source code and documentation of the tool. The HIT-Template: Included in the HIT-Template folder is an example HIT along with the javascript and stylesheet.
Download from	https://github.com/AIPHES/HierarchicalSummarization
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/252.html
Edition	LREC 2018

Name	ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions)
Resource type	Corpus
Size	1000000 words
Languages	Czech (ces)
Production status	Newly created-finished
Resource usage	Speech Recognition/Understanding
License	Creative Commons CC BY-NC-SA 4.0
Conditions of use	Attribution Non-Commercial Share-Alike
Description	ORTOFON v1 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole Czech Republic. The corpus is composed of 332 recordings from 2012–2017 and contains 1,014,786 orthographic words (i.e. a total of 1,236,508 tokens including punctuation); a total of 624 different speakers appear in the probes. ORTOFON v1 is fully balanced regarding the basic sociolinguistic speaker categories (gender, age group, level of education and region of childhood residence). The transcription is linked to the corresponding audio track. Unlike the ORAL-series corpora, the transcription was carried out on two main tiers, orthographic and phonetic, supplemented by an additional metalanguage tier. ORTOFON v1 is lemmatized and morphologically tagged. The (anonymized) corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query engine to registered users of the CNC at http://www.korpus.cz Please note: this item includes only the transcriptions, audio (and the transcripts in their original format) is available under more restrictive non-CC license at http://hdl.handle.net/11234/1-2579 Published by: Charles University, Faculty of Arts, Institute of the Czech National Corpus Acknowledgements: This resource was created within the Czech National Corpus project (LM2015044) funded by the Ministry of Education, Youth and Sports of the Czech Republic within the framework of Large Research, Development and Innovation Infrastructures.
Download from	http://hdl.handle.net/11234/1-2580
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/833.html
Edition	LREC 2018

Name	Parallel English-Persian Corpus (PEPC)
Resource type	Corpus
Size	200000 sentences
Languages	English Persian
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	PEPC is a collection of parallel sentences in English and Persian languages extracted from Wikipedia documents using a bidirectional translation method.
Download from	https://iasbs.ac.ir/~ansari/nlp/pepc.html
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/674.html
Edition	LREC 2018

Name	Persian word embeddings
Resource type	Corpus
Size	200 MByte
Languages	Iranian Persian (pes)
Production status	Newly created-finished
Resource usage	Machine Learning
License	<Not Specified>
Conditions of use	<Not Specified>
Description	ArmanPersoNERCorpus is the first manually-annotated Persian named-entity (NE) dataset (ISLRN 399-379-640-828-6). We are releasing it only for academic research use. The data set includes 250,015 tokens and 7,682 Persian sentences in total. It is available in 3 folds to be used in turn as training and test sets. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated with a newline. The NER tags are in IOB format. According to the instructions provided to the annotators, NEs are categorized into six classes: person, organization (such as banks, ministries, embassies, teams, nationalities, networks and publishers), location (such as cities, villages, rivers, seas, gulfs, deserts and mountains), facility (such as schools, universities, research centers, airports, railways, bridges, roads, harbors, stations, hospitals, parks, zoos and cinemas), product (such as books, newspapers, TV shows, movies, airplanes, ships, cars, theories, laws, agreements and religions), and event (such as wars, earthquakes, national holidays, festivals and conferences); other are the remaining tokens. Four different word embeddings are trained on a sizable collation of unannotated Persian text. They contain a comprehensive Persian dictionary of nearly 50K unique words. The length of the embedding vectors is 300. The use of these embeddings is unrestricted.ArmanPersoNERCorpus is the first manually-annotated Persian named-entity (NE) dataset (ISLRN 399-379-640-828-6). We are releasing it only for academic research use. The data set includes 250,015 tokens and 7,682 Persian sentences in total. It is available in 3 folds to be used in turn as training and test sets. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated with a newline. The NER tags are in IOB format. According to the instructions provided to the annotators, NEs are categorized into six classes: person, organization (such as banks, ministries, embassies, teams, nationalities, networks and publishers), location (such as cities, villages, rivers, seas, gulfs, deserts and mountains), facility (such as schools, universities, research centers, airports, railways, bridges, roads, harbors, stations, hospitals, parks, zoos and cinemas), product (such as books, newspapers, TV shows, movies, airplanes, ships, cars, theories, laws, agreements and religions), and event (such as wars, earthquakes, national holidays, festivals and conferences); other are the remaining tokens. Four different word embeddings are trained on a sizable collation of unannotated Persian text. They contain a comprehensive Persian dictionary of nearly 50K unique words. The length of the embedding vectors is 300. The use of these embeddings is unrestricted.ArmanPersoNERCorpus is the first manually-annotated Persian named-entity (NE) dataset (ISLRN 399-379-640-828-6). We are releasing it only for academic research use. The data set includes 250,015 tokens and 7,682 Persian sentences in total. It is available in 3 folds to be used in turn as training and test sets. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated with a newline. The NER tags are in IOB format. According to the instructions provided to the annotators, NEs are categorized into six classes: person, organization (such as banks, ministries, embassies, teams, nationalities, networks and publishers), location (such as cities, villages, rivers, seas, gulfs, deserts and mountains), facility (such as schools, universities, research centers, airports, railways, bridges, roads, harbors, stations, hospitals, parks, zoos and cinemas), product (such as books, newspapers, TV shows, movies, airplanes, ships, cars, theories, laws, agreements and religions), and event (such as wars, earthquakes, national holidays, festivals and conferences); other are the remaining tokens. Four different word embeddings are trained on a sizable collation of unannotated Persian text. They contain a comprehensive Persian dictionary of nearly 50K unique words. The length of the embedding vectors is 300. The use of these embeddings is unrestricted.
Download from	https://github.com/HaniehP/PersianNER
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/48.html
Edition	LREC 2018

Name	PhotoshopQuiA
Resource type	Corpus
Size	2854 entries
Languages	English
Production status	Newly created-finished
Resource usage	Question Answering
License	Creative Commons CC BY-SA 3.0
Conditions of use	Attribution Share-Alike
Description	We introduce the PhotoshopQuiA dataset, a new publicly available set of 2,854 why-question and answer (WhyQ, A) pairs related to Adobe Photoshop usage collected from five CQA web sites. We chose Adobe Photoshop because it is a popular and well-known product, with a lively, knowledgeable and sizable community. To the best of our knowledge, this is the first English dataset for Why-QA that focuses on a product, as opposed to previous open-domain datasets. The corpus is stored in JSON format and contains detailed data about questions and questioners as well as answers and answerers. The dataset can be used to build Why-QA systems, to evaluate current approaches for answering why-questions, and to develop new models for future QA systems research.
Download from	https://github.com/dulceanu/photoshop-quia/blob/master/dataset/PhotoshopQuiA.json
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/758.html
Edition	LREC 2018

Name	PolNARC-2016
Resource type	Corpus
Size	18300 Attribution relations
Languages	English
Production status	Newly created-in progress
Resource usage	Information Extraction, Information Retrieval
License	MIT License
Conditions of use	Preservation of Copyright Notice
Description	The Political News Attribution Relations Corpus annotates the attribution of direct and indirect quotes, as well as private states expressing belief and intention, using an annotation scheme derived from that of PARC3.
Download from	https://github.com/networkdynamics/PolNeAR
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/1051.html
Edition	LREC 2018

Name	Portuguese Multiword Expressions Scored for Compositionality (Unfiltered)
Resource type	Lexicon
Size	184662 entries
Languages	Portuguese (por)
Production status	Newly created-finished
Resource usage	Multiword Expression Compositionality
License	CreativeCommons
Conditions of use	<Not Specified>
Description	A compositionality-ranked list of Portuguese multiword expressions, created using the methodology detailed in our paper; evaluation for this resource is not presented.
Download from	http://amor.cms.hu-berlin.de/~robertsw/files/lrec2016/MWE_pt-unfiltered.utf8.txt.gz
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/650.html
Edition	LREC 2018

Name	POS Tagged Dialectal Arabic Data
Resource type	Corpus
Size	600 KByte
Languages	Egyptian Arabic Levantine Arabic Gulf Arabic Maghrebi Arabic
Production status	Newly created-finished
Resource usage	Part-of-Speech Tagging
License	CreativeCommons
Conditions of use	<Not Specified>
Description	350 tweets for four major Arabic dialects that were manually segmented and POS tagged
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/562_res_1.tgz [582 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/562.html
Edition	LREC 2018

Name	PoSTWITA-UD
Resource type	Treebank
Size	124410 tokens
Languages	Italian
Production status	Newly created-finished
Resource usage	Parsing and Tagging
License	Creative Commons CC BY-NC-SA 4.0
Conditions of use	Attribution Non-Commercial Share-Alike
Description	PoSTWITA-UD is a collection of Italian tweets annotated in Universal Dependencies that can be exploited for the training of NLP systems to enhance their performance on social media texts.
Download from	https://github.com/UniversalDependencies/UD_Italian-PoSTWITA
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/636.html
Edition	LREC 2018

Name	Prague Dependency Treebank 3.5 (PDT 3.5)
Resource type	Corpus
Size	50000 sentences
Languages	Czech (ces)
Production status	Existing-used
Resource usage	Corpus Creation/Annotation
License	Creative Commons CC BY-NC-SA 4.0
Conditions of use	Attribution Non-Commercial Share-Alike
Description	The Prague Dependency Treebank 3.5 is a 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied Linguistics under various projects between 1996 and 2018 on the original texts.
Download from	http://hdl.handle.net/11234/1-2621
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/20.html
Edition	LREC 2018

Name	Relational Noun Lexicon
Resource type	Lexicon
Size	6224 words
Languages	English
Production status	Newly created-finished
Resource usage	Information Extraction, Information Retrieval
License	Open Source
Conditions of use	<Not Specified>
Description	A lexicon of 6,224 nouns, annotated as either relational or non-relational (1,446 are relational), for use in relation extraction systems and other NLP applications.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/461_res_1.tsv [64,0 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/461.html
Edition	LREC 2018

Name	Risamálheild (Icelandic Gigaword Corpus)
Resource type	Corpus
Size	1.3 billion words
Languages	Icelandic
Production status	Newly created-in progress
Resource usage	Machine Learning
License	Part 1 under custom user licence (full text : http://www.malfong.is/files/userlicense_rmh1_download_en.pdf) Part 2 under CC-BY 4.0
Conditions of use	Part 1 Research Use No Commercial exploitation of source material Attribution Part 2 : Attribution
Description	A large corpus with more than one billion running words from contemporary Icelandic texts. The two main sources are official texts and texts from news media. The corpus texts are morphosyntactically tagged and provided with metadata.
Download from	http://www.malfong.is/index.php?lang=en&pg=rmh
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/746.html
Edition	LREC 2018

Name	rlfowl
Resource type	Ontology
Size	22 MByte
Languages	French (fra)
Production status	Newly created-in progress
Resource usage	Knowledge Discovery/Representation
License	OpenSource
Conditions of use	<Not Specified>
Description	Ontology representation of the French Lexical Network.
Download from	https://github.com/alex-fonseca/rlfowl
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/1102.html
Edition	LREC 2018

Name	Russian Wiktionary
Resource type	Lexicon
Size	861467 entries
Languages	Russian (rus)
Production status	Existing-used
Resource usage	Lexicon Creation/Annotation
License	Creative Commons CC BY-SA 3.0
Conditions of use	Attribution Share-Alike
Description	<Not Specified>
Download from	https://ru.wiktionary.org
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/312.html
Edition	LREC 2018

Name	Sample IS-pros Corpus
Resource type	Corpus
Size	59.9 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	GNU GPL 3.0
Conditions of use	Preservation of Copyright Notice Share-Alike
Description	This sample corpus includes the pitch and intensity objects created from speech samples by twelve speakers of American English reading 109 sentences from pieces of news. Sentences are annotated with hierarchical thematicity in TextGrid format. After running the processing pipeling presented in the papers and made available in this submission, the corpus will be annotated with thematicity and acoustic features in both TextGrid and cvs formats.
Download from	https://github.com/TalnUPF/compilationISpros
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/530.html
Edition	LREC 2018

Name	Sarcasm Target Dataset
Resource type	Corpus
Size	<Not Specified>
Languages	English
Production status	Newly created-finished
Resource usage	Opinion Mining/Sentiment Analysis
License	<Not Specified>
Conditions of use	<Not Specified>
Description	The dataset has two domains: book snippets and tweets. Each entity in the dataset is a sarcastic text while the label is either (a) a subset of words in the sentence that point to the sarcasm target, or (b) a fall-back label `Outside'.
Download from	https://github.com/Pranav-Goel/Sarcasm-Target-Detection
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/583.html
Edition	LREC 2018

Name	Self-Annotated Reddit Corpus (SARC)
Resource type	Corpus
Size	200 GByte
Languages	English
Production status	Newly created-finished
Resource usage	Opinion Mining/Sentiment Analysis
License	Open Source
Conditions of use	<Not Specified>
Description	A large corpus for sarcasm research and for training and evaluating systems for sarcasm detection.
Download from	http://nlp.cs.princeton.edu/SARC/2.0/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/160.html
Edition	LREC 2018

Name	Semantic verb classes for English, Polish, and Croatian
Resource type	Evaluation Data
Size	267 lexemes
Languages	English (eng) Polish (pol) Croatian (hrv)
Production status	Newly created-in progress
Resource usage	Lexicon Creation/Annotation
License	Creative Commons CC BY
Conditions of use	Attribution
Description	The classifications are a result of semantic clustering experiments with native speakers asked to classify a sample of 267 verbs into soft clusters based solely on their meaning (i.e. no reference to verbs' syntactic behaviour was required), aimed at verifying whether semantic verb classes can be reliably obtained from non-expert human annotators following simple instructions.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/116_res_1.zip [52.7 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/116.html
Edition	LREC 2018

Name	SenSALDO
Resource type	Lexicon
Size	69700 words
Languages	Swedish
Production status	Newly created-finished
Resource usage	Opinion Mining/Sentiment Analysis
License	Creative Commons CC BY
Conditions of use	Attribution
Description	Sentiment lexicon for Swedish, based on word senses in SALDO 2.3. Sentiment values from -1 to +1, and also discrete values in {-1,0,+1}.
Download from	https://spraakbanken.gu.se/eng/resource/sensaldo
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/857.html
Edition	LREC 2018

Name	Sentiment Lexicon of IDiomatic Expressions (SLIDE)
Resource type	Lexicon
Size	5000 entries
Languages	English
Production status	Newly created-finished
Resource usage	Opinion Mining/Sentiment Analysis
License	© Copyright Wikipedia CC-BY-SA 3.0 © Copyright IBM 2014. Released under CC-BY-SA. 4.0
Conditions of use	<Not Specified>
Description	The Sentiment Lexicon of IDiomatic Expressions (SLIDE) is a large idiom sentiment lexicon, which includes 5,000 frequently occurring idioms.
Download from	http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/602.html
Edition	LREC 2018

Name	SMS Test Collection
Resource type	Evaluation Data
Size	800 KByte
Languages	English
Production status	Newly created-finished
Resource usage	Information Extraction, Information Retrieval
License	<Not Specified>
Conditions of use	<Not Specified>
Description	This resource contains SMS topics and relevance judgments in TREC style for evaluating information retrieval systems. One could request for or buy the collection of SMS conversations from the Linguistic Data Consortium (LDC) (https://www.ldc.upenn.edu/).
Download from	https://github.com/rashmisankepally/SMSTestCollection/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/695.html
Edition	LREC 2018

Name	Spanish Multiword Expressions Scored for Compositionality (Unfiltered)
Resource type	Lexicon
Size	277960 entries
Languages	Spanish (spa)
Production status	Newly created-finished
Resource usage	Multiword Expression Compositionality
License	CreativeCommons
Conditions of use	<Not Specified>
Description	A compositionality-ranked list of Spanish multiword expressions, created using the methodology detailed in our paper; evaluation for this resource is not presented.
Download from	http://amor.cms.hu-berlin.de/~robertsw/files/lrec2016/MWE_es-unfiltered.utf8.txt.gz
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/650.html
Edition	LREC 2018

Name	Spokes Mix
Resource type	Corpus
Size	2200000 words
Languages	Polish
Production status	Existing-updated
Resource usage	Speech Recognition/Understanding
License	Creative Commons CC BY-NC
Conditions of use	<Not Specified>
Description	Spokes Mix is an online service providing access to a number of spoken corpora of Polish, including three newly released time-aligned collections of manually transcribed spoken-conversational data.
Download from	http://pelcra.clarin-pl.eu/spokes2-web/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/888.html
Edition	LREC 2018

Name	Stars2T corpus of time-constrained referring expressions
Resource type	Corpus
Size	368 annotated referring expressions
Languages	Portuguese (por)
Production status	Newly created-finished
Resource usage	Natural Language Generation
License	Creative Commons CC BY 4.0
Conditions of use	Attribution
Description	Ths Stars2 corpus is a collection of annotated definite descriptions elicited from visual stimuli in time-constrained situations of communication. The corpus may be used as standard dataset for referring expression generation (REG) with a particular focus on time constrains (and, as it turns out, on the issue of referential overspecification).
Download from	https://drive.google.com/open?id=0B-KyU7T8S8bLYzNtWTJfWGszdk0
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/39.html
Edition	LREC 2018

Name	STREUSLE 4.0
Resource type	Corpus
Size	55000 words
Languages	English
Production status	Existing-updated
Resource usage	Corpus Creation/Annotation
License	Creative Commons CC BY-SA 4.0
Conditions of use	Attribution Share-Alike
Description	STREUSLE stands for Supersense-Tagged Repository of English with a Unified Semantics for Lexical Expressions. The text is from the web reviews portion of the English Web Treebank. STREUSLE incorporates comprehensive annotations of multiword expressions (MWEs) and semantic supersenses for lexical expressions. The supersense labels apply to single- and multiword noun and verb expressions, and prepositional/possessive expressions. The 4.0 release updates the inventory and application of preposition supersenses, applies those supersenses to possessives, incorporates the syntactic annotations from the Universal Dependencies project, and adds lexical category labels to indicate the holistic grammatical status of strong multiword expressions.
Download from	https://github.com/nert-gu/streusle
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/963.html
Edition	LREC 2018

Name	STS.news.sr
Resource type	Corpus
Size	1192 sentence pairs
Languages	Serbian (srp)
Production status	Newly created-finished
Resource usage	Semantic Textual Similarity
License	Creative Commons CC BY-NC-SA 4.0
Conditions of use	Attribution Non-Commercial Share-Alike
Description	The Serbian STS News Corpus (ISLRN 146-979-597-345-4) consists of 1192 pairs of sentences in Serbian gathered from news sources on the web. Each sentence pair was manually annotated with fine-grained semantic similarity scores on the 0-5 scale. The final scores were obtained by averaging the individual scores of five annotators.
Download from	http://vukbatanovic.github.io/STS.news.sr/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/442.html
Edition	LREC 2018

Name	Swedish Literary Corpus
Resource type	Corpus
Size	178.2 KByte
Languages	Swedish (swe)
Production status	Newly created-in progress
Resource usage	Written Dialogue
License	GNU GPL 3.0
Conditions of use	Preservation of Copyright Notice Share-Alike Attribution
Description	This corpus consists of chapters from novellas by four Swedish authors. The novels used are: - August Strindberg, The Red Room (1879; obtained from the National Edition of August Strindberg’s Collected Works, published in 1981), - Hjalmar Söderberg, The Serious Game (1912),\nBirger Sjöberg, The Quartet That Split Up, part I (1924),\nKarin Boye, Kallocain (1940). In each file, each line within dialogues are annotated with the following information: - Speaker, - Addressee, - Speaker type, - Addressee typeThis corpus consists of chapters from novellas by four Swedish authors. The novels used are: - August Strindberg, The Red Room (1879; obtained from the National Edition of August Strindberg’s Collected Works, published in 1981), - Hjalmar Söderberg, The Serious Game (1912),\nBirger Sjöberg, The Quartet That Split Up, part I (1924),\nKarin Boye, Kallocain (1940). In each file, each line within dialogues are annotated with the following information: - Speaker, - Addressee, - Speaker type, - Addressee typeThis corpus consists of chapters from novellas by four Swedish authors. The novels used are: - August Strindberg, The Red Room (1879; obtained from the National Edition of August Strindberg’s Collected Works, published in 1981), - Hjalmar Söderberg, The Serious Game (1912),\nBirger Sjöberg, The Quartet That Split Up, part I (1924),\nKarin Boye, Kallocain (1940). In each file, each line within dialogues are annotated with the following information: - Speaker, - Addressee, - Speaker type, - Addressee typeThis corpus consists of chapters from novellas by four Swedish authors. The novels used are: - August Strindberg, The Red Room (1879; obtained from the National Edition of August Strindberg’s Collected Works, published in 1981), - Hjalmar Söderberg, The Serious Game (1912),\nBirger Sjöberg, The Quartet That Split Up, part I (1924),\nKarin Boye, Kallocain (1940). In each file, each line within dialogues are annotated with the following information: - Speaker, - Addressee, - Speaker type, - Addressee typeThis corpus consists of chapters from novellas by four Swedish authors. The novels used are: - August Strindberg, The Red Room (1879; obtained from the National Edition of August Strindberg’s Collected Works, published in 1981), - Hjalmar Söderberg, The Serious Game (1912),\nBirger Sjöberg, The Quartet That Split Up, part I (1924),\nKarin Boye, Kallocain (1940). In each file, each line within dialogues are annotated with the following information: - Speaker, - Addressee, - Speaker type, - Addressee type
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/1036_res_1.tgz [174 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/1036.html
Edition	LREC 2018

Name	Synthetic source Korean-Arabic bilingual corpus
Resource type	Corpus
Size	450000 sentences
Languages	Korean Arabic
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	OpenSource
Conditions of use	<Not Specified>
Description	We used the data that is WIT3, OPUS, Production corpus, synthetic corpus in the paper. We upload our synthetic source Korean-Arabic bilingual corpus. We can share this corpus. But other corpus like the WIT3, OPUS, synthetic target and Production corpus can't be shared in LRE map. The WIT3 and OPUS corpus is open-source corpus. Then anyone can get this data from each site. And the production corpus built by paying money can't be shared
Download from	https://github.com/ChoiGH/For_LRE_Map_corpus
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/139.html
Edition	LREC 2018

Name	Szeged Corpus
Resource type	Corpus
Size	82000 sentences
Languages	Hungarian (hun)
Production status	Existing-used
Resource usage	Evaluation/Validation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	A concept space for Explicit Semantic Analysis (ESA) (Gabrilovich & Markovitch 2007). It is a technique that provides a semantic representation of text in a space of concepts derived from Wikipedia. ESA defines concepts from Wikipedia articles e.g., BARACK OBAMA and ACOMPUTER SCIENCE. This resource is a concept space created from a Wikipedia (April 2017 snapshot)
Download from	http://rgai.inf.u-szeged.hu/index.php?lang=en&page=SzegedTreebank
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/730.html
Edition	LREC 2018

Name	T-Rex : A Large Scale Alignment of Natural Language with Knowledge Base Triples
Resource type	Corpus
Size	4.4 GByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Information Extraction, Information Retrieval
License	Creative Commons CC BY-SA 4.0
Conditions of use	Attribution Share-Alike
Description	Alignments between natural language and Knowledge Base (KB) triples are an essential prerequisite for training machine learning approaches employed in a variety of Natural Language Processing problems. These include Relation Extraction, KB Population, Question Answering and Natural Language Generation from KB triples. Available datasets that provide those alignments are plagued by significant shortcomings – they are of limited size, they exhibit a restricted predicate coverage, and/or they are of unreported quality. To alleviate these shortcomings, we present T-REx, a dataset of large scale alignments between Wikipedia abstracts and Wikidata triples. T-REx consists of 11 million triples aligned with 3.09 million Wikipedia abstracts (6.2 million sentences). T-REx is two orders of magnitude larger than the largest available alignments dataset and covers 2.5 times more predicates. Additionally, we stress the quality of this language resource thanks to an extensive crowdsourcing evaluation. T-REx is publicly available at https://w3id.org/t-rex.
Download from	http://w3id.org/t-rex
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/632.html
Edition	LREC 2018

Name	TAP-DLND 1.0 : A Corpus for Document Level Novelty Detection
Resource type	Corpus
Size	17.3 MByte
Languages	English (eng)
Production status	Newly created-in progress
Resource usage	Document Classification, Text categorisation
License	Creative Commons CC BY-NC-SA 4.0
Conditions of use	Attribution Non-Commercial Share-Alike
Description	The TAP-DLND 1.0 is a document level annotated corpus for novelty detection. The corpus is created via event-specific topical crawling of news reportings from the web, as they develop over time. We view novelty as an ordered update over existing knowledge. We collect information of 223 events from 10 different domains. For each event we fix 3 initial documents as source and asked the annotators to annotate other documents for that event as novel or non-novel based on information coverage and human judgment. Ambiguous cases we leave out from our resource. Thus we create a resource of 5,440 annotated documents that manifests certain criteria for novelty detection as discussed in the original manuscript.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/479_res_1.tgz [5,67 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/479.html
Edition	LREC 2018

Name	Test set for Chinese nonlocal dependencies
Resource type	Corpus
Size	1 MByte
Languages	Mandarin Chinese (cmn)
Production status	Newly created-in progress
Resource usage	Emotion Recognition/Generation
License	GNU GPL 3.0
Conditions of use	Preservation of Copyright Notice Share-Alike
Description	It contains nonlocal dependency test data for eight nonlocal dependency constructions in Mandarin Chinese. Each test set contains around 100 sentences except for extractions from embedded clauses because they occur rarely in the data.
Download from	https://github.com/modelblocks/modelblocks-release
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/981.html
Edition	LREC 2018

Name	The Epic Epigraph Graph
Resource type	Corpus
Size	10000 epigraphs
Languages	English
Production status	Newly created-in progress
Resource usage	Distant Reading
License	Creative Commons CC BY 4.0
Conditions of use	Attribution
Description	This is a collection of literary epigraphs, showing the epigraph, its source and where it is used. It is still a work-in-progress, it is already large enough to be interesting, but needs some checking and normalization. We also intend to make it bigger (roughly twice as large).
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/610_res_1.tsv [2,77 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/610.html
Edition	LREC 2018

Name	TSix
Resource type	Corpus
Size	3.9 MByte
Languages	English
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	MIT License
Conditions of use	<Not Specified>
Description	The dataset includes six events collected from Twitter from October 10 to November 9, 2016. The gold-standard references are created by the humans, which allow to correctly evaluate extractive methods.
Download from	http://lrec2018.lrec-conf.org/sharedlrs2018/516_res_1.zip [3,74 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/516.html
Edition	LREC 2018

Name	UDLexicons
Resource type	Lexicon
Size	<Not Specified>
Languages	<Not Specified>
Production status	Newly created-in progress
Resource usage	Parsing and Tagging
License	Open Source licences (depends on the lexicon)
Conditions of use	Depends on the lexicon
Description	Multilingual collection of morphological lexicons that follow the guidelines and format of the Universal Dependencies initiative, created based on existing freely available resources.
Download from	http://pauillac.inria.fr/~sagot/udlexicons.html
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/705.html
Edition	LREC 2018

Name	ViDA
Resource type	Corpus
Size	5065 functional segments
Languages	Vietnamese (vie)
Production status	Newly created-finished
Resource usage	Conversation mining, detection of emotion/sentiment in conversation, automatic dialect/accent detection
License	<Not Specified>
Conditions of use	<Not Specified>
Description	This corpus is a new language resource in Vietnamese consisting of dialogues with dialog act annotation according to the ISO 24617-2 (2012) standard, emotion tagging at functional segments level according to the Ekman's (1972) list of basic emotions and sentiment annotation. We use spoken text from IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7 for annotation. IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7 was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 201 hours of Vietnamese conversational and scripted telephone speech collected in 2012 along with corresponding transcripts.
Download from	https://drive.google.com/file/d/0B6xRTY1wmqt8UWxwcDBkOXVpalk/view
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/942.html
Edition	LREC 2018

Name	Vision-grounded dataset of human ratings
Resource type	Corpus
Size	52 KByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Thematic role filling
License	<Not Specified>
Conditions of use	<Not Specified>
Description	This dataset consists in a csv file with 4 columns... - an "id" column (integer) containing the numbering of the ratings from 1 to 2000 - a "verb" column (string/character) containing the verbs - a "+X8location" column (string/character) containing the locations - a "avr_rating" column (float/numeric) containing the average ratings NB: The average rating is the average of 10-11 ratings from different workers on Amazon MTurk obtained as part of an online experiment. Basic statistics of the collected data: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 2.800 4.200 4.096 5.400 7.000 standard deviation: 1.55390718673 - For other details see the paper.This dataset consists in a csv file with 4 columns... - an "id" column (integer) containing the numbering of the ratings from 1 to 2000 - a "verb" column (string/character) containing the verbs - a "+X8location" column (string/character) containing the locations - a "avr_rating" column (float/numeric) containing the average ratings NB: The average rating is the average of 10-11 ratings from different workers on Amazon MTurk obtained as part of an online experiment. Basic statistics of the collected data: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 2.800 4.200 4.096 5.400 7.000 standard deviation: 1.55390718673 - For other details see the paper.This dataset consists in a csv file with 4 columns... - an "id" column (integer) containing the numbering of the ratings from 1 to 2000 - a "verb" column (string/character) containing the verbs - a "+X8location" column (string/character) containing the locations - a "avr_rating" column (float/numeric) containing the average ratings NB: The average rating is the average of 10-11 ratings from different workers on Amazon MTurk obtained as part of an online experiment. Basic statistics of the collected data: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 2.800 4.200 4.096 5.400 7.000 standard deviation: 1.55390718673 - For other details see the paper.
Download from	http://datasets.d2.mpi-inf.mpg.de/arohrbach/datasetV1.csv
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/1089.html
Edition	LREC 2018

Name	WiFiNE
Resource type	Corpus
Size	1.2 GByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Named Entity Recognition
License	OpenSource
Conditions of use	<Not Specified>
Description	WiFiNE is an English corpus annotated with fine-grained entity types.
Download from	http://rali.iro.umontreal.ca/rali/en/wifiner-wikipedia-for-et
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/11.html
Edition	LREC 2018

Name	Wikipedia discourse connectives
Resource type	Corpus
Size	351 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Discourse
License	Creative Commons CC BY-SA 3.0 US
Conditions of use	Attribution Share-Alike
Description	2.9 million pairs of adjacent sentences extracted from the English Wikipedia on September 5, 2016, including the discourse connective at the beginning of the second sentence, if any, i.e., the "gold" connective.
Download from	https://github.com/ekQ/discourse-connectives
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/203.html
Edition	LREC 2018

Name	Wikipedia Embedding
Resource type	Lexicon
Size	2000000 words
Languages	Chinese
Production status	Newly created-finished
Resource usage	Word Sense Disambiguation
License	OpenSource
Conditions of use	Attribution
Description	Chinese Wikipedia Title Embedding
Download from	http://ckipsvr.iis.sinica.edu.tw/cwemb/reg.php
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/159.html
Edition	LREC 2018

Name	WORD
Resource type	Evaluation Data
Size	19276 concept pairs
Languages	<Not Specified>
Production status	Newly created-in progress
Resource usage	Evaluation/Validation
License	Creative Commons CC BY-SA 3.0
Conditions of use	Attribution Share-Alike
Description	A set of 19,276 Wikipedia concepts with their human annotated relatedness level.
Download from	http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/445.html
Edition	LREC 2018

Name	Word Importance Annotations
Resource type	Corpus
Size	1.9 MByte
Languages	English (eng)
Production status	Newly created-in progress
Resource usage	Evaluation/Validation
License	Creative Commons CC BY-NC-SA 4.0
Conditions of use	<Not Specified>
Description	The Switchboard Corpus consists of audio recordings of approximately 260 hours of speech consisting of about 2,400 two-sided telephone conversations among 543 speakers (consisting of 302 male, 241 female) from across the United States. In January 2003, the Institute for Signal and Information Processing (ISIP) released written transcripts for the entire corpus, which consists of nearly 400,000 conversational turns. The ISIP transcripts include a complete lexicon list and automatic word alignment timing corresponding to the original audio files. In this project, a pair of annotators has assigned word-importance scores to these transcripts. As of September 2017, they have annotated over 25,000 tokens, with an overlap of approximately 3,100 tokens. We announce the release of these annotations as a set of supplementary files, aligned to the ISIP transcripts. Our annotation work continues, and we aim to annotate all of the Switchboard corpus and with a larger group of annotators.
Download from	http://latlab.ist.rit.edu/lrec2018/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/187.html
Edition	LREC 2018

Name	Word Segmentation Dataset
Resource type	Corpus
Size	650 MByte
Languages	Sanskrit
Production status	Existing-used
Resource usage	Word Segmentation
License	Creative Commons CC BY 4.0
Conditions of use	Attribution
Description	A Dataset for Sanskrit word segmentation. See English language documentation here: https://zenodo.org/record/803508#.WdJXkRdx3eR
Download from	https://zenodo.org/record/803508#.WdJXkRdx3eR
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/669.html
Edition	LREC 2018

Name	WordNetGraph
Resource type	Lexicon
Size	74 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Textual Entailment and Paraphrasing
License	MIT License
Conditions of use	Preservation of Copyright Notice
Description	The WordNetGraph is an RDF graph generated from WordNet, whose noun and verb definitions were labeled with Definition Semantic Roles (DSR).
Download from	https://github.com/Lambda-3/WordnetGraph/blob/master/WN_DSR_model_XML.rdf
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/190.html
Edition	LREC 2018

Name	WordsEye Evaluation Corpus: Imaginative and Realistic Sentences
Resource type	Corpus
Size	459 sentences
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	Creative Commons CC BY-NC-SA 4.0
Conditions of use	<Not Specified>
Description	The resource contains 209 imaginative sentences and 250 realistic sentences. It also includes images generated by the WordsEye text-to-scene system, using each sentence as input. We used Amazon Mechanical Turk to obtain the imaginative sentences. Turkers were given short lists of words divided into several categories and were asked to write a short sentence using at least one word from each category. The words provided to the Turkers represent the objects, properties, and relations supported by the WordsEye text-toscene system. The realistic sentences are a subset of the PASCAL image caption corpus (Rashtchian et al., 2010).The resource contains 209 imaginative sentences and 250 realistic sentences. It also includes images generated by the WordsEye text-to-scene system, using each sentence as input. We used Amazon Mechanical Turk to obtain the imaginative sentences. Turkers were given short lists of words divided into several categories and were asked to write a short sentence using at least one word from each category. The words provided to the Turkers represent the objects, properties, and relations supported by the WordsEye text-toscene system. The realistic sentences are a subset of the PASCAL image caption corpus (Rashtchian et al., 2010).The resource contains 209 imaginative sentences and 250 realistic sentences. It also includes images generated by the WordsEye text-to-scene system, using each sentence as input. We used Amazon Mechanical Turk to obtain the imaginative sentences. Turkers were given short lists of words divided into several categories and were asked to write a short sentence using at least one word from each category. The words provided to the Turkers represent the objects, properties, and relations supported by the WordsEye text-toscene system.
Download from	http://www.cs.columbia.edu/~coyne/wordseye-evaluation-corpus.html
Referring paper	http://www.lrec-conf.org/proceedings/lrec2018/summaries/115.html
Edition	LREC 2018

Search for LRs

Shared-LRs @ LREC 2018

Important Dates

Links

Latest Tweets

Share this page!

Search for LRs

Shared-LRs @ LREC 2018

Important Dates

Links

Latest Tweets