search RSS twitter search

11th edition of the Language Resources and Evaluation Conference, 7-12 May 2018, Miyazaki (Japan)
Under the patronage of the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT)

home contact

LREC2018-Box1b.png Conference Venue & Travel
LREC2018-Box1a.png Submission
LREC2018-Box2.png Registration
LREC2018-Box4.png Accommodation & Tours

Current List of LREC 2018 Shared LRs

Share this page!
linkedin

LREC recognises the importance of sharing Language Resources (LRs) and making them available to the community. When submitting a paper, participants were offered the possibility to share their LRs (data, tools, web-services, etc.), uploading them in a special LREC repository set up by ELRA. This effort of sharing LRs, linked to the LRE Map initiative for their description, contributes to creating a common repository where everyone can deposit and share data.

After the conference, the Shared LRs set at LREC 2018 was manually checked and a cleaned version of the list of LRs is now available. The LRs in this list comply with the following criteria:

  • LRs accessible (whether downloadable directly or through an  an external URL)
  • LRs categorized as Datasets only. It can be a:
    • Corpus,
    • Evaluation Data,
    • Lexicon,
    • Ontology,
    • Terminology,
    • Treebank.

 

Excluded LRs are:

  • Uploaded LRs  with  a content that does not match  the description
  • LRs with no download URL or URL now a dead link
  • LRs categorized as tools or guidelines
  • LRs associated to rejected papers

 

Search for LRs

Filter by resource type:

 
Reset
 
Shared-LRs @ LREC 2018
  • Name A Tweet Dataset Annotated in Four Emotion Dimensions
    Resource type Corpus
    Size 2019 entries  
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Emotion Recognition/Generation
    License CC-BY 4.0
    Conditions of use Attribution
    Description A corpus of 2,019 tweets annotated along each of four emotion dimensions: Valence, Dominance, Arousal and Surprise. Two annotation schemes are used: a 5-point ordinal scale (using SAM manikins for Valence, Arousal and Dominance) and pair-wise comparisons with an "about the same" option (here 2,019 tweet pairs are annotated such that each of the 2,019 tweets is in at lest one pair and no pairs are duplicated). In all cases, there was a "Can't Tell" option for unintelligible tweets. Files provided are csv output from CrowdFlower with useful columns largely self-explanatory. Annotation columns are emotion names (5-point scale) or "most_emotion" (comparisons), "index" contains a unique id for each annotation task. "_worker_id" contains a unique identifier for each annotator.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/61_res_1.tgz [3,45 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/61.html
    Edition LREC 2018
  • Name Abu El-Khair Corpus
    Resource type Corpus
    Size 16 Gbyte
    Languages Arabic
    Production status Complete
    Resource usage Information Retrieval, Natural Language Processing, Machine Learning
    License OpenSource
    Conditions of use Free
    Description A text corpus that includes five million newspaper articles. A billion and a half words in total, out of which, there is about three million unique words. The data were collected from newspaper articles in ten major news sources from eight Arabic countries, over a period of fourteen years. The corpus was encoded with two types of encoding, namely: UTF-8, and Windows CP-1256. Also it was marked with two mark-up languages, namely: SGML, and XML.
    Download from http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/4.html
    Edition LREC 2018
  • Name Adimen-SUMO v2.6
    Resource type Ontology
    Size 4000 rules   
    Languages <Not Specified>
    Production status Existing-updated
    Resource usage Automated reasoning
    License Creative Commons CC BY 3.0 Unported
    Conditions of use Attribution
    Description Adimen-SUMO is an off-the-shelf first-order ontology that has been obtained by reengineering out of the 88% of SUMO (Suggested Upper Merged Ontology). Adimen-SUMO can be used appropriately by FO theorem provers (like E-Prover or Vampire) for formal reasoning.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/308_res_1.zip [3,41 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/308.html
    Edition LREC 2018
  • Name Anchor wordlists for code-switching
    Resource type Lexicon
    Size 65 MByte  
    Languages English (eng) Spanish (spa)
    Production status Newly created-finished
    Resource usage Language Modelling
    License <Not Specified>
    Conditions of use <Not Specified>
    Description This Language Resource package contains the anchor wordlists described in the LREC 2018 paper: "Collecting Code-Switched Data from Social Media", Gideon Mendels, Victor Soto, Aaron Jaech and Julia Hirschberg, LREC 2018. - The folder strong_anchors contains the collection of strong anchors computed as described in the paper. The three subfolders anchors, anchors_news and anchors_wiki contain the strong anchors when computed using the news resources, wiki resources, or all resources from the Leipzig Corpora Collection (LCC). - The folder weak_anchors contains the collection of Spanish and English weak anchors as described in the paper, using the GigaCorpus dataset of Broadcast News data. You can use the anchors wordlists to seed the search of code-switched tweets using Babler. (https://github.com/gidim/Babler/blob/master/README.md).This Language Resource package contains the anchor wordlists described in the LREC 2018 paper: "Collecting Code-Switched Data from Social Media", Gideon Mendels, Victor Soto, Aaron Jaech and Julia Hirschberg, LREC 2018. - The folder strong_anchors contains the collection of strong anchors computed as described in the paper. The three subfolders anchors, anchors_news and anchors_wiki contain the strong anchors when computed using the news resources, wiki resources, or all resources from the Leipzig Corpora Collection (LCC). - The folder weak_anchors contains the collection of Spanish and English weak anchors as described in the paper, using the GigaCorpus dataset of Broadcast News data. You can use the anchors wordlists to seed the search of code-switched tweets using Babler. (https://github.com/gidim/Babler/blob/master/README.md).This Language Resource package contains the anchor wordlists described in the LREC 2018 paper: "Collecting Code-Switched Data from Social Media", Gideon Mendels, Victor Soto, Aaron Jaech and Julia Hirschberg, LREC 2018. - The folder strong_anchors contains the collection of strong anchors computed as described in the paper. The three subfolders anchors, anchors_news and anchors_wiki contain the strong anchors when computed using the news resources, wiki resources, or all resources from the Leipzig Corpora Collection (LCC). - The folder weak_anchors contains the collection of Spanish and English weak anchors as described in the paper, using the GigaCorpus dataset of Broadcast News data. You can use the anchors wordlists to seed the search of code-switched tweets using Babler. (https://github.com/gidim/Babler/blob/master/README.md).
    Download from http://www.cs.columbia.edu/~vsoto/files/anchor_wordlists.zip
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/92.html
    Edition LREC 2018
  • Name Annotated arXiv CS Data Set
    Resource type Corpus
    Size 3.5 GByte  
    Languages <Not Specified>
    Production status Newly created-finished
    Resource usage Information Extraction, Information Retrieval
    License arXiv
    Conditions of use Website is granted a perpetual, non-exclusive right to distribute the content
    Description The data set contains 15.5M sentences of arXiv.org publications in the computer science domain. In those sentences, the citation markers were replaced by global paper identifiers. All citing and cited papers are linked to DBLP, as far as possible. The data set can be used for a variety of citation-based tasks, such as citation recommendation, citation function determination, and citation-based document summarization.
    Download from http://citation-recommendation.org/publications/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/283.html
    Edition LREC 2018
  • Name Annotated Corpus of Scientific Conference's Homepages
    Resource type Corpus
    Size 57.8
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Information Extraction, Information Retrieval
    License Creative Commons CC BY-NC-SA 4.0
    Conditions of use Attribution Non-Commercial Share-Alike
    Description The corpus contains homepages of conferences with annotations of interesting information, e.g., name of a conference, its abbreviation, several important dates for the conference. The corpus can be used to train a tool for information extraction from unstructured sources containing data describing conferences. We chose conference home pages as a source as they contain up-to-date information. Structured services, such as WikiCFP, do not always update information, e.g., deadline changes and cannot be used in a real system for gathering up-to-date information about conferences.
    Download from http://ii.pw.edu.pl/~pandrusz/data/conferences/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/828.html
    Edition LREC 2018
  • Name Arabic Dialects Dataset
    Resource type Corpus
    Size 355069 words  
    Languages Egyptian Arabic (arz) Gulf Arabic (afb) North Levantine Arabic (apc) Tunisian Arabic (aeb) Modern Standard Arabic
    Production status Existing-updated
    Resource usage Document Classification, Text categorisation
    License Open Source
    Conditions of use <Not Specified>
    Description Arabic Dialect Dataset is a collection of Arabic dialect documents collected from the Arabic Commentary Dataset and the Tunisian Arabic dataset. We selected only 100% dialectical documents from each resources, filtered them by dialect so no document is written in more than one dialect. We built frequency lists for the collected documents, bivalency and dialectical modern standard arabic lists.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/237_res_1.zip [2,24 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/237.html
    Edition LREC 2018
  • Name ArapTweet
    Resource type Corpus
    Size 2500000 Tweets
    Languages Dialectal Arabic
    Production status Newly created-in progress
    Resource usage Language Identification
    License Creative Commons CC BY 4.0
    Conditions of use Attribution
    Description An Arabic multi-dialectal corpus of Tweets from 12 regions and 17 countries in the Arab world. The corpus is annotated for age categories, gender and for the dialectal variety.
    Download from http://arap.qatar.cmu.edu/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/521.html
    Edition LREC 2018
  • Name ArmanPersoNERCorpus
    Resource type Corpus
    Size 1.6 MByte  
    Languages Iranian Persian (pes)
    Production status Existing-updated
    Resource usage Named Entity Recognition
    License <Not Specified>
    Conditions of use <Not Specified>
    Description ArmanPersoNERCorpus is the first manually-annotated Persian named-entity (NE) dataset (ISLRN 399-379-640-828-6). We are releasing it only for academic research use. The data set includes 250,015 tokens and 7,682 Persian sentences in total. It is available in 3 folds to be used in turn as training and test sets. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated with a newline. The NER tags are in IOB format. According to the instructions provided to the annotators, NEs are categorized into six classes: person, organization (such as banks, ministries, embassies, teams, nationalities, networks and publishers), location (such as cities, villages, rivers, seas, gulfs, deserts and mountains), facility (such as schools, universities, research centers, airports, railways, bridges, roads, harbors, stations, hospitals, parks, zoos and cinemas), product (such as books, newspapers, TV shows, movies, airplanes, ships, cars, theories, laws, agreements and religions), and event (such as wars, earthquakes, national holidays, festivals and conferences); other are the remaining tokens.ArmanPersoNERCorpus is the first manually-annotated Persian named-entity (NE) dataset (ISLRN 399-379-640-828-6). We are releasing it only for academic research use. The data set includes 250,015 tokens and 7,682 Persian sentences in total. It is available in 3 folds to be used in turn as training and test sets. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated with a newline. The NER tags are in IOB format. According to the instructions provided to the annotators, NEs are categorized into six classes: person, organization (such as banks, ministries, embassies, teams, nationalities, networks and publishers), location (such as cities, villages, rivers, seas, gulfs, deserts and mountains), facility (such as schools, universities, research centers, airports, railways, bridges, roads, harbors, stations, hospitals, parks, zoos and cinemas), product (such as books, newspapers, TV shows, movies, airplanes, ships, cars, theories, laws, agreements and religions), and event (such as wars, earthquakes, national holidays, festivals and conferences); other are the remaining tokens.ArmanPersoNERCorpus is the first manually-annotated Persian named-entity (NE) dataset (ISLRN 399-379-640-828-6). We are releasing it only for academic research use. The data set includes 250,015 tokens and 7,682 Persian sentences in total. It is available in 3 folds to be used in turn as training and test sets. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated with a newline. The NER tags are in IOB format. According to the instructions provided to the annotators, NEs are categorized into six classes: person, organization (such as banks, ministries, embassies, teams, nationalities, networks and publishers), location (such as cities, villages, rivers, seas, gulfs, deserts and mountains), facility (such as schools, universities, research centers, airports, railways, bridges, roads, harbors, stations, hospitals, parks, zoos and cinemas), product (such as books, newspapers, TV shows, movies, airplanes, ships, cars, theories, laws, agreements and religions), and event (such as wars, earthquakes, national holidays, festivals and conferences); other are the remaining tokens.
    Download from https://github.com/HaniehP/PersianNER
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/48.html
    Edition LREC 2018
  • Name b5 corpus
    Resource type Corpus
    Size 1082 Big five inventories and text
    Languages Portuguese (por)
    Production status Newly created-finished
    Resource usage Document Classification, Text categorisation
    License Creative Commons CC BY 4.0
    Conditions of use Attribution
    Description The b5 corpus is a dataset containing texts (in Brazilian Portuguese) and self-reported personality inventories of their authors. This consists of an author's knowledge base called b5-subject and four text databases (or subcorpora) called b5-post, b5-ref, b5-text, and b5-caption. The b5-subject knowledge base contains contains 1082 personality inventories, and partial author information regarding gender, age, background, degree of religiosity (on a 1-5 scale) and undergraduate course information. The b5-post subcorpus contains Facebook status updates from participants who filled out the personality inventory using a purpose-built application. For each subject, up to 1,000 Facebook status updates were collected. Users with little or no Facebook activity were discarded, resulting in a corpus of 1019 texts. The b5-ref corpus is a collection of 1810 definite descriptions elicited from visual contexts, and annotated with their semantic properties. The b5-text and b5-caption subcorpora contain about 1500 scene descriptions produced in two versions: a detailed version in the form of multi-sentential text, and a short version in the form of a single sentences similar to picture captions.The b5 corpus is a dataset containing texts (in Brazilian Portuguese) and self-reported personality inventories of their authors. This consists of an author's knowledge base called b5-subject and four text databases (or subcorpora) called b5-post, b5-ref, b5-text, and b5-caption. The b5-subject knowledge base contains contains 1082 personality inventories, and partial author information regarding gender, age, background, degree of religiosity (on a 1-5 scale) and undergraduate course information. The b5-post subcorpus contains Facebook status updates from participants who filled out the personality inventory using a purpose-built application. For each subject, up to 1,000 Facebook status updates were collected. Users with little or no Facebook activity were discarded, resulting in a corpus of 1019 texts. The b5-ref corpus is a collection of 1810 definite descriptions elicited from visual contexts, and annotated with their semantic properties. The b5-text and b5-caption subcorpora contain about 1500 scene descriptions produced in two versions: a detailed version in the form of multi-sentential text, and a short version in the form of a single sentences similar to picture captions.
    Download from https://drive.google.com/open?id=0B-KyU7T8S8bLTHpaMnh2U2NWZzQ
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/31.html
    Edition LREC 2018
  • Name b5-ref-lex corpus
    Resource type Corpus
    Size 4711 Definite descriptions
    Languages Portuguese (por)
    Production status Newly created-finished
    Resource usage Natural Language Generation
    License Creative Commons CC BY 4.0
    Conditions of use Attribution
    Description The b5-ref-lex dataset conveys semantic properties taken from the b5-ref corpus of referring expressions, the corresponding Big Five personality traits about their authors, and their surface forms (in Brazilian Portuguese). The dataset has been created for the development of machine learning methods of personality-based referring expression lexical choice, in which the goal is to learn the appropriate lexicalisation for a given input semantics and target personality. Each instance of the dataset is represented by seven features: a referential property (or attribute-value pair) as defined in the b5-ref domain, the five personality values (Extraversion, Agreeableness, Conscientiousness, Neuroticism, Openness to Experience) of the speaker who selected the property to produce a definite description, and the basic word that was uttered. The b5-ref-lex dataset conveys semantic properties taken from the b5-ref corpus of referring expressions, the corresponding Big Five personality traits about their authors, and their surface forms (in Brazilian Portuguese). The dataset has been created for the development of machine learning methods of personality-based referring expression lexical choice, in which the goal is to learn the appropriate lexicalisation for a given input semantics and target personality. Each instance of the dataset is represented by seven features: a referential property (or attribute-value pair) as defined in the b5-ref domain, the five personality values (Extraversion, Agreeableness, Conscientiousness, Neuroticism, Openness to Experience) of the speaker who selected the property to produce a definite description, and the basic word that was uttered.
    Download from https://drive.google.com/open?id=0B-KyU7T8S8bLclA0XzIwX0NncEk
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/35.html
    Edition LREC 2018
  • Name Baidu Baike document titles
    Resource type Lexicon
    Size 93.9 MByte  
    Languages Chinese
    Production status Newly created-finished
    Resource usage Lexicon Creation/Annotation
    License OpenSource
    Conditions of use <Not Specified>
    Description There are 10,143,321 titles of Baidu Baike (the largest Chinese-language encyclopedia like Wikipedia) documents. The title can be considered as a specific word or combination of words.
    Download from https://drive.google.com/open?id=1rO-LUcWpm_5KhBUpYID1M5B_an55XVO2
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/295.html
    Edition LREC 2018
  • Name Basque Multiword Expressions Scored for Compositionality (Unfiltered)
    Resource type Lexicon
    Size 77679 lexemes  
    Languages Basque (eus)
    Production status Newly created-finished
    Resource usage Multiword Expression Compositionality
    License CreativeCommons
    Conditions of use <Not Specified>
    Description A compositionality-ranked list of Basque multiword expressions, created using the methodology detailed in our paper; evaluation for this resource is not presented.
    Download from http://amor.cms.hu-berlin.de/~robertsw/files/lrec2016/MWE_eu-unfiltered.utf8.txt.gz
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/650.html
    Edition LREC 2018
  • Name Berriak
    Resource type Corpus
    Size 500 sentences  
    Languages Basque English
    Production status Newly created-in progress
    Resource usage Machine Translation, SpeechToSpeech Translation
    License <Not Specified>
    Conditions of use <Not Specified>
    Description 500 English-Basque accurate translations, translated by human translators. The corpus was created thanks to the help of Librezale, a group that works for an increase in the presence of Basque language in informatics. This is the first version of the corpus, we aim to increase the number of translations in the future. Pre-trained word embeddings in Basque: We have learned Basque word embeddings from Basque Wikipedia with Glove. 500 English-Basque accurate translations, translated by human translators. The corpus was created thanks to the help of Librezale, a group that works for an increase in the presence of Basque language in informatics. This is the first version of the corpus, we aim to increase the number of translations in the future. Pre-trained word embeddings in Basque: We have learned Basque word embeddings from Basque Wikipedia with Glove. 500 English-Basque accurate translations, translated by human translators. The corpus was created thanks to the help of Librezale, a group that works for an increase in the presence of Basque language in informatics. This is the first version of the corpus, we aim to increase the number of translations in the future. Pre-trained word embeddings in Basque: We have learned Basque word embeddings from Basque Wikipedia with Glove.
    Download from https://github.com/ijauregiCMCRC/english_basque_MT
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/101.html
    Edition LREC 2018
  • Name BlogSet-BR
    Resource type Corpus
    Size 4.7 GByte  
    Languages Portuguese (por)
    Production status Newly created-finished
    Resource usage Text Mining
    License Apache 2.0
    Conditions of use Preservation of Copyright Notice
    Description The processed corpus is a CSV file with 7.4 millions posts only from Brazilian bloggers with the columns: post id number, blog id number, published date, title, content, author id number, author display name and replies (number of comments). The total size of compressed processed data is 4.7 GB. The survey data with 4 thousand Brazilian bloggers responses is a XLS file sized 1.1 MB.The processed corpus is a CSV file with 7.4 millions posts only from Brazilian bloggers with the columns: post id number, blog id number, published date, title, content, author id number, author display name and replies (number of comments). The total size of compressed processed data is 4.7 GB. The survey data with 4 thousand Brazilian bloggers responses is a XLS file sized 1.1 MB.The processed corpus is a CSV file with 7.4 millions posts only from Brazilian bloggers with the columns: post id number, blog id number, published date, title, content, author id number, author display name and replies (number of comments). The total size of compressed processed data is 4.7 GB. The survey data with 4 thousand Brazilian bloggers responses is a XLS file sized 1.1 MB.
    Download from http://www.inf.pucrs.br/linatural/blogset-br
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/10.html
    Edition LREC 2018
  • Name BPEmb
    Resource type Corpus
    Size 100 GByte  
    Languages 275 languages
    Production status Existing-used
    Resource usage Input for Neural Models
    License MIT License
    Conditions of use Preservation of Copyright Notice
    Description Subword embeddings in 275 languages, based on Byte-Pair Encoding, trained on Wikipedia article text
    Download from https://github.com/bheinzerling/bpemb
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/1049.html
    Edition LREC 2018
  • Name Bulgarian Multiword Expressions Scored for Compositionality (Unfiltered)
    Resource type Lexicon
    Size 98872 entries  
    Languages Bulgarian (bul)
    Production status Newly created-finished
    Resource usage Multiword Expression Compositionality
    License CreativeCommons
    Conditions of use <Not Specified>
    Description A compositionality-ranked list of Bulgarian multiword expressions, created using the methodology detailed in our paper; evaluation for this resource is not presented.
    Download from http://amor.cms.hu-berlin.de/~robertsw/files/lrec2016/MWE_bg-unfiltered.utf8.txt.gz
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/650.html
    Edition LREC 2018
  • Name Chinese Word Embedding Evaluation Sets
    Resource type Lexicon
    Size entries  
    Languages Chinese
    Production status Newly created-finished
    Resource usage Word Sense Disambiguation
    License OpenSource
    Conditions of use Attribution
    Description Evaluation Datasets for Chinese Word Embedding.
    Download from http://ckipsvr.iis.sinica.edu.tw/ecemb/reg.php
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/159.html
    Edition LREC 2018
  • Name Classical Chinese Evaluation Dataset (Twenty-Five Histories)
    Resource type Evaluation Data
    Size 147 KByte  
    Languages Classical Chinese
    Production status Newly created-finished
    Resource usage Evaluation/Validation
    License OpenSource
    Conditions of use <Not Specified>
    Description The manually segmented texts in the evaluation dataset, consisting of 32,689 characters, are proportionally selected from each historical book of the Twenty-Five histories. The dataset is exported from MongoDB.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/295_res_2.zip [147 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/295.html
    Edition LREC 2018
  • Name Code-switched English-Spanish Tweets
    Resource type Corpus
    Size 493 KByte  
    Languages English (eng) Spanish (spa)
    Production status Newly created-finished
    Resource usage Language Modelling
    License Apache 2.0
    Conditions of use Preservation of Copyright Notice Share-Alike
    Description This package contains the collection of tweets described in the LREC 2018 paper: "Collecting Code-Switched Data from Social Media", Gideon Mendels, Victor Soto, Aaron Jaech and Julia Hirschberg, LREC 2018. Please remember to cite this paper if you use this resource. The tagged_tweets_ids file contains the IDs of the 8,285 tweets for which we crowdsourced language tags. These tweets were collected using Babler (https://github.com/gidim/Babler/blob/master/README.md) and the anchor wordlists described in the paper and that can be found in http://www.cs.columbia.edu/~vsoto/files/anchor_wordlists.zip The tagged_tweets_labels file contains the crowdsourced language tags for each token in the collection of 8,285 tweets. The format of the file is one line per token and each line contains a tweet ID, token index and language tag. The language tag values are the following (for a more thorough explanation read the paper): lang1 = English, lang2 = Spanish, ne = Named Entity, unk = Unknown, fw = Foreign Word, ambiguous, mixed and other.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/92_res_2.zip [481 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/92.html
    Edition LREC 2018
  • Name Concept Space
    Resource type Corpus
    Size 12.5 GByte  
    Languages English
    Production status Newly created-finished
    Resource usage Knowledge Discovery/Representation
    License CreativeCommons
    Conditions of use <Not Specified>
    Description A concept space for Explicit Semantic Analysis (ESA) (Gabrilovich & Markovitch 2007). It is a technique that provides a semantic representation of text in a space of concepts derived from Wikipedia. ESA defines concepts from Wikipedia articles e.g., BARACK OBAMA and ACOMPUTER SCIENCE. This resource is a concept space created from a Wikipedia (April 2017 snapshot)
    Download from https://goo.gl/JZhEvm
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/806.html
    Edition LREC 2018
  • Name Cornell eRulemaking Corpus -- CDCP
    Resource type Corpus
    Size 4931 sentences  
    Languages English
    Production status Newly created-finished
    Resource usage Argument Mining
    License Open Database License
    Conditions of use <Not Specified>
    Description This dataset consists in argument annotations on user comments about rule proposals regarding Consumer Debt Collection Practices by the Consumer Financial Protection Bureau crawled from an eRulemaking website, regulationroom.org. The annotation scheme is based on the argumentation model presented in "Toward Machine-assisted Participation in eRulemaking: An Argumentation Model of Evaluability" by Joonsuk Park, Cheryl Blake and Claire Cardie (ICAIL 2015)
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/679_res_1.zip [194 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/679.html
    Edition LREC 2018
  • Name Corpus for Sarcasm Detection in English-Hindi code-mixed tweets
    Resource type Corpus
    Size 3.2 Mbyte
    Languages Hindi English
    Production status Complete
    Resource usage Sarcasm detection in English-Hindi code-mixed tweets
    License GNU GPL 3.0
    Conditions of use Preservation of Copyright Notice Share-Alike
    Description The dataset is divided into three files: - Sarcasm_tweets.txt: This file contains all the tweet ids and their corresponding tweet text. Each tweet id is followed by the text and a blank line and so on. - Sarcasm_tweets_with_language.txt: This file contains tweet id followed by the corresponding tweet that tokenized and each token is tagged with a language tag (en/hi/rest). - Sarcasm_tweet_truth.txt: This file contains tweet id followed by a label (YES/NO) that indicates presence of sarcasm and then a blank line.The dataset is divided into three files: - Sarcasm_tweets.txt: This file contains all the tweet ids and their corresponding tweet text. Each tweet id is followed by the text and a blank line and so on. - Sarcasm_tweets_with_language.txt: This file contains tweet id followed by the corresponding tweet that tokenized and each token is tagged with a language tag (en/hi/rest). - Sarcasm_tweet_truth.txt: This file contains tweet id followed by a label (YES/NO) that indicates presence of sarcasm and then a blank line.The dataset is divided into three files: - Sarcasm_tweets.txt: This file contains all the tweet ids and their corresponding tweet text. Each tweet id is followed by the text and a blank line and so on. - Sarcasm_tweets_with_language.txt: This file contains tweet id followed by the corresponding tweet that tokenized and each token is tagged with a language tag (en/hi/rest). - Sarcasm_tweet_truth.txt: This file contains tweet id followed by a label (YES/NO) that indicates presence of sarcasm and then a blank line.The dataset is divided into three files: - Sarcasm_tweets.txt: This file contains all the tweet ids and their corresponding tweet text. Each tweet id is followed by the text and a blank line and so on. - Sarcasm_tweets_with_language.txt: This file contains tweet id followed by the corresponding tweet that tokenized and each token is tagged with a language tag (en/hi/rest). - Sarcasm_tweet_truth.txt: This file contains tweet id followed by a label (YES/NO) that indicates presence of sarcasm and then a blank line.
    Download from https://github.com/sahilswami96/SarcasmDetection_CodeMixed/tree/master/Dataset
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/7.html
    Edition LREC 2018
  • Name CorpusDRF
    Resource type Lexicon
    Size 936 entries  
    Languages French (fra)
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License OpenSource
    Conditions of use <Not Specified>
    Description CorpusDRF is an open‑source, digitized collection of regionalisms, their parts of speech, and recognition rates published in the 'Dictionnaire des Regionalismes de France' (DRF, "Dictionary of Regionalisms of France") (Rezeau, 2001), enabling the visualization and analyses of the largest‑scale study of French regionalisms in the 20th century using publicly available data. The corpus documents, in a tabular format, the numerical values of recognition rates for each DRF entry sorted according to each of the 94 departments in continental France. Each CSV file contains 95 rows for the 94 French departments with a header and 936 DRF entries as columns. There are 3 versions of the data -- one with empty values (NAs) left as empty, one with NAs imputed with 0s, and one with NAs as -1s.CorpusDRF is an open‑source, digitized collection of regionalisms, their parts of speech, and recognition rates published in the 'Dictionnaire des Regionalismes de France' (DRF, "Dictionary of Regionalisms of France") (Rezeau, 2001), enabling the visualization and analyses of the largest‑scale study of French regionalisms in the 20th century using publicly available data. The corpus documents, in a tabular format, the numerical values of recognition rates for each DRF entry sorted according to each of the 94 departments in continental France. Each CSV file contains 95 rows for the 94 French departments with a header and 936 DRF entries as columns. There are 3 versions of the data -- one with empty values (NAs) left as empty, one with NAs imputed with 0s, and one with NAs as -1s.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/1005_res_1.zip [62,6 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/1005.html
    Edition LREC 2018
  • Name Czech Multiword Expressions Scored for Compositionality (Unfiltered)
    Resource type Lexicon
    Size 175713 entries  
    Languages Czech (ces)
    Production status Newly created-finished
    Resource usage Multiword Expression Compositionality
    License CreativeCommons
    Conditions of use <Not Specified>
    Description A compositionality-ranked list of Czech multiword expressions, created using the methodology detailed in our paper; evaluation for this resource is not presented.
    Download from http://amor.cms.hu-berlin.de/~robertsw/files/lrec2016/MWE_cs-unfiltered.utf8.txt.gz
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/650.html
    Edition LREC 2018
  • Name Czech Text Document Corpus v 2.0
    Resource type Corpus
    Size 694 MByte  
    Languages Czech (ces)
    Production status Existing-updated
    Resource usage Document Classification, Text categorisation
    License Creative Commons CC BY-NC-SA 3.0
    Conditions of use Attribution Non-Commercial Share-Alike
    Description Czech Text Document Corpus v 2.0 is a collection of text documents for automatic document classification in Czech language. It is composed of the text documents provided by the Czech News Agency and is freely available for research purposes. This corpus was created in order to facilitate a straightforward comparison of the document classification approaches on Czech data. It is particularly dedicated to evaluation of multi-label document classification approaches, because one document is usually labelled with more than one label. Besides the information about the document classes, the corpus is also annotated at the morphological layer. The main part (for training and testing) is composed of 11,955 real newspaper articles. We provide also a development set which is intended to be used for tuning of the hyper-parameters of the created models. This set contains 2735 additional articles. The total category number is 60 out of which 37 most frequent ones are used for classification. The reason of this reduction is to keep only the classes with the sufficient number of occurrences to train the models. Technical Details: Text documents are stored in the individual text files using UTF-8 encoding. Each filename is composed of the serial number and the list of the categories abbreviations separated by the underscore symbol and the .txt suffix. Serial numbers are composed of five digits and the numerical series starts from the value one. For instance the file 00046_kul_nab_mag.txt represents the document file number 46 annotated by the categories kul (culture), nab (religion) and mag (magazine selection). The content of the document, i.e. the word tokens, is stored in one line. The tokens are separated by the space symbols. Every text document was further automatically mophologically analyzed. This analysis includes lemmatization, POS tagging and syntactic parsing. The fully annotated files are stored in .conll files. We also provide the lemmatized form, file with suffix .lemma, and appropriate POS-tags, see .pos files. The tokenized version of the documents is also available in .tok files. This corpus is available only for research purposes for free. Commercial use in any form is strictly excluded. Czech Text Document Corpus v 2.0 is a collection of text documents for automatic document classification in Czech language. It is composed of the text documents provided by the Czech News Agency and is freely available for research purposes. This corpus was created in order to facilitate a straightforward comparison of the document classification approaches on Czech data. It is particularly dedicated to evaluation of multi-label document classification approaches, because one document is usually labelled with more than one label. Besides the information about the document classes, the corpus is also annotated at the morphological layer. The main part (for training and testing) is composed of 11,955 real newspaper articles. We provide also a development set which is intended to be used for tuning of the hyper-parameters of the created models. This set contains 2735 additional articles. The total category number is 60 out of which 37 most frequent ones are used for classification. The reason of this reduction is to keep only the classes with the sufficient number of occurrences to train the models. Technical Details: Text documents are stored in the individual text files using UTF-8 encoding. Each filename is composed of the serial number and the list of the categories abbreviations separated by the underscore symbol and the .txt suffix. Serial numbers are composed of five digits and the numerical series starts from the value one. For instance the file 00046_kul_nab_mag.txt represents the document file number 46 annotated by the categories kul (culture), nab (religion) and mag (magazine selection). The content of the document, i.e. the word tokens, is stored in one line. The tokens are separated by the space symbols. Every text document was further automatically mophologically analyzed. This analysis includes lemmatization, POS tagging and syntactic parsing. The fully annotated files are stored in .conll files. We also provide the lemmatized form, file with suffix .lemma, and appropriate POS-tags, see .pos files. The tokenized version of the documents is also available in .tok files. This corpus is available only for research purposes for free. Commercial use in any form is strictly excluded. Czech Text Document Corpus v 2.0 is a collection of text documents for automatic document classification in Czech language. It is composed of the text documents provided by the Czech News Agency and is freely available for research purposes. This corpus was created in order to facilitate a straightforward comparison of the document classification approaches on Czech data. It is particularly dedicated to evaluation of multi-label document classification approaches, because one document is usually labelled with more than one label. Besides the information about the document classes, the corpus is also annotated at the morphological layer. The main part (for training and testing) is composed of 11,955 real newspaper articles. We provide also a development set which is intended to be used for tuning of the hyper-parameters of the created models. This set contains 2735 additional articles. The total category number is 60 out of which 37 most frequent ones are used for classification. The reason of this reduction is to keep only the classes with the sufficient number of occurrences to train the models. Technical Details: Text documents are stored in the individual text files using UTF-8 encoding. Each filename is composed of the serial number and the list of the categories abbreviations separated by the underscore symbol and the .txt suffix. Serial numbers are composed of five digits and the numerical series starts from the value one. For instance the file 00046_kul_nab_mag.txt represents the document file number 46 annotated by the categories kul (culture), nab (religion) and mag (magazine selection). The content of the document, i.e. the word tokens, is stored in one line. The tokens are separated by the space symbols. Every text document was further automatically mophologically analyzed. This analysis includes lemmatization, POS tagging and syntactic parsing. The fully annotated files are stored in .conll files. We also provide the lemmatized form, file with suffix .lemma, and appropriate POS-tags, see .pos files. The tokenized version of the documents is also available in .tok files. This corpus is available only for research purposes for free. Commercial use in any form is strictly excluded.
    Download from http://ctdc.kiv.zcu.cz/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/671.html
    Edition LREC 2018
  • Name Dataset of Nuanced Assertions on Controversial Issues (NAoCI dataset)
    Resource type Evaluation Data
    Size <Not Specified>
    Languages English
    Production status Newly created-finished
    Resource usage Opinion Mining/Sentiment Analysis
    License Creative Commons CC BY-NC-ND 4.0
    Conditions of use Attribution Non-Commercial No-Derivatives
    Description The Dataset of Nuanced Assertions on Controversial Issues (NAoCI) dataset consists of over 2,000 assertions on sixteen different controversial issues. It has over 100,000 judgments of whether people agree or disagree with the assertions, and of about 70,000 judgments indicating how strongly people support or oppose the assertions.The Dataset of Nuanced Assertions on Controversial Issues (NAoCI) dataset consists of over 2,000 assertions on sixteen different controversial issues. It has over 100,000 judgments of whether people agree or disagree with the assertions, and of about 70,000 judgments indicating how strongly people support or oppose the assertions.The Dataset of Nuanced Assertions on Controversial Issues (NAoCI) dataset consists of over 2,000 assertions on sixteen different controversial issues. It has over 100,000 judgments of whether people agree or disagree with the assertions, and of about 70,000 judgments indicating how strongly people support or oppose the assertions.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/321_res_1.zip [283 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/321.html
    Edition LREC 2018
  • Name Datasets for classification experiments IS-pros
    Resource type Corpus
    Size 135 KByte  
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Evaluation/Validation
    License GNU GPL 3.0
    Conditions of use Preservation of Copyright Notice Share-Alike
    Description Datasets is arff format (for Weka machine learning software) are made available to reproduce the validation experiments presented in the paper.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/530_res_3.zip [135 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/530.html
    Edition LREC 2018
  • Name Debate Recordings
    Resource type Corpus
    Size 280 MByte  
    Languages English
    Production status Newly created-finished
    Resource usage Debating technologies and computational argumenation
    License © Copyright Wikipedia CC-BY-SA 3.0 © Copyright IBM 2014. Released under CC-BY-SA. 4.0
    Conditions of use Attribution Non-Commercial Share-Alike
    Description This resource is an audio and textual dataset of debating speeches for computational argumentation and debating technologies research. It contains 60 speeches recorded by experienced debaters, as well as their automatic and manual transcripts, in both raw and clean versions (5 formats in total). We plan to release more data in the future.
    Download from https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/66.html
    Edition LREC 2018
  • Name Dundee GCG-Bank
    Resource type Treebank
    Size 2372 sentences
    Languages English
    Production status Complete
    Resource usage <Not Specified>
    License GNU GPL 3.0
    Conditions of use Preservation of Copyright Notice Share-Alike Free
    Description Dundee GCG-Bank contains hand-corrected deep syntactic annotations for the Dundee eye-tracking corpus (Kennedy et al., 2003). The annotations are designed to support psycholinguistic investigation into the structural determinants of sentence processing effort. Dundee GCG-Bank is distributed as a sub-module of the ModelBlocks repository, a code base designed to support broad-coverage psycholinguistic modeling.
    Download from https://github.com/modelblocks/modelblocks-release
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/9.html
    Edition LREC 2018
  • Name Dutch Multiword Expressions Scored for Compositionality (Unfiltered)
    Resource type Lexicon
    Size 269498 entries  
    Languages Dutch (nld)
    Production status Newly created-finished
    Resource usage Multiword Expression Compositionality
    License CreativeCommons
    Conditions of use <Not Specified>
    Description A compositionality-ranked list of Dutch multiword expressions, created using the methodology detailed in our paper; evaluation for this resource is not presented.
    Download from http://amor.cms.hu-berlin.de/~robertsw/files/lrec2016/MWE_nl-unfiltered.utf8.txt.gz
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/650.html
    Edition LREC 2018
  • Name E-HowNet
    Resource type Lexicon
    Size web browser: 90000 words ; download: 30000 words
    Languages Chinese English
    Production status Newly created-finished
    Resource usage Information Extraction, Information Retrieval
    License Academic License Trial Version Agreement (Automatic Translation of Chinese version)
    Conditions of use Attribution Share-Alike
    Description Extended-HowNet (E-HowNet) is a lexical knowledge base evolved from HowNet and created by the CKIP (Chinese Knowledge and Information Processing) group. It consists of definitions for lexical senses and an ontology. The ontology is built by modifying HowNet taxonomy for sememes to denote taxonomic relations between concepts and attributes of concepts and aimed to construct a lexical knowledge database. It is a very important groundwork for E-HowNet project.
    Download from http://ckip.iis.sinica.edu.tw/CKIP/ehownet_reg/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/547.html
    Edition LREC 2018
  • Name Emotion Movie Transcript Corpus
    Resource type Corpus
    Size 100 MByte  
    Languages English (eng)
    Production status Newly created-in progress
    Resource usage Emotion Recognition/Generation
    License In Readme File
    Conditions of use Research Uses Commercial usage if approved by the author
    Description Emotion Movie Transcript Corpus (EMTC) is an emotion conversational text corpus collected from Imdb quotes dataset. The corpus is partly annotated using multi-label scheme. It has relatively high inter-annotators agreement score. The corpus is practical and closer to real-life settings than other emotion corpus. Emotion analysis system can benefit by using the corpus as training/testing data or extracting emotion lexicon from it. The corpus include 3 files (excluding the README.txt file)
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/405_res_1.zip [88 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/405.html
    Edition LREC 2018
  • Name EN-Hi : Humor Detection Code-mixed texts
    Resource type Corpus
    Size 1.8 MByte  
    Languages English (eng) Hindi (hin)
    Production status Newly created-in progress
    Resource usage Humor detection and language identification in English-Hindi code-mixed texts
    License GNU GPL 3.0
    Conditions of use Attribution Preservation of Copyright Notice
    Description Corpus consists of English-Hindi code-mixed social media texts. It contains 3453 tweets annotated with the presence of humor in the text, along with the language identification at the word level.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/363_res_1.txt [692 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/363.html
    Edition LREC 2018
  • Name English Multiword Expressions Scored for Compositionality (Filtered)
    Resource type Lexicon
    Size 817592 entries  
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Multiword Expression Compositionality
    License CreativeCommons
    Conditions of use <Not Specified>
    Description The compositionality-ranked list of English multiword expressions presented in the paper, semi-automatically filtered for use in our machine translation experiments.
    Download from http://amor.cms.hu-berlin.de/~robertsw/files/lrec2016/MWE_en-filtered.utf8.txt.gz
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/650.html
    Edition LREC 2018
  • Name English Multiword Expressions Scored for Compositionality (Unfiltered)
    Resource type Lexicon
    Size 917648 entries  
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Multiword Expression Compositionality
    License CreativeCommons
    Conditions of use <Not Specified>
    Description The compositionality-ranked list of English multiword expressions presented in the paper.
    Download from http://amor.cms.hu-berlin.de/~robertsw/files/lrec2016/MWE_en-unfiltered.utf8.txt.gz
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/650.html
    Edition LREC 2018
  • Name English Vocabulary Knowledge Dataset
    Resource type Evaluation Data
    Size 22 KByte  
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Building and evaluating educational applications.
    License Open Source
    Conditions of use <Not Specified>
    Description This is the vocabulary test results of Japanese English-as-a-Second-Language learners collected via crowdsourcing. The test used for collecting this data set is the Vocabulary Size Test (Nation and Beglar 2007). The test tests test-takers' vocabulary by asking the correct meaning from multiple options for 100 English words. The test was answered by 100 test-takers collected from a crowdsourcing service called Lancers, whose workers are mostly Japanese. Every test-taker was required to have taken the TOEIC test at least once, and was asked to report their TOEIC score and when they took the test. This dataset was collected in Jan. 2016. The test results for 100 questions of each test-taker follow. The test can be downloaded from https://www.victoria.ac.nz/lals/about/staff/publications/paul-nation/VST-version-A.pdf\nThe test result can be download from https://www.victoria.ac.nz/lals/about/staff/publications/paul-nation/VST-version-A_answers.pdf For copyright reasons, we do not attach the test and the test answer.This is the vocabulary test results of Japanese English-as-a-Second-Language learners collected via crowdsourcing. The test used for collecting this data set is the Vocabulary Size Test (Nation and Beglar 2007).
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/978_res_1.zip [6,19 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/978.html
    Edition LREC 2018
  • Name English Wiktionary
    Resource type Lexicon
    Size 5317000 entries  
    Languages English (eng)
    Production status Existing-used
    Resource usage Lexicon Creation/Annotation
    License Creative Commons CC BY-SA 3.0
    Conditions of use Attribution Share-Alike
    Description <Not Specified>
    Download from https://en.wiktionary.org
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/312.html
    Edition LREC 2018
  • Name English-Hindi code-mixed dataset for sarcasm detection
    Resource type Corpus
    Size 3.2 Mbyte
    Languages English Hindi
    Production status Complete
    Resource usage Develop systems for classification of English-Hindi code-mixed texts and sarcasm detection
    License GNU GPL 3.0
    Conditions of use Preservation of Copyright Notice Share-Alike
    Description The dataset is divided into three files: - Sarcasm_tweets.txt: This file contains all the tweet ids and their corresponding tweet text. Each tweet id is followed by the text and a blank line and so on. - Sarcasm_tweets_with_language.txt: This file contains tweet id followed by the corresponding tweet that tokenized and each token is tagged with a language tag (en/hi/rest). - Sarcasm_tweet_truth.txt: This file contains tweet id followed by a label (YES/NO) that indicates presence of sarcasm and then a blank line.The dataset is divided into three files: - Sarcasm_tweets.txt: This file contains all the tweet ids and their corresponding tweet text. Each tweet id is followed by the text and a blank line and so on. - Sarcasm_tweets_with_language.txt: This file contains tweet id followed by the corresponding tweet that tokenized and each token is tagged with a language tag (en/hi/rest). - Sarcasm_tweet_truth.txt: This file contains tweet id followed by a label (YES/NO) that indicates presence of sarcasm and then a blank line.The dataset is divided into three files: - Sarcasm_tweets.txt: This file contains all the tweet ids and their corresponding tweet text. Each tweet id is followed by the text and a blank line and so on. - Sarcasm_tweets_with_language.txt: This file contains tweet id followed by the corresponding tweet that tokenized and each token is tagged with a language tag (en/hi/rest). - Sarcasm_tweet_truth.txt: This file contains tweet id followed by a label (YES/NO) that indicates presence of sarcasm and then a blank line.The dataset is divided into three files: - Sarcasm_tweets.txt: This file contains all the tweet ids and their corresponding tweet text. Each tweet id is followed by the text and a blank line and so on. - Sarcasm_tweets_with_language.txt: This file contains tweet id followed by the corresponding tweet that tokenized and each token is tagged with a language tag (en/hi/rest). - Sarcasm_tweet_truth.txt: This file contains tweet id followed by a label (YES/NO) that indicates presence of sarcasm and then a blank line.The dataset is divided into three files: - Sarcasm_tweets.txt: This file contains all the tweet ids and their corresponding tweet text. Each tweet id is followed by the text and a blank line and so on. - Sarcasm_tweets_with_language.txt: This file contains tweet id followed by the corresponding tweet that tokenized and each token is tagged with a language tag (en/hi/rest). - Sarcasm_tweet_truth.txt: This file contains tweet id followed by a label (YES/NO) that indicates presence of sarcasm and then a blank line.
    Download from https://github.com/sahilswami96/SarcasmDetection_CodeMixed/tree/master/Dataset
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/20.html
    Edition LREC 2018
  • Name English-Malayalam-MorphGenerated Forms
    Resource type Lexicon
    Size 413397 entries  
    Languages English Malayalam (mal)
    Production status Newly created-in progress
    Resource usage Machine Translation, SpeechToSpeech Translation
    License Creative Commons CC BY-NC 4.0
    Conditions of use Attribution Non-Commercial
    Description The corpus contains the following resources: 1. File name - MorphWords.en.txt, MorphWords.ml.txt \na. Morphology generated forms of 370065 parallel entries of English-Malayalam language pair 2. File name – NounCaseForms.en.txt, NounCaseForms.ml.txt: 21360 entries of NounCase morphology generated sample forms with tags for English and Malayalam 3. File name – NounBaseWord.en.txt, NounBaseWords.ml.txt: 13326 entries of NounBase forms for English and Malayalam 4. File name – VerbRootWords.en.txt, VerbRootWords.ml.txt: 2526 parallel entries of English-Malayalam root verb words 5. File name – VerbMorphTagForms.en.txt, VerbMorphTagForms.ml.txt: 6120 parallel entries of English-Malayalam Morphology generated tagged sample verb forms These pairs were generated programmatically and also extracted phrases from the corpus. These forms have been validated manually with English-Malayalam Bilingual experts having qualifications of Masters Degree in Malayalam Literature. The details regarding the dataset are mentioned in the following paper. Kindly cite this paper if you are using this dataset for research: Sreelekha.S, Pushpak Bhattacharyya. Morphology Generation for English-Malayalam SMT. Language and Resources and Evaluation Conference (LREC). 2018. The details of the license can be found below:<a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc/4.0/\"><img alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc/4.0/88x31.png\" /></a><br /><span xmlns:dct=\"http://purl.org/dc/terms/\" href=\"http://purl.org/dc/dcmitype/Text\" property=\"dct:title\" rel=\"dct:type\">Corpus with Morphology Generated Forms for English-Malayalam</span> is licensed under a <a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc/4.0/\">Creative Commons Attribution-NonCommercial 4.0 International License</a>..
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/125_res_1.zip [7.13 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/125.html
    Edition LREC 2018
  • Name Extended Typology Paraphrase Corpus (ETPC)
    Resource type Corpus
    Size 5800 paraphrase pairs
    Languages English
    Production status Newly created-in progress
    Resource usage Textual Entailment and Paraphrasing
    License <Not Specified>
    Conditions of use <Not Specified>
    Description Corpus for Paraphrase Identification (PI), annotated with paraphrase types. Additionally annotated with negation.
    Download from https://github.com/venelink/ETPC
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/661.html
    Edition LREC 2018
  • Name Finnish Wiktionary
    Resource type Lexicon
    Size 340787 entries  
    Languages Finnish (fin)
    Production status Existing-used
    Resource usage Lexicon Creation/Annotation
    License Creative Commons CC BY-SA 3.0
    Conditions of use Attribution Share-Alike
    Description <Not Specified>
    Download from https://fi.wiktionary.org
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/312.html
    Edition LREC 2018
  • Name FooTweets_Corpus
    Resource type Corpus
    Size 747 KByte  
    Languages English German
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License <Not Specified>
    Conditions of use <Not Specified>
    Description FooTweets is a first bilingual parallel corpus for English--German tweets. A total of 4,000 English tweets are collected from the FIFA World Cup 2014 and translated into German. The English tweets are essentially informal in nature but they are translated into formal texts in German in order to help build machine translation systems that is capable of translating informal texts into the formal ones. In addition to this, each tweet is assigned with a sentiment score of either 0.3, 0.5 or 0.7 to represent the negative, neutral and positive sentiment classes, respectively.
    Download from https://github.com/HAfli/FooTweets_Corpus
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/471.html
    Edition LREC 2018
  • Name German Multiword Expressions Scored for Compositionality (Unfiltered)
    Resource type Lexicon
    Size 773468 entries  
    Languages German (deu)
    Production status Newly created-finished
    Resource usage Multiword Expression Compositionality
    License CreativeCommons
    Conditions of use <Not Specified>
    Description A compositionality-ranked list of German multiword expressions, created using the methodology detailed in our paper; evaluation for this resource is not presented.
    Download from http://amor.cms.hu-berlin.de/~robertsw/files/lrec2016/MWE_de-unfiltered.utf8.txt.gz
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/650.html
    Edition LREC 2018
  • Name GOST Lexicon
    Resource type Lexicon
    Size 44613 lexemes  
    Languages English
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License Open Source
    Conditions of use Academic Use Only
    Description Created by Lancaster University, GOST Lexicon contains 433 single word bio-terms (see OntologySW-AllPaths.usas) and 44,180 multiword bio-terms (see OntologyMWE-AllPaths.usas). It has been merged into the Lancaster UCREL Semantic lexicons to create a new version of the Lancaster USAS semantic annotation system (Rayson et al., 2004; Piao et al., 2015; Piao et al.,2017), named GOST (Gene Ontology Semantic Tagger), in order to automatically annotate the bio-terms with GO IDs in Medical journal articles, along with generic USAS semantic tags.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/706_res_1.zip [2,72 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/706.html
    Edition LREC 2018
  • Name HAI Alice-corpus
    Resource type Corpus
    Size 9900 tokens  
    Languages English
    Production status Newly created-finished
    Resource usage Question Answering
    License Creative Commons CC BY 4.0
    Conditions of use Attribution
    Description The resource currently contains the transcriptions on speech of 15 Human-Agent dialogs. We will provide additional resources in the near future.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/429_res_1.xml [95,1 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/429.html
    Edition LREC 2018
  • Name HappyDB
    Resource type Corpus
    Size 24 MByte  
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License CreativeCommons
    Conditions of use Attribution
    Description HappyDB is a corpus of 100,000+ crowd-sourced happy moments. The goal of the corpus is to advance the state of the art of understanding the causes of happiness that can be gleaned from text. 
    Download from https://rit-public.github.io/HappyDB/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/204.html
    Edition LREC 2018
  • Name Hotels Dialogues and Utterances
    Resource type Corpus
    Size 6000 sentences  
    Languages English
    Production status Newly created-finished
    Resource usage Dialogue
    License <Not Specified>
    Conditions of use <Not Specified>
    Description Collected Utterances using methods described in the paper and whole dialogues for a conversational agent in the hotels domain.
    Download from https://nlds.soe.ucsc.edu/hotels
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/763.html
    Edition LREC 2018
  • Name https://github.com/PLN-FaMAF/ArgumentMiningECHR
    Resource type Corpus
    Size <Not Specified>
    Languages English
    Production status Newly created-in progress
    Resource usage Machine Learning
    License <Not Specified>
    Conditions of use <Not Specified>
    Description Corpus of Sentences of the European Court of Human Rights annotated with Argumentation concepts, namely Claims and justifications (Premises) attacking or supporting them.
    Download from https://github.com/PLN-FaMAF/ArgumentMiningECHR
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/1048.html
    Edition LREC 2018
  • Name Humor Detection Classifier
    Resource type Corpus
    Size 28.8 KByte  
    Languages <Not Specified>
    Production status Newly created-finished
    Resource usage Classification of humor in texts
    License GNU GPL 3.0
    Conditions of use Preservation of Copyright Notice Share-Alike
    Description The corpus contains tweetID along with Humor(N) and Non-Humrous(N) tags. It was built for humor detection in English-Hindi code-mixed social media content.
    Download from https://github.com/Ankh2295/humor-detection-corpus
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/363.html
    Edition LREC 2018
  • Name Hungarian Webcorpus
    Resource type Corpus
    Size 1.48 billion words  
    Languages Hungarian (hun)
    Production status Existing-used
    Resource usage Evaluation/Validation
    License Open Source
    Conditions of use <Not Specified>
    Description With over 1.48 billion words unfiltered (589m words fully filtered), this is by far the largest Hungarian language corpus, and unlike the Hungarian National Corpus (125m words), it is available in its entirety under a permissive Open Content license. The Hungarian webcorpus was crawled in the winter of 2003 as part of the WordSword project at the Media Research and Education Centre.
    Download from http://mokk.bme.hu/resources/webcorpus/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/730.html
    Edition LREC 2018
  • Name Hungarian Wiktionary
    Resource type Lexicon
    Size 335886 entries  
    Languages Hungarian (hun)
    Production status Existing-used
    Resource usage Lexicon Creation/Annotation
    License Creative Commons CC BY-SA 3.0
    Conditions of use Attribution Share-Alike
    Description <Not Specified>
    Download from https://hu.wiktionary.org
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/312.html
    Edition LREC 2018
  • Name IPSL: A Database of Iconicity Patterns in Sign Languages
    Resource type Lexicon
    Size 193 KByte  
    Languages Russian Sign Language (rsl) French Sign Language (fsl) American Sign Language (ase) British Sign Language (bfi) Spanish Sign Language (ssp)
    Production status Newly created-finished
    Resource usage Lexicon Creation/Annotation
    License Creative Commons CC BY-SA 3.0
    Conditions of use Attribution
    Description This is the first large-scale database of signs annotated according to various parameters of iconicity. The signs represent concrete concepts in seven semantic fields in nineteen sign languages; 1542 signs in total. Each sign was annotated with respect to the type of form-image association, the presence of iconic location and movement, personification, and with respect to whether the sign depicts a salient part of the concept. The database is also a basis of a website with several visualization tools to represent the data from the database. It is possible to visualize iconic properties of separate concepts or iconic properties of semantic fields on the map of the world, and to build graphs representing iconic patterns for selected semantic fields.
    Download from https://sl-iconicity.shinyapps.io/iconicity_patterns/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/102.html
    Edition LREC 2018
  • Name KIT-Multi
    Resource type Corpus
    Size 140000 entries  
    Languages English German French
    Production status Newly created-in progress
    Resource usage Knowledge Discovery/Representation
    License CreativeCommons
    Conditions of use <Not Specified>
    Description KIT-Multi is a multilingual embedding corpus, currently consisting word embeddings of English-German-French. Other languages such as Chinese, Japanese, Korean, Vietnamese, Dutch, Italian, Romanian, Spanish or Portuguese are being added.
    Download from http://i13pc106.ira.uka.de/~tha/KIT-Multi
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/688.html
    Edition LREC 2018
  • Name Konstanz Resource of Questions (KRoQ)
    Resource type Corpus
    Size 140 MByte  
    Languages German French Spanish Greek
    Production status Newly created-in progress
    Resource usage Question Classification
    License https://github.com/kkalouli/BIBLE-processing/blob/master/KRoQ/license
    Conditions of use <Not Specified>
    Description A Multilingual Approach to Question Classification To cite this work use: Kalouli, A.-L., Kaiser, K., Hautli-Janisz, A., Kaiser, G., Butt., M. 2018. A Multilingual Approach to Question Classification. In Proceedings of the Language Resources and Evaluation Conference (LREC), Miyazaki-JapanA Multilingual Approach to Question Classification To cite this work use: Kalouli, A.-L., Kaiser, K., Hautli-Janisz, A., Kaiser, G., Butt., M. 2018. A Multilingual Approach to Question Classification. In Proceedings of the Language Resources and Evaluation Conference (LREC), Miyazaki-JapanA Multilingual Approach to Question Classification To cite this work use: Kalouli, A.-L., Kaiser, K., Hautli-Janisz, A., Kaiser, G., Butt., M. 2018. A Multilingual Approach to Question Classification. In Proceedings of the Language Resources and Evaluation Conference (LREC), Miyazaki-JapanA Multilingual Approach to Question Classification To cite this work use: Kalouli, A.-L., Kaiser, K., Hautli-Janisz, A., Kaiser, G., Butt., M. 2018. A Multilingual Approach to Question Classification. In Proceedings of the Language Resources and Evaluation Conference (LREC), Miyazaki-Japan
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/13_res_1.zip [180 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/13.html
    Edition LREC 2018
  • Name Korean L2 Unknown Words - Labeled Dataset
    Resource type Evaluation Data
    Size 53202 word annotations
    Languages Korean (kor)
    Production status Newly created-finished
    Resource usage Supervised Machine Learning & Evaluation/Validation
    License Creative Commons CC BY-NC 4.0
    Conditions of use Attribution Non-Commercial
    Description This is a labeled dataset for training and/or evaluating unknown word prediction models for L2 learners of Korean. It was extracted from a corpus of passages annotated by L2 learners. To produce this dataset, each annotated word was normalized to its base form by removing inflectional and derivational suffixes, and duplicate annotations were removed so that there were at most 1 annotation per annotator-word pair. All annotated words are labeled as either being known or unknown by the annotator. Metadata about each annotator is also provided, including reported Korean proficiency level, their estimated level based on annotations provided, native language, country, etc.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/272_res_1.zip [555 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/272.html
    Edition LREC 2018
  • Name KRAUTS
    Resource type Corpus
    Size <Not Specified>
    Languages German (deu)
    Production status Newly created-finished
    Resource usage <Not Specified>
    License Creative Commons CC BY-NC 4.0
    Conditions of use Attribution Non-Commercial
    Description KRAUTS (Korpus of newspapeR Articles with Underlinded Temporal expressionS) is a German temporally annotated news corpus accompanied with TimeML annotation guidelines for German. It was developed at Fondazione Bruno Kessler, Trento, Italy and at the Max Planck Institute for Informatics, Saarbrücken, Germany. Our goal is to boost temporal tagging research [1] for German.
    Download from https://github.com/JannikStroetgen/KRAUTS/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/436.html
    Edition LREC 2018
  • Name lexfom
    Resource type Ontology
    Size 100 KByte  
    Languages <Not Specified>
    Production status Existing-updated
    Resource usage Knowledge Discovery/Representation
    License OpenSource
    Conditions of use <Not Specified>
    Description This is an ontology to represent the Meaning Text Theory's lexical functions and to represent lexical relations.
    Download from https://github.com/alex-fonseca/lexfom
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/1102.html
    Edition LREC 2018
  • Name LIA-msc
    Resource type Corpus
    Size 210.8 KByte  
    Languages Portuguese (por) Spanish (spa)
    Production status Newly created-finished
    Resource usage Summarisation
    License Lesser General Public License For Linguistic Resources
    Conditions of use https://dev.termwatch.es/~fresa/CORPUS/MSF2/lgpllr.html Preservation of Copyright Notice Share-Alike Notify substantive changes
    Description Multi-Sentence Compression (MSC) is a variation of Sentence Compression. MSC aims at analyzing a cluster of similar sentences to generate a new sentence, which is shorter than the average length of source sentences and has the key information of the cluster. MSC enables summarisation and question-answering systems to generate outputs combining fully formed sentences from one or several documents. We present a new annotated corpus in the Portuguese and Spanish languages for the MSC task. This corpus was collected from Portuguese and Spanish Google News and it is composed of clusters of similar sentences along with reference compression for each cluster.Multi-Sentence Compression (MSC) is a variation of Sentence Compression. MSC aims at analyzing a cluster of similar sentences to generate a new sentence, which is shorter than the average length of source sentences and has the key information of the cluster. MSC enables summarisation and question-answering systems to generate outputs combining fully formed sentences from one or several documents.
    Download from http://dev.termwatch.es/~fresa/CORPUS/MSF2/index.html
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/275.html
    Edition LREC 2018
  • Name LIdioms
    Resource type Lexicon
    Size 147 KByte  
    Languages English (en) Portuguese (pt) German (de) Italian (ita) Russian (ru)
    Production status Newly created-finished
    Resource usage Semantic Web
    License Creative Commons CC BY-NC-SA 3.0
    Conditions of use <Not Specified>
    Description LIDIOMS data set consists in a multilingual RDF representation of idioms containing five languages. The data set is intended to support natural language processing applications by providing links between idioms across languages. The underlying data was crawled and integrated from various sources. To ensure the quality of the crawled data, all idioms were evaluated by at least two native speakers. LIDIOMS is linked with two well-known multilingual data sets BabelNet and DBnary.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/46_res_1.tgz [147 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/46.html
    Edition LREC 2018
  • Name Lingmotif-lex
    Resource type Lexicon
    Size 67400 entries  
    Languages English Spanish
    Production status Newly created-in progress
    Resource usage Opinion Mining/Sentiment Analysis
    License <Not Specified>
    Conditions of use <Not Specified>
    Description Wide coverage, manually curated sentiment lexicon featuring fine-grained valence system and sentiment shifters system accessible through accompanying Python 3 library.
    Download from http://tecnolengua.uma.es
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/457.html
    Edition LREC 2018
  • Name Live blog Summarization Corpus
    Resource type Corpus
    Size 778 MByte  
    Languages English
    Production status Newly created-finished
    Resource usage Summarisation
    License Apache 2.0
    Conditions of use Preservation of Copyright Notice
    Description Live blogs are an increasingly popular news format to cover breaking news and live events in online journalism. Online news websites around the world are using this medium to give their readers a minute by minute update on an event. Good summaries enhance the value of the live blogs for a reader but are often not available. In this paper, we study a way of collecting corpora for automatic live blog summarization. In an empirical evaluation using well-known state-of-the-art summarization systems, we show that live blogs corpus poses new challenges in the field of summarization. We make our tools publicly available to reconstruct the corpus to encourage the research community and replicate our results.Live blogs are an increasingly popular news format to cover breaking news and live events in online journalism. Online news websites around the world are using this medium to give their readers a minute by minute update on an event. Good summaries enhance the value of the live blogs for a reader but are often not available. In this paper, we study a way of collecting corpora for automatic live blog summarization. In an empirical evaluation using well-known state-of-the-art summarization systems, we show that live blogs corpus poses new challenges in the field of summarization. We make our tools publicly available to reconstruct the corpus to encourage the research community and replicate our results.Live blogs are an increasingly popular news format to cover breaking news and live events in online journalism. Online news websites around the world are using this medium to give their readers a minute by minute update on an event. Good summaries enhance the value of the live blogs for a reader but are often not available. In this paper, we study a way of collecting corpora for automatic live blog summarization. In an empirical evaluation using well-known state-of-the-art summarization systems, we show that live blogs corpus poses new challenges in the field of summarization. We make our tools publicly available to reconstruct the corpus to encourage the research community and replicate our results.
    Download from https://github.com/UKPLab/lrec2018-live-blog-corpus
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/317.html
    Edition LREC 2018
  • Name LOaDing
    Resource type Ontology
    Size 1.1 MByte  
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Word Sense Disambiguation
    License Creative Commons CC BY 4.0
    Conditions of use <Not Specified>
    Description LOaDing is a new resource that enriches the Framester knowledge graph, which links Framenet, WordNet, VerbNet and\nother resources, with semantic features extracted from text corpora. Features are extracted from distributional semantics-based sense inventories and allow to connect the resource with text, for instance to boost the performance on Word Frame Disambiguation. Since Framester is a frame-based knowledge graph, which enables full-fledged OWL querying and reasoning, our resource paves the way for the development of novel, deeper semantic-aware applications that could benefit from the combination of knowledge from text and complex symbolic representations of events and participants.
    Download from http://data.dws.informatik.uni-mannheim.de/download/loading/ddt-wiki-n30-1400k-loading.zip
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/263.html
    Edition LREC 2018
  • Name LX-LX-DSemVectors 2.2b
    Resource type Lexicon
    Size <Not Specified>
    Languages Portuguese (por)
    Production status Newly created-in progress
    Resource usage <Not Specified>
    License Creative Commons CC BY 4.0
    Conditions of use <Not Specified>
    Description Distributional semantics model (aka word embeddings) for Portuguese, LX-DSemVectors 2.2b, trained over 2.2 billion tokens, with the largest vocabulary and the best intrinsic evaluation scores.
    Download from http://github.com/nlx-group
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/592.html
    Edition LREC 2018
  • Name MCDTB
    Resource type Treebank
    Size 294 KByte  
    Languages Chinese
    Production status Newly created-in progress
    Resource usage Discourse
    License Creative Commons CC BY 4.0
    Conditions of use Attribution
    Description In view of the differences between the annotations of micro and macro discourse relationships, this paper describes the relevant experiments on the construction of the Macro Chinese Discourse Treebank (MCDTB), a higher-level Chinese discourse corpus. Following RST (Rhetorical Structure Theory), we annotate the macro discourse information, including discourse structure, nuclearity and relationship, and the additional discourse information, including topic sentences, lead and abstract, to make the macro discourse annotation more objective and accurate. Finally, we annotated 720 articles with a Kappa value greater than 0.6. Preliminary experiments on this corpus verify the computability of MCDTB.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/147_res_1.zip [39,4 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/147.html
    Edition LREC 2018
  • Name MGAD Syntactic Analogy Datasets
    Resource type Evaluation Data
    Size 5 MByte  
    Languages Arabic Russian Hindi
    Production status Newly created-in progress
    Resource usage Evaluation/Validation
    License <Not Specified>
    Conditions of use <Not Specified>
    Description Word embeddings evaluation syntactic analogy datasets for Arabic, Russian, and Hindi.
    Download from https://github.com/rutrastone/LREC2018
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/1022.html
    Edition LREC 2018
  • Name MirasText
    Resource type Corpus
    Size 15.3 GByte  
    Languages Iranian Persian (pes)
    Production status Newly created-in progress
    Resource usage Language Modelling
    License MIT License
    Conditions of use Preservation of Copyright Notice
    Description This repository contains MirasText corpus and description along side with what it has been used for and what it can be used for. A sample of the dataset is provided in MirasText_sample.txt which contains 1000 documents.
    Download from https://github.com/miras-tech/MirasText
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/385.html
    Edition LREC 2018
  • Name MirasVoice
    Resource type Corpus
    Size 20 GByte  
    Languages Iranian Persian (pes) English (eng)
    Production status Newly created-in progress
    Resource usage Person Identification
    License Apache 2.0
    Conditions of use Preservation of Copyright Notice
    Description The MirasVoice Speech Corpus (MVSC) is one of the largest Farsi-English voice datasets currently available for general purpose studies and expert system development. Some of the applications this dataset can be used for is for speaker recognition systems, speech recognition studies, gender recognition, cognitive science, and pattern recognition.
    Download from https://github.com/miras-tech/MirasVoice/blob/master/LICENSE
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/443.html
    Edition LREC 2018
  • Name morphdb.hu
    Resource type Lexicon
    Size 6.2 MByte  
    Languages Hungarian (hun)
    Production status Existing-used
    Resource usage Morphological Analysis
    License Open Source
    Conditions of use <Not Specified>
    Description morphdb.hu is an open source morphological database of Hungarian, consisting of a lexicon and morphological grammar that are based on well-founded theoretical decisions.
    Download from http://mokk.bme.hu/resources/morphdb-hu/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/730.html
    Edition LREC 2018
  • Name MPST: A Corpus of Movie Plot Synopses with Tags
    Resource type Corpus
    Size 153 MByte  
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Document Classification, Text categorisation
    License GNU GPL 3.0
    Conditions of use Preservation of Copyright Notice Share-Alike
    Description The corpus contains 14,828 plot synopses of movies and their multi-label associations with 71 fine-grained tags. The tagset was created by collecting tags from the MovieLens 20M dataset and IMDB, filtering tags related to the plots and grouping semantically similar tags together. These tags represent a wide range of information about the movies like their genres, plot structures, and emotional experiences that a viewer may feel after watching the movie. The plot synopses were collected from IMDB and Wikipedia and they all have at least 10 sentences.
    Download from http://ritual.uh.edu/mpst-2018/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/332.html
    Edition LREC 2018
  • Name MSimlex999_Polish
    Resource type Evaluation Data
    Size 1998 words  
    Languages Polish (pol)
    Production status Newly created-in progress
    Resource usage Evaluation/Validation
    License Creative Commons CC BY
    Conditions of use <Not Specified>
    Description Polish translation of the SimLex-999 data (https://www.cl.cam.ac.uk/~fh295/simlex.html) with similarity and relatedness scores.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/687_res_1.txt [29,9 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/687.html
    Edition LREC 2018
  • Name Mulitmodal Lexical Translation Dataset
    Resource type Corpus
    Size 98647 sentences  
    Languages English German French
    Production status Newly created-finished
    Resource usage Evaluation/Validation
    License Creative Commons CC BY-NC-SA 4.0
    Conditions of use <Not Specified>
    Description Multimodal Lexical Translation Dataset is a collection of 4-tuples of the form: (x, y, X, V) where x is an ambiguous word, X is its textual context (a sentence in source language), V is its visual context (an image), and y is its translation that conforms with both the textual and visual contexts.
    Download from https://github.com/sheffieldnlp/mlt
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/629.html
    Edition LREC 2018
  • Name MultiBooked Corpus
    Resource type Corpus
    Size 29 MByte  
    Languages Basque (eus) Catalan (cat)
    Production status Newly created-finished
    Resource usage Opinion Mining/Sentiment Analysis
    License Creative Commons
    Conditions of use <Not Specified>
    Description While sentiment analysis has become an established field in the NLP community, research into languages other than English has been hindered by the lack of resources. Although much research in multi-lingual and cross-lingual sentiment analysis has focused on unsupervised or semi-supervised approaches, these still require a large number of resources and do not reach the performance of supervised approaches. With this in mind, we introduce two datasets for supervised aspect-level sentiment analysis in Basque and Catalan, both of which are under-resourced languages. We provide high-quality annotations and benchmarks with the hope that they will be useful to the growing community of researchers working on these languages.
    Download from https://repositori.upf.edu/handle/10230/33928
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/217.html
    Edition LREC 2018
  • Name Multilingual IsA (MIsA)
    Resource type Corpus
    Size 2 GByte  
    Languages English Spanish French Italian Dutch
    Production status Newly created-finished
    Resource usage Information Extraction, Information Retrieval
    License Creative Commons CC BY 4.0
    Conditions of use <Not Specified>
    Description Multilingual IsA (MISA) is a collection of hypernymy relations in five languages (i.e., English, Spanish, French, Italian and Dutch) extracted from the corresponding full Wikipedia corpus. The extraction, for each language, is based on an established set of existing (viz. found in literature) or newly defined lexico-syntactic patterns. Similarly to WebIsADb, the resulting resource contains hypernymy relations represented as "tuples", as well as additional information such as provenance, context of the extraction etc. Multilingual IsA (MISA) is a collection of hypernymy relations in five languages (i.e., English, Spanish, French, Italian and Dutch) extracted from the corresponding full Wikipedia corpus. The extraction, for each language, is based on an established set of existing (viz. found in literature) or newly defined lexico-syntactic patterns. Similarly to WebIsADb, the resulting resource contains hypernymy relations represented as "tuples", as well as additional information such as provenance, context of the extraction etc. Multilingual IsA (MISA) is a collection of hypernymy relations in five languages (i.e., English, Spanish, French, Italian and Dutch) extracted from the corresponding full Wikipedia corpus. The extraction, for each language, is based on an established set of existing (viz. found in literature) or newly defined lexico-syntactic patterns. Similarly to WebIsADb, the resulting resource contains hypernymy relations represented as "tuples", as well as additional information such as provenance, context of the extraction etc. Multilingual IsA (MISA) is a collection of hypernymy relations in five languages (i.e., English, Spanish, French, Italian and Dutch) extracted from the corresponding full Wikipedia corpus. The extraction, for each language, is based on an established set of existing (viz. found in literature) or newly defined lexico-syntactic patterns. Similarly to WebIsADb, the resulting resource contains hypernymy relations represented as "tuples", as well as additional information such as provenance, context of the extraction etc. Multilingual IsA (MISA) is a collection of hypernymy relations in five languages (i.e., English, Spanish, French, Italian and Dutch) extracted from the corresponding full Wikipedia corpus. The extraction, for each language, is based on an established set of existing (viz. found in literature) or newly defined lexico-syntactic patterns. Similarly to WebIsADb, the resulting resource contains hypernymy relations represented as "tuples", as well as additional information such as provenance, context of the extraction etc.
    Download from http://web.informatik.uni-mannheim.de/misa/download.html
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/254.html
    Edition LREC 2018
  • Name N-gram Analogical Clusters and Analogical Grids
    Resource type Lexicon
    Size 374 MByte  
    Languages Danish German Modern Greek English Spanish
    Production status Newly created-finished
    Resource usage Morphological Analysis
    License Creative Commons CC BY-NC 4.0
    Conditions of use <Not Specified>
    Description This is a complete dataset of analogical clusters and analogical grids built on the vocabularies of 11 languages contained in 1,000 corresponding lines of the 11 different language versions of the Europarl corpus v.3. The grids were built on N-grams of different lengths, from words to 6-grams. We hope that the use of structured parallel data will foster research in comparative linguistics.This is a complete dataset of analogical clusters and analogical grids built on the vocabularies of 11 languages contained in 1,000 corresponding lines of the 11 different language versions of the Europarl corpus v.3. The grids were built on N-grams of different lengths, from words to 6-grams. We hope that the use of structured parallel data will foster research in comparative linguistics.This is a complete dataset of analogical clusters and analogical grids built on the vocabularies of 11 languages contained in 1,000 corresponding lines of the 11 different language versions of the Europarl corpus v.3. The grids were built on N-grams of different lengths, from words to 6-grams. We hope that the use of structured parallel data will foster research in comparative linguistics.This is a complete dataset of analogical clusters and analogical grids built on the vocabularies of 11 languages contained in 1,000 corresponding lines of the 11 different language versions of the Europarl corpus v.3. The grids were built on N-grams of different lengths, from words to 6-grams. We hope that the use of structured parallel data will foster research in comparative linguistics.This is a complete dataset of analogical clusters and analogical grids built on the vocabularies of 11 languages contained in 1,000 corresponding lines of the 11 different language versions of the Europarl corpus v.3. The grids were built on N-grams of different lengths, from words to 6-grams. We hope that the use of structured parallel data will foster research in comparative linguistics.This is a complete dataset of analogical clusters and analogical grids built on the vocabularies of 11 languages contained in 1,000 corresponding lines of the 11 different language versions of the Europarl corpus v.3. The grids were built on N-grams of different lengths, from words to 6-grams. We hope that the use of structured parallel data will foster research in comparative linguistics.
    Download from https://waseda.pure.elsevier.com/en/publications/tools-for-the-production-of-analogical-grids-and-a-resource-of-n-
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/344.html
    Edition LREC 2018
  • Name Natural Stories GCG-Bank
    Resource type Treebank
    Size 485 Other
    Languages English
    Production status Complete
    Resource usage <Not Specified>
    License GNU GPL 3.0
    Conditions of use Preservation of Copyright Notice Share-Alike Free
    Description Natural Stories GCG-Bank contains hand-corrected deep syntactic annotations for the Natural Stories self-paced reading corpus (Futrell et al., 2017). The annotations are designed to support psycholinguistic investigation into the structural determinants of sentence processing effort. Natural Stories GCG-Bank is distributed as a sub-module of the ModelBlocks repository, a code base designed to support broad-coverage psycholinguistic modeling.
    Download from https://github.com/modelblocks/modelblocks-release
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/9.html
    Edition LREC 2018
  • Name NL2Bash
    Resource type Corpus
    Size 12609 entries  
    Languages English
    Production status Newly created-finished
    Resource usage Natural language to code generation
    License GNU GPL 3.0
    Conditions of use Attribution Share-Alike
    Description A parallel corpus of one-line Bash commands paired with their natural language descriptions.
    Download from https://github.com/TellinaTool/nl2bash
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/1021.html
    Edition LREC 2018
  • Name NL2KB
    Resource type Terminology
    Size 3.9 MByte  
    Languages English
    Production status Newly created-finished
    Resource usage Information Extraction, Information Retrieval
    License MIT License
    Conditions of use <Not Specified>
    Description Two files are included in this release: - kb2nl.txt: The relational mappings from knowledge base (KB) predicates to natural language (NL) relation patterns. Each line is one of the most frequenty 629 KB predicates in DBpedia, columns in a line is separated by tags. The first column is the predicate in the format #p#{predicate_name}. The rest of the line shows the mapped NL relation patterns in the format: #r#{pattern} score - nl2kb.txt: The relational mappings from natural language (NL) relation patterns to knowledge base (KB) predicates.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/94_res_1.tgz [3,88 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/94.html
    Edition LREC 2018
  • Name Online Drug User Guideline Corpus
    Resource type Corpus
    Size 1.4 MByte  
    Languages English
    Production status Newly created-in progress
    Resource usage Information Extraction, Information Retrieval
    License Creative Commons CC BY 4.0
    Conditions of use Attribution
    Description The resource is the dataset or corpus presented in the paper. It is publicly available.
    Download from https://zenodo.org/record/1173345#.WoTZkJM-f-Y
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/426.html
    Edition LREC 2018
  • Name Open Source International Arabic News Corpus
    Resource type Corpus
    Size 157000000 tokens (roughly)
    Languages Arabic
    Production status First version
    Resource usage <Not Specified>
    License OpenSource
    Conditions of use Public (free)
    Description The Open Source International Arabic News (OSIAN) corpus has been collected from international Arabic news websites like CNN, DW, RT, Aljazeera, among others. With a server-friendly crawling policy we extracted 1 million web pages. After necessary cleaning and filtering steps, the OSIAN corpus has 477,556 articles comprising 2,861,944 sentences and roughly 157 million words. The corpus is encoded in XML, each article is annotated with metadata information, which gives the information about its web location and the date of its extraction. Moreover, each word is annotated with lemma and part-of-speech.
    Download from http://oujda-nlp-team.net/en/corpora/osian-corpus/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/7.html
    Edition LREC 2018
  • Name Open-Content Text Corpus
    Resource type Corpus
    Size 28 MByte  
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License Annotations: Creative Commons CC-BY 4.0 license Original content from ClueWeb12 keeps its original license Annotation tool: GNU General Public License v3.0
    Conditions of use <Not Specified>
    Description The following repository contains the corpus that was created for the publication 'Beyond Generic Summarization: A Multi-faceted Hierarchical Summarization Corpus of Large Heterogeneous Data' as well as the annotation tool that was developed for that purpose and an example Amazon Mechanical Turk HIT . The Corpus: Included in the Corpus folder is the following: Included in the SourceDocuments folder are the .xml files of all source topics and a .txt file with the topic names. Included in the AMTAllNuggets folder is a tab-delimited csv file with all annotations from Amazon Mechanical Turk in the format worker [tab] annotation. The turker IDs have been hashed in order to anonymize them. Included in the Trees folder are the inout documents for the tree annotation, the trees from three annotators as well as the gold standard trees created out of these trees. The Annotation tool: Included in the AnnotationTool folder is the Annotation tool as a Java archive as well as the source code and documentation of the tool. The HIT-Template: Included in the HIT-Template folder is an example HIT along with the javascript and stylesheet.
    Download from https://github.com/AIPHES/HierarchicalSummarization
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/252.html
    Edition LREC 2018
  • Name ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions)
    Resource type Corpus
    Size 1000000 words  
    Languages Czech (ces)
    Production status Newly created-finished
    Resource usage Speech Recognition/Understanding
    License Creative Commons CC BY-NC-SA 4.0
    Conditions of use Attribution Non-Commercial Share-Alike
    Description ORTOFON v1 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole Czech Republic. The corpus is composed of 332 recordings from 2012–2017 and contains 1,014,786 orthographic words (i.e. a total of 1,236,508 tokens including punctuation); a total of 624 different speakers appear in the probes. ORTOFON v1 is fully balanced regarding the basic sociolinguistic speaker categories (gender, age group, level of education and region of childhood residence). The transcription is linked to the corresponding audio track. Unlike the ORAL-series corpora, the transcription was carried out on two main tiers, orthographic and phonetic, supplemented by an additional metalanguage tier. ORTOFON v1 is lemmatized and morphologically tagged. The (anonymized) corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query engine to registered users of the CNC at http://www.korpus.cz Please note: this item includes only the transcriptions, audio (and the transcripts in their original format) is available under more restrictive non-CC license at http://hdl.handle.net/11234/1-2579 Published by: Charles University, Faculty of Arts, Institute of the Czech National Corpus Acknowledgements: This resource was created within the Czech National Corpus project (LM2015044) funded by the Ministry of Education, Youth and Sports of the Czech Republic within the framework of Large Research, Development and Innovation Infrastructures.
    Download from http://hdl.handle.net/11234/1-2580
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/833.html
    Edition LREC 2018
  • Name Parallel English-Persian Corpus (PEPC)
    Resource type Corpus
    Size 200000 sentences  
    Languages English Persian
    Production status Newly created-finished
    Resource usage Machine Translation, SpeechToSpeech Translation
    License <Not Specified>
    Conditions of use <Not Specified>
    Description PEPC is a collection of parallel sentences in English and Persian languages extracted from Wikipedia documents using a bidirectional translation method.
    Download from https://iasbs.ac.ir/~ansari/nlp/pepc.html
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/674.html
    Edition LREC 2018
  • Name Persian word embeddings
    Resource type Corpus
    Size 200 MByte  
    Languages Iranian Persian (pes)
    Production status Newly created-finished
    Resource usage Machine Learning
    License <Not Specified>
    Conditions of use <Not Specified>
    Description ArmanPersoNERCorpus is the first manually-annotated Persian named-entity (NE) dataset (ISLRN 399-379-640-828-6). We are releasing it only for academic research use. The data set includes 250,015 tokens and 7,682 Persian sentences in total. It is available in 3 folds to be used in turn as training and test sets. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated with a newline. The NER tags are in IOB format. According to the instructions provided to the annotators, NEs are categorized into six classes: person, organization (such as banks, ministries, embassies, teams, nationalities, networks and publishers), location (such as cities, villages, rivers, seas, gulfs, deserts and mountains), facility (such as schools, universities, research centers, airports, railways, bridges, roads, harbors, stations, hospitals, parks, zoos and cinemas), product (such as books, newspapers, TV shows, movies, airplanes, ships, cars, theories, laws, agreements and religions), and event (such as wars, earthquakes, national holidays, festivals and conferences); other are the remaining tokens. Four different word embeddings are trained on a sizable collation of unannotated Persian text. They contain a comprehensive Persian dictionary of nearly 50K unique words. The length of the embedding vectors is 300. The use of these embeddings is unrestricted.ArmanPersoNERCorpus is the first manually-annotated Persian named-entity (NE) dataset (ISLRN 399-379-640-828-6). We are releasing it only for academic research use. The data set includes 250,015 tokens and 7,682 Persian sentences in total. It is available in 3 folds to be used in turn as training and test sets. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated with a newline. The NER tags are in IOB format. According to the instructions provided to the annotators, NEs are categorized into six classes: person, organization (such as banks, ministries, embassies, teams, nationalities, networks and publishers), location (such as cities, villages, rivers, seas, gulfs, deserts and mountains), facility (such as schools, universities, research centers, airports, railways, bridges, roads, harbors, stations, hospitals, parks, zoos and cinemas), product (such as books, newspapers, TV shows, movies, airplanes, ships, cars, theories, laws, agreements and religions), and event (such as wars, earthquakes, national holidays, festivals and conferences); other are the remaining tokens. Four different word embeddings are trained on a sizable collation of unannotated Persian text. They contain a comprehensive Persian dictionary of nearly 50K unique words. The length of the embedding vectors is 300. The use of these embeddings is unrestricted.ArmanPersoNERCorpus is the first manually-annotated Persian named-entity (NE) dataset (ISLRN 399-379-640-828-6). We are releasing it only for academic research use. The data set includes 250,015 tokens and 7,682 Persian sentences in total. It is available in 3 folds to be used in turn as training and test sets. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated with a newline. The NER tags are in IOB format. According to the instructions provided to the annotators, NEs are categorized into six classes: person, organization (such as banks, ministries, embassies, teams, nationalities, networks and publishers), location (such as cities, villages, rivers, seas, gulfs, deserts and mountains), facility (such as schools, universities, research centers, airports, railways, bridges, roads, harbors, stations, hospitals, parks, zoos and cinemas), product (such as books, newspapers, TV shows, movies, airplanes, ships, cars, theories, laws, agreements and religions), and event (such as wars, earthquakes, national holidays, festivals and conferences); other are the remaining tokens. Four different word embeddings are trained on a sizable collation of unannotated Persian text. They contain a comprehensive Persian dictionary of nearly 50K unique words. The length of the embedding vectors is 300. The use of these embeddings is unrestricted.
    Download from https://github.com/HaniehP/PersianNER
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/48.html
    Edition LREC 2018
  • Name PhotoshopQuiA
    Resource type Corpus
    Size 2854 entries  
    Languages English
    Production status Newly created-finished
    Resource usage Question Answering
    License Creative Commons CC BY-SA 3.0
    Conditions of use Attribution Share-Alike
    Description We introduce the PhotoshopQuiA dataset, a new publicly available set of 2,854 why-question and answer (WhyQ, A) pairs related to Adobe Photoshop usage collected from five CQA web sites. We chose Adobe Photoshop because it is a popular and well-known product, with a lively, knowledgeable and sizable community. To the best of our knowledge, this is the first English dataset for Why-QA that focuses on a product, as opposed to previous open-domain datasets. The corpus is stored in JSON format and contains detailed data about questions and questioners as well as answers and answerers. The dataset can be used to build Why-QA systems, to evaluate current approaches for answering why-questions, and to develop new models for future QA systems research.
    Download from https://github.com/dulceanu/photoshop-quia/blob/master/dataset/PhotoshopQuiA.json
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/758.html
    Edition LREC 2018
  • Name PolNARC-2016
    Resource type Corpus
    Size 18300 Attribution relations
    Languages English
    Production status Newly created-in progress
    Resource usage Information Extraction, Information Retrieval
    License MIT License
    Conditions of use Preservation of Copyright Notice
    Description The Political News Attribution Relations Corpus annotates the attribution of direct and indirect quotes, as well as private states expressing belief and intention, using an annotation scheme derived from that of PARC3.
    Download from https://github.com/networkdynamics/PolNeAR
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/1051.html
    Edition LREC 2018
  • Name Portuguese Multiword Expressions Scored for Compositionality (Unfiltered)
    Resource type Lexicon
    Size 184662 entries  
    Languages Portuguese (por)
    Production status Newly created-finished
    Resource usage Multiword Expression Compositionality
    License CreativeCommons
    Conditions of use <Not Specified>
    Description A compositionality-ranked list of Portuguese multiword expressions, created using the methodology detailed in our paper; evaluation for this resource is not presented.
    Download from http://amor.cms.hu-berlin.de/~robertsw/files/lrec2016/MWE_pt-unfiltered.utf8.txt.gz
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/650.html
    Edition LREC 2018
  • Name POS Tagged Dialectal Arabic Data
    Resource type Corpus
    Size 600 KByte  
    Languages Egyptian Arabic Levantine Arabic Gulf Arabic Maghrebi Arabic
    Production status Newly created-finished
    Resource usage Part-of-Speech Tagging
    License CreativeCommons
    Conditions of use <Not Specified>
    Description 350 tweets for four major Arabic dialects that were manually segmented and POS tagged
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/562_res_1.tgz [582 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/562.html
    Edition LREC 2018
  • Name PoSTWITA-UD
    Resource type Treebank
    Size 124410 tokens  
    Languages Italian
    Production status Newly created-finished
    Resource usage Parsing and Tagging
    License Creative Commons CC BY-NC-SA 4.0
    Conditions of use Attribution Non-Commercial Share-Alike
    Description PoSTWITA-UD is a collection of Italian tweets annotated in Universal Dependencies that can be exploited for the training of NLP systems to enhance their performance on social media texts.
    Download from https://github.com/UniversalDependencies/UD_Italian-PoSTWITA
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/636.html
    Edition LREC 2018
  • Name Prague Dependency Treebank 3.5 (PDT 3.5)
    Resource type Corpus
    Size 50000 sentences  
    Languages Czech (ces)
    Production status Existing-used
    Resource usage Corpus Creation/Annotation
    License Creative Commons CC BY-NC-SA 4.0
    Conditions of use Attribution Non-Commercial Share-Alike
    Description The Prague Dependency Treebank 3.5 is a 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied Linguistics under various projects between 1996 and 2018 on the original texts.
    Download from http://hdl.handle.net/11234/1-2621
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/20.html
    Edition LREC 2018
  • Name Relational Noun Lexicon
    Resource type Lexicon
    Size 6224 words  
    Languages English
    Production status Newly created-finished
    Resource usage Information Extraction, Information Retrieval
    License Open Source
    Conditions of use <Not Specified>
    Description A lexicon of 6,224 nouns, annotated as either relational or non-relational (1,446 are relational), for use in relation extraction systems and other NLP applications.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/461_res_1.tsv [64,0 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/461.html
    Edition LREC 2018
  • Name Risamálheild (Icelandic Gigaword Corpus)
    Resource type Corpus
    Size 1.3 billion words  
    Languages Icelandic
    Production status Newly created-in progress
    Resource usage Machine Learning
    License Part 1 under custom user licence (full text : http://www.malfong.is/files/userlicense_rmh1_download_en.pdf) Part 2 under CC-BY 4.0
    Conditions of use Part 1  Research Use No Commercial exploitation of source material Attribution Part 2 : Attribution
    Description A large corpus with more than one billion running words from contemporary Icelandic texts. The two main sources are official texts and texts from news media. The corpus texts are morphosyntactically tagged and provided with metadata.
    Download from http://www.malfong.is/index.php?lang=en&pg=rmh
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/746.html
    Edition LREC 2018
  • Name rlfowl
    Resource type Ontology
    Size 22 MByte  
    Languages French (fra)
    Production status Newly created-in progress
    Resource usage Knowledge Discovery/Representation
    License OpenSource
    Conditions of use <Not Specified>
    Description Ontology representation of the French Lexical Network.
    Download from https://github.com/alex-fonseca/rlfowl
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/1102.html
    Edition LREC 2018
  • Name Russian Wiktionary
    Resource type Lexicon
    Size 861467 entries  
    Languages Russian (rus)
    Production status Existing-used
    Resource usage Lexicon Creation/Annotation
    License Creative Commons CC BY-SA 3.0
    Conditions of use Attribution Share-Alike
    Description <Not Specified>
    Download from https://ru.wiktionary.org
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/312.html
    Edition LREC 2018
  • Name Sample IS-pros Corpus
    Resource type Corpus
    Size 59.9 MByte  
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License GNU GPL 3.0
    Conditions of use Preservation of Copyright Notice Share-Alike
    Description This sample corpus includes the pitch and intensity objects created from speech samples by twelve speakers of American English reading 109 sentences from pieces of news. Sentences are annotated with hierarchical thematicity in TextGrid format. After running the processing pipeling presented in the papers and made available in this submission, the corpus will be annotated with thematicity and acoustic features in both TextGrid and cvs formats.
    Download from https://github.com/TalnUPF/compilationISpros
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/530.html
    Edition LREC 2018
  • Name Sarcasm Target Dataset
    Resource type Corpus
    Size <Not Specified>
    Languages English
    Production status Newly created-finished
    Resource usage Opinion Mining/Sentiment Analysis
    License <Not Specified>
    Conditions of use <Not Specified>
    Description The dataset has two domains: book snippets and tweets. Each entity in the dataset is a sarcastic text while the label is either (a) a subset of words in the sentence that point to the sarcasm target, or (b) a fall-back label `Outside'.
    Download from https://github.com/Pranav-Goel/Sarcasm-Target-Detection
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/583.html
    Edition LREC 2018
  • Name Self-Annotated Reddit Corpus (SARC)
    Resource type Corpus
    Size 200 GByte  
    Languages English
    Production status Newly created-finished
    Resource usage Opinion Mining/Sentiment Analysis
    License Open Source
    Conditions of use <Not Specified>
    Description A large corpus for sarcasm research and for training and evaluating systems for sarcasm detection.
    Download from http://nlp.cs.princeton.edu/SARC/2.0/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/160.html
    Edition LREC 2018
  • Name Semantic verb classes for English, Polish, and Croatian
    Resource type Evaluation Data
    Size 267 lexemes  
    Languages English (eng) Polish (pol) Croatian (hrv)
    Production status Newly created-in progress
    Resource usage Lexicon Creation/Annotation
    License Creative Commons CC BY
    Conditions of use Attribution
    Description The classifications are a result of semantic clustering experiments with native speakers asked to classify a sample of 267 verbs into soft clusters based solely on their meaning (i.e. no reference to verbs' syntactic behaviour was required), aimed at verifying whether semantic verb classes can be reliably obtained from non-expert human annotators following simple instructions.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/116_res_1.zip [52.7 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/116.html
    Edition LREC 2018
  • Name SenSALDO
    Resource type Lexicon
    Size 69700 words  
    Languages Swedish
    Production status Newly created-finished
    Resource usage Opinion Mining/Sentiment Analysis
    License Creative Commons CC BY
    Conditions of use Attribution
    Description Sentiment lexicon for Swedish, based on word senses in SALDO 2.3. Sentiment values from -1 to +1, and also discrete values in {-1,0,+1}.
    Download from https://spraakbanken.gu.se/eng/resource/sensaldo
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/857.html
    Edition LREC 2018
  • Name Sentiment Lexicon of IDiomatic Expressions (SLIDE)
    Resource type Lexicon
    Size 5000 entries  
    Languages English
    Production status Newly created-finished
    Resource usage Opinion Mining/Sentiment Analysis
    License © Copyright Wikipedia CC-BY-SA 3.0 © Copyright IBM 2014. Released under CC-BY-SA. 4.0
    Conditions of use <Not Specified>
    Description The Sentiment Lexicon of IDiomatic Expressions (SLIDE) is a large idiom sentiment lexicon, which includes 5,000 frequently occurring idioms.
    Download from http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/602.html
    Edition LREC 2018
  • Name SMS Test Collection
    Resource type Evaluation Data
    Size 800 KByte  
    Languages English
    Production status Newly created-finished
    Resource usage Information Extraction, Information Retrieval
    License <Not Specified>
    Conditions of use <Not Specified>
    Description This resource contains SMS topics and relevance judgments in TREC style for evaluating information retrieval systems. One could request for or buy the collection of SMS conversations from the Linguistic Data Consortium (LDC) (https://www.ldc.upenn.edu/).
    Download from https://github.com/rashmisankepally/SMSTestCollection/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/695.html
    Edition LREC 2018
  • Name Spanish Multiword Expressions Scored for Compositionality (Unfiltered)
    Resource type Lexicon
    Size 277960 entries  
    Languages Spanish (spa)
    Production status Newly created-finished
    Resource usage Multiword Expression Compositionality
    License CreativeCommons
    Conditions of use <Not Specified>
    Description A compositionality-ranked list of Spanish multiword expressions, created using the methodology detailed in our paper; evaluation for this resource is not presented.
    Download from http://amor.cms.hu-berlin.de/~robertsw/files/lrec2016/MWE_es-unfiltered.utf8.txt.gz
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/650.html
    Edition LREC 2018
  • Name Spokes Mix
    Resource type Corpus
    Size 2200000 words  
    Languages Polish
    Production status Existing-updated
    Resource usage Speech Recognition/Understanding
    License Creative Commons CC BY-NC
    Conditions of use <Not Specified>
    Description Spokes Mix is an online service providing access to a number of spoken corpora of Polish, including three newly released time-aligned collections of manually transcribed spoken-conversational data.
    Download from http://pelcra.clarin-pl.eu/spokes2-web/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/888.html
    Edition LREC 2018
  • Name Stars2T corpus of time-constrained referring expressions
    Resource type Corpus
    Size 368 annotated referring expressions
    Languages Portuguese (por)
    Production status Newly created-finished
    Resource usage Natural Language Generation
    License Creative Commons CC BY 4.0
    Conditions of use Attribution
    Description Ths Stars2 corpus is a collection of annotated definite descriptions elicited from visual stimuli in time-constrained situations of communication. The corpus may be used as standard dataset for referring expression generation (REG) with a particular focus on time constrains (and, as it turns out, on the issue of referential overspecification).
    Download from https://drive.google.com/open?id=0B-KyU7T8S8bLYzNtWTJfWGszdk0
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/39.html
    Edition LREC 2018
  • Name STREUSLE 4.0
    Resource type Corpus
    Size 55000 words  
    Languages English
    Production status Existing-updated
    Resource usage Corpus Creation/Annotation
    License Creative Commons CC BY-SA 4.0
    Conditions of use Attribution Share-Alike
    Description STREUSLE stands for Supersense-Tagged Repository of English with a Unified Semantics for Lexical Expressions. The text is from the web reviews portion of the English Web Treebank. STREUSLE incorporates comprehensive annotations of multiword expressions (MWEs) and semantic supersenses for lexical expressions. The supersense labels apply to single- and multiword noun and verb expressions, and prepositional/possessive expressions. The 4.0 release updates the inventory and application of preposition supersenses, applies those supersenses to possessives, incorporates the syntactic annotations from the Universal Dependencies project, and adds lexical category labels to indicate the holistic grammatical status of strong multiword expressions.
    Download from https://github.com/nert-gu/streusle
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/963.html
    Edition LREC 2018
  • Name STS.news.sr
    Resource type Corpus
    Size 1192 sentence pairs
    Languages Serbian (srp)
    Production status Newly created-finished
    Resource usage Semantic Textual Similarity
    License Creative Commons CC BY-NC-SA 4.0
    Conditions of use Attribution Non-Commercial Share-Alike
    Description The Serbian STS News Corpus (ISLRN 146-979-597-345-4) consists of 1192 pairs of sentences in Serbian gathered from news sources on the web. Each sentence pair was manually annotated with fine-grained semantic similarity scores on the 0-5 scale. The final scores were obtained by averaging the individual scores of five annotators.
    Download from http://vukbatanovic.github.io/STS.news.sr/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/442.html
    Edition LREC 2018
  • Name Swedish Literary Corpus
    Resource type Corpus
    Size 178.2 KByte  
    Languages Swedish (swe)
    Production status Newly created-in progress
    Resource usage Written Dialogue
    License GNU GPL 3.0
    Conditions of use Preservation of Copyright Notice Share-Alike Attribution
    Description This corpus consists of chapters from novellas by four Swedish authors. The novels used are: - August Strindberg, The Red Room (1879; obtained from the National Edition of August Strindberg’s Collected Works, published in 1981), - Hjalmar Söderberg, The Serious Game (1912),\nBirger Sjöberg, The Quartet That Split Up, part I (1924),\nKarin Boye, Kallocain (1940). In each file, each line within dialogues are annotated with the following information: - Speaker, - Addressee, - Speaker type, - Addressee typeThis corpus consists of chapters from novellas by four Swedish authors. The novels used are: - August Strindberg, The Red Room (1879; obtained from the National Edition of August Strindberg’s Collected Works, published in 1981), - Hjalmar Söderberg, The Serious Game (1912),\nBirger Sjöberg, The Quartet That Split Up, part I (1924),\nKarin Boye, Kallocain (1940). In each file, each line within dialogues are annotated with the following information: - Speaker, - Addressee, - Speaker type, - Addressee typeThis corpus consists of chapters from novellas by four Swedish authors. The novels used are: - August Strindberg, The Red Room (1879; obtained from the National Edition of August Strindberg’s Collected Works, published in 1981), - Hjalmar Söderberg, The Serious Game (1912),\nBirger Sjöberg, The Quartet That Split Up, part I (1924),\nKarin Boye, Kallocain (1940). In each file, each line within dialogues are annotated with the following information: - Speaker, - Addressee, - Speaker type, - Addressee typeThis corpus consists of chapters from novellas by four Swedish authors. The novels used are: - August Strindberg, The Red Room (1879; obtained from the National Edition of August Strindberg’s Collected Works, published in 1981), - Hjalmar Söderberg, The Serious Game (1912),\nBirger Sjöberg, The Quartet That Split Up, part I (1924),\nKarin Boye, Kallocain (1940). In each file, each line within dialogues are annotated with the following information: - Speaker, - Addressee, - Speaker type, - Addressee typeThis corpus consists of chapters from novellas by four Swedish authors. The novels used are: - August Strindberg, The Red Room (1879; obtained from the National Edition of August Strindberg’s Collected Works, published in 1981), - Hjalmar Söderberg, The Serious Game (1912),\nBirger Sjöberg, The Quartet That Split Up, part I (1924),\nKarin Boye, Kallocain (1940). In each file, each line within dialogues are annotated with the following information: - Speaker, - Addressee, - Speaker type, - Addressee type
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/1036_res_1.tgz [174 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/1036.html
    Edition LREC 2018
  • Name Synthetic source Korean-Arabic bilingual corpus
    Resource type Corpus
    Size 450000 sentences  
    Languages Korean Arabic
    Production status Newly created-finished
    Resource usage Machine Translation, SpeechToSpeech Translation
    License OpenSource
    Conditions of use <Not Specified>
    Description We used the data that is WIT3, OPUS, Production corpus, synthetic corpus in the paper. We upload our synthetic source Korean-Arabic bilingual corpus. We can share this corpus. But other corpus like the WIT3, OPUS, synthetic target and Production corpus can't be shared in LRE map. The WIT3 and OPUS corpus is open-source corpus. Then anyone can get this data from each site. And the production corpus built by paying money can't be shared
    Download from https://github.com/ChoiGH/For_LRE_Map_corpus
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/139.html
    Edition LREC 2018
  • Name Szeged Corpus
    Resource type Corpus
    Size 82000 sentences  
    Languages Hungarian (hun)
    Production status Existing-used
    Resource usage Evaluation/Validation
    License <Not Specified>
    Conditions of use <Not Specified>
    Description A concept space for Explicit Semantic Analysis (ESA) (Gabrilovich & Markovitch 2007). It is a technique that provides a semantic representation of text in a space of concepts derived from Wikipedia. ESA defines concepts from Wikipedia articles e.g., BARACK OBAMA and ACOMPUTER SCIENCE. This resource is a concept space created from a Wikipedia (April 2017 snapshot)
    Download from http://rgai.inf.u-szeged.hu/index.php?lang=en&page=SzegedTreebank
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/730.html
    Edition LREC 2018
  • Name T-Rex : A Large Scale Alignment of Natural Language with Knowledge Base Triples
    Resource type Corpus
    Size 4.4 GByte  
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Information Extraction, Information Retrieval
    License Creative Commons CC BY-SA 4.0
    Conditions of use Attribution Share-Alike
    Description Alignments between natural language and Knowledge Base (KB) triples are an essential prerequisite for training machine learning approaches employed in a variety of Natural Language Processing problems. These include Relation Extraction, KB Population, Question Answering and Natural Language Generation from KB triples. Available datasets that provide those alignments are plagued by significant shortcomings – they are of limited size, they exhibit a restricted predicate coverage, and/or they are of unreported quality. To alleviate these shortcomings, we present T-REx, a dataset of large scale alignments between Wikipedia abstracts and Wikidata triples. T-REx consists of 11 million triples aligned with 3.09 million Wikipedia abstracts (6.2 million sentences). T-REx is two orders of magnitude larger than the largest available alignments dataset and covers 2.5 times more predicates. Additionally, we stress the quality of this language resource thanks to an extensive crowdsourcing evaluation. T-REx is publicly available at https://w3id.org/t-rex.
    Download from http://w3id.org/t-rex
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/632.html
    Edition LREC 2018
  • Name TAP-DLND 1.0 : A Corpus for Document Level Novelty Detection
    Resource type Corpus
    Size 17.3 MByte  
    Languages English (eng)
    Production status Newly created-in progress
    Resource usage Document Classification, Text categorisation
    License Creative Commons CC BY-NC-SA 4.0
    Conditions of use Attribution Non-Commercial Share-Alike
    Description The TAP-DLND 1.0 is a document level annotated corpus for novelty detection. The corpus is created via event-specific topical crawling of news reportings from the web, as they develop over time. We view novelty as an ordered update over existing knowledge. We collect information of 223 events from 10 different domains. For each event we fix 3 initial documents as source and asked the annotators to annotate other documents for that event as novel or non-novel based on information coverage and human judgment. Ambiguous cases we leave out from our resource. Thus we create a resource of 5,440 annotated documents that manifests certain criteria for novelty detection as discussed in the original manuscript.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/479_res_1.tgz [5,67 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/479.html
    Edition LREC 2018
  • Name Test set for Chinese nonlocal dependencies
    Resource type Corpus
    Size 1 MByte  
    Languages Mandarin Chinese (cmn)
    Production status Newly created-in progress
    Resource usage Emotion Recognition/Generation
    License GNU GPL 3.0
    Conditions of use Preservation of Copyright Notice Share-Alike
    Description It contains nonlocal dependency test data for eight nonlocal dependency constructions in Mandarin Chinese. Each test set contains around 100 sentences except for extractions from embedded clauses because they occur rarely in the data.
    Download from https://github.com/modelblocks/modelblocks-release
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/981.html
    Edition LREC 2018
  • Name The Epic Epigraph Graph
    Resource type Corpus
    Size 10000 epigraphs
    Languages English
    Production status Newly created-in progress
    Resource usage Distant Reading
    License Creative Commons CC BY 4.0
    Conditions of use Attribution
    Description This is a collection of literary epigraphs, showing the epigraph, its source and where it is used. It is still a work-in-progress, it is already large enough to be interesting, but needs some checking and normalization. We also intend to make it bigger (roughly twice as large).
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/610_res_1.tsv [2,77 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/610.html
    Edition LREC 2018
  • Name TSix
    Resource type Corpus
    Size 3.9 MByte  
    Languages English
    Production status Newly created-finished
    Resource usage Evaluation/Validation
    License MIT License
    Conditions of use <Not Specified>
    Description The dataset includes six events collected from Twitter from October 10 to November 9, 2016. The gold-standard references are created by the humans, which allow to correctly evaluate extractive methods.
    Download from http://lrec2018.lrec-conf.org/sharedlrs2018/516_res_1.zip [3,74 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/516.html
    Edition LREC 2018
  • Name UDLexicons
    Resource type Lexicon
    Size <Not Specified>
    Languages <Not Specified>
    Production status Newly created-in progress
    Resource usage Parsing and Tagging
    License Open Source licences (depends on the lexicon)
    Conditions of use Depends on the lexicon
    Description Multilingual collection of morphological lexicons that follow the guidelines and format of the Universal Dependencies initiative, created based on existing freely available resources.
    Download from http://pauillac.inria.fr/~sagot/udlexicons.html
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/705.html
    Edition LREC 2018
  • Name ViDA
    Resource type Corpus
    Size 5065 functional segments
    Languages Vietnamese (vie)
    Production status Newly created-finished
    Resource usage Conversation mining, detection of emotion/sentiment in conversation, automatic dialect/accent detection
    License <Not Specified>
    Conditions of use <Not Specified>
    Description This corpus is a new language resource in Vietnamese consisting of dialogues with dialog act annotation according to the ISO 24617-2 (2012) standard, emotion tagging at functional segments level according to the Ekman's (1972) list of basic emotions and sentiment annotation. We use spoken text from IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7 for annotation. IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7 was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 201 hours of Vietnamese conversational and scripted telephone speech collected in 2012 along with corresponding transcripts.
    Download from https://drive.google.com/file/d/0B6xRTY1wmqt8UWxwcDBkOXVpalk/view
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/942.html
    Edition LREC 2018
  • Name Vision-grounded dataset of human ratings
    Resource type Corpus
    Size 52 KByte  
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Thematic role filling
    License <Not Specified>
    Conditions of use <Not Specified>
    Description This dataset consists in a csv file with 4 columns... - an "id" column (integer) containing the numbering of the ratings from 1 to 2000 - a "verb" column (string/character) containing the verbs - a "+X8location" column (string/character) containing the locations - a "avr_rating" column (float/numeric) containing the average ratings NB: The average rating is the average of 10-11 ratings from different workers on Amazon MTurk obtained as part of an online experiment. Basic statistics of the collected data: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 2.800 4.200 4.096 5.400 7.000 standard deviation: 1.55390718673 - For other details see the paper.This dataset consists in a csv file with 4 columns... - an "id" column (integer) containing the numbering of the ratings from 1 to 2000 - a "verb" column (string/character) containing the verbs - a "+X8location" column (string/character) containing the locations - a "avr_rating" column (float/numeric) containing the average ratings NB: The average rating is the average of 10-11 ratings from different workers on Amazon MTurk obtained as part of an online experiment. Basic statistics of the collected data: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 2.800 4.200 4.096 5.400 7.000 standard deviation: 1.55390718673 - For other details see the paper.This dataset consists in a csv file with 4 columns... - an "id" column (integer) containing the numbering of the ratings from 1 to 2000 - a "verb" column (string/character) containing the verbs - a "+X8location" column (string/character) containing the locations - a "avr_rating" column (float/numeric) containing the average ratings NB: The average rating is the average of 10-11 ratings from different workers on Amazon MTurk obtained as part of an online experiment. Basic statistics of the collected data: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 2.800 4.200 4.096 5.400 7.000 standard deviation: 1.55390718673 - For other details see the paper.
    Download from http://datasets.d2.mpi-inf.mpg.de/arohrbach/datasetV1.csv
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/1089.html
    Edition LREC 2018
  • Name WiFiNE
    Resource type Corpus
    Size 1.2 GByte  
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Named Entity Recognition
    License OpenSource
    Conditions of use <Not Specified>
    Description WiFiNE is an English corpus annotated with fine-grained entity types.
    Download from http://rali.iro.umontreal.ca/rali/en/wifiner-wikipedia-for-et
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/11.html
    Edition LREC 2018
  • Name Wikipedia discourse connectives
    Resource type Corpus
    Size 351 MByte  
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Discourse
    License Creative Commons CC BY-SA 3.0 US
    Conditions of use Attribution Share-Alike
    Description 2.9 million pairs of adjacent sentences extracted from the English Wikipedia on September 5, 2016, including the discourse connective at the beginning of the second sentence, if any, i.e., the "gold" connective.
    Download from https://github.com/ekQ/discourse-connectives
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/203.html
    Edition LREC 2018
  • Name Wikipedia Embedding
    Resource type Lexicon
    Size 2000000 words  
    Languages Chinese
    Production status Newly created-finished
    Resource usage Word Sense Disambiguation
    License OpenSource
    Conditions of use Attribution
    Description Chinese Wikipedia Title Embedding
    Download from http://ckipsvr.iis.sinica.edu.tw/cwemb/reg.php
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/159.html
    Edition LREC 2018
  • Name WORD
    Resource type Evaluation Data
    Size 19276 concept pairs
    Languages <Not Specified>
    Production status Newly created-in progress
    Resource usage Evaluation/Validation
    License Creative Commons CC BY-SA 3.0
    Conditions of use Attribution Share-Alike
    Description A set of 19,276 Wikipedia concepts with their human annotated relatedness level.
    Download from http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/445.html
    Edition LREC 2018
  • Name Word Importance Annotations
    Resource type Corpus
    Size 1.9 MByte  
    Languages English (eng)
    Production status Newly created-in progress
    Resource usage Evaluation/Validation
    License Creative Commons CC BY-NC-SA 4.0
    Conditions of use <Not Specified>
    Description The Switchboard Corpus consists of audio recordings of approximately 260 hours of speech consisting of about 2,400 two-sided telephone conversations among 543 speakers (consisting of 302 male, 241 female) from across the United States. In January 2003, the Institute for Signal and Information Processing (ISIP) released written transcripts for the entire corpus, which consists of nearly 400,000 conversational turns. The ISIP transcripts include a complete lexicon list and automatic word alignment timing corresponding to the original audio files. In this project, a pair of annotators has assigned word-importance scores to these transcripts. As of September 2017, they have annotated over 25,000 tokens, with an overlap of approximately 3,100 tokens. We announce the release of these annotations as a set of supplementary files, aligned to the ISIP transcripts. Our annotation work continues, and we aim to annotate all of the Switchboard corpus and with a larger group of annotators.
    Download from http://latlab.ist.rit.edu/lrec2018/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/187.html
    Edition LREC 2018
  • Name Word Segmentation Dataset
    Resource type Corpus
    Size 650 MByte  
    Languages Sanskrit
    Production status Existing-used
    Resource usage Word Segmentation
    License Creative Commons CC BY 4.0
    Conditions of use Attribution
    Description A Dataset for Sanskrit word segmentation. See English language documentation here: https://zenodo.org/record/803508#.WdJXkRdx3eR
    Download from https://zenodo.org/record/803508#.WdJXkRdx3eR
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/669.html
    Edition LREC 2018
  • Name WordNetGraph
    Resource type Lexicon
    Size 74 MByte  
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Textual Entailment and Paraphrasing
    License MIT License
    Conditions of use Preservation of Copyright Notice
    Description The WordNetGraph is an RDF graph generated from WordNet, whose noun and verb definitions were labeled with Definition Semantic Roles (DSR).
    Download from https://github.com/Lambda-3/WordnetGraph/blob/master/WN_DSR_model_XML.rdf
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/190.html
    Edition LREC 2018
  • Name WordsEye Evaluation Corpus: Imaginative and Realistic Sentences
    Resource type Corpus
    Size 459 sentences  
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Evaluation/Validation
    License Creative Commons CC BY-NC-SA 4.0
    Conditions of use <Not Specified>
    Description The resource contains 209 imaginative sentences and 250 realistic sentences. It also includes images generated by the WordsEye text-to-scene system, using each sentence as input. We used Amazon Mechanical Turk to obtain the imaginative sentences. Turkers were given short lists of words divided into several categories and were asked to write a short sentence using at least one word from each category. The words provided to the Turkers represent the objects, properties, and relations supported by the WordsEye text-toscene system. The realistic sentences are a subset of the PASCAL image caption corpus (Rashtchian et al., 2010).The resource contains 209 imaginative sentences and 250 realistic sentences. It also includes images generated by the WordsEye text-to-scene system, using each sentence as input. We used Amazon Mechanical Turk to obtain the imaginative sentences. Turkers were given short lists of words divided into several categories and were asked to write a short sentence using at least one word from each category. The words provided to the Turkers represent the objects, properties, and relations supported by the WordsEye text-toscene system. The realistic sentences are a subset of the PASCAL image caption corpus (Rashtchian et al., 2010).The resource contains 209 imaginative sentences and 250 realistic sentences. It also includes images generated by the WordsEye text-to-scene system, using each sentence as input. We used Amazon Mechanical Turk to obtain the imaginative sentences. Turkers were given short lists of words divided into several categories and were asked to write a short sentence using at least one word from each category. The words provided to the Turkers represent the objects, properties, and relations supported by the WordsEye text-toscene system.
    Download from http://www.cs.columbia.edu/~coyne/wordseye-evaluation-corpus.html
    Referring paper http://www.lrec-conf.org/proceedings/lrec2018/summaries/115.html
    Edition LREC 2018

Important Dates

  • 2 October 2017: Submission of proposals for oral and poster papers
  • 2 October 2017: Submission of proposals for panels, workshops and tutorials
  • 25 October 2017: Notification of acceptance of workshops and tutorials
  • 20 December 2017: Notification of acceptance of conference papers
  • 22 February 2018: Final Submission of accepted papers
  • 9 March 2018: Submission of workshop proceedings
  • 7-8 May 2018: Pre-conference Workshops & Tutorials
  • 9-10-11 May 2018: Main Conference
  • 12 May 2018: Workshops & Tutorials