Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
EURLEX57K contains 57k legislative documents in English from EUR-Lex portal, annotated with EUROVOC concepts.
The EUR-Lex text collection is a collection of documents about European Union law. It contains many different types of documents, including treaties, legislation, case-law and legislative proposals, which are indexed according to several orthogonal categorization schemes to allow for multiple search facilities. The most important categorization is provided by the EUROVOC descriptors, which form a topic hierarchy with almost 4000 categories regarding different aspects of European law. This document collection provides an excellent opportunity to study text classification techniques for several reasons: it contains multiple classifications of the same documents, making it possible to analyze the effects of different classification properties using the same underlying reference data without resorting to artificial or manipulated classifications,
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The CEPS EurLex dataset The dataset contains 142.036 EU laws - almost the entire corpus of the EU's digitally available legal acts passed between 1952 - 2019. It encompasses the three types of legally binding acts passed by the EU institutions: 102.304 regulations, 4.070 directives, 35.798 decisions in English language. The dataset was scraped from the official EU legal database (Eur-lex.eu) and transformed in machine-readable CSV format with the programming languages R and Python. The dataset was collected by the Centre for European Policy Studies (CEPS) for the TRIGGER project (https://trigger-project.eu/). We hope that it will facilitate future quantitative and computational research on the EU. Brief description: - The dataset is organised in tabular format, with each law representing one row and the columns representing 23 variables. - The full text of 134.633 laws is included (column "act_raw_text"). For newer laws, the text was scraped from Eur-lex.eu via the HTML pages, while for older laws, the text was extracted from (scanned) PDF documents (if available in English). - 22 additional variables are included, such as 'Act_name', 'Act_type', 'Subject_matter', 'Authors', 'Date_document', 'ELI_link', 'CELEX' (a unique identifier for every law). Please see the "CEPS_EurLex_codebook.pdf" file for an explanation of all variables. - Given its size, the dataset was uploaded in different batches to facilitate usage. Some Excel files are provided for non-technical users. We recommend, however, the use of the CSV files, since Excel does not save large amounts of data properly. EurLex_all.csv is the master file containing all data. Caveats: - The Eur-lex.eu website does not consistently provide data for all the variables. In addition, the HTML documents were not always cleanly formatted and text extraction from scanned PDFs is not entirely clean. Some data points are therefore missing for some laws and some laws were excluded entirely. - Not not all (older) laws were available in English, especially since Ireland and the UK only joined the European Communities in 1973. Non-English laws are excluded from the dataset. Other: - For details on the types of EU legal acts: https://ec.europa.eu/info/law/law-making-process/types-eu-law_en - An example for an experimental analysis with this dataset: https://trigger-project.eu/2019/10/28/a-data-science-approach-to-eu-differentiated-integration/ - The TRIGGER project is funded by the EU's Horizon 2020 programme, grant number 822735
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Super-EURLEX dataset containing legal documents from multiple languages. The datasets are build/scrapped from the EURLEX Website [https://eur-lex.europa.eu/homepage.html] With one split per language and sector, because the available features (metadata) differs for each sector. Therefore, each sample contains the content of a full legal document in up to 3 different formats. Those are raw HTML and cleaned HTML (if the HTML format was available on the EURLEX website during the scrapping process) and cleaned text. The cleaned text should be available for each sample and was extracted from HTML or PDF. 'Cleaned' HTML stands here for minor cleaning that was done to preserve to a large extent the necessary HTML information like table structures while removing unnecessary complexity which was introduced to the original documents due to actions like writing each sentence into a new object. Additionally, each sample contains metadata which was scrapped on the fly, this implies the following 2 things. First, not every sector contains the same metadata. Second, most metadata might be irrelevant for most use cases. In our minds the most interesting metadata is the celex-id which is used to identify the legal document at hand, but also contains a lot of information about the document see [https://eur-lex.europa.eu/content/tools/eur-lex-celex-infographic-A3.pdf] as well as eurovoc- concepts, which are labels that define the content of the documents. Eurovoc-Concepts are, for example, only available for the sectors 1, 2, 3, 4, 5, 6, 9, C, and E. The Naming of most metadata is kept like it was on the eurlex website, except for converting it to lower case and replacing whitespaces with '_'.
EURLEX57K is a new publicly available legal LMTC dataset, dubbed EURLEX57K, containing 57k English EU legislative documents from the EUR-LEX portal, tagged with ∼4.3k labels (concepts) from the European Vocabulary (EUROVOC).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
EU Legislation Documents and Metadata from 1971 to 2022 (English language)
This is the set of full text regulation, decision and directive documents in PDF and HTML format, in the English language, downloaded from EURLEX, together with metadata in CSV format about these documents. The documents were downloaded using this Python script, and the metadata was extracted from the CELLAR SPARQL endpoint using this Python script.
During the download process, HTML versions for the legislative documents were extracted if they were available. If there was no HTML version available for a particular document, the PDF version was downloaded (HTML versions were preferred because it is generally simpler to extract and process the text with software because of the added structure the format provides). If there was neither an HTML nor PDF version available, we made a note of the unique identifier (CELEX number) for those documents. The archive in this Zenodo repository which contains all the full text documents consists of three directories "htmls/", "pdfs/" and "problems/", which contain all the downloaded documents in that particular format. The "problems/" directory contains a list of blank .txt files where the name of each file is the CELEX number for a legislative document that was not available on EURLEX for download.
For more information about the scripts and a description of the metadata extracted, please see this Github repository.
The data was extracted as part of the Nature of EU Rules project which seeks to analyse the "strictness" and density of EU regulations over time and by legal policy area.
http://data.europa.eu/eli/dec/2011/833/ojhttp://data.europa.eu/eli/dec/2011/833/oj
The European Legislation Identifier (ELI) is a system to make legislation available online in a standardised format, so that it can be accessed, exchanged and reused across borders. This initiative, taken jointly by EU countries and institutions, is enshrined in the Council Conclusions of 6 November 2017 on the European Legislation Identifier (2017/C 441/05).
ELI is based on a voluntary agreement between the EU countries. It includes technical specifications on:
ELI is funded under Action "Facilitating the exchange of legislation data in Europe", which falls under the Interoperability Solutions for European Public Administrations ( ISA²) Programme.
Please find more information here: https://eur-lex.europa.eu/eli-register/about.html
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for EurlexResources: A Corpus Covering the Largest EURLEX Resources
Dataset Summary
This dataset contains large text resources (~179GB in total) from EURLEX that can be used for pretraining language models. Use the dataset like this: from datasets import load_dataset config = "de_caselaw" # {lang}_{resource} dataset = load_dataset("joelito/eurlex_resources", config, split='train', streaming=True)
Supported Tasks and Leaderboards
The… See the full description on the dataset page: https://huggingface.co/datasets/joelniklaus/eurlex_resources.
EUR-Lex-Sum is a dataset for cross-lingual summarization. It is based on manually curated document summaries of legal acts from the European Union law platform. Documents and their respective summaries exist as crosslingual paragraph-aligned data in several of the 24 official European languages, enabling access to various cross-lingual and lower-resourced summarization setups. The dataset contains up to 1,500 document/summary pairs per language, including a subset of 375 cross-lingually aligned legal acts with texts available in all 24 languages.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Bilingual (EN-DA) corpus acquired from website (https://eur-lex.europa.eu/legal-content) of the EU portal (9th July 2020).
Contains 21238 translations units (DA-EN)
The EUR-Lex database is an online search tool created by the Publications Office of the European Union. The tool provides free access in the 24 official EU languages to all European Union legislation. The database covers many types of texts produced mostly by the institutions of the European Union, but also by Member States, EFTA, etc. The content is divided into sectors: treaties, international agreements, legislation, complementary legislation, preparatory acts, case-law, national implementing measures, references to national case-law concerning EU law, parliamentary questions, consolidated legislation, other documents published in the Official Journal C series, and EFTA documents. The main topics discussed are: - politics; - international relations; - European law; - law; - economics; - trade; - finance; - social questions; - education and communications; - science; - business and competition; - employment and working conditions; - transport; - environment; - agriculture, forestry and fisheries; - agri-foodstuffs; - production, technology and research; - energy; - industry; - geography; - international relationships.
The EUR-Lex database is an online search tool created by the Publications Office of the European Union. The tool provides free access in the 24 official EU languages to all European Union legislation. The database covers many types of texts produced mostly by the institutions of the European Union, but also by Member States, EFTA, etc. The content is divided into sectors: treaties, international agreements, legislation, complementary legislation, preparatory acts, case-law, national implementing measures, references to national case-law concerning EU law, parliamentary questions, consolidated legislation, other documents published in the Official Journal C series, and EFTA documents. The main topics discussed are: - politics; - international relations; - European law; - law; - economics; - trade; - finance; - social questions; - education and communications; - science; - business and competition; - employment and working conditions; - transport; - environment; - agriculture, forestry and fisheries; - agri-foodstuffs; - production, technology and research; - energy; - industry; - geography; - international relationships.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is published with:
MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. Ilias Chalkidis, Manos Fergadiotis, and Ion Androutsopoulos. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. Punta Cana, Dominican Republic.
Documents: MultiEURLEX comprises 65k EU in 23 official EU languages. Each EU law has been annotated with EUROVOC concepts (labels) by the Publication Office of EU. Each EUROVOC label ID is associated with a Label descriptor, e.g., [60, `agri-foodstuffs'], [6006, `plant product'], [1115, `fruit']. The descriptors are also available in 23 languages. Chalkidis et al. (2019) published a monolingual (English) version of this dataset, called EURLEX57K, comprising 57k EU laws with the originally assigned gold labels.
Languages: MultiEURLEX covers 23 languages from 7 families. EU laws are published in all official EU languages, except for Irish for resource-related reasons (Read more: https://europa.eu/european-union/about-eu/eu-languages_en). This wide coverage makes the dataset a valuable testbed for cross-lingual transfer. All languages use the Latin script, except for Bulgarian (Cyrillic script) and Greek.
Multi-granular Labeling: EUROVOC has eight levels of concepts. Each document is assigned one or more concepts (labels). If a document is assigned a concept, the ancestors and descendants of that concept are typically not assigned to the same document. The documents were originally annotated with concepts from levels 3 to 8. We created three alternative sets of labels per document, by replacing each assigned concept by its ancestor from levels 1, 2, or 3, respectively. Thus, we provide four sets of gold labels per document, one for each of the first three levels of the hierarchy, plus the original sparse label assignment.
Supported Tasks: Similarly to EURLEX (Chalkidis et al., 2019), MultiEURLEX can be used for legal topic classification, a multi-label classification task where legal documents need to be assigned concepts (in our case, from EUROVOC) reflecting their topics. Unlike EURLEX57K, however, MultiEURLEX supports labels from three different granularities (EUROVOC levels). More importantly, apart from monolingual (one-to-one) experiments, it can be used to study cross-lingual transfer scenarios, including one-to-many (systems trained in one language and used in other languages with no training data), and many-to-one or many-to-many (systems jointly trained in multiple languages and used in one or more other languages).
Data Split and Concept Drift: MultiEURLEX is chronologically split in training (55k, 1958-2010), development (5k, 2010-2012), test (5k, 2012-2016) subsets, using the English documents. The test subset contains the same 5k documents in all 23 languages. The development subset also contains the same 5k documents in 23 languages, except Croatian. Croatia is the most recent EU member (2013); older laws are gradually translated. For the official languages of the seven oldest member countries, the same 55k training documents are available; for the other languages, only a subset of the 55k training documents is available. Compared to EURLEX57K (Chalkidis et al., 2019), MultiEURLEX is not only larger (8k more documents) and multilingual; it is also more challenging, as the chronological split leads to temporal real-world concept drift across the training, development, test subsets, i.e., differences in label distribution and phrasing, representing a realistic temporal generalization problem (Huang and Paul, 2019; Lazaridou et al., 2021). Recently, Søgaard et al. (2021) showed this setup is more realistic, as it does not overestimate real performance, contrary to random splits (Gorman and Bedrick, 2019).
EUR-Lex gives access to EU Law, the jurisprudence of the EU Court of Justice, other EU public documents and the electronic edition of the Official Journal of the EU, in 24 languages.
MultiEURLEX is a multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. The dataset covers 23 official EU languages from 7 language families.
http://data.europa.eu/eli/dec/2011/833/ojhttp://data.europa.eu/eli/dec/2011/833/oj
This dataset provides statistics on EUR-Lex website from two views: type of content and number of legal acts available. It is updated on a daily basis.
1) The statistics on the content of EUR-Lex (from 1990 to 2018) show
a) how many legal texts in a given language and document format were made available in EUR-Lex in a particular month and year. They include:
Since the eight parliamentary term, parliamentary questions are no longer included.
b) bibliographical notices by sector (e.g. case-law, treaties).
2) The statistics on legal acts (from 1990 to 2018) provide yearly and monthly figures on the number of adopted acts (also by author and by type) as well as those repealed and expired in a given month.
http://data.europa.eu/eli/dec/2011/833/ojhttp://data.europa.eu/eli/dec/2011/833/oj
The usage statistics provide yearly and monthly figures on number of visits, number of visitors, number of pages viewed and number of documents consulted.
Overview:
EUR-Lex provides direct free access to European Union law. Here you can consult the Official Journal of the European Union as well as the treaties, legislation, case-law and legislative proposals.
Collections include:
Free to be re-used as long as attribution is given. The Copyright notice on the European Commission website states:
Reproduction is authorised, provided the source is acknowledged, save where otherwise stated.
Where prior permission must be obtained for the reproduction or use of textual and multimedia information (sound, images, software, etc.), such permission shall cancel the above-mentioned general permission and shall clearly indicate any restrictions on use.
There's also a specific legal notice from the Publications Office.
http://data.europa.eu/eli/dec/2011/833/ojhttp://data.europa.eu/eli/dec/2011/833/oj
Štatistiky využívania poskytujú ročné a mesačné údaje o počte návštev, počte návštevníkov, počte prezeraných strán a počte nahliadnutých dokumentov.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
EURLEX57K contains 57k legislative documents in English from EUR-Lex portal, annotated with EUROVOC concepts.