100+ datasets found

P
French Wikipedia Dataset
paperswithcode.com
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot, French Wikipedia Dataset [Dataset]. https://paperswithcode.com/dataset/french-wikipedia
Explore at:
Authors
Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot
Area covered
French
Description
French Wikipedia is a dataset used for pretraining the CamemBERT French language model. It uses the official 2019 French Wikipedia dumps
h
Corpus de Français Parlé Parisien des années 2000 (CFPP)
cocoon.huma-num.fr
Updated Apr 12, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fleury, Serge; Lefeuvre, Florence; Branca-Rosoff, Sonia; Pires, Mat (2013). Corpus de Français Parlé Parisien des années 2000 (CFPP) [Dataset]. http://doi.org/10.34847/cocoon.8bc96a4e-9899-30e4-99be-c72d216eb38b
Explore at:
Unique identifier
https://doi.org/10.34847/cocoon.8bc96a4e-9899-30e4-99be-c72d216eb38b
Dataset updated
Apr 12, 2013
Dataset provided by
CNRS/COCOON
Authors
Fleury, Serge; Lefeuvre, Florence; Branca-Rosoff, Sonia; Pires, Mat
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Area covered
Paris, French
Description
Le Corpus de Français Parlé Parisien (CFPP2000) est composé d'un ensemble d'interviews non directives sur les quartiers de Paris et de la proche banlieue. Les entretiens, transcrits en orthographe et alignés au tour de parole, sont disponibles sur le net ; ils sont librement employables en échange de la mention dans la bibliographie des travaux qui en seraient tirés d'une part de l'adresse du site: http://cfpp2000.univ-paris3.fr/ et d'autre part du document de présentation suivant : Branca-Rosoff S., Fleury S., Lefeuvre F., Pires M., 2012, "Discours sur la ville. Présentation du Corpus de Français Parlé Parisien des années 2000 (CFPP2000)". En février 2013, ce corpus comprenait environ 550 000 mots. Un certain nombre d'outils en ligne, notamment un concordancier et des outils textométriques permettent de mener des requêtes lexicales et grammaticales. CFPP2000 est particulièrement destiné à des analyses sur le français oral. Le projet sous-jacent au corpus est par ailleurs l'étude des modifications et des variations qui interviennent dans ce qu'on peut considérer comme un parisien véhiculaire en tension entre le pôle du standard et le pôle du vernaculaire. Par ailleurs, il comporte des activités linguistiques diversifiées (description de quartier, anecdotes, argumentation…) et on peut par conséquent travailler sur la syntaxe propre à ces différentes utilisations du langage. Il permet enfin d'opposer dialogues (entre enquêteur et enquêtés) et multilogues (où la présence de plusieurs enquêtés favorise le passage à un registre familier). CFPP2000 est constitué d'interviews longues (d'une heure en moyenne) intégralement transcrites. Il est donc utilisable pour examiner les singularités qui reviennent à l'idiolecte propre à une personne donnée, par opposition aux variantes diffusées dans des groupes plus larges (quartiers, groupes socio-culturels, classe d'âge, etc.). Le corpus constitue enfin un ensemble de témoignages intéressants sur les représentations de Paris et de sa proche banlieue qui est susceptible d'intéresser des analystes du discours, des sociologues, ou tout simplement des curieux de la ville.; Corpus de Français Parlé Parisien des années 2000.
e
Educational Resource — French Primary — UD0. Le français c’est facile
data.europa.eu
html, unknown
Updated Jul 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gobierno de Aragón (2023). Educational Resource — French Primary — UD0. Le français c’est facile [Dataset]. https://data.europa.eu/data/datasets/https-opendata-aragon-es-datos-catalogo-dataset-recurso-educativo-primariafrances-ud0-le-francais-cest-facile/
Explore at:
html, unknownAvailable download formats
Dataset updated
Jul 5, 2023
Dataset authored and provided by
Gobierno de Aragón
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
French
Description
On and va apprendre à se présenter en français: se saluer, demander et dire son nom, demander et dire son âge, demander et dire sa nationalité et le lieu où l’on habite, demander et dire ce qu’on fait, ex Primer ce que l’on aime et prendre congé. Tout d’une façon très simple et facile. On and off Paris... nous sommes prêts.
o
Réseau Lexical du Français (RL-fr)
ortolang.fr
Updated Sep 6, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Réseau Lexical du Français (RL-fr) [Dataset]. https://www.ortolang.fr/market/lexicons/lexical-system-fr/v1
Explore at:
Dataset updated
Sep 6, 2017
Area covered
French
Description
Caractérisation du Réseau Lexical du Français (RL-fr)Le Réseau Lexical du Français (RL-fr) est un modèle formel du lexique du français contemporain, en cours de construction au
Accès des ménages français à Internet 2006-2021
fr.statista.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista, Accès des ménages français à Internet 2006-2021 [Dataset]. https://fr.statista.com/statistiques/509227/menage-francais-acces-internet/
Explore at:
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
France
Description
Cette statistique met en évidence la part des ménages ayant un accès Internet en France de 2006 à 2019 et en 2021. On constate que le taux de pénétration d'Internet au sein des foyers français a dépassé 80 % en 2012. En 2019, 90 % des ménages français avaient accès à Internet. Le taux de pénétration d'Internet diffère selon l'âge : en 2016, 92 % des 18-24 ans se déclaraient internautes, contre seulement 56 % des personnes âgées de 70 ans et plus.
P
Fon-French Dataset Dataset
paperswithcode.com
Updated Jun 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bonaventure F. P. Dossou; Chris C. Emezue (2020). Fon-French Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/fon-french-dataset
Explore at:
Dataset updated
Jun 13, 2020
Authors
Bonaventure F. P. Dossou; Chris C. Emezue
Area covered
French
Description
FFR Dataset is an ongoing project to collect, clean and store corpora of Fon and French sentences for machine translation from Fon-French. Fon (also called Fongbe) is an African-indigenous language spoken mostly in Benin, by about 1.7 million people. As training data is crucial to the high performance of a machine learning model, the aim of the project is to compile the largest set of training corpora for the research and design of translation and NLP models involving Fon. There are 117,029 parallel Fon-French sentences at the moment.
s
Canadian French Language Datasets | ASR, Virtual Assistant, Chatbot
fr.shaip.com
pl.shaip.com
+27more
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2023). Canadian French Language Datasets | ASR, Virtual Assistant, Chatbot [Dataset]. https://fr.shaip.com/offerings/speech-data-catalog/canadian-french-dataset/
Explore at:
Dataset updated
May 30, 2023
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Canada, Français
Description
Enhance your Conversational AI model with our Off-the-Shelf Canadian French Language Datasets. Shaip high-quality audio datasets are a quick and effective solution for model training.
h
French-PD-Newspapers
huggingface.co
Updated Mar 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PleIAs (2024). French-PD-Newspapers [Dataset]. https://huggingface.co/datasets/PleIAs/French-PD-Newspapers
Explore at:
Dataset updated
Mar 20, 2024
Dataset authored and provided by
PleIAs
Area covered
French
Description
🇫🇷 French Public Domain Newspapers 🇫🇷

French-Public Domain-Newspapers or French-PD-Newpapers is a large collection aiming to agregate all the French newspapers and periodicals in the public domain. The collection has been originally compiled by Pierre-Carl Langlais, on the basis of a large corpus curated by Benoît de Courson, Benjamin Azoulay for Gallicagram and in cooperation with OpenLLMFrance. Gallicagram is leading cultural analytics project giving access to word and… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/French-PD-Newspapers.
d
Sport (Français)
data.gouv.fr
data.europa.eu
+2more
csv, json, shp, xls
Updated Nov 26, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pas-de-Calais tourisme (2015). Sport (Français) [Dataset]. https://www.data.gouv.fr/en/datasets/sport-francais-pdct/
Explore at:
csv, json, xls, shpAvailable download formats
Dataset updated
Nov 26, 2015
Dataset authored and provided by
Pas-de-Calais tourisme
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Area covered
French
Description
Sport dans le Pas-de-Calais. \ Ce jeu de données est en Français.
h
french_book_reviews
huggingface.co
Updated Apr 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abir ELTAIEF (2023). french_book_reviews [Dataset]. http://doi.org/10.57967/hf/1052
Explore at:
Unique identifier
https://doi.org/10.57967/hf/1052
Dataset updated
Apr 24, 2023
Authors
Abir ELTAIEF
Area covered
French
Description
Dataset Card for French book reviews

I-Dataset Summary

The majority of review datasets are in English. There are datasets in other languages, but not many. Through this work, I would like to enrich the datasets in the French language(my mother tongue with Arabic).The data was retrieved from two French websites: Babelio and Critiques LibresLike Wikipedia, these two French sites are made possible by the contributions of volunteers who use the Internet to share their… See the full description on the dataset page: https://huggingface.co/datasets/Abirate/french_book_reviews.
w
French now! : a level one worktext = Le français actuel! : premier...
workwithdata.com
Updated May 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2023). French now! : a level one worktext = Le français actuel! : premier programm.. [Dataset]. https://www.workwithdata.com/object/french-now-a-level-one-worktext-le-francais-actuel-premier-programme-book-by-christopher-kendris-0000
Explore at:
Dataset updated
May 11, 2023
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
French
Description
Explore French now! : a level one worktext = Le français actuel! : premier programm.. through unique data from multiples sources: key facts, real-time news, interactive charts, detailed maps & open datasets
F
French (France) General Conversation Speech Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). French (France) General Conversation Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-french-france
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Area covered
France, French
Dataset funded by
FutureBeeAI
Description
Welcome to the French Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of French language speech recognition models, with a particular focus on France accents and dialects.
With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the French language spoken in France.
Speech Data:
This training dataset comprises 50 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 70 native French speakers from different states/provinces of France. This collaborative effort guarantees a balanced representation of France accents, dialects, and demographics, reducing biases and promoting inclusivity.
Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.
Metadata:
In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of French language speech recognition models.
Transcription:
This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.
Our goal is to expedite the deployment of French language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.
Updates and Customization:
We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.
If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.
License:
This audio dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.
F
Real Estate Call Center Speech Data: French (France)
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Real Estate Call Center Speech Data: French (France) [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-french-france
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Area covered
France, French
Dataset funded by
FutureBeeAI
Description
Welcome to the French Language Call Center Speech Dataset for the Real Estate domain. It is a specialized and comprehensive collection of voice data designed to enhance the development of call center speech recognition models specifically for the Real Estate industry.
With high-quality call center audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and generative voice AI algorithms in the Real Estate domain. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the French language spoken in France.
Speech Data:
This training dataset comprises 30 hours of call center audio recordings covering various topics and scenarios related to the Real Estate domain, to build robust and accurate customer service speech technology.
To curate realistic call center interactions, we collaborated with a diverse network of 60 expert native French speakers from different states/provinces of France. This collaborative effort ensures a balanced representation of France accents, dialects, and demographics, promoting inclusivity and reducing biases in the dataset.
Each audio recording captures the essence of unscripted and spontaneous conversations between call center agents and customers, with an average duration ranging from 5 to 15 minutes per call. The dataset includes both inbound and outbound calls, covering scenarios such as inquiries, promotional offers, complaints, technical support, and more. Additionally, the dataset contains call center conversations with both positive and negative outcomes, providing a diverse and realistic dataset.
The speech data is available in WAV format with stereo channels, a bit depth of 16 bits, and a sample rate of 8 kHz, ensuring high-quality audio for accurate analysis. The recording environment is generally quiet, without background noise and echo.
Metadata:
In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This includes the participant’s age, gender, country, state, and dialect. Additionally, it includes metadata like domain, topic, call type, outcome, bit depth, and sample rate for each conversation.
The metadata serves as a powerful tool for understanding and characterizing the data, enabling informed decision-making in the development of French language call center speech recognition models for the Real Estate domain.
Transcription:
To facilitate your workflow, the dataset includes manual verbatim transcriptions of each call center audio file in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags, covering both the agent and customer conversations.
These ready-to-use transcriptions accelerate the development of Real Estate call center conversational AI and ASR models for the French language.
Updates and Customization:
We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our call center voice dataset is regularly updated with new audio data captured in diverse real-world conditions.
If you require a custom training dataset with specific environmental conditions, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.
License:
This Real Estate call center audio dataset is created by FutureBeeAI and is available for commercial use!
Conclusion:
Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, or building state-of-the-art voice assistants to improve customer experiences in the Real Estate sector, our dataset serves as a trusted resource to meet your goals
d
Perception and production of plosives: Data from Norwegian learners of...
search.dataone.org
dataverse.no
Updated Jan 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreassen, Helene N. (2024). Perception and production of plosives: Data from Norwegian learners of French L3 [Dataset]. https://search.dataone.org/view/sha256%3Ac8e38e49de34af467e880727901435319452ca57921d6c170717c88720cd50a0
Explore at:
Dataset updated
Jan 5, 2024
Dataset provided by
DataverseNO
Authors
Andreassen, Helene N.
Area covered
French
Description
This dataset contains different measures of plosives produced by 16 Norwegian learners of French as a third language during a reading task and a repetition task. The data are extracted from two corpora collected within the framework of the IPFC project (Interphonologie du français contemporain): the Tromsø corpus with high school students, and the Oslo corpus with university students enrolled in a first year course on French phonetics and phonology. The dataset contains four files: A readme file, the word list used during the reading and repetition tasks, a data file containing all measures, and a text file presenting average values and VOT ranges for the individual informants.
Groupes sanguins des Français, selon le système ABO
fr.statista.com
Updated Jan 11, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2017). Groupes sanguins des Français, selon le système ABO [Dataset]. https://fr.statista.com/statistiques/656008/groupes-sanguins-repartition-abo-france/
Explore at:
Dataset updated
Jan 11, 2017
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
France
Description
Cette statistique illustre la répartition des groupes sanguins dans la population française, selon le système ABO. On peut y lire que moins de 5 % des Français possèdent le groupe sanguin AB.
N
French Settlement, LA Population Breakdown by Race
neilsberg.com
csv, json
Updated Aug 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2023). French Settlement, LA Population Breakdown by Race [Dataset]. https://www.neilsberg.com/insights/french-settlement-la-population-by-race/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Aug 18, 2023
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
French Settlement, Louisiana
Variables measured
Asian Population, Black Population, White Population, Some other race Population, Two or more races Population, American Indian and Alaska Native Population, Asian Population as Percent of Total Population, Black Population as Percent of Total Population, White Population as Percent of Total Population, Native Hawaiian and Other Pacific Islander Population, and 4 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the racial categories idetified by the US Census Bureau. It is ensured that the population estimates used in this dataset pertain exclusively to the identified racial categories, and do not rely on any ethnicity classification. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the population of French Settlement by race. It includes the population of French Settlement across racial categories (excluding ethnicity) as identified by the Census Bureau. The dataset can be utilized to understand the population distribution of French Settlement across relevant racial categories.

Key observations

The percent distribution of French Settlement population by race (across all racial categories recognized by the U.S. Census Bureau): 98.40% are white, 1.03% are Native Hawaiian and other Pacific Islander and 0.57% are multiracial.

https://i.neilsberg.com/ch/french-settlement-la-population-by-race.jpeg" alt="French Settlement population by race">

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

Racial categories include:

White

Black or African American

American Indian and Alaska Native

Asian

Native Hawaiian and Other Pacific Islander

Some other race

Two or more races (multiracial)

Variables / Data Columns

Race: This column displays the racial categories (excluding ethnicity) for the French Settlement

Population: The population of the racial category (excluding ethnicity) in the French Settlement is shown in this column.

% of Total Population: This column displays the percentage distribution of each race as a proportion of French Settlement total population. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for French Settlement Population by Race & Ethnicity. You can refer the same here
E
PAROLE French Corpus
catalogue.elra.info
live.european-language-grid.eu
Updated Nov 15, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) (2016). PAROLE French Corpus [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-W0020/
Explore at:
Dataset updated
Nov 15, 2016
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Area covered
French
Description
The PAROLE French corpus contains the following data:Miscellaneous: Data provided by ELRA (CRATER, MLCC Multilingual and Parallel Corpora)2 025 964 wordsBooks: CNRS Editions3 267 409 wordsPeriodicals: CNRS Info, Hermès942 963 wordsNewspapers: Le Monde, provided by ELRA13 856 763 wordsTotal20 093 099 words1. Newspapers: 14 million words were extracted from complete issues of years 1987, 1989, 1991, 1993 and 1995 of Le Monde newspaper. 241,484 words, from 7 issues of Le Monde of September 1987, have been extracted, and POS-tagged automatically. Each article consists of a complete item ? header ? according to the directives of the TEI (Text Encoding Initiative). Le Monde original markups were changed into classication features, so that extracting articles of different topics is possible.2. Periodicals:? HERMESIssues 15 to 22 have been used (134 articles, one Word file per article). The data have been converted from Word to RTF (Rich Text Format) and then, via a translator, from RTF to HTML. The conversion from HTML to the PAROLE format was made thanks to flex programs. The result for each article is: one "header" file which contains information on the author and the article id, and one "body" file which contains the article itself. A perl script is creating the final file from both "header" and "body".? CNRS-InfosThe data come from the CNRS-Infos Web site (http://www.cnrs.fr/Cnrspresse/cnrsinfo.html). Each file has been processed as follows: cleaning the HTML header, extracting a summary, cleaning of HTML markups, translation to the PAROLE format, creation of the "header" and the "body" files (see Hermès). . Like Hermès files, a perl script is creating the final file from both "header" and "body".3. BooksAll books were provided on CD-ROM as Xpress files, each book having its own structure. Therefore, each book has been considered separately. XPress allows conversion to a format called "Xpress markup". This format enables to spot the different structures of the book (if the Xpress file has been laid out well - which is not always the case). The structure of each book had to be worked out to create the perl script which enables the translation to the PAROLE format. Conformance to the PAROLE format was made thanks to a "nsgmls" tool. The errors found during the verification have been manually corrected.***Introduction on the PAROLE projectLE-PAROLE project (MLAP/LE2-4017) aims to offer a large-scale harmonised set of "core" corpora and lexica for all European Union languages. Language corpora and lexica were built according to the same design and composition principles, in the period 1996-1998. PAROLE Corpora:The harmonisation with respect to corpus composition (selection of corpus texts) was to be achieved by the obligatory application of common parameters for time of production and classification according to publication medium. No texts older than 1970 were allowed. As for publication medium, the corpus had to include specif...
d
Terminologie des descripteurs des pains français - Dataset - B2FIND
b2find.dkrz.de
Updated Nov 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Terminologie des descripteurs des pains français - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/d2cf994f-58c3-51c6-971a-ebcfa8f831e1
Explore at:
Dataset updated
Nov 3, 2023
Area covered
French
Description
The AsCoPain-T terminology lists and defines the main quality assessment descriptors used by bread-making professionals and those used in scientific studies. It aims at linking professional and scientific terminologies, as language can vary greatly to denote similar processing operations and observations that are made on the dough and bread. The AsCoPain-T terminology is the result of successive efforts: expert knowledge collected by the INRA AsCoPain project first resulted in the definition of the relations between bread-making control variables and the different states of the dough and bread. These relations were implemented into the expert system used to predict the state of the dough throughout the kneading process. The terminology collected during this work was then published in the form of a glossary in a document referenced by the FAO (see https://hal.inrae.fr/hal-02823534). The recent adaptation of its content to the semantic web standards now allows unforeseen uses, in particular by applications. The terminology is available in French. Its translation in English will soon be under way.
k
French-GEC-Dataset
kaggle.com
Updated Apr 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). French-GEC-Dataset [Dataset]. https://www.kaggle.com/datasets/isakbiderre/french-gec-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 23, 2023
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered
French
Description
Context

Wikipedia is a free encyclopedia where everyone can contribute and modify, delete, or add text to the articles. Because of this, every day there is newly created text and, most importantly, new corrections made to preexisting sentences. The idea is to find the corrections made to these sentences and create a dataset with X,y sentence pairs.

The data

This dataset was created using the Wikipedia edit histories thanks to the Wikipedia dumps available here: https://dumps.wikimedia.org/frwiki/

The dataset is composed of 45 million X,y sentence pairs extracted from almost the entire French Wikipedia. There are four columns in the CSV files: X (the source sentence), y (the target sentence), title (the title of the article from which the sentence came from), timestamps (the two dates when the source sentence and target sentence were created), comments (the comment of the edit if specified).

There is one major issue with this dataset (if it is used for the GEC task): a big part of the sentence pairs extracted are not in the scope of the GEC task (grammar, typos, and syntax). Many corrections made to the sentences on Wikipedia are reformulations, synthetizations, or clarifications, which, when training a model on this dataset, gives a model that reformulates and deletes parts of the sentence it was supposed to correct. To solve this problem, I suggest creating a classification model using transfer learning could be made to filter out the "bad" sentences.
e
French Monolingual legal corpus from Official Journal of France
data.europa.eu
live.european-language-grid.eu
zip
Updated Feb 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Directorate-General for Communications Networks, Content and Technology (2020). French Monolingual legal corpus from Official Journal of France [Dataset]. https://data.europa.eu/data/datasets/elrc_2501?locale=en
Explore at:
zipAvailable download formats
Dataset updated
Feb 29, 2020
Dataset authored and provided by
Directorate-General for Communications Networks, Content and Technology
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
France, French
Description
French Monolingual legal corpus from Official Journal of France as collected from https://www.legifrance.gouv.fr/ web site

This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) actions SMART 2014/1074 and SMART 2015/1091. For further information on the project: http://lr-coordination.eu.

Facebook

Twitter

Click to copy link

Link copied

Cite

Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot, French Wikipedia Dataset [Dataset]. https://paperswithcode.com/dataset/french-wikipedia

French Wikipedia Dataset

Explore at:

Authors

Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot

Area covered

French

Description

French Wikipedia is a dataset used for pretraining the CamemBERT French language model. It uses the official 2019 French Wikipedia dumps

Clear search

Close search

Google apps

Main menu

French Wikipedia Dataset

Corpus de Français Parlé Parisien des années 2000 (CFPP)

Educational Resource — French Primary — UD0. Le français c’est facile

Réseau Lexical du Français (RL-fr)

Accès des ménages français à Internet 2006-2021

Fon-French Dataset Dataset

Canadian French Language Datasets | ASR, Virtual Assistant, Chatbot

French-PD-Newspapers

Sport (Français)

french_book_reviews

French now! : a level one worktext = Le français actuel! : premier...

French (France) General Conversation Speech Dataset

Real Estate Call Center Speech Data: French (France)

Perception and production of plosives: Data from Norwegian learners of...

Groupes sanguins des Français, selon le système ABO

French Settlement, LA Population Breakdown by Race

About this dataset

Content

Inspiration

Recommended for further research

PAROLE French Corpus

Terminologie des descripteurs des pains français - Dataset - B2FIND

French-GEC-Dataset

Context

The data

French Monolingual legal corpus from Official Journal of France

French Wikipedia Dataset