French Wikipedia is a dataset used for pretraining the CamemBERT French language model. It uses the official 2019 French Wikipedia dumps
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Le Corpus de Français Parlé Parisien (CFPP2000) est composé d'un ensemble d'interviews non directives sur les quartiers de Paris et de la proche banlieue. Les entretiens, transcrits en orthographe et alignés au tour de parole, sont disponibles sur le net ; ils sont librement employables en échange de la mention dans la bibliographie des travaux qui en seraient tirés d'une part de l'adresse du site: http://cfpp2000.univ-paris3.fr/ et d'autre part du document de présentation suivant : Branca-Rosoff S., Fleury S., Lefeuvre F., Pires M., 2012, "Discours sur la ville. Présentation du Corpus de Français Parlé Parisien des années 2000 (CFPP2000)". En février 2013, ce corpus comprenait environ 550 000 mots. Un certain nombre d'outils en ligne, notamment un concordancier et des outils textométriques permettent de mener des requêtes lexicales et grammaticales. CFPP2000 est particulièrement destiné à des analyses sur le français oral. Le projet sous-jacent au corpus est par ailleurs l'étude des modifications et des variations qui interviennent dans ce qu'on peut considérer comme un parisien véhiculaire en tension entre le pôle du standard et le pôle du vernaculaire. Par ailleurs, il comporte des activités linguistiques diversifiées (description de quartier, anecdotes, argumentation…) et on peut par conséquent travailler sur la syntaxe propre à ces différentes utilisations du langage. Il permet enfin d'opposer dialogues (entre enquêteur et enquêtés) et multilogues (où la présence de plusieurs enquêtés favorise le passage à un registre familier). CFPP2000 est constitué d'interviews longues (d'une heure en moyenne) intégralement transcrites. Il est donc utilisable pour examiner les singularités qui reviennent à l'idiolecte propre à une personne donnée, par opposition aux variantes diffusées dans des groupes plus larges (quartiers, groupes socio-culturels, classe d'âge, etc.). Le corpus constitue enfin un ensemble de témoignages intéressants sur les représentations de Paris et de sa proche banlieue qui est susceptible d'intéresser des analystes du discours, des sociologues, ou tout simplement des curieux de la ville.; Corpus de Français Parlé Parisien des années 2000.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
On and va apprendre à se présenter en français: se saluer, demander et dire son nom, demander et dire son âge, demander et dire sa nationalité et le lieu où l’on habite, demander et dire ce qu’on fait, ex Primer ce que l’on aime et prendre congé. Tout d’une façon très simple et facile. On and off Paris... nous sommes prêts.
Caractérisation du Réseau Lexical du Français (RL-fr)Le Réseau Lexical du Français (RL-fr) est un modèle formel du lexique du français contemporain, en cours de construction au
Cette statistique met en évidence la part des ménages ayant un accès Internet en France de 2006 à 2019 et en 2021. On constate que le taux de pénétration d'Internet au sein des foyers français a dépassé 80 % en 2012. En 2019, 90 % des ménages français avaient accès à Internet. Le taux de pénétration d'Internet diffère selon l'âge : en 2016, 92 % des 18-24 ans se déclaraient internautes, contre seulement 56 % des personnes âgées de 70 ans et plus.
FFR Dataset is an ongoing project to collect, clean and store corpora of Fon and French sentences for machine translation from Fon-French. Fon (also called Fongbe) is an African-indigenous language spoken mostly in Benin, by about 1.7 million people. As training data is crucial to the high performance of a machine learning model, the aim of the project is to compile the largest set of training corpora for the research and design of translation and NLP models involving Fon. There are 117,029 parallel Fon-French sentences at the moment.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Enhance your Conversational AI model with our Off-the-Shelf Canadian French Language Datasets. Shaip high-quality audio datasets are a quick and effective solution for model training.
🇫🇷 French Public Domain Newspapers 🇫🇷
French-Public Domain-Newspapers or French-PD-Newpapers is a large collection aiming to agregate all the French newspapers and periodicals in the public domain. The collection has been originally compiled by Pierre-Carl Langlais, on the basis of a large corpus curated by Benoît de Courson, Benjamin Azoulay for Gallicagram and in cooperation with OpenLLMFrance. Gallicagram is leading cultural analytics project giving access to word and… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/French-PD-Newspapers.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Sport dans le Pas-de-Calais. \ Ce jeu de données est en Français.
Dataset Card for French book reviews
I-Dataset Summary
The majority of review datasets are in English. There are datasets in other languages, but not many. Through this work, I would like to enrich the datasets in the French language(my mother tongue with Arabic).The data was retrieved from two French websites: Babelio and Critiques LibresLike Wikipedia, these two French sites are made possible by the contributions of volunteers who use the Internet to share their… See the full description on the dataset page: https://huggingface.co/datasets/Abirate/french_book_reviews.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Explore French now! : a level one worktext = Le français actuel! : premier programm.. through unique data from multiples sources: key facts, real-time news, interactive charts, detailed maps & open datasets
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Welcome to the French Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of French language speech recognition models, with a particular focus on France accents and dialects.
With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the French language spoken in France.
Speech Data:
This training dataset comprises 50 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 70 native French speakers from different states/provinces of France. This collaborative effort guarantees a balanced representation of France accents, dialects, and demographics, reducing biases and promoting inclusivity.
Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.
Metadata:
In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of French language speech recognition models.
Transcription:
This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.
Our goal is to expedite the deployment of French language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.
Updates and Customization:
We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.
If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.
License:
This audio dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Welcome to the French Language Call Center Speech Dataset for the Real Estate domain. It is a specialized and comprehensive collection of voice data designed to enhance the development of call center speech recognition models specifically for the Real Estate industry.
With high-quality call center audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and generative voice AI algorithms in the Real Estate domain. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the French language spoken in France.
Speech Data:
This training dataset comprises 30 hours of call center audio recordings covering various topics and scenarios related to the Real Estate domain, to build robust and accurate customer service speech technology.
To curate realistic call center interactions, we collaborated with a diverse network of 60 expert native French speakers from different states/provinces of France. This collaborative effort ensures a balanced representation of France accents, dialects, and demographics, promoting inclusivity and reducing biases in the dataset.
Each audio recording captures the essence of unscripted and spontaneous conversations between call center agents and customers, with an average duration ranging from 5 to 15 minutes per call. The dataset includes both inbound and outbound calls, covering scenarios such as inquiries, promotional offers, complaints, technical support, and more. Additionally, the dataset contains call center conversations with both positive and negative outcomes, providing a diverse and realistic dataset.
The speech data is available in WAV format with stereo channels, a bit depth of 16 bits, and a sample rate of 8 kHz, ensuring high-quality audio for accurate analysis. The recording environment is generally quiet, without background noise and echo.
Metadata:
In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This includes the participant’s age, gender, country, state, and dialect. Additionally, it includes metadata like domain, topic, call type, outcome, bit depth, and sample rate for each conversation.
The metadata serves as a powerful tool for understanding and characterizing the data, enabling informed decision-making in the development of French language call center speech recognition models for the Real Estate domain.
Transcription:
To facilitate your workflow, the dataset includes manual verbatim transcriptions of each call center audio file in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags, covering both the agent and customer conversations.
These ready-to-use transcriptions accelerate the development of Real Estate call center conversational AI and ASR models for the French language.
Updates and Customization:
We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our call center voice dataset is regularly updated with new audio data captured in diverse real-world conditions.
If you require a custom training dataset with specific environmental conditions, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.
License:
This Real Estate call center audio dataset is created by FutureBeeAI and is available for commercial use!
Conclusion:
Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, or building state-of-the-art voice assistants to improve customer experiences in the Real Estate sector, our dataset serves as a trusted resource to meet your goals
This dataset contains different measures of plosives produced by 16 Norwegian learners of French as a third language during a reading task and a repetition task. The data are extracted from two corpora collected within the framework of the IPFC project (Interphonologie du français contemporain): the Tromsø corpus with high school students, and the Oslo corpus with university students enrolled in a first year course on French phonetics and phonology. The dataset contains four files: A readme file, the word list used during the reading and repetition tasks, a data file containing all measures, and a text file presenting average values and VOT ranges for the individual informants.
Cette statistique illustre la répartition des groupes sanguins dans la population française, selon le système ABO. On peut y lire que moins de 5 % des Français possèdent le groupe sanguin AB.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the population of French Settlement by race. It includes the population of French Settlement across racial categories (excluding ethnicity) as identified by the Census Bureau. The dataset can be utilized to understand the population distribution of French Settlement across relevant racial categories.
Key observations
The percent distribution of French Settlement population by race (across all racial categories recognized by the U.S. Census Bureau): 98.40% are white, 1.03% are Native Hawaiian and other Pacific Islander and 0.57% are multiracial.
https://i.neilsberg.com/ch/french-settlement-la-population-by-race.jpeg" alt="French Settlement population by race">
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.
Racial categories include:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for French Settlement Population by Race & Ethnicity. You can refer the same here
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The PAROLE French corpus contains the following data:Miscellaneous: Data provided by ELRA (CRATER, MLCC Multilingual and Parallel Corpora)2 025 964 wordsBooks: CNRS Editions3 267 409 wordsPeriodicals: CNRS Info, Hermès942 963 wordsNewspapers: Le Monde, provided by ELRA13 856 763 wordsTotal20 093 099 words1. Newspapers: 14 million words were extracted from complete issues of years 1987, 1989, 1991, 1993 and 1995 of Le Monde newspaper. 241,484 words, from 7 issues of Le Monde of September 1987, have been extracted, and POS-tagged automatically. Each article consists of a complete item ? header ? according to the directives of the TEI (Text Encoding Initiative). Le Monde original markups were changed into classication features, so that extracting articles of different topics is possible.2. Periodicals:? HERMESIssues 15 to 22 have been used (134 articles, one Word file per article). The data have been converted from Word to RTF (Rich Text Format) and then, via a translator, from RTF to HTML. The conversion from HTML to the PAROLE format was made thanks to flex programs. The result for each article is: one "header" file which contains information on the author and the article id, and one "body" file which contains the article itself. A perl script is creating the final file from both "header" and "body".? CNRS-InfosThe data come from the CNRS-Infos Web site (http://www.cnrs.fr/Cnrspresse/cnrsinfo.html). Each file has been processed as follows: cleaning the HTML header, extracting a summary, cleaning of HTML markups, translation to the PAROLE format, creation of the "header" and the "body" files (see Hermès). . Like Hermès files, a perl script is creating the final file from both "header" and "body".3. BooksAll books were provided on CD-ROM as Xpress files, each book having its own structure. Therefore, each book has been considered separately. XPress allows conversion to a format called "Xpress markup". This format enables to spot the different structures of the book (if the Xpress file has been laid out well - which is not always the case). The structure of each book had to be worked out to create the perl script which enables the translation to the PAROLE format. Conformance to the PAROLE format was made thanks to a "nsgmls" tool. The errors found during the verification have been manually corrected.***Introduction on the PAROLE projectLE-PAROLE project (MLAP/LE2-4017) aims to offer a large-scale harmonised set of "core" corpora and lexica for all European Union languages. Language corpora and lexica were built according to the same design and composition principles, in the period 1996-1998. PAROLE Corpora:The harmonisation with respect to corpus composition (selection of corpus texts) was to be achieved by the obligatory application of common parameters for time of production and classification according to publication medium. No texts older than 1970 were allowed. As for publication medium, the corpus had to include specif...
The AsCoPain-T terminology lists and defines the main quality assessment descriptors used by bread-making professionals and those used in scientific studies. It aims at linking professional and scientific terminologies, as language can vary greatly to denote similar processing operations and observations that are made on the dough and bread. The AsCoPain-T terminology is the result of successive efforts: expert knowledge collected by the INRA AsCoPain project first resulted in the definition of the relations between bread-making control variables and the different states of the dough and bread. These relations were implemented into the expert system used to predict the state of the dough throughout the kneading process. The terminology collected during this work was then published in the form of a glossary in a document referenced by the FAO (see https://hal.inrae.fr/hal-02823534). The recent adaptation of its content to the semantic web standards now allows unforeseen uses, in particular by applications. The terminology is available in French. Its translation in English will soon be under way.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Wikipedia is a free encyclopedia where everyone can contribute and modify, delete, or add text to the articles. Because of this, every day there is newly created text and, most importantly, new corrections made to preexisting sentences. The idea is to find the corrections made to these sentences and create a dataset with X,y sentence pairs.
This dataset was created using the Wikipedia edit histories thanks to the Wikipedia dumps available here: https://dumps.wikimedia.org/frwiki/
The dataset is composed of 45 million X,y sentence pairs extracted from almost the entire French Wikipedia. There are four columns in the CSV files: X (the source sentence), y (the target sentence), title (the title of the article from which the sentence came from), timestamps (the two dates when the source sentence and target sentence were created), comments (the comment of the edit if specified).
There is one major issue with this dataset (if it is used for the GEC task): a big part of the sentence pairs extracted are not in the scope of the GEC task (grammar, typos, and syntax). Many corrections made to the sentences on Wikipedia are reformulations, synthetizations, or clarifications, which, when training a model on this dataset, gives a model that reformulates and deletes parts of the sentence it was supposed to correct. To solve this problem, I suggest creating a classification model using transfer learning could be made to filter out the "bad" sentences.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
French Monolingual legal corpus from Official Journal of France as collected from https://www.legifrance.gouv.fr/ web site
This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) actions SMART 2014/1074 and SMART 2015/1091. For further information on the project: http://lr-coordination.eu.
French Wikipedia is a dataset used for pretraining the CamemBERT French language model. It uses the official 2019 French Wikipedia dumps