French Wikipedia is a dataset used for pretraining the CamemBERT French language model. It uses the official 2019 French Wikipedia dumps
Cette statistique met en évidence la part des ménages ayant un accès Internet en France de 2006 à 2023. On constate que le taux de pénétration d'Internet au sein des foyers français a dépassé 80 % en 2012. En 2023, 93 % des ménages français avaient accès à Internet.Le taux de pénétration d'Internet diffère selon l'âge : en 2016, 92 % des 18-24 ans se déclaraient internautes, contre seulement 56 % des personnes âgées de 70 ans et plus.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Manifestations dans le Pas-de-Calais. \ Ce jeu de données est en Français.
This dataset contains different measures of plosives produced by 16 Norwegian learners of French as a third language during a reading task and a repetition task. The data are extracted from two corpora collected within the framework of the IPFC project (Interphonologie du français contemporain): the Tromsø corpus with high school students, and the Oslo corpus with university students enrolled in a first year course on French phonetics and phonology. The dataset contains four files: A readme file, the word list used during the reading and repetition tasks, a data file containing all measures, and a text file presenting average values and VOT ranges for the individual informants.
FFR Dataset is an ongoing project to collect, clean and store corpora of Fon and French sentences for machine translation from Fon-French. Fon (also called Fongbe) is an African-indigenous language spoken mostly in Benin, by about 1.7 million people. As training data is crucial to the high performance of a machine learning model, the aim of the project is to compile the largest set of training corpora for the research and design of translation and NLP models involving Fon. There are 117,029 parallel Fon-French sentences at the moment.
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Introducing the French Shopping List Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the French language.
Dataset Contain & Diversity:Containing more than 2000 images, this French OCR dataset offers a wide distribution of different types of shopping list images. Within this dataset, you'll discover a variety of handwritten text, including sentences, and individual item name words, quantity, comments, etc on shopping lists. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.
To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible French text.
The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.
All these shopping lists were written and images were captured by native French people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.
Metadata:In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.
This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of French text recognition models.
Update & Custom Collection:We are committed to continually expanding this dataset by adding more images with the help of our native French crowd community.
If you require a customized OCR dataset containing shopping list images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.
Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.
License:This image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:Leverage this shopping list image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the French language. Your journey to improved language understanding and processing begins here.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides a comprehensive set of audio recordings for wake word detection in French. It features a range of accents, speech patterns, and environmental conditions to ensure reliable and accurate performance of speech recognition systems. Ideal for developers and researchers working on French language technology solutions.
The French English Discourse Study – Canada (FrEnDS-CAN) is a multisite research project lead by a consortium of researchers from a number of Canadian universities. The project examined the development of discourse skills in mono- and bilingual children between the ages of 7 and 12. Discourse samples were collected from monolingual French, monolingual English, and bilingual French and English-speaking children. In addition, samples were collected from children who spoke Arabic as their home language. Three discourse contexts were included, conversational, narrative (story telling) and expository (description of favorite game or sport). There were two main objectives for the study: 1) to increase our understanding of monolingual and bilingual development by analyzing the impacts of language status (mono-/bilingual) and discourse type on various aspects of language development at the word, sentence and discourse structure levels and 2) to improve our ability to identify language disabilities in monolingual and bilingual children through the development of normative information and the creation of databases of language samples in the three discourse contexts.
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The BDSONS Database is a French - speech database with two subsets: evaluation and acoustic modelling. The Corpora consist of 32 speakers: 16 male and 16 female (7 CD-ROMs of approximately 3,5 Gigabytes), Phonetic labelling (partly) available on additional floppies, of the following data: "Evaluation" (32 speakers): adjustment: 5 sentences and 54 bi-syllabic "logatomes", numbers, digits, letters, and names (spelled in isolation and in connected speech). "Acoustic" (12 speakers): Words: 600 CVCV including 20 consonant and semi-consonant and vowels /a/, /i/, /u/ ; 200 consonant clusters; rhyme tests for consonant and vowels (pairs and triplets), sentences: 52 phonetically balanced sentences, 44 nasal sentences, 192 sentences including real words in French with 16 consonants and 12 vowels. Phonetic labelling for a subset of the data is available on floppy disk.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the raw data and the R scripts necessary to replicate all tables and figures in the cited publication. The raw data consists of manually-annotated plain-text concordances containing instances of five pairs of Old French spatial prepositions (ens/dedans, hors/dehors, avant/devant, arriere/derriere, sus/dessus). The concordances were initially extracted from the "Base de Français Médiéval" corpus (http://txm.bfm-corpus.org/). (publication abstract): This paper compares the syntactic distribution of two separate series of spatial preposition-adverbs in medieval French: "base" forms descended directly from Latin adverbs and forms prefixed with de-. As both types of form may occur with a similar meaning either as prepositions or as adverbs, many grammars of Old French typically consider them to be free variants. However, on the basis of a detailed quantitative analysis of five pairs of forms across 1.4 million words of medieval French drawn from the Base de français médiéval corpus, I argue that the base forms are particles, being favoured in motion expressions and showing limited prepositional uses, while the de-prefixed forms, favoured in static contexts or as locative adjuncts, are best analysed as locative adverbs with secondary prepositional uses.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books and is filtered where the book is On parle français, featuring 7 columns including author, BNB id, book, book publisher, and ISBN. The preview is ordered by publication date (descending).
Attitudes towards the European Union.
Topics: attitude towards the following statements on European integration: guarantees peace on the continent, makes France stronger against the rest of the world, contributes to France’s prosperity; preferred decision level for measures against the economic crisis: national or EU level; management of the economic crisis to date as joint action of the European countries or following national interests; attitude towards selected propositions: increased monitoring of national budgets by the EU, increased regulation of financial markets, stricter monitoring of rating agencies, introduction of a financial transaction tax, harmonization of the taxation systems of the member states, programme to stimulate economic growth, ban on imports of products from certain countries, principle of reciprocity in international exchange, decisions on EU level on the basis of a qualified majority and not unanimously; respondent feels well informed about political life in France and in the EU, citizens need more information on the EU given by French politicians, citizens need more information on the EU given by the media; satisfaction with the current personal situation and expected development for the next three years; left-right self-placement.
Demography: age; sex; age at end of education; occupation; professional position; nationality; region; type of community; own a mobile phone and fixed (landline) phone; household composition and household size.
Additionally coded was: type of phone line; weighting factor.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Niveau de vie des Français par commune’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from http://data.europa.eu/88u/dataset/59f89adf88ee381016f69c0c on 17 January 2022.
--- Dataset description provided by original source is as follows ---
L'Insee a publié les niveaux de vie des ménages par commune pour l'année 2014. Le dispositif d'analyse, appelé Filosofi, permet de détailler où se situent les zones de pauvreté en France.
--- Original source retains full ownership of the source dataset ---
Cette statistique illustre la répartition des groupes sanguins dans la population française, selon le système ABO. On peut y lire que moins de 5 % des Français possèdent le groupe sanguin AB. Pour plus d'informations, vous pouvez consulter notre infographie sur la compatibilité des groupes sanguins.
In France, in 2022, half of the new immigrants that took a test related to written comprehension of the French language had 80 percent or more success rate. The number has increased since 2019, indeed that year only 40 percent of the new immigrants were highly successful when completing the test, and they were 44 percent in 2020.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Ce projet à pour but de réaliser les listes de plantes présentes dans chaque département français.
Un travail d'harmonisation est réalisé pour la nomenclature des taxons sur la base de l'index synonymique réalisé par Benoît BOCK dans le cadre du projet "Index synonymique" de Tela Botanica.
Les listes réalisées seront disponibles en ligne au fur et à mesure de leur réalisation.
Vous pouvez aussi consulter, via une interface web, l'état d'avancement du projet : http://www.tela-botanica.org/chorologie
Licence Ouverte / Open Licence 1.0https://www.etalab.gouv.fr/wp-content/uploads/2014/05/Open_Licence.pdf
License information was derived automatically
Il n'y a pas de description pour ce jeu de données.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ce jeu de données est issu de la base appelée Pelagis-Observations (base de données développée et administrée par l'UMS 3468 BBEES). Il rassemble les données d'observations de mammifères marins collectées au cours d'embarquements à bord des navires Astrolabe et Marion Dufresne I et II lors des campagnes dans les Terres Australes et Antarctiques Françaises (programme 109 Institut Polaire Français - IPEV), entre 1982 et 2015. Il s’agit ici des campagnes logistiques, scientifiques et océanographiques OISO.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘2014-24 - Visas délivrés aux conjoints de Français’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from http://data.europa.eu/88u/dataset/543bd41488ee3805963c163f on 19 January 2022.
--- Dataset description provided by original source is as follows ---
Analyse de l’évolution de la délivrance pour les principales catégories de visas
--- Original source retains full ownership of the source dataset ---
Article abstract: Phonological variation forms an integrated part of language acquisition, and one important challenge for learners of French as a post-L1 language concerns schwa alternation, in perception as well as production. This paper presents a first analysis of the behavior of schwa in conversational speech in two Norwegian learner corpora, and tests the hypothesis whereby the acquisition of the phonological variable depends on phonotactic structure and frequency. The examination of the learners' productions indicates that even at an advanced level, they are far from mastering the target system, which encourages a more explicit exposition in the classroom to the factors conditioning schwa alternation. About the dataset: The dataset consists of 2 data files and a readme-file explaining the content of the data files. The 2 data files contain information about schwa behavior in monosyllables and the initial syllable of polysyllables, in conversational speech, in two learner corpora (Norwegian learners of French as a post-L1 language). The data are coded by making use of the schwa pilot coding system developed within the IPFC project (Interphonologie du français contemporain). For access to the sound files, contact the authors.
French Wikipedia is a dataset used for pretraining the CamemBERT French language model. It uses the official 2019 French Wikipedia dumps