100+ datasets found
  1. P

    French Wikipedia Dataset

    • paperswithcode.com
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot, French Wikipedia Dataset [Dataset]. https://paperswithcode.com/dataset/french-wikipedia
    Explore at:
    Authors
    Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot
    Area covered
    French
    Description

    French Wikipedia is a dataset used for pretraining the CamemBERT French language model. It uses the official 2019 French Wikipedia dumps

  2. h

    Corpus de Français Parlé Parisien des années 2000 (CFPP)

    • cocoon.huma-num.fr
    Updated Apr 12, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fleury, Serge; Lefeuvre, Florence; Branca-Rosoff, Sonia; Pires, Mat (2013). Corpus de Français Parlé Parisien des années 2000 (CFPP) [Dataset]. http://doi.org/10.34847/cocoon.8bc96a4e-9899-30e4-99be-c72d216eb38b
    Explore at:
    Dataset updated
    Apr 12, 2013
    Dataset provided by
    CNRS/COCOON
    Authors
    Fleury, Serge; Lefeuvre, Florence; Branca-Rosoff, Sonia; Pires, Mat
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Area covered
    Paris, French
    Description

    Le Corpus de Français Parlé Parisien (CFPP2000) est composé d'un ensemble d'interviews non directives sur les quartiers de Paris et de la proche banlieue. Les entretiens, transcrits en orthographe et alignés au tour de parole, sont disponibles sur le net ; ils sont librement employables en échange de la mention dans la bibliographie des travaux qui en seraient tirés d'une part de l'adresse du site: http://cfpp2000.univ-paris3.fr/ et d'autre part du document de présentation suivant : Branca-Rosoff S., Fleury S., Lefeuvre F., Pires M., 2012, "Discours sur la ville. Présentation du Corpus de Français Parlé Parisien des années 2000 (CFPP2000)". En février 2013, ce corpus comprenait environ 550 000 mots. Un certain nombre d'outils en ligne, notamment un concordancier et des outils textométriques permettent de mener des requêtes lexicales et grammaticales. CFPP2000 est particulièrement destiné à des analyses sur le français oral. Le projet sous-jacent au corpus est par ailleurs l'étude des modifications et des variations qui interviennent dans ce qu'on peut considérer comme un parisien véhiculaire en tension entre le pôle du standard et le pôle du vernaculaire. Par ailleurs, il comporte des activités linguistiques diversifiées (description de quartier, anecdotes, argumentation…) et on peut par conséquent travailler sur la syntaxe propre à ces différentes utilisations du langage. Il permet enfin d'opposer dialogues (entre enquêteur et enquêtés) et multilogues (où la présence de plusieurs enquêtés favorise le passage à un registre familier). CFPP2000 est constitué d'interviews longues (d'une heure en moyenne) intégralement transcrites. Il est donc utilisable pour examiner les singularités qui reviennent à l'idiolecte propre à une personne donnée, par opposition aux variantes diffusées dans des groupes plus larges (quartiers, groupes socio-culturels, classe d'âge, etc.). Le corpus constitue enfin un ensemble de témoignages intéressants sur les représentations de Paris et de sa proche banlieue qui est susceptible d'intéresser des analystes du discours, des sociologues, ou tout simplement des curieux de la ville.; Corpus de Français Parlé Parisien des années 2000.

  3. e

    Educational Resource — French Primary — UD0. Le français c’est facile

    • data.europa.eu
    html, unknown
    Updated Jul 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gobierno de Aragón (2023). Educational Resource — French Primary — UD0. Le français c’est facile [Dataset]. https://data.europa.eu/data/datasets/https-opendata-aragon-es-datos-catalogo-dataset-recurso-educativo-primariafrances-ud0-le-francais-cest-facile/
    Explore at:
    html, unknownAvailable download formats
    Dataset updated
    Jul 5, 2023
    Dataset authored and provided by
    Gobierno de Aragón
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    On and va apprendre à se présenter en français: se saluer, demander et dire son nom, demander et dire son âge, demander et dire sa nationalité et le lieu où l’on habite, demander et dire ce qu’on fait, ex Primer ce que l’on aime et prendre congé. Tout d’une façon très simple et facile. On and off Paris... nous sommes prêts.

  4. o

    Réseau Lexical du Français (RL-fr)

    • ortolang.fr
    Updated Sep 6, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Réseau Lexical du Français (RL-fr) [Dataset]. https://www.ortolang.fr/market/lexicons/lexical-system-fr/v1
    Explore at:
    Dataset updated
    Sep 6, 2017
    Area covered
    French
    Description

    Caractérisation du Réseau Lexical du Français (RL-fr)Le Réseau Lexical du Français (RL-fr) est un modèle formel du lexique du français contemporain, en cours de construction au

  5. Accès des ménages français à Internet 2006-2021

    • fr.statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Accès des ménages français à Internet 2006-2021 [Dataset]. https://fr.statista.com/statistiques/509227/menage-francais-acces-internet/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    France
    Description

    Cette statistique met en évidence la part des ménages ayant un accès Internet en France de 2006 à 2019 et en 2021. On constate que le taux de pénétration d'Internet au sein des foyers français a dépassé 80 % en 2012. En 2019, 90 % des ménages français avaient accès à Internet. Le taux de pénétration d'Internet diffère selon l'âge : en 2016, 92 % des 18-24 ans se déclaraient internautes, contre seulement 56 % des personnes âgées de 70 ans et plus.

  6. P

    Fon-French Dataset Dataset

    • paperswithcode.com
    Updated Jun 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bonaventure F. P. Dossou; Chris C. Emezue (2020). Fon-French Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/fon-french-dataset
    Explore at:
    Dataset updated
    Jun 13, 2020
    Authors
    Bonaventure F. P. Dossou; Chris C. Emezue
    Area covered
    French
    Description

    FFR Dataset is an ongoing project to collect, clean and store corpora of Fon and French sentences for machine translation from Fon-French. Fon (also called Fongbe) is an African-indigenous language spoken mostly in Benin, by about 1.7 million people. As training data is crucial to the high performance of a machine learning model, the aim of the project is to compile the largest set of training corpora for the research and design of translation and NLP models involving Fon. There are 117,029 parallel Fon-French sentences at the moment.

  7. s

    Canadian French Language Datasets | ASR, Virtual Assistant, Chatbot

    • fr.shaip.com
    • pl.shaip.com
    • +27more
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Canadian French Language Datasets | ASR, Virtual Assistant, Chatbot [Dataset]. https://fr.shaip.com/offerings/speech-data-catalog/canadian-french-dataset/
    Explore at:
    Dataset updated
    May 30, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Canada, Français
    Description

    Enhance your Conversational AI model with our Off-the-Shelf Canadian French Language Datasets. Shaip high-quality audio datasets are a quick and effective solution for model training.

  8. h

    French-PD-Newspapers

    • huggingface.co
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). French-PD-Newspapers [Dataset]. https://huggingface.co/datasets/PleIAs/French-PD-Newspapers
    Explore at:
    Dataset updated
    Mar 20, 2024
    Dataset authored and provided by
    PleIAs
    Area covered
    French
    Description

    🇫🇷 French Public Domain Newspapers 🇫🇷

    French-Public Domain-Newspapers or French-PD-Newpapers is a large collection aiming to agregate all the French newspapers and periodicals in the public domain. The collection has been originally compiled by Pierre-Carl Langlais, on the basis of a large corpus curated by Benoît de Courson, Benjamin Azoulay for Gallicagram and in cooperation with OpenLLMFrance. Gallicagram is leading cultural analytics project giving access to word and… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/French-PD-Newspapers.

  9. d

    Sport (Français)

    • data.gouv.fr
    • data.europa.eu
    • +2more
    csv, json, shp, xls
    Updated Nov 26, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pas-de-Calais tourisme (2015). Sport (Français) [Dataset]. https://www.data.gouv.fr/en/datasets/sport-francais-pdct/
    Explore at:
    csv, json, xls, shpAvailable download formats
    Dataset updated
    Nov 26, 2015
    Dataset authored and provided by
    Pas-de-Calais tourisme
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Area covered
    French
    Description

    Sport dans le Pas-de-Calais. \ Ce jeu de données est en Français.

  10. h

    french_book_reviews

    • huggingface.co
    Updated Apr 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abir ELTAIEF (2023). french_book_reviews [Dataset]. http://doi.org/10.57967/hf/1052
    Explore at:
    Dataset updated
    Apr 24, 2023
    Authors
    Abir ELTAIEF
    Area covered
    French
    Description

    Dataset Card for French book reviews

      I-Dataset Summary
    

    The majority of review datasets are in English. There are datasets in other languages, but not many. Through this work, I would like to enrich the datasets in the French language(my mother tongue with Arabic).The data was retrieved from two French websites: Babelio and Critiques LibresLike Wikipedia, these two French sites are made possible by the contributions of volunteers who use the Internet to share their… See the full description on the dataset page: https://huggingface.co/datasets/Abirate/french_book_reviews.

  11. w

    French now! : a level one worktext = Le français actuel! : premier...

    • workwithdata.com
    Updated May 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2023). French now! : a level one worktext = Le français actuel! : premier programm.. [Dataset]. https://www.workwithdata.com/object/french-now-a-level-one-worktext-le-francais-actuel-premier-programme-book-by-christopher-kendris-0000
    Explore at:
    Dataset updated
    May 11, 2023
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    Explore French now! : a level one worktext = Le français actuel! : premier programm.. through unique data from multiples sources: key facts, real-time news, interactive charts, detailed maps & open datasets

  12. F

    French (France) General Conversation Speech Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). French (France) General Conversation Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-french-france
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Area covered
    France, French
    Dataset funded by
    FutureBeeAI
    Description

    Welcome to the French Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of French language speech recognition models, with a particular focus on France accents and dialects.

    With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the French language spoken in France.

    Speech Data:

    This training dataset comprises 50 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 70 native French speakers from different states/provinces of France. This collaborative effort guarantees a balanced representation of France accents, dialects, and demographics, reducing biases and promoting inclusivity.

    Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.

    Metadata:

    In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of French language speech recognition models.

    Transcription:

    This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.

    Our goal is to expedite the deployment of French language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.

    Updates and Customization:

    We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.

    If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.

    License:

    This audio dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.

  13. F

    Real Estate Call Center Speech Data: French (France)

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Real Estate Call Center Speech Data: French (France) [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-french-france
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Area covered
    France, French
    Dataset funded by
    FutureBeeAI
    Description

    Welcome to the French Language Call Center Speech Dataset for the Real Estate domain. It is a specialized and comprehensive collection of voice data designed to enhance the development of call center speech recognition models specifically for the Real Estate industry.

    With high-quality call center audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and generative voice AI algorithms in the Real Estate domain. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the French language spoken in France.

    Speech Data:

    This training dataset comprises 30 hours of call center audio recordings covering various topics and scenarios related to the Real Estate domain, to build robust and accurate customer service speech technology.

    To curate realistic call center interactions, we collaborated with a diverse network of 60 expert native French speakers from different states/provinces of France. This collaborative effort ensures a balanced representation of France accents, dialects, and demographics, promoting inclusivity and reducing biases in the dataset.

    Each audio recording captures the essence of unscripted and spontaneous conversations between call center agents and customers, with an average duration ranging from 5 to 15 minutes per call. The dataset includes both inbound and outbound calls, covering scenarios such as inquiries, promotional offers, complaints, technical support, and more. Additionally, the dataset contains call center conversations with both positive and negative outcomes, providing a diverse and realistic dataset.

    The speech data is available in WAV format with stereo channels, a bit depth of 16 bits, and a sample rate of 8 kHz, ensuring high-quality audio for accurate analysis. The recording environment is generally quiet, without background noise and echo.

    Metadata:

    In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This includes the participant’s age, gender, country, state, and dialect. Additionally, it includes metadata like domain, topic, call type, outcome, bit depth, and sample rate for each conversation.

    The metadata serves as a powerful tool for understanding and characterizing the data, enabling informed decision-making in the development of French language call center speech recognition models for the Real Estate domain.

    Transcription:

    To facilitate your workflow, the dataset includes manual verbatim transcriptions of each call center audio file in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags, covering both the agent and customer conversations.

    These ready-to-use transcriptions accelerate the development of Real Estate call center conversational AI and ASR models for the French language.

    Updates and Customization:

    We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our call center voice dataset is regularly updated with new audio data captured in diverse real-world conditions.

    If you require a custom training dataset with specific environmental conditions, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.

    License:

    This Real Estate call center audio dataset is created by FutureBeeAI and is available for commercial use!

    Conclusion:

    Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, or building state-of-the-art voice assistants to improve customer experiences in the Real Estate sector, our dataset serves as a trusted resource to meet your goals

  14. d

    Perception and production of plosives: Data from Norwegian learners of...

    • search.dataone.org
    • dataverse.no
    Updated Jan 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreassen, Helene N. (2024). Perception and production of plosives: Data from Norwegian learners of French L3 [Dataset]. https://search.dataone.org/view/sha256%3Ac8e38e49de34af467e880727901435319452ca57921d6c170717c88720cd50a0
    Explore at:
    Dataset updated
    Jan 5, 2024
    Dataset provided by
    DataverseNO
    Authors
    Andreassen, Helene N.
    Area covered
    French
    Description

    This dataset contains different measures of plosives produced by 16 Norwegian learners of French as a third language during a reading task and a repetition task. The data are extracted from two corpora collected within the framework of the IPFC project (Interphonologie du français contemporain): the Tromsø corpus with high school students, and the Oslo corpus with university students enrolled in a first year course on French phonetics and phonology. The dataset contains four files: A readme file, the word list used during the reading and repetition tasks, a data file containing all measures, and a text file presenting average values and VOT ranges for the individual informants.

  15. Groupes sanguins des Français, selon le système ABO

    • fr.statista.com
    Updated Jan 11, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2017). Groupes sanguins des Français, selon le système ABO [Dataset]. https://fr.statista.com/statistiques/656008/groupes-sanguins-repartition-abo-france/
    Explore at:
    Dataset updated
    Jan 11, 2017
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    France
    Description

    Cette statistique illustre la répartition des groupes sanguins dans la population française, selon le système ABO. On peut y lire que moins de 5 % des Français possèdent le groupe sanguin AB.

  16. N

    French Settlement, LA Population Breakdown by Race

    • neilsberg.com
    csv, json
    Updated Aug 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2023). French Settlement, LA Population Breakdown by Race [Dataset]. https://www.neilsberg.com/insights/french-settlement-la-population-by-race/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Aug 18, 2023
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    French Settlement, Louisiana
    Variables measured
    Asian Population, Black Population, White Population, Some other race Population, Two or more races Population, American Indian and Alaska Native Population, Asian Population as Percent of Total Population, Black Population as Percent of Total Population, White Population as Percent of Total Population, Native Hawaiian and Other Pacific Islander Population, and 4 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the racial categories idetified by the US Census Bureau. It is ensured that the population estimates used in this dataset pertain exclusively to the identified racial categories, and do not rely on any ethnicity classification. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the population of French Settlement by race. It includes the population of French Settlement across racial categories (excluding ethnicity) as identified by the Census Bureau. The dataset can be utilized to understand the population distribution of French Settlement across relevant racial categories.

    Key observations

    The percent distribution of French Settlement population by race (across all racial categories recognized by the U.S. Census Bureau): 98.40% are white, 1.03% are Native Hawaiian and other Pacific Islander and 0.57% are multiracial.

    https://i.neilsberg.com/ch/french-settlement-la-population-by-race.jpeg" alt="French Settlement population by race">

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

    Racial categories include:

    • White
    • Black or African American
    • American Indian and Alaska Native
    • Asian
    • Native Hawaiian and Other Pacific Islander
    • Some other race
    • Two or more races (multiracial)

    Variables / Data Columns

    • Race: This column displays the racial categories (excluding ethnicity) for the French Settlement
    • Population: The population of the racial category (excluding ethnicity) in the French Settlement is shown in this column.
    • % of Total Population: This column displays the percentage distribution of each race as a proportion of French Settlement total population. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for French Settlement Population by Race & Ethnicity. You can refer the same here

  17. E

    PAROLE French Corpus

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Nov 15, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) (2016). PAROLE French Corpus [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-W0020/
    Explore at:
    Dataset updated
    Nov 15, 2016
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Area covered
    French
    Description

    The PAROLE French corpus contains the following data:Miscellaneous: Data provided by ELRA (CRATER, MLCC Multilingual and Parallel Corpora)2 025 964 wordsBooks: CNRS Editions3 267 409 wordsPeriodicals: CNRS Info, Hermès942 963 wordsNewspapers: Le Monde, provided by ELRA13 856 763 wordsTotal20 093 099 words1. Newspapers: 14 million words were extracted from complete issues of years 1987, 1989, 1991, 1993 and 1995 of Le Monde newspaper. 241,484 words, from 7 issues of Le Monde of September 1987, have been extracted, and POS-tagged automatically. Each article consists of a complete item ? header ? according to the directives of the TEI (Text Encoding Initiative). Le Monde original markups were changed into classication features, so that extracting articles of different topics is possible.2. Periodicals:? HERMESIssues 15 to 22 have been used (134 articles, one Word file per article). The data have been converted from Word to RTF (Rich Text Format) and then, via a translator, from RTF to HTML. The conversion from HTML to the PAROLE format was made thanks to flex programs. The result for each article is: one "header" file which contains information on the author and the article id, and one "body" file which contains the article itself. A perl script is creating the final file from both "header" and "body".? CNRS-InfosThe data come from the CNRS-Infos Web site (http://www.cnrs.fr/Cnrspresse/cnrsinfo.html). Each file has been processed as follows: cleaning the HTML header, extracting a summary, cleaning of HTML markups, translation to the PAROLE format, creation of the "header" and the "body" files (see Hermès). . Like Hermès files, a perl script is creating the final file from both "header" and "body".3. BooksAll books were provided on CD-ROM as Xpress files, each book having its own structure. Therefore, each book has been considered separately. XPress allows conversion to a format called "Xpress markup". This format enables to spot the different structures of the book (if the Xpress file has been laid out well - which is not always the case). The structure of each book had to be worked out to create the perl script which enables the translation to the PAROLE format. Conformance to the PAROLE format was made thanks to a "nsgmls" tool. The errors found during the verification have been manually corrected.***Introduction on the PAROLE projectLE-PAROLE project (MLAP/LE2-4017) aims to offer a large-scale harmonised set of "core" corpora and lexica for all European Union languages. Language corpora and lexica were built according to the same design and composition principles, in the period 1996-1998. PAROLE Corpora:The harmonisation with respect to corpus composition (selection of corpus texts) was to be achieved by the obligatory application of common parameters for time of production and classification according to publication medium. No texts older than 1970 were allowed. As for publication medium, the corpus had to include specif...

  18. d

    Terminologie des descripteurs des pains français - Dataset - B2FIND

    • b2find.dkrz.de
    Updated Nov 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Terminologie des descripteurs des pains français - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/d2cf994f-58c3-51c6-971a-ebcfa8f831e1
    Explore at:
    Dataset updated
    Nov 3, 2023
    Area covered
    French
    Description

    The AsCoPain-T terminology lists and defines the main quality assessment descriptors used by bread-making professionals and those used in scientific studies. It aims at linking professional and scientific terminologies, as language can vary greatly to denote similar processing operations and observations that are made on the dough and bread. The AsCoPain-T terminology is the result of successive efforts: expert knowledge collected by the INRA AsCoPain project first resulted in the definition of the relations between bread-making control variables and the different states of the dough and bread. These relations were implemented into the expert system used to predict the state of the dough throughout the kneading process. The terminology collected during this work was then published in the form of a glossary in a document referenced by the FAO (see https://hal.inrae.fr/hal-02823534). The recent adaptation of its content to the semantic web standards now allows unforeseen uses, in particular by applications. The terminology is available in French. Its translation in English will soon be under way.

  19. k

    French-GEC-Dataset

    • kaggle.com
    Updated Apr 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). French-GEC-Dataset [Dataset]. https://www.kaggle.com/datasets/isakbiderre/french-gec-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 23, 2023
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    Context

    Wikipedia is a free encyclopedia where everyone can contribute and modify, delete, or add text to the articles. Because of this, every day there is newly created text and, most importantly, new corrections made to preexisting sentences. The idea is to find the corrections made to these sentences and create a dataset with X,y sentence pairs.

    The data

    This dataset was created using the Wikipedia edit histories thanks to the Wikipedia dumps available here: https://dumps.wikimedia.org/frwiki/

    The dataset is composed of 45 million X,y sentence pairs extracted from almost the entire French Wikipedia. There are four columns in the CSV files: X (the source sentence), y (the target sentence), title (the title of the article from which the sentence came from), timestamps (the two dates when the source sentence and target sentence were created), comments (the comment of the edit if specified).

    There is one major issue with this dataset (if it is used for the GEC task): a big part of the sentence pairs extracted are not in the scope of the GEC task (grammar, typos, and syntax). Many corrections made to the sentences on Wikipedia are reformulations, synthetizations, or clarifications, which, when training a model on this dataset, gives a model that reformulates and deletes parts of the sentence it was supposed to correct. To solve this problem, I suggest creating a classification model using transfer learning could be made to filter out the "bad" sentences.

  20. e

    French Monolingual legal corpus from Official Journal of France

    • data.europa.eu
    • live.european-language-grid.eu
    zip
    Updated Feb 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Directorate-General for Communications Networks, Content and Technology (2020). French Monolingual legal corpus from Official Journal of France [Dataset]. https://data.europa.eu/data/datasets/elrc_2501?locale=en
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 29, 2020
    Dataset authored and provided by
    Directorate-General for Communications Networks, Content and Technology
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    France, French
    Description

    French Monolingual legal corpus from Official Journal of France as collected from https://www.legifrance.gouv.fr/ web site

    This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) actions SMART 2014/1074 and SMART 2015/1091. For further information on the project: http://lr-coordination.eu.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot, French Wikipedia Dataset [Dataset]. https://paperswithcode.com/dataset/french-wikipedia

French Wikipedia Dataset

Explore at:
Authors
Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot
Area covered
French
Description

French Wikipedia is a dataset used for pretraining the CamemBERT French language model. It uses the official 2019 French Wikipedia dumps

Search
Clear search
Close search
Google apps
Main menu