Text data sources

There are several open and subscription-based sources for text data. Here we provide a (not exhaustive) list of important sources and indicate how they can be accessed.

Accessing text for data mining

In most cases, open sources allow text data mining with their content. These could also be useful for first tests. For the subscription-based sources listed here, access for text data mining can be provided via Lib4RI.

Article 24d (Swiss Copyright Act CopA) allows to use works for scientific research, provided they are legally accessible. However, the data providers may impose regulations on data retrieval in order to protect their servers. If the data provider's regulations are not met, access may be blocked for the entire institution! Always make sure that you have thoroughly read and understood the underlying terms and conditions of a data provider before retrieving data. 

Regulations frequently state:

  • Systematic downloads can only be carried out via explicitly authorized services, e.g. specific APIs
  • The maximum download volume and the maximum frequency of requests per time interval must not be exceeded
  • Data can only be used for non-commercial research
  • The text data corpus cannot be shared and, in some cases, must be deleted after the project
  • A valid subscription for closed access material is necessary

Open sources

  • Description: The arXiv is an archive and distribution server for e-prints in physics, mathematics, computer science, quantitative biology, quantitative finance and statistics. It includes more than 2 million e-prints, with roughly 5,000 new e-prints added every month.

    Access: Full text sources can be retrieved via AWS (PDF and/or (La)TeX source files) or Kaggle (PDF). The use of automated scraping is prohibited.

    Limitations: When using the legacy APIs (including OAI-PMH, RSS, and the arXiv API), make no more than one request every three seconds, and limit requests to a single connection at a time. Acknowledgement is expected!

    Learn more: arXiv Bulk Access Overview

  • Description: Europeana is a digital library with content on scientific and cultural heritage from more than 2'000 European institutions.

    Access: Via Europeana APIs

    Limitations: Acknowledgement of Europeana with official logo (Terms of Use §15).

    Learn more: Terms of Use

  • Description: The HathiTrust collects digitised materials from over 120 academic institutions worldwide. 

    Access: Everybody can search and read open and public domain items. Datasets are available from HathiTrust only by request. The HathiTrust Research Center (HTRC) contains tools for text data mining. However, for non-members only basic services are provided.

    Limitations: Non-commercial research only.

    Learn more: HathiTrust Search & Access and Access & Use Policy

     

  • Description: The Public Library of Science (PLOS) is a full Open Access publisher with several journals in natural and life sciences.

    Access: Download a full XML dataset of all PLOS articles, excluding figures and supplementary information.

    Limitations: Using other scripts is possible, but is discouraged for bulk downloads.

    Learn more: Text & Data Mining at PLOS

  • Description: PubMed Central (PMC) is a US-based, free full-text archive of biomedical and life sciences journal literature. Several subsets of data are available for download, e.g. the Open Access subset with millions of full-text open access article files. 

    Access: Download only with the authorised tools listed here. The use of automated scraping is prohibited.

    Limitations: Not all articles are available for text mining and other reuse.

    Learn more: PMC Article Datasets

Subscription-based sources

  • Description: Elsevier publishes more than 3000 scientific journals, specializing in scientific, technical, and medical content. 

    Access: Elsevier API, other forms of bulk download are not permitted!

    Limitations: Non-commercial research only, proprietary notice required. Refer to the Elsevier TDM provision for further details.

    Learn more: Elsevier Policy and Standards on TDM

  • Description: JSTOR is a digital library providing access to articles, books and other resources. Constellate is the associated text analysis platform.

    Access: Through Lib4RI you have basic access allowing you to create and download datasets of up to 25.000 items. Metadata is provided as CSV file. Full text files and metadata files including abstract can be downloaded as JSONL file. Built-in analysis and visualization is not contained with basic access. 

    Limitations: Non-commercial research only, acknowledgement required. You may download up to 10 datasets per day. To retrieve closed-access full text items, you will need to pose a request. 

    Learn more: ITHAKA/Constellate Terms and Conditions

  • Description: Pressreader is a catalogue for more than 7000 international newspapers and magazines.

    Access: In 2023, various APIs were released, including catalogue access and natural language processing. They can be accessed through the developer portal. The documentation is not very extensive and access might require more expertise than for other APIs. Feel free to share your experiences with us!

    Limitations: Not documented.

    Learn more: API catalogue

  • Description: Sage Journals publishes more than 1.000 journals in the fields of business, humanities, social sciences, science, technology and medicine.

    Access: Via publisher website and CrossRef API.

    Limitations: Use for legitimate academic research and other educational purposes, respect limits for download rates.

    Learn more: Text and Data Mining on Sage Journals

  • Description: The Springer Nature portfolio contains around 3.000 journals across many disciplines.

    Access: A special licence is required for TDM. Please contact @email before starting your project.

  • Description: Taylor & Francis publishes around 2.700 scientific journals. Topics include social science, behavioural science, STEM fields and medicine.

    Access: Get in touch with @email, advising your affiliation and a brief description of your project.

    Limitations: Non-commercial research only, corpus must be deleted after finishing the project.

    Learn more: Taylor & Francis Terms and Conditions

  • Description: Wiley journals are mainly in the fields of chemistry, material science, physics and life sciences as well as business and trade.

    Access: Via CrossRef API, other forms of bulk download are not permitted! OrcID required.

    Limitations: Respect limits for download rates. Per default, only PDF can be harvested. Should you need another format, please get in touch with @email.

    Learn more: Text and Data Mining Agreement and help page.

Do you have any questions on data retrieval or terms and conditions of a data source?

Do not hesitate to contact us!