

Text data mining (TDM) can be a powerful tool for efficiently extracting hidden patterns from vast unstructured text data. Learn here more about how to start your TDM project and how to avoid legal pitfalls.
Please note: access to licensed electronic resources (such as databases, e-journals or e-books) for TDM is governed by the conditions of use contained in license agreements between the library Lib4RI and the respective publishers. Articles with a CC-BY license can be used for TDM projects; check our FAQs and the respective publisher's guidelines for more information.
If you are planning to use licensed electronic resources for a text data mining or a TDM project, please contact us at @email and we will help you clarify the conditions.
In some cases, the information is contained in a set of full text documents. For many analyses, however, the necessary information is accessible through the bibliographic data of a set of documents. For instance, titles and abstracts of publications often contain sufficient information for topic and trend analysis. Bibliometric data can be accessed with considerably less effort than full-text corpora. Refer to our bibliometrics page to learn more about sources and analyses.
To conduct a TDM project, you need a textual dataset, or corpus, and tools to transform and analyse the data.
Starting your project
Basically any written resource can be used to compile a corpus. This includes scientific publications as well as newspaper articles or web posts. In most cases, data can be accessed either via APIs or as a snapshot.
From a computational perspective, text data are inherently unstructured. Therefore, a corpus must be pre-processed before conducting a computational analysis. Depending on your corpus, this can include:
After the corpus is transformed into a machine-interpretable dataset, it can be analysed. Models use techniques from computational linguistics, natural language processing, machine learning and statistics. Possible analyses are:
Further resources
Further resources