Text data mining tools

Here we list some tools which can be useful for text data mining projects. This list is not exhaustive - feel free to contact us if you have additions or experiences with any of these tools. 

Resources for R

  • An annotated list of packages for TDM compiled by the libraries of the University of Pennsylvania. It also includes links to documentation, tutorials and help pages. 

    Go to the page

  • Constantly updated web book accompanying the tidytext package. The book features step-by-step examples with code. Detailed use cases will help you to set up your first text data mining project.

    Go to the page

Resources for Python

This list is based on the Guide to text mining using Python by the Libraries of the University of Pennsylvania.

Resources for Python

  • "NLTK is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It can help simplify textual data and gain in-depth information from input messages."

    Go to the Google Collab Space

  • "spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It is designed for production use which helps users to comprehend large volumes of text. It has a wide range of applications in information extraction, natural language understanding, and text pre-processing."

    Go to the Google Collab Space

  • "scattertext is a free, open-source library in Python. It is designed to create informative visualizations of the input text. As its name suggests, scattertext creates scatterplots for text."

    Go to the Google Collab Space

  • "TextBlob is a free, open-source library in Python for processing textual data. It is a powerful package that reduces the complexity of the contextual data and derive in-depth information from the text. Like spaCy, its features and capabilities give insights into a text’s grammatical structure that can be particularly helpful in the following fields. "

    Go to the Google Collab Space

Other free tools

  • According to the self-description, BERTtopic is "a topic model that leverages clustering techniques and a class-based variation of TF-IDF to generate coherent topic representations". It utilizes also the open source large language model Llama 2.

    The code is freely available on GitHub. There you find use cases on scientific and non-scientific text data and an extensive documentation.

    The code author Maarten Grootendorst also released KeyBERT, a more minimal tool to extract keywords from text data. 

  • Voyant Tools is a web-based, open-source tool for analysing full text corpora. It can process raw text or text stored in files, and text from URLs. 

    The tool supports different visualisations, such as word clouds or correlation networks. A comprehensive description and help can be found here. The code is available on GitHub to adapt it to specific needs as well.

    Third-party packages are used for analytics, e.g. Google Analytics, and data is stored for persistent access during and between work sessions. Make sure that the corpus you use allows uploading to third-party servers.

    Image
    Wordcloud created from the Lib4RI webpage
    The implented "Cirrus" tool allows to construct and adapt  word clouds from a corpus or a single document in the corpus. For this example, we used the Lib4RI webpage and some subpages.

Do you have additions to this list?

Let us know if there are further resources you'd like to see here.