VALA2022 Lightning Talk Musgrave 1

The Australian text analytics platform

VALA2022 Lightning Talk

Simon Musgrave
  • Senior Project Officer
  • University of Queensland

Please tag your comments, tweets, and blog posts about this session: #vala2022

Abstract

Students and scholars in many fields work with text data, and it is increasingly easy to assemble large collections of such data. A variety of tools are available to work with text, both online (e.g. Voyant Tools) and as off-the-shelf packages (e.g Antconc). At the other end of the scale, individuals with relevant skills can hand-craft their own code for specialised tasks. There is a space between these two possibilities where tools are needed which are more powerful than those at one end of the continuum, but more general than those at the other end and the Australian Text Analytics Platform (ATAP) aims to fill that space.

The ATAP project commenced in June 2021, and the platform is developing an integrated notebooks-based platform for processing and mining text data. Notebook documents (or “notebooks”, all lower case) are produced by the Jupyter Notebook App and contain both computer code (e.g. python) and rich text elements (paragraph, equations, figures, links, etc…). Notebook documents are both human-readable documents containing analysis, description and results such as figures and tables, as well as executable documents which can be run to perform data analysis. Online training modules in text analytics will be provided, and the notebooks platform will be made accessible through a web-based interface. ATAP will bring together users and providers of text analytics in an integrated, collaborative environment which emphasises principles of open access, replicability and transparency.

The primary audience for the platform is Australian researchers who use text data in their work, but it will be accessible to other potential users, including those in the GLAM sector. Most research libraries now offer information on computational tools for working with text, and ATAP will be an important additional resource in this area. Also, written material makes up a significant part of cultural heritage and ATAP will make many techniques for working with such data more accessible, including tools for extracting and classifying important social and cultural information from those texts. Another aim of ATAP is to provide an environment where users can enhance their technical skills. Users will progress by learning to understand code that is presented in notebooks and then moving to modifying code chunks and even writing code from scratch, all tailored for the needs of those who work with (or are just fascinated by) text.

Biography

Simon Musgrave was a member of the linguistics program at Monash University from 2003 until 2020. His research interests included the use of computational tools in linguistic research and the relationship between linguistics and digital humanities. He was involved in the Australian National Corpus project, an important piece of digital research infrastructure, and has been a member of the executive of the Australasian Association for Digital Humanities since 2015. Simon currently is part of the team delivering various language-related infrastructures including the Australian Text Analytics Platform and the Language Data Commons of Australia.

 

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License