OCRE is not just a framework contract, it is also a tool offering researchers the opportunity to receive support for their project. Two projects led by Belgian universities have recently won funding. This is notably the case for the iCANDID 3.0 SSH FAIR Data Hub project at KULeuven, led by Leen D’Haenens, in collaboration with Roxanne Weys co-promotor and technical lead.
Using Google AI cloud services to empower data processing activities in the iCANDID 3.0 SSH FAIR Data Hub research infrastructure project
iCANDID and the use of cloud AI & ML services
The iCANDID 3.0 SSH FAIR Data Hub project focuses on the field of social sciences and (digital) humanities (SSH) and helps researchers collect and analyze large volumes of data. The infrastructure provides FAIR (Findable, Accessible, Interoperable and Reusable) access to various types of data from press media, social media, governmental open data etc. Researchers use iCANDID to query, visualize and export the data in a format of their choice for further analysis with tools such as SPSS, Gephi or Sketch Engine. By making the harvested data available on a dedicated platform, iCANDID ensures that the time-consuming process of data collection does not have to be repeated by individual researchers, since data extraction, normalization and database development are activities they typically spend a considerable amount of time on.
In the first phase of the project (2018-2022) we focused on developing a robust and scalable data infrastructure capable of extracting, transforming and loading (ETL) large amounts of data coming from multiple providers in multiple formats and supporting multiple exchange protocols. The data collected in this early stage was fairly homogenic: 9 million textual records from press databases and social media accounts, mostly in Dutch. In 2022 we received new funding from the Research Foundation Flanders to expand the infrastructure towards a FAIR data hub for social sciences as well as humanities, with the latter specifically interested in data from libraries and archives. With the expansion plans of our data collection in both volume and diversity in terms of languages represented and formats included (text, image, audio-visual), we wanted to start using AI and machine learning for data pre-processing to make sense of the growing volume of data available through iCANDID. Of specific interest to the project was machine translation, NER and data classification, sentiment analysis, and image analysis (including OCR/HTR).
The OCRE program was timely and gave us the opportunity to explore the potential of cloud providers such as Google which have extensive offerings in AI & Machine Learning services.
With over 18 million records and a continuous growth in volume, we needed a scalable solution that would deliver sufficient quality in the standard processing of our datasets. The funding provided will allow us to both explore the potential of cloud services and boost the research potential of the available datasets.
The opportunity of the OCRE mini competition and status of the work
Originally, I thought OCRE mainly focused on cloud storage and compute, services we were not immediately looking for. When I met the Belnet colleagues at the TNC22 conference in Trieste, we started talking about OCRE and the availability of the different types of services in the catalogue, including AI and ML services. Jean-Pierre Aerts informed me about the opportunity of the mini competition organised by GÉANT and encouraged us to apply for funding. After this, things moved quickly and with the support of Sparkle, OCRE awarded provider for GCP, and the Google Belgium Team, we were successful in our application. Sparkle gave us fundamental support in constructing our proposal in a way that best met the tender criteria and in governing all stages of the process as they have been an official and trusted OCRE Supplier for a long time. The Google team was instrumental in helping us translate our functional requirements into technical requirements, while also helping us optimise resources to get as much value for money as possible. Thus, we received funding to use Google AI & ML services for iCANDID.
We are currently testing all relevant AI & ML services for the different types of data in iCANDID and although there is always a learning curve, we find the services easy to use. To improve machine learning results for some dedicated datasets we have selected, we are currently preparing the pilot with Google AutoML, which will allow us to train custom ML models for better OCR results. We will also process some larger batches of data with the standard ML models in the coming months, such as machine translation of Tweets from Hungarian politicians and parliamentary data from Sweden. These are data worked on in ongoing research projects involving KU Leuven researchers.
This OCRE project allows us to explore the possibilities, processes, skill level required, and quality and usability of cloud services for social sciences and humanities.
We see it as an opportunity to scale up recurring activities in SSH data preparation with the benefit of access to cloud scalability when we need it. Our ambition is to incorporate AI and machine learning into our automated processes in the iCANDID infrastructure. As such, the OCRE service catalogue appears to provide opportunities for efficient access to cloud services. The contacts and support provided by Belnet also lower the threshold for using these services.
About the team and author
The iCANDID 3.0 SSH FAIR Data Hub research infrastructure project is financed by the Research Foundation Flanders and is led by Prof. Leen d’Haenens of the Institute of Media Studies (KU Leuven). The project includes several other research groups from translation studies, computational linguistics, mass communication, literary theory and cultural studies at KU Leuven. LIBIS acts as technical partner for the development of the data hub and FAIR data access platform. More info: https://icandid.libis.be/
Roxanne Wyns (f) is innovation manager at LIBIS, a digital service provider part of KU Leuven Libraries. As innovation manager she specialises in FAIR data infrastructures and works on several domain specific research infrastructure projects related to FAIR management of data collections. She participates in several Open Science and Research Data Management initiatives in Flanders and Europe and is co-chair of the EOSC-A Long Term Data Preservation Task Force.