RapTech – Research and selection of an optimal artificial intelligence algorithm to build a tool that automatically generates a state of the art report

Project co-financed from the state budget. Project co-financed by a specific grant from the President of the Łukasiewicz Centre
Acronym: RapTech
Type: R&D
Grant: PLN 100 000
Total value: PLN 100 000
Start date: 01.01.2022
Project completion date: 31.03.2022
Patent searches leading to the determination of the state of the art are a complex process with many iterations with observing and analysing the results and constantly adjusting the range of concepts (phrases, keywords) and the shape of the queries. In this process, the expert’s experience and ability to construct concepts within the field and to build a search strategy in patent databases, taking into account the limitations of the associated search engines, is extremely important.
The aim of the project was to develop and validate a proof of concept on the feasibility of building a system that prepares state of the art reports based on input data (patent application, in particular the claims section) and source data (patent literature collections). AI/NLP algorithms built and learned as part of the project should identify a set of key phrases (words) for a dozen documents selected by the Patent Office. This set was compared with a set of concepts prepared in parallel by the UPRP experts.
For the purposes of the project, the PPO made available more than 150000 documents – patent descriptions – from 1924 to 2019 in the form of PDF and XML files. These documents were loaded into a specially prepared database using algorithms developed by the project team to improve the quality of document content. The database prepared in this way was the basis for teaching selected algorithms enabling extraction of keywords from ten patent descriptions selected by PPO experts.
A number of algorithms related to keywords (phrases) extraction were developed and tested. The following concepts were selected and tested:
- YAKE algorithm – an unsupervised approach that is based on features extracted from the text
- KeyBERT algorithm using the BERT model (a pre-trained HerBERT allegro/herbert-large-cased model was used) – a neural network-based model that uses Transformer, an attention mechanism that learns contextual relationships between words (or sub-words) in the text.
In addition, tests were performed for comparison on an English-language dataset using suitably adapted methods:
- YAKE algorithm
- KeyBERT algorithm (pre-trained BERT xlm-r-bert-base-nli-stsb-mean-tokens model used)
- SpaCy + BERT embedding combined model – an algorithm similar in idea to KeyBERT, pre-extracting nouns and noun phrases (not available in Polish) through the SpaCy library and then evaluating candidate embeddings through the BERT transform autoencoder using cosine similarity.
As part of the task, the RoBERT model (pre-trained clarin/roberta kgr 10 model) was also retrained using the MLM (Masked-Language Modeling) algorithm on the basis of all the documents collected in the database.
The conclusions of the work carried out are as follows:
- In the authors’ opinion, the analysis of the results obtained and the activities carried out forms the basis and need for further work within the project, leading to the creation of a system to assist the expert in developing the state of the art.
- The type of machine learning that achieves the best results in terms of maximising the correctness of document retrieval is expert-supervised learning.
- AI algorithms can only be used as generators of a set of keywords verified by an expert when creating a search strategy.
- A method needs to be developed to effectively present AI-generated keyword suggestions to the expert and an effective method for teaching the system to be supervised by the expert.
