Skip to main content


Related products

Related products


At your request, we will design a solution tailored to your needs.


Discover the potential of this application from Clementine staff!


Clementine experts can even provide on-site training to help your make the most of your applications.


Clementine events present the latest trends and developments in data science and natural language processing.


Document processing and network analysis

Development of automated information retrieval and processing to reduce the burden on human resources and automate existing processes.

Companies and law enforcement agencies store a lot of data in texts in the form of documents, contracts, reports, and records. There are two ways of extracting and structuring information: using human resources or using machines. Both are time- and resource-intensive processes, but machine-based information retrieval allows the development of an automated system that will be able to extract data from new documents in a standardised and objective way.

Processing with human resources is extremely time-consuming: since the format of the documents is mostly non-standard, but contains a bulk text, it is assumed that it takes half an hour to process a document. At a continuous 10 hours per day, this is 20 documents per processor, so for 200,000 documents the process would take more than 10,000 man-days. Taking this into account, the task requires more people, which makes it impossible to populate the database consistently, as there will always be aspects and data that are interpreted and populated differently by each processor. And for new documents, human resources will be needed each time.

The first phase of machine learning is teaching, which is also a time-consuming task. During training, the system must be taught how to search for, identify, and extract information and data. Once the training is done, the completed system requires maintenance, which is much less resource-intensive.

Data processing
Our solution performs the extraction of the following entities:

  • people - names in the following nationalities: Hungarian, Slavic, Arabic, Russian/Ukrainian, Romanian, Georgian, English
  • names of companies, organizations, based on the types of companies in 10 European countries: Austria, Slovakia, Ukraine, Romania, Serbia, Croatia, Slovenia, Germany, Russia, Poland
  • E-mail address, IP address, URL

At the end of the data processing phase, a data structure is created which can now be loaded into an "Internal Knowledgebase" (case-related entities and their attributes, relationships, related reports, etc.). The data processing component is based on the predictive and text analytics toolset of IBM SPSS Modeler Premium.

Internal knowledgebase (data storage)
The "Internal Knowledgebase" has an Entity repository which is a continuously expanding structured database that includes entities extracted from documents, their attributes, and relationships.

The i2 iBase, part of the solution, is an easy-to-use intelligence database application that also provides a search interface for analysts to search the Internal Knowledgebase.

A visualisation, analysis, and search interface - i2 Analyst's Notebook - is connected to the Internal Knowledgebase, which enables visual analysis of the relationships and connections between entities.

What industries can it be used in?

Financial services
Governmental services