среда, 1 августа 2018 г.

Easy and powerful pdf processing

Introduction

Applications usually works with data. Sometimes applications depends on different input data format but in any case data is processing like binary or text (particular case of binary data). PDF (Printable data format) is widely using format for writing official documents, letters, scientific and other articles and for many other different cases. Sometimes our business solution must extract text or some other data from structured document. I discovered that very powerfull solution could be build using Aspose commercial solutions (https://www.aspose.com/). My solution was implemented and deployed on dev.activedictionary.ru.

Application issue

The main task is to getting text from any pdf document. Current developing application is Web but pdf processing could be built with libraries for standalone application it is cheaper and does not carry too much troubles. Prices for Aspose solutions are listed here (https://purchase.aspose.com/pricing/pdf/) . I have used one for Java (small bussines licence type) for 1000$ - https://purchase.aspose.com/pricing/pdf/java. I have not mentioned it previously but application which must contain PDF processing is Python-based application built with Django framework.

Solution with Aspose libraries

Getting text from pdf is not easy when you work with well-structured PDF documents, Aspose libraries are awared of how to process structured PDF documents: defines paragraphes and images on a page. I don't know every particular cases but in 80% cases processing documents consists of 2-20 pages separated into blocks with titles and there are could be images or tables here. Documents also could be structured with cover and back pages, index and so on. The  main features of processing documents:
- mostly contains text (60-80% of document);
- text could be on a different language with possibly different character encoding (ANSI, KOI8 family, Unicode)
- documents could either structured and without structure (because target system allows users to load any type of document).
Main feature of working with PDF is to extract text for further manual texts analysis therefore text extraction must be accurate (we can't afford to loose block/paragraphes or even sentences or words) because developing application is learning management system (LMS).
We have choosed Aspose libraries  by suggestion of one of our developers As for me: pdf processing is important but not the only feature of our system and must be working reliable without fails. We checked Aspose libtraries with documents that we written on several different languages (ru, en, de) with various encodings and after tests we consider that this library is suitable for our project: it works fairly reliable on most of documents (we have tried on approximetely 800-1000 different docs: articles, books, standards).

For upper described business proceses we have built java console application which takes pdf input file via command line argument with --source key. Result of processing - text is output to stdout via System.out.printf("%s\n", text);  Processing consists from following steps:

1) Instantation of License object with license file:
     License license = new License();
       license.setLicense(LICENSE_FILENAME);

2) Creation of container object - Document:
    document = new Document(commandLineArgs.getSource());

3) Content processing consists of absorbing images first (using ImagePlacementAbsorber) and paragraph absorbing (ParagraphAbsorber) and returning list of Content object (full details of implementation are not listed until we open project).

From python web application it looks like:

extractor_dir = os.path.join(EXTERNAL_SOURCE_DIR, 'pdf')
source_param = u'--source={0}'.format(source_path)
args = ['java', '-jar', 'PdfContentExtractor-0.3.jar', source_param]
proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, cwd=extractor_dir)
out, error = proc.communicate()
returncode = proc.wait()
if returncode > 0:
    raise Exception()

content (a composition of images and text=paragraphes) is stored in out variable as Base-64 string (result of extraction was converted before output in Java console application). Further processing is following:

import StringIO
import base64
buffer = StringIO.StringIO(output)
try:
    self.__text = base64.b64decode(buffer.readline())
    self.__images = []
    while True:
        name = buffer.readline().rstrip()
        if not name:
            break
        data = base64.b64decode(buffer.readline())
        self.__images.append((name, data))
finally:
    buffer.close()
 

A bitter taste in a honey pot (disadvantages)

There is one (personally for me) disadvantages is a price, i think 1000$ it is sufficient price for small Russian team especially if you are building non-profit open solution. In my hamble opinion for such cases price should be reduced a twice.

Conclusion

I must say that Aspose libraries are great for PDF documents processing, it was easy to built a component of developing application with solution based on Aspose libraries. Don't forget to comment/like this article about my expirience working with PDFs.



Распространение Windows-приложений (Chocolatey)

Менеджеры пакетов для ОС Windows В большинстве дистрибутивов Linux есть свои менеджеры пакетов: в Ubuntu/Mint это apt и deb, в OpenSuse э...