Smalot\Pdf Parser Library for Joomla

Smalot\Pdf Parser Library for Joomla

Pdf parser library. Can read and extract information from pdf file. There library is wrapped for Joomla 3 and Joomla 4

Description

Use reading PDF-files in Joomla

There is code example for Joomla 3 and Joomla 4.

<?php 
defined('_JEXEC') or die('Restricted access');
use \Smalot\PdfParser\Parser;

// For Joomla 3

JLoader::registerNamespace('Smalot', JPATH_LIBRARIES);

// OR
// for  Joomla 4
JLoader::registerNamespace('Smalot', JPATH_LIBRARIES. '/Smalot');
$file_name     = 'images/path_to_file.pdf';
$parser        = new Parser();
$pdf           = $parser->parseFile(JPATH_SITE . '/' . $file_name);
$pdf_meta_data = $pdf->getDetails();

Usage

First create a parser object and point it to a file.

$parser = new \Smalot\PdfParser\Parser();

$pdf = $parser->parseFile('document.pdf');
// .. or ...
$pdf = $parser->parseContent(file_get_contents('document.pdf'))

Extract text

A common scenario is to extract text.

$text = $pdf->getText();

// or extract the text of a specific page (in this case the first page)
$text = $pdf->getPages()[0]->getText();

Extract text positions

You can extract transformation matrix (indexes 0-3) and x,y position of text objects (indexes 4,5).

$data = $pdf->getPages()[0]->getDataTm();

Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => 0.999429
                    [1] => 0
                    [2] => 0
                    [3] => 1
                    [4] => 201.96
                    [5] => 720.68
                )

            [1] => Document title
        )

    [1] => Array
        (
            [0] => Array
                (
                    [0] => 0.999402
                    [1] => 0
                    [2] => 0
                    [3] => 1
                    [4] => 70.8
                    [5] => 673.64
                )

            [1] => Calibri : Lorem ipsum dolor sit amet, consectetur a
        )
)

When activated via Config setting (`Config::setDataTmFontInfoHasToBeIncluded(true)`) font identifier (index 2) and font size (index 3) are added to dataTm.

// create config
$config = new Smalot\PdfParser\Config();
$config->setDataTmFontInfoHasToBeIncluded(true);

// use config and parse file
$parser = new Smalot\PdfParser\Parser([], $config);
$pdf = $parser->parseFile('document.pdf');

$data = $pdf->getPages()[0]->getDataTm();

Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => 0.999429
                    [1] => 0
                    [2] => 0
                    [3] => 1
                    [4] => 201.96
                    [5] => 720.68
                )

            [1] => Document title
            [2] => R7
            [3] => 27.96
        )

    [1] => Array
        (
            [0] => Array
                (
                    [0] => 0.999402
                    [1] => 0
                    [2] => 0
                    [3] => 1
                    [4] => 70.8
                    [5] => 673.64
                )

            [1] => Calibri : Lorem ipsum dolor sit amet, consectetur a
            [2] => R9
            [3] => 11.04
        )
)

Text width should be calculated on text from dataTm to make sure all character widths are available. In next example we are using data from above.

$fonts = $pdf->getFonts();
$font_id = $data[0][2]; //R7
$font = $fonts[$font_id];
$text = $data[0][1];
$width = $font->calculateTextWidth($text, $missing);

Extract metadata

You can also extract metadata. The available data varies from PDF to PDF.

$metaData = $pdf->getDetails();

Array
(
    [Producer] => Adobe Acrobat
    [CreatedOn] => 2022-01-28T16:36:11+00:00
    [Pages] => 35
)

Read Base64 encoded PDFs

If working with Base64 encoded PDFs, you might want to parse the PDF without saving the file to disk.

This sample will parse the Base64 encoded PDF and extract text from each page.

<?php
// Parse Base64 encoded PDF string and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseContent(base64_decode($base64PDF));

$text = $pdf->getText();
echo $text;

Joomla

Extension type:
Library
Joomla version:
4.0

What's new

Smalot / PDF Parser v.2.1.0

Version from February 2, 2022

WebTolk Joomla Extensions

77 Extensions
11 Categories
323 Versions released
302405 Downloads
Cart
Cart is empty