A Portable Document Format(PDF) is a file format for capturing and sending electronic documents in exactly the intended format. Where PDF files play a huge role in everyday tasks and they own significant parts of processes that every industry has.
Creating or reading such files is a crucial part of PDF workflows across different market verticals which can be easily automated by using the UiPath Studio.
There are two types of PDF's are there
Let us see how to extract a PDF content into the message box by automation, step by step as shown below :
First, Launch the UiPath Studio in your system and create a new process called PDF Automation
Once the Process opened in the UiPath Studio you need to install the PDF packages. Go to Manage packages and select Official under that and select UiPath.PDF.Activities and install it.
Once you install the PDF activity click on the Save button.
First, download a sample PDF in your system and save it any of the preferred folders. I am saving sample pdf in the Downloads folder under This PC.
Go to UiPath Studio and then drag and drop a sequence into the designer pane. Add the Read PDF Text into the sequence as shown below.
Click on the three horizontal dots in the Read PDF Text box and add a downloaded PDF file path in the double-quotes.
The Read Text Box is having some properties in the property pane such as
Next, add the Message Box activity inside the sequence to display text from the PDF file And add the variable name(pdf2text) inside the message box as shown below.
Now save the sequence and run. Once the sequence start executes, the text from the sample PDF file will be displayed in the message box.
You can compare by opening both on time.
This is how we are going to extract the file from the PDF to a message box.
The following example demonstrates how to write text into a file.
Go back to the same sequence and delete the message box, and add the Write Text File activity inside the sequence as shown below.
Click on the three horizontal dots(...) and you can select the place where you wanted to save this file. I am going to save where the process got saved by the name called sample.txt as shown below. Add the file path in the double-quotes.
And enter the text as pdf2text a variable name in the text box in the double-quotes.
We have created a sequence that contains an activity Read PDF Text which can read the content of the selected PDF file. The text from the PDF will be transferred to a variable called
pdf2text. And now we are creating a file called sample.txt with the content of
Save the sequence and run. Once the sequence starts to execute. Go to your sample.txt file location and refresh it you can see that the sample.txt file has been created. Click on the sample.txt file you can see that the file is containing the content of the Sample PDF file.
This is how we are going to extract the pdf to a txt file. In the same way, you can also extract the PDF file into a Document by creating a sample.doc file.
Now, save and run the sequence, Once the sequence starts to execute, go to the sample.doc file location, you can see that the doc file has been created. Click on the sample.doc you can see that the file contains the pdf file content.
In the case of scanned documents, data extraction can also be achieved by using OCR-based activities, Read PDF With OCR and Read XPS With OCR.
To select one of the three OCR engines specific to UiPath, Google OCR, Microsoft OCR, and Abbyy OCR. You can select Microsoft OCR as it is free and given from UiPath Community.
Follow the below steps to extract the scanned PDF to file. Before that first download a scanned PDF in your system.
Whenever we want to copy the scanned type of PDF, we need to use the OCR (Optical Character Recognization)method. In UiPath Studio we have N number of OCR functionalities we just have to drag and drop in the sequence.
Create a new sequence called scanned PDF example.
We have already installed PDF functionality into the UiPath, search for Read PDF with OCR and add this activity inside the sequence.
Click on the three horizontal dots and select the scanned sample file path and then drop the OCR Engine Activity inside the Read PDF with OCR. In UiPath you will find Microsoft OCR engine, add this to your sequence.
Keep the properties as default and then create a variable called
First, try to extract scanned PDF to a message box, Add Message Box activity into the sequence and then add variable name inside it.
Save the sequence and run. Once the sequence starts to execute, you will see a pop-up message which contains the text from scanned PDF.
Now let us extract the scanned PDF into a text file by creating a text file. Delete the Message Box activity and then add Write Text File inside the sequence.
Click on the three horizontal dots and enter the new file path in the write to Filename box and then add the variable name in the text box.
The complete sequence looks as shown below
Save and run the sequence. Once the sequence starts to execute the text file will be created.
If you want to extract this content into doc means you can extract by creating a doc file as shown in the previous example.
The PDF pack contains activities designed to extract data from PDF and XPS files and store them into string variables. The data can be extracted from the entire document or from a range of pages specified under the Range property found in each of the activities.
Most of the activities are self-explanatory like Read PDF with OCR, Read PDF Text and Manage password, etc.. Where the Manage Password is used to change the password of your PDF. Join PDF File is used to join more than two PDFs in a single file.
Extract PDF File Range is used to Extract the required number of pages into another PDF. For example, if a PDF contains ten pages and you wanted to extract only two pages then this extract pdf file range will make it easy.
The Following example Demonstrates the joining PDF Files
Join PDF Files: The Join PDF File activity is used to join two or more PDFs into a single file. Let us create a new sequence called JoinPDF files.
Add the Join PDF File activity inside it. And in the property pane, you can see file list where you can enter all the file list which you are going to extract. And theIn Output filename enter the name of the File where you want to join all the extracted PDF files.
Add an Assign activity inside the sequence before the Join PDF File and create a variable, I am creating a variable called file list.
And then write a small function as shown below in the enter VB expression box. Click on the property pane and enter the function in the Expression Editor wizard. Press
Ctrl+space to see the available function in the UiPath.
The function to get all the PDF files in a directory is as shown below.
Directory.GetFiles("PDF path","*.PDF") Where the Getfiles collects all the files into a directory. PDF path refers to the location of PDF files in your system. *PDF selects only the PDF files in your folder.
I have PDF files in my UiPath folder under Documents in ThisPC and hence I am going to give the complete path of the file location in double-quotes.
The code is as shown below.
Next, set the variable type as Array(strings). Click on the variable pane and the select Array[T] under variable type, select the
Click on the Join PDF File activity and add filename in the property pane under FileName as a
Now create a combined .pdf file and add the path in the Join PDF File in the double quotes as shown below.
Now save and run the Sequence. Once the sequence starts to execute, the combined file got created which contains the content of two different PDFs.