Jump to content

Extract data from PDF


Recommended Posts

Hi.

I am trying to use @jguinch XPDF UDF from here:

in order to extract certain data from a PDF file that I converted to a text file but not sure how to move forward.  In the text file (test.txt) that I attached to this thread I need to extract the "Asset" IP and/or the "Asset Name:" along with the details from the "Details:" section which includes "Port:", "(u)" and "(p)".  Any help is greatly appreciated.  Would regex be easier to accomplish this?

 

test.txt

Edited by antmar904
Link to comment
Share on other sites

I'm not very familiar with regex so the best path I can point you to is to open the file, fileopen().  The setup a loop that reads the file line by line comparing each line to the specific items you need.  When it finds a match determine how to extract the relevant data based on the structure of the file.  I only say this bc I'm not sure if the data you're expecting is beside or below.  You also need to put in a way to reject lines that match but are just empty strings.  Probably an easier way to do this but what I'm describing is the brute force method.   Probably slow aswell. 

Also obviously once you find good data you are looking for you need to save it by passing it to an array and then doing whatever you need to do with it from there.  

 

Edited by markyrocks
Link to comment
Share on other sites

this is the best i could come up with bc im not exactly sure what you plan to do with the information after you sort it out or whatever.  I really have no idea if this is finding everything but it seems to be working.  it will definitely need to be tweaked and played around with to get it to squeeze out exactly what you are looking for, more filtering and shifting around.  I could have used FileReadToArray() as well but its been awhile since i played around with files so i kinda forgot about that function.  Merry christmas

 

#include <File.au3>
#include <Array.au3>
Global $result[25],$result_count=0




$file=FileOpen(@ScriptDir & "\test.txt")

if $file=-1 Then
    MsgBox('','ERROR',"file failed to open")
EndIf

Local $x=1
$linecount=_FileCountLines(@ScriptDir & "\test.txt")
;~ MsgBox('','line count',$linecount)
while $x<>$linecount+1
Local $a[7]
Local $line=FileReadLine($file,$x)
;~ MsgBox('','line',$line,1)


$a[0]=StringInStr($line,"Asset Name:")    ;$a is equal to the position.
$a[1]=stringInStr($line,"IP ")
$a[2]=stringInStr($line,"Details:")
$a[3]=stringInStr($line,"Port")
$a[4]=stringInStr($line,"(u)")
$a[5]=stringInStr($line,"(p)")
$a[6]=stringInStr($line,"Asset")
Local $pos=0
    for $y=0 to UBound($a)-1          ;this determines if multiple strInStr are found the lowest found position in the line....
;~          MsgBox('',"a",$a[$y],1)
        if $a[$y]<>0 and $pos=0 Then
            $pos=$a[$y]
        elseif $a[$y]<>0 and $pos<>0 and $a[$y]<$pos Then
            $pos=$a[$y]
;~          MsgBox('',"pos",$pos,1)

        EndIf
    Next

_ArrayDelete($a,"0-6")

if $pos<>0 Then
$Trim_Left=StringTrimLeft($line,$pos-1)  ;trim off any garbabe b4 the part were looking for
;~ MsgBox('','',$Trim_Left,1)
EndIf

;lets see if theres anything after what were looking for in the line.....

if $pos<>0 Then
    $String_Split=StringSplit($Trim_Left," ")  ;separates the string by spaces lol
        for $n=2 to $String_Split[0]   ;[0] is the number of elements in the Stringsplit array,  $String_Split[1] should be a found keyword so we can ignore it
;~              MsgBox('',"split strings",$String_Split[$n],1)
                if $String_Split[$n]<>" " Then    ;the line is good should save as a result
                    if $result_count>UBound($result)-1 Then
                        ReDim $result[$result_count+1]
                    EndIf
                    $result[$result_count]=$Trim_Left

;~                  MsgBox('','result',$result[$result_count],1)
                    $result_count+=1
                    ExitLoop
                EndIf


        Next
EndIf
$x+=1
WEnd
;~ MsgBox('','linecount',$x)
FileClose($file)
_ArrayDisplay($result)

 

 

Link to comment
Share on other sites

The conversion into and out of pdf files is always going to give you inconsistent data as far as order and layout (especially if tables in the original are involved). you could fiddle around with pdftotext.exe command line options but I've had little success in these providing clean consistent data (pdf is for 'looks' and printing nothing more). This is the best I could do with regex - first trying to regularize the data lines (17 and 18) then trying to capture the data you want. note the last item in your test file has Asset after IPAddress (if you mean COMPUTERNAME as the Asset:).  Anyhow here is something for you to go crazy trying to sort out - if you wish.  There a hundreds of variations in regex so no doubt someone will provide different and perhaps even better examples.

Joe

 

#include <MsgBoxConstants.au3>
#include <StringConstants.au3>
#include <FileConstants.au3>


$processing = @MyDocumentsDir & '\AutoIt_code\getter\processing\test.txt'


; Open the file for reading and store the handle to a variable.
Local $hFileOpen = FileOpen($processing, $FO_READ)
If $hFileOpen = -1 Then
    MsgBox($MB_SYSTEMMODAL, "", "An error occurred when reading the file.")
EndIf

; Read the contents of the file using the handle returned by FileOpen.
$sFileRead = FileRead($hFileOpen)
$sFileRead = StringStripWS($sFileRead, 8)
$sFileRead = StringRegExpReplace($sFileRead, '(?i)(?-s)(Asset:.*?\w*COMPUTERNAME\d*.*?)(?=Discovery:)', @CRLF & '$1')
;ConsoleWrite( $sFileRead & @CRLF)

If StringRegExp($sFileRead, '(?i)(?-s)([A-Z]+COMPUTERNAME\d*)IPAddress:([\d\.]*).*?Details:(.*?)Details:(.*?)\(u\):(.*?)\(p\):(.*?)', 0) Then
    Local $aArray = StringRegExp($sFileRead, '(?i)(?-s)([A-Z]+COMPUTERNAME\d*)IPAddress:([\d\.]*).*?Details:(.*?)Details:(.*?)\(u\):(.*?)\(p\):(.*?)', 3)
;~ ElseIf StringRegExp($sFileRead, '(?i)(?-s)IPAddress:([\d\.]*).*?([A-Z]+COMPUTERNAME\d*).*?Details:(.*?)Details:(.*?)\(u\):(.*?)\(p\):(.*?)', 0) Then
;~  Local $aArray = StringRegExp($sFileRead, '(?i)(?-s)IPAddress:([\d\.]*).*?([A-Z]+COMPUTERNAME\d*).*?Details:(.*?)Details:(.*?)\(u\):(.*?)\(p\):(.*?)', 3)
EndIf

For $i = 0 To UBound($aArray) - 1
    ConsoleWrite($aArray[$i] & @CRLF)
Next

 

Edited by Jury
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...