ctgilbert Posted August 8, 2011 Posted August 8, 2011 Hello, I am wondering if anyone could provide some guidance on how to check a pdf file for broken links (internal or external)? I think the process would go something like: 1) scan the document for a link 2) click the link 3) make sure it went where it was supposed to go 4) repeat until EOF I understand how to do this with web pages, but I need help understanding how to apply it to a pdf file. Thanks!
wakillon Posted August 9, 2011 Posted August 9, 2011 I have use "pdftotext.exe" a pdf command line tool for get links and open them in default browser.if it can help you.$_PdfFilePath = @DesktopDir & '\any.pdf' $_PdfToTextPath = @DesktopDir & '\pdftotext.exe' RunWait ( '"' & $_PdfToTextPath & '" "' & $_PdfFilePath & '" C:\file.tmp', '', @SW_HIDE ) $_ArrayLinks = StringRegExp ( FileRead ( 'C:\file.tmp' ), '(?s)(?i)http://(.*?) ', 3 ) For $_I = 0 To UBound ( $_ArrayLinks ) -1 $_ArrayLinks[$_I] = "http://" & $_ArrayLinks[$_I] ConsoleWrite ( $_I +1 & " link : " & $_ArrayLinks[$_I] & @Crlf ) ShellExecute ( $_ArrayLinks[$_I] ) Next AutoIt 3.3.14.2 X86 - SciTE 3.6.0 - WIN 8.1 X64 - Other Example Scripts
ctgilbert Posted August 9, 2011 Author Posted August 9, 2011 wakillon, Thanks for the suggestion. However, the links inside the pdf aren't "spelled out" in the document. This means that when I run pdftotext against it, I just get the text and not the actual link.
MrMitchell Posted August 9, 2011 Posted August 9, 2011 wakillon,Thanks for the suggestion. However, the links inside the pdf aren't "spelled out" in the document. This means that when I run pdftotext against it, I just get the text and not the actual link.This tool sounds like it will do what you need. It outputs a file in the format:Link Text|HyperlinkLink Text2|Hyperlink2Trial version is limited but at least you can see if it does what you need before you buy...
wakillon Posted August 9, 2011 Posted August 9, 2011 (edited) wakillon,Thanks for the suggestion. However, the links inside the pdf aren't "spelled out" in the document. This means that when I run pdftotext against it, I just get the text and not the actual link.Did you try my solution ?It extracts all urls found in a pdf. Edited August 9, 2011 by wakillon AutoIt 3.3.14.2 X86 - SciTE 3.6.0 - WIN 8.1 X64 - Other Example Scripts
ctgilbert Posted August 9, 2011 Author Posted August 9, 2011 Did you try my solution ?It extracts all urls found in a pdf.Yes, I did try the solution and if the full link was in the pdf it was captured. However, links that do not have the full path visible to the reader (internal links that go to another part of the document, for example) were not captured. It turns out that all links are listed if you open the pdf in a text editor. However, I have not been able to decipher how they are mapped within the document.Thanks again for your help.
wakillon Posted August 9, 2011 Posted August 9, 2011 (edited) Try to ask to Taietel a PDF expert member ! Edited August 9, 2011 by wakillon AutoIt 3.3.14.2 X86 - SciTE 3.6.0 - WIN 8.1 X64 - Other Example Scripts
taietel Posted August 9, 2011 Posted August 9, 2011 wakillon, I'm far below from an expert. ctgilbert, if the pdf is not encoded, you can do some RegExp for "/URI /URI (http://www.autoitscript.com)", for external links. For other types of links (within document, outside the document, for opening a file, for playing multimedia etc) this is much more difficult because of the multitude of action types. Things you should know first...In the beginning there was only ONE! And zero... Progs: Create PDF(TXT2PDF,IMG2PDF) 3D Bar Graph DeskGadget Menu INI Photo Mosaic 3D Text
wakillon Posted August 10, 2011 Posted August 10, 2011 wakillon, I'm far below from an expert. ctgilbert, if the pdf is not encoded, you can do some RegExp for "/URI /URI (http://www.autoitscript.com)", for external links. For other types of links (within document, outside the document, for opening a file, for playing multimedia etc) this is much more difficult because of the multitude of action types.Stop modesty, if you are not a pdf expert, who is ? AutoIt 3.3.14.2 X86 - SciTE 3.6.0 - WIN 8.1 X64 - Other Example Scripts
Yab Posted March 31, 2015 Posted March 31, 2015 Hi All, I have 10 pdf's x-linking to approx 1000-1500 pdf docs, (all in 1 large folder) and I need to check the links are all good, (approx 1000-1500 links). The links were created as "link to a file" NOT to a web address and the link itself does not show after converting to text using pdftotext. Any ideas on a tool or method to check "link to file" points to the correct document? Is the is the document there? Also, to add to the difficulty, some links are in the bookmarks and not in the pdf content! Cheers Yab
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now