Jump to content
ctgilbert

Check PDF for broken links

Recommended Posts

ctgilbert

Hello,

I am wondering if anyone could provide some guidance on how to check a pdf file for broken links (internal or external)?

I think the process would go something like:

1) scan the document for a link

2) click the link

3) make sure it went where it was supposed to go

4) repeat until EOF

I understand how to do this with web pages, but I need help understanding how to apply it to a pdf file.

Thanks!

Share this post


Link to post
Share on other sites
wakillon

I have use "pdftotext.exe" a pdf command line tool for get links and open them in default browser.

if it can help you.Posted Image

$_PdfFilePath = @DesktopDir & '\any.pdf'
$_PdfToTextPath = @DesktopDir & '\pdftotext.exe' 
RunWait ( '"' & $_PdfToTextPath & '" "' & $_PdfFilePath & '" C:\file.tmp', '', @SW_HIDE )
$_ArrayLinks = StringRegExp ( FileRead ( 'C:\file.tmp' ), '(?s)(?i)http://(.*?) ', 3 )
For $_I = 0 To UBound ( $_ArrayLinks ) -1
    $_ArrayLinks[$_I] = "http://" & $_ArrayLinks[$_I]
    ConsoleWrite ( $_I +1 & " link : " & $_ArrayLinks[$_I] & @Crlf )
    ShellExecute ( $_ArrayLinks[$_I] )
Next

AutoIt 3.3.14.2 X86 - SciTE 3.6.0WIN 8.1 X64 - Other Example Scripts

Share this post


Link to post
Share on other sites
ctgilbert

wakillon,

Thanks for the suggestion. However, the links inside the pdf aren't "spelled out" in the document. This means that when I run pdftotext against it, I just get the text and not the actual link.

Share this post


Link to post
Share on other sites
MrMitchell

wakillon,

Thanks for the suggestion. However, the links inside the pdf aren't "spelled out" in the document. This means that when I run pdftotext against it, I just get the text and not the actual link.

This tool sounds like it will do what you need. It outputs a file in the format:

Link Text|Hyperlink

Link Text2|Hyperlink2

Trial version is limited but at least you can see if it does what you need before you buy...

Share this post


Link to post
Share on other sites
wakillon

wakillon,

Thanks for the suggestion. However, the links inside the pdf aren't "spelled out" in the document. This means that when I run pdftotext against it, I just get the text and not the actual link.

Did you try my solution ?

It extracts all urls found in a pdf.

Edited by wakillon

AutoIt 3.3.14.2 X86 - SciTE 3.6.0WIN 8.1 X64 - Other Example Scripts

Share this post


Link to post
Share on other sites
ctgilbert

Did you try my solution ?

It extracts all urls found in a pdf.

Yes, I did try the solution and if the full link was in the pdf it was captured. However, links that do not have the full path visible to the reader (internal links that go to another part of the document, for example) were not captured.

It turns out that all links are listed if you open the pdf in a text editor. However, I have not been able to decipher how they are mapped within the document.

Thanks again for your help.

Share this post


Link to post
Share on other sites
wakillon

Try to ask to Taietel a PDF expert member ! Posted Image

Edited by wakillon

AutoIt 3.3.14.2 X86 - SciTE 3.6.0WIN 8.1 X64 - Other Example Scripts

Share this post


Link to post
Share on other sites
taietel

wakillon, I'm far below from an expert. :mellow:

ctgilbert, if the pdf is not encoded, you can do some RegExp for "/URI /URI (http://www.autoitscript.com)", for external links. For other types of links (within document, outside the document, for opening a file, for playing multimedia etc) this is much more difficult because of the multitude of action types.

Share this post


Link to post
Share on other sites
wakillon

wakillon, I'm far below from an expert. :mellow:

ctgilbert, if the pdf is not encoded, you can do some RegExp for "/URI /URI (http://www.autoitscript.com)", for external links. For other types of links (within document, outside the document, for opening a file, for playing multimedia etc) this is much more difficult because of the multitude of action types.

Stop modesty, if you are not a pdf expert, who is ? Posted Image


AutoIt 3.3.14.2 X86 - SciTE 3.6.0WIN 8.1 X64 - Other Example Scripts

Share this post


Link to post
Share on other sites
Yab

Hi All,

I have 10 pdf's x-linking to approx 1000-1500 pdf docs, (all in 1 large folder) and I need to check the links are all good, (approx 1000-1500 links). The links were created as "link to a file" NOT to a web address and the link itself does not show after converting to text using pdftotext. 

Any ideas on a tool or method to check "link to file" points to the correct document? Is the is the document there?

Also, to add to the difficulty, some links are in the bookmarks and not in the pdf content!

Cheers

Yab

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.