Jump to content
Sign in to follow this  
hentaiw

Find all ISBNs within text files

Recommended Posts

hentaiw

I have a text file which was extracted using pdftext.exe from an ISBN book, this results in a text file which contain the text content of the pdf file, now i want to read the txt file and find all ISBN numbers

ISBN could be 10 or 13 digits e.g 1234567890 or 9781234567890

ISBN10 might have an X instead of the final digit e.g 123456789X

There might be space or '-' chracter between portion of the ISBN e.g 978-12345-67890

I want to parse the text file and find all these number, then would normalize them later.

I'm new with this AutoIT and would glad if anyone can provide a RegEx and tell me how i can parse all these things correctly.

Share this post


Link to post
Share on other sites
hentaiw

My work so far : 

#include <GUIConstantsEx.au3>
#include <Array.au3>
#include <File.au3>

Global $DetectMode = IniRead("settings.ini", "ISBN", "DetectMode", 404)
Global $FileExtension = IniRead("settings.ini", "ISBN", "FileExtension", 404)
Global $ScanMethod = IniRead("settings.ini", "ISBN", "ScanMethod", 404)
Global $ScanX = IniRead("settings.ini", "ISBN", "ScanX", 404)
Global $MultipleISBN = IniRead("settings.ini", "ISBN", "MultipleISBN", 404)
Global $ISBNNotFound = IniRead("settings.ini", "ISBN", "ISBNNotFound ", 404)
Global $InfoNotFound = IniRead("settings.ini", "ISBN", "InfoNotFound ", 404)
Global $GglBksKey = IniRead("settings.ini", "ISBN", "GglBksKey", 404)
Global $ISBNDBKey = IniRead("settings.ini", "ISBN", "ISBNDBKey", 404)
Global $FileTemplate = IniRead("settings.ini", "Template", "FileTemplate", 404)
Global $FolderTemplate = IniRead("settings.ini", "Template", "FolderTemplate", 404)
Global $InvCharSub = IniRead("settings.ini", "Template", "InvCharSub", 404)

Local $dir = FileSelectFolder("Choose a folder.", "")

Local $FileList = _FileListToArray($dir,"*.*")

$dir = $dir & "\"

If @error = 1 Then
    MsgBox(0, "", "No Folders Found.")
    Exit
EndIf
If @error = 4 Then
    MsgBox(0, "", "No Files Found.")
    Exit
EndIf
_ArrayDisplay($FileList, "$FileList")

For $i=1 To $FileList[0]
   Dim $FileExt = _FileGetExt($FileList[$i])
   If $FileExt = ".pdf" Then
      _PDFScan($dir & $FileList[$i])
      MsgBox(0, "TTTT", $dir & $FileList[$i])
   ElseIf $FileExt = ".djvu" Or $FileExt = ".djv" Then
      
   ElseIf $FileExt = ".chm" Then
      
   EndIf
Next

Func _FileGetExt($sPath)
    Local $NULL, $sExt
    _PathSplit($sPath, $NULL, $NULL, $NULL, $sExt)
    Return StringLower($sExt)
 EndFunc
 
Func _PDFScan($File)
   FileMove($File,$dir & "processing.pdf")
   ShellExecuteWait("pdftext.exe","-f 1 -l 10 " & $dir & "processing.pdf processing.txt")
   _ISBNScan("processing.txt")
EndFunc

Func _ISBN_Scan($File)
   
EndFunc

Share this post


Link to post
Share on other sites
hentaiw

And here is what the text file might be :

Produced by Blue Island, London Reproduced by Colourscan, Singapore Printed and bound in China by Leo Paper Products Ltd. First American Edition, 2003 11 12 13 14 10 9 8 7 6 5 4 3 2 1 Published in the United States by DK Publishing, 375 Hudson Street, New York, New York 10014 Copyright 2003, 2011 © Dorling Kindersley Limited Reprinted with revisions 2005, 2007, 2009, 2011 All rights reserved. Without limiting the rights under copyright reserved above, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of both the copyright owner and the above publisher of this book. Published in Great Britain by Dorling Kindersley Limited. A catalog record for this book is available from the Library of Congress. ISSN 1479-344X ISBN 978 0 7566 6923 2 Within each Top 10 list in this book, no hierarchy of quality or popularity is implied. All 10 are, in the editor's opinion, of roughly equal merit. Floors are referred to throughout in accordance with Spanish usage; ie the "first floor" is the floor above ground level.

There could be more ISBNs therefore i want to search all of them in this text file

Share this post


Link to post
Share on other sites
Realm

Hello hentaiw

First of all, Welcome to the AutoIt Forums!

There are several ways to accomplish what you are looking for, and more than likely are some better ways than the solution I have found.

Maybe guiness or jchd would chime in with their expertise in RegEx.

Without further ado, I have a simple approach and hope it achieves exactly what you need.

First, I would strip all the dashes and white space from the text to make the search easier. Than a simple String RegEx should suffice your needs.

I created a shortened example from what you supplied, utilizing all the different forms of ISBN #'s.

This is merely an example to parse the ISBN numbers from a string.

#include <Array.au3>

$string = "A catalog record for this book is available from the Library of Congress. " _
    & "ISSN 1479-344X ISBN 978 0 7566 6923 2 Within each Top 10 list in this book, no hierarchy of quality or popularity " _
    & "is implied. " _
    & "A catalog record for this book is available from the Library of Congress. " _
    & "ISSN 1479-344X ISBN 978-12345-67890 Within each Top 10 list in this book, no hierarchy of quality or popularity " _
    & "is implied. " _
    & "A catalog record for this book is available from the Library of Congress. " _
    & "ISSN 1479-344X ISBN 123456789X Within each Top 10 list in this book, no hierarchy of quality or popularity " _
    & "is implied. " _
    & "A catalog record for this book is available from the Library of Congress. " _
    & "ISSN 1479-344X ISBN 9781234567890 Within each Top 10 list in this book, no hierarchy of quality or popularity " _
    & "is implied. " _
    & "A catalog record for this book is available from the Library of Congress. " _
    & "ISSN 1479-344X ISBN 1234567890 Within each Top 10 list in this book, no hierarchy of quality or popularity " _
    & "is implied. "


$string = StringStripWS( StringReplace($string, '-', ''), 8 )

$aSRE = StringRegExp( $string, 'ISBN(\d+[X]?)', 3)
;only what is in the captions will be included in search.
;\d searches for digits while the + sign tells it to continue the string until there are no more digits.
;[X] tells it to search for an X right after the digit string while the '?' tells it that it may or may not appear.

_ArrayDisplay($aSRE)

If this is not as helpful as you had expected, feel free to ask questions or further describe what your looking for.

Happy Coding!

Realm

Edit: Typo

Edited by Realm

My Contributions: Unix Timestamp: Calculate Unix time, or seconds since Epoch, accounting for your local timezone and daylight savings time. RegEdit Jumper: A Small & Simple interface based on Yashied's Reg Jumper Function, for searching Hives in your registry.  

Share this post


Link to post
Share on other sites
hentaiw

I'm happy that you guys provide quite an enthusiastic support to newcomer that I am, thank you.

I wrote myself this RegEx patterns and it suits me (for now,maybe), here you can use to check if you get the same thing that has happened to me

Local $var = StringRegExp ( "ISBN 0000000000 - 011029503X · ISBN 0110295048 - 0278067190 · ISBN 0278067204 - 044583935X · ISBN 0445839368 - 0613611519 ", "(9[ -]*7[ -]*8[ -]*[0-9][ -]*[0-9][ -]*[0-9][ -]*[0-9][ -]*[0-9][ -]*[0-9][ -]*[0-9][ -]*[0-9][ -]*[0-9][ -]*[0-9])|([0-9][ -]*[0-9][ -]*[0-9][ -]*[0-9][ -]*[0-9][ -]*[0-9][ -]*[0-9][ -]*[0-9][ -]*[0-9][ -]*[0-9Xx])", 3 )
$var = _RemoveEmptyArrayElements($var)
_ArrayDisplay($var,"$var")

 

I don't know why it leaves me empty item in item 0, 2, 4, 6, 8 and so on so I have to use a remove empty array element to have the desired result.

It works nicely apparently.

The next step would be print out all these ISBNs in a GUI so user can select one from them, i need suggestions, implement checkboxes could be quite messy because this array can be of any size.

After hitting OK button i want the GUI to return the selected item position so I can continue the process.

Btw, could you guys tell me how to load all text file content into a variable so I can perform only one RegEx, rather than reading line-by-line and perform separate RegExes...

Edited by hentaiw

Share this post


Link to post
Share on other sites
TheSaint

If every ISBN is preceded by the text ISBN, then you could StringSplit on that (using the parameter of 1 maybe), then to further simplify your text, do a character count from the left with all results ... or even use a check for numbers (StringIsDigit), not forgetting the chance of a final 'X'.

That is a simple approach, that relies on the text 'ISBN' only being mentioned once ... or at least so, for every book (etc).


AutoIt.4.Life Clubrooms - Life is like a Donut (secret key)

Make sure brain is in gear before opening mouth!
Remember, what is not said, can be just as important as what is said.

Spoiler

What is the Secret Key? Life is like a Donut

If I put effort into communication, I expect you to read properly & fully, or just not comment.
Ignoring those who try to divert conversation with irrelevancies.
If I'm intent on insulting you or being rude, I will be obvious, not ambiguous about it.
I'm only big and bad, to those who have an over-active imagination.

I may have the Artistic Liesense ;) to disagree with you. TheSaint's Toolbox (be advised many downloads are not working due to ISP screwup with my storage)

userbar.png

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×