Jump to content

[SOLVED] Extracting all text from a file that start with >"text": "< and ends with >", "timestamp":<


Go to solution Solved by Trong,

Recommended Posts

There must be a very simple solution for this "problem". I know how I would do it if there is just a few but I need to do it for every instance and it might be 1000 of them.

Quote

QVzADyw", "author_thumbnail": "https://yt3.ggpht.com/ytc/AKedOLTt1rCz0emlXz_QUVNB7T1AH11QBO13oYbFZw=s176-c-k-c0x00ffffff-no-rj", "author_is_uploader": false, "parent": "root"}, {"id": "UgxjxCRsERmHwFHpnXN4AaABAg", "text": "Some Youtube comment as example", "timestamp": 1583452800, "time_text": "2 years ago", "like_count": 1, "is_favorited": false, "author": "Aramis Papadopulos", "author_id": "UCRkVkOOpOBYkdrvGLmAETLQ", "author_thumbnail": "https://yt3.ggpht.com/ytc/AKedOLSbooVjQtUXaSGBpjrxOWh3kVTfXLRpsvcmu-AP=s176-c-k-c0x00ffffff-no-rj", "author_is_uploader": false, "parent": "root"}, {"id": "UgxbPLIT8b8pFW3K3f54AaABAg", "text": "some other text example", "timestamp": 1583452800, "time_text": "2 years ago", "like_count": 0, "is_favorited": false, "author": "BL1TZ", "author_id": "UCXqbwq4gWUpk8yJC561Pqew", "author_thumbnail": "https://yt3.ggpht.com/ytc/AKedOLRmm233HPvOaL-ohfRufFpnpAaHEoaMxKcypVru=s176-c-k-c0x00ffffff-no-rj", "author_is_uploader": false, "parent": "root"}, {"id": "UgwRoRLyWM_TIiaTEZp4AaABAg", "text": "another text example and so on", "timestamp": 1583452800, "time_text": "2 years ago", "like_count": 6, "is_favorited": false,

I guess it is just a few lines of code. i am not so good with regular expressions so solution without it for my better understanding would be much appreciated. I use autoit for a long time but I am not an expert and I am recovering from corona illness, haven't been coding for some time, etc.  So, if any good soul would give me a hint, comments that I would like to extract are between "text": " and ", "timestamp": . Anyone? Thanks!

Edited by Fr33b0w
Link to comment
Share on other sites

  • Solution

I don't know how to use RegEx but you can use _StringBetween():

#include <String.au3>
Local $InputData = '"text": "Some Youtube comment as example", "timestamp":346230, "text": "SomeSDGs example", "timestamp": 15833460, "text": "Some YoutFGNSFGJnt as example", "timestamp": 45634572800, "'
$InputData = StringReplace($InputData, ', "', ',"')
$InputData = StringReplace($InputData, '": ', '":')
Local $textArray = _StringBetween($InputData, '"text":', ',"')
If IsArray($textArray) Then
    For $i = 0 To UBound($textArray) - 1
        ConsoleWrite($textArray[$i] & @CRLF)
    Next
EndIf

Local $timestampArray = _StringBetween($InputData, '"timestamp":', ',"')
If IsArray($timestampArray) Then
    For $i = 0 To UBound($timestampArray) - 1
        ConsoleWrite($timestampArray[$i] & @CRLF)
    Next
EndIf

 

Regards,
 

Link to comment
Share on other sites

Or something like:

#include <Array.au3>
Global $sText = '"text": "Some Youtube comment as example", "timestamp":346230, "text": "SomeSDGs example", "timestamp": 15833460, "text": "Some YoutFGNSFGJnt as example", "timestamp": 45634572800, "'
Global $aText = StringRegExp($sText, '(?<=\"text\": \").*?(?=\", \"timestamp\")', 3)
_ArrayDisplay($aText)

 

Link to comment
Share on other sites

Thank You very much VIP. This solve my probem and do exactly what I wanted to achieve. It was not that simple I thought it could be, so sorry for that. I am learning from your example. i wish you good and healthy life.

 

Reedit: Thanks Subz! This regex I can understand and learn from it. Guys, thanks a lot. You made my day.

 

Edited by Fr33b0w
Link to comment
Share on other sites

  • Fr33b0w changed the title to [SOLVED] Extracting all text from a file that start with >"text": "< and ends with >", "timestamp":<
  • 2 weeks later...

Sorry...  Still have some problems with this. It wont process all files...  Did try to rename them, did try to change the code. but it wont work... It process 223 files of 327 and I dont know why...   

Script I am trying to use is:

#include <String.au3>
#include <Array.au3>


Local $search = FileFindFirstFile("*.info.json")
DirCreate(@ScriptDir & "\comments\")
Local $dir = @ScriptDir & "\comments\"


 If $search = -1 Then
        MsgBox($MB_SYSTEMMODAL, "", "Error: No files/directories matched the search pattern.")
        Return False
     EndIf


While 1
   Local $file = FileFindNextFile($search)
    If @error Then ExitLoop
Local $target = StringReplace($file, '.info.json', '.txt')
Local $InputData = FileRead($file)

$InputData = StringReplace($InputData, ', "', ',"')
$InputData = StringReplace($InputData, '": ', '":')
Local $textArray = _StringBetween($InputData, '"text":', ',"')
If IsArray($textArray) Then
    For $i = 0 To UBound($textArray) - 1
        FileWriteLine($dir & $target, @CRLF & " * " & $textArray[$i] & @CRLF)
    Next
EndIf

Local $timestampArray = _StringBetween($InputData, '"timestamp":', ',"')
If IsArray($timestampArray) Then
    For $i = 0 To UBound($timestampArray) - 1
        FileWriteLine($dir & $target, @CRLF & " * " & $textArray[$i] & @CRLF)
    Next
 EndIf
    FileClose($dir & $target)
WEnd

Exit

 

I added files which I am trying to scrap... I let them be in a same folder where designated files are...  Files are in attachment...   Thanks.

test.zip

Edited by Fr33b0w
I didnt enter how many files are processed of how many targeted... Brain burnt by non working script...
Link to comment
Share on other sites

Few suggestions for your script :

1- Use _FileListToArray instead of FileFindFirstFile/FileFindNextFile.  You can then use _ArrayDisplay to make sure you got all the files in the array.

2- Your second FileWriteLine should use $timestampArray instead of $textArray

3- FileClose on a named file is useless (see help file : it should be a handle)

4- You should add a consoleWrite warning when your stringBetween does not work

5- Adding traces to a script to understand what is going on is the best way to debug...

Edited by Nine
Link to comment
Share on other sites

Thanks. I decided to use second example, which I can see its better, but far away from my level of knowledge. And it works even better then the first one, but with much more difficulty to play with it. This way it looks like script is playing with me....    Problem is that in this case I cant add @CRLF after every set of text which is find and I don't know how to do that. I did try to use StringReplace function to replace every @CRLF with two, so I will get a blank line after every part of text that is found....   But I am not good with arrays and RegEX...   Got nothing...    I am still using  FindFile instead of _FileListToArray as you have been suggested, but thats only because I would like to make this code work on field where I am less uncomfortable and after that I could try to do it another way. Just...  for someone this is a piece of cake and for me is rest of that cake...        How to add @CRLF or @CR that will work?

 

#include <String.au3>
#include <Array.au3>
#include <File.au3>


Local $search = FileFindFirstFile("*.info.json")
DirCreate(@ScriptDir & "\comments\")
Local $dir = @ScriptDir & "\comments\"


 If $search = -1 Then
        MsgBox($MB_SYSTEMMODAL, "", "Error: No files/directories matched the search pattern.")
        Return False
     EndIf

While 1
   Local $file = FileFindNextFile($search)
    If @error Then ExitLoop
Local $target = StringReplace($file, '.info.json', '.txt')
Local $InputDataa = FileRead($file)

Global $InputDatab = StringRegExp($InputDataa, '(?<=\"text\": \").*?(?=\", \"timestamp\")', 3)
;_ArrayDisplay($InputDatab)
_FileWriteFromArray($dir & $target,$InputDatab, 1)

WEnd

Exit

 

Link to comment
Share on other sites

5 minutes ago, Nine said:

Replace your _FileWriteFromArray line by this one:

FileWriteLine($dir & $target, _ArrayToString($InputDatab, "|"))

 

I have seen that default array delimiter in help but wasnt sure how to use it. It now replaces existing carriage return with "|".  Any tip for that?

 

So, from:

Line 1

Line 2

Line 3

 

I am getting Line 1|Line 2|Line 3

 

 

 

Edited by Fr33b0w
Link to comment
Share on other sites

  • 2 years later...
Posted (edited)

Hi sorry for bumping an old post but again i have a problem because site code changed. Everything worked fine but now there is a new line of code which unable this regex to work. Instead of "author_id" as closure now there is sometimes "like_count" instead of author_id which is still there but after much more code I dont need to extract. I did try to use delimiter in RegEx but I guess regex is not easy for me...   Can someone just give me a suggestion how to make a regex which will say: Get text from here to (here or here). I did try to put it like this:

 

Global $InputDatab = StringRegExp($InputDataa, '(?<=\"text\": \").*?(?=\", \"author_id\|like_count\")', 3)

...but it didnt work. Line instead of this was taking data from "text:" to "timestamp"

Global $InputDatab = StringRegExp($InputDataa, '(?<=\"text\": \").*?(?=\", \"timestamp\")', 3)

Here is an example of text which is in .info.json:

Quote

"text": "8 hours later the Fire HD 8 is  $109. 99. I wish I would have gotten to watch this earlier. :( \nThanks for all you do Matt even if I'm late to the party.", "like_count": 1, "author_id": "UCWFKQey1WtCgGyxHPMhPtGQ", "author": "@kaceycampbell5550", "author_thumbnail": "https://yt3.ggpht.com/ytc/AOPolaSLWOprKke3uCsTselIrClAYoEM8RqDNcgadJvxBg=s176-c-k-c0x00ffffff-no-rj", "parent": "root", "_time_text": "2 years ago", "timestamp": 1630454400, "author_url": "https://www.youtube.com/channel/UCWFKQey1WtCgGyxHPMhPtGQ", "author_is_uploader": false, "is_favorited": false}, {"id": "Ugx1HNwbzMS9V0pUrgN4AaABAg", "text": "Lmao that first product is definitely photoshopped ≡ƒÿé", "like_count": 1, "author_id": "UCzeJMeX2bFwqvs9IJGKorfQ", "author": "@mrhappygoluckyjock", "author_thumbnail": "https://yt3.ggpht.com/ytc/AOPolaTRITy2x4xoy7aYMgIpyvmdF-ixQlv9thvtg7To=s176-c-k-c0x00ffffff-no-rj", "parent": "root", "_time_text": "2 years ago", "timestamp": 1630454400, "author_url": "https://www.youtube.com/channel/UCzeJMeX2bFwqvs9IJGKorfQ", "author_is_uploader": false, "is_favorited": false}, {"id": "Ugx1HNwbzMS9V0pUrgN4AaABAg.9HYyh7phVcw9HbAfkT-Wfr", "text": "Really, how can you tell? Genuinely asking, it looks too good to me", "author_id": "UCzTLWlN4pDD1jLiJJLVrfDA", "author": "@kikikiki3216", "author_thumbnail": "https://yt3.ggpht.com/ytc/AOPolaQiA8_KkqCrK7o7WNNL5qLk3C-PrOy1S591OQ=s176-c-k-c0x00ffffff-no-rj", "parent": "Ugx1HNwbzMS9V0pUrgN4AaABAg", "_time_text": "2 years ago", "timestamp": 1630454400, "author_url": "https://www.youtube.com/channel/UCzTLWlN4pDD1jLiJJLVrfDA", "author_is_uploader": false, "is_favorited": false}, {"id": "UgzhitiBqqS5dUzDfIZ4AaABAg", "text": "That echo auto does not have good user reviews", "author_id": "UC8Krza6o2IbS9zTGjYgd4jA", "author": "@soupedkid13", "author_thumbnail": "https://yt3.ggpht.com/ytc/AOPolaQbtLnxh1qhqgYU8i3LsO_6qE8lCRmBbV_OJ6f-=s176-c-k-c0x00ffffff-no-rj", "parent": "root", "_time_text": "2 years ago", "timestamp": 1630454400, "author_url": "https://www.youtube.com/channel/UC8Krza6o2IbS9zTGjYgd4jA", "author_is_uploader": false, "is_favorited": false}, {"id": "UgyWmQAcvH3gCxWMz9x4AaABAg", "text": "Merry Christmas , thank you for your videos and energy", "author_id": "UCOcfr_BebW1QqXpTI-PNEaQ", "author": "@teresafinnerty207", "author_thumbnail": "https://yt3.ggpht.com/ytc/AOPolaQDSwM9-eRu5aBKVVC1bh4xx4A6LoH2Vaompo-j=s176-c-k-c0x00ffffff-no-rj", "parent": "root", "_time_text": "2 years ago", "timestamp": 1630454400, "author_url": "https://www.youtube.com/channel/UCOcfr_BebW1QqXpTI-PNEaQ", "author_is_uploader": false, "is_favorited": false}, {"id": "UgyWmQAcvH3gCxWMz9x4AaABAg.9HYtClhcvg19HYwMNLud3_", "text": "Thanks for being here Teresa!", "author_id": "UC5Qbo0AR3CwpmEq751BIy0g", "author": "@thedealguy", "author_thumbnail": "https://yt3.ggpht.com/PHbn_ZwKQ-3PPhTtF7k6Q5t-vGBnENCPZAQc9lNe-EGCeJJ8T5DgbNIvGSSmFNVUrOCV6l3q=s176-c-k-c0x00ffffff-no-rj", "parent": "UgyWmQAcvH3gCxWMz9x4AaABAg", "_time_text": "2 years ago", "timestamp": 1630454400, "author_url": "https://www.youtube.com/channel/UC5Qbo0AR3CwpmEq751BIy0g", "author_is_uploader": true, "is_favorited": false, "author_is_verified": true}, {"id": "U

 

So, now there are two lines which can be a closure for getting text:   ', "like_count":' and ', "author_id":'

 

How can I add in RegEx code that would do what i want? I did try it on my own with examples I found online but it does not work...  Again much thanks in advance for this.

 

 

 

 

Sorry, I just tried a bit more and solved a problem. Correct line is:

Global $InputDatab = StringRegExp($InputDataa, '(?<=\"text\": \").*?(?=\", \"author_id|\", "like_count\")', 3)

Thanks, sorry!

Edited by Fr33b0w
Had a problem which I couldnt solve but then waiting for an answer I had an idea and... soleved it myself.
Link to comment
Share on other sites

Link to comment
Share on other sites

Hi Nine and thanks for trying to help. This version of a solution of yours leave       ", "like_count"   and   , "author_id"   after every line. I am very bad at regex so i dont know why but would like to see if you can correct it because your solution looks much more clear to me.

Link to comment
Share on other sites

Ow, thanks for that. I am looking forward to check that UDFs. Have not been around much lately. I have to start learning RegEx proper way but I like also what you said about JSON UDFs...   

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...