Jump to content

StringBetween capturing multiple lines


Recommended Posts

Hello all,

First time posting; please excuse my noobness as I am self-taught over a short period with very little programming experience.  However, I am trying to create a simple program that retrieves college basketball scores from the web, puts them into an array and writes to excel.  Ultimately I want to create an additional variable that would be a date range so the program can cycle through many pages and retrieve years worth of information at a time.  For now, I am working on a single day.  Here is where I am stuck:  I want to use _StringBetween to isolate team names while considering whether they are "winners" or "losers".  Here is a snippet from the source code i am dealing with:

<tr class="winner">
  <td><a href="/teams/MIN/2020.html">Minnesota</a></td>
  <td class="right">118</td>
  <td class="right gamelink">
  <a href="/boxscores/202001050CLE.html">Final</a>
   
  </td>
  </tr>
  <tr class="loser">
  <td><a href="/teams/CLE/2020.html">Cleveland</a></td>
  <td class="right">103</td>
  <td class="right">&nbsp;
  </td>
  </tr>
  </tbody>
 

</table>

 

My thought was to use _Stringbetween( '<tr class ="loser> <td>' & @CRLF & '<td><a href="/teams/', '</a></td>')  

However, i believe I am using the @CRLF out of context.  Is there another way to identify page breaks within the stringbetween function?  Complete script is attached for reference.  Appreciate the help and patience as I try to piece it all together :)  

Muchos gracias.

 

 

basketball scores x2.au3

Link to comment
Share on other sites

posting script:

#include <Array.au3>
#include <Excel.au3>
#include <String.au3>
#include <MsgBoxConstants.au3>

Global $g_bPaused = False
HotKeySet("{ESC}", "Terminate")
Func Terminate()
    Exit
 EndFunc   ;==>Terminate

; Create application object and create a new workbook
Local $oExcel = _Excel_Open()
If @error Then Exit MsgBox($MB_SYSTEMMODAL, "Excel UDF: _Excel_RangeWrite Example", "Error creating the Excel application object." & @CRLF & "@error = " & @error & ", @extended = " & @extended)
Local $oWorkbook = _Excel_BookNew($oExcel)
If @error Then
    MsgBox($MB_SYSTEMMODAL, "Excel UDF: _Excel_RangeWrite Example", "Error creating the new workbook." & @CRLF & "@error = " & @error & ", @extended = " & @extended)
    _Excel_Close($oExcel)
    Exit
EndIf

$Source = BinaryToString(INetRead('https://www.basketball-reference.com/boxscores/?month=01&day=5&year=2020',1))

$Teams = _StringBetween($Source,'<td><a href="/teams/','/2020.html">')

For $a = 0 To UBound($Teams) - 1
   _Excel_RangeWrite($oWorkbook, $oWorkbook.Activesheet, $Teams, "A1", Default, True)
   Sleep(25)
Next
;;;;^Pulls teams array; writes to excel^;;;

Link to comment
Share on other sites

What does the result look like? Or what exactly is the problem?

I'm not familiar with the range write function but I don't believe that it needs looped.  Is it writing the same thing multiple times in each cell?

Edited by markyrocks
Link to comment
Share on other sites

In this particular case the result is pulling each item twice, as the strings chosen are not quite unique.  My goal is to be able to identify a two-line single string from the source code.  Is this possible?

If you have other suggestions, I am also brainstorming other approaches -- 

Looking at this section, 

</tr>
  <tr class="loser">
  <td><a href="/teams/CLE/2020.html">Cleveland</a></td>
  <td class="right">103</td>
  <td class="right">&nbsp;
  </td>
  </tr>
  </tbody>
 

</table>

 

 

I want to isolate "CLE", while using the beginning string of <tr class="loser> and ending string of /2020.html"

So the ouput would read, "<td><a href="/teams/CLE"  rather than just "CLE"

I was hoping there is a simple solution, such as extending the beginning string to include parts from both lines, but that does not seem to work.

Other thoughts are using _StringExplode, or defining a 2nd variable -- I am having no luck with any of these methods.


 

Link to comment
Share on other sites

18 minutes ago, markyrocks said:

What does the result look like? Or what exactly is the problem?

I'm not familiar with the range write function but I don't believe that it needs looped.  Is it writing the same thing multiple times in each cell?

Rangewrite is working fine in the loop, just the stringbetween giving me issues. 

Link to comment
Share on other sites

I would suggest you use IE udf to extract the tables from the web site instead of using String functions.  You would have much more control over it, and would save you lot of coding. 

Link to comment
Share on other sites

5 minutes ago, Nine said:

I would suggest you use IE udf to extract the tables from the web site instead of using String functions.  You would have much more control over it, and would save you lot of coding. 

I would venture to guess that you are correct, thank you for pointing me in a better direction, I will explore.  

Link to comment
Share on other sites

You can loop through the teams array and compare each line to the rest and if they're the same just delete the duplicate via 

For $x=1 to ubound( teams)-1

For $y=$x+1 to ubound( teams)-1

If teams[$x]=teams[$y] then

Teams[$y]=""

Endif

 

Next

Next

 

You can then sort through and use string replace to delete any Parts of the line that reoccur on a pattern or stringregex but that is definitely not for beginers

Youd Also need get rid of empty array slots.  I'm not sure how to do that exactly but I'm sure it's not too difficult 

 

Sry edit 

Edited by markyrocks
Link to comment
Share on other sites

@SOCCERMOMMY  That should get you fast started :

#include <Constants.au3>
#include <IE.au3>
#include <Array.au3>

Opt ("MustDeclareVars", 1)

Local $oIE = _IECreate ("https://www.basketball-reference.com/boxscores/?month=01&day=5&year=2020")
Local $oEast = _IEGetObjById ($oIE, "divs_standings_E")
Local $aEast = _IETableWriteToArray ($oEast, True)
_ArrayDisplay ($aEast)
Local $oWest = _IEGetObjById ($oIE, "divs_standings_W")
Local $aWest = _IETableWriteToArray ($oWest, True)
_ArrayDisplay ($aWest)

Enjoy !

Link to comment
Share on other sites

3 minutes ago, Nine said:

@SOCCERMOMMY  That should get you fast started :

#include <Constants.au3>
#include <IE.au3>
#include <Array.au3>

Opt ("MustDeclareVars", 1)

Local $oIE = _IECreate ("https://www.basketball-reference.com/boxscores/?month=01&day=5&year=2020")
Local $oEast = _IEGetObjById ($oIE, "divs_standings_E")
Local $aEast = _IETableWriteToArray ($oEast, True)
_ArrayDisplay ($aEast)
Local $oWest = _IEGetObjById ($oIE, "divs_standings_W")
Local $aWest = _IETableWriteToArray ($oWest, True)
_ArrayDisplay ($aWest)

Enjoy !

Wow, that is great, much cleaner!!  Thank you Nine!!  

Link to comment
Share on other sites

String funcs are not so bad...  :)

#include <Array.au3>

Local $all[0][3]
$teams = StringRegExp($Source, '(?s)<table class="teams">(.*?)</table>', 3)
For $i = 0 to UBound($teams)-1
   $res = StringRegExp($teams[$i], '(?s)(winner|loser).*?html">([^<]+).*?(\d+)', 3)
   $tmp = $res[0] & "|" & $res[1] & "|" & $res[2] & @crlf & $res[3] & "|" & $res[4] & "|" & $res[5] & @crlf
  _ArrayAdd($all, $tmp)
Next
_ArrayDisplay($all)

 

Link to comment
Share on other sites

On 1/6/2020 at 3:53 PM, mikell said:

String funcs are not so bad...  :)

#include <Array.au3>

Local $all[0][3]
$teams = StringRegExp($Source, '(?s)<table class="teams">(.*?)</table>', 3)
For $i = 0 to UBound($teams)-1
   $res = StringRegExp($teams[$i], '(?s)(winner|loser).*?html">([^<]+).*?(\d+)', 3)
   $tmp = $res[0] & "|" & $res[1] & "|" & $res[2] & @crlf & $res[3] & "|" & $res[4] & "|" & $res[5] & @crlf
  _ArrayAdd($all, $tmp)
Next
_ArrayDisplay($all)

 

Could you point me in the direction of some good reg ex tutorials? 

Link to comment
Share on other sites

3 hours ago, SOCCERMOMMY said:

Could you point me in the direction of some good reg ex tutorials? 

Preliminary : AutoIt uses the PCRE engine ("Perl-Compatible Regular Expressions"). So tutorials about PCRE in general can be useful, even if they are not directly related to AutoIt.

1. The AutoIt-Help for StringRegExp  offers a quite good introduction (possibly a bit unstructured ;)).

2. You should definitely check out the following website : https://regex101.com/  (online regex tester and debugger)

3. I have never used it, but this might help you as well :

 

Musashi-C64.png

"In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move."

Link to comment
Share on other sites

FYI in the helpfile, at the bottom of the "AutoIt/Tutorials/String Regular Expression" page, there is a button which links to a cute integrated tester  

Here are some details about the patterns I used previously (more explanations for the syntax in the helpfile, StringRegExp page)
In both cases flag 3 ($STR_REGEXPARRAYGLOBALMATCH : return an array of global matches)

'(?s)<table class="teams">(.*?)</table>'

(?s) : dot-all option, to make the dot match newlines
<table class="teams"> : literal string
(.*?) : group to match, 0 or more chars, lazy
</table> : literal string
This pattern acts the same way _StringBetween() does :)


'(?s)(winner|loser).*?html">([^<]+).*?(\d+)'

(?s) : dot-all option, to make the dot match newlines
(winner|loser) : 1st group to match, alternation between 2 words (winner or loser)
.*? : 0 or more chars, lazy
html"> : literal string
([^<]+) : 2nd group to match, one or more chars which are not a <
.*? : 0 or more chars, lazy
(\d+) : 3rd group to match, one or more digits

^_^

Link to comment
Share on other sites

At this point I have dabbled with a little bit of what everyone has left on here and come up with something more fitting.  This is my working script: 

#include <Constants.au3>
#include <IE.au3>
#include <Array.au3>
#include <Excel.au3>
#include <String.au3>

Opt ("MustDeclareVars", 1)

Local $oIE = _IECreate ("https://www.basketball-reference.com/boxscores/202001010LAL.html")

Local $oWB = _IEGetObjById ($oIE, "box-LAL-game-basic")
Local $aWB = _IETableWriteToArray ($oWB, True)

Local $oWA = _IEGetObjById ($oIE, "box-LAL-game-advanced")
Local $aWA = _IETableWriteToArray ($oWA, True)

Local $oLB = _IEGetObjById ($oIE, "box-PHO-game-basic")
Local $aLB = _IETableWriteToArray ($oLB, True)

Local $oLA = _IEGetObjById ($oIE, "box-PHO-game-advanced")
Local $aLA = _IETableWriteToArray ($oLA, True)

Local $statsW[0][36]
Local $statsL[0][36]

_ArrayAdd($statsW,$aWB)
_ArrayAdd($statsW,$aWA)
_ArrayAdd($statsL,$aLB)
_ArrayAdd($statsL,$aLA)

Local $oExcel = _Excel_Open()
Local $oWorkbook = _Excel_BookNew($oExcel)

_Excel_RangeWrite($oWorkbook, $oWorkbook.Activesheet, $statsW, "A1", Default, True)
_Excel_SheetAdd($oWorkbook)
_Excel_RangeWrite($oWorkbook, $oWorkbook.Activesheet, $statsL, "A1", Default, True)

This script works for gathering the data I want from one webpage, now my question becomes;  How would one go about Inserting variables into the website location where it is currently reads "20200201010LAL.HTML"  ?  As, the rest of the string is the same,  I figure if i can turn that segment into a variable, I can begin to pull all dates and games simultaneously.   

Additionally, I now am using "LAL" and "PHO" for locating the correct tables to pull in the command line _IEGetObjById.   I am unsure how to fill the variable into the middle of the "object"

I started writing that section of the code as so:

#include<Array.au3>

Global $Source = BinaryToString(INetRead('https://www.basketball-reference.com/boxscores/202001010LAL.html',1))
Local $games[0][3]
Local $teams = StringRegExp($Source, '(?s)<tr class=.thead.>(.*?)</table>', 3)
For $i = 0 to UBound($teams)-1
   Local $result = StringRegExp($teams[$i], '(?s).*?html">([^<]+)', 3)
   Local $tmp = $result[0] & "|" & $result[1] & @crlf
   _ArrayAdd($games, $tmp)
Next

_ArrayDisplay($games)

However, you will see that it pulls more information than necessary.  This is due to using the  .thead.  to represent what is 'thead' in the source code, but also is locating "thead"  and possible some other variations. 

edit:  i realize that defining the team variables in this step is contingent on the first defined variable, my idea is to use a GUI or pre-defined spreadsheet to manipulate the first variable.  

thanks for your help in any of these areas

Edited by SOCCERMOMMY
Link to comment
Share on other sites

$variable="some other value"

Local $oIE = _IECreate ("https://www.basketball-reference.com/boxscores/" & $variable)

$var="something else"

Local $oWA = _IEGetObjById ($oIE, "box-" & $var & "L-game-advanced")

 

Edited by markyrocks
Link to comment
Share on other sites

Before giving you a precise direction, I would like to know what is the goal of your script ?  To gather all game results for 1 team ? All teams ? During a certain period of time ? 

The challenge here is to determine how you will be navigating from one web page to the other.  In other words, if you were to go thru multiple pages, how manually would you do it ?  We can replicate the navigation with IE and then capture all the tables you need along the way.  But to help you see what I mean, here an example of how I navigate thru 4 games :

#include <Constants.au3>
#include <IE.au3>
#include <Array.au3>
#include <Excel.au3>

Opt ("MustDeclareVars", 1)

Local $aMatch [] = ["202001010LAL", "202001030LAL", "202001050LAL", "202001070LAL"] ; this is the challenge - how to get those game strings ?
Local $oDivs, $oTable, $aTable
Local $oIE = _IECreate ()
For $i = 0 To UBound($aMatch)-1
  _IENavigate ($oIE, "https://www.basketball-reference.com/boxscores/" & $aMatch[$i] & ".html")
  $oDivs = _IETagNameGetCollection ($oIE, "div")
  For $oDiv in $oDivs
    If StringRegExp ($oDiv.id, "all_box-[A-Z]{3}-game-.+") And $oDiv.className = "table_wrapper table_controls" Then
      ConsoleWrite ($oDiv.id & @CRLF)
      $oTable = $oDiv.GetElementsByTagName("table").item(0)
      $aTable = _IETableWriteToArray ($oTable, True)
      _ArrayDisplay ($aTable, $oDiv.id)
    EndIf
  Next
Next

 

Edited by Nine
Link to comment
Share on other sites

10 hours ago, Nine said:

Before giving you a precise direction, I would like to know what is the goal of your script ?  To gather all game results for 1 team ? All teams ? During a certain period of time ? 

The challenge here is to determine how you will be navigating from one web page to the other.  In other words, if you were to go thru multiple pages, how manually would you do it ?  We can replicate the navigation with IE and then capture all the tables you need along the way.  But to help you see what I mean, here an example of how I navigate thru 4 games :

#include <Constants.au3>
#include <IE.au3>
#include <Array.au3>
#include <Excel.au3>

Opt ("MustDeclareVars", 1)

Local $aMatch [] = ["202001010LAL", "202001030LAL", "202001050LAL", "202001070LAL"] ; this is the challenge - how to get those game strings ?
Local $oDivs, $oTable, $aTable
Local $oIE = _IECreate ()
For $i = 0 To UBound($aMatch)-1
  _IENavigate ($oIE, "https://www.basketball-reference.com/boxscores/" & $aMatch[$i] & ".html")
  $oDivs = _IETagNameGetCollection ($oIE, "div")
  For $oDiv in $oDivs
    If StringRegExp ($oDiv.id, "all_box-[A-Z]{3}-game-.+") And $oDiv.className = "table_wrapper table_controls" Then
      ConsoleWrite ($oDiv.id & @CRLF)
      $oTable = $oDiv.GetElementsByTagName("table").item(0)
      $aTable = _IETableWriteToArray ($oTable, True)
      _ArrayDisplay ($aTable, $oDiv.id)
    EndIf
  Next
Next

 

The goal of the script is to obtain the entire season worth of data, all teams, all games.  Since the strings are based on the date + the initials of the home team, I thought that using excel to create a list of dates/home teams and then concatenating the cells together would prove fairly easy to get the complete list of strings.  At which point I figure I would use the rangeread function and use a for loop to cycle through.  

Link to comment
Share on other sites

You do not need to quote everything I said, I know what I have written.  Quotes are essentially to extract part of a post to enhance its pertinence.

Then I suppose my script is relevant to your needs.  You just have to modify the arrays to fit it with your excel writes.

Good luck !

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...