Sign in to follow this  
Followers 0
KarlosTHG

WebPage parser

14 posts in this topic

Hi again guys,

I study in the Universidad EAFIT :D in Colombia and this provide an user and password for a platform to get the results of the quiz and the works, Well, i need to make a script to download the page and parse it to fill a gui with the results, the second part is easy but i have not idea to download the page automatically because this request the authentication info. How i can do that?

Sorry for my english but i have only taken a few of classes. :D

Any help would be appreciated. thanks! :D

Share this post


Link to post
Share on other sites



With the function:

InetGet

To use a username and password when connecting simply prefix the servername with "username:password@", e.g.

"http://myuser:mypassword@www.somesite.com"

Share this post


Link to post
Share on other sites

Doesn't work, it get the login page but making something diferent:

i use firefox to enter to the page and i show up the page source:

http://webapps.eafit.edu.co/ulises/login.do

<html:html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"/>
<meta http-equiv="Cache-Control" Content="no-cache"/>
<meta http-equiv="Pragma" Content="no-cache"/>
<meta http-equiv="Expires" Content="0"/>
    
    <title>
      Sistema de Admisiones y Registro
    </title>

    <link href="http://webapps.eafit.edu.co/imagenes/v1/styles/estiloEafit2005.css" type="text/css" rel="stylesheet"/>
  </head>
  <body topmargin="0" leftmargin="0">
<script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
</script>
<script type="text/javascript">
_uacct = "UA-1439547-2";
urchinTracker();
</script> 
     


<table  class="noPrint" width="100%" border="0" cellpadding="0" cellspacing="0" background='http://webapps.eafit.edu.co/imagenes/v1/apps/ulises/fondo.gif' bgcolor='#E5E5E5'>
  <tr>
    <td width="182"><img src='http://webapps.eafit.edu.co/imagenes/v1/apps/ulises/logoEafit.gif' border="0"/></td>
    <td align="center"><img src='http://webapps.eafit.edu.co/imagenes/v1/apps/ulises/titulo.gif' border="0"/></td>

    <td width="186">
      <table width="100%" border="0" cellspacing="0" cellpadding="0">
        <tr>
          <td align="right"><a href='http://www.eafit.edu.co' target="_blank"><img src='http://webapps.eafit.edu.co/imagenes/v1/apps/ulises/homeEafit.gif' border="0"/></a></td>
        </tr>
        <tr>
          <td align="right"><img src='http://webapps.eafit.edu.co/imagenes/v1/apps/ulises/imagenTop.gif' border="0"/></td>
        </tr>
      </table></td>

  </tr>
</table>

    







<script type="text/javascript">
  function entrar()
  {
    document.getElementById('loginId').style.display = "none";          
        document.getElementById('imagen').style.display = "";           
        loginForm.submit();
  }
  
  function load()
  {
    document.getElementById('loginId').style.display = "";          
        document.getElementById('imagen').style.display = "none";
  }
  
  
    function hideElement(elementId)
    {
      document.getElementById(elementId).style.display = "none";        
    }
    
    function showElement(elementId)
    {
      document.getElementById(elementId).style.display = "";        
    }
  
    function logueoPorDocumento(){
        if(document.loginForm.tipo.value != '6'){
            hideElement('login');
            showElement('msnDocumento');
            hideElement('msnLogin');
            hideElement('msnLogin2');
            showElement('tdcto');
            showElement('ndcto');
            hideElement('clave');
            showElement('clave');
            hideElement('entrar');
            showElement('entrarRecordar');
           
            document.loginForm.tipo.value = '6';
        }
    }
    
    function logueoPorlogin(){
        if(document.loginForm.tipo.value != '3'){
            showElement('login');
            showElement('msnLogin');
            showElement('msnLogin2');
            hideElement('msnDocumento');
            hideElement('tdcto');
            hideElement('ndcto');
            hideElement('clave');
            showElement('clave');
            hideElement('entrarRecordar');
            showElement('entrar');
            
            document.loginForm.tipo.value = '3';
        }
    }
    function recordarClave(){
        document.loginForm.action= "/ulises/user-search.do";
        document.loginForm.submit();
    }
</script>
<!-- T&iacute;tulo de la P&aacute;gina -->
<p align="center" class="titulo">
  Ingreso al Sistema
</p>
<!-- Formulario -->
<form name="loginForm" method="post" action="/ulises/login-submit.do">
  <!-- Muestra los errores generados por la validaci&oacute;n del formulario -->
  

<div id="loginId" align="center" style='display=""'>
  <table border="0" cellspacing="0" cellpadding="1" align="center">
    <!-- Referencia al atributo tipo del formulario -->
    <input type="hidden" name="tipo" value="3">
    <tr id="login">
        <td>
            <!-- Referencia al label del ApplicationResources -->
            Usuario (*):
        </td>
        <td>

            <!-- Referencia al atributo login del formulario -->
            <input type="text" name="login" maxlength="30" size="15" value="">
        </td>
    </tr>
    <tr id="tdcto" style="display:none">
        <td>
            <!-- Referencia al label del ApplicationResources -->
            Tipo de Documento: 
        </td>

        <td>
            <select name="tipoDcto"><option value="CC">Cédula De Ciudadania</option>
<option value="CE">Cédula De Extranjeria</option>
<option value="CO">Código De Estudiante</option>
<option value="CG">Guatemala - Cédula De Ciudadanía</option>
<option value="NU">Nro Único Identificación Personal</option>
<option value="OT">Otro</option>
<option value="PP">Pasaporte</option>

<option value="RE">Registro</option>
<option value="TI">Tarjeta De Identidad</option></select>
        </td>
    </tr>
    <tr id="ndcto" style="display:none">
        <td>
            <!-- Referencia al label del ApplicationResources -->
            No. de Documento: 
        </td>

        <td>
            <input type="text" name="nroDcto" maxlength="12" size="15" value="">
        </td>
    </tr>
    <tr id="clave">
        <td>
            <!-- Referencia al label del ApplicationResources -->
            Clave (*): 
        </td>

        <td>
            <input type="password" name="clave" maxlength="30" size="15" value="">
        </td>
    </tr>
    <tr id="entrar">
        <td colspan="2">
            <div align="center">
              <a href="javascript:entrar();">
                <input type="image" class="form" src="http://webapps.eafit.edu.co/imagenes/v1/botones/btn_entrar.gif" onclick="javascript:entrar();"/>

              </a>
            </div>
        </td>
    </tr>
    <tr id="entrarRecordar" style="display:none">
        <td colspan="2">
            <div align="center">
               <input type="image" name="" src="http://webapps.eafit.edu.co/imagenes/v1/botones/btn_entrar.gif" border="0" class="form" alt="Entrar">
               <a href="javascript:recordarClave();">

                    <img alt="Recordar Clave" border="0" src="http://webapps.eafit.edu.co/imagenes/v1/botones/btn_recordarClave.gif"/>
                </a>
            </div>
        </td>
    </tr>
  </table>
  <p/>  
  <div align="center" id='msnLogin2'>
      <a href="javascript:logueoPorDocumento();" class="msgMensaje">

          Si no recuerda o no tiene asignado su logín por favor de click aquí para logearse con tipo y número de documento de identidad.
     </a>
  </div>
  <div align="center" class="mini" id='msnLogin'>    
    <br/>
    Use el login y la clave que tenga para entrar al correo electrónico asignado por la Universidad     
    <p/>
    Los campos marcados con * son obligatorios
    <p/>
  </div>
  <div align="center" class="mini" id='msnDocumento' style="display:none">

    Use su tipo y número de documento de identidad y la clave que tiene asignada en el sistema.
    <br/>
    <a href="javascript:logueoPorlogin();">
        Si desea ingresar con el login y clave que tiene asignadas de click aquí 
    </a>
    <p/>
    Los campos marcados con * son obligatorios
    <p/>
  </div>
</div>

  
  <div align="center" class="mini">
| <a href="/ulises/comentarios.do">Comentarios y Sugerencias</a> |
<br/>Universidad EAFIT: Tel&eacute;fono: (57) (4) - 2619500| Dirección: Carrera 49 - 7 Sur 50 | Medellín - Colombia - Suramérica
<br/>&copy; Copyright 2007 Universidad EAFIT &reg; Todos los Derechos Reservados - Centro de Inform&aacute;tica<br/>

 Fecha Actualizaci&oacute;n: 2009-09-09<br/>
Utilice <a href='http://www.microsoft.com/downloads/details.aspx?displaylang=es&FamilyID=1e1550cb-5e5d-48f5-b02b-20b602228de6' target='_blank'>Internet Explorer 6.0</a> o una versi&oacute;n superior de este navegador.</div>

  <script type="text/javascript">
    document.loginForm.login.focus();    
  </script>
</form>
<div id="imagen" align="center" style='display:none'>
  <img border="0" src="http://webapps.eafit.edu.co/imagenes/v1/icons/ico_animado.gif"/>    

</div>

  </body>
</html:html>

searching in the code i found a interesting line:

<form name="loginForm" method="post" action="/ulises/login-submit.do">

with this explaination:

<!-- Muestra los errores generados por la validaci&oacute;n del formulario -->

traslate:

Show the errors generated with the form validation

and i think that maybe is PHP or something like that

is there a way to know the syntax of the post request for login-submit.do

thanks

Share this post


Link to post
Share on other sites

When you open a page in your browser, it is essentially downloaded to your machine and the browser gives you an interface to work with it. I am confused by your question. Once you download the page, how do you plan to work with it and what do you intend to do with the results?

Dale


Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Share this post


Link to post
Share on other sites

I want to do a script that shows the schedule, the results of quiz and other info in my own way, like a offline platform, then i need to download the page which has login protection in some kind of PHP, parse it, save to hard disk, and show the information retrieved in my application GUI, I can do everything, but the page download :D

I hope explained me well

Thanks

Share this post


Link to post
Share on other sites

Ok, so you download the page but instead of the page you wanted you get a login page? You want to parse the login page or something and build a HTTP request based on that? I fail to see where this is going.

Share this post


Link to post
Share on other sites

I don't need the login page i need a page after the login but when i use InetGet with the page that i need, i always get the login page source because i dont know how to pass my login info into the script to download the page that i need.

i think that i can not explain me better, jaja, :D

bye and thanks

Share this post


Link to post
Share on other sites

Read over DaleHolms reply again and pay attention closely.

What is happening when you try to download the page directly:

- The page you are trying to download is redirecting you to the login page (probably because you don't have some cookies set)

- InetGet knows it's being redirected, so it downloads the page it's being redirected to instead

- You're stuck with the login page. LoL.

So what you should do is:

- Try to download the page

- Be redirected to the login page

- Log in on the login page

- Check if log in was succesful

- Try to download the page

For this you'll need a better way of handling a website than just the InetGet function, because there are a lot of things involved: POST HTTP requests, cookies, redirection.

DaleHolm has written a very nice library for this. It uses Internet Explorers internal workings to be able to interact with a website. You can find about all these things in the AutoIt help file. The functions all start with _IE and they're very convenient.

I don't attend a university at all. I just do this stuff in my free time. Kthxbye.

Share this post


Link to post
Share on other sites

How much do you know about HTML and the Document Object Model (DOM)?

To login, see _IECreate, _IEFormGetObjByName, _IEFormElementGetObjByName, _IEFormElementSetValue, _IEFormImgClick,

To save the page, see _IEDocReadHTML or _IEBodyReadHTML

There are more refined ways of getting individual elements off the page, but the techniques require knowedge of the DOM.

Dale


Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Share this post


Link to post
Share on other sites

This is how I log in to the google mail: so you log in, retrieve info you need & close browser.

to get that $o_signin=_IEGetObjByName($oIE,"signIn") use DEBUGBAR for internet explorer - google it.

#include <IE.au3>
$oIE =_IECreate("http://mail.google.com/mail/?hl=et&tab=wm");opens the webpage
; get pointers to the login form and username, password and signin fields
$o_login = _IEGetObjByName($oIE,"Email")
_IEFormElementSetValue ($o_login, "mudak.yo")
$o_password = _IEGetObjByName($oIE,"passwd")
_IEFormElementSetValue ($o_password, "dasitmeinpasswerthohoho")
$o_signin=_IEGetObjByName($oIE,"signIn")
_IEAction($o_signin,"click")

Sleep(5000)

_IEQuit($oIE)

And this is how I work with webpage source hope it helps:

#include <INet.au3>

$URL        = 'www.muwebpage.rt'
$SEARCH_FOR = ' Downloads and Information</title>'


    $HTMLSource = _INetGetSource($URL)

    $_Arrayline = StringSplit($HTMLSource, @LF) ; this is the Array $_Arrayline
;~  
        for $i = 1 to $_Arrayline[0] 
            If StringInStr($_Arrayline[$i],$SEARCH_FOR) Then  ; if string contains word time                    
                _sample($_Arrayline[$i])
                ExitLoop
            EndIf
        Next

Func _sample($STRING)
;~ #ce ----------------------------------------------------------------------------

    $split = StringSplit($STRING,' Downloads and Information</title>',1)    ; split line to get DL URL Middle & Right part [ URL SGFDG ]
                
    ConsoleWrite($split[1] & @CRLF) 
;~                  
    $split = StringSplit($split[2],'"',1)       ; 3 split $split[2] to get Final page URL [ URL ]   
;~                  
    $DL_Link = $split[1]
;~                  
;~  $split = StringSplit($DL_Link,"/",1)  ; Get Filename from link          
                    
;~  $Filename_Save_as = $split[7]
    
    Return $DL_Link
;~                  
EndFunc

My Projects:[list][*]Guide - ytube step by step tut for reading memory with autoitscript + samples[*]WinHide - tool to show hide windows, Skinned With GDI+[*]Virtualdub batch job list maker - Batch Process all files with same settings[*]Exp calc - Exp calculator for online games[*]Automated Microsoft SQL Server 2000 installer[*]Image sorter helper for IrfanView - 1 click opens img & move ur mouse to close opened img[/list]

Share this post


Link to post
Share on other sites

@Manadar: Thats is exactly what i need. About the university, I study Business administration and i just do this stuff in my free time too :D because is funny for me, and exercise my brain :D jajaja.

@DaleHohm: I will try these functions later because i have to study now. About HTML i just know the basics and about DOM I just know its name, but maybe i will read something about it later.

@goldenix: I will try the debugbar and your example later, but in your example a IE window should be showed? because this is a little bad-looking for an application, can i use the WinSetState function to hide the window?

Thanks to all of you i cannot wait to try yours examples but i have to read now a document about the globalization and how this affects the internal economy :D:D

Bye, and thanks again for your help.

Share this post


Link to post
Share on other sites

@goldenix: hi again, i tried your script and it works very good and hide the windows too, now i can do my project, so many thanks men, and thanks to all the others who helped me

Share this post


Link to post
Share on other sites

Is there a way to download just the HTML without the images to get a fast download?

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0