Jump to content

Simple Readability Server [PYTHON]


Recommended Posts

Readability is a Python Library that emulates the "Reading Mode" used by Browsers, ie it takes an input URL, and returns the simplified HTML. It removes headers, footers and scripts.

I made a simple server out of it, which takes CLI arguments for server IP and server Port to start the server. Default IP and port are 127.0.0.1:8900
Example requests that can be made:

http://127.0.0.1:8900?url=https://google.com&output_type=TITLE

http://127.0.0.1:8900?url=https://google.com&output_type=SHORT_TITLE

http://127.0.0.1:8900?url=https://google.com&output_type=CONTENT

http://127.0.0.1:8900?url=https://google.com&output_type=SUMMARY

http://127.0.0.1:8900/health (to check if the server is running)

import http.server
import requests
import re
import logging
import sys
from readability import Document

# Set up logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

class RequestHandler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        # Log the request
        logging.info(f"Received request: {self.path}")

        # Regular expression to match URLs
        URL_REGEX = re.compile(r"^https?://.+$")

        # Allowed output types
        ALLOWED_OUTPUT_TYPES = ["TITLE", "SHORT_TITLE", "CONTENT", "SUMMARY"]

        if self.path == "/health":
            # This is a health check request, return a 200 status code
            self.send_response(200)
            self.send_header("Content-type", "text/plain")
            self.send_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0")
            self.end_headers()
            self.wfile.write(b"OK")
        else:
            # Parse the query string to get the URL and output type
            query_string = self.path[2:]
            query_params = query_string.split("&")
            url = query_params[0].split("=")[1]
            output_type = query_params[1].split("=")[1]

            # Validate the input
            if not URL_REGEX.match(url):
                # URL is invalid
                self.send_response(400)
                self.send_header("Content-type", "text/plain")
                self.end_headers()
                self.wfile.write(b"Invalid URL")
            elif output_type not in ALLOWED_OUTPUT_TYPES:
                # Output type is invalid
                self.send_response(400)
                self.send_header("Content-type", "text/plain")
                self.send_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0")
                self.end_headers()
                self.wfile.write(b"Invalid output type")
            else:
                # Input is valid, proceed with processing the request
                try:
                    doc = Document(requests.get(url).content)
                    output = {
                        "TITLE": doc.title(),
                        "SHORT_TITLE": doc.short_title(),
                        "CONTENT": doc.content(),
                        "SUMMARY": doc.summary()
                    }[output_type]

                    # Send the response
                    self.send_response(200)
                    self.send_header("Content-type", "text/plain")
                    self.end_headers()
                    self.wfile.write(output.encode())
                except Exception as e:
                    # Log the error
                    logging.error(f"Error: {e}")
                    # Return an error message to the client
                    self.send_response(500)
                    self.send_header("Content-type", "text/plain")
                    self.end_headers()
                    self.wfile.write(b"An error occurred while processing the request")

# Get the server IP and port from the command line arguments
server_ip = sys.argv[1] if len(sys.argv) > 1 else "127.0.0.1"
server_port = int(sys.argv[2]) if len(sys.argv) > 2 else 8900

# Create the server and run it indefinitely
server_address = (server_ip, server_port)
httpd = http.server.HTTPServer(server_address, RequestHandler)

# Log an info message when the server starts
logging.info("Server started")

httpd.serve_forever()

Note: make sure you have the readability library https://github.com/buriy/python-readability before using this

pip install readability-lxml

 

Link to post
Share on other sites

Example use cases:
 

Wordpress Blog Post
https://lmilosis.wordpress.com/2020/01/26/19/
http://127.0.0.1:8900/?url=https://lmilosis.wordpress.com/2020/01/26/19/&output_type=SUMMARY

News Article
https://us.cnn.com/2023/01/04/weather/severe-storm-tornado-threat-south-wednesday/index.html
http://127.0.0.1:8900/?url=https://us.cnn.com/2023/01/04/weather/severe-storm-tornado-threat-south-wednesday/index.html&output_type=SUMMARY

It doesn't do too welll with JS heavy sites. You may want to edit the script to take HTML source as input instead if you're using another tool to scrape the HTML.

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By noellarkin
      Yesterday I wrote a quick AutoIt script that uses the EditDistance python library by first writing the .py script and then converting it to exe, then passing arguments to it from AutoIt via command line.
      Link:
      I use AutoIt for 90% of my work, and it's quite adequate for almost anything. However, it falls short in one area: libraries related to data processing. This is fine, I understand AutoIt wasn't built to be a language for that purpose. However, it would be interesting if AutoIt would have ways of interacting with python scripts etc so existing python libraries for data analytics etc could be used. So far, the only way I've managed to do this is:
      1. Write .py script, make it accept command line arguments, and print output
      2. Compile .py into an exe file
      3. Write autoit script that uses Run() to execute the exe file, and catches the cmd console output in a variable.
      Is this the only way to go about it? This is a little inconvenient, in that I have to compile the .py into an exe every time I make changes to the python script.
    • By HoratioCaine
      Hi, I am using python to call the Autoit function. I found a interesting problem.
      env:  win10 64bit 
                python3.6.4  x86、    python3.8.6  x64
                autoit v3.3.16.0
      code:
      #filename: demo.py from ctypes import windll dll = windll.LoadLibrary(r"D:\it_tools\autoit\AutoIt3\AutoItX\AutoItX3_x64.dll") # or AutoItX3.dll dll.AU3_Send("#r", 0) Behaviour:
            (1) run with the "python.exe" 
      it will not open the run dialog, but input a "r" in the cmd window.

             (2) run with xxxxxx.exe (renamed from python.exe, you can rename whatever you like)
      work success


      I dont know why it happened.   I think it shoule be related to Python and Autoit.  So I came here...
      Can someone give me some advices.  Thanks a lot.
      (by the way, My English is not very well... I wish I  have provided the enough information... If you need more details, please contact me    )
    • By HoratioCaine
      Hi, everyone.
      I have python code for kill window,  but sometimes it does not working .

      My code is :
      import subprocess import time import ctypes au3_dll = ctypes.windll.LoadLibrary(r'D:\AutoIt\AutoItX3.dll') def close_ie(title): subprocess.Popen(f"C:/Program Files (x86)/Internet Explorer/iexplore.exe https://cn.bing.com/?mkt=zh-CN") time.sleep(2) au3_dll.AU3_Opt("WinTitleMatchMode", 2) ret = au3_dll.AU3_WinKill(title, "") print(ret) if __name__ == '__main__': title = '必应 - Internet Explorer' for i in range(10): close_ie(title) My expectation is that all IE windows will be closed, but there will always be a few windows still there.
      My env: win10 64bit python3.6.4 autoit v3.3.14.2  
      Any suggestions would be appreciated 
    • By DanielRossinsky
      I've been working for quite a while on an automated installer for python3.8.3 and Thonny3.2.7 and encountered a pretty strange problem - automated python install work perfectly. However using pip to install esptool returns error code 1. The strange thing is if i manually install it after running the python installer it does indeed install correctly (returns 0 instead).
      The code i was using:
      Local Const $sInstallerPath = @ScriptDir & "\python-3.8.3.exe" Local $PythonResult = RunWait($sInstallerPath & " /quiet") Local $ESPToolResult = RunWait(@ComSpec & " /c " & "pip install esptool") MsgBox(0, "Installer returns", "Python result: " & $PythonResult & @CRLF & "esptool result: " & $ESPToolResult) Than I thought that maybe python installer fires off multiple processes during installation and tried :
      Local Const $sInstallerPath = @ScriptDir & "\python-3.8.3.exe" Local $iPID = Run($sInstallerPath & " /quiet") Local $PythonResult = ProcessWaitClose($iPID) Local $ESPToolResult = RunWait(@ComSpec & " /c " & "pip install esptool") MsgBox(0, "Installer returns", "Python result: " & $PythonResult & @CRLF & "esptool result: " & $ESPToolResult) However, The problem still remains. some-why esptool install refuses to be automated with python but if i comment out the python install part it does work ?! (note that i comment it out after the script already installed python and did not manually install it myself).
      Any idea why this happens ?
      NOTE: I added the unattend.xml file im using but python.exe was too big - I'm using python3.8.3 for compatibility with thonny and the unattend files makes a minimal install just for thonny to work (target platform esp32 with micopython)
      unattend.xml
    • By TheXman
      Purpose (from Microsoft's website)
      The HTTP Server API enables applications to communicate over HTTP without using Microsoft Internet Information Server (IIS). Applications can register to receive HTTP requests for particular URLs, receive HTTP requests, and send HTTP responses. The HTTP Server API includes SSL support so that applications can exchange data over secure HTTP connections without IIS.
      Description
      There have been several times in the past that I wanted to either retrieve information from one of my PCs or execute commands on one of my PCs, whether it was from across the world or sitting on my couch.  Since AutoIt is one of my favorite tools for automating just about anything on my PC, I looked for ways to make to make it happen.  Setting up a full blown IIS server seemed like overkill so I looked for lighter weight solutions.  I though about creating my own AutoIt UDP or TCP server but that just wasn't robust enough,  Then I found Microsoft's HTTP Server API and it looked very promising.  After doing a little research into the APIs, I found that it was flexible & robust enough to handle just about any of the tasks that I required now and in the future.  So a while back I decided to wrap the API functionality that I needed into an AutoIt UDF file to allow me to easily create the functionality I needed at the time.  It has come in very handy over the years.  Of course it wasn't all wrapped up with a nice little bow like it is now.  That only happened when I decided to share it with anyone else who could use it or learn from it.
      The example file that I included is a very granular example of the steps required to get a lightweight HTTP Server up and listening for GET requests.  The UDF is a wrapper for the Microsoft APIs.  That means to do anything over and above what I show in the example, one would probably have to have at least a general knowledge of APIs or the ability to figure out which APIs/functions to use, what structures and data is needed to be passed to them, and in what order.  However, the UDF gives a very solid foundation on which to build upon.  Of course, if anyone has questions about the UDF or how to implement any particular functionality, I would probably help to the extent that I could or point you in the right direction so that you can figure out how to implement your own solution.
      The APIs included in the UDF are the ones that I needed in the past to do what I needed to do.  If any additional APIs need to be added to the UDF file, please make those suggestions in the related forum topic.
      Being that this is basically an AutoIt wrapper for the Microsoft API functions, there's no need to create AutoIt-specific documentation.  All of the UDF functions, structures, constants, and enumerations are named after their Microsoft API counterparts.  Therefore, you can refer to Microsoft's extensive documentation of their HTTP Server API.  As stated earlier, if there is one or more APIs that you find yourself needing for your particular solution, please suggest it in the related Example Scripts forum topic.
      Related Links
      Microsoft HTTP Server API - Start Page
      Microsoft HTTP Server API - API v2 Reference
      Microsoft HTTP Server API - Programming Model
×
×
  • Create New...