Jump to content
Beege

MS Classic Bounce Multithreading Example with 128 threads

Recommended Posts

That's kinda cool code :thumbsup: Thanks for sharing.

 

After a long time of assembler abstinence I currently try some ASM code to speed up my TGA loader. Currently I'm stuck with x64 ASM code... 


Please don't send me any personal message and ask for support! I will not reply!

Selection of finest graphical examples at Codepen.io

The own fart smells best!
Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!
¯\_(ツ)_/¯  ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ

Share this post


Link to post
Share on other sites

Thanks UEZ!

I took a look at your loader and have been getting caught up on TGA format. I cant believe I never heard of it. I love the simple header. Replacing those For loops would definitely speed things up. The parts for me that I always get stuck on are dealing with floats.  Have you got asm code working for 32bit yet?

Share this post


Link to post
Share on other sites
33 minutes ago, Beege said:

Thanks UEZ!

I took a look at your loader and have been getting caught up on TGA format. I cant believe I never heard of it. I love the simple header. Replacing those For loops would definitely speed things up. The parts for me that I always get stuck on are dealing with floats.  Have you got asm code working for 32bit yet?

Well, for the TGA loader loading 32-bit image should be relativ fast as it a 1d loop. Only the 2d loops take very long time for larger images.

Currently I've done the x86 ASM code for 15/16 bit images but for x64 the code doesn't work. E.g. loading a 15-bit image with 2789x3500 dim. using native AutoIt it takes on my machine more than 134992 ms, with ASM 37 ms -> 3.648x faster!

This applies only to 8/15/16/24-bit images whose width is not a divider of 4. 

Here the part of the UDF which can be replaced with the non-pro ASM code

Case 15, 16, 24, 32 ;15/16/24/32-bit, as the bitmap format is the same we can use memcpy to copy the pixel data directly to the memory.
                            ;Exeptions are 15/16/24-bit images whose width is not a divider of 4!
            If BitOR($iPxDepth = 15, $iPxDepth = 16, $iPxDepth = 24) And Mod($iW, 4) Then
                Switch $iPxDepth
                    Case 15, 16
                        Local Const $bBinASM1516_x86 = Binary("0x5589E58B5D188B4D1C89C8F7651089452489C8F765148945285389D8D1E08B552801C203550C8B752401C60375200375088B066689024B83FB0075DE5B4983F90075C65DC22400")
                        Local $tBinASM1516_x86 = DllStructCreate("byte asm[" & BinaryLen($bBinASM1516_x86) & "]")
                        $tBinASM1516_x86.asm = $bBinASM1516_x86
                        Local $tMemVar1 = DllStructCreate("dword var"), $tMemVar2 = DllStructCreate("dword var")
                        DllCallAddress("none", DllStructGetPtr($tBinASM1516_x86), _
                                       "ptr", DllStructGetPtr($tSrcBmp), "ptr", DllStructGetPtr($tDestBmp), _
                                       "dword", $iW * 2, "dword", $stride, "dword", $iW - 1, "dword", $iH - 1, "dword", $pitch, _
                                       "ptr", DllStructGetPtr($tMemVar1), "ptr", DllStructGetPtr($tMemVar2))
                    Case 24

The ASM code:

#cs _ASM1516_x86
    use32
    ;pushad

    define tSrcBmp  dword[ebp + 08]
    define tDestBmp dword[ebp + 12]
    define strideS  dword[ebp + 16]
    define strideD  dword[ebp + 20]
    define width    dword[ebp + 24]
    define height   dword[ebp + 28]
    define pitch    dword[ebp + 32]
    define tMemVar1 dword[ebp + 36]
    define tMemVar2 dword[ebp + 40]

    push ebp
    mov ebp, esp

    mov ebx, width ;exc = w - 1
    mov ecx, height ;ecx = h - 1

;~  _ASMDBG_()

    _y:
        mov eax, ecx
        mul strideS
        mov tMemVar1, eax

        mov eax, ecx
        mul strideD
        mov tMemVar2, eax
        push ebx
        _x:
            mov eax, ebx
            shl eax, 1

            mov edx, tMemVar2
            add edx, eax
            add edx, tDestBmp

            mov esi, tMemVar1
            add esi, eax
            add esi, pitch
            add esi, tSrcBmp

            mov eax, [esi]
            mov word[edx], ax

            dec ebx
            cmp ebx, 0
            jne _x
        pop ebx
        dec ecx
        cmp ecx, 0
        jne _y

    pop ebp
    ;popad

    ret 36
#ce _ASM1516_x86

Any idea how to convert is to a working x64 version?

Edited by UEZ

Please don't send me any personal message and ask for support! I will not reply!

Selection of finest graphical examples at Codepen.io

The own fart smells best!
Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!
¯\_(ツ)_/¯  ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By paradox109
      Hello, 
      I have A simple question about http request. What would be the fastest way to send mupltiple http request at the same time with autoit? The only way i figured  out was to to start multiple processes. This way works fine but its not really a good way. What user would like to see 15 processes running in the background at the same time. I know multithread is also not available in autoit.
    • By Beege
      Here is the latest assembly engine from Tomasz Grysztar, flat assembler g as a dll which I compiled using original fasm engine. He doesn't have it compiled in the download package but it was as easy as compiling a exe in autoit if you ever have to do it yourself. Just open up the file in the fasm editor and press F5. 
      You can read about what makes fasmg different from the original fasm HERE if you want . The minimum you should understand is that this engine is bare bones by itself not capable of very much. The macro engine is the major difference and it uses macros for basically everything now including implementing the x86 instructions and formats. All of these macros are located within the include folder and you should keep that in its original form.  
      When I first got the dll compiled I couldn't get it to generate code in flat binary format. It was working but size of output was over 300 bytes no matter what the assembly code and could just tell it was outputting a different format than binary. Eventually I figured out that within the primary "include\win32ax.inc"', it executes a macro "format PE GUI 4.0" if x86 has not been defined. I underlined macro there because at first I (wasted shit loads of time because I) didn't realize it was a macro (adding a bunch of other includes) since in version 1 the statement "format binary" was a default if not specified and specifically means add nothing extra to the code. So long story short, the part that I was missing is including the cpu type and extensions from include\cpu folder. By default I add x64 type and SSE4 ext includes. Note that the x64 here is not about what mode we are running in, this is for what instructions your cpu supports. if you are running on some really old hardware that may need to be adjusted or if your on to more advanced instructions like the avx extensions, you may have to add those includes to your source. 
      Differences from previous dll function
      I like the error reporting much better in this one. With the last one we had a ton error codes and a variable return structure depending on what kind of error it had. I even had an example showing you what kind of an error would give you correct line numbers vs wouldn't. With this one the stdout is passed to the dll function and it simply prints the line/details it had a problem with to the console. The return value is the number of errors counted.
      It also handles its own memory needs automatically now . If the output region is not big enough it will virtualalloc a new one and virtualfree the previous.  
      Differences in Code
      Earlier this year I showed some examples of how to use the macros to make writing assembly a little more familiar. Almost all the same functionality exists here but there are a couple syntax sugar items gone and slight change in other areas. 
      Whats gone is FIX and PTR. Both syntax sugar that dont really matter. 
      A couple changes to structures as well but these are for the better. One is unnamed elements are allowed now, but if it does not have a name, you are not allowed to initialize those elements during creation because they can only be intialized via syntax name:value . Previously when you initialized the elements, you would do by specifying values in a comma seperated list using the specific order like value1,value2,etc, but this had a problem because it expected commas even when the elements were just padding for alignment so this works out better having to specify the name and no need for _FasmFixInit function. "<" and ">" are not longer used in the initializes ether.
      OLD: $sTag = 'byte x;short y;char sNote[13];long odd[5];word w;dword p;char ext[3];word finish' _(_FasmAu3StructDef('AU3TEST', $sTag));convert and add definition to source _(' tTest AU3TEST ' & _FasmFixInit('1,222,<"AutoItFASM",0>,<41,43,43,44,45>,6,7,"au3",12345', $sTag));create and initalize New: $sTag = 'byte x;short y;char sNote[13];long odd[5];word w;dword p;char ext[3];word finish' _(_fasmg_Au3StructDef('AU3TEST', $sTag)) ;convert and add definition to source _(' tTest AU3TEST x:11,y:22,sNote:"AutoItFASM",odd:41,odd+4:42,odd+8:43,w:6,p:7,ext:"au3",finish:12345');create and initalize Extra Includes
      I created a includeEx folder for the extra macros I wrote/found on the forums. Most of them are written by Thomaz so they may eventually end up in the standard library. 
      Edit: Theres only the one include folder now. All the default includes are in thier own folder within that folder and all the custom ones are top level.  
       Align.inc, Nop.inc, Listing.inc
      The Align and Nop macros work together to align the next statement to whatever boundary you specified and it uses multibyte nop codes to fill in the space. Filling the space with nop is the default but you can also specify a fill value if you want. Align.assume is another macro part of align.inc that can be used to set tell the engine that a certain starting point is assumed to be at a certain boundary alignment and it will do its align calculations based on that value. 
      Listing is a macro great for seeing where and what opcodes are getting generated from each line of assembly code.  Below is an example of the source and output you would see printed to the console during the assembly. I picked this slightly longer example because it best shows use of align, nop, and then the use of listing to verify the align/nop code. Nop codes are instructions that do nothing and one use of them is to insert nop's as space fillers when you want a certian portion of your code to land on a specific boundary offset. I dont know all the best practices here with that (if you do please post!) but its a type of optimization for the cpu.  Because of its nature of doing nothing, I cant just run the code and confirm its correct because it didnt crash. I need to look at what opcodes the actual align statements made and listing made that easy. 
      source example:
      _('procf _main stdcall, pAdd') _(' mov eax, [pAdd]') _(' mov dword[eax], _crc32') _(' mov dword[eax+4], _strlen') _(' mov dword[eax+8], _strcmp') _(' mov dword[eax+12], _strstr') _(' ret') _('endp') _('EQUAL_ORDERED = 1100b') _('EQUAL_ANY = 0000b') _('EQUAL_EACH = 1000b') _('RANGES = 0100b') _('NEGATIVE_POLARITY = 010000b') _('BYTE_MASK = 1000000b') _('align 8') _('proc _crc32 uses ebx ecx esi, pStr') _(' mov esi, [pStr]') _(' xor ebx, ebx') _(' not ebx') _(' stdcall _strlen, esi') _(' .while eax >= 4') _(' crc32 ebx, dword[esi]') _(' add esi, 4') _(' sub eax, 4') _(' .endw') _(' .while eax') _(' crc32 ebx, byte[esi]') _(' inc esi') _(' dec eax') _(' .endw') _(' not ebx') _(' mov eax, ebx') _(' ret') _('endp') _('align 8, 0xCC') ; fill with 0xCC instead of NOP _('proc _strlen uses ecx edx, pStr') _(' mov ecx, [pStr]') _(' mov edx, ecx') _(' mov eax, -16') _(' pxor xmm0, xmm0') _(' .repeat') _(' add eax, 16') _(' pcmpistri xmm0, dqword[edx + eax], 1000b') ;EQUAL_EACH') _(' .until ZERO?') ; repeat loop until Zero flag (ZF) is set _(' add eax, ecx') ; add remainder _(' ret') _('endp') _('align 8') _('proc _strcmp uses ebx ecx edx, pStr1, pStr2') ; ecx = string1, edx = string2' _(' mov ecx, [pStr1]') ; ecx = start address of str1 _(' mov edx, [pStr2]') ; edx = start address of str2 _(' mov eax, ecx') ; eax = start address of str1 _(' sub eax, edx') ; eax = ecx - edx | eax = start address of str1 - start address of str2 _(' sub edx, 16') _(' mov ebx, -16') _(' STRCMP_LOOP:') _(' add ebx, 16') _(' add edx, 16') _(' movdqu xmm0, dqword[edx]') _(' pcmpistri xmm0, dqword[edx + eax], EQUAL_EACH + NEGATIVE_POLARITY') ; EQUAL_EACH + NEGATIVE_POLARITY ; find the first *different* bytes, hence negative polarity' _(' ja STRCMP_LOOP') ;a CF or ZF = 0 above _(' jc STRCMP_DIFF') ;c cf=1 carry _(' xor eax, eax') ; the strings are equal _(' ret') _(' STRCMP_DIFF:') _(' mov eax, ebx') _(' add eax, ecx') _(' ret') _('endp') _('align 8') _('proc _strstr uses ecx edx edi esi, sStrToSearch, sStrToFind') _(' mov ecx, [sStrToSearch]') _(' mov edx, [sStrToFind]') _(' pxor xmm2, xmm2') _(' movdqu xmm2, dqword[edx]') ; load the first 16 bytes of neddle') _(' pxor xmm3, xmm3') _(' lea eax, [ecx - 16]') _(' STRSTR_MAIN_LOOP:') ; find the first possible match of 16-byte fragment in haystack') _(' add eax, 16') _(' pcmpistri xmm2, dqword[eax], EQUAL_ORDERED') _(' ja STRSTR_MAIN_LOOP') _(' jnc STRSTR_NOT_FOUND') _(' add eax, ecx ') ; save the possible match start') _(' mov edi, edx') _(' mov esi, eax') _(' sub edi, esi') _(' sub esi, 16') _(' @@:') ; compare the strings _(' add esi, 16') _(' movdqu xmm1, dqword[esi + edi]') _(' pcmpistrm xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK') ; mask out invalid bytes in the haystack _(' movdqu xmm4, dqword[esi]') _(' pand xmm4, xmm0') _(' pcmpistri xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY') _(' ja @b') _(' jnc STRSTR_FOUND') _(' sub eax, 15') ;continue searching from the next byte _(' jmp STRSTR_MAIN_LOOP') _(' STRSTR_NOT_FOUND:') _(' xor eax, eax') _(' ret') _(' STRSTR_FOUND:') _(' sub eax, [sStrToSearch]') _(' inc eax') _(' ret') _('endp') Listing Output:
      00000000: use32 00000000: 55 89 E5 procf _main stdcall, pAdd 00000003: 8B 45 08 mov eax, [pAdd] 00000006: C7 00 28 00 00 00 mov dword[eax], _crc32 0000000C: C7 40 04 68 00 00 00 mov dword[eax+4], _strlen 00000013: C7 40 08 90 00 00 00 mov dword[eax+8], _strcmp 0000001A: C7 40 0C D8 00 00 00 mov dword[eax+12], _strstr 00000021: C9 C2 04 00 ret 00000025: localbytes = current 00000025: purge ret?,locals?,endl?,proclocal? 00000025: end namespace 00000025: purge endp? 00000025: EQUAL_ORDERED = 1100b 00000025: EQUAL_ANY = 0000b 00000025: EQUAL_EACH = 1000b 00000025: RANGES = 0100b 00000025: NEGATIVE_POLARITY = 010000b 00000025: BYTE_MASK = 1000000b 00000025: 0F 1F 00 align 8 00000028: 55 89 E5 53 51 56 proc _crc32 uses ebx ecx esi, pStr 0000002E: 8B 75 08 mov esi, [pStr] 00000031: 31 DB xor ebx, ebx 00000033: F7 D3 not ebx 00000035: 56 E8 2D 00 00 00 stdcall _strlen, esi 0000003B: 83 F8 04 72 0D .while eax >= 4 00000040: F2 0F 38 F1 1E crc32 ebx, dword[esi] 00000045: 83 C6 04 add esi, 4 00000048: 83 E8 04 sub eax, 4 0000004B: EB EE .endw 0000004D: 85 C0 74 09 .while eax 00000051: F2 0F 38 F0 1E crc32 ebx, byte[esi] 00000056: 46 inc esi 00000057: 48 dec eax 00000058: EB F3 .endw 0000005A: F7 D3 not ebx 0000005C: 89 D8 mov eax, ebx 0000005E: 5E 59 5B C9 C2 04 00 ret 00000065: localbytes = current 00000065: purge ret?,locals?,endl?,proclocal? 00000065: end namespace 00000065: purge endp? 00000065: CC CC CC align 8, 0xCC 00000068: 55 89 E5 51 52 proc _strlen uses ecx edx, pStr 0000006D: 8B 4D 08 mov ecx, [pStr] 00000070: 89 CA mov edx, ecx 00000072: B8 F0 FF FF FF mov eax, -16 00000077: 66 0F EF C0 pxor xmm0, xmm0 0000007B: .repeat 0000007B: 83 C0 10 add eax, 16 0000007E: 66 0F 3A 63 04 02 08 pcmpistri xmm0, dqword[edx + eax], 1000b 00000085: 75 F4 .until ZERO? 00000087: 01 C8 add eax, ecx 00000089: 5A 59 C9 C2 04 00 ret 0000008F: localbytes = current 0000008F: purge ret?,locals?,endl?,proclocal? 0000008F: end namespace 0000008F: purge endp? 0000008F: 90 align 8 00000090: 55 89 E5 53 51 52 proc _strcmp uses ebx ecx edx, pStr1, pStr2 00000096: 8B 4D 08 mov ecx, [pStr1] 00000099: 8B 55 0C mov edx, [pStr2] 0000009C: 89 C8 mov eax, ecx 0000009E: 29 D0 sub eax, edx 000000A0: 83 EA 10 sub edx, 16 000000A3: BB F0 FF FF FF mov ebx, -16 000000A8: STRCMP_LOOP: 000000A8: 83 C3 10 add ebx, 16 000000AB: 83 C2 10 add edx, 16 000000AE: F3 0F 6F 02 movdqu xmm0, dqword[edx] 000000B2: 66 0F 3A 63 04 02 18 pcmpistri xmm0, dqword[edx + eax], EQUAL_EACH + NEGATIVE_POLARITY 000000B9: 77 ED ja STRCMP_LOOP 000000BB: 72 09 jc STRCMP_DIFF 000000BD: 31 C0 xor eax, eax 000000BF: 5A 59 5B C9 C2 08 00 ret 000000C6: STRCMP_DIFF: 000000C6: 89 D8 mov eax, ebx 000000C8: 01 C8 add eax, ecx 000000CA: 5A 59 5B C9 C2 08 00 ret 000000D1: localbytes = current 000000D1: purge ret?,locals?,endl?,proclocal? 000000D1: end namespace 000000D1: purge endp? 000000D1: 0F 1F 80 00 00 00 00 align 8 000000D8: 55 89 E5 51 52 57 56 proc _strstr uses ecx edx edi esi, sStrToSearch, sStrToFind 000000DF: 8B 4D 08 mov ecx, [sStrToSearch] 000000E2: 8B 55 0C mov edx, [sStrToFind] 000000E5: 66 0F EF D2 pxor xmm2, xmm2 000000E9: F3 0F 6F 12 movdqu xmm2, dqword[edx] 000000ED: 66 0F EF DB pxor xmm3, xmm3 000000F1: 8D 41 F0 lea eax, [ecx - 16] 000000F4: STRSTR_MAIN_LOOP: 000000F4: 83 C0 10 add eax, 16 000000F7: 66 0F 3A 63 10 0C pcmpistri xmm2, dqword[eax], EQUAL_ORDERED 000000FD: 77 F5 ja STRSTR_MAIN_LOOP 000000FF: 73 30 jnc STRSTR_NOT_FOUND 00000101: 01 C8 add eax, ecx 00000103: 89 D7 mov edi, edx 00000105: 89 C6 mov esi, eax 00000107: 29 F7 sub edi, esi 00000109: 83 EE 10 sub esi, 16 0000010C: @@: 0000010C: 83 C6 10 add esi, 16 0000010F: F3 0F 6F 0C 3E movdqu xmm1, dqword[esi + edi] 00000114: 66 0F 3A 62 D9 58 pcmpistrm xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK 0000011A: F3 0F 6F 26 movdqu xmm4, dqword[esi] 0000011E: 66 0F DB E0 pand xmm4, xmm0 00000122: 66 0F 3A 63 CC 18 pcmpistri xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY 00000128: 77 E2 ja @b 0000012A: 73 0F jnc STRSTR_FOUND 0000012C: 83 E8 0F sub eax, 15 0000012F: EB C3 jmp STRSTR_MAIN_LOOP 00000131: STRSTR_NOT_FOUND: 00000131: 31 C0 xor eax, eax 00000133: 5E 5F 5A 59 C9 C2 08 00 ret 0000013B: STRSTR_FOUND: 0000013B: 2B 45 08 sub eax, [sStrToSearch] 0000013E: 40 inc eax 0000013F: 5E 5F 5A 59 C9 C2 08 00 ret 00000147: localbytes = current 00000147: purge ret?,locals?,endl?,proclocal? 00000147: end namespace 00000147: purge endp?  
      procf and forcea macros
      In my previous post I spoke about the force macro and why the need for it. I added two more macros (procf and forcea) that combine the two and also sets align.assume to the same function. As clarified in the previous post, you should only have to use these macros for the first procedure being defined (since nothing calls that procedure). And since its the first function, it should be the starting memory address which is a good place to initially set the align.assume address to. 
      Attached package should include everything needed and has all the previous examples I posted updated. Let me know if I missed something or you have any issues running the examples and thanks for looking
       
      Update 04/19/2020:
      A couple new macros added. I also got rid of the IncludeEx folder and just made one include folder that has the default include folder within it and all others top level.
      dllstruct macro does the same thing as _fasmg_Au3StructDef(). You can use either one; they both use the macro.
      getmempos macro does the delta trick I showed below using anonymous labels. 
      stdcallw and invokew macros will push any parameters that are raw (quoted) strings as wide characters
      Ifex include file gives .if .ifelse .while .until  the ability to use stdcall/invoke/etc inline. So if you had a function called "_add" you could do .if stdcall(_add,5,5) = 10. All this basically does in the background is perform the stdcall and then replaces the comparison with eax and passes it on to the original default macros, but is super helpful for cleaning up code and took a ton of time learning the macro language to get in place. 
       
      Update 05/19/2020:
      Added fastcallw that does same as stdcallw only
      Added fastcall support for Ifex
      Corrected missing include file include\macro\if.inc within win64au3.inc 
      fasmg 5-19-2020.zip
       
      Previous versions:
       
    • By Beege
      Heres a function for searching for a bitmap within another bitmap. The heart of it is written assembly (source included) and working pretty quick I feel. I have included an example which is pretty basic and should be easily enough for anyone to get the concept. 
      You will be given a small blue window that will take a screencapture of that size:

       
      It will then take a full screenshot and highlight all locations that it found

      Please let me know if you have any issues or questions. Thanks!
       
      Update 8/5/2019:
      Rewrote for fasmg. Added full source with everything needed to modify
      BmpSearch_8-5-2019.7z
      BmpSearch.zip
       
      GAMERS - Asking for help with ANY kind of game automation is against the forum rules. DON'T DO IT.
×
×
  • Create New...