Jump to content

Find the first occurrence


Recommended Posts

This is something that I've been thinking about the whole freaking day today.

I was at a beach and thought about it. Kinda drives me nuts.

The function should find the first occurrence of some wide character inside wide string.

INT __stdcall ChrInStr(LPCWSTR sString, WCHAR sFind)
{
    //...
}

So, the most obvious code (to me) would be:

INT __stdcall ChrInStr(LPCWSTR sString, WCHAR sFind)
{
    if (sString == NULL) return 0;

    for(INT i = 0; sString[i]; ++i) 
    {
        if (sString[i] == sFind) return i;
    }

    return 0;
}

Considering that compiler settings could affect the size I've set it (Visual Studio 2010 Express') to produce the shortest possible code. The result is 39 bytes of x86 code.


55 push ebp
8BEC mov ebp, esp

> BLOCK 20
8B5508 mov edx, dword[ebp+08] ; first parameter is copied to edx ; sString
33C0 xor eax, eax ; eax is 0 ; i = 0
85D2 test edx, edx ; (edx == 0)? ; is (sString == NULL)
7417 je 23d ; if equal go forward 23 bytes (GOTO 50), eax == 0
0FB70A movzx ecx, word[edx] ; ecx is first character in string ; sString[i]
EB0B jmp 11d ; go forward 11 bytes (GOTO 40).

> BLOCK 30
663B4D0C cmp cx, word[ebp+0C] ; see if (e)cx -current character- is the one searched for ; if (sString[i] == sFind)
740C je 12d ; if equal go forward 12 bytes (GOTO 50), eax == i
40 inc eax ; increment eax ; ++i
0FB70C42 movzx ecx, word[edx+2*eax] ; ecx is next character in string

> BLOCK 40
6685C9 test cx, cx ; ((e)cx == 0)? -is null terminator found ; if (!sString[i]), or if you like if (sString[i] == '\0')
75F0 jne -15d ; if not equal go back 15 bytes (GOTO 30) ; continue looping if not
33C0 xor eax, eax ; eax = 0, eax == 0

> BLOCK 50 - STACK JOB
5D pop ebp
C20800 ret 0008
> BLOCK 10 - STACK JOB

But, here's the catch. I saw like 36 bytes (less than 39) of code that does the same thing, compiled with the same compiler. I saw the guy compiling it, I saw everything except the body of the function and compiler settings. Let's assume settings are not the issue, do you have an idea what to do to have it compiled to less instructions as possible? What c++ code would result in shorter code with the same calling convention and other things? Maybe even lesser than his function produce.

How to get rid of this line of code:

movzx ecx, word[edx]
... and jump directly to this:
movzx ecx, word[edx+2*eax]

And of course there is possibility that he was just playing me around knowing I can't resist this stuff. Smartass.

edit: fact that returned 0 is vague is irrelevant.

Edited by trancexx

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

Just an idea, maybe a while loop?

UINT __stdcall ChrInStr(LPCWSTR sString, WCHAR sFind)
{
    if (sString == NULL) return 0;
    INT i = 0;
    while (sString[i])
    {
        if (sString[i] == sFind) return i;
        ++i;
    }

    return 0;
}

Also, don't you want to return something other than 0 if the char is not found? 0 is valid if the first char matches.

Edit: saw your edit

Edited by wraithdu
Link to comment
Share on other sites

Probably compiles the same, but removes one line:

UINT __stdcall ChrInStr(LPCWSTR sString, WCHAR sFind)
{
    if (sString == NULL) return 0;
    INT i = 0;
    while (sString[i])
    {
        if (sString[i++] == sFind) return i-1;
    }

    return 0;
}
Link to comment
Share on other sites

Probably compiles the same, but removes one line:

UINT __stdcall ChrInStr(LPCWSTR sString, WCHAR sFind)
{
    if (sString == NULL) return 0;
    INT i = 0;
    while (sString[i])
    {
        if (sString[i++] == sFind) return i-1;
    }

    return 0;
}

No that's more. 46 bytes of "crash" code.

Your first and mine compiles the same.

Edited by trancexx

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

Do you know for sure that he sanity checked the input parameter? How many bytes do you get if you remove that line "if (sString == NULL) return 0;"?

:)

Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

Hmm, what if you remove the multiple exit points?

INT __stdcall ChrInStr(LPCWSTR sString, WCHAR sFind)
{
    INT r = 0;
    if (sString != NULL) {
        for(INT i = 0; sString[i]; ++i) {
            if (sString[i] == sFind) {
                r = i;
                break;
            }
        }
    }
    return r;
}

And I'm curious if you make this a "safe" function:

INT __stdcall ChrInStr(LPCWSTR sString, INT nLen, WCHAR sFind)
{
    INT r = 0;
    if (sString != NULL) {
        for(INT i = 0; i < nLen; ++i) {
            if (sString[i] == sFind) {
                r = i;
                break;
            }
        }
    }
    return r;
}
Link to comment
Share on other sites

Hmm, what if you remove the multiple exit points?

INT __stdcall ChrInStr(LPCWSTR sString, WCHAR sFind)
{
    INT r = 0;
    if (sString != NULL) {
        for(INT i = 0; sString[i]; ++i) {
            if (sString[i] == sFind) {
                r = i;
                break;
            }
        }
    }
    return r;
}

And I'm curious if you make this a "safe" function:

INT __stdcall ChrInStr(LPCWSTR sString, INT nLen, WCHAR sFind)
{
    INT r = 0;
    if (sString != NULL) {
        for(INT i = 0; i < nLen; ++i) {
            if (sString[i] == sFind) {
                r = i;
                break;
            }
        }
    }
    return r;
}
Both of those are 45 bytes with me.

@PsaltyDS, that check was done for sure. Btw, it's 4 bytes.

The main thing, I think, is to make compiler not doing that one assignment that I showed in the first post. That's redundant since eax is 0 there. word[edx] is the same as word[edx+2*eax] if eax is 0, and that exactly is the case at that point.

Edited by trancexx

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

Are you sure the calling convention is the same? Are you sure that the function body was C/C++ code and not inline assembly? You could save quite a few bytes if you wrote your own assembly (obviously).

Link to comment
Share on other sites

Are you sure the calling convention is the same?

Yes.

Are you sure that the function body was C/C++ code and not inline assembly? You could save quite a few bytes if you wrote your own assembly (obviously).

This is possible option. Maybe he cheated. :)

Do you think some weird type casting could help?

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

Assembly is a great mystery to me. So it sets i to 0 before it checks if sString is null?

33C0                    xor eax, eax                ; eax is 0                                          ;  i = 0
85D2                    test edx, edx                ; (edx == 0)?                                        ;  is (sString == NULL)

Edit: I get it now that's for the loop.

Edit: Wait, no it isn't? BRAINASPLODE!

<--- Relegated to staring at blinking lights in awe for rest of life.

What is the criteria for the end of the for loop? I see no test. Or is that why it sets i to 0 before it compares sString to null?

If the for loop checks sString for null every iteration then is this necessary: if (sString == NULL) return 0;

No no, scratch all of that. I see now.

Edited by LaCastiglione
Link to comment
Share on other sites

Assembly is a great mystery to me. So it sets i to 0 before it checks if sString is null?

33C0                    xor eax, eax                ; eax is 0                                          ;  i = 0
85D2                    test edx, edx                ; (edx == 0)?                                        ;  is (sString == NULL)

Edit: I get it now that's for the loop.

Edit: Wait, no it isn't? BRAINASPLODE!

<--- Relegated to staring at blinking lights in awe for rest of life.

eax is the return value of the function.

After the call if you want to check what's returned you (and the compiler) check eax.

The code that you point to is completely optimized. It's saying "before I do anything, I'll set return value to 0. Maybe change later, maybe not".

Edited by trancexx

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

Thanks for that clarification. =)

I see now that in the condition of the for loop if sString is 0 then you know you've come to the end of the string because any condition that evaluates to 0 terminates the loop.

Well, I'm out of my league here. Yall have fun!

I do think you're comments have been misplaced though.

Edited by LaCastiglione
Link to comment
Share on other sites

This is shorter (1 byte) and faster (more than that):

INT __stdcall ChrInStr(LPCWSTR sString, WCHAR sFind)
{
    if (sString == NULL) return 0;

    for(INT i = 0; *sString; ++i)
    {
        if (*sString == sFind) return i;
        ++sString;
    }

    return 0;
}
It basically does what I thought it should. Only it also increments two variables per cycle therefore loosing bytes. I'm fairly confident he went this way. Next question is obvious.

@LaCastiglione, nothing is misplaced.

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

What about using this?

INT __stdcall ChrInStr(const LPCWSTR sString, WCHAR sFind)
{
    if (sString == NULL) return -1;
    LPCWSTR s = sString;
    while (*s) {
        if (*s == sFind) return s-sString;
        ++s
    }
    return -1;
}
Or this?
INT __stdcall ChrInStr(const LPCWSTR sString, WCHAR sFind)
{
    if (sString == NULL) return -1;
    LPCWSTR s = sString;
    register WCHAR c = *s;
    while (c) {
        if (c == sFind) return s-sString;
        c = *(++s);
    }
    return -1;
}
Edited by ProgAndy

*GERMAN* [note: you are not allowed to remove author / modified info from my UDFs]My UDFs:[_SetImageBinaryToCtrl] [_TaskDialog] [AutoItObject] [Animated GIF (GDI+)] [ClipPut for Image] [FreeImage] [GDI32 UDFs] [GDIPlus Progressbar] [Hotkey-Selector] [Multiline Inputbox] [MySQL without ODBC] [RichEdit UDFs] [SpeechAPI Example] [WinHTTP]UDFs included in AutoIt: FTP_Ex (as FTPEx), _WinAPI_SetLayeredWindowAttributes

Link to comment
Share on other sites

I'm not sure if the syntax will change the compiled code but it's probably:

WCHAR c;
while (c = *sString++)
Or whatever the syntax is. In other words, combine the assignment to c and the check that c != NULL into a single statement. I hate code that does the assignment and test but in this case it could affect the compiled output - or not.
Link to comment
Share on other sites

What about using this?

INT __stdcall ChrInStr(const LPCWSTR sString, WCHAR sFind)
{
    if (sString == NULL) return -1;
    LPCWSTR s = sString;
    while (*s) {
        if (*s == sFind) return s-sString;
        ++s
    }
    return -1;
}
Or this?
INT __stdcall ChrInStr(const LPCWSTR sString, WCHAR sFind)
{
    if (sString == NULL) return -1;
    LPCWSTR s = sString;
    register WCHAR c = *s;
    while (c) {
        if (c == sFind) return s-sString;
        c = *(++s);
    }
    return -1;
}
Both of those compiles to same code. The main issue with it is additional code added as some sort of security check likely due to pointer subtraction. Generated instructions and resulting mnemonics are unknown to me (yet). I don't know how to avoid them.

@Valik, whatever I join compiler splits.

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...