Jump to content

AutoIt Function For Cosine Similarity (Vector Embeddings)?


Go to solution Solved by RTFC,

Recommended Posts

I can use the OpenAI API to get arrays containing vector embeddings for a word/phrase using this: https://platform.openai.com/docs/guides/embeddings

But what's the process of comparing the two vector arrays using something like this: https://en.wikipedia.org/wiki/Cosine_similarity

In python, there's a library for this: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

Anything similar in AutoIt? Thanks!

Link to comment
Share on other sites

Did I get this right? Just working off of the Wikipedia definition.

#include <Array.au3>
#include <Math.au3>


Local $embedding1[3] = [1.0, 2.0, 3.0]
Local $embedding2[3] = [4.0, 5.0, 6.0]


Local $dotProduct = 0.0
For $i = 0 To UBound($embedding1) - 1
    $dotProduct += $embedding1[$i] * $embedding2[$i]
Next


Local $magnitude1 = 0.0
For $i = 0 To UBound($embedding1) - 1
    $magnitude1 += $embedding1[$i] ^ 2
Next
$magnitude1 = Sqrt($magnitude1)

Local $magnitude2 = 0.0
For $i = 0 To UBound($embedding2) - 1
    $magnitude2 += $embedding2[$i] ^ 2
Next
$magnitude2 = Sqrt($magnitude2)

Local $cosineSimilarity = $dotProduct / ($magnitude1 * $magnitude2)

MsgBox(0, "", "Cosine similarity: " & $cosineSimilarity)

 

Link to comment
Share on other sites

  • Solution

looks okay, but you should really look into E4A's DotProduct (section: Multiplication) and GetNorm (section: Reduction) functions.

Link to comment
Share on other sites

23 minutes ago, RTFC said:

looks okay, but you should really look into E4A's DotProduct (section: Multiplication) and GetNorm (section: Reduction) functions.

I remember you recommending this library some time back, and I downloaded it but it looked so daunting (I don't have a CS background) I backed off immediately :)
Okay I'll give it another go :)

Link to comment
Share on other sites

How is this daunting?:D

#include "C:\AutoIt\Eigen\Eigen4AutoIt.au3" ; NB adjust path to wherever you put it

Local $embedding1[3] = [1.0, 2.0, 3.0]
Local $embedding2[3] = [4.0, 5.0, 6.0]

_Eigen_StartUp()

$vec1=_Eigen_CreateMatrix_FromArray($embedding1)
$vec2=_Eigen_CreateMatrix_FromArray($embedding2)

MsgBox(0, "", "Cosine similarity: " & _
    _Eigen_DotProduct($vec1,$vec2) / (_Eigen_GetNorm($vec1) * _Eigen_GetNorm($vec2)))

_Eigen_CleanUp()

(I don't have a CS background either.)

Edited by RTFC
typo
Link to comment
Share on other sites

  • 1 month later...

Update: as of version 5.4 (released: 29 May 2023), E4A supports direct retrieval of the angle between two vectors with function _Eigen_GetVectorAngle ( $vecA, $vecB, $returnRadians = False ). A zero-degree angle signifies parallel vectors (aligned and pointing in the exact same direction), a 90-degree angle perpendicular ones, and a 180-degree angle implies the vectors are anti-parallel (aligned, but pointing in opposite directions).

#include "C:\AutoIt\Eigen\Eigen4AutoIt.au3" ; NB adjust path to wherever you put it

Local $embedding1[3] = [1.0, 2.0, 3.0]
Local $embedding2[3] = [4.0, 5.0, 6.0]

_Eigen_StartUp()

$vec1=_Eigen_CreateMatrix_FromArray($embedding1)
$vec2=_Eigen_CreateMatrix_FromArray($embedding2)

MsgBox(0, "", "Cosine similarity: " & _Eigen_GetVectorAngle($vec1,$vec2))

_Eigen_CleanUp()

 

Link to comment
Share on other sites

Never jumped on the Python bandwagon myself either. From what I read at stackoverflow in various threads, you should be able to get significantly better performance when replacing numPy with raw Eigen/C++, even without GPU/CUDA/MPI refactoring.

If you're serious about setting up ML in this way, I can probably help you. Because many of Eigen's speed optimisations are obtained at compile-time (e.g. lazy evaluation, smart loop unrolling, and matrix operation-specific stuff), if you were to present a snippet  of E4A code (say, a UDF that applies a number of E4A functions to some input matrices), I could duplicate/optimise/rewrite that and present you with single pre-compiled E4A  dllcall. I first suggested this when I started the E4A thread many years ago, but so far nobody has taken me up on this. Up to you of course. If you're worried about your intellectual property, you can PM me instead. In any case, hope it helps.

Link to comment
Share on other sites

1 hour ago, RTFC said:

so far nobody has taken me up on this

Would love to :) but nothing in my workflow (so far) has warranted anything extremely complex - - at most, I'm using SBERT embeddings + Milvus vector DB and doing some vector comparisons, indexing corpus, some n-gram extractions with Yake.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...