Translate

Euclidean distance VS Cosine distance

Before we talk about euclidean distance and cosine distance, let's see what is Norm firstly.

    1. What is Norm?

        The distance from the origin is called Norm of $\vec{A}$.
        To calculate Norm of $\vec{A}$, there are several methods:
            1) Euclidean distance
                $$d=\sqrt{(x_2-x_1)^2+(y_2-y_1)^2}$$
                Norm is the distance from the origin, so it equals to $\sqrt{x_2^2+y_2^2}$
            
            Vector Norm using Euclidean distance is called L2-Norm.

            2) Manhattan distance
                $$d=|x_2-x_1|+|y_2-y_1|$$
                Norm equals to $|x_2|+|y_2|$
            
            Vector Norm using Manhattan distance is called L1-Norm.

    2. Euclidean distance
        
        If $\vec{A}$ = $(x_1,y_1)$, $\vec{B}$ = $(x_2,y_2)$,
        The Euclidean distance between $\vec{A}$ and $\vec{B}$ should be:
            $$d=\sqrt{(x_2-x_1)^2+(y_2-y_1)^2}$$
        Smaller, closer.

    3. Cosine similarity

$$\cos\theta=\frac{\vec{A}\cdot\vec{B}}{||\vec{A}||\cdot||\vec{B}||}=\frac{\sum_{i=1}^{n}x_iy_i}{\sqrt{\sum_{i=1}^{n}x_i^2}\cdot\sqrt{\sum_{i=1}^{n}y_i^2}}$$

        The cosine value means the angle between two vectors, $\cos\theta\in[-1,1]$
        More close to 1, more similar

    4. Cosine distance

        Cosine distance = $1-\cos\theta$. So, cosine distance $\in[0,2]$
        Smaller, closer.

* What is Normalization:
        $$\vec{y} = \frac{\vec{x}}{||\vec{x}||}$$
        $\vec{y}$ is $\vec{x}$ after normalization. All the value in $\vec{y} \in [-1,1]$. Also, $||\vec{y}||=1$.

    5. The relationship between Euclidean distance and Cosine distance:

        After normalization, $||\vec{A}||=1$, $||\vec{B}||=1$
        Cdist($\vec{A}$, $\vec{B}$) = $1-\vec{A}\cdot\vec{B}$       
        Edist($\vec{A}$, $\vec{B}$) = $\sqrt{||\vec{A}-\vec{B}||^2}$ = $\sqrt{||\vec{A}||^2+||\vec{B}||^2-2\vec{A}\vec{B}}$ = $\sqrt{2}\cdot\sqrt{1-\vec{A}\vec{B}}$
        Which means after normalization, Euclidean distance has the same monotonicity with Cosine distance.

    6. In python 
       
    Use cosine distance get topk vectors:
        f = f / np.linalg.norm(f, axis=-1. keepdims=True)
        sim = (features * f).sum(axis=1)
        topk_idx = np.argpartition(-sim, tuple(range(K)))[:K]
        topk_val = sim[topk_idx].tolist()
        




        

No comments:

Post a Comment