Before we talk about euclidean distance and cosine distance, let's see what is Norm firstly.
1. What is Norm?
The distance from the origin is called Norm of $\vec{A}$.
To calculate Norm of $\vec{A}$, there are several methods:
1) Euclidean distance
$$d=\sqrt{(x_2-x_1)^2+(y_2-y_1)^2}$$
Norm is the distance from the origin, so it equals to $\sqrt{x_2^2+y_2^2}$
Vector Norm using Euclidean distance is called L2-Norm.
2) Manhattan distance
$$d=|x_2-x_1|+|y_2-y_1|$$
Norm equals to $|x_2|+|y_2|$
Vector Norm using Manhattan distance is called L1-Norm.
2. Euclidean distance
If $\vec{A}$ = $(x_1,y_1)$, $\vec{B}$ = $(x_2,y_2)$,
The Euclidean distance between $\vec{A}$ and $\vec{B}$ should be:
$$d=\sqrt{(x_2-x_1)^2+(y_2-y_1)^2}$$
Smaller, closer.
3. Cosine similarity
$$\cos\theta=\frac{\vec{A}\cdot\vec{B}}{||\vec{A}||\cdot||\vec{B}||}=\frac{\sum_{i=1}^{n}x_iy_i}{\sqrt{\sum_{i=1}^{n}x_i^2}\cdot\sqrt{\sum_{i=1}^{n}y_i^2}}$$
The cosine value means the angle between two vectors, $\cos\theta\in[-1,1]$
More close to 1, more similar
4. Cosine distance
Cosine distance = $1-\cos\theta$. So, cosine distance $\in[0,2]$
Smaller, closer.
* What is Normalization:
$$\vec{y} = \frac{\vec{x}}{||\vec{x}||}$$
$\vec{y}$ is $\vec{x}$ after normalization. All the value in $\vec{y} \in [-1,1]$. Also, $||\vec{y}||=1$.
5. The relationship between Euclidean distance and Cosine distance:
After normalization, $||\vec{A}||=1$, $||\vec{B}||=1$
Cdist($\vec{A}$, $\vec{B}$) = $1-\vec{A}\cdot\vec{B}$
Edist($\vec{A}$, $\vec{B}$) = $\sqrt{||\vec{A}-\vec{B}||^2}$ = $\sqrt{||\vec{A}||^2+||\vec{B}||^2-2\vec{A}\vec{B}}$ = $\sqrt{2}\cdot\sqrt{1-\vec{A}\vec{B}}$
Which means after normalization, Euclidean distance has the same monotonicity with Cosine distance.
6. In python
Use cosine distance get topk vectors:
f = f / np.linalg.norm(f, axis=-1. keepdims=True)
sim = (features * f).sum(axis=1)
topk_idx = np.argpartition(-sim, tuple(range(K)))[:K]
topk_val = sim[topk_idx].tolist()
No comments:
Post a Comment