Cosine Similarity, Pearson Correlation, Inner Products

To begin, a criticism

I picked up the Haskell Data Analysis Cookbook. The book presents examples of comparing data using Pearson Coefficient and using Cosine Similarity.

pearson xs ys = (n * sxy - sx * sy) / sqrt ((n * sxx - sx * sx) * (n * syy - sy * sy))
      n = fromIntegral (length xs)
      sx = sum xs
      sy = sum ys
      sxx = sum $ zipWith (*) xs ys
      syy = sum $ zipWith (*) ys ys
      sxy = sum $ zipWith (*) xs ys

cosine xs ys = dot d1 d2 / (len d1 * len d2)
      dot a b = sum $ zipWith (*) a b
      len a = sqrt $ dot a a

Although these code snippets are both calculating the ‘similarity’ between two vectors and actually, as we shall see, share a lot of structure, this is not at all apparent from a glance.

We can fix that however…

Definition of an Inner Product

An inner product is conceptually a way to see how long a vector is after projecting it along another (inside some space).

Formally, an inner product is a binary operator satisfying the following properties


\langle u+v,w \rangle = \langle u,w\rangle + \langle v,w\rangle
\langle \alpha u,w\rangle = \alpha\langle u,w\rangle for \alpha \in \mathbb{R}
We are saying that we can push sums inside on the left to being outside. We can also push out constant factors.

(Conjugate) Symmetry

\langle u,v \rangle = \langle v,u \rangle or in the complex case, \langle u,v \rangle = \overline{\langle v,u \rangle}
In the real case, we’re saying everything is symmetric – it doesn’t matter which way you do it. In the complex case we have to reflect things by taking the conjugate.

Positive Definiteness

\langle u,u \rangle \ge 0 with equality iff u = 0
Here we’re saying projecting a vector onto itself always results in a positive length. Secondly, the only way we can end up with a result of zero is if the vector itself is of length 0.

From Inner Product to a notion of ‘length’

Intuitively a distance between two things must be

  • positive or zero (a negative distance makes not too much sense), with a length of zero corresponding to the zero vector
  • linear (if we scale the vector threefold, the length should also increase threefold)

Given that \langle u,u \rangle \ge 0 we might be tempted to set length(a) := \langle u,u \rangle but then upon scaling u \rightarrow \alpha u we get length(\alpha u) := \langle \alpha u, \alpha u \rangle = \alpha^2 \langle u,u \rangle – we’re not scaling linearly.

Instead defining ||a|| := \sqrt{\langle a,b \rangle} everything is good!


Now, in the abstract, how similar are two vectors?

How about we first stop caring about how long they are, and want them just to point in the same direction. We can project one along the other and see how much it changes in length (shrinks).

Projecting is kind of like seeing what its component is in that direction – i.e. considering 2-dimensional vectors in the plane, projecting a vector onto a unit vector in the x direction will tell you the x component of that vector.

Let’s call two vectors a and b.

Firstly let’s scale them to be both of unit length, \hat{a} = \frac{a}{||a||}, \hat{b} = \frac{b}{||b||}

Now, project one onto the other (remember we’re not caring about order because of symmetry).
similarity(a,b) := \langle \frac{a}{||a||}, \frac{b}{||b||} \rangle

Using linearity we can pull some stuff out (and also assuming everything’s happily a real vector – not caring about taking conjugates)…
similarity(a,b) := \frac{\langle a, b \rangle}{||a|| ||b||}

Making Everything Concrete

Euclidean Inner Product

The dot product we know and love.
a \dot b = a_1 b_1 + \dots + a_n b_n

Plugging that into the similarity formula, we end up with the cosine similarity we started with!

Covariance Inner Product

The covariance between two vectors is defined as Cov(X,Y) = \mathbb{E}((X - \mathbb{E}(X))(Y - \mathbb{E}(Y))) where we’re abusing the notion of expectation somewhat. This in fact works if X and Y are arbitrary L2 random variables… but for the very concrete case of finite vectors we could consider \mathbb{E}(X) = \frac{1}{n}(x_1 + \dots + x_n).

We’ve said in our space, to project a first vector onto a second we see how covariant the first is with the second – if they move together or not.

Plugging this inner product into the similarity formula, we instead get the pearson coefficient!

In fact, given Cov(X,X) = Variance(X), in this space we have length(X) = \sqrt{Variance(X)} = StdDev(X) =: \sigma_X,

i.e. similarity(X,Y) = \frac{Cov(X,Y)}{\sigma_X \sigma_Y}.

Improving the code

Now that we know this structure exists, I posit the following as being better

similarity ip xs ys = (ip xs ys) / ( (len xs) * (len ys) )
   where len xs = sqrt(ip xs xs)

-- the inner products
dot xs ys = sum $ zipWith (*) xs ys

covariance xs ys = exy - (ex * ey)
   where e xs = sum xs / (fromIntegral $ length xs)
         exy = e $ zipWith (*) xs ys
         ex = e xs
         ey = e ys

-- the similarity functions
cosineSimilarity = similarity dot
pearsonSimilarity = similarity covariance

Things I’m yet to think about

…though maybe the answers are apparent.

We have a whole load of inner products available to us. What does it mean to use those inner products?
E.g. \langle f,g \rangle = \int^{-\pi}_{\pi} f(t) \overline{g(t)}  \, \mathrm{d}t on \mathbb{L}^2[-\pi,\pi] – the inner product producing the Fourier transform. I’m not the resulting similarity is anything particularly special though…

Cosine Similarity, Pearson Correlation, Inner Products

Haskell, Cellular Automata and Sierpinski Rule 90

I’ve just finished writing an article for The Invariant Magazine (the annual Oxford Maths Society magazine which tends to actually get published once a decade) about cellular automata.

Whilst writing, I became a bit distracted, and decided to write some haskell to print 1D automata. After replacing most variables with single letters, and manually inlining some things, behold:

import Data.Bits
ns xs = zip3 (drop (l-1) xs') xs (tail xs') where xs' = cycle xs; l = length xs
s ru xs = map (\(l,c,r) -> if (2^(l*4 + c*2 + r) .&. ru) == 0 then 0 else 1) $ ns xs
ss = map (\c -> if c == 0 then ' ' else '#')
printAutomata :: Int -> Int -> [Int] -> IO () --one type sig needed
printAutomata n r xs = mapM_ putStrLn $ map ss $ take n $ iterate (s r) xs

Then in GHCI, let’s try to simulate Rule 90 starting with a single cell of state 1.

Prelude> :l ca.hs
[1 of 1] Compiling Main             ( ca.hs, interpreted )
Ok, modules loaded: Main.
*Main> printAutomata 20 90 $ (replicate 20 0) ++ [1] ++ (replicate 20 0)
                   # #                   
                  #   #                  
                 # # # #                 
                #       #                
               # #     # #               
              #   #   #   #              
             # # # # # # # #             
            #               #            
           # #             # #           
          #   #           #   #          
         # # # #         # # # #         
        #       #       #       #        
       # #     # #     # #     # #       
      #   #   #   #   #   #   #   #      
     # # # # # # # # # # # # # # # #     
    #                               #    
   # #                             # #   
  #   #                           #   #  
 # # # #                         # # # # 

Yay Sierpinski!

Haskell, Cellular Automata and Sierpinski Rule 90

Dual Numbers, Automatic Differentiation, Haskell

Duals are an extension to a number system such that x^2 = 0 has a non-zero solution.
We could consider them as a quotient:
– take a ring R
– quotient R[X] by the ideal < X^2 >

Since we can keep things general in the theory, I’m going to take
duals over a general type a. Maybe I should look into the mathematical
prelude so that I will be able to specify a to be a ring, but that’s
for another day.

> data Dual a = D a a deriving Show

Some Notation

I shall also slip in and out of the notation a + b\mathrm{j} to represent the Dual (D a b).

We could maybe use some utility functions to give us the ‘real’ and ‘dual’ parts

> real, dual :: Dual a -> a
> real (D a _) = a
> dual (D _ b) = b

In some ways, we can consider these as an analogue of the complex numbers,
but instead with j^2 = 0. Anyhow, equality comes naturally as follows.
(We could get this from the polynomial definition using the coset definition of

> instance Eq a => Eq (Dual a) where
> (D a b) == (D c d) = (a == c) && (b == d)

Now, we want to treat this as a number.
Since things commute, addition and multiplication come naturally.

> instance Num a => Num (Dual a) where
> (D a b) + (D c d) = D (a+c) (b+d)
> (D a b) * (D c d) = D (a*c) (b*c + a*d)

Again, negation is fairly obvious

> negate (D a b) = D (-a) (-b)

Now Haskell wants us to give a definition for signum – i.e. give
some kind of subset of ‘positive’ duals.
Not sure that this really makes much sense. I’ll leave it undefined for now.

> signum (D a b) = undefined

We also need to give an ‘abs’. Taking inspiration from complex numbers,
if we were to define abs (D a b) to be (D a b)*(D a -b) we end up with

> abs (D a b) = D (abs a) 0

This is all kind of crazy! The complex numbers modulus maps from \mathbb{C} \rightarrow \mathbb{R} whereas Haskell requires us to give a function from Duals to Duals. I can’t seem to see anywhere what properties such an abs function should have.

Haskell’s implementation of complex numbers does seem to give something similar to my implementation.
\mathrm{abs } z \mapsto |z| + 0i

For now then, I’ll just use the definition given here:

Finally we need to give a way of converting from integrals to duals – just
take the ‘dual’ part to be 0

> fromInteger n = D (fromIntegral n) 0

Now since Dual numbers are PRETTY MAGIC
– f(D a 1) = D f(a) f'(a)
we can automagically differentiate functions of type Num a => a -> a

> deriv :: Num a => (Dual a -> Dual b) -> a -> b
> deriv f x = dual $ f (D x 1)

Now we could probably do with some notion of division!

but how?!

(D a b) / (D c d) = (D a b) * (D c -d) / c² ?!?!

let’s try and see if crazy happens.

> instance Fractional a => Fractional (Dual a) where
> (D a b) / (D c d) = D (real x / (c*c)) (dual x / (c*c))
> where
> x = (D a b) * (D c (-d))
> fromRational r = D (fromRational r) 0

Everything seems fine!

Now for maybe the more interesting bit, I’d like to implement Duals for
standard functions – sin, cos, … etc
In haskell, this means implementing the Floating class.

> instance Floating a => Floating (Dual a) where

Taking pi to be the Dual with ‘real’ part of pi seems sensible,
though maybe we should ensure that this is still twice the first
‘positive’ root of cos, though quite what positive means.
Since it seems good to have that cos(\pi/2) = 0,
let us take pi to be \pi + 2\pi\mathrm{j}

> pi = D pi (2*pi)

Next thing to define is exp. Taking the power series definition,
and noting that

(a+b\mathrm{j})^n = a^n + n b a^{n-1}\mathrm{j}
(which follows from binomially expanding (a+b\mathrm{j})^n and then noting that all other terms contain a j^2.

it follows trivially that

> exp (D a b) = D (exp a) (b*(exp a))

Ok, so maybe something better is happening here.
Why not try a general power series. If we have:
f(x) = \sum_0^\infty a_n x^n
Then we can try sticking in the dual x + y\mathrm{j}
f(x + y\mathrm{j}) = \sum_0^\infty a_n x^n nyx^{n-1}\mathrm{j}
= \sum_0^\infty a_n x^n + \left (y\sum_1^\infty a_n n x^{n-1} \right ) \mathrm{j}

where we see the dual part is the termwise derivative! Woooop!
so, taking what we already ‘know’ of the derivatives of these functions
(though quite how this is applicable when our underlying Num type isn’t sane)
we get

f(a + b\mathrm{j}) = f(a) + bf'(a)

though, there is a problem in that, since haskell doesn’t ‘know’
these functions are defined in terms of power series, we can’t use
our existing deriv function to produce f' :(
Lets get cracking then…

> log (D a b) = D (log a) (b*(1/a))
> sin (D a b) = D (sin a) (b*(cos a))
> cos (D a b) = D (cos a) (b*(- sin a))
> sinh (D a b) = D (sinh a) (b*(cosh a))
> cosh (D a b) = D (cosh a) (b*(sinh a))

–leave the inverse trigometric and hyperbolics for now, I’m lazy :D

We can even do crazy things now by taking duals over complex numbers, though
what this ‘means’ I’m not yet sure.

I shall have to think on this.

On the other hand, we are now able to differentiate fairly complicated functions.

> g x = (sin (x * cos (log (1 / (x**2 – 2*x + 3)))))**3
> g’ = deriv g

Dual Numbers, Automatic Differentiation, Haskell

Haskell Snippet – Summing series

Say I wanted to calculate something like \displaystyle\sum_{i=10}^{40} r^2 - r -1

With Haskell:

> sum[r^2-r-1|r<-[10..40]]

This can be generalised to something like the following:

sumseries what from to = sum[what n|n<-[]]


> sumseries (\r->r^2) 1 10
> sumseries (\r->r^2-r-1) 10 40

On paper, I’d have to

  • split this into two sums: \displaystyle\sum_{i=1}^{40} (r^2 - r -1) - \sum_{i=1}^{9} (r^2 - r -1)
  • find an expression for \displaystyle\sum_{i=1}^{n} (r^2 - r -1) using standard sums of the first n squares and naturals, giving \displaystyle\sum_{i=1}^{n} (r^2 - r -1) = \frac{1}{3}(n-2)n(n+2)
  • substitute n = 40 and n = 9 into this, and subtract the latter from the former
Haskell Snippet – Summing series