%matplotlib inline
import numpy as np
cov = np.array([[2.0, 0.8], [0.8, 0.6]])
mean = np.array([2., 1.])
rd = np.random.multivariate_normal(mean, cov, 50)
x = rd[:,0]
x.mean()
Standard deviation: "Average distance of the mean to a data point"
$$ \sigma_x = \sqrt \frac{\sum_{i=1}^{m} (\bar x - x_i)^2}{m-1} $$Estimated mean "sees" a smaller spread (hand waving argument for (m-1) instead of m in denominator.
But use 1/m instead if:
# denominator m
print np.sqrt(((x-x.mean())**2).mean())
print x.std()
# denominator m-1
print np.sqrt( ((x-x.mean())**2).sum()/(len(x)-1.) )
print x.std(ddof=1)
x.var()
x=rd
Note: $C = C^T$ (cov is symetric)
np.cov(x.T)
np.corrcoef(x.T)
$-1 \leq \rho \leq 1$
$\rho > 0$ : $x$ and $y$ are correlated
$\rho < 0$ : $x$ and $y$ are anti-correlated
$\rho \approx 0$ : (linear) uncorrelated
For what level of measurement is the Pearson Correlation adequat?
#example from book "python for data analysis"
import pandas.io.data as web
import pandas as pd
all_data = {}
for ticker in ['IBM', 'MSFT', 'AAPL', 'AMZN']: # google doesn't work?? 'GOOG']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2000', '1/5/2015')
price = pd.DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
volume = pd.DataFrame({tic: data['Volume'] for tic, data in all_data.iteritems()})
price.plot()
# Percent change over given number of periods.
returns = price.pct_change()
returns.tail(9)
import matplotlib.pyplot as plt
plt.scatter(returns.MSFT, returns.IBM)
plt.xlabel('price.pct_change of MSFT')
plt.ylabel('price.pct_change of IBM')
#correlations coeffient
returns.MSFT.corr(returns.IBM)
returns.corr()
returns.cov()
returns.corrwith(returns.IBM)
returns.corrwith(volume)
Pearsons correlation coefficient is only sensitiv for linear correlations:
from IPython.display import SVG, display
#display(SVG(url='https://upload.wikimedia.org/wikipedia/commons/d/d4/Correlation_examples2.svg'))
Independent implies zero correlation, but not vice versa:
$$ x \bot y \Leftrightarrow P(x,y) = P(x) P(y) \Rightarrow cov(x,y) = 0 $$but $$ x \bot y \nLeftarrow cov(x,y) = 0 $$
Analytic example:
x = np.array([-1,0,1])
y = x**2 # deterministic
d = np.concatenate((x[:,None],y[:,None]), axis=1)
print np.cov(d.T)
plt.scatter(d[:,0], d[:,1])
(non-linear correlation coefficent)
"You take all pairwise distances between sample values of one variable, and do the same for the second variable. Then center the resulting distance matrices (so each has column and row means equal to zero) and average the entries of the matrix which holds componentwise products of the two centered distance matrices. That’s the squared distance covariance between the two variables. The population quantity equals zero if and only if the variables are independent, whatever be the underlying distributions and whatever be the dimen- sion of the two variables. " INTRODUCING THE DISCUSSION PAPER BY SZÉKELY AND RIZZO, MICHAEL A. NEWTON
Literature and Links: