Sunday, 8 June 2014

Running statistics using RunningStats (how to run on Python using SWIG and Distutils)


There is a more accurate method for calculating focal statistics than usually implemented in numerical packages. This is documented here and the C++ code is download from here

The SWIG setup file is simple:

/* File: RunningStats.i */
%module RunningStats
%{
#define SWIG_FILE_WITH_INIT
#include "RunningStats.h"
%}
%include "RunningStats.h"
view raw RunningStats.i hosted with ❤ by GitHub


which you run using:

swig -c++ -python RunningStats.i


The Distutils setup script is:

"""
setup.py file for RunningStats
"""
from distutils.core import setup, Extension
RunningStats_module = Extension('_RunningStats',
sources=['RunningStats_wrap.cxx', 'RunningStats.cpp'],
)
setup (name = 'RunningStats',
version = '0.1',
author = "John D. Cook / Daniel Buscombe",
description = """Compute running stats on an array""",
ext_modules = [RunningStats_module],
py_modules = ["RunningStats"],
)
view raw setup.py hosted with ❤ by GitHub


compile using "python setup.py install"

Then in python, a simple implementation of calculating the global stats from an array would be something like:

import numpy as np
# generate some data
x = np.random.rand(1000,1)
# get the class
import RunningStats
# get an instance of the class
rs = RunningStats.RunningStats()
# global stats
for k in x:
rs.Push(k[0])
print rs.Variance()
print rs.Mean()
print rs.Kurtosis()
print rs.Skewness()
print rs.StandardDeviation()
rs.Clear()
view raw simple_stats.py hosted with ❤ by GitHub


 To show the convergence of locals means on the global mean, you might use the RunningStats class in the following way:

import numpy as np
import matplotlib.pylab as plt
# generate some data
x = np.random.rand(1000,1)
# get the class
import RunningStats
# get an instance of the class
rs = RunningStats.RunningStats()
# convergence on global mean
convav = []
for k in x:
rs.Push(k[0])
convav.append(rs.Mean())
plt.plot(x); plt.plot(convav,'r'); plt.show()
view raw conv_stats.py hosted with ❤ by GitHub

 Finally, to compute running averages on the array you might use something like the following, where the function to window the data efficiently was downloaded from here

import numpy as np
import matplotlib.pylab as plt
# generate some data
x = np.random.rand(1000,1)
# get the class
import RunningStats
# get an instance of the class
rs = RunningStats.RunningStats()
# define a window size for the running average
window_size = 10
from itertools import product
# use sliding window_nd from here: http://www.johnvinyard.com/blog/?p=268
xs = sliding_window_nd(np.squeeze(x),window_size,1)
runav = []
for j in xrange(len(xs)):
for k in xs[j]:
rs.Push(k)
runav.append(rs.Mean())
rs.Clear()
i = np.linspace(1,len(x),len(runav))
plt.plot(x); plt.plot(i,runav,'r');
# compute using a larger window size
window_size = 50
xs = sliding_window_nd(np.squeeze(x),window_size,1)
runav = []
for j in xrange(len(xs)):
for k in xs[j]:
rs.Push(k)
runav.append(rs.Mean())
rs.Clear()
i = np.linspace(1,len(x),len(runav))
plt.plot(i,runav,'k');
plt.show()

Obviously, as well as the mean, the above could be used for variance, standard deviation, kurtosis and skewness.