Numpy: How to do data analysis using python?


In my last post, we discussed "How to use lists, arrays, Numpy to make life easy with Python?". If you are beginner go to the first post of this blog.  In this post, I will discuss how data can be analysed in python?. Sometimes we have a huge number of data files and we want to take averages over some specific contents of files or want to read only some specific rows and columns. This post is not the first post about data analysis, there are many excellent bloggers and websites who have tried to explain in multiple ways how data should be analysed using python!


 My aim is to bring multiple techniques together to make it is easy to use for the reader. When I started learning python, I found all the information on the internet, but I did not know how to use the information. It was very easy to do a statistical analysis of one file (python makes it one second go), but I did not know how to combine multiple files and then analyse only specific data from these files.


I will introduce basic things here. It will be like ready to use coffee, mix into water and drink it. If you really want to learn in detail how to make the mixture, then look at Scipy "Statistical Data Analysis in Python", Numpy "statistics using numpy", glob "this cool glob introduction!" . Moreover, there is this excellent website "Lynda.com: Introduction to Data Analysis with Python!". There are certainly many more good resources out there but I do not know or could not cite them all.



 In this post, I will go with my simple style to explain by examples how to analyse the data using Numpy, SciPy, globin python! I would suggest for debugging use the jupyter notebook.  you just type "jupyter notebook" in the terminal in linux/uinx systems to start it. it will give you some options in the browser. Open Python2.6 or whatever it is showing instead of 2.6. I have explained in my last posts how to use ipython notebook. Ipython notebook is now called jupyter notebook. If you do not want to use ipython or you can still use command-line python or anaconda.

Which packages do we need? How to install them. Instructions on how to install Numpy or SciPy are on my last blogs. Anaconda will have most of the modules inbuilt. Open a notebook or a file call stat.py (or anythingyouwant.py).

 
import numpy as np
from scipy import *
from pylab import *
import matplotlib.pyplot as plt
import pylab as pl
from scipy import stats
import glob
In this glob will manage the file/files, numpy will manage the arrays, scipy will help doing statistics, matplotlib will help in plotting, pylab can be used also for plotting.




How to do statistics on a single data file in python?

This is a very easy task and there are many methods to do it. I would just use numpy for this one. This I have explained in my last post ("How to use lists, arrays, Numpy to make life easy with Python?"). 

How to analyse multiple files in python?

How to use glob to sort the data files and analyse these files?

This is a very important question for all who deal with huge data files and data sets.  As a side note, I would like to mention that python is extremely useful for scientific computation. Here it does not matter how the data appeared but the main question is how the data will be analysed. I will mention from very simple statistics to complicated error analysis methods.

Once all the modules above has been imported, we can use glob to import multiple files to the program.
files= sorted(glob.glob('*.data'))
Now we have imported all the files to our program. Next question is how to read these file? This we will see in our next example code. 
for i in xrange(len(file)): #this line iterates over the files
    namei=filee[i]# here we name a particular file namei
    #print namei # you could print filename by uncommenting this line
    filename=namei
    print filename
    a=np.loadtxt(filename)# have explained this step in my last post.
    mean_a = np.array([av[:].mean() for av in a],dtype=np.float64)#calculate 
    mean on each file
    a=np.savetxt("./"+str(namei)+".datt", a, delimiter=" " ) # in this file mean values of all your files would be saved.

We did calculate mean for all the files in your folder. The condition was that files had same ending (*.data). You can also exchange these files with something else. The "sorted" option just finds mean in a sorted manner. If this is not important for you, you could leave the sorted part.

How to use glob to split a file into multiple arrays?

How to do error analysis in python?

Bootstrap Error Analysis and Jackknife error analysis in Python?

Usually when we do scientific computing, one of our main goals is to determine the error in our quantities. There are multiple ways of analysing the errors. Mos of you who would be familiar with scientific computing will know about bootstrap and jackknife methods. In future, I will be making a post only about error analysis, but in this post, I am giving a ready to use way to do this kind of error analysis. We will use AstroML, pandas packages in addition to the packages above. As I wrote that I will be making a full post including maths and codes in the advanced version of this blog, but here I will not explain very much about these packages. I would like to mention that pandas and AstroML are very useful for people who need advanced statistical techniques. Those of you who are into astrophysics, you should certainly look into AstroML packege on the offical webpage
If you want to read more about Pandas visit official website with great information. There are many nice tutorials which will give a lot of insight to these packages. Look at how to use pandas for statistical analysis? These are nice tutorials. I will give you here again a short program.

import numpy as np
from numpy import array
import matplotlib.pyplot as plt
from glob import glob
import h5py
import pandas as pd
import numpy.ma as ma
import sys
import scipy
import scipy.stats
import scipy.optimize
import astroML
from astroML.resample import jackknife
from astroML.stats import sigmaG
from scipy.stats import norm
from astroML.utils import check_random_state
from astroML.resample import bootstrap
fnames = glob('*.dat') #we import a file here
arrays = [np.loadtxt(f) for f in fnames] #file comverted as an array if we have multiple columns in the data file then each column will be a single array!
print len(arrays)# check length of the array
n=len(arrays)
S=arrays
mean = np.mean([arrays[j] for j in xrange(n)], axis = 0)
# mean of each array/column
var = np.std([arrays[j] for j in xrange(n)], axis = 0)#variance of each array/column
std=np.var([arrays[j] for j in xrange(n)], axis = 0)#standard deviation
err=var/np.sqrt(n) #error from variance

new=np.concatenate((mean,err1), axis=1) # make an array where first column is mean and second #error in the mean.
namei="mydata.dat" #name of the file we want to save
a=np.savetxt("./"+str(namei)+".av", new, delimiter=" " )# save file as 2 dimensional array
for i in xrange(n):
    dede=data[:,i,2]#chote 
    mm=np.std(dede)
    mu1, sigma_mu1 = jackknife(dede, np.mean,kwargs=dict(axis=1))#jackknifek mean and variance
    mu2, sigma_mu2 = jackknife(dede, np.std,kwargs=dict(axis=1))#jackknife
    mu1_bootstrap = bootstrap(dede, 10,  np.std, kwargs=dict(axis=1, ddof=1))#bootstratp with sample     #size 10
    mu2_bootstrap = bootstrap(dede, 10, sigmaG, kwargs=dict(axis=1))
    print tt[i],mu1,err1

 
 
This is it for today...will update you soon with more and advanced programs. As soon as we touch tongue of the python, we will start learning how to use the venom! Cheers!







Comments

Popular posts from this blog

How to use Edward library for probabilistic modeling with Tensorflow and GPy to study asymptotic connections between Multi-Layer Perceptrons (neural nets) and Gaussian processes?

Writing a loop first time in python