maths with Python 4: Loading data.

First tutorial was about installing python, some packages, and using it on basic maths.

python-logo

On the second one we use python to solve the Rössler system.

beauty

In the third one, we switch to Anaconda distribution of Python.

carousel_1

and use it to solve the diffusion equation.

Untitled

Now we are going to go back to an old post: A Little bit DataMining on Web.

So, we are going to upload some data on Python and plot it. And we are going to use as source the United States, again American Fact Finder

facfinder

Step 1. Get data from American FactFinder. We are going to look for data about New York. And for that, we use the tool on the main webpage.

newyork

Step 2. Now lets move into population, let’s see what they have…
newyork2
Step 3. Ok. Here we are, the data about population age distribution. Let’s download it.

newyork3

Common download, no extras.
newyork3

Step 4. Open it with some text editor to see how it looks alike.
newyork3

So, there is 4 files. One is a readme describing the files. Another contains notes about the data. The third one contains the labels for the data (age ranges), and the fourth one contains the data as numbers.

The main file here is DEC_10_DP_DPDP1_with_ann.csv which is 2 rows of data. The first row are keys for the description of the data, and the second one is the data.

The file DEC_10_DP_DPDP1_metadata.csv Contains the keys again with the description of what they mean.

Step 5. The code! (Remember that WordPress don’t allow to copy the indentation when copying from Python).

#This is for selecting the file to be opened using a graphic interface to do it.
from Tkinter import Tk
from tkFileDialog import askopenfilename
Tk().withdraw() # we don't want a full GUI, so keep the root window from appearing
filename = askopenfilename() # show an "Open" dialog box and return the path to the selected file
print(filename)
#Now the path to our file is in the variable called filename
#We are going to import the data from FactFinder. On this kind of file there is two rows, the first one with the keys and the second one with the data.
#There is a second file where the keys are aexplained.
import csv
i=0;
labels=[];
data=[];
with open(filename, 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',', quotechar='|')
for row in spamreader:
if i==0:
for k in range(size(row)):
labels.append(row[k]);
else:
for k in range(size(row)):
data.append(row[k]);
i=i+1;

#At this point, we have a list with the labels (keys) and a second list with the data.

#The next part is to create a dictionary with the keys and the data. A dictionary is like a kind of list where you adress entries writting the key isntead of the coordinate.
l=[];
for k in range(size(labels)):
l.append([labels[k],data[k]])
d={}
for key, val in l:
try:
d.setdefault(key, []).append(float(val));
except (NameError, ValueError):
d.setdefault(key, []).append(val)
#Now make the plot like common demographic plots.
ind = 4*np.arange(18) #We know that the age groups are 18 separated by 4 years.
men=zeros(18)
#And taking a look at the file of the keys description, we found that the keys take the form HD02_S0... So basically, we put the data we want to use into arrays.
for k in range(18):
men[k]=d['HD02_S0'+str(27+k)][0]
women=zeros(18)
for k in range(18):
women[k]=d['HD02_S0'+str(52+k)][0]
fig=figure();
width=4;
#First the men on the left.
ax1 = fig.add_subplot(121)
#This is an horizontal bar plot.
ax1.barh( ind, men, width, color='blue')
#But we need to set the x ticks and it's labels.
#We only want 4 ticks
ticks=[max(max(men),max(women))*z/4 for z in range(4)]
ax1.set_xticks(ticks)
#And now, we need to write the ticks labest and say that axis is by 10^3
ticks2=[int(z/1000) for z in ticks]
ax1.set_xticklabels(ticks2)
ax1.set_xlabel('Populationx1000')
#Finally, we invert this axis
ax1.invert_xaxis()
#Second, women in the rigth.
ax2 = fig.add_subplot(122)
ax2.barh(ind, women, width, color='pink',)
#We only want 4 ticks
ax2.set_xticks(ticks)
ax2.set_xticklabels(ticks2)
ax2.set_xlabel('Populationx1000')
#As an extra, we can use the Geo label to save the graph as an SVG file with the Name of the city.
savefig(d['GEO.display-label'][0]+'.svg')

And this ir the output.

newyork_graph

So… the code has lots of explanations, but a little bit more will be better.
The code has 3 parts.

The first one is opening the files. That is made using a GUI (Graphical User Interface), basically a windows like open file dialog. This is quite useful when we don’t want to know where our data is located or we are dealing with many different files.

The second one is quite standard for reading CSV files. Since we know that our file is just two rows, we organize what we read from the file into two lists, one for the labes and another one for the data. Once we have them, we build a dictionary with them. Basically, a kind of list where you address the elements using keys instead of coordinates. Once the dictionary is ready, we can use it to create 2 arrays, one of the men data and the other one with the woman data.

The last part is just a little formatting to plot the data into bar plots.

And that’s all.

.

.

.

.

.

.

Ok, so here it is. Since we have the code, is quite easy to download more data and plot it. So a few more graphs, and a little bit Inkscape… can do very nice plots. See you soon!

usaTO dig more into Python I/O files. http://www.tutorialspoint.com/python/python_files_io.htm

Advertisements

5 thoughts on “maths with Python 4: Loading data.”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s