Saturday, 30 June 2012

Python BeautifulSoup scraper script

Beautiful Soup written in Python.Which is  HTML / XML parser, it can handle non-standard tags and generate inside parse tree. Also provides a simple and commonly used in navigation , search and modify the operation.
Recently i tried BeautifuSoup module inorder to parse html class libraries, Here is the simple scrapy code for extracting informaton from Apple website.

 #! /usr/bin/python  
 print 'Content-type: text/plain\r\n'  
 from BeautifulSoup import BeautifulSoup   
 import urllib   
 webpage = urllib.urlopen(r"http://store.apple.com/us/browse/home/shop_iphone/family/iphone/iphone4s");   
 soup = BeautifulSoup(webpage.read())   
 tags = soup('ul',{'class':'selection-options all-models'})   
 tags = tags[0](lambda tag : len(tag.attrs) == 1 and tag.name in ['span'] and   
            tag['class'] in ['shipping','price','color','title'])   
 for tag in tags :   
   print tag.text   
   print '-' * 30   


Results:
16GB2
------------------------------
black
------------------------------
From$199
------------------------------
In Stock
------------------------------
16GB2
------------------------------
black
------------------------------
From$199
------------------------------
In Stock

Tuesday, 26 June 2012

Setup Galaxy production server on Ubuntu and Apache environment

1.) Create user called galaxy with password galaxy
 admin@myserver:~# adduser galaxy  
 Adding user `galaxy' ...  
 Adding new group `galaxy' (1007) ...  
 Adding new user `galaxy' (1008) with group `galaxy' ...  
 Creating home directory `/home/galaxy' ...  
 Copying files from `/etc/skel' ...  
 Enter new UNIX password:  
 Retype new UNIX password:  
 passwd: password updated successfully  
 Changing the user information for galaxy  
 Enter the new value, or press ENTER for the default  
     Full Name []:  
     Room Number []:  
     Work Phone []:  
     Home Phone []:  
     Other []:  
 Is the information correct? [Y/n] Y  
 admin@myserver:~#  
2.) Change user to galaxy and clone galaxy production version
 admin@myserver: su galaxy  
 galaxy@myserver:/home/admin$ cd   
 galaxy@myserver: hg clone https://bitbucket.org/galaxy/galaxy-dist  
3.) If you don't have mercurial client for clone galaxy use following step
 admin@myserver:sudo apt-get install mercurial  
4.) Set the $TEMP environment variable to Galaxy's new_files_path directory
 galaxy@myserver:~$ export TEMP=/home/galaxy/galaxy-dist/database/tmp  
5.) We need clean python interpreter with correct python path
 galaxy@myserver:wget http://bitbucket.org/ianb/virtualenv/raw/tip/virtualenv.py  
 galaxy@myserver:/usr/bin/python2.6 virtualenv.py --no-site-packages galaxy_env  
6.) Now we need to setup new database for galaxy.I am going to create PostgreSQL database
 galaxy@myserver:~/galaxy-dist$ psql -h localhost -d postgres -U postgres  
 postgres=#CREATE DATABASE galaxy_prod;  
 postgres=# CREATE USER galaxy_prod_user WITH PASSWORD 'galaxy';  
 postgres=# GRANT ALL PRIVILEGES ON DATABASE galaxy_prod to galaxy_prod_user;  
 postgres=# \q  
7.) Then we need to configure Galaxy default server settings to our server details
 galaxy@myserver:cd galaxy-dist/  
 galaxy@myserver:~/galaxy-dist$ chmod -R 777 universe_wsgi.ini  
 galaxy@myserver:~/galaxy-dist$ vi universe_wsgi.ini  
8.) Here is the basic changes for universe_wsgi.ini file.
 host = xxx.xxx.23.123 [IP ADDRESS]  
 debug = False  
 use_interactive = False  
 database_connection = postgres://galaxy_prod_user:galaxy@localhost:5432/galaxy_prod  
9.) There are many more changes we can do for galaxy by customizing niverse_wsgi.ini for instance adding tracks,user privileges, ftp upload e.t.c.Galaxy has its own server but there are pages with static contents therefore we can setup proxy to enhance efficiency
 admin@myserver:vi /etc/httpd/conf/httpd.conf  
10.) Add following lines to httpd.conf
 <VirtualHost *:80>  
 ServerName xxx.xxx.23.123 [IP ADDRESS]  
 RewriteEngine on  
 #RewriteLog "/etc/httpd/logs/rewrite_log"  
 #RewriteLogLevel 9  
 RewriteRule ^/galaxy$ /galaxy/ [R]  
 #RewriteRule ^/galaxy/static/style/(.*) /home/galaxy/galaxy-dist/static/june_2007_style/blue/$1 [L]  
 #RewriteRule ^/galaxy/static/scripts/(.*) /home/galaxy/galaxy-dist/static/scripts/packed/$1 [L]  
 #RewriteRule ^/galaxy/static/(.*) /home/galaxy/galaxy-dist/static/$1 [L]  
 #RewriteRule ^/galaxy/favicon.ico /home/galaxy/galaxy-dist/static/favicon.ico [L]  
 #RewriteRule ^/galaxy/robots.txt /home/galaxy/galaxy-dist/static/robots.txt [L]  
 RewriteRule ^/galaxy(.*) http://localhost:8080$1 [P]  
 </VirtualHost>  
11.) Now we need to restart proxy server.
 admin@myserver:/etc/init.d/httpd restart  
12.) Finally we can run galaxy
 galaxy@myserver:~/galaxy-dist$ sh ./run.sh --daemon  
13.) We can stop or see the status by using following commands
 galaxy@myserver:~/galaxy-dist$ sh ./run.sh --stop-daemon  
 galaxy@myserver:~/galaxy-dist$ sh ./run.sh --status  
14.) Done! 

Saturday, 23 June 2012

Ubuntu folder/file permission


chmod -R 754 /www/test

7 – Owner(current user)
5 – Group(set by owner)
4 – anyone else

Basic numbers make persmission settings as follows

0 – no permission, this person cannot read, write or execute
1 – execute only
2 – write only
3 – execute and write only (1 + 2)
4 – read only
5 – execute and read only (1 + 4)
6 – write and read only (2 + 4)
7 – execute, write and read (1 + 2 + 3)