I am interested in the effect of time on plane tickets. So I figured I’d go ahead and parse my favorite websites for plane tickets; hipmunk, world ticket center and skyscanner.
I go to hipmunk and type the date and the destinations. I can see that this generates the following http link:
http://www.hipmunk.com/flights/AMS-to-BUE#!dates=Jan27&pax=1
This is good because this means that I can adjust the link and automate some of the data collection. I can change the date and the destinations. If I want to parse the website all I need to do is look for the proper tag and then I’d be done.
import urllib link = 'http://www.hipmunk.com/flights/ AMS-to-BUE#!dates=Jan27&pax=1' html = urllib.urlopen(link) website = html.read()
First I import the required python libraries. I take the original website link and store it as a string in the variable link. This then gets opened by calling the urllib.urlopen() method which gives me all the html generated from the website. This is turned into a string by using the html.read() method. You can print the website variable to confirm that it does return html.
Because the html variable is a string we do not need to use a parsing library on it per se, we can also just use website.find() for now. We want to check if we can find the cheapest ticket price in the string. It turns out that we cannot find it. Try it out for yourself.
If I inspect the element through Chrome’s console that lists the price then I can see that website.find(‘$921′) should return an index, instead it returned -1. What is happening?
Javascript is being used to communicate data with the server. Initially when searching for plane tickets the html of the website still needs to be generated. A large portion of javascript needs to run before we can actually take the html string and parse it.
This is where selenium comes into play. Selenium is a neat little python plug in that pretends to be a browser, only when the javascript is done on a webpage will it return something. It is a bit more advanced than the standard urllib library in python. To get it to work in this example I simply have python execute the following code:
import codecs import lxml.html as lh from selenium import webdriver browser = webdriver.Firefox() browser.get('http://www.hipmunk.com/flights/AMS-to-BUE#!dates=Jan27&pax=1') content = browser.page_source browser.quit()
Now typing content.find(‘$921′) returns a positive index. We can begin parsing the string.
Downside of this approach is that it is slow. You will see the browser window being opened and closed while python is executing this code. Still, this allows for a lot of clever hacking.
Big final tip: use browser.quit() at the very end of any for loop that you would like to make.
By doing this for about 80 days I was able to see a pattern occur in plane ticket prices:
It seems rather unwise to buy tickets less than 40 days in advance, unless you are buying a last minute. It must be said that this graph was made on December 1st 2012, so the spike in prices might be due to Christmas. The total code needed to parse skyscanner can be found on github but also here:
import codecs import lxml.html as lh from selenium import webdriver import time import re results = [] days = [ '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30'] months = [ '1212', '1301', '1302' ] for month in months: for day in days: browser = webdriver.Firefox() link2 = 'http://www.skyscanner.nl/vluchten/ams/buea/' + month + day + '/vliegtarieven-van-amsterdam-schiphol-naar-buenos-aires-in-januari-2013.html' browser.get(link2) bcontent = browser.page_source price = bcontent[bcontent.find('px EUR'):bcontent.find('px EUR')+25] if len(price) > 0: print 'got one' results.append( [ month+day, str(re.search(r"\d+", price).group())] ) else: print month + day browser.quit()

