Parsing dynamic websites

I am interested in the effect of time on plane tickets. So I figured I’d go ahead and parse my favorite websites for plane tickets; hipmunk, world ticket center and skyscanner.

I go to hipmunk and type the date and the destinations. I can see that this generates the following http link:

http://www.hipmunk.com/flights/AMS-to-BUE#!dates=Jan27&pax=1

This is good because this means that I can adjust the link and automate some of the data collection. I can change the date and the destinations. If I want to parse the website all I need to do is look for the proper tag and then I’d be done.

import urllib 
link		= 'http://www.hipmunk.com/flights/
                   AMS-to-BUE#!dates=Jan27&pax=1'
html 		= urllib.urlopen(link)
website 	= html.read()

First I import the required python libraries. I take the original website link and store it as a string in the variable link. This then gets opened by calling the urllib.urlopen() method which gives me all the html generated from the website. This is turned into a string by using the html.read() method. You can print the website variable to confirm that it does return html.

Because the html variable is a string we do not need to use a parsing library on it per se, we can also just use website.find() for now. We want to check if we can find the cheapest ticket price in the string. It turns out that we cannot find it. Try it out for yourself.

If I inspect the element through Chrome’s console that lists the price then I can see that website.find(‘$921′) should return an index, instead it returned -1. What is happening?

Javascript is being used to communicate data with the server. Initially when searching for plane tickets the html of the website still needs to be generated. A large portion of javascript needs to run before we can actually take the html string and parse it.

This is where selenium comes into play. Selenium is a neat little python plug in that pretends to be a browser, only when the javascript is done on a webpage will it return something. It is a bit more advanced than the standard urllib library in python. To get it to work in this example I simply have python execute the following code:

import codecs
import lxml.html as lh
from selenium import webdriver
 
browser = webdriver.Firefox()
browser.get('http://www.hipmunk.com/flights/AMS-to-BUE#!dates=Jan27&pax=1')
content = browser.page_source
browser.quit()

Now typing content.find(‘$921′) returns a positive index. We can begin parsing the string.

Downside of this approach is that it is slow. You will see the browser window being opened and closed while python is executing this code. Still, this allows for a lot of clever hacking.

Big final tip: use browser.quit() at the very end of any for loop that you would like to make.

By doing this for about 80 days I was able to see a pattern occur in plane ticket prices:

It seems rather unwise to buy tickets less than 40 days in advance, unless you are buying a last minute. It must be said that this graph was made on December 1st 2012, so the spike in prices might be due to Christmas. The total code needed to parse skyscanner can be found on github but also here:

import codecs
import lxml.html as lh
from selenium import webdriver
import time 
import re 
 
results = [] 
days 	= [ '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', 
			'11', '12', '13', '14', '15', '16', '17', '18', '19', '20',
			'21', '22', '23', '24', '25', '26', '27', '28', '29', '30']
months 	= [ '1212', '1301', '1302' ]
 
for month in months:
	for day in days: 
		browser = webdriver.Firefox()
		link2 = 'http://www.skyscanner.nl/vluchten/ams/buea/' + month + day + '/vliegtarieven-van-amsterdam-schiphol-naar-buenos-aires-in-januari-2013.html'
		browser.get(link2) 
		bcontent = browser.page_source
		price 		= bcontent[bcontent.find('px EUR'):bcontent.find('px EUR')+25]
 
		if len(price) > 0:
			print 'got one'
			results.append( [ month+day, str(re.search(r"\d+", price).group())] )
		else:
			print month + day 
		browser.quit()
  • Datanerd

    How did you store and read the data once your script finished

    • admin

      Dear Datanerd,
      there are a few ways to do this. You can use the IO that python has or you can use the csv module.

      The following syntax is to write a txt file that can be interpreted as a csv later on:

      filename = "test_file.txt"
      print "Writing to file: %s" % filename
      file = open(filename, 'w')
      file.write("text text text")
      file.close()

      Otherwise the .csv document can also be created by using the csv module in python.


      import csv

      lol = [[1,2,3],[4,5,6],[7,8,9]]
      item_length = len(lol[0])

      with open('test.csv', 'wb') as test_file:
      file_writer = csv.writer(test_file)
      for i in range(item_length):
      file_writer.writerow([x[i] for x in lol])

  • jesse

    thank you for making this tutorial – modifying your process i was able to (albeit very slowly) scrape over 300 pages reliably every day – it wasnt so hard because the site just changed the id all the way from 2 – 342 for each page :) so was just a case of counting up and appending. Did you make the graphs yourself or have you used something to automate the plotting points? It would be great to know what you use – thanks again

    • Vincent Warmerdam

      I use R for a lot of data oriented things, particularly the ggplot2 plotting package (details of this package can be found here: http://docs.ggplot2.org/current/ )

      parsing is slow by using this method, the reason is that it takes time for all the javascript to load, which is a bummer.

      glad to be of help!

      • jesse

        yeah i dont mind the load time – i was so glad to finally create a stable scraper that could handle java in an easy way :P thanks for the plotting package tip!