89pies nomnomnom

3Jul/110

How TFL endorsed me as a taxi firm or why how to monitor external links

This is going to be a technical post so a short bit of background for anybody who doesn't live in London, Transport for London is a local government body who are responsible for most of the aspects of travel in London including (but not limited to) the Tube, Busses, Trains, Streets and most relavent to this post Taxis. They are an official body so information supplied by them is deemed to be trusted.

I was wanting to book a taxi one night and as my usual company AddisonLee were fully booked (it was the day of the public sector strikes) I decided to try and find another company on the TFL Website. As I usually prefer to book online (I can pay by card) I tried the first link on the page which was to http://www.onenumbertaxis.co.uk/ however rather than a site with a nice booking form I was redirect to http://this-is-not-a-real-web-site.com. I'm not sure quite why that was so I decided to have a quick check why and the easiest way to do this is just by curling it on the command line:

The response is exactly as I thought it would be, a redirect from the original domain to this new one, as is always with a domain which looks interesting my first response is to see if it is registered or not...

And it's not, this means TFL are linking to a URL which doesn't exist and anybody can register. Naturally whenever I find a URL like this I'm thinking about registering it but this one was different as this one was linked to from a trusted site and anybody could easily register it, redirect it to a smiliar looking domain and start a fake taxi firm. This would then mean TFL are endorsing an unlicensed minicab firm which goes against all the good work they have been doing with cabwise. Of course I went ahead and registered it and at the time of writing this clicking on the top link on the TFL taxi page takes you to this post so effectivly endorsing my site as a reputable taxi company.

Again the horror stories of what could have happened are for another post so to continue on the technical aspect what could they have done to prevent this from happening? As anybody who has used the web knows it is ever changing and sites can change with an alarming frequency rendering a site which was nice and safe one day full of porn or worse the next day which can seriously damage a companies reputation and image both for the company who used to own the domain and anybody linking to it.

So what could TFL have done about this? They can't manually check every link every day so the only answer which makes sense is automation. There are so many things you can automate online from reports to tests to content validation and security testing. This issue falls under the category of content validation and this is a relativly easy one to perform once you have got the architecture in place, for a scenario like this I would suggest having some form of data store containing your 3rd party URLs and this can range from an extra field on your database table for a small site like a blog to using a document store such as MongoDB if you have a huge amount of external links.

How you handle this data once you have decided you are going to automate this testing varies as well and as with all good architecture you need to design it to not only scale well but also to not over engineer it so you never finish building it or once you have delivered it the system is so complex it is full of issues. However you write it there are a few key questions you will have to ask yourself:

  1. How often should you check? You must have the system run on a regular basis, once a day is probably enough and would suffice for most applications but some systems will require a much higher frequency of monitoring.
  2. What constitues a failure? For some systems this will be the domain not responding in a certain timeframe, a key piece of content missing (usually </html> to not the page has failed mid generation) or sending a non 200 response.
  3. How do you handle a failure? This should be be an action such as queuing for a check again within a set time frame e.g. 5 minutes in case the error is temporary, alerting to the monitoring systems or even an automated content removal from the site.
  4. What happens next? If the content is removedĀ  automatically should it come back automatically or does it go into a moderation queue or have to be reinstered manually?

There are so many variants of the above it's impossible to say exactly how your applicaiton should behave without sitting down to plan it out but this should be a key part of your architecture and should be done during the initial planning and not left until the last minute as I can guarantee this will get overlooked to get the product out. Ensure whoever is looking at this understands your infrastructure as well, in short ensure your Technical Leads and Architects know what they are doing although this goes without saying but is worth repeating!

In this case I will now wait to see how long it takes after being notified for TFL to fix this issue, if they haven't done it in a few days I will chase them again although leaving your automated testing to random people on the Internet is not the best approach to take really.

Incidently the link doesn't have a massivly high ctr, I'll try and compile some stats at some point but I would have expect it to be higher than it is.

Lastly for anybody who says I should have not registered this domain and just told TFL I have informed them via their web contact form so hopefully they will fix it quickly. I wanted to register it as soon as I could after seeing it in case anybody else had seen it and thought of doing the same and redirecting to something a lot less palatable.

Comments (0) Trackbacks (0)

No comments yet.


Leave a comment

(required)

No trackbacks yet.