Search Engines, Friend or Foe?
Input validation, perimeter control, user education, cryptography, physical security, access control; the list goes on and on. Each of these needs its own special considerations, things like: What validations am I going to put in place on my web application to protect my database backend? What kind of input must I discard to ensure stability? What attacks am I to expect?
There is one item though that is not generally seen in this list of concerns and that is search engines. There is awareness on the subject, there is even a term coined to describe the activity, ‘Google hacking’, but I still don’t really see it being taken seriously and not a lot of people know about it either.
In this article I will focus mainly on Google, because it is the most popular search engine and because it offers advanced functionality that helps people find what they want efficiently. That efficiency can be used by you and me but can also be used by someone who has less than good intentions.
How can search engines such as Google be a threat to security?
Well the answer to that is: In many ways.
The obvious threat is finding restricted information. Google can search not only web pages but in some cases even text in certain supported Document files. These include:
- Adobe Portable Document Format (pdf)
- Adobe PostScript (ps)
- Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wks, wku)
- Lotus WordPro (lwp)
- MacWrite (mw)
- Microsoft Excel (xls)
- Microsoft PowerPoint (ppt)
- Microsoft Word (doc)
- Microsoft Works (wks, wps, wdb)
- Microsoft Write (wri)
- Rich Text Format (rtf)
- Shockwave Flash (swf)
- Text (ans, txt)
This means that people can search for text in your files if they are available online using Google. Malicious people might search for the phrase “social security number” and if there is a file on your site containing that term, it might be presented to that malicious person who can then download it.
Google the Super Computer
While I am sure that Google has lots of processing power in its infrastructure I am not referring to taking advantage of that power directly, but it is possible to use Google as a super computer of sorts. Let’s assume that one has an application which stores passwords as an md5 hash. Lots of applications do that and the reason is quite sound – if a password is somehow stolen it will be no big deal, because it is very computationally expensive to get the password back from an md5 hash right? It would take over a year to try breaking an 8 letter code right? WRONG. Well it would take over a year if you simply try to brute force it, but what if you search Google instead? Yes, that’s right, get an md5 hash and search it on Google. If it’s a dictionary word, chances are that you will get a hit!
I ran some tests and here are the results:
|MD5 that was searched for||The “password”||Number of hits|
It should take roughly 300 days to guess “obscure” running a sequential brute force attack using only the English alphabet but including lower and upper case. Yet a simple search on Google returns an answer in less than a second. For a simple system an md5 hash of a password would generally be enough, figuring there is no data that might justify using years of computational power to crack, but with search engines you can just search for the answer in seconds. Even searching for the md5 of a random collection of letters such as AFVX, Google found a match. However the same was not true for longer random letter as well as phrases. Searches for md5s of complex random characters and phrases didn’t return any matches. Still keeping Google in mind and playing it safe, storing an md5 hash of a password is no longer enough, now we need to add a pinch of salt to the mix.
Aiding Malicious Hackers
Another hazard presented by search engines is granting the ability to malicious hackers to either find you or to attack you without any warning. An attack can present itself in one of two possibilities – you can either have been a target of opportunity or this was a targeted attack. In both cases a malicious hacker is able to use Google as one of his tools.
Target of Opportunity
Sometimes a hacker doesn’t have a particular target in mind, instead he has exploits and he wants to use those exploits to gain access to as many machines as possible. The first step is to identify the machines that have the vulnerabilities that he can exploit. In pre-Google days, this step involved scanning the internet for the service that he intends to exploit and then trying to identify each version from the list this scanning would generate. This task would previously have taken a very long time to complete. Today, however, a simple search can get a list of targets within less than a second. Searching for: intitle:index.of “Apache/1.3.34 Server at” for example returns a huge list (3million+) of domains that are running apache 1.3.34 and also have directly listing enabled in some of their web folders. Obviously it is not just Apache that can be found this way, its IIS, web applications, scripts and appliances even. Anything with a web front-end really, that might be indexed by Google. For example searching for: inurl:hp/device/this.LCDispatcher will return a number of web front-ends for HP Laserjet printers.
Please realize that if you try these searches they will return real live systems and printers that people and companies are using. Use these queries to test your own sites; be aware that accessing printers that you do not own might open you up to legal action. These examples are only provided to illustrate the point and the dangers, please do not misuse the knowledge.
In the previous section we saw how Google can help a malicious hacker find a target of opportunity, but how can Google help a hacker who intends to target a specific person?
When a malicious hacker intends to infiltrate a specific target, the first step is to gather intelligence. An attacker needs to know what he can access about his target, he basically needs to catalogue every service, server and appliance that are accessible from his location (the internet). Previously he would have achieved this by scanning his target for open ports and fingerprint said ports which would work and still does today but this leaves behind a footprint. This footprint can be detected by firewalls and log analysers and can alert an administrator of someone scanning his network. This might give the administrator time to prepare and keep a close eye on his network. He could possibly be in a position to even track down the attacker. At best it will leave a trail back to the attacker even if he is unable to find a weakness to exploit and never acts on his intentions. However what if he does his finger printing using Google?
Using a search query like site:[domain] Google will list all the indexed pages on that site. In such results one might find, services running, servers, versions, scripts and even appliances. The attacker can then, without exposing himself, start to devise an attack plan without worrying whether the administrator of the target site is on to him. Additionally should he decide that he has no way he can penetrate his target he can safely give up without any consequences.
In conclusion, of course search engines are very useful in allowing us to find things easily and quickly. Unfortunately it also allows people with malicious intent to find things that they are interested in very easily as well. As such my suggestion is to keep search engines in mind when going through your security tasks, be they during development, system administration, web design or anything else that can be affected by a search engine. Don’t depend on security by obscurity as that obscurity might not be as obscure as you may imagine. Taking simple precautions can help a lot. It is possible to control where on your site Google will index and where it will ignore. There are a lot of resources that webmasters can use at: http://www.google.com/webmasters/ and http://www.google.com/support/webmasters/
It is also important to disable directory listings unless absolutely needed and when needed, it should be protected as well. Appliances such as printers should never be connected to the internet unless absolutely necessary and when they are make sure that they’re secure and cannot be accessed by everyone. Always remember that appliances can be used just as any other machine to get a foothold inside your network.
Finally I am curious as to your views on the subject – are search engines something you worry about? Do you think that search engines are a threat to the security of your system, but that maybe it’s a threat that’s mitigated through your normal routines and doesn’t really require any additional steps? Maybe you wouldn’t really consider them a threat at all?
I personally think that they’re a threat that may generally be overlooked, but perhaps it might be not a huge threat at the end of day since the steps to protect against it are ultimately best practices that should be followed in the first place.