Can we use browser for Web Automation ? (Google is my Real-Best Friend !)

Hi Enthusiast-er,

(I am really sorry for long wait… 😦 )

Many Time we search so many things/information with the help of Google search engine. frankly speaking, I checked my internet connection by typing in our url bar. But can you imagine how Google is helping us ?. Why we mumbled every-time that “Google is my Best friend” ? Can you analyze how Google is talking with our machines with the help of Wireshark ?

Here is some snapshot of my wireshark :-

Well above wireshark image is all about my topic today. nothing but, Talking with Google without any browser. A great resource for google hacking is Google Hacking for Penetration Testers. Volume 2.

Is there any relation between Penetration Tester and Google ? ohh.. simply YES !!

Google is just awesome tool for Penetration Testers. Just need to understand it properly, and you are gathering many information of your respective project title 😉 ! All you have to do every-time is “just ask” to Google. that’s it !

Now there are two options about to ask question to Google and that is,

  • with browser and
  • without browser

we all know about with browser, and basically we are moving to our main point of article. Lets find interesting stuff with without browser.

Here we used Perl script to get us a listing of files from Google. In this article we will be using Perl with many module as LWP::UserAgent or WWW::Mechanize

Before going directly to the Exact scripting, we surely understand Perl and Perl Script behavior through below link:
Perl Download link :- Click Here !
General Information :- Click Here !
Perl Module Information :- Click Here !
Best url all the time :- ! Google is our best friend now ! 😛

Below is the Perl script which used to get a listing of files from google with the help of LWP::UserAgent module in order to handle web task.

—————Perl Script with LWP::UserAgent————-
use LWP::UserAgent;
use HTML::Parse;
$site = @ARGV[0];
$filetype = @ARGV[1];
$searchurl =”$site+filetype%3A$filetype”;
$useragent = new LWP::UserAgent;
$useragent->agent(‘Mozilla/4.0 (compatible; MSIE 5.0; Windows 95)’);
$request =HTTP::Request->new(‘GET’);
$response = $useragent->request($request);
$body = $response->content;
$parsed = HTML::Parse::parse_html($body);
for (@{ $parsed->extract_links(qw(a)) })
($link) = @$_;
if ($link =~ m/url/)
print $link . “\n”;

Now Understand above Perl Scripting Objects,

For defining any interpreter we used Shebang every-time in any respective scripting language.

LWP (short for “Library for WWW in Perl”) is a popular group of Perl modules for accessing data on the Web.
The LWP::UserAgent is a class implementing a web user agent. LWP::UserAgent objects can be used to dispatch web requests. Click Here for full LWP::UserAgent Description.

The name of site goes into $site and type of file goes into $filetype.

The string in $searchurl is a simple Google Search, with the values in $site and $filetype plugged in, in the appropriate places.

Then comes the useragent part of scripting to know the google about browser agent identifier.

HTTP::Request is a class encapsulating HTTP style requests, consisting of a request line, some headers, and a content body. Understand the basic idea, about how the request we are handling with $request and $response objects.

Next we used (HTTP::Parse) Parse module to parse the content of $body out into something we

Then we put together a for loop to go through our lines, looking only for the links and , of those links, only the links of href variety, and discarding images and other links in which we are not interested.

See below, my screenshot of command line Perl script output :-

( hope you can understand the white patches on images 😛 )

Save above script as and type command as “./ website(url) filetype” for example “./ pdf” and you will get the result as above in jpg image.

Before going to next module of perl, we can used bit command line module as LWP::Simple, Here is the single command line: perl -MLWP::Simple -e “getprint ‘'”

Now there is another beauty in Perl known as WWW::Mechanize module, we can do nearly anything from this module that we can do from web browser with a person operating it.

WWW::Mechanize, or Mech for short, is a Perl module for stateful programmatic web browsing, used for automating interaction with websites.

Features include:

  • All HTTP methods
  • High-level hyperlink and HTML form support, without having to parse HTML yourself
  • SSL support
  • Automatic cookies
  • Custom HTTP headers
  • Automatic handling of redirection
  • Proxies
  • HTTP authentication

Mech supports performing a sequence of page fetches including following links and submitting forms. Each fetched page is parsed and its links and forms are extracted. A link or a form can be selected, form fields can be filled and the next page can be fetched. Mech also stores a history of the URLs you’ve visited, which can be queried and revisited.

—————Perl Script with WWW::Mechanize————-


# Handy web browsing in a Perl object
use WWW::Mechanize;

# Name of the site, filetype, searchurl
$site = @ARGV[0];

$filetype = @ARGV[1];

# create mech as Handler
$mech = WWW::Mechanize->new();

# Sets user agent string to the expanded version from a table of actual user strings
$mech->agent_alias(‘Windows Mozilla’);

# Page Fetching Method

@links = $mech->find_all_links(url_regex => qr/\d+.+\.$filetype$/);
for $link (@links) {
$url = $link->url_abs;
$filename = $url;
$filename =~ s[.*/][];
print “downloading $url\n”;
$mech->get($url, ‘:content_file’ => $filename);}

Well we already tag comment in above perl scripting. For more Detail those who want to learn – can read Coding for Penetration Tester : Building Better Tools

Here is the video that will give you idea about wireshark with browser query

note: The google search used in above script is not the approved way to talk to google with automation. If you are not careful and abuse this type of connection, Google will get confused and ban your IP address. Google has helpfully documented the proper way for us, and we should really be using that. This is bit out of scope for what we are doing here, but documentation will get us there for constructing our queries in the approved manner.