Overview
When you configure the Google Search Appliance (GSA), you have limited control over how content is crawled or how that content is presented to the GSA for further processing. However, by introducing an Apache server as a proxy into the deployment environment, you gain the ability to modify content as it is being crawled to serve a number of purposes. The most common use case for modifying content is filtering, when you need to strip out or add content to pages as they are crawled. The ability to modify content as it is crawled is also useful if you want to change how the crawler looks to your content sources. You can use Apache as a filtering proxy by:
Configuring Apache as a proxy
To configure Apache as a proxy, you need to add the following lines to your
httpd.conf
:
The first two lines in this configuration simply load theLoadModule proxy_module modules/mod_proxy.so LoadModule proxy_http_module modules/mod_proxy_http.so Listen 8080 <VirtualHost *:8080> ProxyRequests On <Proxy *> Order Deny,Allow Deny from all Allow from 192.168.0.20 </Proxy> ### Add filters here ### </VirtualHost>
mod_proxy
module and tell the Apache server to start listening on port 8080.
The next section defines a virtual host on port 8080 and tells it to proxy requests (rather than serve results like a normal web server).
Locking down the configuration
If the machine where this configuration is running is a public machine, you definitely want to lock this configuration down further to prevent it from being used inappropriately. In this case, you can use simple IP rules to allow proxy requests only from 192.168.0.20 (the GSA).Testing the proxy server
Once the server is started, test it by performing the following steps:- telnet to port 8080
- Enter “GET http://www.google.com/”
Configuring your GSA to use the proxy server
To configure the GSA to use the proxy server:
- In the GSA Admin Console, navigate to Content Sources > Web Crawl > Proxy Servers.
- Enter the URL patterns that should use the proxy, the IP address or fully-qualified domain name, and the port of the proxy server that you have configured.
- Click Save.
Using multiple proxy configurations
If you need multiple proxy configurations for your application, you can run multiple instances of Apache on different ports, or you can define filters within a single Apache configuration to handle content based on URL patterns or other parameters.
Creating filters
Apache supports two types of filters:
- Input
- Output
Proxy Virtual Host
block.
SetOutputFilter directive
The SetOutputFilter directive can be used to apply a filter to ALL content passing through the proxy:In this example, we define an external filter named “fixrobots” which just passes stdin (the requested doc) through sed, and strips out the strings “noarchive,” “noindex,” and “nofollow.” This basically allows the GSA to ignore embedded robots meta tags. “sed -r“ is quick and easy for regular expression patterns and simple string manipulation. But it’s just as easy to use a Perl, PHP or shell script. Apache just passes the file as stdin and passes the output of the filter back to the GSA.# Filter robots meta tags ExtFilterDefine fixrobots mode=output intype=text/html \ cmd="/bin/sed -r 's/(noarchive|noindex|nofollow)>//g'" SetOutputFilter fixrobots
AddOutputFilterByType directive
The AddOutputFilterByType directive gives you a little more control by enabling you to apply a filter based on MIME type. This is useful if you want to crawl content that the GSA doesn’t natively support, such as images, video, and so on.# Filter video files ExtFilterDefine filtervideo mode=output outtype=text/html \ cmd="/home/ericl/mediaFilter.php" AddOutputFilterByType filtervideo video/x-msvideo video/mp4 video/ audio/mpeg audio/ video/quicktime
In this example, we create an external filter named “filtervideo” that calls an external script, mediaFilter.php. In this case it is a script that would accept binary video files as input, and output html (embedded metadata) and thumbnails. Because we only want this to happen for specific content types, we use the AddOutputFilterByType directive to specify several multimedia formats. Another thing you can do is to modify the HTTP headers. A simple example here is to replace the GSA’s User-Agent string with a different one:# php script code $stream=file_get_contents('php://stdin'); # html stream is gzip content // $content=gzdecode($stream); //echo gzencode($content); echo $stream;
This doesn’t modify the headers from the GSA, because those don’t get passed through by the proxy. It just sets the header that Apache uses when it fetches a page. This can be useful if you need to set a specific cookie, User-Agent, or other header to crawl your content. After the proxy and filters are configured, you can test them by sending your own GET requests, or by:# Set User Agent of the Proxy RequestHeader set User-Agent "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR 2.0.50727;)"
- Configuring your browser to use the proxy.
- Requesting some URLs.
- Viewing the source.