Using Apache as a Filtering Proxy

Overview

When you configure the Google Search Appliance (GSA), you have limited control over how content is crawled or how that content is presented to the GSA for further processing. However, by introducing an Apache server as a proxy into the deployment environment, you gain the ability to modify content as it is being crawled to serve a number of purposes. The most common use case for modifying content is filtering, when you need to strip out or add content to pages as they are crawled. The ability to modify content as it is crawled is also useful if you want to change how the crawler looks to your content sources. You can use Apache as a filtering proxy by:

Configuring Apache as a proxy

To configure Apache as a proxy, you need to add the following lines to your httpd.conf:


LoadModule proxy_module modules/mod_proxy.so
LoadModule proxy_http_module modules/mod_proxy_http.so
Listen 8080
<VirtualHost *:8080>
ProxyRequests On
<Proxy *>
  Order Deny,Allow
  Deny from all
  Allow from 192.168.0.20
</Proxy>
### Add filters here ###
</VirtualHost>

The first two lines in this configuration simply load the mod_proxy module and tell the Apache server to start listening on port 8080. The next section defines a virtual host on port 8080 and tells it to proxy requests (rather than serve results like a normal web server).

Locking down the configuration

If the machine where this configuration is running is a public machine, you definitely want to lock this configuration down further to prevent it from being used inappropriately. In this case, you can use simple IP rules to allow proxy requests only from 192.168.0.20 (the GSA).

Testing the proxy server

Once the server is started, test it by performing the following steps:

telnet to port 8080
Enter “GET http://www.google.com/”

If the server is working, it returns the source from Google’s home page or, if you’re not coming from an allowed IP, returns a 403 (Forbidden) error.

Configuring your GSA to use the proxy server

To configure the GSA to use the proxy server:

In the GSA Admin Console, navigate to Content Sources > Web Crawl > Proxy Servers.
Enter the URL patterns that should use the proxy, the IP address or fully-qualified domain name, and the port of the proxy server that you have configured.
Click Save.

In some cases, you might want to use the proxy for crawling all content, so you can just enter “/” for the URL pattern. In other cases, such as crawling video or images, you might want to use the proxy for only that specific content, so enter a URL pattern, as appropriate.

Using multiple proxy configurations

If you need multiple proxy configurations for your application, you can run multiple instances of Apache on different ports, or you can define filters within a single Apache configuration to handle content based on URL patterns or other parameters.

Creating filters

Apache supports two types of filters:

Input
Output

For proxies, consider the input to be the request that the GSA sends to the destination web server, and the output to be what the web server sends back to the GSA. So, for most applications, you want to create an output filter. Apache has several directives for creating output filters, including:

Filters are simply defined as part of the Proxy Virtual Host block.

SetOutputFilter directive

The SetOutputFilter directive can be used to apply a filter to ALL content passing through the proxy:


# Filter robots meta tags
ExtFilterDefine fixrobots mode=output intype=text/html \
cmd="/bin/sed -r 's/(noarchive|noindex|nofollow)>//g'"
SetOutputFilter fixrobots

In this example, we define an external filter named “fixrobots” which just passes stdin (the requested doc) through sed, and strips out the strings “noarchive,” “noindex,” and “nofollow.” This basically allows the GSA to ignore embedded robots meta tags. “sed -r“ is quick and easy for regular expression patterns and simple string manipulation. But it’s just as easy to use a Perl, PHP or shell script. Apache just passes the file as stdin and passes the output of the filter back to the GSA.

AddOutputFilterByType directive

The AddOutputFilterByType directive gives you a little more control by enabling you to apply a filter based on MIME type. This is useful if you want to crawl content that the GSA doesn’t natively support, such as images, video, and so on.


# Filter video files
ExtFilterDefine filtervideo mode=output outtype=text/html \
cmd="/home/ericl/mediaFilter.php"
AddOutputFilterByType filtervideo video/x-msvideo video/mp4 video/ audio/mpeg
audio/ video/quicktime


# php script code
$stream=file_get_contents('php://stdin');
# html stream is gzip content
// $content=gzdecode($stream);
//echo gzencode($content);
echo $stream;

In this example, we create an external filter named “filtervideo” that calls an external script, mediaFilter.php. In this case it is a script that would accept binary video files as input, and output html (embedded metadata) and thumbnails. Because we only want this to happen for specific content types, we use the AddOutputFilterByType directive to specify several multimedia formats. Another thing you can do is to modify the HTTP headers. A simple example here is to replace the GSA’s User-Agent string with a different one:


# Set User Agent of the Proxy
RequestHeader set User-Agent "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT
6.0; SLCC1; .NET CLR 2.0.50727;)"

This doesn’t modify the headers from the GSA, because those don’t get passed through by the proxy. It just sets the header that Apache uses when it fetches a page. This can be useful if you need to set a specific cookie, User-Agent, or other header to crawl your content. After the proxy and filters are configured, you can test them by sending your own GET requests, or by:

Configuring your browser to use the proxy.
Requesting some URLs.
Viewing the source.

When everything looks good, simply add the appropriate proxy patterns to the GSA and start crawling. Your cached documents should show the filtered output.]]>