Monitoring Search Operations

Detecting bot requests and how to mitigate them.

Problem - A Flood of Searches

A client sees 1,000,000 queries in their usage panel, but they only have 100 users. There’s a good chance that something is wrong.

They initially suspect spam or malicious users. They also consider bugs in their solution. They wonder if there is a runaway loop triggering millions of search requests. Maybe they’re sending too many empty requests, triggered by automatic refreshes. They contact support to help them debug.

Investigation is needed. They start with the Dashboard’s Monitoring page.

With this view, they confirm that there is indeed a spike. But it doesn’t tell them why.

Cause - Google Bots

While there are many reasons for an increase in search operations, the one that concerns us here is caused by Google bots. When Google crawls your website, it has the potential to trigger events that execute queries. For some sites, this might not be a problem, causing only a few unintended searches. But for other sites, it can cause a flood of query requests.

One common reason for this is refinements or facets. Many clients create separate URLs for all of their facet values. Thus, if Google crawls a website, it will trigger a separate URL for every refinement. See the real-life example below for more on this use case.

Investigation

We suspect bots are impacting our search volume, how can we can investigate this further?

Get a list of IPs

To validate that a bot is triggering all the unexpected operations, you can go to your Algolia dashboard in the “Indices > Search API Logs” section. From there, you can dig into every search request and get the associated IP or see if there’s one IP that is always making the same request. You may notice that most requests are coming from a search bot like Googlebot (the bot Google uses to crawl the internet and build their search engine).

You can do the same with the get-logs API method.

Identify the source of the IPs

Check the IP using a service such as GeoIP. This traces the search back to Google, which confirms Google is behind the increase. Here’s a tool to help you trace IP addresses back to their source.

Solution - Exclude URLs from Google Bots

Initial Approach

Your web host/infrastructure provider is a good point of contact about mitigating the bot searches on your website.
On CMS such as Magento or WordPress, each platform has marketplace offerings focused on bots.

Inform Google

Inform Googlebot to not go to your search pages at all with a well configured robots.txt. You can refer to this guide by Google. One strategy is to let Google crawl the home search page but not allow them to crawl any other pages.

A Google Trick

Implement the latest Google reCAPTCHA, which is a very efficient way to protect yourself against bots.

Cloudflare

Cloudflare has good measures against bot abuse as well.

Putting it all together - A real-life example

You see 1M queries but you only have 10 users!

You use Cloudflare to identify the source of traffic - It was clearly the Google bot.
On your page, every sort + refinement is a crawlable link - so with 30 different filters that results in basically 30^30 crawlable pages. Google just indexed the site for three days straight, resulting in 150k+ query operations.
You fixed it by adding the appropriate command to the Robots Exclusion Protocol via the robots.txt file. The general approach was to disallow each refinement and sort.
For example, you went to your Algolia search page and proceeded to filter results by clicking on refinements and sorts. This produced a URL of https://mywebsite.com/?age_group_map=7076&color_map=5924&manufacturer=2838&price=150-234&size=3055&product_list_order=new.

The resulting robots.txt file could be:

User-agent: * # this * means this section applies to all robots

# You need a separate Disallow line for every URL prefix you want to exclude
Disallow: /*?*age_group_map # not permitted to crawl a URL with query param age_group_map
Disallow: /*?*color_map # not permitted to crawl a URL with query param color_map
Disallow: /*?*manufacturer # not permitted to crawl a URL with query param manufacturer
Disallow: /*?*price # not permitted to crawl a URL with query param price
Disallow: /*?*product_list_order # not permitted to crawl a URL with query param for this sort

For InstantSearch, you can do the same with this:

Disallow: /*?*refinementList
Disallow: /*?*sortBy

Alternative cause - bots other than Google’s crawling the site

After confirming that the unusually high API traffic isn’t from Google bots, you review the logs in more detail. You identified one or more IP addresses that are making a high number of search requests.

Often, this is due to web scraping or a denial-of-service attack. One of the best ways to prevent scraping is by preventing the bot’s IP address from accessing your site.

Third-party tools, such as Cloudflare, AWS Shield, or Akamai can often detect and prevent high numbers of requests.

These bots often change their IP address on a regular basis and evolve around the protections you add to your site. It can be difficult to keep up with these changes.

Solution - restricting access

In addition to rate-limiting your API keys, you can also use these methods to prevent a bot from scraping your site constantly.

Using secured API keys

Restrict access with secured API keys

You can prevent bots from replaying the API requests on your site by changing your search API key on a regular basis. This is time-consuming, as you need to generate a new search API key every time. Then, you need to update your front end with the new search API key.

Also, bots can often detect when their requests fail and check the site again to get an updated request URL.

To address this, you can use secured API keys. These are virtual API keys you can use to temporarily grant access, or give users access to a subset of data. You can also use secured API keys to automatically generate short-lived API keys.

You can generate secure API keys through Algolia’s API. You need to use a back-end service, that receives a request from your front end for an API key. The back-end service generates the secured API key and returns it to the front end. The front-end application uses this secured API key instead of your standard search API key.

You can set the expiration time of the secured API key by setting the validUntil property. You can choose a small time frame, for example, a few hours or a day, to prevent bots from scraping your site.

To check if the API is valid, you can check the remaining time. If the API key expired or is close to expiring, you can make a new request to generate a new API key.

In your back end, you need to add an API endpoint for generating this secured API key. This API endpoint also should be able to detect bot traffic to prevent a bot making multiple requests.

By implementing this mechanism, your front end uses automatically generated API keys that change frequently without any additional developer involvement. While a bot may scrape the search request initially, when the API key changes, the request no longer returns any data.

Building your own back-end proxy

Algolia’s InstantSearch UI library uses a search client to query the API directly from the browser. You can implement your own search client that queries your own back-end service. This can then query Algolia’s APIs from your server.

Problem - A Flood of Searches

Cause - Google Bots

Investigation

Get a list of IPs

Identify the source of the IPs

Solution - Exclude URLs from Google Bots

Initial Approach

Inform Google

A Google Trick

Cloudflare

Putting it all together - A real-life example

Other Solutions

Rate Limiting using Algolia API Keys

Alternative cause - bots other than Google’s crawling the site

Solution - restricting access

Restrict access with secured API keys

Building your own back-end proxy

Problem - A Flood of Searches

Cause - Google Bots

Investigation

Get a list of IPs

Identify the source of the IPs

Solution - Exclude URLs from Google Bots

Initial Approach

Inform Google

A Google Trick

Cloudflare

Putting it all together - A real-life example

Other Solutions

Rate Limiting using Algolia API Keys

Alternative cause - bots other than Google’s crawling the site

Solution - restricting access

Restrict access with secured API keys

Building your own back-end proxy

Related articles