Detecting bot requests and how to mitigate them.
Problem - A Flood of Searches
A client sees 1,000,000 queries in their usage panel, but they only have 100 users. There’s a good chance that something is wrong.
They initially suspect spam or malicious users. They also consider bugs in their solution. They wonder if there is a runaway loop triggering millions of search requests. Maybe they’re sending too many empty requests, triggered by automatic refreshes. They contact support to help them debug.
Investigation is needed. They start with the Dashboard’s Monitoring page.
With this view, they confirm that there is indeed a spike. But it doesn’t tell them why.
Cause - Google Bots
While there are many reasons for an increase in search operations, the one that concerns us here is caused by Google bots. When Google crawls your website, it has the potential to trigger events that execute queries. For some sites, this might not be a problem, causing only a few unintended searches. But for other sites, it can cause a flood of query requests.
One common reason for this is refinements or facets. Many clients create separate URLs for all of their facet values. Thus, if Google crawls a website, it will trigger a separate URL for every refinement. See the real-life example below for more on this use case.
Investigation
We suspect bots are impacting our search volume, how can we can investigate this further?
Get a list of IPs
To validate that a bot is triggering all the unexpected operations, you can go to your Algolia dashboard in the “Indices > Search API Logs” section. From there, you can dig into every search request and get the associated IP or see if there’s one IP that is always making the same request. You may notice that most requests are coming from a search bot like Googlebot (the bot Google uses to crawl the internet and build their search engine).
You can do the same with the get-logs API method.
Identify the source of the IPs
Check the IP using a service such as GeoIP. This traces the search back to Google, which confirms Google is behind the increase. Here’s a tool to help you trace IP addresses back to their source.
Solution - Exclude URLs from Google Bots
Initial Approach
-
Your web host/infrastructure provider is a good point of contact about mitigating the bot searches on your website.
-
On CMS such as Magento or WordPress, each platform has marketplace offerings focused on bots.
Inform Google
- Inform Googlebot to not go to your search pages at all with a well configured robots.txt. You can refer to this guide by Google. One strategy is to let Google crawl the home search page but not allow them to crawl any other pages.
A Google Trick
- Implement the latest Google reCAPTCHA, which is a very efficient way to protect yourself against bots.
Cloudflare
- Cloudflare has good measures against bot abuse as well.
Putting it all together - A real-life example
You see 1M queries but you only have 10 users!
-
You use Cloudflare to identify the source of traffic - It was clearly the Google bot.
-
On your page, every sort + refinement is a crawlable link - so with 30 different filters that results in basically 30^30 crawlable pages. Google just indexed the site for three days straight, resulting in 150k+ query operations.
-
You fixed it by adding the appropriate command to the Robots Exclusion Protocol via the robots.txt file. The general approach was to disallow each refinement and sort.
-
For example, you went to your Algolia search page and proceeded to filter results by clicking on refinements and sorts. This produced a URL of https://mywebsite.com/?age_group_map=7076&color_map=5924&manufacturer=2838&price=150-234&size=3055&product_list_order=new.
The resulting robots.txt file could be:
1 2 3 4 5 6 7 8 |
User-agent: * // this * means this section applies to all robots // You need a separate Disallow line for every URL prefix you want to exclude Disallow: /?*age_group_map // not permitted to crawl a URL with query param refinement Disallow: /?*color_map // not permitted to crawl a URL with query param color_map Disallow: /?*manufacturer // not permitted to crawl a URL with query param manufacturer Disallow: /?*price // not permitted to crawl a URL with query param price Disallow: /?*product_list_order // not permitted to crawl a URL with query param for this sort |
For InstantSearch, you can do the same with this:
1 2 |
Disallow: /?*refinementList Disallow: /?*sortBy |
Other Solutions
Rate Limiting using Algolia API Keys
In Algolia, you can generate a new Search API Key with reduced queries per IP, per hour. But because bot IPs vary greatly, ultimately you would degrade your search performance for normal users by trying to manually block all of the bots in this way. If you would like to retrieve the most popular visitor IP addresses, you can find them in your logs.
The rate of the limit is up to you, we generally recommend starting with a higher number (in order to avoid limiting real users) and reduce it gradually based on the usage/fake bots. See rate limiting for more information.
Alternative cause - bots other than Google’s crawling the site
After confirming that the unusually high API traffic isn’t from Google bots, you review the logs in more detail. You identified one or more IP addresses that are making a high number of search requests.
Often, this is due to web scraping or a denial-of-service attack. One of the best ways to prevent scraping is by preventing the bot’s IP address from accessing your site.
Third-party tools, such as Cloudflare, AWS Shield, or Akamai can often detect and prevent high numbers of requests.
These bots often change their IP address on a regular basis and evolve around the protections you add to your site. It can be difficult to keep up with these changes.
Solution - restricting access
In addition to rate-limiting your API keys, you can also use these methods to prevent a bot from scraping your site constantly.
- Using secured API keys
- Proxying the traffic through a content delivery network (CDN)
Restrict access with secured API keys
You can prevent bots from replaying the API requests on your site by changing your search API key on a regular basis. This is time-consuming, as you need to generate a new search API key every time. Then, you need to update your front end with the new search API key.
Also, bots can often detect when their requests fail and check the site again to get an updated request URL.
To address this, you can use secured API keys. These are virtual API keys you can use to temporarily grant access, or give users access to a subset of data. You can also use secured API keys to automatically generate short-lived API keys.
You can generate secure API keys through Algolia’s API. You need to use a back-end service, that receives a request from your front end for an API key. The back-end service generates the secured API key and returns it to the front end. The front-end application uses this secured API key instead of your standard search API key.
You can set the expiration time of the secured API key by setting the validUntil
property. You can choose a small time frame, for example, a few hours or a day, to prevent bots from scraping your site.
To check if the API is valid, you can check the remaining time. If the API key expired or is close to expiring, you can make a new request to generate a new API key.
In your back end, you need to add an API endpoint for generating this secured API key. This API endpoint also should be able to detect bot traffic to prevent a bot making multiple requests.
By implementing this mechanism, your front end uses automatically generated API keys that change frequently without any additional developer involvement. While a bot may scrape the search request initially, when the API key changes, the request no longer returns any data.
Proxying Algolia API requests through a CDN
Another option may be to proxy the Algolia API requests through a content delivery network (CDN), such as Cloudflare, AWS, or Akamai. These can detect higher than average requests from a single IP address and can then block that IP address.
This method is essentially a backend implementation of Algolia, so this can reduce the performance of the API requests. How much this is reduced will depend upon the proxy service that is used and how their network connections.
To use a proxy, this can be done by setting the hosts
parameter when initialising the Algolia client. This hosts
parameter needs to be the domain you want to proxy the API requests to. By using this, the details about the account such as the AppId and API key can be set as random strings within the client (although required by the client, these aren't actually used for the Algolia API request). The proxy itself will then use the correct AppId and API key when making the Algolia API request so that this information is hidden from the end users.
For instance, if you were using Cloudflare, a Cloudflare worker could be created which takes the POST request details and passes onto the Algolia API. The response from this API request could then be returned back to the client.
Building your own back-end proxy
Algolia’s InstantSearch UI library uses a search client to query the API directly from the browser. You can implement your own search client that queries your own back-end service. This can then query Algolia’s APIs from your server.