05 Jun 2014
Since the first SaaS IPO by salesforce.com, the
SaaS (Software as a Service) model has boomed in the last decade to become a
global market that is worth billions today. It has taken a long way and a lot
of evangelisation to get there.
Before salesforce.com and the other SaaS
pioneers succeeded at making SaaS a standard model, the IT departments were
clear: the infrastructure as well as the whole stack had to be behind their
walls. Since then, mindsets have shifted with the cloud revolution, and you
can now find several softwares such as Box, Jive or Workday used by a lot of
Fortune 500 companies and millions of SMBs and startups.
Everything is now going SaaS, even core product components such as internal
search. This new generation of SaaS products is facing the same misperceptions
their peers faced years ago. So today, we wanted to dig into the
misperceptions about search as a service in general.
Hosting your search is way more complex and expensive than you may think
Some people prefer to go on-premises as they only pay for the raw resource,
especially if they choose to run open source software on it. By doing this,
they believe they can skip the margin layer in the price of the SaaS
solutions. The problem is that this view highly under-estimates the Total Cost
of Ownership (TCO) of the final solution.
Here are some reasons why hosting your own search engine can get extremely
complex & expensive:
Hardware selection
A search engine has the particularity of being very IO (indexing), RAM
(search) and CPU (indexing + search) intensive. If you want to host it
yourself, you need to make sure your hardware is well sized for the kind of
search you will be handling. We often see companies that run on under-sized
EC2 instances to host their search engine are simply unable to add more
resource-consuming features (faceting, spellchecking, auto-completion).
Selecting the right instance is more difficult than it seems, and you’ll need
to review your copy if your dataset, feature list or queries per second (QPS)
change. Elasticity is not only about adding more servers, but is also about
being able to add end-users features. Each Algolia cluster is backed by 3
high-end bare metal servers with at least the following hardware
configuration:
- CPU: Intel Xeon (E5-1650v2) 6c/12t 3,5 GHz+/3,9 GHz+
- RAM: 128GB DDR3 ECC 1600MHz
- Disk: 1.2TB SSD (via 3 or 4 high-durability SSD disks in RAID-0)
This configuration is key to provide instant and realtime search, answering
queries in <10ms.
Server configuration
It is a general perception of many technical people that server configuration
is easy: after all it should just be a matter of selecting the right EC2
Amazon Machine Image (AMI) + a puppet/chef configuration, right?
Unfortunately, this isn’t the case for a search engine. Nearly all AMIs
contain standard kernel settings that are okay if you have low traffic, but a
nightmare as soon as your traffic gets heavier. We’ve been working with
search engines for the last 10 years, and we still discover kernel/hardware
corner cases every month! To give you a taste of some heavyweight issues
you’ll encounter, check out the following bullet points:
- IO: Default kernel settings are NOT optimized for SSDs!!! For example, Linux’s I/O scheduler is configured to merge some I/Os to reduce the hard-drive latency while seeking the disk sectors: non-sense on SSD and slowing the overall server performance.
- Memory: The kernel caches a lot, and that’s cool… most of the time. When you write data on the disk, it will actually be written in the RAM and flushed to disk later by the pdflush process. There are some advanced kernel parameters that allow configuration. vm.dirty_background_ratio is one of them: it configures the maximum percentage of memory that can be “dirty” (in cache) before it is written on the disk. In other words, if you have 128GB of RAM, and you are using the default value of 10% for dirty_background_ratio, the system will only flush the cache when it reaches 12GB!!!! Flushing such bursts of writes will slow down your entire system (even on SSD), killing the speed of all searches & reads. Read more.
- Network: When calling the listen function in BSD and POSIX sockets, an argument called the backlog is accepted. The backlog argument defines the maximum length of the queue of pending connections for sockfd. If the backlog argument is higher than the value in net.core.somaxconn, it is silently truncated to that value. The default value is 128 which is way too low! If a connection request arrives when the queue is full, the client may receive an error with an indication of ECONNREFUSED. Read more & even more.
We’ve been working hard to fine-tune such settings and it has allowed us to
handle today several thousands of search operations per second on one server.
Deployment & upgrades are complex
Upgrading software is one of the main reasons of service outages. It should be
fully automated and capable of rolling back in case of a deployment failure.
If you want to have a safe deployment, you would also need a pre-production
setup that duplicates your production’s setup to validate a new deployment, as
well as an A/B test with a part of your traffic. Obviously, such setup
requires additional servers. At Algolia, we have test and pre-production
servers allowing us to validate every deployment before upgrading your
production cluster. Each time a feature is added or a bug is fixed on the
engine, all of our clusters are updated so that everyone benefits from the
upgrade.
On-premises solutions were not built to be exposed as a public service: you
always need to build extra layers on top of it. And even if these solutions
have plenty of APIs and low-level features, turning them into end-user
features requires time, resources and a lot of engineering (more than just a
full-stack developer!). You may need to re-develop:
- **Auto-completion:** to suggest best products/queries directly from the search bar while handling security & business filters (not only suggesting popular entries);
- Instant-Faceting: to provide realtime faceting refreshed at each keystroke;
- **Multi-datacenter replication:** synchronize your data across multiple instances and route the queries to the right datacenter to ensure the best search performance all around the world;
- Queries analytics: to get valuable information on what and how people search;
- Monitoring: To track in realtime the state of your servers, the storage you use, the available memory, the performance of your service, etc.
On-premises is not as secure as one might think
Securing a search engine is very complex and if you chose to do it yourself,
you will face three main challenges:
- Controlling who can access your data: You probably have a model that requires permissions associated with your content. Search as a service providers offer packaged features to handle user based restrictions. For example you can generate an API Key that can only target specific indexes. Most on-premise search engines do not provide any access control feature.
- Protecting yourself against attacks: There are various attacks that your service can suffer from (denial of service, buffer overflow, access control weakness, code injection, etc.). API SaaS providers put a lot of effort into having the best possible security. For example API providers reacted the most quickly to the “HeartBleed” SSL vulnerability; It only took a few hours after disclosure for Twilio, Firebase and Algolia to fix the issue.
- Protecting yourself from unwarranted downloads: The search feature of your website can easily expose a way to grab all your data. Search as a service providers offer packaged features to help prevent this problem (rate limit, time-limited API Key, user-restricted API Key, etc.).
Mastering these three areas is difficult, and API providers are challenged
every day by their customers to provide a state-of-the-art level of security
in all of them. Reaching the same level of security with an on-premise
solution would simply require too much investment.
Search as a service is not reserved to simple use cases
People tend to believe that search as a service is only good for basic use
cases, which prevents developers from implementing fully featured search
experiences. The fact of the matter is that search as a service simply handles
all of the heavy lifting while keeping the flexibility to easily configure the
engine. Therefore it enables any developers, even front-end only developers,
to build complex instant search implementation with filters, faceting or geo-
search. For instance, feel free to take a look at
JadoPado, a customer who developed a fully featured
instant search for their e-commerce store. Because your solution runs inside
your walls once in production, you will need a dedicated team to constantly
track and fix the multiple issues you will encounter. Who would think of
having a team dedicated to ensuring their CRM software works fine? It makes no
sense if you use a SaaS software like most people do today. Why should it make
more sense for components such as search? All the heavy lifting and the
operational costs are now concentrated in the SaaS providers’ hands, making it
eventually way more cost-efficient for you..
22 May 2014
We recently added the support for Synonyms in Algolia! It has been the most
requested feature in Algolia since our launch in September. While it may seem
simple, it actually took us some time to implement because we wanted to do it
in a different way than classic search engines.
What’s wrong with synonyms
There are two main problems with how existing search engines handle synonyms.
These issues disturb the user experience and could make them think “this
search engine is buggy”.
Typeahead
In most search engines, synonyms are not compatible with typeahead search. For
example, if you want tablet to equal ipad in a query, the prefix search for
t , ta , tab , tabl & table will not trigger the expansion on iPad ; Only
the tablet query will. Thus, a single new letter in the search bar could
totally change the result set, catching users off-guard.
Highlighting
Highlighting matched text is a key element of the user experience, especially
when the search engine tolerates typos. This is the difference between making
users think “I don’t understand this result” and “This engine was able to
understand my errors”. Synonym expansions are rarely highlighted, which
breaks the trust of the users in the search results and can feel like a bug.
Our implementation
We have identified two different use cases for synonyms: equalities and
placeholders. The first and most common use case is when you tell the search
engine that several words must be considered equal, for example st and street
in an address. The second use case, which we call a placeholder, is when you
indicate that a specific token can be replaced by a set of possible words and
that the token itself is not searchable. For example, the content
street could be matched by the queries 1st street or 2nd street but not the
query number street.
For the first use case, we have added a support of synonyms that is compatible
with prefix search and have implemented two different ways to do highlighting
(controlled by thereplaceSynonymsInHighlight query parameter):
- A mode where the original word that matched via a synonym is highlighted. For example if you have a record that contains black ipad 64GB and a synonym black equals dark, then the following queries will fully highlight the black word : ipad d , ipad da , ipad dar & ipad dark. The typeahead search is working and the synonym expansion is fully highlighted:
**black** **ipad** 64GB
.
- A mode where the original word is replaced by the synonym, and the matched prefix is highlighted. For example ipad d query will replace black by dark and will highlight the first letter of dark:
**d**ark **ipad** 64GB
. This method allows to fully explain the results when the original word can be safely replaced by the matched synonym.
For the second use case, we have added support for placeholders. You can add a
specific token in your records that will be safely replaced by a set of words
defined in your configuration. The highlighting mode that replaces the
original word by the expansion totally makes sense here. For example if you
have mission street record with a placeholder =
[ "1st", "2nd", ....] , then the query 1st missionstreet will replace
by 1st and will highlight all words: `**1st mission street**`.
We believe this is a better way to handle synonyms and we hope you will like
it :) We would love to get your feedback and ideas for improvement on this
feature! Feel free to contact us at hey(at)algolia.com.
14 May 2014
At Algolia, we are convinced that search queries need to be sent directly from
the browser (or mobile app) to the search-engine in order to have a realtime
search experience. This is why we have developed a search backend that replies
within a few milliseconds through an API that handles
security when
called from the browser.
Cross domain requests
For security reasons, the default behavior of a web browser is to block all
queries that are going to a domain that is different from the website they are
sent from. So when using an external HTTP-based search API, all your queries
should be blocked because they are sent to an external domain. There are two
methods to call an external API from the browser:
JSONP
The JSONP approach is a workaround that
consists of calling an external API with a DOM <script>
tag. The <script>
tag is allowed to load content from any domains without security restrictions.
The targeted API needs to expose a HTTP GET endpoint and return Javascript
code instead of the regular JSON data. You can use this jQuery code to
dynamically call a JSONP URL:
$.getJSON( "http://api.algolia.io/1/indexes/users?query=test", function( data ) { .... }
In order to retrieve the API answer from the newly included JavaScript code,
jQuery automatically appends a callback argument to your URL (for example
&callback=method12 ) which must be called by the JavaScript code that your API
generates.
This is what a regular JSON reply would look like:
Instead, the JSONP-compliant API generates:
method12({"results": [ ...]});
Cross Origin Resource Sharing
CORS (Cross
Origin Resource Sharing) is the proper approach to perform a call to an
external domain. If the remote API is CORS-compliant, you can use a regular
XMLHttpRequest JavaScript object to perform the API call. In practice the
browser will first perform an HTTP OPTIONS request to the remote API to check
which caller domains are allowed and if it is authorized to execute the
requested URL.
For example here is a CORS request issued by a browser. The most important
lines are the last two headers that specify which permissions are checked. In
this case, the method is POST and the three specific HTTP headers that are
requested.
OPTIONS http://latency.algolia.io/1/indexes/*/queries
> Host: latency.algolia.io
> Origin: http://demos.algolia.com
> Accept-Encoding: gzip,deflate,sdch
> Accept-Language: en-US,en;q=0.8,fr;q=0.6
> User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
> Accept: */*
> Referer: http://demos.algolia.com/eventbrite/
> Connection: keep-alive
> Access-Control-Request-Headers: x-algolia-api-key, x-algolia-application-id, content-type
> Access-Control-Request-Method: POST
The server reply will be similar to this one:
< HTTP/1.1 200 OK
< Server: nginx/1.6.0
< Date: Tue, 13 May 2014 08:33:55 GMT
< Content-Type: text/plain
< Content-Length: 0
< Connection: keep-alive
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Methods: GET, PUT, DELETE, POST, OPTIONS
< Access-Control-Allow-Headers: x-algolia-api-key, x-algolia-application-id, content-type
< Access-Control-Allow-Credentials: false < Expires: Wed, 14 May 2014 08:33:55 GMT
< Cache-Control: max-age=86400
< Access-Control-Max-Age: 86400
This answer indicates that this POST method can be called from any domain
(Access-Control-Allow-Origin: * ) and with the requested headers.
CORS has many advantages. First, it allows access to a real REST API with all
HTTP verbs (mainly GET, POST, PUT, DELETE) and it also allows to better handle
errors in an API (bad requests, object not found, …). The major drawback is
that it is only supported by modern browsers (Internet Explorer ≥ 10, Firefox
≥ 3.5, Chrome ≥ 3, Safari ≥ 4 & Opera ≥ 12; Internet Explorer 8 & 9 provides
partial support via theXDomainRequest object).
Our initial conclusion
Because of the advantages of CORS in terms of error handling, we started with
a CORS implementation of our API. We also added a specific support for
Internet Explorer 8 & 9 using the XDomainRequest JavaScript object (they do
not support XMLHttpRequest). The main difference is that XDomainRequest does
not support HTTP headers so we added another way to specify user credentials
in the body of the POST request (it was initially only supported via HTTP
headers).
We were confident that we were supporting almost all browsers with this
implementation, as only very old browsers could cause problems. But we were
wrong!
CORS problems
The reality is that CORS still causes problems, even with modern browsers. The
biggest problem we have found was with some firewalls/proxies that refuse HTTP
OPTIONS queries. We even found software on some computers that were blocking
CORS requests, as the Cisco AnyConnect VPN
client, which is widely used in the enterprise
world. We have found this issue when a TechCrunch employee was not able to
operate search on crunchbase.com because the
AnyConnect VPN client was installed on his laptop.
Even in 2014 with a large majority of browsers supporting CORS, it is not
possible to have perfect service quality with a CORS-enabled REST API!
The solution
Using JSONP is the only solution to ensure great compatibility with old
browsers and handle problems with a misconfigured firewall/proxy. However,
CORS offers the advantage of proper error-handling, so we do not want to limit
ourselves to JSONP.
In the latest version of our JavaScript client, we decided to use CORS with a
fallback on JSONP. At client initialization time, we check if the browser
supports CORS and then perform an OPTIONS query to check that there is no
firewall/proxy that blocks CORS requests. If there is any error we fallback on
JSONP. All this logic is available in our JavaScript client without any
API/code change for our customers.
Having CORS support with automatic fallback on JSONP is the best way we have
found to ensure great service quality and to support all corner case
scenarios. If you see any other way to do it, your feedback is very welcome.
02 May 2014
We interviewed Dylan La Com, Growth Product
Manager at Qualaroo &
GrowthHackers.com, about their Algolia
implementation experience.

What role did search play at GrowthHackers before the Algolia
implementation?
When we launched our community site
GrowthHackers.com in October 2013, search was
admittedly an afterthought for us. GrowthHackers is a social-voting site where
marketers, founders, and product-people can share and discuss growth-related
content. At launch, it was unclear what role search would have on the site.
GrowthHackers is built on Wordpress, and with that comes Wordpress’ standard
search functionality. What WP search does is append an additional keyword or
phrase parameter to its typical post query and load a new page with the
results. WP search only indexed the outbound URLs of the articles our members
submitted, and this made finding specific content difficult.
Why did you want to give search an update on GrowthHackers?
We started hearing about our lack of a solid search feature from some of our
more active users. One of our members even put together a slide presentation
to prove just how useless our search was [check it out
here]. At the same time, GrowthHackers was becoming more than just a
way to stay up-to-date on the best growth articles, it was becoming the place
to get answers: an encyclopedia for growth-related information. Search volume
at this time was peaking in the mid-hundreds per week. We needed a search
feature that could support this evolving use-case.
Why did you choose Algolia?
We looked at several search solutions before trying Algolia, including
Swiftype, WP Search (plugin), and Srch2. All are great solutions, but
ultimately, we went with Algolia because they had the right mix of features:
Their integration was simple, the documentation was thorough, and there were
plenty of starter templates. I knew it was a good sign when, while looking
their GitHub repository, I found they had a demo site built with search that
worked very similar to how we hoped ours would work, complete with real-time
results, typo-tolerance, and filters. The Algolia team was incredibly helpful
getting us set up and was there each step of the way through the integration
process, providing resources and best practices for creating a truly top-notch
search experience.
Tell me a little about how the new search works.
Our primary use of Algolia is to store and index user submitted content, and
provide real-time search in our growing database of growth-related articles,
questions, videos and slides. The majority of what we index is article titles
and URLs–strings which are generally small. Visitors to our site often come
with specific growth-related questions and use our search to find answers
quickly. For example, someone interested in learning best practices for
running Twitter ads could type in “Twitter ad” and within milliseconds see
dozens of articles and discussions related to maximizing ROI for Twitter ads.
Using Algolia’s admin dashboard, we’re able to set ranking priorities based on
the number of votes and comments of each article, and make sure the top
results are the most relevant. So, the visitor who searches “Twitter ad” is
shown articles with the highest mix of votes and comments. Algolia took the
search ranking process and wrapped it in a clean and simple interface that
allows anyone, regardless of their experience with search, to easily adjust
and manipulate.
One of the challenges we faced during the integration process was
understanding how to keep our main database synced and up to date with our
Algolia index. User submitted content on GrowthHackers changes often as users
interact with the content. Each post once submitted may receive upvotes and
comments from members in the community. Each post also has a wiki-style
summary field that can be edited by community members. Lastly, posts can have
several states, including published, pending and trashed. In order to ensure
our content on Algolia mirrored the content in our database, we set up a job
queue and a cron process to periodically push updates to our Algolia index.
This has been working quite well for us.
How has the new search impacted engagement?
We released the new search mid-February, and since the release we’ve seen
search volume increase 4-5X. Of course there are several factors at play here,
including increased traffic volume and better search bar placement, but it is
clear that Algolia’s search features have contributed to an impressive
increase in search engagement. On average, visitors who utilize search view
2-3X more pages per session and spend 5-6X longer on the site than those who
don’t search. Algolia’s analytics dashboard provides us with an incredible
glimpse of visitor intent on our site by showing us the queries visitors are
searching for, and trend lines to show popularity over time. With this data,
we’re able to better understand how our visitors want to use our site, and
make better decisions about how to organize the content.
Moving forward, we’re hoping to implement Algolia’s search filters to provide
even better ways to access content on our site. We’re excited to have such a
powerful tool in our stack and hope to experiment with new ways to provide
search functionality throughout GrowthHackers.
09 Apr 2014
Yesterday, the OpenSSL project released an update to fix a serious security
issue. This vulnerability was disclosed in CVE-2014-0160 and is more widely known as the
Heartbleed vulnerability. It allows an attacker to
grab the content in memory on a server. Given the widespread use of OpenSSL
and the versions affected, this vulnerability affects a large percentage of
services on the internet.
Once the exploit was revealed, we responded immediately: All Algolia services
were secured the same day, by 3pm PDT on Monday, April 7th. The fix was
applied on all our API servers and our website. We then generated new SSL
certificates with a new private key.
Our website is also dependent on Amazon Elastic Load Balance, which was
affected by this issue and updated later on
Tuesday, April 8th. We then changed the website certificate.
All Algolia servers are no longer exposed to this vulnerability.
Your credentials
We took the time to analyze the past activity on our servers and did not find
any suspicious activity. We are confident that no credentials were leaked.
However, given that this exploit existed in the wild for such a long time, it
is possible that an attacker could have stolen API keys or passwords without
our knowledge. As a result, we recommend that all Algolia users change the
passwords on their accounts. We also recommend that you reset your Algolia
administration API key, which you can do at the bottom of the “Credential”
section in your dashboard. Be careful to update it everywhere you use it in
your code (once you have patched your SSL library if you too are vulnerable).
Security at Algolia
The safety and security of our customer data are our highest priorities. We
are continuing to monitor the situation and will respond rapidly to any other
potential threats that may be discovered.
If you have any questions or concerns, please email us directly at
security@algolia.com