08 Jul 2012
You mint already be familiar with Evernote. It’s a great company that delivers
an impressive product that I use often, both professionally and personally.
When we heard about them organizing their second developer competition, we
immediately thought about the fantastic opportunity it could be for Algolia!
Let’s sum up:
- It’s an excellent use case for our first lib: search in Evernote mobile apps is quite awful and we could really bring a better user experience!
- It’s a good incentive to create a second demo (after Cities Suggest) that’s more convincing, especially for Evernote’s users.
- It’s an opportunity to get some media coverage :)
- And most of all it’s an excellent occasion to pitch our lib to the Evernote team. We would love to have them as an happy customer!
As you see, even if we don’t make it to the finals, the decision was a no-
brainer! But, wait… we’d love to go to the finals! And you can help us!
Part of the competition is to get the maximum public support. You want to hep
us? Just go to our submission
page and
vote once a day! Tell your friends! Tell your grandma! You can even log in via
facebook ;)
Want to know more about our app “Search for Evernote”? Here comes the video.
You can also download the app directly from
http://www.algolia.com/evernote.html
05 Jul 2012
At one time or another, most developers come across bugs or problems with
Unicode (about 3,720,000 results on google for the request unicode bug
developer at the time
of this writing). Let me tell you about my experience in the last decade and
why we have now implemented our own unicode Library to produce exactly the
same result across devices/languages.
I first started to use Unicode in 2004 when I was developing a Text Mining
software specialized on information extraction. This software was fully
implemented in C++ and I used IBM ICU library to be Unicode compliant (all strings were
stored in UTF16). I also used some normalization functions of ICU based on
decomposition, but I did not notice any major problem at that time. I started
to understand the dark side of Unicode later when I used it in other languages
like Java, Python, and later in Objective-C. My first surprise was when I
understood that a simple isAlpha(unicodechar c) method can return different
results!
I started to look in details at the standard and downloaded UnicodeData.txt
(the file that contains most of the information about the standard, you can
grab the latest version
here.
This file contains descriptions of all Unicode characters. Third column
represents “General Category” and is documented as:
General Categories
The values in this field are abbreviations for the following. Some of the
values are normative, and some are informative. For more information, see the
Unicode Standard.
Normative Categories
- Lu: Letter, Uppercase
- Ll: Letter, Lowercase
- Lt: Letter, Titlecase
- Mn: Mark, Non-Spacing
- Mc: Mark, Spacing Combining
- Me: Mark, Enclosing
- Nd: Number, Decimal Digit
- Nl: Number, Letter
- No: Number, Other
- Zs: Separator, Space
- Zl: Separator, Line
- Zp: Separator, Paragraph
- Cc: Other, Control
- Cf: Other, Format
- Cs: Other, Surrogate
- Co: Other, Private Use
- Cn: Other, Not Assigned (no characters in the file have this property)
- Lm: Letter, Modifier
- Lo: Letter, Other
- Pc: Punctuation, Connector
- Pd: Punctuation, Dash
- Ps: Punctuation, Open
- Pe: Punctuation, Close
- Pi: Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
- Pf: Punctuation, Final quote (may behave like Ps or Pe depending on usage)
- Po: Punctuation, Other
- Sm: Symbol, Math
- Sc: Symbol, Currency
- Sk: Symbol, Modifier
- So: Symbol, Other
As you can see there is quite a lot of categories, some of them are very easy
to understand like “Lu” (Letter, uppercase) and “Ll” (Letter, lowercase) but
some of them are more complex like “Lo” (Letter, other) and “No” (Number,
other), and this is exactly where the first problem begins.
Let’s take the unicode character U+00BD(½) as an example. It is quite common
to describe spare parts and is defined as “No”… except that some unicode
libraries consider that this is not a number and return false to
isNumber(unicodeChar) method (e.g., Objective-C).
In fact the two most used methods, isAlpha(unicodeChar) and
isNumber(unicodeChar), are not directly defined by the Unicode standard and
are subject to interpretation.
The consequence is that results are not the same across devices/languages! In
our case this is a problem because our compiled index is portable, and we want
to have exactly the same results on different devices/languages.
However, this is not the only problem! Unicode normalization is also a tricky
topic. The Unicode standard defines a way to decompose characters (Characters
decomposition mapping), for example U+00E0(à) which is decomposed as U+0061(a)
+ U+0300( ̀). But most of the time you do not want a decomposition but a
normalization: get the most basic form of a string (lowercase without accents,
marks, …). This is key to be able to search and compare words. For example,
the normalization of the French word “Hétérogénéité” will be normalized as
“heterogeneite”.
To compute this normalized form, most people compute the lowercase form of a
word (well defined by the Unicode standard), then compute the decomposed form
and finally remove all the diacritics. However, this is not enough.
Normalization can not always be reduced to just a matter of removing marks.
For example the standard German letter ß is widely used and
replaced/understood as “ss” (you can enter ß in your favorite web search
engine and you will discover that it also search for “ss”). The problem is
that there is no decomposition for “ß” in the Unicode standard because this
letter is not a letter with marks.
To solve that problem, we need to look in the Character Fallback Substitution
table that is not
part of most of Unicode library implementations. This substitution table
defines that “ß” can be replaced by “ss,”. There are plenty of other examples;
For instance, 0153(œ) and 00E6(æ), letters of the French language, can be
replaced by “oe” and “ae”.
At the end, this led us to implement our own Unicode library to ensure that
our isAlpha(unicodechar) and isNumber(unicodechar) methods have a unique
behavior on all devices/languages and to implement a normalize(unicodestring)
method that contains character fallback substitution table. By the way our
implementation of normalization is far more efficient because we implemented
it in one step instead of three (lowercase + decomposition + diacritics
removal).
I hope you found this post useful and gained a better understanding of the
Unicode standard and the limits of standard Unicode libraries. Feel free to
contribute comments or ask for precisions.
03 Jul 2012
On June 19 & 20th, I had the chance to participate to LeWeb 2012
London edition. This year theme was “Faster than
Real Time” and we had an impressive list of speakers! But the true value of
LeWeb is elsewhere: It’s in the 1283 people from 52 countries who were present
and with whom you could network!
They chose Presdo Match to help people meet and
honestly… this tool would benefit from some improvements, especially a
mobile version! Still, I was able to find no fewer than 100 participants
having the “mobile” keyword in their profile and, from there, organize a
handful of meetings. Thanks to all of you who accepted to meet me or whom I
met unplanned, with a special thanks to Paul Ardeleanu,
Gora Sudindranath, Lindsey C.
Holmes, Marius
Rostad, Kevin
McDonagh and Alexandre
Delivet for their precious feedbacks about
Algolia.
[caption id=”attachment_82” align=”alignright” width=”180”]
Cities Suggest demo @
LeWeb[/caption]
I had the opportunity to do the very first demonstrations of our instant
suggest lib, and that was both exhilarating and frustrating! We chose to
develop a small proof of concept suggesting city names from anywhere in the
world. Here’s what I learned:
- A demo is better than many words! Even if most people knew what I meant by “google instant suggest”, the demo was key in clarifying our offering.
- Even if we chose cities because it was easy to demonstrate (thanks to the geonames database), it can be interesting in itself!
- 100ms seemed a pretty fast response time in our initial testing, but it’s actually way too slow to have a smooth user experience.
Over all that was a very good experience, and I came back with a few
improvements to implement (most coming from my own frustration showing the
demo while the feedback was actually very positive!)
The most important piece of feedback was about the perceived sluggishness of
the app. We decided to implement an asynchronous version of the lib. Beware,
it actually comes with a drawback for our users; It’s significantly more
difficult to integrate. But it did not take long for us to decide it was the
way to go, since the perception of speed is so natural that the benefit far
outweighs the longer integration code. We’ll now work on simplifying it!
We’ll soon do a post about this demo. In the meantime, stay tuned!
02 Jul 2012
Welcome to The Algolia Blog! It’s always difficult to write the first post of
a blog! What should I talk about? The company, the founders, the business, the
culture? And all that knowing that virtually nobody will read except diggers
in a few years (hopefully)!
Let’s concentrate instead on what we’ll be blogging about. Company news
obviously, but not only. I expect we’ll write quite a few posts about
technology, algorithms, entrepreneurship, marketing, and whatever else we’ll
want to share with you :)
And most important, feel free to participate in comments or by contacting us
directly. We appreciate your feedback!
Welcome to the Algolia blog!