Let’s search

Getting started

Of course, it requires a lot of time, competencies and a lot of money. Hopefully this time, it is not all about money…

I- Find a fancy name

I was wondering about a name with a lots of OOO like gOOOgle or yahOOOO. Obviously bing is out this time
Maybe WhyOOOO or HowOOO or whatOOOO… sounds stupid

Vote here? [sexypoll id=”2″]

II- Have a community of believers

It would be open source developers, sponsors and general contributors to enrich a giant data base
Everything must be transparent, friendly and highly graphical
Everybody must be able to participate in a collaborative way such as Wikipedia


III- Advertise

At this point, we are still missing a huge computer power in order to process incoming questions
The more questions we receive, the more we learn, the more we get effective
But as a counterpart, the computer power must increase in order to keep the processing time less than one second for any query
It might be necessary to use a grid computer technique, in order to spread the work into hundreds of computers

Technical challenges

What do you know about current search engine internals?
They work basically in three steps:

  1. Browse the huge web content with robots, following every links, files, images etc
  2. Store all this data in a database. All the information is indexed and could be accessed later on
  3. The search itself made with your keywords and criteria

But while dealing with an open domain question answering engine, things are a little bit different…

I- Natural language

Our system must be capable to deal with natural language.
This is of course much more complicated but it provides a lot of valuable information about your request. Interrogative pronoms give a clue about the expected result, the tense is important, verbs are meaningful etc


Using our language means it could understand underlying connections between facts, people etc
This is no more about indexing the web content but rather about gathering the human knowledge

II- Knowledge base

Sorting human knowledge, what a bright idea…

And the tribute goes to a famous English man, Ephraim Chambers, in 1680, while releasing its masterpiece « Cyclopaedia, or, An universal dictionary of arts and science ». This work contains in the preface the very first map of human knowledge.


It was eventually completed by Frenchmen Diderot and d’Alembert within the « Encyclopédie ». Although articles where alphabetically sorted, each one had its own leaf in the tree of human knowledge. Quite impressive!

It’s an ancient work and it is amazing to see how the top first divisions still makes sense today

  1. Memory (all history)
  2. Reason (all sciences)
  3. Imagination (all arts)

Last but not least the Encyclopedia introduced numerous cross-references between articles, highlighting the relations between things. Stirring hyperlinks ancestors…

Back to basis, our search engine instead of browsing the whole web content, should rely on trusted sources and specialized services
Wikipedia might be a trusted source for general information, whereas BBC will be nice to get the latests news or weather forecast. Wordreference will provide translations and Facebook will browse my private area

III- Automated learning

The system must be able to learn by itself based on experience
It means even the best and powerful system won’t be able to answer any question until it is fed with billions of samples. How does it work? They are various ways:

  • Heavily rely on maths and statistics, this is what computer have been made for!
  • Try to mimic the human brain with neural networks, it is mainly used in speech and writing recognition


Anyway, remember about Google crowd-sourcing?
Users will be requested to give a feedback about the answers
According to users trusting or not the answers our engine should be able to identify the best algorithm to use according to specific patterns in the question etc. User might even be able to propose one’s answer our engine does not know yet about. Learning, learning…