Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One of the main reasons things why the system was so complex was because it had literally been built from scratch using a weakly typed language that has no support for data structures or object-oriented programming, and had absolutely horrendous error handling (any Mathematica user can attest to this). I do not believe Mathematica was the appropriate tool for a project as large as Wolfram Alpha, and obviously the performance hit from an interpreted language like Mathematica is very significant when writing computation at the scale of a search engine (or "knowledge engine" as the engineers around me were quick to correct).

One great decision the Wolfram Alpha people made was to put together an excellent set of internal documentation on how to add new parsing capabilities to the language. So suppose you were tasked with adding queries about something like pregnancy data, you would just write a fairly straightforward module that would capture queries like "I am 6 months pregnant" and return a list of pods (a pod is the computed interpretation of your query, most Wolfram Alpha queries will return at least 5 of them). For pregnancy data, there is a pod that shows you how big the fetus should be, another for how much amniotic fluid there is, and so on. You would then write some Mathematica code to either scrape a website with pregnancy data or integrate with some data set that was curated by a data curator. This is not difficult to do, and I know of several WA-like projects that have accomplished this already. The problem with this is that data gets siloed, and data curators have a weak standard for how data and its relations should be expressed.

This leads to difficulty arising when you are tasked with handling a complex query like "Which country has the greatest ratio of population to GDP?" Now you're talking about interoperability between two data sets, and although it can be done quite easily using Mathematica's CountryData function:

  In[1] := First[Sort[# -> CountryData[#, "GDP"]/CountryData[#, "Population"] & /@ CountryData[], Last[#1] > Last[#2] &]]
  Out[1] := "Monaco" -> 226860.
... it is nearly impossible to handle these kinds of situations for general queries that could ask about ratios of anything. A possible solution was for some time to make "ratio of population to GDP" a column in the database table, but ostensibly this leads to an exponential explosion of columns if you are trying to answer general queries.

By the time I had joined the Alpha team (after working for 2 years on Mathematica) they were already moving some of their most poorly designed data sets into a much better system that used a more rigid set of standards for describing things, places, concepts, relations, etc. I wish I could elaborate more on this because it was really very cool technology running in the background, but Wolfram Research has a real track record of suing people who violate their NDA (Matthew Cook). What I can say is that it fixed some absolutely ridiculous database design decisions - for example, in one table storing athlete performance, there were multiple rows for athletes who played multiple years where the name would be BabeRuth1942, BabeRuth1943, BabeRuth1944, and so on. It was then up to the developer to know that the name and year need to be separated, and that their code needs to handle athletes who play one year and multiple years separately.

tl;dr: Don't over-glorify Wolfram Alpha - it gets things done, but the poor performance and unpredictable results are caused by bad planning and poor organization within. If I was going to make my own knowledge engine, I would spend a long time drawing up an incredibly detailed schema about how every single thing would be represented and how a developer would write a new module for it before writing a single line of code. These are the lessons gleaned from spending two and a half years wallowing in a Big Ball of Mud (http://laputan.org/mud/).



Thanks for the reply.

So it used a traditional database for storage? I've gone down the triplestore route, with some trepidation. Working out okish so far, although I wish there were better resources around on SPARQL.


Thank you for going into detail describing these things. It is all very fascinating :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: