Embedly Challenge Results

dustingetz · on Feb 13, 2012

> a text of 900 unique words

  >>> words = [2520/i for i in range(1, 900)]
  >>> len(words)
  899

hmmm.. anyway:

  occs = [2520/(n+1) for n in range(900)]	
  assert 900 == len(occs)
  assert [2520, 1260, 840, 630, 504] == occs[:5]

  num_words = sum(occs)

  for guess in range(100):
      count = countWords(occs[:guess])
      if count >= num_words/2: break

  assert 21 == len(occs[:guess])

screeley · on Feb 13, 2012

Updated accordingly. We used floats instead of ints which means that it should have been 21 not 22.

Terretta · on Feb 24, 2012

It would be unusual to have non integer occurrences of words in the text. But given the question didn't specify how to round, both 21 and 22 could be valid answers even with only whole number counts of words. Round down gives 21, while round by half gives 22.

Interesting to see how few of the proposed answers used an HTML parsing library (simplistic matching of potentially unknown document syntaxes is a notoriously brittle approach), and surprised how few counted depth relative to the article tag.

Given embedly's business and the setup discussion, seems like a valid solution should work with any arbitrary HTML page containing an article tag and paragraphs within it, while many of the gist lists either counted P depths by hand (!) or assumed that one particular document.

If the <article> tag or the <div> by it or the <p> tags had had so much as a space before the closing angle bracket (and forget about classes or styles) most of them would have failed. For the most part, only the solutions pulling in an external parsing lib would have still worked. Python's lxml.soupparser comes to mind (or lxml.etree for this task), and was happy to see several similar libs invoked.

Interesting that you had to replace the document with a cleaned up one to get more successful answers.

Thanks for sharing the results.

johno215 · on Feb 13, 2012

Living Under a Rock Question:

What is up with all the startups using .ly domain or "ly" in their name? I can understand using a foreign top level domain in order to find available domain names, but that does not explain why we don't see ones from all the other international TLDs.

Tossrock · on Feb 13, 2012

I think it's just a naming fad, along the lines of the [word]r naming fad.