A Fun Developer Interview Question

Sandman · on Jan 9, 2011

It seems to me that this is a classic case of "guess the answer I'm looking for". The interviewer is obviously impressed with the concept of a Bloom filter, and I'm not saying that there's anything wrong with Bloom filters, but that kind of thinking in an interview is dangerous. What if the interviewee disagrees with your views, and has strong arguments to use a completely different approach? Would you accept them, would you at least aknowledge them as valid, or would you stick with your solution as 'the only right way to do it'?

dangoldin · on Jan 9, 2011

I wrote it and I'll respond - I do like the concept of a Bloom Filter and do think it's a pretty good solution to this problem but I don't really expect anyone to get to that stage on their own - and very few people do. What I use it for is as a destination and when we finally get there we get to talk about it. You can get a sense of how good a developer is based on how quickly they grasp the concept as well as the discussion that ensues.

I view an interview as more of a conversation and love learning about new concepts.

I think any question can be dangerous in the wrong hands and suffer from the "guess the answer I'm looking for" so it's always up to the interviewer to be unbiased.

Mesmoria · on Jan 9, 2011

I agree. I can only assume that the article is an example of one possible line of solution.

Given the performance characteristics of human typing speed I would use a disc-based database. Simple to implement.

dangoldin · on Jan 9, 2011

I forgot to mention that I do hear the disc based solution quite a bit and it's definitely a valid response given the problem description. But I would still say "let's say you don't have any disk space, how would you go about the problem?" while still acknowledging that the answer was valid.

wccrawford · on Jan 9, 2011

So at all times, you continue to lead them to the answer you want by continuing to add restrictions until you get there. And if they don't understand your favorite new concept immediately, they aren't judged worthy?

This isn't good interview technique.

dangoldin · on Jan 9, 2011

I'm being defensive now but I think that's a pretty extreme view. Every answer is part of the interview process and it's not about the final destination. The more answers that people have the bigger the conversation and the more I can understand how someone thinks which is what you want to get out of an interview.

The fact that someone does quickly grasp the concept is a plus - why do you think it's the wrong approach? I definitely adapt my approach as I do more interviews so any feedback helps.

Xk · on Jan 9, 2011

First of all, let me say this is a significantly better question than many I've seen. However, It disturbs me when someone is interviewing me and they are looking for a particular answer and they consider everything else wrong. Even when I find more efficient answers, if it's not theirs, it's wrong.

Think about it like those little kid riddles: "A man is found dead with a noose around his neck and a puddle of water at his feet. How is it he was killed?" Put on a stool and the stool was removed? Then add the restriction "There are no stools." He was lifted with the rope? "The rope was tied to a bar on the ceiling and couldn't be raised higher." Raised in the air by some strong people and then let hang? "No one raised him from the air." You can make guesses all day long and I can make up excuses about why you're wrong until you finally guess what I wanted you to guess -- that he was standing on a block of ice.

More specific to your question: when I read it the first thing that came to mind was just to have an array of booleans and for each word in the table set H(word) to true and then when looking up a word check if H(word) is true. I don't know what bloom filters are, so I'm not saying this answer is better than them -- but after I suggest that instead of imposing some other restriction, ask me to think about what is wrong with my answer. Talk to me about the false-positives/false-negatives [or lack thereof]. Talk to me about what hash function I could use. Or about why I picked that way. Or what problems could happen if the hash function wasn't perfectly uniform.

dangoldin · on Jan 9, 2011

Fair points but I don't expect people to get to the Bloom filter on their own - I view as more of a process that both of us are driving towards. I also enjoy hearing new solutions - there's always room for discussion and I will try to dig deeper into answers if I sense something is wrong.

At the end I do like to get to the Bloom filter piece though in order to see what the response is as well as talk about something that is new to them.

charrington · on Jan 9, 2011

How about just going with the disk-based line if the candidate goes there? On-disk data structures is a rich area for exploration. I think you're getting people a little riled up here because you seem to want to discuss Bloom filters no matter what. You seem to grok that this should be about thinking skills, data structures, and algorithms. That's great. Forcing the candidate to go where you have a particular interest is not so great.

dangoldin · on Jan 9, 2011

Unfortunately I myself am not too familiar with on-disk data structures and wouldn't be too comfortable interviewing based on that. It probably is something I should brush up on but for the specific role I've been interviewing for it doesn't seem to be a big need.

Stormbringer · on Jan 9, 2011

Good luck with your spell checkers for tightly RAM limited devices with no persistent storage and no external hard drives...

Frankly, I'm surprised and amazed that there is still a market for modding old Tamagotchis. Human ingenuity is amazing.

hackinthebochs · on Jan 9, 2011

I think this is a great interview question, and I wouldn't have came up with the bloom filter answer. The point is the discussion triggered by the question, not any particular answer.

You're taking it like it's a knock on you that he wouldn't 'accept' your answer but lead you down the line of questioning he planned for. But this exactly what interviewers should do: be prepared for a certain line of discussion and nudge the candidate in that direction.

If someone was familiar enough with bloom filters to come up with that answer on their own, that should be an advantage to them. The chances of false negative seems minimal since it's uncommon and I don't think some hack would understand it well enough to use it so appropriately after a cursory glance at a wikipedia page.

On the other hand, the discussion is more important than the answer so the person spouts off bloom filter immediately is short circuiting the very process he's there for. So it balances out either way.

Stormbringer · on Jan 9, 2011

I like many others thought that this was a terrible interview question.

On the other hand, I learned something today (probabilistic data structures?!? WTF?!?)

I think the fundamental problem with the question is that it is the wrong way round. It is designed so that the interviewer can show off his expertise, whereas traditionally the view would be that you need to get the person being interviewed to show off their expertise.

Of course, if what you are really interviewing for is people who will bask in your awesomeness, then it is ideal.

----

I had a really weird interview this week. I couldn't get a word in edgewise. Haven't heard back from them either. Bizarre. If you're not going to let me talk, what is the basis for rejecting me? :D

hackinthebochs · on Jan 9, 2011

I don't get where this question shows he's trying to "show off his expertise". I can see that some people might be sensitive to the situation having been through the type of interview you describe. But I think its going too far assuming thats his motivation for this question.

I think the best types of questions are the ones where you don't know the answer ahead of time. This is where you really get to see how someone thinks. The bloom filter is rare enough that its likely people won't have any exposure to it at all. This gives you a very objective look at how quickly people can grasp new concepts and then apply them. This is exactly what an interview should hope to discover about a candidate.

What I see in the responses to this question are people who aren't very interested in the science side of computer science and thus immediately get suspicious when a question heads in that direction. Perhaps unfortunately, questions like these are going to be biased towards someone who does have a high interest in it.

gfodor · on Jan 9, 2011

If your interview question's success or failure is highly correlated with the person's ability to apply a fairly esoteric data structure it's probably a bad question.

The person who has never come across Bloom filters before has a huge disadvantage in your question, and the person who has seen them before has a huge advantage. I'd guess that as soon as a candidate says "Bloom filter" before you do in an interview, it's an instant "hire" for you, but this is a bad sign since you will get lots of false negatives and false positives.

False negatives will be good devs who never needed a Bloom filter (there are many) and false positives will be green devs that have read the wikipedia article on Bloom filters just before the interview out of dumb luck, or had a question where that data structure was introduced in the way you're introducing it, etc.

hinathan · on Jan 9, 2011

So you're saying this interview question is a probabilistic (with the potential for false positives) tool for detecting membership in the set of people who are equipped to understand the concept? Sounds like a Bloom Filter to me.

dangoldin · on Jan 9, 2011

Nah - if it's obvious the person has heard of a Bloom Filter before it's not a fair question to ask them but I will talk to them about the relationship between the size of the bit array and the number of hash functions to measure how deep the understanding goes.

I actually prefer it when people have never heard of a Bloom filter so that we can actually have a real discussion.

davidst · on Jan 9, 2011

I've done quite a bit of work with dictionaries going as far back as 8-bit machines with tight memory constraints. Bear with me here as it will get interesting.

A Bloom filter is a great data structure but probably not the best choice for this problem. It's not as memory efficient as you think.

Here's why: Think of a Bloom filter as a collection of hash values where each individual hash value is expanded by 44%. The cost of maintaining the same false positive rate of an explicit log2(n) hash is to use 1.44log2(n) bits (see http://en.wikipedia.org/wiki/Bloom_filter for more detail.) So, if you want a false positive rate of 1 in 64k you could store a 16-bit hash in an array or a 23-bit hash in a Bloom filter. That's a significant tax to pay when memory is tight.

In this case it's better to keep a bit array of explicit hashes than to use a Bloom filter.

An array of hashes gives you some options to make it smaller:

1. Sort the array by hash value and store only the differences using a delta coding scheme. The downside is you have to search the entire array (or maintain some indexes into it) to look up a single word.

2. Break the array up into sub-arrays where each sub-array is chosen so the first N bits of the hash are the same (and discarded.) Each sub-array uses a smaller hash value and can still be binary searched.

Of course, strip prefixes and suffixes from the words so the number of hashes that must be stored is reduced.

1.5 bytes per word is easy and with some work you can approach one byte per word with a low error rate.

If you're allowed to do off-line preprocessing on the word list you can do better.

Look up the history of Doug McIlroy's Unix spell checker. He got 75,000 words to fit in about 52k (going from memory here and my copy of Bentley's "Programming Pearls" is at home.) McIlroy traded away random access and had a fairly high error rate to get that much compression. Essentially, what he did was to create a large 1-bit Bloom filter and delta-encoded the differences between set bits (similar to option #1 above.)

If you want to maintain fast random access for individual words you can do this with a Directed Acyclic Word Graph (DAWG). A simple DAWG with fixed node sizes and no stemming of any kind will get two bytes per word on English. Add variable node sizes, stemming and a few other tricks and you can have fast lookup with a near-zero error rate at a cost of about one byte per word.

dangoldin · on Jan 9, 2011

This is a great response - thanks for the info and I will definitely read up a bit more on it.

Just comes to show that there are almost always better solutions than you can think of!

davidst · on Jan 9, 2011

> Just comes to show that there are almost > always better solutions than you can think of!

It's true. One of the characteristics I always looked for when interviewing programmers was whether they were open to other ways of doing things or whether they fixated on the first solution that presented itself. The best programmers are always looking for a better way.

harryh · on Jan 9, 2011

I'd probably just keep a hash of each of the words in the first list rather than all the actual words. Simple & easy to understand for whoever has to deal w/ the code after me.

I'm not 100% convinced that "ability to understand bloom filters after 5 minutes of explanation on them" correlates very highly with what I tend to look for when making a hire. Then again, hiring is such a crap shoot anyways. Mostly you just want to get someone talking about programming for a while and try to determine if they know their shit or not. The exact topic is close to irrelevant.

dangoldin · on Jan 9, 2011

You pretty much nailed it on the head there with the "get someone talking about programming" piece but I try to start with one problem and dig deeper into it rather than having many questions as if going down a checklist.

That's why you also have more than one person do the interview so you can get multiple perspectives.

Adrock · on Jan 9, 2011

One thing I like about this solution is that you can easily trade memory for accuracy by tweaking the hashcode space.

Adrock · on Jan 9, 2011

My personal favorite solution to this problem is to use a trie.

When asked to reduce the memory usage, you can recognize that the trie is actually a DFA and apply a DFA-minimization algorithm. There are several other interesting approaches here:

http://en.wikipedia.org/wiki/Trie#Compressing_tries

dangoldin · on Jan 9, 2011

That's actually pretty neat - no one has really suggested this approach yet.

I don't think that this will work as memory keeps on decreasing - is there a way to make a trie probabilistic?

Adrock · on Jan 9, 2011

Instead of being probabilistically wrong, you can save memory by removing some branches of the trie and becoming deterministically wrong.

You could remove the largest branches that correspond to the fewest words. Alternatively, you could remove the branches that correspond to the least queried words if you know something a priori about the query pattern.

alok-g · on Jan 9, 2011

Instead of making it probabilistic, I would rather add more memory to the system. The idea of making it probabilistic seems to me to be entirely for leading the interview candidate to the solution you have in mind.

hackinthebochs · on Jan 9, 2011

The responses here are amazing. No wonder hiring is so broken for developers. Someone actually comes up with a good question and he's practically attacked for it.

dangoldin · on Jan 9, 2011

Well it's an ongoing process to come up with the question and if I can get better at interviewing it helps me as well.

I usually ask what people think of the question after the interview and the responses tend to be pretty positive (although it may be due to the fact that they are interviewing). It's a great feeling exposing a new concept to a developer though and seeing them get excited about it - those are the exact people you want to work with.

kqueue · on Jan 9, 2011

The way I see it, you wrote an article to tell us you know about bloom filter and your interviewee doesn't.

alok-g · on Jan 9, 2011

Here is a similar interesting problem that I was once posed. Not sure how common or rare is this:

You get a stream of words (using functions like getFirst(), getNext(), boolNoneRemaining(), etc.). Once you reach the end of this stream, you have to output a random word uniformly chosen from this list.

The first answer is of course to store the entire list and maintain the size of this list. The moment it finishes, you multiply the size by a random number ...

Then comes the twist. What if there isn't enough memory to hold the whole list?

The thing that stuck me first is that someone posing this problem by itself meant that there was probably a solution to this, which I could then think of and program shortly.

Now frankly, when I interview, this is usually the last question on my list. Most candidates do not reach this far. Believe it or not, some half of the applicants are not able to draw a finite state machine that could identify if given binary number is odd or even (when its input stream is bits with LSB the last...).