Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The article is somewhat silly, but there's a kernel of good advice here -

To estimate the "complexity" of a codebase:

1. Remove all comments

2. Replace all spans of whitespace with a single space

3. Concatenate all source together into a single file

4. Compress the resulting text file using gzip -9 (or your favorite compression engine)

The size of the resulting file is a good proxy for overall complexity. It's not heavily affected by naming conventions, and a refactoring that reduces the number is probably good for overall complexity.

It's not a perfect metric as it doesn't include any notion of cyclomatic complexity, but it's a good start and useful to track over time.



I think the value of metrics like this is limited, since code base size only very roughly corresponds to implementation complexity.

Here are some examples where you would increase the compressed code size while not making the project more complex:

1. Adding unit tests to code that was previously untested. Unit tests add little complexity because they don't introduce new interfaces.

2. Splitting a God class up into multiple independent classes. Usually this improves readability thanks to separation of concerns, but it often increases raw code size because each new class adds some boiler plate.

etc.


> I think the value of metrics like this is limited, since code base size only very roughly corresponds to implementation complexity.

This sounds a lot like the "your model is wrong because nuance X" argument. I want to remind you that all models are wrong, but some of them are useful anyway. In particular, I have found the size of source code to be a highly useful predictor of complexity. It has helped me predict where bugs are, where changes are made, where developers point out areas of large technical debt, and many other variables associated with complexity.

The test of a model is not whether it accounts for all theoretical nuances, but rather whether it's empirically useful – and critically, has higher return-on-investment than alternative models. What model do you suggest for implementation complexity that you have verified to be better than code size? Genuinely interested!

(Additionally, I have also successfully used the compressed size of input data to predict the resource requirements of processing that data, without actually having to process it first. This is useful because the compressed size can be approximated on-line rather cheaply.)


The GP’s point isn’t that the model is wrong “because nuance X”, it’s that the model directly contradicts good practice.

You also changed the model under consideration (GP said compressed code size; you said just code size), disagreeing with a point the GP didn’t make.


> The GP’s point isn’t that the model is wrong “because nuance X”, it’s that the model directly contradicts good practice.

"Doesn't matter; had predictive value" is what comes to mind. "Good practise" isn't a defense against empirical evidence.

That said, you're right that I missed the compressed part. I haven't tried compressing the source code before analysis, but I do suspect it would improve accuracy rather than decrease it. That's not a rigorous argument though, and I'm willing to accept that uncompressed code size might be a better model than compressed code size.


> Usually this improves readability thanks to separation of concerns, but it often increases raw code size because each new class adds some boiler plate.

That is why compression is mentioned. Boilerplate is something that disappears under good enough compression. It's literally why we call it boilerplate and generally dislike it - because once we spot the pattern, we can mentally compress it away, and then are annoyed that we have to do that mental compression whenever reading or modifying that code. Feels like pointless work, which it is.


Why would you include unit tests in the code size or complexity calculations?


Sometimes I've scanned code bases of my own for all user definable variable names and just levenshtein distanced them. It's kind of useful, but the hurdle for me at least is that I need to run something in a terminal to get the results. Maybe I'd use it more if it was a plugin in my ide of choice.

Something else you could maybe do is to simplify the code and compare sequences of statements and expressions to each other.

ie the 2 statements "foo = bar; foo += 20" is identical to "zoo = war; zoo += 20"


This is what a minifier does, and those go even further to rename variables.

Another thing that should be pruned away entirely are data files, including all constant strings within the code, since humans should avoid those when focusing on algorithms

At that point you pretty much have a highly compressed version of what you'd find in CLRS or any other algorithmic text.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: