If I were to make the software, the corpus of PR, licenses, etc. would be the way I go. But "they did it statistically" doesn't answer the question "what is the model?" There are many different statistical models one could use. My other post has a few things we've figured out.
But I'm starting to think a rule-based lexicon isn't out of the question, given these >1 scores on some texts.
Or just esthetic rules + word dictionary.