Here's a few things I've gleamed from experimenting with it:
- It uses a unigram language model. You can take the same text, randomly permute the words, and you get the same score. This means it also can't be using things like POS tagging, phrases, etc.
- It normalizes words by making all letters lowercase. The exact same text in all upper case has the same score.
- The score is eventually normalized by the length of the text. The same text copied multiple times gets the same score.
- It does not form a valid probability distribution, as someone's managed to get some 1.16's. This makes me believe it's not a Naive Bayes classifier giving you the P(Bullshit|Text). Though this is what I originally thought it would be.
It is that simple. Looks like they assign a BS level to words, and then take some sort of average bullshit level amongst all words. Word order doesn't matter as slashcom says.
For example, if you take the score from the Oracle Pricing blurb posted by BitMistro, and change 'strategies' to 'goals', you drop down to 0.8 or so. If you add an extra random 'strategy' somewhere, it bumps up to 1.4 or something.
I actually suspect a bug on their pair for strategies... probably a decimal error when building to BS level tables.
But similar things happen with other 'bullshity' words, just to a lesser degree.
It should also be noted that on about 400 short texts (~300 words each), it did not correlate with the Flesh-Kinaid readability measure at all. So it's not measuring something like average word length or syllable counts.
But it's not QUITE a true lexicon, as it handles Out-Of-Vocabulary words quite strangely. If you use as input text:
"PR-Experts, politicians, ad writers or scientists need to be strong here!
BlaBlaMeter unmasks without mercy how much bullshit hides in any text.
A useful tool for everyone involved in writing!
Simply copy your text into the white field and check your writing style. It works with english text up to 15.000 characters (overhead will be cut off). For a meaningful result we recommend a minimum length of 5 sentences."
Then you get 0.16. If you replace the last word 'sentences' with 'strategy' you go up to 0.44. However, if you change the last word to 'sentstrategyences' you get 0.47. Try it: you can basically insert 'strategy' inside ANY word and really raise your score. Actually, if you just insert "strateg" anywhere inside the text, it goes up massively.
So I actually think it's just doing string search counts over a lexicon.
"Politics are great, come buy our new, brand spanking awesome banana phone, apple, steve jobs, cripplingly epic banana phone. Just great phones, with bananas, no apples to be found here. Samsung can suck on our banana phone. Android is better than iOS."
"Your text: 251 characters, 43 words
Bullshit Index :0.03
Your text shows no or marginal indications of 'bullshit'-English."
You wrote a lot of bullshit but your text looks pretty normal from a vocabulary standard. looking a bit more to the website you'll see that what it calls bullshit english is that pattern often used in scientific articles or law texts (and president speeches) where they seem to be saying a lot of really wow stuff but all you're really left in the end as a big "?" cause you couldn't get half of what the person said.
I'd guess so.
Or maybe they compiled a corpus of bullshit text (scientific articles, PR and political texts), created a little statistical model and are using that to check the level of bullshit of your text.
My thinking is you are measuring word count versus commonly used marketing or political jargon count, but that's probably too simple.