Any chance of an overview of the algorithm you're using to filter out the text? ...

slashcom · on Aug 30, 2012

Here's a few things I've gleamed from experimenting with it:

- It uses a unigram language model. You can take the same text, randomly permute the words, and you get the same score. This means it also can't be using things like POS tagging, phrases, etc.

- It normalizes words by making all letters lowercase. The exact same text in all upper case has the same score.

- The score is eventually normalized by the length of the text. The same text copied multiple times gets the same score.

- It does not form a valid probability distribution, as someone's managed to get some 1.16's. This makes me believe it's not a Naive Bayes classifier giving you the P(Bullshit|Text). Though this is what I originally thought it would be.

icegreentea · on Aug 30, 2012

It is that simple. Looks like they assign a BS level to words, and then take some sort of average bullshit level amongst all words. Word order doesn't matter as slashcom says.

For example, if you take the score from the Oracle Pricing blurb posted by BitMistro, and change 'strategies' to 'goals', you drop down to 0.8 or so. If you add an extra random 'strategy' somewhere, it bumps up to 1.4 or something.

I actually suspect a bug on their pair for strategies... probably a decimal error when building to BS level tables.

But similar things happen with other 'bullshity' words, just to a lesser degree.

slashcom · on Aug 30, 2012

It should also be noted that on about 400 short texts (~300 words each), it did not correlate with the Flesh-Kinaid readability measure at all. So it's not measuring something like average word length or syllable counts.

But it's not QUITE a true lexicon, as it handles Out-Of-Vocabulary words quite strangely. If you use as input text:

"PR-Experts, politicians, ad writers or scientists need to be strong here! BlaBlaMeter unmasks without mercy how much bullshit hides in any text. A useful tool for everyone involved in writing! Simply copy your text into the white field and check your writing style. It works with english text up to 15.000 characters (overhead will be cut off). For a meaningful result we recommend a minimum length of 5 sentences."

Then you get 0.16. If you replace the last word 'sentences' with 'strategy' you go up to 0.44. However, if you change the last word to 'sentstrategyences' you get 0.47. Try it: you can basically insert 'strategy' inside ANY word and really raise your score. Actually, if you just insert "strateg" anywhere inside the text, it goes up massively.

So I actually think it's just doing string search counts over a lexicon.

icegreentea · on Aug 30, 2012

As yes, you're right. If you insert random 'izations' into your text, your BS meter goes up as well. It also has a hardon for 'activity'.

_p62c · on Aug 30, 2012

Most unusual...

"Politics are great, come buy our new, brand spanking awesome banana phone, apple, steve jobs, cripplingly epic banana phone. Just great phones, with bananas, no apples to be found here. Samsung can suck on our banana phone. Android is better than iOS."

"Your text: 251 characters, 43 words Bullshit Index :0.03 Your text shows no or marginal indications of 'bullshit'-English."

DallaRosa · on Aug 30, 2012

You wrote a lot of bullshit but your text looks pretty normal from a vocabulary standard. looking a bit more to the website you'll see that what it calls bullshit english is that pattern often used in scientific articles or law texts (and president speeches) where they seem to be saying a lot of really wow stuff but all you're really left in the end as a big "?" cause you couldn't get half of what the person said.

DallaRosa · on Aug 30, 2012

I'd guess so. Or maybe they compiled a corpus of bullshit text (scientific articles, PR and political texts), created a little statistical model and are using that to check the level of bullshit of your text.