The Python textcluster Package

Earlier I wrote about finding the most common Firefox issues. I had wanted to automate that process and continually find these issues. Unfortunately I never had time to do this.

When they announced Firefox Input, I thought about doing this again… just with Firefox Input data but then I went on paternity leave and time kind of crept away. But I mentioned the idea this week and it piqued some interest.

So I found myself with a bit of time to work on it. The first stage was releasing a python library called textcluster.

textcluster takes the work I did earlier and makes it a bit more general purpose. The idea is I can do something like this:

docs = (
        'Every good boy does fine.',
        'Every good girl does well.',
        'Cats eat rats.',
        "Rats don't sleep.",
        )

c = Corpus()
for doc in docs:
    c.add(doc)

print c.cluster()

Which results in:

[
    (
        "Rats don't sleep.",
        {'Cats eat rats.': 0.21353467285253394}
    ),
    (
        'Every good girl does well.',
        {'Every good boy does fine.': 0.32030200927880093}
    )
]

The number is the “similarity” between the strings relative to the entire document corpus.

My next trick is to see if I can run this memory-intensive calculation over a data-set of 25,000 opinions submitted. If I can we can get some interesting data about what people think of the new Firefox beta.