Metafilter Frequency Tables: 13 years, 636 million words

I’ve updated today a project I first launched a couple years ago: Metafilter Frequency Tables, a collection of tables calculated from the last 13 years of comments made by users on Metafilter.com.

The tables break down word frequency for the site as a whole as well as by several major subsites, and for all time as well as by year, month, and day. Each table includes raw count and parts-per-million data for each word. It’s all generated by some perl I wrote that fetches comments from our database and tabulates it into these text files. Read about the methodology here.

This isn’t by far the most carefully constructed set of such tables out there — I am a hobbyist, not a trained linguist, and this whole effort is very much DIY — but it’s the largest I’m aware of focused specifically on this sort of internet-mediated casual textual conversation over the last decade-plus, and I’m hoping it will be of some use or interest to word nerds.

Hello from SXSW. I am doing a thing.

So I’m at South by Southwest, which is neat, and I’m doing a jokey presentation thing with a bunch of other people who are way bigger names than me, which is nuts and awesome.

And so a few people may drop by the blog for the first time today. In which case, hi, people! I’m Josh, I work at Metafilter, I’m a musician, and I like making stuff, and here are some things I’ve made:

ThinkStank is a tumblr I started after SXSW last year. It’s full of ideas I have that I kind of like or kind of hate but mostly don’t have the time or the will to flesh out in any case. Some of the things that have come out of that after all are Useless Fliers (fliers that don’t do anything), Nine Inch Niles (NIN tracks reconstituted entirely from snips of sound from the sitcom Frasier), and silly art things like Ennuigi and Beep Calm and Tenori On

The Baseball Card Hall of Fame, a post-a-day collection of awards for unorthodox baseball card achievement.

– I’m a recording musician: I’ve got lots of mp3s and sporadic blog posts about new music and music experiments on music.joshmillard.com. Last year, I wrote and recorded an album from scratch in the month of February; it’s called Inchoatery, and if you’re interested in this sort of thing you can read a post-month breakdown of the process here.

If you’re curious about what random thing I’ve done on any given week, follow me on twitter: @joshmillard. Thanks for dropping by!

Language Log 2.0

After one too many rounds of server headaches, the folks over at Language Log have finally jumped to both new hardware and a new CMS, with date-forward content living at http://languagelog.ldc.upenn.edu/nll on a shiny WordPress install.

And while it’s certainly coincidence, I can’t help but note that they’re (for the moment, at least) using the same theme that my recent redesign was built around.  I don’t care if it’s meaningless: I elect to be flattered and there’s not a damned thing anyone can do about it.

I’m also pleased to see that Melvyn Quince is posting again.  Give ‘em what for, Doc.

keep new york new york

A quick ad-copy collocation, noticed via this Freakonomics post: the tagline for an NYC campaign for “congestion pricing” (the imposition of peak-hour traffic fees) is this wonderful little five-worder:

Keep New York New York“.

Considered as a bare string, it doesn’t even look like a sentence to me, and yet its perfectly comprehensible (and, in context, not even ambiguous: I doubt anyone will see the sign and think, “yes, yes, congestion pricing, but what does that have to do with Broadway showtunes and who is trying to get rid of them anyway?”).

In Portland (where there’s been, in fact, talk of congestion pricing for some bridges serving the downtown area), there’s a related phrase that’s been around for I don’t know how long: Keep Portland Weird.  As far as that goes, I suppose “Keep Portland Portland” would work too, but there’s something a bit more daring (and divisive, I guess, for folks in the anti-weird camp) about a specific named quality.

furthering furth

Interesting bit from Geoff Pullum at Lanugage Log today, on his discovery of the Scottish preposition “furth”, as in “furth of Glasgow” with the meaning “outside of” or “away from” Glasgow.

He talks a little about the place that the familiar “forth” holds in modern English usage, including a rundown of occurances of the word in the WSJ archives, but a couple things struck me:

1. Where does “further” come in to the history of “furth”?  If one can be furth of Glasgow, could one be (or at least have been, at some muddy point in Scots linguistic history) further of Glasgow?

[Update: while it doesn't actually dig into an answer to the question, I have to admit I missed outright this late line from Geoff on the subject of further: "(Note, though, that as Jim Smith points out to me, all modern English dialects have preserved the comparative and superlative forms further and furthest.)"]

2. “Go forth” is listed in Pullum’s corpus rundown as an afterthought, mixed in with “hiss forth” and “tumble forth” and a couple dozen other relatively uncommon also-rans.  (Set, bring, and put forth are the three big hitters, along with the fixed expression “back and forth”.)

Go forth?  Not common?  How could that be?  It’s easily the first phrase I think of when thinking of “forth”.  What gives? 

Ah, but I grew up in Catholic churches; and how often did I hear God or some prophet or metatron suggest (or some narrative simply record) that this or that figure should or did go forth — to multiply, to war, to Galilee, to sin no more.   And regardless of the texts in the readings or the homily, every Sunday had a guaranteed coda: “Go forth, to love and serve the Lord.”

So I suppose that as long as the WSJ isn’t quoting biblical passages or Catholic mass tropes, “go forth” as a rarity might be perfectly expected after all.  Oh well.

(It probably doesn’t help my case that the second thing I thought of, after “go forth”, was the programming language Forth.  I am not a reliable barometer of baseline forthiness, I don’t think.)

Melvyn Quince, linguistic pinch-hitter.

I’m a big fan of Language Log; I believe it’s one of the finest upenn-hosted collaborative linguistics weblogs Mark Liberman has ever given an alliterative name to.  And the folks (and — let’s not be sexist — folksette) over there by and large do a pretty fine job of talking about the ongoing collapse of the English language as we know it.  Plus, they post a lot of fascinating BBC science news, which saves me some web-surfing.

But this new fellow, this Melvyn Quince, is something else.  For as hard as the rest of that lot have tried, over the years, they’re looking a little old hat, a little out-worded if you will, by Quince, whose first two postings display already the kind of verve and insight and flat-out linguifictive integrity that you just can’t fake. 

And though he’s a newcomer to the Log, an examination of his credentials (e.g. via a google search) will, I think, both underscore his bona fides and suggest, to the discerning reader, the estimable value of his future contributions to this proud, arguably somewhat science-esque discipline.

Simpsons comma linguistics humor found in The

Psychiatrist: Robert was a peaceful boy, sickly and weak from a congenital heart defect. [He shows a picture of SB going to his prom in bed. The jury goes "Awwww!"] But then that Simpson boy started tormenting him, and he crossed over into dementia!

Sideshow Bob (defending himself): To what degree was this dementia blown?

Psychiatrist: Full! [Jury gasps.]

Heidi Harley’s yearly Simpsons ling-joke roundup now available in an exciting 2008 model.

Hex, leet, and constrained alphabets – for me3dia

So Andrew Huff had this silly and wonderful idea: displaying six-character words as hexidecimal strings, and interpreting those strings as hex-style color codes (the #006A3f type strings used notably for declaring colors in html code). There are some nice graphics with a few rounds of ideas at the post above; and someone has even thrown together a live application, “L33t text in c0l0r“, to let folks play with the idea in realtime.

So what’s missing? A vocabulary list! And that’s just what I’ve put together over the morning cup of tea. Four lists, actually, available here in order of strictness (and, by association, readability):

alphabetic – only letters a through f allowed
leet – some numeral-for-letter subsitutions
leeter – more aggressive subsitution
leetcore – identical list to ‘leeter’, but with gratuitous subsitutions for A, B and E, too.

More details about this — generation method, thoughts — after the jump.

Continue reading