Wanna help edit Calvin and Hobbes transcripts?

I talked a bit in my Calvin and Markov post this morning about the C&H transcripts I’m using to power the Markov chain process, and about how much work it’s likely to be to edit the whole thing. Probably 25-30 hours total.

It’s not difficult stuff—and in fact it’s not a terrible excuse to reread some C&H as you go, if you want to go less than warp-speed—so if you’re interested in helping out, drop me a line in the comments here or via email or @joshmillard on twitter and I can hook you up with a specific chunk of transcript so we avoid any accidental duplication of effort.

I’ll talk a little more about the approach I’ve been taking to it, to give a clear idea of the small handful of details involved in the markup process. As I said in the other post, the situation is that transcripts of every strip do already exist, which is great, but they aren’t broken down by character or panel at all. It’s just one string of words run together per strip.

So the project is to, for e.g. this Christmas Day strip from 1986:

C&H 19861225

Turn each line in the existing transcript, like this:

19861225 Psst! Are you awake? Is it Christmas? It is! It is! Let's go wake Mom and Dad and open all our loot! Since it's Christmas maybe we should let them sleep a little. That's long enough! Wake up! Wake up! It's Christmas!! Quarter to 6. He let us sleep in this year.

Into a set of lines like this:

19861225
C: Psst! Are you awake?
H: Is it Christmas? It is! It is!
C: Let's go wake Mom and Dad and open all our loot!
H: Since it's Christmas maybe we should let them sleep a little.
C: That's long enough! Wake up! Wake up! It's Christmas!!
D: Quarter to 6. He let us sleep in this year.

That’s the whole idea in a nutshell: each line labeled with character code, colon, space; a new line for each balloon.

But here’s a few details worth noting:

1. Character labels

Each character has a specific short label — as seen above, C for Calvin, H for Hobbes, D for Calvin’s dad. I’ve assigned labels to the most common ten or so characters I’ve come across in the first 14 months of the strip, as below:

C = Calvin
H = Hobbes
D = Dad
M = Mom
T = Teacher, Miss Wormwood
P = Principal
SD = Susie Derkins
SS = Spaceman Spiff, when Calvin is pictured as or narrating as/about Spiff
MOE = Moe, the school bully
ROS = Rosalyn, the babysitter
MON = Monsters/creatures of various sorts
MISC = miscellaneous/offscreen character dialogue
SFX = non-spoken sound effects words e.g. CRASH, WHAP

Most of those are self-explanatory. The spots where I’m making more of a judgement call:

– the use of MON to represent various imaginary speakers — I’ve included actual monsters under the bed, alien dialogue from Spiff sequences, and Calvin’s own speaking-as-imaginary-monster-form where he’s shown in panel as a monster rather than himself. Arguably these could be broken out more finely; if you want to propose or run with a more detailed rubric, just note what you’ve come up with and I’ll look at incorporating it formally.

– the use of MISC for non-major but clearly identifiable characters. In some cases dialogue is truly off-screen, e.g. TV show dialogue when Calvin’s watching TV, but some cases it’s more clearly coming from a detailed if unnamed character. There would be no harm in consistently labeling individual characters instead of lumping them under this catch-all, so if you want to e.g. label the family doctor as DOC: or by his name if you happen to know it, that’s fine too. Just make a note of it when you send your work back.

– SFX is meant for non-spoken environmental sounds; generally the distinction between this and stylized speech/shouts/utterances (AAAAHHH!, WAARG!, Z) is pretty clear, but use your best judgement if it’s a weird case.

2. Different panel/balloon, different line.

In cases where the same character speaks in multiple balloons or panels in a row, I’ve been giving a separate line to each instance. This doesn’t effect my Markov project’s performance, but it feels like a cleaner representation of the structure of the strip’s flow to me.

3. Doubling up shared dialogue

Occasionally, two characters (usually Calvin and Hobbes) will be shown shouting something simultaneously, either with twin word-balloon tails or by implication with more stylized non-balloon text. I’ve preferred to duplicate those lines, giving a separate copy to each character on two consecutive lines. The original transcript generally doesn’t duplicate the text in these contexts, so you’ll need to re-type or copy and paste.

4. Capitalization across panel/balloon breaks

The original transcript often declines to capitalize new panel/balloon starts when continuing dialogue across a break/ellipsis; I prefer to start new lines capitalized for consistency, and so would recommend capitalizing in these situations for the next line, e.g.

Sure, I think it's ... wait a minute.

becomes

H: Sure, I think it's ...
H: Wait a minute.

5. Spelling/transcription errors

The original transcripts are pretty good, but (not surprising given how much of a slog it must have been to type them out) there are occasional typos and a couple recurring issues worth fixing opportunistically if you notice them as you work.

Aside from occasional literal spelling/typing errors, the most common thing I’ve been fixing is the inappropriate use of periods where commas should go and vice versa (and corresponding changes to the following capitalization), and the outright omission of commas. Fix ‘em if you see ‘em.

Calvin and Markov

I’ve spent the last few days building a random generator internet toy called Calvin and Markov. It generates random new weird variations on Bill Watterson’s classic, wonderful comic strip, Calvin and Hobbes, using a Markov chain process and a few hundred lines of Perl code.

It’s a fun, odd machine to just play around with, but if you’re interested in how it works, I’ll detail that below, and include some thoughts on C&H itself and why I built this thing.

(Update: if you’d like to help get the rest of the C&H transcripts neatened up, see here.)

calkov 1

People who know me won’t be surprised that I’m messing around with Markov chains; it’s one of my favorite little intersections of math and linguistic/artistic weirdness, a fairly simple way of analyzing the frequencies of events (like the order in which words appear in a bunch of written text) in order to produce new, novel, semi-coherent output. I’ve built a lot of little Markov-related things over the years.

And some folks who don’t know me at all may find this specific Markov-plus-comics idea familiar anyway because of an old project of mine, off which this new one is based: Garkov, a random Garfield strip generator I wrote several years ago. Calvin and Markov is the result of fits-and-starts cleanup work I’ve done on that original code over the years; at its heart it does just what Garkov does, but does it in a somewhat less stupid way on a couple fronts.

Okay, but how?

calkov 3

There’s a few moving parts to C&M:

1. The Markov chain code

This is a custom implementation of a Markov chain process that I wrote in Perl a few years back because (a) I’m finicky about how my Markoving works and (b) it seemed like a fun thing to write. It’s a content-neutral set of functions — nothing about it is specific to Calvin, or to comic strips. It’s just a bunch of code that will digest an arbitrary collection of text and then burp out new weird sentences when you ask it to.

I’ve made a few small improvements to this code as I’ve revisited it the last few days, but it was feature-complete already.

2. The comic strip art

There’s lots of places to find C&H strips on the internet, both on official comics-hosting sites and elsewhere. I found an archive of the series run sitting around somewhere or other, with all the daily strips rendered to 600 pixels wide, and have been using that.

I’ve selected a few dozen strips that I like, featuring characters with enough dialogue (except maybe Moe, he doesn’t talk much) that they have a variety of things to say, and blanked out the original dialogue in the art (and in some cases tweaked the word balloons to be a little more accommodating to my text insertion process).

It takes a minute or two to blank and neaten up each strip for this step, but there’s some additional work after that, setting up a strip definition file (see below) that adds a few more minutes to the process. Adding more strips to the project is doable in a piece-by-piece fashion and I’ll likely continue adding strip templates in the future to up the variety some more.

3. The dialogue from the comics

Calvin and Hobbes ran daily for about ten years, which, accounting for a couple of sabbaticals by Watterson, means there were on the order of 3,000 original strips published. That’s a lot of text to work with, which is great in theory, but newspaper comics didn’t come with convenient plaintext transcripts. I went into this project knowing I might have to do a lot of typing just to have material to feed to the Markov process. That’s what I did for Garkov, which is why the input corpus for Garkov is based on a few hundred strips rather than a much more significant chunk of the strip’s archives.

Luckily for me, someone, years ago, already decided to tackle this, transcribing with reasonably good accuracy the entire C&H archive, strip by strip. (This is apparently what powers Mike Yingling’s C&H search engine.)

Unluckily for me, they did so by treating each comic strip as a single, run-all-together line of dialogue text without any character breakouts. Which is fine if you’re just trying to search for a strip — if the word “Krakow” appears anywhere in the strip, that strip will be a match — but it’s a problem if you’re doing something character-specific like this project. I want Calvin to say Calvin stuff, Hobbes to say Hobbes stuff, and so on for Dad, Mom, Susie Derkins, etc.

And so I escaped the need to transcribe, but have still had to do a bunch of markup on that original transcription work, for each strip turning a line like this:

19861225 Psst! Are you awake? Is it Christmas? It is! It is! Let's go wake Mom and Dad and open all our loot! Since it's Christmas maybe we should let them sleep a little. That's long enough! Wake up! Wake up! It's Christmas!! Quarter to 6. He let us sleep in this year.

Into a set of lines like this:

19861225
C: Psst! Are you awake?
H: Is it Christmas? It is! It is!
C: Let's go wake Mom and Dad and open all our loot!
H: Since it's Christmas maybe we should let them sleep a little.
C: That's long enough! Wake up! Wake up! It's Christmas!!
D: Quarter to 6. He let us sleep in this year.

That lets me break out each character into a separate collection of lines, and create individual Markov table “brains” for use in strip generation. I also wrote a small script to do that sorting-into-separate-files bit so I don’t have to do the sorting and copying and pasting manually.

So far I’ve marked up the 1985 and 1986 strips, or about 14 months total. That adds up to about 1300 separate lines for Calvin (I peeked, it’s 1,268 lines containing 10,832 words total), a few hundred for Hobbes, 100+ each for Calvin’s mom and dad, and on the order of dozens for the other recurring characters. In a Markov project like this, more input text is generally better (you get more varied, weird, unexpected results), and so marking up more of the transcripts would be a good long-term goal, but it’s tedious and time-consuming; at a good clip, I can do a month’s worth of strips in about 15 minutes, but with what’s done so far that leaves something like 25-30 hours of additional work just to mark up the remaining bulk of transcript. I may keep chipping at it myself; I may try crowdsourcing some of that work if folks are interested in helping.

4. Strip definitions

Blanking out the original dialogue in word balloons gives me a canvas to work with, but my code still needs to know where to actually try and put new words. I ginned up a very simple definition file, describing a set of rectangular areas into which my code would paint dialogue; on each run, the program selects a strip template at random, reads in the actual image file (e.g. strip_02.gif) and then the definition file (strip_02.def), and uses both to do its work.

For example, this strip:

strip_02 template

…has the following definition file, listing on the first line which characters appear in the strip (so the code can be appropriately parsimonious and not bother loading Markov data for any other characters) and then on each subsequent line a character who speaks (so the code knows whose “brain” to pull from) and the geometry of the rectangle representing their word balloon:

calvin dad
calvin 98 8 90 4
dad 183 7 45 3
calvin 252 25 90 5
calvin 375 10 140 5
dad 525 10 140 3
calvin 550 55 100 5

Those numbers represent, respectively, the x-position in pixels of the center of the word balloon, the y-position of the top, the width of the balloon in pixels, and the maximum number of lines of dialogue that can appear. That’s a bit of a hacky mess of a format, but it works well enough for these purposes and is reasonably simple for me to generate by hand; if I clean up and generalize the code in the future, I’ll revisit it a bit. (Likely plan: make the x and y coordinates both center values for the middle of the target rectangle, and replace the max-number-of-lines value with a raw pixel distance with the script doing the math at runtime to figure out how many lines that can accommodate.)

Generating a strip definition involves some eyeballing with a cursor tool in Photoshop Elements, and then a little bit of nudging (a few pixels left or right or up or down) when I look at the text being populated by the program in practice. A tool that actually translated drawn rectangles into numbers would speed up the process, but I didn’t feel like spending time trying to build that when I could just rough out some definitions by hand to get this initial pass working. Something for the future.

5. Lettering

The lettering in comic strip — the actual text as written in the panels — is a key part of its visual identity, if not always the most obvious one. Garfield has a clean, very even balloon-ish lettering, like Comic Sans’ sober cousin; Mark Trail has a sturdy feel with bold verticals canted to the right; Zippy the Pinhead has a loose, scrawled-in-a-notebook lilt that matches its surreal tone.

comic-lettering-examples

Change the lettering, and a comic just won’t look like itself.

And Bill Watterson’s lettering is as central to the look of Calvin and Hobbes as almost any comic I can think of; his narrow, unstuffy left-leaning capitals are just how C&H looks, an integral part of not just the meaning but the feel of the strip, even before you take into account those panels where he gets particularly expressive with larger text, alternate lettering styles, sound effects and shouted words and so on.

Truly replicating expressive hand-lettered text automatically is a difficult job and well out of scope of a project like this, but I was happy to see that someone had created a basic font based on Watterson’s lettering, and shared it on Deviant Art. I’ve used that for the text of the strip, to reasonably good effect; my auto-generated text is mechanical and inorganic and compared to the real deal, and cramped and often inelegantly placed within balloons on account of the lack of artistic vision on my Markov code’s part, but it at least doesn’t immediately scream NOT BILL’S LETTERING the way subbing in some generic “comic” typeface would have.

For the title of the strip, I was able to find another font based on the familiar sticks-and-angles style of Calvin’s own handwriting that folks mostly likely generally associate with the strip.

6. ImageMagick

Aside from the Markov code responsible for the text generation, my Perl script does some basic stuff to wrangle all the rest of the above, most of which isn’t really interesting. But the actual pasting together of an image is a pretty key bit, and that’s something I didn’t want to have to figure out from scratch.

To that end, I used a software library called ImageMagick that handles image generation and manipulation, and which does the work of turning the strip templates and lettering font and generated text into a final, rendered image. (This is a big improvement over Garkov, which renders a strip as a bunch of individual CSS-positioned single letter images on top of a blank-ed out strip, which makes keeping or sharing an image unnecessarily difficult.)

ImageMagick is a fast, reasonably powerful, and a confusing goddam mass to work with. I heartily recommend it and suggest you stay the hell away. It’s that sort of library.

I’ve found myself reaching for other tools in the last few years; if I were starting on this project from scratch today, Perl and ImageMagick is probably not where I’d aim. But having the Garkov codebase to work with was such a head start on this that coming back to those and doing a little bit of angry wrestling was worth the pain.

Okay, but why?

calkov 2

How is easy; I lowered a shoulder and nerded on through it and here we are. Why is trickier. There’s a few different kinds of why.

Why bother?

Because it was (a very specific kind of) fun putting this together, revisiting and improving my Garkov code, getting the whole thing working as well as I could. I really liked making Garkov back in the day, but I also burnt out on it around the time I put it out in the world, partly just because of all the tricky bits I had to sort through and try (and in some cases fail) to find solutions to to get it up and running.

Giving this idea another shot years later with everything I’d learned the first time driving me forward more quickly has been satisfying. I may take this newer, cleaner approach and apply it to a Garkov 2.0; I may throw it at some other comics; I may try to get it into other folks’ hands so ten thousand Markoving comic resynthesis projects can bloom.

Why Calvin and Hobbes?

Because I like C&H. Because other folks like it. Because it’s recognizable, and familiar, and the familiarity of the original lends a kind of weird suspension of disbelief to the broken, altered output of this kind of transmogrification process—if a thing looks enough like the real thing, we try to treat it like the real thing a little longer, give it the benefit of the doubt even as we know we should be doubting it. No one will really be fooled by Calvin and Markov, but all of us who’ve read thousands of the originals are wired to sort of give it credit long enough to produce a double-take, which is great.

Why treat C&H so weirdly?

One thing I’ve thought about while working on this—and I’ve heard it from at least one friend I showed the work in progress too as well—is how different Calvin and Hobbes is, as a cultural property, from the previous choice of Garfield. They’re both totemic, instantly recognizable comic strips, but that’s about the end of the similarities; C&H is loved for its doting, dynamic inkwork and artful writing and characterization, while Garfield is generally derided for its predictability, minimalist and samey art, and overall cash-in, sell-out, factory-produced sterility.

And so building Garkov, a machine that swallows up Garfield strips and spits out something dada and broken and absurd, seems like sort of a gimme. Of course people should fuck with Garfield. What else is it good for? I hate Mondays. Etc. It is, however justly or not, an easy target. (And I was not by far the first or the last to futz around with Garfield as a template for recontextualized weirdness; see the links at the bottom of the Garkov page for many others.)

Whereas C&H is a strip people hold up high as more or less the zenith of the modern newspaper comic strip, a piece of work that was so consistently beautiful and smart and heartfelt and uncompromising that nothing on the page could compete with it during its ten year run, and nothing has been be able to replace it in the years since. C&H was funny, but it wasn’t a joke; as mainstream pop cultural artifacts go, it’s pretty unfuckwithable.

You mess with Garfield, no one says How Dare You. Calvin and Hobbes, though…

So I’ve wondered as I built this how people would feel about it. Not so much that I expect condemnation—weird for weird’s sake gets by okay on the internet and I doubt anyone will get the mistaken impression that I mean any harm here—but really just how they’ll feel about the oddball output of this given their likely more fond releationship with the source material than in the case of Garkov.

Take a well-written, well-remembered comic strip and render it incoherent, and…and then what? And why?

I didn’t really know what I was going to get when I started. I wasn’t sure if I was going to get anything, honestly, other than new, bad Calvin and Hobbes strips. But I have seen some stuff in the output as I’ve developed this that I genuinely like.

And mostly what that is is this: Calvin as an actual, deeply weird little kid. Not the apt, smartly-written, nail-it-in-four-panels Calvin of Watterson’s work, the kid who we understand to be a kid despite his tremendous vocabulary and delightful, imaginary-or-is-he tiger friend, but a real scatterbrained oddball, the unvarnished stream-of-consciousness pile of developing brain cells that parents and teachers end up dealing with.

The little kid who changes the subject every five seconds. The little kid who says bizarre, contextless things, not as a punchline to a three-panel setup with a beautifully drawn alligator but just actually genuinely out of nowhere. Here’s a Calvin who confuses us, the readers, just as much as he does the adults around him in Watterson’s final-panel reveals.

Not Calvin the apposite, but Calvin the apropos-of-nothing. A Calvin whose head we don’t get to see inside of, a Calvin who we can’t keep up with.

It’s a neat thing, and a plenty satisfying outcome of these last few days of work. I love when something like this can surprise me.

Mary Worth’s Howl, by Al “Screwball” Ginsberg

So, Lauren LoPrete‘s Peanuts + Smiths Lyrics mashup blog, This Charming Life, has been making the rounds; it ended up on Metafilter yesterday, which led to much riffing on other possible comic/band juxtapositions, and I saw someone mention Mary Worth and joked that it should in fact be:

Mary Worth and excerpts from Howl.

And it was just a dumb one-off joke, since who could be more square and straight and constitutionally terrified of something like unfiltered Ginsberg than good ol’ meddlin’ Mary.

But I couldn’t stop thinking about the idea, and so, here we go for real. Heavily excerpted in seven parts, because a one-shot wasn’t enough.

maryworthshowl-01

maryworthshowl-02

maryworthshowl-03

maryworthshowl-04

maryworthshowl-05

maryworthshowl-06

maryworthshowl-07

I’m a little bit in love with Jerkcity HD

Jerkcity is one of the ur-webcomics, a weird chat-transcript-as-comic habit among some friends that started way back in 1998, facilitated by the almost nearly as weird application Microsoft Comic Chat, a piece of software released in 1996 and featuring the art of beloved weirdo Jim Woodring. You’d type, and cartoon characters would belch out the things you said emphatically into word balloon, lettered of course in MS Comic Sans.

jerkcity-orig-4452

The comic was surreal, profane, incoherent, hilarious, self-aware, self-loathing, etc. It managed though its mannered-nonsense approach to vocabularian freestyling to birth what indirectly became one of Metafilter’s favorite memes, “hurf durf butter eater“. The whole mess is a little hard to explain other than that everybody was probably pretty high at the time.

Jerkcity HD, then, is a brand new blog collecting submissions of reworks of the original Jerkcity strips by anybody and everybody, so long as you abide by a few rules. With the original Jerkcity being daily updated for 15 years now, the chances of actually catching up with the archives are vanishingly slim, but I don’t figure that’s really the point, so let’s not worry about it. The point, such that it is, is that people are jumping in and the results are lovely and awesome and so on.

I’m digging it, and looking forward to seeing all the weird directions people take these reinterpretations in. I’ve got a bunch of ideas of my own, and have chucked my first submission in their (reportedly a bit crowded already) bin, but here it is in the mean time, reworking this strip shown above:

jerkcityhd-73s

New podcast: We Have Such Films To Show You

So! I’ve started doing a podcast miniseries with my friend from the internet, Yakov “griphus on Metafilter” Grinberg. We’re watching all of the Hellraiser films—there’s nine of them so far—and discussing/reviewing/dissecting/boggling-at them, one movie at a time.

It’s called We Have Such Films To Show You—that’s the blog for the show, there—and you can listen to episodes there or subscribe to the podcast via iTunes or RSS.

We recorded our first episode a couple days ago, going over the original 1987 film (written and directed by Clive Barker, the horror author whose novella The Hellbound Heart was the basis for the film), and it was a fantastic time and two hours just flew by. If you’ve seen the film or are otherwise familiar already with the Hellraiser franchise or Barker’s work, you’re pretty much the target demographic already, but we talk in enough loving, rambling detail about the content of the film that it’s probably plenty listenable even coming at it cold if you enjoy listening to a couple of enthusiastic nerds bullshitting about the pros and cons of 80s horror filmmaking.

We’ll be doing another episode every couple weeks.

Doing things vs. wondering if people will notice

This is a thing I presume lots of folks struggle with, but it’s been on my mind a lot lately:

I spend a lot of time that I could spend doing a thing sitting around peeking at page views and comments and web analytics and tumblr likes and so on, wondering who is noticing a thing I did. And it feels pretty off balance sometimes.

Here’s the givens:

1. I really like making stuff; music, writing, art, humor stuff, weird programming stunts, whatever. Building a thing, doing a thing. I like going from “I bet I could do x” to sitting down and making x happen, or making a solid and realistic plan for how to make x happen; I like turning silly ideas into silly realities. It’s rewarding enough in its own right that when I’m doing it right it doesn’t feel like work even if it takes real effort.

2. But I also like positive feedback. I like to know when people like a thing I made; I like to know when what strikes me as funny or interesting or curious enough to put down on paper (or .jpg or .mp3 or or or) is also curious or interesting or funny to other people. I like that connection, that validation, that sense that I’m not just sitting around in my treehouse by myself. I like it when I see that other people get what I’m after. It’s exciting; it’s gratifying.

And I think both of those are pretty normal and reasonable things, but like I said above, it can get to feeling off-balance. If I’m spending more of my time looking, for the nth time, to see if there’s new thumbs going up on something I posted than I’m spending making the next thing? That feels like a problem. Why am I cycling from one site to the next, dousing for validation, instead of just getting on with the great big stack of Some Day projects that today could be the day for? Why am I sitting around wondering if people will notice what I did, when the answer shouldn’t really matter and the energy I spend on that could be spent making something new?

I try to be self-aware about the whole thing, but self-awareness of a problem and addressing it effectively aren’t the same thing, so some days I end up sitting around sort of being aware that I’m not using my time the way I’d like to be but just not using it well anyway and feeling sort of wrapped up in that conflict and vibrating uselessly. It’s a frustrating thing.

And then I check twitter and flickr and mlkshk and tumblr and metafilter and facebook and my WordPress blogs and my web analytics again. The growth and decentralization of my pool of places-where-I-put-stuff-I-make-or-talk-to-people over time is probably exacerbating this whole thing a bit, and the idea of recentralizing somewhat (likely by routing a lot more of my output through this lately-pretty-dormant blog) has some appeal but isn’t a panacea. (And, turtles all the way down, I would likely do the work to facelift the blog and rework things and then announce that on twitter and sit around wondering who will notice…)

It’s a tricky thing. I’m working on it. I figure a lot of people are. But some days, some months, it feels like more of a thing than usual, and I guess this is one of those.

Now if you’ll excuse me, I think I’ll go draw a picture of Bird President Ptarmigan Jefferson.

Bird Presidents #27: Theodore Crowsevelt

Theodore Crowsevelt

We must treat each bird on his merit and worth as a bird.
~ Theodore Crowsevelt

Theodore “Teddy” Roosevelt; the crow, in this case I think specifically the American Crow though no known species actually have mustaches or wear tiny spectacles.

The first bird president I drew, and so posted out of order. Also I believe I accidentally transposed “merit” and “worth” from the actual quote.

Just a little experiment this morning after I got the name “Crowsevelt” stuck in my brain for some reason.

In retrospect it’s hard to put a mustache on a bird; if I were to do this again I’d probably push the perspective to bring the face forward in the composition and give myself some more room to work with in caricature. But I’m also just not totally sure how to bridge the gap between Teddy and a crow, physiologically speaking, so hey.

Will now spend the rest of the day trying to think of other Presidential bird puns.

Metafilter Frequency Tables: 13 years, 636 million words

I’ve updated today a project I first launched a couple years ago: Metafilter Frequency Tables, a collection of tables calculated from the last 13 years of comments made by users on Metafilter.com.

The tables break down word frequency for the site as a whole as well as by several major subsites, and for all time as well as by year, month, and day. Each table includes raw count and parts-per-million data for each word. It’s all generated by some perl I wrote that fetches comments from our database and tabulates it into these text files. Read about the methodology here.

This isn’t by far the most carefully constructed set of such tables out there — I am a hobbyist, not a trained linguist, and this whole effort is very much DIY — but it’s the largest I’m aware of focused specifically on this sort of internet-mediated casual textual conversation over the last decade-plus, and I’m hoping it will be of some use or interest to word nerds.