Wanna help edit Calvin and Hobbes transcripts?

I talked a bit in my Calvin and Markov post this morning about the C&H transcripts I’m using to power the Markov chain process, and about how much work it’s likely to be to edit the whole thing. Probably 25-30 hours total.

It’s not difficult stuff—and in fact it’s not a terrible excuse to reread some C&H as you go, if you want to go less than warp-speed—so if you’re interested in helping out, drop me a line in the comments here or via email or @joshmillard on twitter and I can hook you up with a specific chunk of transcript so we avoid any accidental duplication of effort.

I’ll talk a little more about the approach I’ve been taking to it, to give a clear idea of the small handful of details involved in the markup process. As I said in the other post, the situation is that transcripts of every strip do already exist, which is great, but they aren’t broken down by character or panel at all. It’s just one string of words run together per strip.

So the project is to, for e.g. this Christmas Day strip from 1986:

C&H 19861225

Turn each line in the existing transcript, like this:

19861225 Psst! Are you awake? Is it Christmas? It is! It is! Let's go wake Mom and Dad and open all our loot! Since it's Christmas maybe we should let them sleep a little. That's long enough! Wake up! Wake up! It's Christmas!! Quarter to 6. He let us sleep in this year.

Into a set of lines like this:

C: Psst! Are you awake?
H: Is it Christmas? It is! It is!
C: Let's go wake Mom and Dad and open all our loot!
H: Since it's Christmas maybe we should let them sleep a little.
C: That's long enough! Wake up! Wake up! It's Christmas!!
D: Quarter to 6. He let us sleep in this year.

That’s the whole idea in a nutshell: each line labeled with character code, colon, space; a new line for each balloon.

But here’s a few details worth noting:

1. Character labels

Each character has a specific short label — as seen above, C for Calvin, H for Hobbes, D for Calvin’s dad. I’ve assigned labels to the most common ten or so characters I’ve come across in the first 14 months of the strip, as below:

C = Calvin
H = Hobbes
D = Dad
M = Mom
T = Teacher, Miss Wormwood
P = Principal
SD = Susie Derkins
SS = Spaceman Spiff, when Calvin is pictured as or narrating as/about Spiff
MOE = Moe, the school bully
ROS = Rosalyn, the babysitter
MON = Monsters/creatures of various sorts
MISC = miscellaneous/offscreen character dialogue
SFX = non-spoken sound effects words e.g. CRASH, WHAP

Most of those are self-explanatory. The spots where I’m making more of a judgement call:

– the use of MON to represent various imaginary speakers — I’ve included actual monsters under the bed, alien dialogue from Spiff sequences, and Calvin’s own speaking-as-imaginary-monster-form where he’s shown in panel as a monster rather than himself. Arguably these could be broken out more finely; if you want to propose or run with a more detailed rubric, just note what you’ve come up with and I’ll look at incorporating it formally.

– the use of MISC for non-major but clearly identifiable characters. In some cases dialogue is truly off-screen, e.g. TV show dialogue when Calvin’s watching TV, but some cases it’s more clearly coming from a detailed if unnamed character. There would be no harm in consistently labeling individual characters instead of lumping them under this catch-all, so if you want to e.g. label the family doctor as DOC: or by his name if you happen to know it, that’s fine too. Just make a note of it when you send your work back.

– SFX is meant for non-spoken environmental sounds; generally the distinction between this and stylized speech/shouts/utterances (AAAAHHH!, WAARG!, Z) is pretty clear, but use your best judgement if it’s a weird case.

2. Different panel/balloon, different line.

In cases where the same character speaks in multiple balloons or panels in a row, I’ve been giving a separate line to each instance. This doesn’t effect my Markov project’s performance, but it feels like a cleaner representation of the structure of the strip’s flow to me.

3. Doubling up shared dialogue

Occasionally, two characters (usually Calvin and Hobbes) will be shown shouting something simultaneously, either with twin word-balloon tails or by implication with more stylized non-balloon text. I’ve preferred to duplicate those lines, giving a separate copy to each character on two consecutive lines. The original transcript generally doesn’t duplicate the text in these contexts, so you’ll need to re-type or copy and paste.

4. Capitalization across panel/balloon breaks

The original transcript often declines to capitalize new panel/balloon starts when continuing dialogue across a break/ellipsis; I prefer to start new lines capitalized for consistency, and so would recommend capitalizing in these situations for the next line, e.g.

Sure, I think it's ... wait a minute.


H: Sure, I think it's ...
H: Wait a minute.

5. Spelling/transcription errors

The original transcripts are pretty good, but (not surprising given how much of a slog it must have been to type them out) there are occasional typos and a couple recurring issues worth fixing opportunistically if you notice them as you work.

Aside from occasional literal spelling/typing errors, the most common thing I’ve been fixing is the inappropriate use of periods where commas should go and vice versa (and corresponding changes to the following capitalization), and the outright omission of commas. Fix ‘em if you see ‘em.

calkov 1

People who know me won’t be surprised that I’m messing around with Markov chains; it’s one of my favorite little intersections of math and linguistic/artistic weirdness, a fairly simple way of analyzing the frequencies of events (like the order in which words appear in a bunch of written text) in order to produce new, novel, semi-coherent output. I’ve built a lot of little Markov-related things over the years.

And some folks who don’t know me at all may find this specific Markov-plus-comics idea familiar anyway because of an old project of mine, off which this new one is based: Garkov, a random Garfield strip generator I wrote several years ago. Calvin and Markov is the result of fits-and-starts cleanup work I’ve done on that original code over the years; at its heart it does just what Garkov does, but does it in a somewhat less stupid way on a couple fronts.

Okay, but how?

calkov 3

There’s a few moving parts to C&M:

1. The Markov chain code

This is a custom implementation of a Markov chain process that I wrote in Perl a few years back because (a) I’m finicky about how my Markoving works and (b) it seemed like a fun thing to write. It’s a content-neutral set of functions — nothing about it is specific to Calvin, or to comic strips. It’s just a bunch of code that will digest an arbitrary collection of text and then burp out new weird sentences when you ask it to.

I’ve made a few small improvements to this code as I’ve revisited it the last few days, but it was feature-complete already.

2. The comic strip art

There’s lots of places to find C&H strips on the internet, both on official comics-hosting sites and elsewhere. I found an archive of the series run sitting around somewhere or other, with all the daily strips rendered to 600 pixels wide, and have been using that.

I’ve selected a few dozen strips that I like, featuring characters with enough dialogue (except maybe Moe, he doesn’t talk much) that they have a variety of things to say, and blanked out the original dialogue in the art (and in some cases tweaked the word balloons to be a little more accommodating to my text insertion process).

It takes a minute or two to blank and neaten up each strip for this step, but there’s some additional work after that, setting up a strip definition file (see below) that adds a few more minutes to the process. Adding more strips to the project is doable in a piece-by-piece fashion and I’ll likely continue adding strip templates in the future to up the variety some more.

3. The dialogue from the comics

Calvin and Hobbes ran daily for about ten years, which, accounting for a couple of sabbaticals by Watterson, means there were on the order of 3,000 original strips published. That’s a lot of text to work with, which is great in theory, but newspaper comics didn’t come with convenient plaintext transcripts. I went into this project knowing I might have to do a lot of typing just to have material to feed to the Markov process. That’s what I did for Garkov, which is why the input corpus for Garkov is based on a few hundred strips rather than a much more significant chunk of the strip’s archives.

Luckily for me, someone, years ago, already decided to tackle this, transcribing with reasonably good accuracy the entire C&H archive, strip by strip. (This is apparently what powers Mike Yingling’s C&H search engine.)

Unluckily for me, they did so by treating each comic strip as a single, run-all-together line of dialogue text without any character breakouts. Which is fine if you’re just trying to search for a strip — if the word “Krakow” appears anywhere in the strip, that strip will be a match — but it’s a problem if you’re doing something character-specific like this project. I want Calvin to say Calvin stuff, Hobbes to say Hobbes stuff, and so on for Dad, Mom, Susie Derkins, etc.

And so I escaped the need to transcribe, but have still had to do a bunch of markup on that original transcription work, for each strip turning a line like this:

19861225 Psst! Are you awake? Is it Christmas? It is! It is! Let's go wake Mom and Dad and open all our loot! Since it's Christmas maybe we should let them sleep a little. That's long enough! Wake up! Wake up! It's Christmas!! Quarter to 6. He let us sleep in this year.

Into a set of lines like this:

C: Psst! Are you awake?
H: Is it Christmas? It is! It is!
C: Let's go wake Mom and Dad and open all our loot!
H: Since it's Christmas maybe we should let them sleep a little.
C: That's long enough! Wake up! Wake up! It's Christmas!!
D: Quarter to 6. He let us sleep in this year.

That lets me break out each character into a separate collection of lines, and create individual Markov table “brains” for use in strip generation. I also wrote a small script to do that sorting-into-separate-files bit so I don’t have to do the sorting and copying and pasting manually.

So far I’ve marked up the 1985 and 1986 strips, or about 14 months total. That adds up to about 1300 separate lines for Calvin (I peeked, it’s 1,268 lines containing 10,832 words total), a few hundred for Hobbes, 100+ each for Calvin’s mom and dad, and on the order of dozens for the other recurring characters. In a Markov project like this, more input text is generally better (you get more varied, weird, unexpected results), and so marking up more of the transcripts would be a good long-term goal, but it’s tedious and time-consuming; at a good clip, I can do a month’s worth of strips in about 15 minutes, but with what’s done so far that leaves something like 25-30 hours of additional work just to mark up the remaining bulk of transcript. I may keep chipping at it myself; I may try crowdsourcing some of that work if folks are interested in helping.

4. Strip definitions

Blanking out the original dialogue in word balloons gives me a canvas to work with, but my code still needs to know where to actually try and put new words. I ginned up a very simple definition file, describing a set of rectangular areas into which my code would paint dialogue; on each run, the program selects a strip template at random, reads in the actual image file (e.g. strip_02.gif) and then the definition file (strip_02.def), and uses both to do its work.

For example, this strip:

strip_02 template

…has the following definition file, listing on the first line which characters appear in the strip (so the code can be appropriately parsimonious and not bother loading Markov data for any other characters) and then on each subsequent line a character who speaks (so the code knows whose “brain” to pull from) and the geometry of the rectangle representing their word balloon:

calvin dad
calvin 98 8 90 4
dad 183 7 45 3
calvin 252 25 90 5
calvin 375 10 140 5
dad 525 10 140 3
calvin 550 55 100 5

Those numbers represent, respectively, the x-position in pixels of the center of the word balloon, the y-position of the top, the width of the balloon in pixels, and the maximum number of lines of dialogue that can appear. That’s a bit of a hacky mess of a format, but it works well enough for these purposes and is reasonably simple for me to generate by hand; if I clean up and generalize the code in the future, I’ll revisit it a bit. (Likely plan: make the x and y coordinates both center values for the middle of the target rectangle, and replace the max-number-of-lines value with a raw pixel distance with the script doing the math at runtime to figure out how many lines that can accommodate.)

Generating a strip definition involves some eyeballing with a cursor tool in Photoshop Elements, and then a little bit of nudging (a few pixels left or right or up or down) when I look at the text being populated by the program in practice. A tool that actually translated drawn rectangles into numbers would speed up the process, but I didn’t feel like spending time trying to build that when I could just rough out some definitions by hand to get this initial pass working. Something for the future.

5. Lettering

The lettering in comic strip — the actual text as written in the panels — is a key part of its visual identity, if not always the most obvious one. Garfield has a clean, very even balloon-ish lettering, like Comic Sans’ sober cousin; Mark Trail has a sturdy feel with bold verticals canted to the right; Zippy the Pinhead has a loose, scrawled-in-a-notebook lilt that matches its surreal tone.


Change the lettering, and a comic just won’t look like itself.

And Bill Watterson’s lettering is as central to the look of Calvin and Hobbes as almost any comic I can think of; his narrow, unstuffy left-leaning capitals are just how C&H looks, an integral part of not just the meaning but the feel of the strip, even before you take into account those panels where he gets particularly expressive with larger text, alternate lettering styles, sound effects and shouted words and so on.

Truly replicating expressive hand-lettered text automatically is a difficult job and well out of scope of a project like this, but I was happy to see that someone had created a basic font based on Watterson’s lettering, and shared it on Deviant Art. I’ve used that for the text of the strip, to reasonably good effect; my auto-generated text is mechanical and inorganic and compared to the real deal, and cramped and often inelegantly placed within balloons on account of the lack of artistic vision on my Markov code’s part, but it at least doesn’t immediately scream NOT BILL’S LETTERING the way subbing in some generic “comic” typeface would have.

For the title of the strip, I was able to find another font based on the familiar sticks-and-angles style of Calvin’s own handwriting that folks mostly likely generally associate with the strip.

6. ImageMagick

Aside from the Markov code responsible for the text generation, my Perl script does some basic stuff to wrangle all the rest of the above, most of which isn’t really interesting. But the actual pasting together of an image is a pretty key bit, and that’s something I didn’t want to have to figure out from scratch.

To that end, I used a software library called ImageMagick that handles image generation and manipulation, and which does the work of turning the strip templates and lettering font and generated text into a final, rendered image. (This is a big improvement over Garkov, which renders a strip as a bunch of individual CSS-positioned single letter images on top of a blank-ed out strip, which makes keeping or sharing an image unnecessarily difficult.)

ImageMagick is a fast, reasonably powerful, and a confusing goddam mass to work with. I heartily recommend it and suggest you stay the hell away. It’s that sort of library.

I’ve found myself reaching for other tools in the last few years; if I were starting on this project from scratch today, Perl and ImageMagick is probably not where I’d aim. But having the Garkov codebase to work with was such a head start on this that coming back to those and doing a little bit of angry wrestling was worth the pain.

Okay, but why?

calkov 2

How is easy; I lowered a shoulder and nerded on through it and here we are. Why is trickier. There’s a few different kinds of why.

Why bother?

Because it was (a very specific kind of) fun putting this together, revisiting and improving my Garkov code, getting the whole thing working as well as I could. I really liked making Garkov back in the day, but I also burnt out on it around the time I put it out in the world, partly just because of all the tricky bits I had to sort through and try (and in some cases fail) to find solutions to to get it up and running.

Giving this idea another shot years later with everything I’d learned the first time driving me forward more quickly has been satisfying. I may take this newer, cleaner approach and apply it to a Garkov 2.0; I may throw it at some other comics; I may try to get it into other folks’ hands so ten thousand Markoving comic resynthesis projects can bloom.

Why Calvin and Hobbes?

Because I like C&H. Because other folks like it. Because it’s recognizable, and familiar, and the familiarity of the original lends a kind of weird suspension of disbelief to the broken, altered output of this kind of transmogrification process—if a thing looks enough like the real thing, we try to treat it like the real thing a little longer, give it the benefit of the doubt even as we know we should be doubting it. No one will really be fooled by Calvin and Markov, but all of us who’ve read thousands of the originals are wired to sort of give it credit long enough to produce a double-take, which is great.

Why treat C&H so weirdly?

One thing I’ve thought about while working on this—and I’ve heard it from at least one friend I showed the work in progress too as well—is how different Calvin and Hobbes is, as a cultural property, from the previous choice of Garfield. They’re both totemic, instantly recognizable comic strips, but that’s about the end of the similarities; C&H is loved for its doting, dynamic inkwork and artful writing and characterization, while Garfield is generally derided for its predictability, minimalist and samey art, and overall cash-in, sell-out, factory-produced sterility.

And so building Garkov, a machine that swallows up Garfield strips and spits out something dada and broken and absurd, seems like sort of a gimme. Of course people should fuck with Garfield. What else is it good for? I hate Mondays. Etc. It is, however justly or not, an easy target. (And I was not by far the first or the last to futz around with Garfield as a template for recontextualized weirdness; see the links at the bottom of the Garkov page for many others.)

Whereas C&H is a strip people hold up high as more or less the zenith of the modern newspaper comic strip, a piece of work that was so consistently beautiful and smart and heartfelt and uncompromising that nothing on the page could compete with it during its ten year run, and nothing has been be able to replace it in the years since. C&H was funny, but it wasn’t a joke; as mainstream pop cultural artifacts go, it’s pretty unfuckwithable.

You mess with Garfield, no one says How Dare You. Calvin and Hobbes, though…

So I’ve wondered as I built this how people would feel about it. Not so much that I expect condemnation—weird for weird’s sake gets by okay on the internet and I doubt anyone will get the mistaken impression that I mean any harm here—but really just how they’ll feel about the oddball output of this given their likely more fond releationship with the source material than in the case of Garkov.

Take a well-written, well-remembered comic strip and render it incoherent, and…and then what? And why?

I didn’t really know what I was going to get when I started. I wasn’t sure if I was going to get anything, honestly, other than new, bad Calvin and Hobbes strips. But I have seen some stuff in the output as I’ve developed this that I genuinely like.

And mostly what that is is this: Calvin as an actual, deeply weird little kid. Not the apt, smartly-written, nail-it-in-four-panels Calvin of Watterson’s work, the kid who we understand to be a kid despite his tremendous vocabulary and delightful, imaginary-or-is-he tiger friend, but a real scatterbrained oddball, the unvarnished stream-of-consciousness pile of developing brain cells that parents and teachers end up dealing with.

The little kid who changes the subject every five seconds. The little kid who says bizarre, contextless things, not as a punchline to a three-panel setup with a beautifully drawn alligator but just actually genuinely out of nowhere. Here’s a Calvin who confuses us, the readers, just as much as he does the adults around him in Watterson’s final-panel reveals.

Not Calvin the apposite, but Calvin the apropos-of-nothing. A Calvin whose head we don’t get to see inside of, a Calvin who we can’t keep up with.

It’s a neat thing, and a plenty satisfying outcome of these last few days of work. I love when something like this can surprise me.

Metafilter Frequency Tables: 13 years, 636 million words

I’ve updated today a project I first launched a couple years ago: Metafilter Frequency Tables, a collection of tables calculated from the last 13 years of comments made by users on Metafilter.com.

The tables break down word frequency for the site as a whole as well as by several major subsites, and for all time as well as by year, month, and day. Each table includes raw count and parts-per-million data for each word. It’s all generated by some perl I wrote that fetches comments from our database and tabulates it into these text files. Read about the methodology here.

This isn’t by far the most carefully constructed set of such tables out there — I am a hobbyist, not a trained linguist, and this whole effort is very much DIY — but it’s the largest I’m aware of focused specifically on this sort of internet-mediated casual textual conversation over the last decade-plus, and I’m hoping it will be of some use or interest to word nerds.

Your MOM stopped chewing her fingernails…

I’ve been quietly refraining from making Your Mom jokes for the last week or so, and it is actually a sort of difficult experience.

Partly because habits are just hard to break; partly because I actually really love “your mom” jokes.

Not because I think making fun of people’s moms is funny; I pretty much stay away from anything resembling a plausible attempt to comment on anyone’s actual mom.

I love your mom jokes because I love playing with language, and the process of yourmomization (for lack of a better word) is a seriously flexible one that lends itself to a wide variety of induced nonce euphemisms from the literal to the absurd. I like how it’s possible to take even the most innocuous sentence and yourmomize it, how the cultural association of “your mom” jokes with implications of e.g. sexual impropriety means even a totally absurd substitution (A: “I collated those files”, B: “I collated your MOM” or B: “Your MOM collated those DICKS”) reads as an acceptable (if often deeply stupid) semantic transformation.

I’ve even put serious thought in the past into, as a project in amateur computational linguistics and natural language processing, building a YourmomBot that would be purpose-built to parse natural language input, identify potential substitutions, and generate a comeback response from those candidates. (It could even be a learning machine: using either an explicit rating mechanism or an NLP heuristic that tries to gauge the positive/negative valence of responses to its comeback, it could build up a model of what substitutions work and use that to weight future candidate selections. As the basis for a joke entry in a future Loebner Prize competition it’d probably at least get a few laughs.)

I love dumb jokes for their own sake, but fundamentally the humor I find in absurdist yourmomization is not so much in the lowbrow implications of any given joke as in the sort of ready-made, Mad Lib universality of the pattern of jokes when made in series; it’s in the way yourmomization, when employed not as a personal attack of opportunity but rather as an always-on regime, is revealed (to paraphrase Stanley Kubrick) not to be hostile so much as indifferent.

A very specific riff on Chomsky, a modified theory of deep linguistic structures: every sentence was actually a your mom jokes all along.

But what I’ve realized is that what may be for me a personal exercise in long-form absurdism may as well be, for those around me, an exercise in littering every single conversation with really banal, repetitive only-barely-jokes. Which isn’t really fair to everybody who isn’t me, and isn’t really how I want to come off.

And I have tolerant friends and a deeply tolerant wife; no one is going to tell me I have to cut it out, nobody thinks I’m actually trying to diss their mom. They might groan a bit, which is the least they’re entitled to do, but that’s about all. They’re kind people.

But having a couple of friends visit for a few days at the end of last year and realizing that I was yourmomizing everything, even literally reflexively yourmomizing random snippets of half-overheard conversation from the other room, made me think about whether maybe it’s time to reel it in some, to recalibrate the meter. And probably the most effective way to start that process is to just kick cold turkey for a while.

And so I’ve stopped making “your mom” jokes for the moment. For a few days now.

All I can moderate so far is what comes out of my mouth (or mostly, considering most of my daily conversation is over the internet, what comes out of my fingers), not what comes into my head, and so I’m not really at a psycholinguistic level making any fewer of them, but now instead of having reflexive yourmomization thoughts and then producing them out into the world, I’m actively quashing the production part of process. Sometimes I mutter them quietly to myself instead of typing them out, but mostly I’ve been getting out ahead of that even.

It feels a bit like stifling a sneeze, one of those stifles where your sinuses get blasted by the blowback and feel unhappy. It’s not a process I’m enjoying. But it’s educational. It’s interesting. And I’m probably annoying fewer people.

And if I bottle up enough of this antiyourmomization frustration it might push me over the edge into actually implementing that chatbot.

12 Variations on Chekhov’s Gun

“If in the first act you have hung a pistol on the wall, then in the following one it should go off. Otherwise don’t put it there.”

– Anton Pavlovich Chekhov, as related by Ilia Gurlyand of a conversation from 1899

“Rules are made to be broken.”

– Ladies Home Journal, 1899

~ The setup ~

(The living room of a modest home. There is a fire burning in the fireplace; hanging above the mantle is pistol.)

Oh, Uncle Bob, it is so good to see you again after all these years.

(embracing ALICE)
I have missed this old house. How is your mother?


~ The variations ~

1. The Vanilla


You’ve betrayed me, and Mother! Everything this family stands for!

(grabs pistol from mantle, fires it at ALICE, killing her)
Oh god, what have I done?


2. The Ladies Home Journal


You’ve betrayed us all, Uncle Bob!

(walks to fireplace, looks at pistol on the wall)
Yes, well, nobody’s perfect. Hey, where’d you find this neat gun?

Mother bought it on eBay.

It’s really cool looking.

Yeah, I like it a lot.

Overthanking a plate of injokes

So I’ve been playing Glitch for the last few days; it’s a lightweight free-to-play social MMO done as a browser Flash game, sort of a cross between a platformer and combat-free resource-wrangling games like Animal Crossing. It’s a good little time. It looks a bit like this:

But I play a lot of games and don’t mention it here much at all, so why am I bringing up this one? Because of a plate of beans, is why.

It goes like this:

1. I found out about Glitch because I work for and hang out at Metafilter, and a lot of folks on Metafilter are playing it. And so while my little yellow dude has been mining beryl and cavorting with pigs in exchange for steaks and harvesting bubbles off of bubble trees, I’ve been chatting with my fellow Mefites. And not only are the are a lot of us playing the game, there’s even a couple folks who work for Tiny Speck, the company that makes Glitch.

2. There’s a long-running joke on Metafilter about how we’re the kind of people who could overthink a plate of beans. It started as a joke someone made during a thread years ago in which people were earnestly deconstructing the performative elements of Alanis Morissette’s cover of Black Eyed Peas’ My Humps, and it caught on, to the point where “beanplating” and “to beanplate” are a commonly understood derived verb forms used to describe maybe-needlessly-in-depth analysis of one thing or another. There’s even a song.

3. Glitch has bean trees, on which grow beans. One can harvest beans, and eat them or use them as constituents in recipes. What Glitch hasn’t had is beans arranged tastefully on a plate.

4. Except, well, that now exists, thanks to one of those Mefite Glitchers who happens to work on the game.

Here it is:

Click on it and you get, as with most items in Glitch, a context menu that gives you some basic options and a special option unique to this item: overthinking.
The Big Atreides (or, The Dune Abides)

So there was a brief discussion of a twofer of Dune-related posts over on Metatalk, and it quickly unraveled into a series of questionable Frank Herbert vs. The Coen Brothers jokes when Metafilter user Eideteker said:

The Dune Abides.

You can read through the thread to see the raw output as folks put it together. Fun real-time riffing, people pushing in a few different directions with it over a couple of hours.

Here’s a neatened-up arrangement of my take on it, in the general style of an IMDB “Memorable Quotes” digest:

The Big Atreides

PRINCESS IRULAN [voiceover]: Way out in the stars there was this fella… fella I wanna tell ya about. Fella by the name of Paul Atreides. At least that was the handle his loving parents gave him, but he never had much use for it himself. Mr. Atreides, he called himself “The Dib”. Now, “Dib” – that’s a name no one would self-apply where I come from. But then there was a lot about the Dib that didn’t make a whole lot of sense. And a lot about where he lived, likewise. But then again, maybe that’s why I found the world so darned interestin’.

They call Arrakis “the Spice planet”. I didn’t find it to be that, exactly. But I’ll allow there are some colorful folks there. ‘Course I can’t say I’ve seen Caladan, and I ain’t never been to Ix. And I ain’t never seen no Reverend Mother in her damned undies, so the feller says. But I’ll tell you what – after seeing Dune, and this here story I’m about to unfold, well, I guess I seen a purpose every bit as turrible as you’d see on any of them other planets. And in Galach, too. So I can die with a smile on my face, without feelin’ like the God Emperor gypped me.

Now this here story I’m about to unfold took place back in the early 10190s – just about the time of the conflict with Vlad Harkonnen and the I-treides. I only mention it because sometimes there’s a man… I won’t say a madhi, ’cause, what’s a madhi? But sometimes, there’s a man. And I’m talkin’ about the Dib here. Sometimes, there’s a man, well, he’s the man for the place the Bene Gesserit dare not look. He fits right in there. And that’s the Dib, on Dune. And even if he’s a prescient man – and the Dib was most certainly that. Quite possibly the most prescient in Arakeen, which would place him high in the runnin’ for the most prophetic galaxywide. But sometimes there’s a man, sometimes, there’s a man. Aw. I lost my heighliner of thought here. But… aw, Shaitan. I’ve done introduced him enough.

THE DIB: I’m the Muad’dib. So that’s what you call me. You know, that or, uh, His ‘Dibeness, or uh, ‘Diber, or El Muad’diberino if you’re not into the whole my-name-is-a-killing-word thing.

THE DIB: Yeah, well, that’s just, like, my terrible purpose, man.

CHANI: What do you do for recreation?

THE DIB: Oh, the usual. I jihad. Ride worms around. The occasional prescient spice trance.

DUNCAN IDAHO: Facedancers! Fuck me. I mean, say what you like about the tenets of the Bene Gesserit, ‘Dib, at least it’s an ethos.

FEYD-RAUTHA [to THE DIB]: What’s this day of rest shit? What’s this wormshit? I don’t fuckin’ care! It don’t matter to Feyd. But you’re not foolin’ me, man. You might fool the fucks in the naib, but you don’t fool Feyd. This bush league wheels-within-wheels stuff. Laughable, man – ha ha! I would have killed you Saturday. I kill you next Wednesday instead. Wooo! You got a date Wednesday, baby!

THE DIB: Mind if I do a water of L?

STILGAR: You have got to buck up, man. You cannot drag this negative energy in to the jihad!

THE DIB: Fuck the jihad… Fuck YOU, Stilgar!

STILGAR: Fuck the jihad? All right, I can see you don’t want to be cheered up here, ‘Dib. Come on Guerney, let’s go get us a worm.

IRULAN [voiceover]: I guess that’s the way the whole durned Golden Path keeps perpetuatin’ itself.

CHANI: Do you like sex, Usul?

THE DIB: ‘Scuse me?

CHANI: Sex. The physical act of love. Coitus. Do you like it?

THE DIB: I was talking about the waters of my homeworld.

STILGAR: Jamis was a good fighter, and a good Freman. He was one of us. He was a man who loved the desert… and killing Harkonnens, and as a wormrider he explored the sands of the Great Flat, from Tuono Basin to Habbanya Ridge and… up to… Gara Kulon. He died, like so many young men of his generation, he died before his time. In your wisdom, Shai-Hulud, you took him, as you took so many bright flowering young men on Salusa Secundus, on Giedi Prime, on Bele Tegeuse. These young men gave their lives. And so would Jamis. Jamis, who loved fighting. And so, Jamis of Sietch Tabr, in accordance with what we think your dying wishes might well have been, we commit the water of your body to the tribe, which you loved so well. Good night, sweet prince.

THE DIB: Just say the Litany Against Fear, man.

STILGAR: I’m perfectly calm, ‘Dib.

THE DIB: Yeah, waving the fucking crysknife around?

STILGAR: Calmer than you are.

[The Baron Vladimir Harkonnen and his mentat aid Piter De Vries interrogate Duke Leto Atreides.]

VLADIMIR: Is this your uniform, Leto? Is this your uniform, Leto?

DE VRIES: Look, Baron…

VLADIMIR: Piter, please? Is this your uniform, Leto?

DE VRIES: Just ask him about the signet ring.

VLADIMIR: Is this yours, Leto? Is this your uniform, Leto?

DE VRIES: Is that your thopter out front?

VLADIMIR: Is this your uniform, Leto?

DE VRIES: We know it’s his fucking uniform! Where’s the fucking signet ring, you freaking’ duke?

VLADIMIR: Look, Leto. Have you ever heard of Giedi Prime?

DE VRIES: Oh, for CHOAM’s sake, Vlad…

VLADIMIR: You’re entering a world of pain, Duke. We know that this is your uniform. We know that you had a signet ring.

DE VRIES: And your fucking heir.

VLADIMIR: And your fucking heir. And, we know that this is your uniform.

DE VRIES: We’re going to feed you your uniform, Leto.

VLADIMIR: You’re killing your lady, Leto!

IRULAN: I like your style, ‘Dib.

THE DIB: Well, I dig your style too, man. Got the whole galactic princess thing goin’.

IRULAN: Thankee.

IRULAN [voiceover]: Sometimes you ride the worm and sometimes, well, he rides you.

I may be an X from Y, but…

“I’m just a humble country lawyer…”
— Jimmy Stewart as Paul Biegler, Anatomy of a Murder

“I’m just a simple hyperchicken lawyer from a backwoods asteroid, but…”
— Hyperchicken Lawyer as himself, Futurama

I may be just a simple armchair linguist from the blogosphere, but I know a snowclone when I see one, and this is an interesting one. “I may be an X from Y, but…” is a go-to phrase in an apparently wide variety of contexts where there’s some perceived rhetorical advantage to a disarming self-deprecating framing.

And while its use seems biased toward small-town/rural self-identification (“country bumpkin”, “dumb old hick”, “redneck libertarian”), its not limited to those contexts; the speaker sometimes identifies as a city-slicker (“a heal-wearing, wine-drinking gal from the city”), or a woman (“a girl from Manchester”), or a young person (“a kid from manila”), or as a member of any of a variety of other categorizations (“a grandmom from Chicago”, “a gay guy from Belfast”, “an insufferably white dork from Canada”).

Regardless of the role filling the X slot, the thrust remains the same: the speaker acknowledges their disadvantaged status in the context of the current conversation, but! But they’ve got something to say. As a rhetorical trick, it’s nice: a savvy speaker can claim some humility, but more than that can undercut a potential dismissal or (more deviously) put their interlocutor on the defense by insinuating a dismissive intent on the interlocutor’s part.

Jimmy Stewart’s quoted line from the Otto Preminger court drama captures that rhetorical thrust nicely, but it doesn’t fill out the template I’m interested in here; the “…from Small Town, Wherever” tag is missing. But as later parodies (such as the template fulfilling Hyperchicken quote above, or Phil Hartman’s drawn-out “I’m just a caveman…but…” spiels from Saturday Night Live) show, the form had legs.

And while the cited examples below don’t generally invoke legal contexts, the perceived role in pop culture perceptions of the small town lawyer as a rhetorically manipulative creature makes the generalization of the form for rhetorical framing in general conversation reasonable enough as a tactic, whether sincere or for the sake of a joke or some mix of the two.

So, collected here and alphabetized are all the hits I found on Google for the search string “i may be a * from * but” as well as variants with “an” as the article, “just be”, and “be just”, documenting only those hits that seemed to fit the pattern correctly with a physical location of some sort as the object of the “from” preposition.

(Notes for the future: It might be interesting to take this list and break it up into constituent parts ([a/an] [adjectival modifier(s)] [noun] from [location]) and look at those independently, or plot the noun and adjective constituents of the X part on a map according to the Y location. Also, to keep things focused and simple, I didn’t look in detail, or cite here, variants where there’s a negation (“I may not be a…”), or where alternatives to “I may be” are used (“I might be a…”, “I’m just a…”), or where the Y value is something other than a physical location (“…from another era”), or where the “from Y” tag is absent or significantly different structurally (“I may be a size 2 from the waist up, but…”).)

And so but: the list of cites, preserving capitalization it occurred in the search results.

I May Be ________ From ___________, But…

an 18 year chick from Canada
a 35-year-old white DJ from Maine
an African far away from home at Christmas-time
an American citizen from Hawaii
an artsy-fartsy from London
a backwards Christian hick from Ohio
a bitter old Klingon from Universe #2
a “blow-in” from Devon
a boy from Bensonhurst, Brooklyn
a bozo originally from Houston
a business school graduate from Canada
a cat from Memphis
a chow-hick from Napa
a country boy Republican from Idaho
a country boy from northern new england
a country bumpkin from the Arctic
a cracker from NYC
a crazy dog from Nagoya, Japan
a dirt-old farmer from Franklin
a dumb hick from Kentucky
a dumb hick from the woods
a dumb nigger from Chicago
a dumb old country boy from Montana
a dumb old hick from a dumb old hick town
an Echidna from another world
an emergency doctor originally from Maryland named Eric McDonald
an exchange student from Tennessee
an ex employee from a year ago
a farm kid from Texas
a fast talking blonde from California
a fireman from Lincoln Nebraska
a flashy girl from Flushing
a ‘fool from Stockholm’
a foolish boy from Chippenham
a foreigner (from an allied country)
a freckled face bitch from Down Under
a “Gaijin” from Texas
a gay guy from Belfast
a general hockey fan from Montreal
A Girl From Homer
a girl from Kansas
a girl from Manchester
a girl from Manchester
a girl from Nebraska
a girl from a Salford council estate
a girl from the Midwest
a gobby bloke from Bermondsey
a goober from Georgia
a grad from Knucklehead U
a graduate from SHS
a grandmom from Chicago
a green-bottomed kid from Texas
a “green go” politician from Alaska
a heal-wearing, wine-drinking gal from the city
a hick from Indianer
a hick from Montana
a hick from N. Louisi-Yana
a hick from Pennsylvania
a hick from a cow college
a hick from the sticks
a hick from up north
a hick girl from Tennessee
a hillbilly from jackson county
a humble Northern boy from England
a humble boy from the Midwest/ Appalachia countryside
a humble techie from a non-medieval institution
a humble woman from Wasilla
an illegal immigrant from Canada
an insufferably white dork from Canada
an Italian from NJ
an Ivy League educated white girl from the mid-west
a Jewish girl from Boston
a kid from manila
a kiwi indian drumma from new zealand
a lily white chick from the subdivision
a little girl from Texas
a little ol’ country boy from Lubbock, Texas
a little old country boy from St Louis
a lowly peasant emigree from Gloucester
a manly naturalist from Northumbria
a middle class ex- factory worker from Surrey
a naive white guy from London
a nice guy from the Midwest
an Okie from Muskogee
an ol’ yellow dog (from New York City)
an old dude from the Northwest
an old frustrated throw back from the 60’s
an old geezer from the hinterlands
an old nobody from NY
an old nobody from NY
an outsider from a different race
an outsider from the lower 48
a pasty Caucasian from a rural farming community
a plain working class bloke originally from the Newtownards Road in Belfast
a plonker from Penge
a privileged white boy from Scarsdale
a redneck from south Arkinsaw
a redneck libertarian from Kentucky
a rookie from England
a rube from Warren County
a schismatic from Rome
a simple Paddy from Ireland
a simple bloke from Otahuhu
a simple country lawyer from your northern neighbor
a simple guy from Bosie, Idaho
a simple guy from Mobile, Alabama
a simple guy from Norway
a simple kid from the trailer park
a simple man from Galveston
a sixteen year old sophmore from Missouri
a sixty year old white woman from St. Louis
a small town girl from Mt. Clemens Mi
a small town girl from Oklahoma
a “small town” girl from Portland
a stranger from afar
a stranger from far away
a stupid metalhead from california
a teenage guy from a small town
a total randomer from Manchester, England
a transplanted hick from New Hampshire
a transplant from CT
an uneducated, “granny taught”, hodad from podunk
an unkwown from the twilight zone
a white boy from the ‘burbs
a white doctor from America
a white girl from Boston
a white girl from Oregon
a woman from Venus
a woman from the south
a Yankee from up north
a yokel from Ohio
a yokel who graduated from UF