Wanna help edit Calvin and Hobbes transcripts?

I talked a bit in my Calvin and Markov post this morning about the C&H transcripts I’m using to power the Markov chain process, and about how much work it’s likely to be to edit the whole thing. Probably 25-30 hours total.

It’s not difficult stuff—and in fact it’s not a terrible excuse to reread some C&H as you go, if you want to go less than warp-speed—so if you’re interested in helping out, drop me a line in the comments here or via email or @joshmillard on twitter and I can hook you up with a specific chunk of transcript so we avoid any accidental duplication of effort.

I’ll talk a little more about the approach I’ve been taking to it, to give a clear idea of the small handful of details involved in the markup process. As I said in the other post, the situation is that transcripts of every strip do already exist, which is great, but they aren’t broken down by character or panel at all. It’s just one string of words run together per strip.

So the project is to, for e.g. this Christmas Day strip from 1986:

C&H 19861225

Turn each line in the existing transcript, like this:

19861225 Psst! Are you awake? Is it Christmas? It is! It is! Let's go wake Mom and Dad and open all our loot! Since it's Christmas maybe we should let them sleep a little. That's long enough! Wake up! Wake up! It's Christmas!! Quarter to 6. He let us sleep in this year.

Into a set of lines like this:

C: Psst! Are you awake?
H: Is it Christmas? It is! It is!
C: Let's go wake Mom and Dad and open all our loot!
H: Since it's Christmas maybe we should let them sleep a little.
C: That's long enough! Wake up! Wake up! It's Christmas!!
D: Quarter to 6. He let us sleep in this year.

That’s the whole idea in a nutshell: each line labeled with character code, colon, space; a new line for each balloon.

But here’s a few details worth noting:

1. Character labels

Each character has a specific short label — as seen above, C for Calvin, H for Hobbes, D for Calvin’s dad. I’ve assigned labels to the most common ten or so characters I’ve come across in the first 14 months of the strip, as below:

C = Calvin
H = Hobbes
D = Dad
M = Mom
T = Teacher, Miss Wormwood
P = Principal
SD = Susie Derkins
SS = Spaceman Spiff, when Calvin is pictured as or narrating as/about Spiff
MOE = Moe, the school bully
ROS = Rosalyn, the babysitter
MON = Monsters/creatures of various sorts
MISC = miscellaneous/offscreen character dialogue
SFX = non-spoken sound effects words e.g. CRASH, WHAP

Most of those are self-explanatory. The spots where I’m making more of a judgement call:

– the use of MON to represent various imaginary speakers — I’ve included actual monsters under the bed, alien dialogue from Spiff sequences, and Calvin’s own speaking-as-imaginary-monster-form where he’s shown in panel as a monster rather than himself. Arguably these could be broken out more finely; if you want to propose or run with a more detailed rubric, just note what you’ve come up with and I’ll look at incorporating it formally.

– the use of MISC for non-major but clearly identifiable characters. In some cases dialogue is truly off-screen, e.g. TV show dialogue when Calvin’s watching TV, but some cases it’s more clearly coming from a detailed if unnamed character. There would be no harm in consistently labeling individual characters instead of lumping them under this catch-all, so if you want to e.g. label the family doctor as DOC: or by his name if you happen to know it, that’s fine too. Just make a note of it when you send your work back.

– SFX is meant for non-spoken environmental sounds; generally the distinction between this and stylized speech/shouts/utterances (AAAAHHH!, WAARG!, Z) is pretty clear, but use your best judgement if it’s a weird case.

2. Different panel/balloon, different line.

In cases where the same character speaks in multiple balloons or panels in a row, I’ve been giving a separate line to each instance. This doesn’t effect my Markov project’s performance, but it feels like a cleaner representation of the structure of the strip’s flow to me.

3. Doubling up shared dialogue

Occasionally, two characters (usually Calvin and Hobbes) will be shown shouting something simultaneously, either with twin word-balloon tails or by implication with more stylized non-balloon text. I’ve preferred to duplicate those lines, giving a separate copy to each character on two consecutive lines. The original transcript generally doesn’t duplicate the text in these contexts, so you’ll need to re-type or copy and paste.

4. Capitalization across panel/balloon breaks

The original transcript often declines to capitalize new panel/balloon starts when continuing dialogue across a break/ellipsis; I prefer to start new lines capitalized for consistency, and so would recommend capitalizing in these situations for the next line, e.g.

Sure, I think it's ... wait a minute.


H: Sure, I think it's ...
H: Wait a minute.

5. Spelling/transcription errors

The original transcripts are pretty good, but (not surprising given how much of a slog it must have been to type them out) there are occasional typos and a couple recurring issues worth fixing opportunistically if you notice them as you work.

Aside from occasional literal spelling/typing errors, the most common thing I’ve been fixing is the inappropriate use of periods where commas should go and vice versa (and corresponding changes to the following capitalization), and the outright omission of commas. Fix ’em if you see ’em.

Calvin and Markov

I’ve spent the last few days building a random generator internet toy called Calvin and Markov. It generates random new weird variations on Bill Watterson’s classic, wonderful comic strip, Calvin and Hobbes, using a Markov chain process and a few hundred lines of Perl code.

It’s a fun, odd machine to just play around with, but if you’re interested in how it works, I’ll detail that below, and include some thoughts on C&H itself and why I built this thing.

(Update: if you’d like to help get the rest of the C&H transcripts neatened up, see here.)

calkov 1

People who know me won’t be surprised that I’m messing around with Markov chains; it’s one of my favorite little intersections of math and linguistic/artistic weirdness, a fairly simple way of analyzing the frequencies of events (like the order in which words appear in a bunch of written text) in order to produce new, novel, semi-coherent output. I’ve built a lot of little Markov-related things over the years.

And some folks who don’t know me at all may find this specific Markov-plus-comics idea familiar anyway because of an old project of mine, off which this new one is based: Garkov, a random Garfield strip generator I wrote several years ago. Calvin and Markov is the result of fits-and-starts cleanup work I’ve done on that original code over the years; at its heart it does just what Garkov does, but does it in a somewhat less stupid way on a couple fronts.

Okay, but how?

calkov 3

There’s a few moving parts to C&M:

1. The Markov chain code

This is a custom implementation of a Markov chain process that I wrote in Perl a few years back because (a) I’m finicky about how my Markoving works and (b) it seemed like a fun thing to write. It’s a content-neutral set of functions — nothing about it is specific to Calvin, or to comic strips. It’s just a bunch of code that will digest an arbitrary collection of text and then burp out new weird sentences when you ask it to.

I’ve made a few small improvements to this code as I’ve revisited it the last few days, but it was feature-complete already.

2. The comic strip art

There’s lots of places to find C&H strips on the internet, both on official comics-hosting sites and elsewhere. I found an archive of the series run sitting around somewhere or other, with all the daily strips rendered to 600 pixels wide, and have been using that.

I’ve selected a few dozen strips that I like, featuring characters with enough dialogue (except maybe Moe, he doesn’t talk much) that they have a variety of things to say, and blanked out the original dialogue in the art (and in some cases tweaked the word balloons to be a little more accommodating to my text insertion process).

It takes a minute or two to blank and neaten up each strip for this step, but there’s some additional work after that, setting up a strip definition file (see below) that adds a few more minutes to the process. Adding more strips to the project is doable in a piece-by-piece fashion and I’ll likely continue adding strip templates in the future to up the variety some more.

3. The dialogue from the comics

Calvin and Hobbes ran daily for about ten years, which, accounting for a couple of sabbaticals by Watterson, means there were on the order of 3,000 original strips published. That’s a lot of text to work with, which is great in theory, but newspaper comics didn’t come with convenient plaintext transcripts. I went into this project knowing I might have to do a lot of typing just to have material to feed to the Markov process. That’s what I did for Garkov, which is why the input corpus for Garkov is based on a few hundred strips rather than a much more significant chunk of the strip’s archives.

Luckily for me, someone, years ago, already decided to tackle this, transcribing with reasonably good accuracy the entire C&H archive, strip by strip. (This is apparently what powers Mike Yingling’s C&H search engine.)

Unluckily for me, they did so by treating each comic strip as a single, run-all-together line of dialogue text without any character breakouts. Which is fine if you’re just trying to search for a strip — if the word “Krakow” appears anywhere in the strip, that strip will be a match — but it’s a problem if you’re doing something character-specific like this project. I want Calvin to say Calvin stuff, Hobbes to say Hobbes stuff, and so on for Dad, Mom, Susie Derkins, etc.

And so I escaped the need to transcribe, but have still had to do a bunch of markup on that original transcription work, for each strip turning a line like this:

19861225 Psst! Are you awake? Is it Christmas? It is! It is! Let's go wake Mom and Dad and open all our loot! Since it's Christmas maybe we should let them sleep a little. That's long enough! Wake up! Wake up! It's Christmas!! Quarter to 6. He let us sleep in this year.

Into a set of lines like this:

C: Psst! Are you awake?
H: Is it Christmas? It is! It is!
C: Let's go wake Mom and Dad and open all our loot!
H: Since it's Christmas maybe we should let them sleep a little.
C: That's long enough! Wake up! Wake up! It's Christmas!!
D: Quarter to 6. He let us sleep in this year.

That lets me break out each character into a separate collection of lines, and create individual Markov table “brains” for use in strip generation. I also wrote a small script to do that sorting-into-separate-files bit so I don’t have to do the sorting and copying and pasting manually.

So far I’ve marked up the 1985 and 1986 strips, or about 14 months total. That adds up to about 1300 separate lines for Calvin (I peeked, it’s 1,268 lines containing 10,832 words total), a few hundred for Hobbes, 100+ each for Calvin’s mom and dad, and on the order of dozens for the other recurring characters. In a Markov project like this, more input text is generally better (you get more varied, weird, unexpected results), and so marking up more of the transcripts would be a good long-term goal, but it’s tedious and time-consuming; at a good clip, I can do a month’s worth of strips in about 15 minutes, but with what’s done so far that leaves something like 25-30 hours of additional work just to mark up the remaining bulk of transcript. I may keep chipping at it myself; I may try crowdsourcing some of that work if folks are interested in helping.

4. Strip definitions

Blanking out the original dialogue in word balloons gives me a canvas to work with, but my code still needs to know where to actually try and put new words. I ginned up a very simple definition file, describing a set of rectangular areas into which my code would paint dialogue; on each run, the program selects a strip template at random, reads in the actual image file (e.g. strip_02.gif) and then the definition file (strip_02.def), and uses both to do its work.

For example, this strip:

strip_02 template

…has the following definition file, listing on the first line which characters appear in the strip (so the code can be appropriately parsimonious and not bother loading Markov data for any other characters) and then on each subsequent line a character who speaks (so the code knows whose “brain” to pull from) and the geometry of the rectangle representing their word balloon:

calvin dad
calvin 98 8 90 4
dad 183 7 45 3
calvin 252 25 90 5
calvin 375 10 140 5
dad 525 10 140 3
calvin 550 55 100 5

Those numbers represent, respectively, the x-position in pixels of the center of the word balloon, the y-position of the top, the width of the balloon in pixels, and the maximum number of lines of dialogue that can appear. That’s a bit of a hacky mess of a format, but it works well enough for these purposes and is reasonably simple for me to generate by hand; if I clean up and generalize the code in the future, I’ll revisit it a bit. (Likely plan: make the x and y coordinates both center values for the middle of the target rectangle, and replace the max-number-of-lines value with a raw pixel distance with the script doing the math at runtime to figure out how many lines that can accommodate.)

Generating a strip definition involves some eyeballing with a cursor tool in Photoshop Elements, and then a little bit of nudging (a few pixels left or right or up or down) when I look at the text being populated by the program in practice. A tool that actually translated drawn rectangles into numbers would speed up the process, but I didn’t feel like spending time trying to build that when I could just rough out some definitions by hand to get this initial pass working. Something for the future.

5. Lettering

The lettering in comic strip — the actual text as written in the panels — is a key part of its visual identity, if not always the most obvious one. Garfield has a clean, very even balloon-ish lettering, like Comic Sans’ sober cousin; Mark Trail has a sturdy feel with bold verticals canted to the right; Zippy the Pinhead has a loose, scrawled-in-a-notebook lilt that matches its surreal tone.


Change the lettering, and a comic just won’t look like itself.

And Bill Watterson’s lettering is as central to the look of Calvin and Hobbes as almost any comic I can think of; his narrow, unstuffy left-leaning capitals are just how C&H looks, an integral part of not just the meaning but the feel of the strip, even before you take into account those panels where he gets particularly expressive with larger text, alternate lettering styles, sound effects and shouted words and so on.

Truly replicating expressive hand-lettered text automatically is a difficult job and well out of scope of a project like this, but I was happy to see that someone had created a basic font based on Watterson’s lettering, and shared it on Deviant Art. I’ve used that for the text of the strip, to reasonably good effect; my auto-generated text is mechanical and inorganic and compared to the real deal, and cramped and often inelegantly placed within balloons on account of the lack of artistic vision on my Markov code’s part, but it at least doesn’t immediately scream NOT BILL’S LETTERING the way subbing in some generic “comic” typeface would have.

For the title of the strip, I was able to find another font based on the familiar sticks-and-angles style of Calvin’s own handwriting that folks mostly likely generally associate with the strip.

6. ImageMagick

Aside from the Markov code responsible for the text generation, my Perl script does some basic stuff to wrangle all the rest of the above, most of which isn’t really interesting. But the actual pasting together of an image is a pretty key bit, and that’s something I didn’t want to have to figure out from scratch.

To that end, I used a software library called ImageMagick that handles image generation and manipulation, and which does the work of turning the strip templates and lettering font and generated text into a final, rendered image. (This is a big improvement over Garkov, which renders a strip as a bunch of individual CSS-positioned single letter images on top of a blank-ed out strip, which makes keeping or sharing an image unnecessarily difficult.)

ImageMagick is a fast, reasonably powerful, and a confusing goddam mass to work with. I heartily recommend it and suggest you stay the hell away. It’s that sort of library.

I’ve found myself reaching for other tools in the last few years; if I were starting on this project from scratch today, Perl and ImageMagick is probably not where I’d aim. But having the Garkov codebase to work with was such a head start on this that coming back to those and doing a little bit of angry wrestling was worth the pain.

Okay, but why?

calkov 2

How is easy; I lowered a shoulder and nerded on through it and here we are. Why is trickier. There’s a few different kinds of why.

Why bother?

Because it was (a very specific kind of) fun putting this together, revisiting and improving my Garkov code, getting the whole thing working as well as I could. I really liked making Garkov back in the day, but I also burnt out on it around the time I put it out in the world, partly just because of all the tricky bits I had to sort through and try (and in some cases fail) to find solutions to to get it up and running.

Giving this idea another shot years later with everything I’d learned the first time driving me forward more quickly has been satisfying. I may take this newer, cleaner approach and apply it to a Garkov 2.0; I may throw it at some other comics; I may try to get it into other folks’ hands so ten thousand Markoving comic resynthesis projects can bloom.

Why Calvin and Hobbes?

Because I like C&H. Because other folks like it. Because it’s recognizable, and familiar, and the familiarity of the original lends a kind of weird suspension of disbelief to the broken, altered output of this kind of transmogrification process—if a thing looks enough like the real thing, we try to treat it like the real thing a little longer, give it the benefit of the doubt even as we know we should be doubting it. No one will really be fooled by Calvin and Markov, but all of us who’ve read thousands of the originals are wired to sort of give it credit long enough to produce a double-take, which is great.

Why treat C&H so weirdly?

One thing I’ve thought about while working on this—and I’ve heard it from at least one friend I showed the work in progress too as well—is how different Calvin and Hobbes is, as a cultural property, from the previous choice of Garfield. They’re both totemic, instantly recognizable comic strips, but that’s about the end of the similarities; C&H is loved for its doting, dynamic inkwork and artful writing and characterization, while Garfield is generally derided for its predictability, minimalist and samey art, and overall cash-in, sell-out, factory-produced sterility.

And so building Garkov, a machine that swallows up Garfield strips and spits out something dada and broken and absurd, seems like sort of a gimme. Of course people should fuck with Garfield. What else is it good for? I hate Mondays. Etc. It is, however justly or not, an easy target. (And I was not by far the first or the last to futz around with Garfield as a template for recontextualized weirdness; see the links at the bottom of the Garkov page for many others.)

Whereas C&H is a strip people hold up high as more or less the zenith of the modern newspaper comic strip, a piece of work that was so consistently beautiful and smart and heartfelt and uncompromising that nothing on the page could compete with it during its ten year run, and nothing has been be able to replace it in the years since. C&H was funny, but it wasn’t a joke; as mainstream pop cultural artifacts go, it’s pretty unfuckwithable.

You mess with Garfield, no one says How Dare You. Calvin and Hobbes, though…

So I’ve wondered as I built this how people would feel about it. Not so much that I expect condemnation—weird for weird’s sake gets by okay on the internet and I doubt anyone will get the mistaken impression that I mean any harm here—but really just how they’ll feel about the oddball output of this given their likely more fond releationship with the source material than in the case of Garkov.

Take a well-written, well-remembered comic strip and render it incoherent, and…and then what? And why?

I didn’t really know what I was going to get when I started. I wasn’t sure if I was going to get anything, honestly, other than new, bad Calvin and Hobbes strips. But I have seen some stuff in the output as I’ve developed this that I genuinely like.

And mostly what that is is this: Calvin as an actual, deeply weird little kid. Not the apt, smartly-written, nail-it-in-four-panels Calvin of Watterson’s work, the kid who we understand to be a kid despite his tremendous vocabulary and delightful, imaginary-or-is-he tiger friend, but a real scatterbrained oddball, the unvarnished stream-of-consciousness pile of developing brain cells that parents and teachers end up dealing with.

The little kid who changes the subject every five seconds. The little kid who says bizarre, contextless things, not as a punchline to a three-panel setup with a beautifully drawn alligator but just actually genuinely out of nowhere. Here’s a Calvin who confuses us, the readers, just as much as he does the adults around him in Watterson’s final-panel reveals.

Not Calvin the apposite, but Calvin the apropos-of-nothing. A Calvin whose head we don’t get to see inside of, a Calvin who we can’t keep up with.

It’s a neat thing, and a plenty satisfying outcome of these last few days of work. I love when something like this can surprise me.