OCRing Music from YouTube with Common Lisp: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
No edit summary
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[File:Article1.png|right|400px]]
[[File:Article1.png|right|400px]]


There's a tune on YouTube I always really liked called "Supersquatting." It was written by Dubmood and Zabutom, two really masterful chiptune composers, but this track always stood out to me as sounding really full and "fat." For those who don't know, these kinds of chiptunes are usually written in "tracker" software which is a category of music composing program that was popular in the 90s and early 2000s-- they produce "module" files which are self-contained music tracks containing both the note data and samples, which could be played back quickly in a game, keygen, or just for listening. There's a culture surrounding music trackers that's inmeshed with the demoscene, video games, and software cracking, but I digress-- here's a [https://www.youtube.com/watch?v=aiILSgNt23E nice in-depth explanation video about it] if you're interested.
There's [https://www.youtube.com/watch?v=wIEUnBYPtFM a tune on YouTube I always really liked called "Supersquatting."] It was written by Dubmood and Zabutom, two really masterful chiptune composers, but this track always stood out to me as sounding really full and "fat." For those who don't know, these kinds of chiptunes are usually written in "tracker" software which is a category of music composing program that was popular in the 90s and early 2000s-- they produce "module" files which are self-contained music tracks containing both the note data and samples, which could be played back quickly in a game, keygen, or just for listening. There's a culture surrounding music trackers that's inmeshed with the demoscene, video games, and software cracking, but I digress-- here's a [https://www.youtube.com/watch?v=aiILSgNt23E nice in-depth explanation video about it] if you're interested.


The point is, this track Supersquatting in particular sounds really full, beyond normally what I considered to be possible with the (these days) rudimentary tools (FastTracker II), so I actually assumed when I first heard it that it must have been made in a more traditional DAW with VSTs and stuff. But then Dubmood himself put up an [https://www.youtube.com/watch?v=PmfuUxfgfDA upload on YouTube of playing the track in Skale Tracker], and I saw that it's just an 8 channel FastTracker II tune. Damn. Maybe the channels have some EQ on them or something, but still, that sounds awesome for an XM file.
The point is, this track Supersquatting in particular sounds really full, beyond normally what I considered to be possible with the (these days) rudimentary tools (FastTracker II), so I actually assumed when I first heard it that it must have been made in a more traditional DAW with VSTs and stuff. But then Dubmood himself put up an [https://www.youtube.com/watch?v=PmfuUxfgfDA upload on YouTube of playing the track in Skale Tracker], and I saw that it's just an 8 channel FastTracker II tune. Damn. Maybe the channels have some EQ on them or something, but still, that sounds awesome for an XM file.
Line 11: Line 11:
= Attempt 1: Tesseract =
= Attempt 1: Tesseract =


Of all the classical OCR libraries out there, Tesseract is probably the most famous. There are a few knobs to tweak on it, but in general you just chuck your image at it and let it rip. I honestly figured this would work really well, since this is monospaced, easily readable characters that should theoretically be a perfect match for these old-skool OCR techniques. There's a Lisp binding out there for it: https://github.com/GOFAI/cl-tesseract, so I quickly grabbed it, pointed it at a sample image, and...
Of all the classical OCR libraries out there, Tesseract is probably the most famous. There are a few knobs to tweak on it, but in general you just chuck your image at it and let it rip. I honestly figured this would work really well, since this is monospaced, easily readable characters that should theoretically be a perfect match for these old-skool OCR techniques. There's a [https://github.com/GOFAI/cl-tesseract Lisp binding out there for it], so I quickly grabbed it, pointed it at a sample image, and...


[[File:Article2.png]]
[[File:Article2.png|600px]]


Wtf? I tried all sorts of techniques to pre-process the image, align the text, whatever, and Tesseract sucked every time. I'm not sure if it's optimized for print or what, but I just could not for the life of me get it to produce correct scans more than like 50% of the time. So it seems to be nearly worthless in this day and age.
Wtf? I tried all sorts of techniques to pre-process the image, align the text, whatever, and Tesseract sucked every time. I'm not sure if it's optimized for print or what, but I just could not for the life of me get it to produce correct scans more than like 50% of the time. So it seems to be nearly worthless in this day and age.
Line 25: Line 25:
[[File:Article3.png|600px]]
[[File:Article3.png|600px]]


Urghhh, kinda better than Tesseract, but still, wtf. Again I tried a bunch of methods to enhance the readability of this, but nothing really worked perfectly. It gets it right most of the time but then occasionally just goes nuts and puts something totally wrong. I guess that's what you get when you have "intelligence" interpreting visual data. I also tried Gemini, same situation. Looking back, maybe I should have cranked the temperature down, but regardless, this solution is a bit overkill anyway, since it's doing a separate HTTP request to a massive GPU-based model for every little chunk of text, cost a (relative) fortune, and took forever.
Urghhh, kinda better than Tesseract, but still, wtf. Again I tried a bunch of methods to enhance the readability of this, but nothing really worked perfectly. It gets it right most of the time but then occasionally just goes nuts and puts something totally wrong. I guess that's what you get when you have "intelligence" interpreting visual data. I also tried Gemini, same situation. Looking back, maybe I should have cranked the temperature down, but regardless, this solution is a bit overkill anyway, since it's doing a separate HTTP request to a massive GPU-based model for every little chunk of text, costs a (relative) fortune, and took forever.


= Attempt 3: Oldskool Pixel Diffing =
= Attempt 3: Oldskool Pixel Diffing =
Line 83: Line 83:
I guess you could do this with Python+Jupyter as well, but I dunno, this just really feels nifty to me, like it's a natural extension of the REPL experience. It also helped massively while testing out threshold values for the image to see what would work best for classification.
I guess you could do this with Python+Jupyter as well, but I dunno, this just really feels nifty to me, like it's a natural extension of the REPL experience. It also helped massively while testing out threshold values for the image to see what would work best for classification.


I wired it all up by having FFmpeg dump out a series of BMP images to a pipe (so I could quickly parse the buffer size and read it into `lisp-magick-wand`) and set up a parallelized loop to call `classify` and store the parsed-out data.
I wired it all up by having FFmpeg dump out a series of BMP images to a pipe (so I could quickly parse the buffer size and read it into <code>lisp-magick-wand</code>) and set up a parallelized loop to call `classify` and store the parsed-out data.


<syntaxhighlight lang="lisp">
<syntaxhighlight lang="lisp">
Line 124: Line 124:
[[File:Supersquatting-bad.mp3]]
[[File:Supersquatting-bad.mp3]]


Well, it doesn't sound nearly as good as the original module, but that's because we're missing all the original samples-- I just replaced most of them with a square wave. There are also a few "typos" from video artifacts causing improper classification of the letters. Meh, good enough, my curiosity about this particular "nerd snipe" is satisfied for now. Now that I (mostly) have the note data for this track, I'd love to do an arrangement for the OPL3 and include it in a game I'm making... just need to ask Dubmood and Zabutom for permission first :)
Well, it doesn't sound nearly as good as the original module, but that's because we're missing all the original samples-- I just replaced most of them with a square wave. There are also a few "typos" from video artifacts causing improper classification of the letters. Meh, good enough, my curiosity about this particular "nerd snipe" is satisfied for now. Now that I (mostly) have the note data for this track, I'd love to do an arrangement for the OPL3 and include it in [[Cubeat | a game I'm making...]] just need to ask Dubmood and Zabutom for permission first :)


Code is all [https://github.com/SuperDisk/trackergrabber here.]
Code is all [https://github.com/SuperDisk/trackergrabber here.]
= Addendum =


It has nothing to do with Lisp, but if you like fun video processing problems and want to work with me, [https://www.workatastartup.com/companies/recall-ai apply to Recall.ai ;)]
It has nothing to do with Lisp, but if you like fun video processing problems and want to work with me, [https://www.workatastartup.com/companies/recall-ai apply to Recall.ai ;)]

Navigation menu