OCRing Music from YouTube with Common Lisp: Difference between revisions

no edit summary
(Created page with "There's a tune on YouTube I always really liked called "Supersquatting." It was written by Dubmood and Zabutom, two really masterful chiptune composers, but this track always stood out to me as sounding really full and "fat." For those who don't know, these kinds of chiptunes are usually written in "tracker" software which is a category of music composing program that was popular in the 90s and early 2000s-- they produce "module" files which are self-contained music trac...")
 
No edit summary
Line 11: Line 11:
Of all the classical OCR libraries out there, Tesseract is probably the most famous. There are a few knobs to tweak on it, but in general you just chuck your image at it and let it rip. I honestly figured this would work really well, since this is monospaced, easily readable characters that should theoretically be a perfect match for these old-skool OCR techniques. There's a Lisp binding out there for it: https://github.com/GOFAI/cl-tesseract, so I quickly grabbed it, pointed it at a sample image, and...
Of all the classical OCR libraries out there, Tesseract is probably the most famous. There are a few knobs to tweak on it, but in general you just chuck your image at it and let it rip. I honestly figured this would work really well, since this is monospaced, easily readable characters that should theoretically be a perfect match for these old-skool OCR techniques. There's a Lisp binding out there for it: https://github.com/GOFAI/cl-tesseract, so I quickly grabbed it, pointed it at a sample image, and...


<dogshit>
dogshit


Wtf? I tried all sorts of techniques to pre-process the image, align the text, whatever, and Tesseract sucked every time. I'm not sure if it's optimized for print or what, but I just could not for the life of me get it to produce correct scans more than like 50% of the time. So it seems to be nearly worthless in this day and age.
Wtf? I tried all sorts of techniques to pre-process the image, align the text, whatever, and Tesseract sucked every time. I'm not sure if it's optimized for print or what, but I just could not for the life of me get it to produce correct scans more than like 50% of the time. So it seems to be nearly worthless in this day and age.
Line 19: Line 19:
I've historically been really impressed by GPT-4o's ability to transcribe images. This isn't scientific evidence or anything, but almost everything I've chucked at it, even minimally legible text it usually gets perfectly, so I thought that on this platonically ideal OCRable text, it should never fail... right?
I've historically been really impressed by GPT-4o's ability to transcribe images. This isn't scientific evidence or anything, but almost everything I've chucked at it, even minimally legible text it usually gets perfectly, so I thought that on this platonically ideal OCRable text, it should never fail... right?


<dogshit>
dogshit


Urghhh, kinda better than Tesseract, but still, wtf. Again I tried a bunch of methods to enhance the readability of this, but nothing really worked perfectly. It gets it right most of the time but then occasionally just goes nuts and puts something totally wrong. I guess that's what you get when you have "intelligence" interpreting visual data. I also tried Gemini, same situation. Looking back, maybe I should have cranked the temperature down, but regardless, this solution is a bit overkill anyway, since it's doing a separate HTTP request to a massive GPU-based model for every little chunk of text, cost a (relative) fortune, and took forever.
Urghhh, kinda better than Tesseract, but still, wtf. Again I tried a bunch of methods to enhance the readability of this, but nothing really worked perfectly. It gets it right most of the time but then occasionally just goes nuts and puts something totally wrong. I guess that's what you get when you have "intelligence" interpreting visual data. I also tried Gemini, same situation. Looking back, maybe I should have cranked the temperature down, but regardless, this solution is a bit overkill anyway, since it's doing a separate HTTP request to a massive GPU-based model for every little chunk of text, cost a (relative) fortune, and took forever.
Line 31: Line 31:
I wrote a little code to load up the directory, and a function to convert a series of "rectangles of interest" into a OCR'd string.
I wrote a little code to load up the directory, and a function to convert a series of "rectangles of interest" into a OCR'd string.


```lisp
<syntaxhighlight lang="lisp">
(defun load-dir (dir)
(defun load-dir (dir)
   (loop for file in (uiop:directory-files dir)
   (loop for file in (uiop:directory-files dir)
Line 55: Line 55:
                       (alexandria:extremum #'< :key (lambda (x) (compare-images (car x) crop)))
                       (alexandria:extremum #'< :key (lambda (x) (compare-images (car x) crop)))
                       (cdr)))))))
                       (cdr)))))))
```
</syntaxhighlight>


I included Lisp in the title so I'll give you what you came for and sing its praises a little ;)
I included Lisp in the title so I'll give you what you came for and sing its praises a little ;)
Line 61: Line 61:
To do image handling, I used lisp-magick-wand: https://github.com/ruricolist/lisp-magick-wand which is a thin wrapper around ImageMagick. One thing I like about Lisp is that you can easily wrap native code (even automatically: https://github.com/rpav/cl-autowrap) and then play around with it in the REPL. There's lots of really useful C/Rust/whatever libraries out there, but the edit/compile/run cycle just makes iterating so clunky and slow. Being able to quickly load up the code and start messing around with it is phenomenal for productivity. To make it even better, I found that SLIME comes with a contrib called `slime-media` which lets you display images in the REPL. I quickly wrote up a wrapper function:
To do image handling, I used lisp-magick-wand: https://github.com/ruricolist/lisp-magick-wand which is a thin wrapper around ImageMagick. One thing I like about Lisp is that you can easily wrap native code (even automatically: https://github.com/rpav/cl-autowrap) and then play around with it in the REPL. There's lots of really useful C/Rust/whatever libraries out there, but the edit/compile/run cycle just makes iterating so clunky and slow. Being able to quickly load up the code and start messing around with it is phenomenal for productivity. To make it even better, I found that SLIME comes with a contrib called `slime-media` which lets you display images in the REPL. I quickly wrote up a wrapper function:


```lisp
<syntaxhighlight lang="lisp">
(defun show (wand)
(defun show (wand)
   (let ((name (format nil "/dev/shm/~a.png" (gensym))))
   (let ((name (format nil "/dev/shm/~a.png" (gensym))))
Line 67: Line 67:
     (swank:eval-in-emacs `(slime-media-insert-image (create-image ,name) ,name))
     (swank:eval-in-emacs `(slime-media-insert-image (create-image ,name) ,name))
     wand))
     wand))
```
</syntaxhighlight>


and suddenly I had the ability to interactively do image operations and immediately see the result! I guess you could do this with Python+Jupyter as well, but I dunno, this just really feels nifty to me, like it's a natural extension of the REPL experience.
and suddenly I had the ability to interactively do image operations and immediately see the result! I guess you could do this with Python+Jupyter as well, but I dunno, this just really feels nifty to me, like it's a natural extension of the REPL experience.
Line 73: Line 73:
Anyway, I wired it all up by having FFmpeg dump out a series of BMP images to a pipe (so I could quickly parse the buffer size and read it into `lisp-magick-wand`) and set up a parallelized loop to call `classify` and store the parsed-out data.
Anyway, I wired it all up by having FFmpeg dump out a series of BMP images to a pipe (so I could quickly parse the buffer size and read it into `lisp-magick-wand`) and set up a parallelized loop to call `classify` and store the parsed-out data.


```lisp
<syntaxhighlight lang="lisp">
(defun drive ()
(defun drive ()
   (let ((process (uiop:launch-program
   (let ((process (uiop:launch-program
Line 83: Line 83:
           (run output-stream))
           (run output-stream))
       (uiop:close-streams process))))
       (uiop:close-streams process))))
```
</syntaxhighlight>


From there it was a simple matter of formatting the rows according to OpenMPT's paste format, and sending it to the clipboard.
From there it was a simple matter of formatting the rows according to OpenMPT's paste format, and sending it to the clipboard.
Line 89: Line 89:
Here's another neat trick for you:
Here's another neat trick for you:


```lisp
<syntaxhighlight lang="lisp">
(defmacro with-output-to-clipboard (&body body)
(defmacro with-output-to-clipboard (&body body)
   "Captures output from body forms and sends it to xclip. Returns the captured string."
   "Captures output from body forms and sends it to xclip. Returns the captured string."
Line 100: Line 100:
                         :force-shell nil))
                         :force-shell nil))
     result))
     result))
```
</syntaxhighlight>


So then I can just do:
So then I can just do:


```lisp
<syntaxhighlight lang="lisp">
(with-output-to-clipboard (print-all-orders))
(with-output-to-clipboard (print-all-orders))
```
</syntaxhighlight>


and paste the result directly into OpenMPT. The crazy thing is, it actually worked:
and paste the result directly into OpenMPT. The crazy thing is, it actually worked:


<video>
video


Well, it doesn't sound nearly as good as the original module, but that's because we're missing all the original samples-- I just replaced most of them with a square wave. There are also a few "typos" from video artifacts causing improper classification of the letters. Meh, good enough, my curiosity about this particular "nerd snipe" is satisfied for now. Now that I (mostly) have the note data for this track, I'd love to do an arrangement for the OPL3 and include it in a game I'm making... just need to ask Dubmood and Zabutom for permission first :)
Well, it doesn't sound nearly as good as the original module, but that's because we're missing all the original samples-- I just replaced most of them with a square wave. There are also a few "typos" from video artifacts causing improper classification of the letters. Meh, good enough, my curiosity about this particular "nerd snipe" is satisfied for now. Now that I (mostly) have the note data for this track, I'd love to do an arrangement for the OPL3 and include it in a game I'm making... just need to ask Dubmood and Zabutom for permission first :)


It has nothing to do with Lisp, but if you like fun video processing problems and want to work with me, apply to Recall.ai: link ;)
It has nothing to do with Lisp, but if you like fun video processing problems and want to work with me, apply to Recall.ai: link ;)