OCRing Music from YouTube with Common Lisp: Difference between revisions

OCRing Music from YouTube with Common Lisp (view source)

Revision as of 11:56, 5 January 2025

179 bytes added , 5 January

no edit summary

Nick

Bureaucrats, Interface administrators, Administrators

88

edits

@@ Line 11: / Line 11: @@
 Of all the classical OCR libraries out there, Tesseract is probably the most famous. There are a few knobs to tweak on it, but in general you just chuck your image at it and let it rip. I honestly figured this would work really well, since this is monospaced, easily readable characters that should theoretically be a perfect match for these old-skool OCR techniques. There's a Lisp binding out there for it: https://github.com/GOFAI/cl-tesseract, so I quickly grabbed it, pointed it at a sample image, and...
-<dogshit>
+dogshit
 Wtf? I tried all sorts of techniques to pre-process the image, align the text, whatever, and Tesseract sucked every time. I'm not sure if it's optimized for print or what, but I just could not for the life of me get it to produce correct scans more than like 50% of the time. So it seems to be nearly worthless in this day and age.
@@ Line 19: / Line 19: @@
 I've historically been really impressed by GPT-4o's ability to transcribe images. This isn't scientific evidence or anything, but almost everything I've chucked at it, even minimally legible text it usually gets perfectly, so I thought that on this platonically ideal OCRable text, it should never fail... right?
-<dogshit>
+dogshit
 Urghhh, kinda better than Tesseract, but still, wtf. Again I tried a bunch of methods to enhance the readability of this, but nothing really worked perfectly. It gets it right most of the time but then occasionally just goes nuts and puts something totally wrong. I guess that's what you get when you have "intelligence" interpreting visual data. I also tried Gemini, same situation. Looking back, maybe I should have cranked the temperature down, but regardless, this solution is a bit overkill anyway, since it's doing a separate HTTP request to a massive GPU-based model for every little chunk of text, cost a (relative) fortune, and took forever.
@@ Line 31: / Line 31: @@
 I wrote a little code to load up the directory, and a function to convert a series of "rectangles of interest" into a OCR'd string.
-```lisp
+<syntaxhighlight lang="lisp">
 (defun load-dir (dir)
    (loop for file in (uiop:directory-files dir)
@@ Line 55: / Line 55: @@
                         (alexandria:extremum #'< :key (lambda (x) (compare-images (car x) crop)))
                         (cdr)))))))
-```
+</syntaxhighlight>
 I included Lisp in the title so I'll give you what you came for and sing its praises a little ;)
@@ Line 61: / Line 61: @@
 To do image handling, I used lisp-magick-wand: https://github.com/ruricolist/lisp-magick-wand which is a thin wrapper around ImageMagick. One thing I like about Lisp is that you can easily wrap native code (even automatically: https://github.com/rpav/cl-autowrap) and then play around with it in the REPL. There's lots of really useful C/Rust/whatever libraries out there, but the edit/compile/run cycle just makes iterating so clunky and slow. Being able to quickly load up the code and start messing around with it is phenomenal for productivity. To make it even better, I found that SLIME comes with a contrib called `slime-media` which lets you display images in the REPL. I quickly wrote up a wrapper function:
-```lisp
+<syntaxhighlight lang="lisp">
 (defun show (wand)
    (let ((name (format nil "/dev/shm/~a.png" (gensym))))
@@ Line 67: / Line 67: @@
      (swank:eval-in-emacs `(slime-media-insert-image (create-image ,name) ,name))
      wand))
-```
+</syntaxhighlight>
 and suddenly I had the ability to interactively do image operations and immediately see the result! I guess you could do this with Python+Jupyter as well, but I dunno, this just really feels nifty to me, like it's a natural extension of the REPL experience.
@@ Line 73: / Line 73: @@
 Anyway, I wired it all up by having FFmpeg dump out a series of BMP images to a pipe (so I could quickly parse the buffer size and read it into `lisp-magick-wand`) and set up a parallelized loop to call `classify` and store the parsed-out data.
-```lisp
+<syntaxhighlight lang="lisp">
 (defun drive ()
    (let ((process (uiop:launch-program
@@ Line 83: / Line 83: @@
             (run output-stream))
        (uiop:close-streams process))))
-```
+</syntaxhighlight>
 From there it was a simple matter of formatting the rows according to OpenMPT's paste format, and sending it to the clipboard.
@@ Line 89: / Line 89: @@
 Here's another neat trick for you:
-```lisp
+<syntaxhighlight lang="lisp">
 (defmacro with-output-to-clipboard (&body body)
    "Captures output from body forms and sends it to xclip. Returns the captured string."
@@ Line 100: / Line 100: @@
                           :force-shell nil))
       result))
-```
+</syntaxhighlight>
 So then I can just do:
-```lisp
+<syntaxhighlight lang="lisp">
 (with-output-to-clipboard (print-all-orders))
-```
+</syntaxhighlight>
 and paste the result directly into OpenMPT. The crazy thing is, it actually worked:
-<video>
+video
 Well, it doesn't sound nearly as good as the original module, but that's because we're missing all the original samples-- I just replaced most of them with a square wave. There are also a few "typos" from video artifacts causing improper classification of the letters. Meh, good enough, my curiosity about this particular "nerd snipe" is satisfied for now. Now that I (mostly) have the note data for this track, I'd love to do an arrangement for the OPL3 and include it in a game I'm making... just need to ask Dubmood and Zabutom for permission first :)
 It has nothing to do with Lisp, but if you like fun video processing problems and want to work with me, apply to Recall.ai: link ;)

OCRing Music from YouTube with Common Lisp: Difference between revisions

OCRing Music from YouTube with Common Lisp (view source)

Revision as of 11:56, 5 January 2025

Navigation menu

Search