Thursday, January 5, 2023

How to extract Chinese hardcoded subtitles from video files

Here is a simple way to extract Chinese hardcoded subtitles from video files for free.

It will help to have basic knowledge about ffmpeg. You will need to be patient if your video file has a lot of hardcoded subtitles. You can try this tutorial on a short video clip that has hardcoded subtitles to see how it works. A lot of movie trailers have hardcoded subtitles. 1 - Cropping the video file: Here is is the ffmpeg script to crop the video.mp4 file used in this tutorial. Please adjust parameters according to your video file and note the file extension ffmpeg -i video.mp4 -filter:v "crop=1850:250:0:830" -c:a copy video-cropped.mp4
Note the crop parameter is "crop=Croped Length:Croped Width:Start X:StartY"
2 - Inserting Timestamps into the video file: This is the ffmpeg script to embed timestamps in video-cropped.mp4 ffmpeg -i video-cropped.mp4 -vf "drawtext=text='timestamp\: %{pts \: hms}': x=0: y=2: fontsize=28:fontcolor=yellow: box=1: boxcolor=black" -c:a copy video-cropped-timestamps.mp4
Note the ':' is an argument separator in ffmpeg so if the parameters contains ':' , it should be escaped using '\'
3 - Creating Image Files every second: This is the ffmpeg script to create image files every second from video file video-cropped-timestamps.mp4 ffmpeg -i video-cropped-timestamps.mp4 -start_number 1 -vf fps=1 video-%04d.jpg Note: For a long movie, creating image files every 1 second will generate a huge number of files. You could try creating image files every 2 or 3 seconds or even more.
Here is the script every 2 seconds ffmpeg -i video-cropped-timestamps.mp4 -start_number 1 -vf fps=1/2 video-%04d.jpg Here is the script every 3 seconds ffmpeg -i video-cropped-timestamps.mp4 -start_number 1 -vf fps=1/3 video-%04d.jpg

Note it's better to set to create images every seconds so that those short subtitles will not be missed.

4 - Using Adobe Bridge/ACD See to select images with first appeared text:
Using ctrl-I to flag those images using Adobe Bridge (ACDSee had similar functionality and much smaller than Adobe Bridge) and copy them to different folder
Select 60 images each time and print them into one pdf using Microsoft print to pdf with Unchecked fit pictures to frame option.

Upload the pdf file to Google Drive and open it using Google Docs to do OCR and copy result into text file.
Edit the text file to fix some errors.

5. Using special text editor to block select time frames and text separately and save to different text files and Import the 2 files into Subtitle Edit and combine those repeating and save to srt file.

6. Load the video with hardcoded subtitle into Subtitle Edit with the new edited srt file. Synchronize the new subtitle file with video.

The final text file can be used as subtitle file.

The above tip is from following youtube link but I added my own note above when I applied them:
https://www.youtube.com/watch?v=2o08WUNDUfY