Building a tool to OCR and translate scanned PDFs without losing the formatting
Thread poster: Kyle Corbitt
Kyle Corbitt
Kyle Corbitt
United States
Local time: 14:57
Spanish to English
+ ...
May 30, 2023

Hi everyone, I've started building a system that combines OCR and MT to quickly produce a draft translation of scanned images and PDFs. It keeps all the formatting of the original document and just adds editable text boxes on top, which saves a ton of time on prep/formatting. It's particularly useful for simple forms like birth certificates (it doesn't work well yet for documents with longer prose... See more
Hi everyone, I've started building a system that combines OCR and MT to quickly produce a draft translation of scanned images and PDFs. It keeps all the formatting of the original document and just adds editable text boxes on top, which saves a ton of time on prep/formatting. It's particularly useful for simple forms like birth certificates (it doesn't work well yet for documents with longer prose). The URL is https://translato.ai

My wife and I actually built this because we wanted a tool like this for ourselves but couldn't find one. We had to manually translate her birth certificate and other documentation when we moved to the US, and I was surprised that there was no way to do that conveniently.

I initially planned for the tool to be used by individuals, but I've actually shown it to a few professional translators and they mentioned that there wasn't a good tool for translating scanned documents for professionals either, so I decided to share it here as well.

I'd really appreciate any feedback on whether this helps your workflow. The service is totally free.

[Edited at 2023-05-30 18:10 GMT]

[Edited at 2023-05-30 21:21 GMT]
Collapse


 
Stepan Konev
Stepan Konev  Identity Verified
Russian Federation
Local time: 00:57
English to Russian
PDF output format May 31, 2023

I have tested your tool. Thank you for your effort and work. However I doubt if I can find a use for it. I put in a non-editable jpg file and I get a non-editable pdf file again. Any machine translation requires post-editing. That means I have to ocr the output pdf to make it editable.

 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 23:57
Member (2006)
English to Afrikaans
+ ...
@Kyle May 31, 2023

I have tested your tool and it works well for its intended purpose. It takes a bit of experimentation to learn all of its features, as the way some users might expect it to work is not how it works. E.g. some might expect to be able to download an editable file. When I tested it, I selected English to English as the language combination, so as to not have a machine translation inserted into the segments.

 
Kyle Corbitt
Kyle Corbitt
United States
Local time: 14:57
Spanish to English
+ ...
TOPIC STARTER
editing May 31, 2023

Stepan Konev wrote:

I have tested your tool. Thank you for your effort and work. However I doubt if I can find a use for it. I put in a non-editable jpg file and I get a non-editable pdf file again. Any machine translation requires post-editing. That means I have to ocr the output pdf to make it editable.


Hi Stepan, thanks so much for your feedback!

The intention is to do all editing in the tool itself. When your file has been imported, you'll see an interface where all the text is editable. You can then click on any of the text boxes and move them, resize them, etc. Once you're satisfied with that you can then export the final version as a PDF.

That said, I understand you may have a workflow where it's more convenient to export in an editable format and do further post-processing that way. Is there a particular export format that would be most convenient and useful for you?


Stepan Konev
 
Kyle Corbitt
Kyle Corbitt
United States
Local time: 14:57
Spanish to English
+ ...
TOPIC STARTER
English to English May 31, 2023

Samuel Murray wrote:

I have tested your tool and it works well for its intended purpose. It takes a bit of experimentation to learn all of its features, as the way some users might expect it to work is not how it works. E.g. some might expect to be able to download an editable file. When I tested it, I selected English to English as the language combination, so as to not have a machine translation inserted into the segments.


Hi Samuel, thanks so much for your feedback! I assume you selected English to English because you intend to use it mostly for the OCR capabilities, not for the MT pass? What type of document did you use for your test, and did it have any trouble identifying the text and making it editable?


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Building a tool to OCR and translate scanned PDFs without losing the formatting






Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »