Forums

Textract UnicodeDecodeError on website

Hi,

I am developing a website which should be able to take PDF files and extract 8 digit numbers from them (Website link + GitHub repo).

However when trying to process this PDF file AbenteuerGarten2021.pdf with the textract package I get the following UnicodeDecodeError !

Strangely when running the exact same flask app on the development server with the exactly same PDF file mentioned above I don't get an error and it works perfectly.

I am using Python 3.8.2 on local environment (Big Sur 11.4) and 3.8.0 on pythonanywhere. Both environments use the following:

  • Flask 2.0.1
  • Werkzeug 2.0.1
  • Textract 1.6.3

Please can anyone help me, I don't understand where the problem is coming from.


I managed to fix the problem by using the method "pdfminer" in process call. I have no idea why it works, but it does. For anyone having similar problem.

How is your locale set? I tried to reproduce this on PA and on my local machine (Linux, en_US.UTF-8 locale) and got the same error.

How do I find the language you are specifying? This could be the solution because it is probably set to German on my Mac and en_US on PA! (The PDF probably contains äöü since it’s German Language)

I think locale command should work on Mac (in a terminal).

It seems there is no language specified at all

Nvm its en_GB.UTF-8 in pycharm. Can I maybe copy the settings from local environment?

Changing LC_CTYPE="en_GB.UTF-8" won't fix the problem, but maybe if I change everything else in "locale" according to my local environment?

You may try the other way round as well -- change locale to the original PA one in your local environment and check if it will make the code fail.

Actually it is still working, no matter how I set the values :(

Do you have any other ideas?

Tbh I'm not sure if I'm changing it the right way, because when I restart console it resets.

Sorry, we're out of ideas.

If you have set environment variables in a console and then start a new console, the environment variables from the first console will not be in the new console. You can make environment variables permanent by exporting the variable in your .bashrc