Textract UnicodeDecodeError on website : Forums : PythonAnywhere

Textract UnicodeDecodeError on website

Hi,

I am developing a website which should be able to take PDF files and extract 8 digit numbers from them (Website link + GitHub repo).

However when trying to process this PDF file AbenteuerGarten2021.pdf with the textract package I get the following UnicodeDecodeError !

Strangely when running the exact same flask app on the development server with the exactly same PDF file mentioned above I don't get an error and it works perfectly.

I am using Python 3.8.2 on local environment (Big Sur 11.4) and 3.8.0 on pythonanywhere. Both environments use the following:

Flask 2.0.1
Werkzeug 2.0.1
Textract 1.6.3

Please can anyone help me, I don't understand where the problem is coming from.

I managed to fix the problem by using the method "pdfminer" in process call. I have no idea why it works, but it does. For anyone having similar problem.

JW30 | 13 posts | July 4, 2021, 8:25 a.m. | permalink

How is your locale set? I tried to reproduce this on PA and on my local machine (Linux, en_US.UTF-8 locale) and got the same error.

pafk | 2973 posts | PythonAnywhere staff | July 4, 2021, 5:24 p.m. | permalink

How do I find the language you are specifying? This could be the solution because it is probably set to German on my Mac and en_US on PA! (The PDF probably contains äöü since it’s German Language)

JW30 | 13 posts | July 4, 2021, 5:28 p.m. | permalink

I think locale command should work on Mac (in a terminal).

pafk | 2973 posts | PythonAnywhere staff | July 4, 2021, 5:39 p.m. | permalink

It seems there is no language specified at all

Nvm its en_GB.UTF-8 in pycharm. Can I maybe copy the settings from local environment?

Changing LC_CTYPE="en_GB.UTF-8" won't fix the problem, but maybe if I change everything else in "locale" according to my local environment?

JW30 | 13 posts | July 4, 2021, 5:43 p.m. | permalink

You may try the other way round as well -- change locale to the original PA one in your local environment and check if it will make the code fail.

pafk | 2973 posts | PythonAnywhere staff | July 4, 2021, 6:16 p.m. | permalink

Actually it is still working, no matter how I set the values :(

Do you have any other ideas?

Tbh I'm not sure if I'm changing it the right way, because when I restart console it resets.

JW30 | 13 posts | July 4, 2021, 6:40 p.m. | permalink

Sorry, we're out of ideas.

If you have set environment variables in a console and then start a new console, the environment variables from the first console will not be in the new console. You can make environment variables permanent by exporting the variable in your .bashrc

glenn | 9498 posts | PythonAnywhere staff | July 5, 2021, 10:27 a.m. | permalink