I am trying to host a website via Pythonanywhere that allows me to extract 8 digit product numbers from PDF files. The extracting is done by the Textract package.
When I try to extract certain PDF files I get a UnicodeDecodeError like you can see here: https://codeshare.io/WdNz6E. The problem is, if I run the website locally via the development server, the same PDF files are no problem for Textract and the program runs without errors. The versions of Flask 1.1.2 and Textract 1.6.3 are the same. Python is version 3.8.2 locally and 3.8 used in Pythonanywhere. With Pycharm version 2020.3.5 in Big Sur 11.4 the Flask app runs flawless on the development server.
This is the function used to extract the pdf files:
def get_numbers(file_path):
output = ''
pdf_string = str(textract.process(file_path))
numbers_list = re.sub('\D', ' ', pdf_string).split()
for x in numbers_list:
if len(x) == 8 and x not in output:
output += f'{x} '
return output.rstrip()