When I use pdfkit to generate a PDF from HTML on Pythonanywhere and then have fitz read the text out of the produced PDF, I get replacement characters for all the spaces. When I run the exact same code in my home environment, I get the accurate spaces. Here is the code I'm running.
import pdfkit
import fitz
def convert_to_pdf(input_html, output_pdf):
options = {
"enable-local-file-access": None,
'encoding':'UTF-8',
'dpi':200,
'page-size': 'Letter',
'margin-top': '0.75in',
'margin-right': '0.75in',
'margin-bottom': '0.75in',
'margin-left': '0.75in',
'minimum-font-size': '22', }
# Convert to PDF with pdfkit
pdfkit.from_file(input_html, output_pdf,options=options)
convert_to_pdf(input_html, output_pdf)
pdf = fitz.open(output_pdf)
for i, page in enumerate(pdf):
text = page.get_text()
print(text)
When I run this on my home system, I get, for example:
I) the sum of the amount on which tax is determined under subparagraph (A) plus the net capital gain, over (II) taxable income; and (F) 28 percent of the amount of taxable income in excess of the sum of the amounts on which tax is determined under the preceding subparagraphs of this paragraph.
When I run this on Pythonanywhere, I get, for example:
(11)�Dividends�taxed�as�net�capital�gain (A)�In�general (A)�In�general For�purposes�of�this�subsection,�the�term�“net�capital�gain”�means�net�capital gain�(determined�without�regard�to�this�paragraph)�increased�by�qualified�dividend�income. (B)�Qualified�dividend�income (B)�Qualified�dividend�income For�purposes�of�this�paragraph— (i)�In�general (i)�In�general The�term�“qualified�dividend�income”�means�dividends�received�during the�taxable�year�from—
I have also tried modifying the code on Pythonanywhere by wrapping the convert_to_pdf code in disp = Display().start()
/ disp.stop()
, and also putting in config = pdfkit.configuration(wkhtmltopdf='/usr/bin/wkhtmltopdf')
, based on some prior forum posts. The code still runs when I do this, but it doesn't fix the output.
When I create the PDF on Pythonanywhere and read it in my home environment, I get the same text with the replacement characters, so the fitz part seems to be working the same in both environments and the issue seems to be with the generation of the PDF. Additionally, the PDF looks fine whether I generate it in my home environment or on Pythonanywhere; the issue is apparent only when I extract the text back out from the PDF.
Thanks very much for any insights you can provide.