Forums

PDFKIT

When I use pdfkit to generate a PDF from HTML on Pythonanywhere and then have fitz read the text out of the produced PDF, I get replacement characters for all the spaces. When I run the exact same code in my home environment, I get the accurate spaces. Here is the code I'm running.

import pdfkit
import fitz

def convert_to_pdf(input_html, output_pdf):
    options = {
    "enable-local-file-access": None,
    'encoding':'UTF-8',
    'dpi':200,
   'page-size': 'Letter',
   'margin-top': '0.75in',
    'margin-right': '0.75in',
    'margin-bottom': '0.75in',
    'margin-left': '0.75in',
    'minimum-font-size': '22',  }
   # Convert to PDF with pdfkit
    pdfkit.from_file(input_html, output_pdf,options=options)

convert_to_pdf(input_html, output_pdf)
pdf = fitz.open(output_pdf)
for i, page in enumerate(pdf):
    text = page.get_text()
    print(text)

When I run this on my home system, I get, for example:

I) the sum of the amount on which tax is determined under subparagraph (A) plus the net capital gain, over (II) taxable income; and (F) 28 percent of the amount of taxable income in excess of the sum of the amounts on which tax is determined under the preceding subparagraphs of this paragraph.

When I run this on Pythonanywhere, I get, for example:

(11)�Dividends�taxed�as�net�capital�gain (A)�In�general (A)�In�general For�purposes�of�this�subsection,�the�term�“net�capital�gain”�means�net�capital gain�(determined�without�regard�to�this�paragraph)�increased�by�qualified�dividend�income. (B)�Qualified�dividend�income (B)�Qualified�dividend�income For�purposes�of�this�paragraph— (i)�In�general (i)�In�general The�term�“qualified�dividend�income”�means�dividends�received�during the�taxable�year�from—

I have also tried modifying the code on Pythonanywhere by wrapping the convert_to_pdf code in disp = Display().start() / disp.stop(), and also putting in config = pdfkit.configuration(wkhtmltopdf='/usr/bin/wkhtmltopdf'), based on some prior forum posts. The code still runs when I do this, but it doesn't fix the output.

When I create the PDF on Pythonanywhere and read it in my home environment, I get the same text with the replacement characters, so the fitz part seems to be working the same in both environments and the issue seems to be with the generation of the PDF. Additionally, the PDF looks fine whether I generate it in my home environment or on Pythonanywhere; the issue is apparent only when I extract the text back out from the PDF.

Thanks very much for any insights you can provide.

Do you have the same package versions in both environments?

Yes, running pip freeze on my home environment and in the bash console for my virtual environment for the app, I have pdfkit==1.0.0, wkhtmltopdf==0.2, PyMuPdf==1.22.5 for both.

Since it's related to converting from HTML and spaces are the issue, maybe that could be a lead (it's an old ticket, but maybe you have different fonts available on your local machine)?

To check if it was the font, I took the font specification out of the CSS entirely on PythonAnywhere, presuming that it would use whatever the default font was, which I think was DejaVuSerif. It did appear to be using DejaVuSerif, but I had the same problem (replacement characters) with that font on PythonAnywhere.

My CSS file did specify font-family: "Times New Roman", Times, serif; I installed times.ttf in the ~/.fonts folder, and I was indeed getting Times New Roman in the PDF. When I check to see what fonts are available on PythonAnywhere, running fc-list : family style spacing, it does list

Times New Roman:style=Regular,Normal,obyčejné,Standard,Κανονικά,Normaali,Normál,Normale,Standaard,Normalny,Обычный,Normálne,Navadno,thường,Arrunta

I noticed something else, which is that the size of the font differs between running it on my home environment and running it on PythonAnywhere. It is much smaller on my home environment. To get roughly the same font size, I have to put 'minimum-font-size': '40' in the pdfkit options for my home environment, and 'minimum-font-size': '22' for the PythonAnywhere environment. That also may suggest that it's something about fonts. Is there another way I could try a different font that you would suggest (given that I have already tried taking font specification out of the CSS file entirely)? Thank you very much for your help with unraveling this.

How are you running the code on PythonAnywhere? If it's from a Bash console, did you run

fc-cache -f -v

...after copying the TTF file into .fonts? I'm wondering if the install hasn't happened completely.

The code is running through a webapp. I tried running the code you suggested in a Bash console in the virtual environment and then reloading the app but the same problem is still occurring. Also, I tried removing the font specification from the CSS file entirely so that the program would use whatever the default font is, and it did use what seemed to be the default font, but the problem still occurred.

What happens if you try to convert some completely trivial HTML file -- something like

<html>
    <body>
        <p>Hello, world</p>
    </body>
</html>

...with no styling information at all?

Ah, good question! Exact same problem when there is no formatting at all. Here is the code:

import fitz
import pdfkit

def convert_to_pdf(input_html, output_pdf):
   pdfkit.from_file(input_html, output_pdf)

def test_file(input_path):
   pdf = fitz.open(input_path)
   for i, page in enumerate(pdf):
       text = page.get_text()
       print(text)

convert_to_pdf('testing.html','output2.pdf')
test_file('output2.pdf')

And here is testing.html:

<html>
    <body>
        <p>Hello, world</p>
    </body>
</html>

And the output when I ran the code in the Bash console in the virtual environment for the webapp was: Hello,�world

Running the exact same thing in home environment, I got: Hello, world

I have additional information, which is that the issue appears to be that pdfkit puts slightly too much space between the words when I use it on Pythonanywhere (but not in my home environment). PyMuPDF can't recognize it so puts in unicode 65533 ("Hello,�world"). Pypdf deals with it by inserting two spaces instead of one ("Hello, world"). Whatever is making the fontsize different between Pythonanywhere and my home environment appears to also be making the spaces between the words slightly too large. The code that produces the two spaces between the words is, with convert_to_pdf the same as above,

from pypdf import PdfReader

def test_file_pypdf(input_path):
  reader = PdfReader(input_path)
  for i, page in enumerate(reader.pages):
       text = page.extract_text()
       print(text)

convert_to_pdf('testing.html','output2.pdf')
test_file_pypdf('output2.pdf')

I wonder if the issue is that Pythonanywhere is running on Ubuntu and my home system is Windows.

https://github.com/wkhtmltopdf/wkhtmltopdf/issues/5069

That also may be a lead. Could you tell us which Python version are you using and how did you install fitz package? I was trying to reproduce, but the fitz I found seems to be deprecated and does not have open method at all.

This happens with Python 3.9 and 3.10. To install PyMuPdf:

pip install PyMuPdf

https://pypi.org/project/PyMuPDF/

(You do import fitz once it is installed to get PyMuPdf, but there's also a totally separate package named fitz that is deprecated, it's confusing.)

Thanks -- I got a repro on my account, as well as on my local machine (a different Linux distribution than on PA), so that seems like not a specifically PA issue.

Sounds like it is likely Linux vs. Windows then. Thank you for looking into this--I will find another workaround for the specific thing I'm trying to do.