Forums

Reading from PostgreSQL database is 15 times slower than locally

I am testing reading a single column of interest from 1000 rows of the PostgreSQL database. The column contains a binary object sized about 150 KB. Here is the command script for Django:

from django.core.management.base import BaseCommand
from content.models import Content
from utils.stats import Timer

class Command(BaseCommand):
    def handle(self, *args, **options):
        objects = Content.objects.all().only("vectors")
        print(f"Looping on {objects.count()} objects", flush=True)
        timer = Timer()
        for obj in objects:
            pass
        print(f"{timer.stop()}ms", flush=True)

Here is the output in PythonAnywhere console:

Looping on 1000 objects
23501ms

On my laptop, this runs in about 1.6 seconds.

Why is this so much slower in PythonAnywhere and is there any way to solve this (as my website will heavily rely on this operation)?

Do you have exactly the same db locally (maybe add another print to show how many objects were found) or are you connecting to your PA postgres instance when you run tests locally?

Exactly the same (in fact, it has one more row :)

One other strange thing is that this clearly I/O-bound script takes almost 10 seconds of CPU time as reflected in the updated used CPU time in the Tasks tab.

One other strange thing is that this clearly I/O-bound script takes almost 10 seconds of CPU time as reflected in the updated used CPU time in the Tasks tab.

There are a couple of things to remember here:

  • On PythonAnywhere, you're using shared hosting. If you consider that your local machine cost something in the region of US$1,200, and would last about a year if you ran it at 100% use all of the time (like crypto miners do, for example), then that's US$100/month. Add on the costs of security, cooling, electricity, and a machine with the same specs would cost upwards of US$200/month (which is reflected in, for example, AWS pricing). So on PythonAnywhere, where you're paying significantly less than that, you share a machine with other people and you get lower performance as a result.
  • Also, I suspect that when you're running your test locally, you have the database and the code you're running on the same machine -- at least, that's how most people develop. You can expect much higher bandwidth between the processes as a result, as there's no round-trip over the network. Most queries (and I appreciate that this might not be possible in your specific case) are better done by doing any filtering on the server side with where clauses, so that you don't have so many results coming back and so you're less impacted by that.

Regarding the CPU time -- that is odd, certainly, but the numbers we use for that come straight from the OS. You'll see similar numbers reported if you run time python manage.py yourcommand. Perhaps Django is doing some kind of heavyweight unmarshalling of the binary data, or something like that?

Are you saying that the problem is not slow disk I/O, but network transfer between the machine on which that database is physically located and the machine on which the script is run? If I give you a script to create the database, will you be able to confirm this?

Here is a reproduceable test without Django:

import psycopg2
import time
import random
import string

SIZE = 1000 * 200000

def ms_now():
    return int(time.time_ns() / 1000000)

class Timer():
    def __init__(self, timestamp_function=ms_now):
        self.timestamp_function = timestamp_function
        self.start = self.timestamp_function()

    def stop(self):
        return self.timestamp_function() - self.start

def random_text(size):
    return ''.join([random.choice(string.ascii_letters) for _ in range(size)])

def connect():
    # Fill out host, port, dbname, username and password here
    return psycopg2.connect(
        host=host,
        port=port,
        dbname=dbname,
        user=username,
        password=password
    )

def create_database(conn):
    with conn.cursor() as cursor:
        cursor.execute("DROP TABLE IF EXISTS speed_test;")
        cursor.execute(
            "CREATE TABLE speed_test (id serial PRIMARY KEY, data text);")
        conn.commit()
        timer = Timer()
        data = random_text(SIZE)
        print(f"Generation time: {timer.stop()}ms")
        timer = Timer()
        cursor.execute("INSERT INTO speed_test (data) VALUES (%s);", (data,))
        conn.commit()
        print(f"Insertion time: {timer.stop()}ms")

def query_database(conn):
    with conn.cursor() as cursor:
        timer = Timer()
        cursor.execute("SELECT data FROM speed_test;")
        print(f"Query time: {timer.stop()}ms")

conn = connect()
create_database(conn)
query_database(conn)

The output on my laptop:

Generation time: 56417ms
Insertion time: 1604ms
Query time: 337ms

The output in the PythonAnywhere terminal:

Generation time: 228069ms
Insertion time: 39319ms
Query time: 4029ms

I am not sure why text data is faster than my binary data, but the comparison between the local time and the time at PythonAnywhere is telling. While the CPU-intensive task of generating the data is some four times slower than on the laptop, the I/O operations are some 24 times slower in the case of writing to the database and some 12 times slower in the case of reading from the database.

For my particular application, data generation and writing to the database are done over the course of many requests and hence do not slow down my application by too much. However, the reading operation is done in a single request and is critical for my application's performance.

The I/O side of things does seem to suggest that it's related to sending the data over the network, then. Are you running Postgres and your code locally on the same machine?

I assume that your last reply came after my reproduceable example, which was delayed for moderation, got posted. Yes, my code was run on the same laptop on which the database resides. Note however, that I am running under WSL, which means that the CPU time is quite a bit slower than it would have been without virtualization. It would be great if you can test this code and form your own conclusions. In particular, can you run it on the same machine on which the physical database resides?

Also, I think that the network transfer time should not depend on the direction of transfer, so that the network transfer overhead should be the same for reading and for writing. However, it took roughly 4 seconds for reading and roughly 39 seconds for writing, which suggests that it's not about network overhead, but about disk I/O overhead.

Yes, I answered your message after I'd spotted that it had got stuck in a spam filter, so it was written after having read that.

Unfortunately the machines that host Postgres instances aren't really set up to run code.

Regarding reading vs writing, do you have any indexes on the table in question? Writing can be slowed down quite a lot if for every write it has to regenerate indexes.

Please see the create_database function in the code that I posted (https://www.pythonanywhere.com/forums/topic/33658/#id_post_115902) and that you can run yourself to confirm my findings. It does not create any indices. I think that the indication provided by reading vs. writing speed comparison is very strong that it is disk I/O that's slow.

There are probably a large number of things that contribute to the performance difference, not just one. There is the network hop to the machine, also probably slower disks and probably the largest contributor is likely to be that the machine is shared with other users. The last one would contribute because both the processors and the disks are in use by other users on the machine.

When the shared nature of the system causes such a dramatic slowdown, it is the very definition of under-provisioning of resources. As a paying user, I feel that I quite rightfully expect that such under-provisioning should not happen.

I think there's a bit of confusion here. Just to re-iterate what I said earlier:

On PythonAnywhere, you're using shared hosting. If you consider that your local machine cost something in the region of US$1,200, and would last about a year if you ran it at 100% use all of the time (like crypto miners do, for example), then that's US$100/month. Add on the costs of security, cooling, electricity, and a machine with the same specs would cost upwards of US$200/month (which is reflected in, for example, AWS pricing). So on PythonAnywhere, where you're paying significantly less than that, you share a machine with other people and you get lower performance as a result.

Your Postgres instance costs US$7/month, and you're comparing performance with a machine that would cost US$200/month. Underprovisioning would be if you got worse than 200/7 = 28 times slower performance, but your experiments are showing that it's only 15 times slower.

First, my application does not use the system 100% of the time, which breaks the analogy completely. Second, I am surprised that you don't know how much your service costs. A Postgres instance is $12. Further, it is not true that a crypto miner lasts only a year. A quick Google search suggests that 5-7 years is the average. Now do your calculation.

If anyone is reading this and can easily run the code at Digital Ocean or another platform with similar pricing, it would be interesting to compare.

The analogy is about available resources you can use (or not). How did you establish that our postgres instance costs $12? (Asking, because maybe there's something in our UI that is confusing you, as pg costs US$7 per month + extra $0.20 per each GiB of storage)?

Because that's what I am paying... Here is the screenshot from the configuration page.

That's for your entire account. Postgres is a part of that.

Ah, I see now what you mean. I apologize for my remark. Nonetheless, given that a local solution can easily last 5-7 years, the 15 times difference is unjustified.

Practically speaking, the current performance will make me look for a more performant platform very soon. This is regretful, given the amount of effort I have put in development with PA in mind and, quite frankly, the outstanding support I have been enjoying. So, calculations aside, what if anything will you do to help this situation?

If you find better performing solution for the same price let us know, we are happyto learn something new. From a paid account you can connect to any database out there so you can test many other postgres providers.

Since PA is based on AWS, I decided to check it against that to see what level of AWS service I am getting at PA. Here is the code for the test (it is the same as above, except that smaller data generated in chunks is used to fit the RAM limit of the free tier of EC2):

import psycopg2
import time
import random
import string

CHUNK_SIZE = 10000000
N_CHUNKS = 10 # Do in chunks, so as to avoid a huge list

def ms_now():
    return int(time.time_ns() / 1000000)

class Timer():
    def __init__(self, timestamp_function=ms_now):
        self.timestamp_function = timestamp_function
        self.start = self.timestamp_function()

    def stop(self):
        return self.timestamp_function() - self.start

def random_text():
    result = ''
    for i in range(N_CHUNKS):
        print(f"Chunk {i+1} of {N_CHUNKS}")
        result += ''.join([random.choice(string.ascii_letters) for _ in range(CHUNK_SIZE)])
    return result

def connect():
    # Fill out host, port, dbname, username and password here

    return psycopg2.connect(
        host=host,
        port=port,
        dbname=dbname,
        user=username,
        password=password
    )

def create_database(conn):
    with conn.cursor() as cursor:
        cursor.execute("DROP TABLE IF EXISTS speed_test;")
        cursor.execute(
            "CREATE TABLE speed_test (id serial PRIMARY KEY, data text);")
        conn.commit()
        timer = Timer()
        data = random_text()
        print(f"Generation time: {timer.stop()}ms")
        timer = Timer()
        cursor.execute("INSERT INTO speed_test (data) VALUES (%s);", (data,))
        conn.commit()
        print(f"Insertion time: {timer.stop()}ms")

def query_database(conn):
    with conn.cursor() as cursor:
        timer = Timer()
        cursor.execute("SELECT data FROM speed_test;")
        print(f"Query time: {timer.stop()}ms")

conn = connect()
create_database(conn)
query_database(conn)

Here are the timing results at PA:

Generation time: 77334ms
Insertion time: 15911ms
Query time: 2389ms

And this what I get by running the script on the FREE TIER of EC2 with the database located in the FREE TIER of RDS (Postgres):

Generation time: 54626ms
Insertion time: 14964ms
Query time: 2325ms

The performance I get from PA is about equal (and even slightly worse as far the computation is concerned) to the performance of the weakest instances of AWS (free tier is free tier), while I do pay money to PA. In my opinion, this is quite conclusive...

Ok. So then you can stop paying for postgres on PythonAnywhere and use that instance.

First, this is not a solution, because the network overheads will dominate the cost for small queries. This means that I have to either pay for Postgres both at PA and at AWS, or completely switch to AWS.

Second, such a reply is quite unbelievable for a company that wants to stay in business. I wonder how your boss would react to these facts becoming duly publicized after having been ignored by the team. (While on the point, another thing that has been ignored is the hugely important in the modern machine-learning-oriented world ability of Postgres to work with vectors through the pgvector extension, which AWS has since May and which I have been asking for from PA for months.)

Hi there. We primarily aim to simplify web hosting for Python developers, not provide equal or better performance than AWS. Your request for the pgvector extension was added to our backlog, not ignored. We're a small team and unfortunately don't have the bandwidth to implement every new feature that gets requested, as we have other high-priority tasks that need addressing. If performance and up-to-date Postgres extensions are your priority, then AWS might be the better fit for you.

I would argue that installing pgvector presents a very high benefit to investment ratio. In particular, my original issue might disappear. The only reason I have to perform these huge reads from the database is because I perform exact KNN searches over hundreds of thousands of vectors. With pgvector, an approximate KNN search query would be fast and return a small result set. This would also reduce the overall load of the system and actually save PA money. And the investment is, well, come on, how long can installing an extension take? This correspondence probably took you much longer than resolving the issue would have taken!

I would really hate to have to rework my backend to run on EC2 and to leave PA, and then hear in a few months that you finally got to this item in the backlog and the move could be avoided. So, please, can you accommodate this request and give this feature urgent precedence?

It is in our backlog and we may do it at some point. That is all that we can promise.