Forums

scrapy and ssh

Hello,

I hope you're doing well.

I have a Scrapyd server deployed on Heroku. Currently, when I perform web scraping, I send the collected data to a PlanetScale database. Afterward, I use my web application to interact with the PlanetScale database to retrieve records and perform other operations.

However, I've encountered some challenges. First, there is significant latency in this setup. Additionally, the limitations of the PlanetScale database are becoming apparent, and they may not meet my long-term needs. Due to these reasons, I am considering switching to a MySQL database on PythonAnywhere (PA).

I've identified three potential approaches to address this issue:

Instead of using Scrapyd, I could run Scrapy directly using a subprocess and save the scraped items directly to the MySQL database on PA. This would eliminate the need for an external server outside of PA, making the setup more concise. However, I'm unsure about the potential impact on processing time, as some of my scraping processes take 2 to 4 minutes to complete, while others finish in under a minute. I'm concerned about how this might affect my CPU usage on PA and the behavior of my web application while these processes are running.

Another option is to retain the Scrapyd server and modify it to record data in the MySQL database on PA using SSH. However, I'm uncertain about whether it's best to open an SSH tunnel and keep it open until the scraper completes, or if it's more efficient to close and reopen the connection after processing a certain number of items, such as 10. I'm interested in minimizing the impact on the MySQL database on PA.

The third approach involves creating a new route in my web application. With this setup, each scraped item would be sent to this route, which would then be responsible for recording the data in the MySQL database.

I'm open to any insights or suggestions you may have regarding these approaches. Your expertise would be greatly appreciated.

Those all seem like viable ways of doing it. I would think that the third one seems like the easiest to implement.