Forums

feature request: make popular datasets available

Hi,

I recently subscribed to the "Hacker" account and I'm liking it so far.

I'm sure most are aware that Python is one of the more commonly used programming languages for experimenting with and prototyping machine learning and data analysis algorithms. It would be great if PythonAnywhere could provide central "read only" access to some of the major datasets (such as the ones available on Kaggle) to its users.

I'm sure this would be a good selling point for new users to start using PythonAnywhere and a welcome feature for existing users.

Thanks! AK

Interesting. So this is to help people skip the trouble of downloading the data? Or to save on disk space since it will be shared amongst all users?

I took a look at their datasets and it seems like a lot of them are very trendy (ie. seems like what is popular may change every couple weeks). Are there any long term popular ones you would recommend?

Conrad, thanks for taking an interest. Data science is all the buzz these days, and PythonAnywhere would do well to get in on the ground floor, as they say. Even if the current hype dials down a bit after a few years, it's not a fad that'll die out, as long as we're generating data the way we are and businesses see it as a way to increase profits.

The advantage would be both the points you mentioned. There's also an issue of size - some datasets can be quite large. I think if you were willing to allocated, say, a few TBs of data space that should be enough to store a fair number of datasets of varied sizes. (I don't know how you guys buy/rent storage, but this doesn't necessarily mean you'll be using up that much more storage capacity - assuming a fair number of people are running machine learning/data science experiments via PythonAnywhere, some of them are probably using the same datasets, which implies the same data is being duplicated across several user accounts. This would be avoided if the same datasets were made available centrally.)

The issue of which datasets is a tougher one. I haven't been in the game long enough to give a good answer to it, but I'm sure other more experienced users could give their opinions. There are some enduring ones (such as MNIST handwritten digits and iris), some of which are already included by virtue of the inclusion of certain Python libraries (such as scikit-learn). Others, such as NLTK data, appear not to be available (even though the NLTK library itself is available).

Perhaps users could be polled to decide what datasets to upload, or datasets once uploaded could be retained based on transient popularity (since, as you said, trendy ones come and go) - I'm sure you could figure out a system. I just quoted Kaggle as an example; I don't know what the usage restrictions are.

While it's not entirely irrelevant which datasets to upload, I think just having data of different modalities (image, audio, text, video, etc.) readily available to play with in Python, is itself a plus.

Cool- there's a lot of stuff to think about in terms of implementing this. My initial thoughts are that maybe it could potentially be something that you explicitly sign up for, and is only accessible through the console/ipython notebooks (and be optimized for that).

As a first step, I think I will try to talk to experienced big data people and try to compile a list of the more enduring datasets and go from there.