Forums

problem with calling nltk.word_tokenize()

I tried to make simple web app to test the interaction of NLTK in PythonAnywhere but received a"500 internal server error". What I tried to do was to get a text query from the user and return nltk.word_tokenize(). My init.py funcion contains:

@app.route('/analysis', methods=['POST'])
def analysis():
    text_data = request.form.get('text_data', '')
    func = request.form.get('func', '')
    proc_data = data_analys.data_process(text_data)
    return render_template('analysis.html', text_data=proc_data, func=func)

and the data_analys.py funciton contains:

import nltk
def data_process(data):
    return nltk.word_tokenize(data)

The problem seems to be triggered once the word_tokenize function is called. I appreciate your ideas how to fix this issue.

BR, Omidemon

I manged to solve the problem by downloading the nltk package using nltk.download() -> d -> book :)

             / ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄\    
<<<<<<:>~  <   Yay!          |   
             \_________/

I'm having same problem from a python command-line program. There is a very explicit traceback as to the cause of the problem.

Traceback (most recent call last): File "classify.py", line 118, in <module> nb.train_from_data(data) File "classify.py", line 42, in train_from_data self.train(doc, category) File "classify.py", line 101, in train features = self.get_features(item) File "classify.py", line 57, in get_features all_words = [w for w in word_tokenize(document) if len(w) > 3 and len(w) < 16] File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/init.py", line 93, in word_tokenize return [token for sent in sent_tokenize(text) File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/init.py", line 81, in sent_tokenize tokenizer = load('tokenizers/punkt/english.pickle') File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 774, in load opened_resource = open(resource_url) File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 888, in _open return find(path, path + ['']).open() File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 618, in find raise LookupError(resource_not_found) LookupError:


Resource u'tokenizers/punkt/english.pickle' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download() Searched in: - '/home/funderburkjim/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' - u''


So a particular tokenizer resource is required, for tokenizing English.

This word_tokenizer is such a frequent feature that it's lack of functioning in PythonAnywhere should be considered a bug in the PythonAnywhere installation of the NLTK library. At least that's my opinion and suggestion.

Incidentally, I didn't understand the solution mentioned above, namely

"downloading the nltk package using nltk.download() -> d -> book"

run this from an interactive python console (using the correct version)

import nltk
nltk.download()

and then follow the prompts. The reason it doesn't work is because you need to choose and download the relevant data.

Thanks for details on getting the nltk download. For anyone else who may need the particular file required by nltk.word_tokenize , the download code is 'punkt', so nltk.download('punkt') does the download. Incidentally, the download puts the file in a place that the nltk calling method knows about, which is a nice detail.

Nice. Thanks for the follow up info!