Forums

Problem with deleting files in elasticsearch

Hello,

I'm ingesting data with python to elasticsearch and I wrote a separate python script which deletes duplicate entries. When I start the deduplicating script from the bash console, it works perfectly. Now, when I call the py file from my other scripts, it works once, then I reproduce the duplicates, and the second time the deduplicating script stops deleting after a couple of entries - in the error log I see that a hash was not found on elasticsearch. What then helps is to restart the python server, then it works again for one time. What could be problem? I can't restart the pythonanywhere server every time after one deduplicating process. It should work automatically.

Error log of the 2nd deleting:

2021-07-02 10:33:39,301: Found credentials in shared credentials file: ~/.aws/credentials
2021-07-02 10:33:40,128: POST https://search-xxxxx-mynnywnysa2f3jytnh5syqixvm.eu-central-1.es.amazonaws.com:443/_bulk [status:200 request:0.736s]
2021-07-02 10:33:40,230: POST https://search-xxxxxe-mynnywnysa2f3jytabcdefg.eu-central-1.es.amazonaws.com:443/cdr/_search?size=20 [status:200 request:0.101s]
2021-07-02 10:33:53,308: POST https://search-xxxxx-mynnywnysa2f3jyabcdefg.eu-central-1.es.amazonaws.com:443/cdr/_search?scroll=5m&size=1000 [status:200 request:0.193s]
2021-07-02 10:33:53,411: POST https://search-xxxxx-mynnywnysa2f3jyabcdefg.eu-central-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.098s]
2021-07-02 10:33:53,508: DELETE https://search-xxxxx-mynnywnysa2f3jyabcdefg.eu-central-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.096s]
2021-07-02 10:33:53,604: POST https://search-xxxxx-mynnywnysa2f3jyabcdefg.eu-central-1.es.amazonaws.com:443/cdr/doc/_mget [status:200 request:0.096s]
2021-07-02 10:33:53,706: DELETE https://search-xxxxx-mynnywnysa2f3jyabcdefg.eu-central-1.es.amazonaws.com:443/cdr/_doc/cV3EZnoBaFIiCCZjy_6s [status:404 request:0.102s]
2021-07-02 10:33:53,707: Exception on /deduplicate [POST]
Traceback (most recent call last):
  File "/home/xxxx/.virtualenvs/my-virtualenv/lib/python3.6/site-packages/flask/app.py", line 2070, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/xxxx/.virtualenvs/my-virtualenv/lib/python3.6/site-packages/flask/app.py", line 1515, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/xxxx/.virtualenvs/my-virtualenv/lib/python3.6/site-packages/flask/app.py", line 1513, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/xxxx/.virtualenvs/my-virtualenv/lib/python3.6/site-packages/flask/app.py", line 1499, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/home/xxxx/mysite/flask_app.py", line 63, in deduplicate
    dp.main()
  File "/home/xxxx/mysite/duplicates.py", line 88, in main
    loop_over_hashes_and_remove_duplicates()
  File "/home/xxxx/mysite/duplicates.py", line 71, in loop_over_hashes_and_remove_duplicates
    es.delete(index='cdr', id=dup_id)
  File "/home/xxxx/.virtualenvs/my-virtualenv/lib/python3.6/site-packages/elasticsearch/client/utils.py", line 168, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "/home/xxxx/.virtualenvs/my-virtualenv/lib/python3.6/site-packages/elasticsearch/client/__init__.py", line 605, in delete
    "DELETE", _make_path(index, doc_type, id), params=params, headers=headers
  File "/home/xxxx/.virtualenvs/my-virtualenv/lib/python3.6/site-packages/elasticsearch/transport.py", line 415, in perform_request
    raise e
  File "/home/xxxx/.virtualenvs/my-virtualenv/lib/python3.6/site-packages/elasticsearch/transport.py", line 388, in perform_request
    timeout=timeout,
  File "/home/xxxx/.virtualenvs/my-virtualenv/lib/python3.6/site-packages/elasticsearch/connection/http_requests.py", line 204, in perform_request
    self._raise_error(response.status_code, raw_data)
  File "/home/xxxx/.virtualenvs/my-virtualenv/lib/python3.6/site-packages/elasticsearch/connection/base.py", line 331, in _raise_error
    status_code, error_message, additional_info
elasticsearch.exceptions.NotFoundError: NotFoundError(404, '{"_index":"cdr","_type":"_doc","_id":"cV3EZnoBaFIiCCZjy_6s","_version":1,"result":"not_found","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":2605,"_primary_term":1}')

Thank you!

I'd guess that you're trying to delete something that you previously deleted. So you're probably not tracking what you want deleted vs what you've already deleted. Trace backwards from the error to see why that id was in the list of ids to be deleted, when it should not have been.

Well, I ingest 50 documents, then again the same 50. Then I delete the 50 , it works once. I ingest again the same 50 documents. 2nd time it doesn't work. And why does it work repeatedly when I start the script from the console?

My guess would be that you're relying on global state in some way. When you run it from the console, the global state is new every time you run it. If you're importing it into your web app, then the global state will only be reset when you reload your web app.