Lambda does not support NLTK file size

I am writing a python script that parses a piece of text and returns data in JSON format. I use NLTK for data analysis. Basically, this is my thread:

Create an endpoint (API gateway) -> calls my lambda function -> returns the JSON of the required data.

I wrote my script deployed to lambda, but I ran into this problem:

Resource \ u001b [93mpunkt \ u001b [0m not found. Please use NLTK Downloader to get the resource:

\ u001b [31m β†’> import nltk nltk.download ('punkt') \ u001b [0m
Search in: - '/ home / sbx_user1058 / nltk_data' - '/ usr / share / nltk_data' - '/ usr / local / share / nltk_data' - '/ usr / lib / nltk_data' - '/ usr / local / lib / nltk_data '-' / var / lang / nltk_data '-' / var / lang / lib / nltk_data '

Even after loading "punkt", my script still gave me the same error. I tried the solutions here:

Python optimization script extract and process large data files

but the problem is that the nltk_data folder is huge and the lambda has a size limit.

How can I fix this problem? Or where else can I use my script and still integrate the API call?

I am using serverless to deploy python scripts.

+5
source share
1 answer

There are two things you can do:

  • The errors seem to be that the path is not being defined properly, maybe set it as env variable?

sys.path.append(os.path.abspath('/var/task/nltk_data/')

or in this way

  • After running nltk.download() copy it to the root folder of your AWMS lambda application. (Name the dir, which will be called "nltk_data".)

  • In the lambda function toolbar (in the AWS console) add NLTK_DATA = ./nltk_data to the environment variable-var-var.


  1. Reduce the size of nltk downloads since you don’t need all of them.

    • Delete all zip files, save only the desired section, for example: stop words. This can be moved to: save nltk_data/corpora/stopwords and delete the rest.

    • Or If you need tokenizers, save them to nltk_data/tokenizers/punkt . Most of them can be downloaded separately: python -m nltk.downloader punkt , then copy the files.

+5
source

Source: https://habr.com/ru/post/1272767/


All Articles