The Jobs API ============ We have been introduced to the concept of *concurrency*: a method for managing resources such that multiple agents or components of the system can be in progress at the same time without impacting the correctness of the system. We have also discussed the utility of *asynchronicity*: an approach in concurrency wherein we can schedule a task, receive an immediate response, and continue on to other tasks while the previous task works in the background. The tools we will use to achieve this in the software systems we are building include worker containers, a messaging system, a task queue, and now the *Jobs API*. The basic idea is that we will have a new endpoint in our API at a path ``/jobs`` (or something similar). A user wanting to have our system perform a long-running task will create a new job by making an HTTP POST request to ``/jobs``, describing the job in the POST message body (in JSON). Instead of performing the actual computation, the request will simply be recorded in Redis and a response will be immediately provided to the user. The response will not include the result of the job itself, but instead it will indicate that the request has been received and it will be worked on once it gets to the top of the queue. Also, critically, the response will include an ``id`` for the job so that the user can check the status and, eventually, get the actual result. *The Jobs API is a Python module that we will write which includes methods and tools for managing jobs in our software system.* By the end of this module, students should be able to: * Explain the purpose and reasoning behind all variables and methods in Jobs API * Decide which variables and methods should be private and which should be public * Organize code for software system into API, worker, and jobs modules * Import the Jobs API into other modules to use for jobs functionality * Perform appropriate ``curl`` requests to POST jobs and GET the result of jobs * **Design Principles.** The implementation of our Jobs API, comprised of multiple Flask routes, a task queue persisted in Redis, and a worker program, will demonstrate the use of modularity and encapsulation in software design. Concurrency in the Jobs API --------------------------- Recall that our big-picture goal is to add a Jobs endpoint to our Flask system that can process long-running tasks. We will implement our Jobs API with concurrency in mind. The overall architecture will thus be: 1. Save the request in a database and respond to the user that the analysis will eventually be run. 2. Give the user a unique identifier with which they can check the status of their job and fetch the results when they are ready, 3. Queue the job to run so that a worker can pick it up and run it. 4. Build the worker to actually work the job. Parts **1-3** are the tasks of the Flask API, while part **4** will be a worker, running as a separate container, that is waiting for new items in the Redis queue. Code Organization ----------------- As software systems get larger, it is very important to keep code organized so that finding the functions, classes, etc. responsible for different behaviors is as easy as possible. To some extent, this is technology-specific, as different languages, frameworks, etc., have different rules and conventions about code organization. We'll focus on Python, since that is what we are using. The basic unit of code organization in Python is called a "module". This is just a Python source file (ends in a ``.py`` extension) with variables, functions, classes, etc., defined in it. We've already used a number of modules, including modules that are part of the Python standard library (e.g. ``json``) and modules that are part of third-party libraries (e.g., ``redis``). The following should be kept in mind when designing the modules of a larger system: * Modules should be focused, with specific tasks or functionality in mind, and their names (preferably, short) should match their focus. * Modules are also the most typical entry-point for the Python interpreter itself, (e.g., ``python some_module.py``). * Accessing code from external modules is accomplished through the ``import`` statement. * Circular imports will cause errors - if module A imports an object from module B, module B cannot import from module A. Module Design ------------- The Python standard library is a good source of examples of module design. You can browse the standard library for Python 3.10 `here `_. * We see the Python standard library has modules focused on a variety of computing tasks; for example, for working with different data types, such as the ``datetime`` module and the ``array`` module. The descriptions are succinct: * *The datetime module supplies classes for manipulating dates and times.* * *This module defines an object type which can compactly represent an array of basic values: characters, integers, floating point numbers* * For working with various file formats: e.g., ``csv``, ``configparser`` * For working with concurrency: ``threading``, ``multiprocessing``, etc. With this in mind, a first approach might be to break up our system into two modules: * ``api.py`` - this module contains the Flask web server. * ``worker.py`` - this module contains the code to execute jobs. However, both the API server and the workers will need to interact with the database and the queue: * The API will create new jobs in the database, put new jobs onto the queue, and retrieve the status of jobs (and probably the output products of the job). * The worker will pull jobs off the queue, retrieve jobs from the database, and update them. This suggests a different structure: * ``api.py`` - this module contains the Flask web server. * ``worker.py`` - this module contains the code to execute jobs. * ``jobs.py`` - this module contains core functionality for working with jobs in Redis (and on the queue). Common code for working with ``redis``/``hotqueue`` can go in the ``jobs.py`` module and be imported in both ``api.py`` and ``worker.py``. .. note:: High-quality modular design is a crucial aspect of building good software. It requires significant thought and experience to do correctly, and when done poorly it can have dire consequences. In the best case, poor module design can make the software difficult to maintain/upgrade; in the worst case, it can prevent it from running correctly at all. We can sketch out our module design by making a list of the functionality that will be available in each module. This is only an initial pass at listing the functionality needed -- we will refine it over time -- but making an initial list is important for thinking through the problem. ``api.py``: This file will contain all the functionality related to the Flask web server, and will include functions related to each of the API endpoints in our application. * POST /data -- Load the data into the application. Will write to Redis. * GET /data?search=... -- List all of the data in the system, optionally filtering with a search query parameter. Will read from Redis. * GET /data/ -- Get a specific object from the dataset using its ``id``. Will read from Redis. * POST /jobs -- Create a new job. This function will save the job description to Redis and add a new task on the queue for the job. Will write to Redis and the queue. * GET /jobs -- List all the jobs. Will read from Redis. * GET /jobs/ -- Get the status of a specific job by id. Will read from Redis. * GET /jobs//results -- Return the outputs (results) of a completed job. Will read from Redis. ``worker.py``: This file will contain all of the functionality needed to get jobs from the task queue and execute the jobs. * Get a new job -- Hotqueue consumer to get an item off the queue. Will get from the queue and write to Redis to update the status of the job. * Perform analysis -- * Finalize job -- Saves the results of the analysis and updates the job status to complete. Will write to Redis. ``jobs.py``: This file will contain all functionality needed for working with jobs in the Redis database and the Hotqueue queue. * Save a new job -- Will need to write to Redis. * Retrieve an existing job - Will need to read from Redis. * Update an existing jobs -- Will need to read and write to Redis. Private vs Public Objects ------------------------- As software projects grow, the notion of public and private access points (functions, variables, etc.) becomes an increasingly important part of code organization. * Private objects should only be used within the module they are defined. If a developer needs to change the implementation of a private object, she only needs to make sure the changes work within the existing module. * Public objects can be used by external modules. Changes to public objects need more careful analysis to understand the impact across the system. Like the layout of code itself, this topic is technology-specific. In this class, we will take a simplified approach based on our use of Python. Remember, this is a simplification to illustrate the basic concepts - in practice, more advanced/robust approaches are used. * We will name private objects starting with a single underscore (``_``) character. * If an object does not start with an underscore, it should be considered public. We can see public and private objects in use within the standard library as well. If we open up the source code for the ``datetime`` module, which can be found `on GitHub `_ we see a mix of public and private objects and methods. * Private objects are listed first. * Public objects start on `line 442 `_ with the ``timedelta`` class. EXERCISE 1 ~~~~~~~~~~ Create three files, ``api.py``, ``worker.py``, and ``jobs.py`` in your local directory. You may wish to start from the files you prepared for Homework 06. You should also have a ``Dockerfile``, ``docker-compose.yml``, and ``requirements.txt`` in this directory to help with containerization and orchestration. .. code-block:: console [coe332-vm] $ ls Dockerfile api.py docker-compose.yaml jobs.py requirements.txt worker.py Add the following function and variable definitions to ``jobs.py``. Closely examine each line to make sure you understand the purpose. Carefully consider which are public and private, and why. .. code-block:: python :linenos: import json import uuid import redis from hotqueue import HotQueue _redis_ip='redis-db' _redis_port='6379' rd = redis.Redis(host=_redis_ip, port=6379, db=0) q = HotQueue("queue", host=_redis_ip, port=6379, db=1) jdb = redis.Redis(host=_redis_ip, port=6379, db=2) def _generate_jid(): """ Generate a pseudo-random identifier for a job. """ return str(uuid.uuid4()) def _instantiate_job(jid, status, start, end): """ Create the job object description as a python dictionary. Requires the job id, status, start and end parameters. """ return {'id': jid, 'status': status, 'start': start, 'end': end } def _save_job(jid, job_dict): """Save a job object in the Redis database.""" jdb.set(jid, json.dumps(job_dict)) return def _queue_job(jid): """Add a job to the redis queue.""" q.put(jid) return def add_job(start, end, status="submitted"): """Add a job to the redis queue.""" jid = _generate_jid() job_dict = _instantiate_job(jid, status, start, end) _save_job(jid, job_dict) _queue_job(jid) return job_dict def get_job_by_id(jid): """Return job dictionary given jid""" return json.loads(jdb.get(jid)) def update_job_status(jid, status): """Update the status of job with job id `jid` to status `status`.""" job_dict = get_job_by_id(jid) if job_dict: job_dict['status'] = status _save_job(jid, job_dict) else: raise Exception() EXERCISE 2 ~~~~~~~~~~ Write a skeleton for a Flask app in the file ``api.py``. The Flask app should: 1. Import necessary modules, including some from ``jobs.py`` 2. Declare an instance of the Flask class 3. Support a route for POSTing a new job 4. Support a route for GETting job status .. tip:: A job POST request might look like: .. code-block:: console curl localhost:5000/jobs -X POST -d '{"start":1, "end":2}' -H "Content-Type: application/json" In this example, we are sending a 'start' and 'end' index which is important for the "work". E.g. perhaps the worker is designed to plot all the values between 'start' and 'end'. In practice, the app that you develop may require different parameters. EXERCISE 3 ~~~~~~~~~~ Write a skeleton for a worker in the file ``worker.py``: The worker should: 1. Import necessary modules, including some from ``jobs.py`` 2. Pull items (job IDs) off the queue 3. When it starts working on a new job, update the job status to 'in progress' 4. Do work (e.g. sleep for 15 seconds) 5. When it finishes working on a new job, update the job status to 'complete' EXERCISE 4 ~~~~~~~~~~ Fill out the contents of the ``Dockerfile``, ``docker-compose.yml``, and ``requirements.txt`` in order to help with containerization and orchestration. Pay careful attention to how you set up and build the containers. Should we be using one Docker image or two? What should the entrypoint be? EXERCISE 5 ~~~~~~~~~~ Modify the definition of the ``rd``, ``q``, and ``jdb`` objects to not use a hard-coded IP address, but to instead read the IP address from an environment variable, ``REDIS_IP``. Determine how to set the value of ``REDIS_IP`` in the ``Dockerfile`` and / or ``docker-compose.yml`` file.