dimanche 26 décembre 2021

Image uploading and validation architecture diagram

I am working on a project whereby I am to design a system for users to upload, download and search for images within a database. I created a system architecture diagram for the upload component which I would like some feedback on. To start, the upload component must be:

  1. Secure only allowing for images which are also unique.
  2. Scalable to over 1 million requests per month with 100 terabytes of data uploaded per month.
  3. Robust and operational 24/7.
  4. Data does not need to be immediately available to download, the deadline for upload to complete is 1 week.



It would be great to get some feedback on my architecture. If it helps, here are some prompts that may make it easy to structure any feedback:

  1. What is missing?
  2. What must be removed?
  3. What problems were overlooked?
  4. Will this meet the scalability demands required?
  5. What additional changes would you make?



Explanation of architecture diagram:

  1. A Frontend allows a user to upload a compressed folder of images. The frontend lightly verifies the contents of the folder to ensure it matches some schema (i.e. just JPEGs). Upload is asynchronous and the user won't wait for the whole pipeline to complete, just the initial upload. They will later be notified of the status of their folder's acceptance.
  2. A Load Balancer selects which upload server instance images are sent to.
  3. Upload servers handle additional verification on the images, checking that the content type and schema is correct (i.e. same as above- just JPEGs). These upload servers then chunk the image folder for concurrency, sending subsets of the folder into a queue once they've passed verification.
  4. A Queue receives some subset of the image dataset which includes a unique hash that maps chunks to the same dataset.
  5. There are two consumers of that queue; one is a Validation Database which keeps let's say 10 percent of the un-validated images as reference. Another is an advanced ML Validation Service. As with the upload servers, there are multiple instances of the ML Validation Service. Sorry for the change in graphical representation, I ran out of space. ML Models take the images as input 1 by 1, validate them by running them through their networks and then passing fully validated images to an S3 Bucket (i.e. images that returned True for every ML validation). The ML component runs models in parallel so that the time it takes to run all ML validation is equal to the time it takes to run the slowest ML validation. Only images that pass all validations are sent to the database.
  6. If an image passes validation then it is sent to an S3 Bucket within AWS via some Database Connector.



A couple notes that came up while I was thinking about this:

  1. Near Real-time over Batch Processing. Although this process could be batch processed, storing potentially invalid images in some database will take up unnecessary space and money.
  2. Near Real-time over Stream Processing. Processing images too quickly may result in invalid images making their way into the database leaving room for potentially malicious actors who may potentially exploit weaknesses in the ML models. Near Real-time gives quality engineers the luxury of checking on the database every hour as opposed to being glued to a screen.

Closing Notes:
I'm new to this and hoping to benefit from your substantial experience here on stack. Please don't be shy to include a question even if it might seem basic, I look forward to incorporating your valuable advice and happy to explain any of these components in more depth. If there are design patterns I can make use of, or perhaps links that would be helpful to read, I am open to any book or article suggestions you may have for me.

Thank you in advance!

Architecture Diagram

Aucun commentaire:

Enregistrer un commentaire