mardi 11 janvier 2022

Search a given list of keywords in large number of documents

So i was giving an interview and it was going pretty well except, during the last 20 minutes I got the following question,

You have a service which takes in a resume from users and saves it to a database.
Now, any recruiter do a search query on them based on any combination of the following keywords,
Location, Experience, Domain etc.
For example,
Query 1. Return list of resumes where, experience is 3 years and domain is Java.
Query 2. Return list of resumes where, experience is 2 years and domain is Kotlin and Android.

Experience will be given in exact numbers so resume having '3 years of experience' in Java can be ignored if query is only for '2 years of experience'.
Number of resumes and search queries can be in millions.
Design a data structure to make search queries fast and return list of resumes to the recruiter.
If any pre-processiing is being done on the resume, describe it. Assume that it will be already implemented.

The question boils down to "Find list of documents containing given list of words".

During the interview, I was only able to give the approach of pre-processing each resume by creating a reverse index for each word. Then during search query, our data structure will retrieve the list of documents for each keyword and merge them by taking intersection of results.

For example, for the documents,
Doc1 : 3 years experience in Java and SpringBoot
Doc2 : 2 years experience in Java and Android.

Following reverse indexing for fields can be done during pre-processing,

Experience :
3 => Doc1
2 => Doc2

Domain :
Java => Doc1, Doc2
SpringBoot => Doc1
Android => Doc2

Then for Query 1, we get 'Doc1' for 3 years of experience and, 'Doc1, Doc2' for Java domain.
Intersection of above results will give us 'Doc1'.

And for Query 2, we get 'Doc2' for 2 years of experience and, 'Doc2' for Kotlin and Android.
Intersection of above results will give us no document.

But the recruiter said that this is pretty slow as we are still going through all the resumes having the search query keywords. I thought about it and could only come up with grouping the search keywords in pairs and parallelly retrieving and merge each keywords pairs.

Time ran out and I got rejected. I still cannot figure out a better way for solving this.

Can anyone point out for a good resource on this problem as the problem of searching a set of keywords in documents seems pretty common?

Thanks

Aucun commentaire:

Enregistrer un commentaire