How does google indexes text documents?

Question

Unfortunately, I didn't find much information surrounding this, no pdfs, no textbooks that discuss this in just enough detail. And I didn't see any forums posts about this. I just want to learn an example scenario of indexing in big data.

I've found this but I'd love some explanations and details about it rather than a picture alone.

Welcome to DataScienceSE. In general Google doesn't give the details of their methods, as a private company they don't want to and they don't have to. This picture gives a decent idea of what indexing basically looks like, but there are certainly a lot of tweaks to make things more accurate and more efficient. I'm not sure what kind of details you're looking for, is the concept of inverted index clear to you? — Erwan, Jul 10 '22 at 09:49
yeah it's somewhat clear to me. i don't want google's if they don't allow to tell, but in general the indexing techniques purposed for any big data. lucene is an example of indexing tool, but i want to learn like this figure rather than how lucene coding is done. high level overview. — jewloa, Jul 10 '22 at 10:03
I don't feel confident enough to give a full answer, but the main idea is indeed simply storing the inverted index usually together with something like TFIDF. The main issue is efficient storage/access, which is more a problem related to distributed database than an NLP. — Erwan, Jul 10 '22 at 11:30
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. — Community, Jul 10 '22 at 15:06

How does google indexes text documents?

0 Answers0