Questions tagged [document-understanding]

10 questions
3
votes
0 answers

Multi-page document image classification

Sorry for the long post but I needed it to be able to capture all the details and questions. I am working on multi-page document image classification problem and am kind of confused on what approach or model architecture to follow. Here's is the…
2
votes
0 answers

What is the meaning of, or explanation for, having multiple tags in a Doc2Vec model's TaggedDocuments?

I've tried reading the other answers on this topic but I'm unsure if I understand completely. For my dataset, I have a series of tagged documents, "good" or "bad." Each document belongs to an entity, and each entity has a different number of…
Jayke
  • 21
  • 1
2
votes
2 answers

"Object" Detection in Textual Data

I have a task where the input is a parsed document (i.e., full text in 1 string or tokens) and I need to classify parts of the text into say 5 classes (i.e., 5 tokens from the entire text are labeled into 5 different classes). Example: Document #1:…
leed
  • 145
  • 4
1
vote
1 answer

Document clustering to merge common labels

I am building a recommendation system and I have to clean up some of the labels that I have. For example of the data df['resolution_modified'].value_counts() Gives 105829 It is recommended to replace scanner …
1
vote
0 answers

Highlight specific paragraphs from documents

I have a bunch of documents in which I want to highlight certain paragraphs/keyphrases. I have a list of the most frequently appearing sentences and I want to search for these paragraphs/keyphrases in the document and if they appear, highlight them.…
spectre
  • 1,831
  • 1
  • 9
  • 29
1
vote
1 answer

Data Analytics Documantaions

I am working as a data analyst in a company. Me and my colleagues use different tools and software to analyze the data and make the reports (e.g., Excel, Python, R, Alteryx, SQL, Tableau). Each one use his/her favourite tools to do the tasks.…
1
vote
0 answers

Entity Linking for Receipts

I am building a model for reading receipts from their mobile snapshots. After the receipt is OCR'd, I plan to use a variation on LayoutLM for entity extraction. Entities are: "quantity", "price-per-unit", "product-name", "items-price", etc. What is…
0
votes
1 answer

Identify Resume Structure

I am trying to build a resume parser (from PDF to JSON). After extracting text from a pdf as one long string, how would you split the string into different sections like the red lines show. Resumes have different formats and people use different…
E.K.
  • 405
  • 4
  • 6
0
votes
1 answer

How to extract handwritten phone numbers from a huge set of documents?

Say you have a lot of PDF documents, say K documents. Each document Di is Ni pages long. In one of the Ni pages (don't know which, say Pi), there is the information you need to extract. I am thinking about building a three step pipeline for…
Anmol Deep
  • 101
  • 3
0
votes
0 answers

What is the most efficient way of image document classification?

So, I am working on a project where I have to extract sales tax invoice from the pdf document which contains other files along with the invoice. I researched on the topic, and am considering two solutions. Converting pdf to images and then…