Questions tagged [document-understanding]
10 questions
3
votes
0 answers
Multi-page document image classification
Sorry for the long post but I needed it to be able to capture all the details and questions.
I am working on multi-page document image classification problem and am kind of confused on what approach or model architecture to follow. Here's is the…
asanoop24
- 141
- 1
2
votes
0 answers
What is the meaning of, or explanation for, having multiple tags in a Doc2Vec model's TaggedDocuments?
I've tried reading the other answers on this topic but I'm unsure if I understand completely.
For my dataset, I have a series of tagged documents, "good" or "bad." Each document belongs to an entity, and each entity has a different number of…
Jayke
- 21
- 1
2
votes
2 answers
"Object" Detection in Textual Data
I have a task where the input is a parsed document (i.e., full text in 1 string or tokens) and I need to classify parts of the text into say 5 classes (i.e., 5 tokens from the entire text are labeled into 5 different classes).
Example:
Document #1:…
leed
- 145
- 4
1
vote
1 answer
Document clustering to merge common labels
I am building a recommendation system and I have to clean up some of the labels that I have. For example of the data
df['resolution_modified'].value_counts()
Gives
105829
It is recommended to replace scanner …
Wolfy
- 237
- 2
- 9
1
vote
0 answers
Highlight specific paragraphs from documents
I have a bunch of documents in which I want to highlight certain paragraphs/keyphrases. I have a list of the most frequently appearing sentences and I want to search for these paragraphs/keyphrases in the document and if they appear, highlight them.…
spectre
- 1,831
- 1
- 9
- 29
1
vote
1 answer
Data Analytics Documantaions
I am working as a data analyst in a company. Me and my colleagues use different tools and software to analyze the data and make the reports (e.g., Excel, Python, R, Alteryx, SQL, Tableau).
Each one use his/her favourite tools to do the tasks.…
N.IT
- 1,975
- 4
- 17
- 35
1
vote
0 answers
Entity Linking for Receipts
I am building a model for reading receipts from their mobile snapshots. After the receipt is OCR'd, I plan to use a variation on LayoutLM for entity extraction. Entities are: "quantity", "price-per-unit", "product-name", "items-price", etc.
What is…
fierval
- 11
- 1
0
votes
1 answer
Identify Resume Structure
I am trying to build a resume parser (from PDF to JSON). After extracting text from a pdf as one long string, how would you split the string into different sections like the red lines show. Resumes have different formats and people use different…
E.K.
- 405
- 4
- 6
0
votes
1 answer
How to extract handwritten phone numbers from a huge set of documents?
Say you have a lot of PDF documents, say K documents.
Each document Di is Ni pages long. In one of the Ni pages (don't know which, say Pi), there is the information you need to extract.
I am thinking about building a three step pipeline for…
Anmol Deep
- 101
- 3
0
votes
0 answers
What is the most efficient way of image document classification?
So, I am working on a project where I have to extract sales tax invoice from the pdf document which contains other files along with the invoice. I researched on the topic, and am considering two solutions.
Converting pdf to images and then…
Sardar Arslan
- 1
- 1