1

I have a body of PDF documents of differing vintage.

Our group had exported the documents as text to feed them into a natural-language parser (I think) to pull out subject-verb-predicate triples.

This hasn't performed as well as hoped so I exported the documents as XML using Acrobat Pro, hoping to capture the semantic document structure in order to pass it in as a hint to the text parser.

One document looked pretty good (something like this):

<TaggedPDF-doc>
  <bookmark-tree>...</bookmark-tree>
  <Sect>...</Sect>
  <Sect>...</Sect>
  <Sect>...</Sect>
  <Sect>...</Sect>
  <Sect>...</Sect>
  <Sect>...</Sect>
  <Sect>...</Sect>
  <Sect>...</Sect>
  <Sect>...</Sect>
</TaggedPDF-doc>

Another one wasn't quite so nice semantically:

<TaggedPDF-doc>
  <bookmark-tree>...</bookmark-tree>
  <Part>
    <H1>(name of document)</H1>
    <P>(title of document)</P>
    <Sect>
      <H1>22 October 2013 </H1>
      <P>...</P>
      <P>...</P>
      <P>...</P>
      <Figure>...</Figure>
      <P id="LinkTarget_1388">PREFACE </P>
      <P>1. Scope </P>
      <P>...</P>
      <P>2. Purpose </P>
      <P>...</P>
      <P>3. Application </P>
      <L>...</L>
      <P>Intentionally Blank </P>
      <P id="LinkTarget_1389">SUMMARY OF CHANGES </P>
      <P>...</P>
      <L>...</L>
      <P>Intentionally Blank </P>
      <P id="LinkTarget_1390">TABLE OF CONTENTS </P>
      <P>...</P>
      <P>...</P>    <!-- Chapter 1 started here -->
      <P>...</P>
      <P>...</P>    <!-- Chapter 2 started here -->
      <P>...</P>
      <P>...</P>    <!-- Chapter 3 started here -->
      <P>...</P>
      <P>CHAPTER IV </P>
      <P>...</P>
      <P>(part of chapter 4 title)</P>
      <P>...</P>
      <P>...</P>
      <P>...</P>
      <P>...</P>
      <P>...</P>
      <P>...</P>
      <P>...</P>
      <P>...</P>
      <Link>...</Link>
      <Link>...</Link>
      <P>...</P>
      <Link>...</Link>
      <P>Intentionally Blank </P>
      <P id="LinkTarget_1391">EXECUTIVE SUMMARY </P>
      <P>(section title)</P>
      <P>(a bullet item inside only a paragraph element)</P>
      <P>(a bullet item inside only a paragraph element)</P>
      <P>...</P>
      <P>...</P>
      <P>...</P>
      <P>(some text inside only a paragraph element)</P>
      <P>...</P>
      <P>...</P>
      <P>...</P>
      <P>...</P>
      <P>...</P>
      <P>...</P>
      <P>(some text inside only a paragraph element)</P>
      <P>...</P>
      <Table>...</Table>
      <P>(some text inside only a paragraph element)</P>
      <P>...</P>
      <P>...</P>
      <P>...</P>
      <P>...</P>
      <Sect>...</Sect>
      <Sect>...</Sect>
      <Sect>...</Sect>
      <Sect>...</Sect>
      <Sect>...</Sect>
      <Sect>...</Sect>
      <Sect>...</Sect>
      <Sect>...</Sect>
      <Sect>...</Sect>
      <Sect>...</Sect>
      <Sect>...</Sect>
    </Sect>
  </Part>
</TaggedPDF-doc>

I'm relatively new to data science but handling the "normalization" of this kind of data set (XML documents largely in a chapter-section-subsection format) is a fairly tractable problem? Or maybe even a solved one?

These are the only two documents I've looked at so far, but I suspect that each document will be snowflake enough that it could benefit from applying a machine learning algorithm to bring some consistency to the tags. I picture something very basic, like nested <section title=""> tags and using the PDF XML output structure as leverage whenever I can.

Clarifiation: I'd like to gain some insight on algorithms (possibly existing ones) that can can "recognize" the semantic sections of a document that's structured inconsistently and loosely, like the second example I posted above. It would be straightforward for me to do this manually, but if there's an algorithm that does this (or even part of this) then it would be an improvement over doing it manually. The particular tag structure for the target document (I guess that's what it's called) I like to be more true to what the type of data is. A subset of the HTML5 tags would work: <section>, <article>, <caption>, <figure>. OpenDocument is probably too much. But something that is a step better than calling everything a <P>.

Any insights / pointing in the right direction?

John
  • 111
  • 3
  • "some consistency" is not a very specific target. Any chance you can be more specific in your problem description? What is your ultimate target with the parsed document? etc. – oW_ Apr 06 '20 at 22:35
  • I'm unsure how to make the question clearer. The top document has clearly marked `Sect` tags, while the bottom one has `P` tags for everything from section titles to entire sections of the document to actual paragraphs. But they're both perfectly human-readable and appear well-formatted when a human looks at them. I don't even know how to categorize the problem I'm asking about, or the tools I would use to make headway. All I can say is that it looks like a machine-learning problem. – John Apr 06 '20 at 23:13
  • @oW_ I added clarification. I don't really need a full solution but just something to improve my search terms so I can research productively. – John Apr 06 '20 at 23:33

0 Answers0