4

I have an html file like this:

<h1>Group 1</h1>
<table>
  <tr>
    <td>Col1</td>
    <td>Col2</td>
    <td>Col3</td>
  </tr>
  <tr>
    <td>ValA</td>
    <td>ValB</td>
    <td>ValC</td>
  </tr>
</table>
<h1>Group 2</h1>
<table>
  <tr>
    <td>Col1</td>
    <td>Col2</td>
    <td>Col3</td>
  </tr>
  <tr>
    <td>ValP</td>
    <td>ValQ</td>
    <td>ValR</td>
  </tr>
</table>

I want to read it into Pandas as if it had this structure:

<table>
 <tr>
   <td>Caption</td>
   <td>Col1</td>
   <td>Col2</td>
   <td>Col3</td>
 </tr>
 <tr>
   <td>Group 1</td>
   <td>ValA</td>
   <td>ValB</td>
   <td>ValC</td>
 </tr>
 <tr>
   <td>Group 2</td>
   <td>ValP</td>
   <td>ValQ</td>
   <td>ValR</td>
 </tr>
</table>

I can do it easily with PowerQuery the language for PowerBI:

let
    Source = Web.Page(File.Contents("multiple_tables.html")),
    #"Expanded Data" = Table.ExpandTableColumn(Source, "Data", {"Column1", "Column2", "Column3"}, {"Col1", "Col2", "Col3"}),
    #"Filtered Rows" = Table.SelectRows(#"Expanded Data", each ([Caption] <> "Document") and ([Col1] <> "Col1"))
in
    #"Filtered Rows"

Is there a way to achieve this effect in less than 10 lines of code using Python/Pandas plus some html parser opensource library? Or should I resign myself to writing lower level code to handle this?

Stephen Rauch
  • 1,783
  • 11
  • 21
  • 34
Luxspes
  • 143
  • 1
  • 6

1 Answers1

3

With a bit of parsing via BeautifulSoup, we can get a pandas.Dataframe using pandas.read_html() like:

Code:

def get_tables(source):
    elems = iter(BeautifulSoup(source, 'lxml').find_all(['table', 'h1']))
    df = pd.DataFrame(
        pd.read_html(str(next(elems)), header=0)[0].iloc[0].rename(h1.text)
        for h1 in elems)
    df.index.names = ['Caption']
    return df

Test Code:

import pandas as pd
from bs4 import BeautifulSoup

with open('test.html', 'r') as f:
    print(get_tables(f))

Results:

         Col1  Col2  Col3
Caption                  
Group 1  ValA  ValB  ValC
Group 2  ValP  ValQ  ValR
Stephen Rauch
  • 1,783
  • 11
  • 21
  • 34