Skip to content Skip to sidebar Skip to footer

Pandas Read_html Function With Colspan=2

I'm using the pandas read_html function to load an html table into a dataframe, however it's failing because the source data has a colspan=2 merged header, resulting in this Assert

Solution 1:

If you don't insist on using read_html from pandas, this code does the job:

import pandas as pd
from lxml.html import parse
from urllib2 import urlopen
from pandas.io.parsers import TextParser

def_unpack(row, kind='td'):
   elts = row.findall('.//%s' % kind)
   return [val.text_content() for val in elts]

defparse_options_data(table):
  rows = table.findall('.//tr')
  header = _unpack(rows[0], kind='th')
  data = [_unpack(r) for r in rows[1:]]
  return TextParser(data, names=header).get_chunk()

parsed = parse(urlopen('http://www.bmfbovespa.com.br/en-us/intros/Limits-and-Haircuts-for-accepting-stocks-as-collateral.aspx?idioma=en-us'))
doc = parsed.getroot()
tables = doc.findall('.//table')
table = parse_options_data(tables[0])

This is taken from the Book "Python for Data analysis" from Wes McKinney.

Solution 2:

pandas >= 0.24.0 understands colspan and rowspan attributes. As per the release notes:

result = pd.read_html("""
    <table><thead><tr><th>A</th><th>B</th><th>C</th></tr></thead><tbody><tr><tdcolspan="2">1</td><td>2</td></tr></tbody></table>""")

result

Out:

[   A  B  C
 0  1  1  2

Previously this would return the following:

[   A  B   C
 0  1  2 NaN]

I can't test with your link because the URL is not found.

Post a Comment for "Pandas Read_html Function With Colspan=2"