XLS文件是一个不是压缩文件格式的BURANY文件.
因为在使用OpenPYXL引擎时会出现ZIP错误.你可以把引擎留空,让Pandas 来 Select .
现在判断Microsoft Excel97-2003 XMLElectron 表格问题,我已经开发了一个基于该文档的阅读器https://en.wikipedia.org/wiki/Microsoft_Office_XML_formats
import pandas as pd
import xml.etree.ElementTree as ET
from io import StringIO ## for Python 3
import re
### Performs some cleanup at the generate XML
### These cleanups are based on https://www.ishares.com/ needs
### You need to customized based on your personal needs
def cleanup(xml):
# Remooving multiple BOM at the file
xml = xml.replace('\ufeff', '')
# Regular expression that finds "&" with no ";" at next 5 positions
pattern = r'&(?!(?:.*?;){1,5})'
# Replacing the alone "&""
xml = re.sub(pattern, '&', xml)
# TODO you can perform others celeanup
return xml
### Removing the namespace to improove the code readbility
def remove_namespace(tree):
for elem in tree.iter():
if '}' in elem.tag:
elem.tag = elem.tag.split('}', 1)[1]
for key in list(elem.attrib.keys()):
if '}' in key:
new_key = key.split('}', 1)[1]
elem.attrib[new_key] = elem.attrib.pop(key)
### Extract data frame from xml and workSheet name/firstLine
def extract_data_from_xml(xml_file, sheets):
df = {}
prefix= u'{urn:schemas-microsoft-com:office:spreadsheet}'
tree = ET.parse(xml_file)
remove_namespace(tree)
root = tree.getroot()
for sheet_name, start_row in sheets.items():
data = extract_data_from_root(root, sheet_name, start_row)
if(len(data)>0):
headers = data[0]
df[sheet_name] = pd.DataFrame(data[1:], columns=headers)
return df
### Extracting data array from parsed xml root not from text, to improove performance
def extract_data_from_root(root, sheet_name, start_row):
data = []
found_table = False
acc = 0
for elem in root.iter():
if found_table and elem.tag == 'Row':
row_data = [cell.text for cell in elem.findall('Cell/Data')]
if(acc>= start_row):
data.append(row_data)
else:
acc +=1
elif elem.tag == 'Table' or elem.tag == 'Worksheet':
if elem.attrib.get('Name') == sheet_name:
found_table = True
else:
if found_table and len(data)>0:
break
else:
continue
return data
### The core function
def read_XML_MS_spreadsheet(filename, sheets):
with open(filename, mode="r", encoding="utf-8-sig") as fin:
xml = fin.read()
xml = cleanup(xml)
df = extract_data_from_xml(StringIO(xml), sheets)
return df
def read_XLS_and_XLSX_spreadsheet(filename, sheet_start_rows):
dfs = {}
for sheet_name, start_row in sheet_start_rows.items():
df = pd.read_excel(filename, sheet_name=sheet_name, header=start_row)
dfs[sheet_name] = df
return dfs
### main
sheets = {
'Holdings' : 7,
'Historical' : 0,
'Performance' : 4,
'Distributions' : 0
}
# Originaly posted on https://stackoverflow.com/questions/77958287/having-difficulties-to-open-an-old-format-xls-file-with-python-pandas/77958383#77958383
# https://www.ishares.com/us/products/239566/ishares-iboxx-investment-grade-corporate-bond-etf/1521942788811.ajax?fileType=xls&fileName=iShares-iBoxx--Investment-Grade-Corporate-Bond-ETF_fund&dataType=fund
file_path = 'iShares-iBoxx--Investment-Grade-Corporate-Bond-ETF_fund.xls'
## Firstly try to read in XLSX or XLS. Pandas will choose the right one
try:
df = read_XLS_and_XLSX_spreadsheet(file_path, sheets)
except Exception:
try: ## If this isn't an XLS or XLSX file, try to load as an MS-XML 2003 Spreadsheet @see https://en.wikipedia.org/wiki/Microsoft_Office_XML_formats
df = read_XML_MS_spreadsheet(file_path, sheets)
except Exception: ## If still having an error, try to open an CSV file
df = pd.read_csv(file_path) # CSV it's not multi-sheet
# Let's do some stufs
print(df)
@edited 1个
基于该错误,您有一个命名为XLS文件的CSV文件会有问题.
try 将Read方法更改为
df = pd.read_csv(file_path)
@edited 2
在收到指向特定文件的链接后,我通过开发MS-XML 2003Electron 表格阅读器增强了响应.此外,我还对外部生成的XML文件进行了一些清理.因此,代码现在与各种文件格式兼容,如XLSX,XLS,MS-XML或CSV.您可以输入Electron 表格以及所需的初始行,以便导入到pandas DataFrame中.