我有一个Electron 表格,这是在Microsoft Excel 97-2003 XLS的格式.我try 了以下几个步骤:

import pandas as pd

xlsx_file_path = "C:/temp/a_file.xls"
sheets_dict = pd.read_excel(xlsx_file_path, engine='xlrd', sheet_name=None)

for sheet_name, df_in in sheets_dict.items():
    print(sheet_name)

它给出错误:

  File C:\xxxxxx\site-packages\xlrd\__init__.py:172 in open_workbook
    bk = open_workbook_xls(

  File C:\xxxxxxx\site-packages\xlrd\book.py:79 in open_workbook_xls
    biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)

  File C:\xxxxxxxx\site-packages\xlrd\book.py:1284 in getbof
    bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])

  File C:\xxxxxxxx\site-packages\xlrd\book.py:1278 in bof_error
    raise XLRDError('Unsupported format, or corrupt file: ' + msg)

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf\xef\xbb\xbf<?'

我try 了其他引擎,如Openpyxl,得到以下错误:

  File C:\xxxx\lib\zipfile.py:1336 in _RealGetContents
    raise BadZipFile("File is not a zip file")

BadZipFile: File is not a zip file

有什么解决办法吗?

实际的文件是: https://www.ishares.com/us/products/239566/ishares-iboxx-investment-grade-corporate-bond-etf/1521942788811.ajax?fileType=xls&fileName=iShares-iBoxx--Investment-Grade-Corporate-Bond-ETF_fund&dataType=fund

推荐答案

XLS文件是一个不是压缩文件格式的BURANY文件. 因为在使用OpenPYXL引擎时会出现ZIP错误.你可以把引擎留空,让Pandas 来 Select .

现在判断Microsoft Excel97-2003 XMLElectron 表格问题,我已经开发了一个基于该文档的阅读器https://en.wikipedia.org/wiki/Microsoft_Office_XML_formats

import pandas as pd
import xml.etree.ElementTree as ET
from io import StringIO ## for Python 3
import re

### Performs some cleanup at the generate XML
### These cleanups are based on https://www.ishares.com/ needs
### You need to customized based on your personal needs
def cleanup(xml):
    # Remooving multiple BOM at the file
    xml = xml.replace('\ufeff', '')
    # Regular expression that finds "&" with no ";" at next 5 positions
    pattern = r'&(?!(?:.*?;){1,5})'
    
    # Replacing the alone "&""
    xml = re.sub(pattern, '&amp;', xml)
    
    # TODO you can perform others celeanup
    return xml

### Removing the namespace to improove the code readbility
def remove_namespace(tree):
    for elem in tree.iter():
        if '}' in elem.tag:
            elem.tag = elem.tag.split('}', 1)[1]
        for key in list(elem.attrib.keys()):
            if '}' in key:
                new_key = key.split('}', 1)[1]
                elem.attrib[new_key] = elem.attrib.pop(key)

### Extract data frame from xml and workSheet name/firstLine
def extract_data_from_xml(xml_file, sheets):
    df = {}
    prefix= u'{urn:schemas-microsoft-com:office:spreadsheet}'
    tree = ET.parse(xml_file)
    remove_namespace(tree)
    root = tree.getroot()
    for sheet_name, start_row in sheets.items():
        data = extract_data_from_root(root, sheet_name, start_row)
        if(len(data)>0):
            headers = data[0]
            df[sheet_name] = pd.DataFrame(data[1:], columns=headers)
    return df

### Extracting data array from parsed xml root not from text, to improove performance
def extract_data_from_root(root, sheet_name, start_row):
    data = []
    found_table = False
    acc = 0
    for elem in root.iter():
        if found_table and elem.tag == 'Row':
            row_data = [cell.text for cell in elem.findall('Cell/Data')]
            if(acc>= start_row):
                data.append(row_data)
            else:
                acc +=1
        elif elem.tag == 'Table' or elem.tag == 'Worksheet':
            if elem.attrib.get('Name') == sheet_name:
                found_table = True
            else:
                if found_table and len(data)>0:
                    break
                else:
                    continue
    return data

### The core function
def read_XML_MS_spreadsheet(filename, sheets):
    with open(filename, mode="r", encoding="utf-8-sig") as fin:
        xml = fin.read()
        xml = cleanup(xml)
    df = extract_data_from_xml(StringIO(xml), sheets)
    return df

def read_XLS_and_XLSX_spreadsheet(filename, sheet_start_rows):
    dfs = {}
    for sheet_name, start_row in sheet_start_rows.items():
        df = pd.read_excel(filename, sheet_name=sheet_name, header=start_row)
        dfs[sheet_name] = df
    return dfs

### main
sheets = {
    'Holdings' : 7,
    'Historical' : 0,
    'Performance' : 4,
    'Distributions' : 0
}
# Originaly posted on https://stackoverflow.com/questions/77958287/having-difficulties-to-open-an-old-format-xls-file-with-python-pandas/77958383#77958383
# https://www.ishares.com/us/products/239566/ishares-iboxx-investment-grade-corporate-bond-etf/1521942788811.ajax?fileType=xls&fileName=iShares-iBoxx--Investment-Grade-Corporate-Bond-ETF_fund&dataType=fund
file_path = 'iShares-iBoxx--Investment-Grade-Corporate-Bond-ETF_fund.xls'

## Firstly try to read in XLSX or XLS. Pandas will choose the right one
try:
    df = read_XLS_and_XLSX_spreadsheet(file_path, sheets)
except Exception:
    try: ## If this isn't an XLS or XLSX file, try to load as an MS-XML 2003 Spreadsheet @see https://en.wikipedia.org/wiki/Microsoft_Office_XML_formats
        df = read_XML_MS_spreadsheet(file_path, sheets)
    except Exception: ## If still having an error, try to open an CSV file
        df = pd.read_csv(file_path) # CSV it's not multi-sheet

# Let's do some stufs
print(df)

@edited 1个 基于该错误,您有一个命名为XLS文件的CSV文件会有问题. try 将Read方法更改为

df = pd.read_csv(file_path)

@edited 2 在收到指向特定文件的链接后,我通过开发MS-XML 2003Electron 表格阅读器增强了响应.此外,我还对外部生成的XML文件进行了一些清理.因此,代码现在与各种文件格式兼容,如XLSX,XLS,MS-XML或CSV.您可以输入Electron 表格以及所需的初始行,以便导入到pandas DataFrame中.

Python相关问答推荐

pandas DataFrame GroupBy.diff函数的意外输出

try 与gemini-pro进行多轮聊天时出错

通过Selenium从页面获取所有H2元素

如何在polars(pythonapi)中解构嵌套 struct ?

Polars:用氨纶的其他部分替换氨纶的部分

Julia CSV for Python中的等效性Pandas index_col参数

Telethon加入私有频道

Python虚拟环境的轻量级使用

实现神经网络代码时的TypeError

将pandas导出到CSV数据,但在此之前,将日期按最小到最大排序

启动带有参数的Python NTFS会导致文件路径混乱

Tensorflow tokenizer问题.num_words到底做了什么?

从嵌套极轴列的列表中删除元素

Python类型提示:对于一个可以迭代的变量,我应该使用什么?

如何获得满足掩码条件的第一行的索引?

如何在SQLAlchemy + Alembic中定义一个"Index()",在基表中的列上

Polars定制函数返回多列

递归链表反转与打印语句挂起

Stats.ttest_ind:提取df值

Pandas 删除只有一种类型的值的行,重复或不重复