Go 同一文件上的多个 Arrow CSV 阅读器返回 null

发布于05月02日

我正在try 使用多个Goroutine来读取同一个文件，其中每个Goroutine都被分配了一个字节来开始读取，并分配了一些行来读取lineLimit个字节.

当文件可以放入内存时，通过将csv.ChunkSize选项设置为chunkSize变量，我成功地做到了这一点.但是，当文件大于内存时，我需要减少csv.ChunkSize选项.我正try 着做这样的事情

package main

import (
    "io"
    "log"
    "os"
    "sync"

    "github.com/apache/arrow/go/v11/arrow"
    "github.com/apache/arrow/go/v11/arrow/csv"
)

// A reader to read lines from the file starting from the byteOffset. The number
// of lines is specified by linesLimit.
func produce(
    id int,
    ch chan<- arrow.Record,
    byteOffset int64,
    linesLimit int64,
    filename string,
    wg *sync.WaitGroup,
) {
    defer wg.Done()

    fd, _ := os.Open(filename)
    fd.Seek(byteOffset, io.SeekStart)

    var remainder int64 = linesLimit % 10
    limit := linesLimit - remainder
    chunkSize := limit / 10

    reader := csv.NewInferringReader(fd,
        csv.WithChunk(int(chunkSize)),
        csv.WithNullReader(true, ""),
        csv.WithComma(','),
        csv.WithHeader(true),
        csv.WithColumnTypes(map[string]arrow.DataType{
            "Start_Time":        arrow.FixedWidthTypes.Timestamp_ns,
            "End_Time":          arrow.FixedWidthTypes.Timestamp_ns,
            "Weather_Timestamp": arrow.FixedWidthTypes.Timestamp_ns,
        }))
    reader.Retain()
    defer reader.Release()

    var count int64
    for reader.Next() {
        rec := reader.Record()
        rec.Retain() // released at the other end of the channel
        ch <- rec
        count += rec.NumRows()
        if count == limit {
            if remainder != 0 {
                flush(id, ch, fd, remainder)
            }
            break
        } else if count > limit {
            log.Panicf("Reader %d read more than it should, expected=%d, read=%d", id, linesLimit, count)
        }
    }

    if reader.Err() != nil {
        log.Panicf("error: %s in line %d,%d", reader.Err().Error(), count, id)
    }
}

func flush(id int,
    ch chan<- arrow.Record,
    fd *os.File,
    limit int64,
) {
    reader := csv.NewInferringReader(fd,
        csv.WithChunk(int(limit)),
        csv.WithNullReader(true, ""),
        csv.WithComma(','),
        csv.WithHeader(false),
    )

    reader.Retain()
    defer reader.Release()

    record := reader.Record()
    record.Retain() // nil pointer dereference error here
    ch <- record
}

我try 了上述代码的多个版本，包括:

正在复制文件描述符
复制文件描述符的偏移量，打开同一文件并寻求弥补这一点.
在呼叫flush或关闭第一个fd之前关闭第一个读卡器.

无论我如何更改代码，错误似乎都是一样的.请注意，对flush读取器的任何调用都会引发错误.包括reader.Next和reader.Err().

我是不是用错了CSV读卡器？这是重用同一文件的问题吗？

编辑:我不知道这是否有帮助，但在flush中打开一个没有任何Seek的新FD可以避免错误(不知何故，任何Seek都会导致原始错误出现).但是，如果没有Seek，代码就不正确(即，删除Seek会导致文件的一部分根本无法被任何Goroutine读取).

package main import ( "bytes" "fmt" "io" "os" "github.com/apache/arrow/go/v11/arrow" "github.com/apache/arrow/go/v11/arrow/csv" ) func main() { // Create a two-column csv file with this content (the second column has 1024 bytes): // 0,000000.... // 1,111111.... // 2,222222.... // 3,333333.... temp := createTempFile() schema := arrow.NewSchema( []arrow.Field{ {Name: "i64", Type: arrow.PrimitiveTypes.Int64}, {Name: "str", Type: arrow.BinaryTypes.String}, }, nil, ) r := csv.NewReader( temp, schema, csv.WithComma(','), csv.WithChunk(3), ) defer r.Release() r.Next() // To check what's left after the first chunk is read. // If the reader stop at the end of the chunk, the content left will be: // 3,333333.... // But in fact, the content left is: // 33333333333 buf, err := io.ReadAll(temp) if err != nil { panic(err) } fmt.Printf("%s\n", buf) } func createTempFile() *os.File { temp, err := os.CreateTemp("", "test*.csv") if err != nil { panic(err) } for i := 0; i < 4; i++ { fmt.Fprintf(temp, "%d,", i) if _, err := temp.Write(bytes.Repeat([]byte{byte('0' + i)}, 1024)); err != nil { panic(err) } if _, err := temp.Write([]byte("\n")); err != nil { panic(err) } } if _, err := temp.Seek(0, io.SeekStart); err != nil { panic(err) } return temp }

Go 同一文件上的多个 Arrow CSV 阅读器返回 null

推荐答案

Go相关问答推荐

为什么我不能使用Docker从本地访问我的Gin应用程序？

如何使用Gio设置标题栏图标

使用Golang的Lambda自定义al2运行时，初始化阶段超时

如何在Golang中使用ECHO服务器实现Socket.IO服务器

埃拉托塞尼筛：加快交叉关闭倍数步骤

通过代理从golang连接到ftp

显示GUI时后台处理功能

关于如何使用 Service Weaver 设置多个不同侦听器的问题

用 fork 替换 Go 依赖：...用于两个不同的模块路径

Golang text/template中的startswith函数 - 入门教程

AWS Lambda 中的 Websocket URL 超时达到错误

使用innerxml在 Go 中编码 XML 是否仅适用于某些类型？

在 Gorm 的 AfterFind() 钩子中获取智能 Select struct 的值

panic ：拨号 tcp：在 172.22.64.1：53 上查找 bookstoreDB：没有这样的主机

分配空切片而不引用其类型？

将未知长度切片的值分配给Go中的 struct ？

如何在眼镜蛇(golang)中将标志作为参数传递？

Go模板中的浮点除法

如何迭代在泛型函数中传递的片的并集？

在 Go 泛型中，如何对联合约束中的类型使用通用方法？