尽管我尽了最大努力使CoreML MLModel进程并行,但它的预测似乎是幕后苹果迫使它以串行/逐个的方式运行.

我制作了一个公共存储库,复制该问题的PoC: https://github.com/SocialKitLtd/coreml-concurrency-issue.

What I have tried:

  • 每次重新创建MLModel,而不是全局实例
  • 仅使用.cpuAndGpu个配置

What I'm trying to achieve:
I'm trying to utilize multithreading to process a bunch of video frames at the same time (assuming the CPU/RAM can take it) faster than the one-by-one strategy.


class ViewController: UIViewController {

    override func viewDidLoad() {
        let parallelTaskCount = 3
        for i in 0...parallelTaskCount - 1 {
            DispatchQueue.global(qos: .userInteractive).async {
                let image = UIImage(named: "image.jpg")!
                self.runPrediction(index: i, image: image)

    func runPrediction(index: Int, image: UIImage) {
        let conf = MLModelConfiguration()
        conf.computeUnits = .cpuAndGPU
        conf.allowLowPrecisionAccumulationOnGPU = true
        let myModel = try! MyModel(configuration: conf)
        let myModelInput = try! MyModelInput(LR_inputWith: image.cgImage!)
        // Prediction
        let predicition = try! myModel.prediction(input: myModelInput)
        print("finished proccessing \(index)")

Any help will be highly appreciated.


当您在CPU上采用并行执行时,通常可以在CPU限制的计算上获得显著的性能提升.但.cpuAndGPU的CoreML也受到GPU的限制,这会降低您享受的并行性.在我的实验中,我看到了运行基于GPU的CoreML计算的较小的性能优势(在iPhone和M1 iPad上从串行操作分别提高了13%和18%),但在Mac Studio上获得了更多实质性的好处(速度是前者的两倍多).

使用工具(通过在Xcode中按命令-i或 Select "产品"»"配置文件")的评测可能很有启发性.请参见Recording Performance Data.

首先,让我们先比较一下.cpuOnly个方案中的computeUnits个.在这里,它按顺序运行20个CoreML prediction调用(1个调用中的maxConcurrentOperationCount个):

enter image description here

而且,如果我切换到CPU视图,我可以看到它在我的iPhone 12 Pro Max上的两个性能核心之间 skip :

enter image description here


enter image description here


enter image description here



enter image description here


enter image description here


enter image description here



import os.log

private let poi = OSLog(subsystem: "Test", category: .pointsOfInterest)

func processAll() {
    let parallelTaskCount = 20

    let queue = OperationQueue()
    queue.maxConcurrentOperationCount = 3          // or try `1`

    let id = OSSignpostID(log: poi)
    os_signpost(.begin, log: poi, name: #function, signpostID: id)

    for i in 0 ..< parallelTaskCount {
        queue.addOperation {
            let image = UIImage(named: "image.jpg")!
            self.runPrediction(index: i, image: image, shouldAddContuter: true)

    queue.addBarrierBlock {
        os_signpost(.end, log: poi, name: #function, signpostID: id)

func runPrediction(index: Int, image: UIImage, shouldAddContuter: Bool = false) {
    let id = OSSignpostID(log: poi)
    os_signpost(.begin, log: poi, name: #function, signpostID: id, "%d", index)
    defer { os_signpost(.end, log: poi, name: #function, signpostID: id, "%d", index) }

    let conf = MLModelConfiguration()
    conf.computeUnits = .cpuAndGPU                 // contrast to `.cpuOnly`
    conf.allowLowPrecisionAccumulationOnGPU = true
    let myModel = try! MyModel(configuration: conf)
    let myModelInput = try! MyModelInput(LR_inputWith: image.cgImage!)
    // Prediction
    let prediction = try! myModel.prediction(input: myModelInput)
    os_signpost(.event, log: poi, name: "finished processing", "%d %@", index, prediction.featureNames)

Note, above I have focused on CPU usage. You can also use the “Core ML” template in Instruments. E.g. here are the Points of Interest 和 the CoreML tracks next to each other on my M1 iPad Pro (with maxConcurrencyOperationCount set to 2 to keep it simple):

enter image description here


Anyway, in short, you can use Instruments to observe what is going on. And one can achieve significant improvements in performance through parallel processing for CPU-bound tasks only, 和 anything requiring the GPU or neural engine will be further constrained by that hardware.




