当您在CPU上采用并行执行时,通常可以在CPU限制的计算上获得显著的性能提升.但.cpuAndGPU
的CoreML也受到GPU的限制,这会降低您享受的并行性.在我的实验中,我看到了运行基于GPU的CoreML计算的较小的性能优势(在iPhone和M1 iPad上从串行操作分别提高了13%和18%),但在Mac Studio上获得了更多实质性的好处(速度是前者的两倍多).
使用工具(通过在Xcode中按命令-i或 Select "产品"»"配置文件")的评测可能很有启发性.请参见Recording Performance Data.
首先,让我们先比较一下.cpuOnly
个方案中的computeUnits
个.在这里,它按顺序运行20个CoreML prediction
调用(1个调用中的maxConcurrentOperationCount
个):
而且,如果我切换到CPU视图,我可以看到它在我的iPhone 12 Pro Max上的两个性能核心之间 skip :
这事儿可以理解.好的,现在让我们将maxConcurrentOperationCount
改为3
,总体处理时间(processingAll
函数)从5分钟降至3.5分钟:
当我切换到CPU视图,以查看发生了什么情况时,它似乎开始在两个性能核心上并行运行,但切换到了一些效率核心(可能是因为设备的散热状态变得紧张,这解释了我们没有达到任何接近2倍的性能):
因此,在执行仅使用CPU的CoreML计算时,并行执行可以带来显著的好处.话虽如此,仅使用CPU的计算比使用GPU的计算要慢得多.
当我切换到.cpuAndGPU
时,maxConcurrentOperationCount
的1和3的差异要小得多,允许三个并发操作时需要45秒,而当连续执行时需要50秒.在这里,它并行运行三个:
和顺序:
但与.cpuOnly
个场景相比,您可以在CPU跟踪中看到,CPU大部分处于空闲状态.下面是后者,其中的CPU视图显示了详细信息:
因此,人们可以看到,让它们在多个CPU上运行并不会获得太大的性能提升,因为这不是CPU限制的,而是明显受到GPU的限制.
以下是我的上述代码.注意,我使用了OperationQueue
,因为它提供了一种简单的机制来控制并发度(maxConcurrentOperationCount
:
import os.log
private let poi = OSLog(subsystem: "Test", category: .pointsOfInterest)
和
func processAll() {
let parallelTaskCount = 20
let queue = OperationQueue()
queue.maxConcurrentOperationCount = 3 // or try `1`
let id = OSSignpostID(log: poi)
os_signpost(.begin, log: poi, name: #function, signpostID: id)
for i in 0 ..< parallelTaskCount {
queue.addOperation {
let image = UIImage(named: "image.jpg")!
self.runPrediction(index: i, image: image, shouldAddContuter: true)
}
}
queue.addBarrierBlock {
os_signpost(.end, log: poi, name: #function, signpostID: id)
}
}
func runPrediction(index: Int, image: UIImage, shouldAddContuter: Bool = false) {
let id = OSSignpostID(log: poi)
os_signpost(.begin, log: poi, name: #function, signpostID: id, "%d", index)
defer { os_signpost(.end, log: poi, name: #function, signpostID: id, "%d", index) }
let conf = MLModelConfiguration()
conf.computeUnits = .cpuAndGPU // contrast to `.cpuOnly`
conf.allowLowPrecisionAccumulationOnGPU = true
let myModel = try! MyModel(configuration: conf)
let myModelInput = try! MyModelInput(LR_inputWith: image.cgImage!)
// Prediction
let prediction = try! myModel.prediction(input: myModelInput)
os_signpost(.event, log: poi, name: "finished processing", "%d %@", index, prediction.featureNames)
}
Note, above I have focused on CPU usage. You can also use the “Core ML” template in Instruments. E.g. here are the Points of Interest 和 the CoreML tracks next to each other on my M1 iPad Pro (with maxConcurrencyOperationCount
set to 2 to keep it simple):
乍一看,CoreML似乎在并行处理这些任务,但如果我以1
的maxConcurrencyOperationCount
(即,串行)再次运行它,这些单独的计算任务的时间会更短,这表明在并行场景中,存在一些与GPU相关的争用.
Anyway, in short, you can use Instruments to observe what is going on. And one can achieve significant improvements in performance through parallel processing for CPU-bound tasks only, 和 anything requiring the GPU or neural engine will be further constrained by that hardware.