这是一个棘手的问题.我的磁盘中存储了大约1,000 * 999 / 20张图像,我想通过成对比较找到彼此相似的图像.所以我必须做大约1,000 * 999 / 2=499500的比较(相似的属性是不可传递的).我的问题不在于如何比较图像,而在于如何在比较过程中有效地管理机器的内存.我已经实现了比较功能:

static bool AreSimilar(ImageInfo x, ImageInfo y)
{
    // Logic
}

...其中ImageInfo是一个类,用于保存一个图像的信息:

class ImageInfo : IDisposable
{
    public string Path { get; init; }
    public System.Drawing.Image Image { get; init; }
    public void Dispose() => Image.Dispose();
}

理想情况下,我希望在内存中加载所有AreSimilar0个图像,然后进行嵌套循环,并为每对图像调用AreSimilar方法,但一次加载所有图像所需的内存远远超过了我机器的可用内存.图像文件非常大,大小差异很大(大多数大小在5到50MB之间).可用的内存是2GB,所以我不能同时加载超过80个图像.从磁盘加载映像非常慢.实际上,从磁盘加载两个图像要比比较它们慢得多

我的问题是如何实现一种方法,在利用所有可用内存的同时,在不超过内存限制的情况下,负责从磁盘加载/卸载图像,并成对生成它们.以下是我试图实现的方法的签名:

static IEnumerable<(TItem, TItem)> GetPairs<TSource, TItem>(
    IReadOnlyList<TSource> source,
    Func<TSource, long> sizeSelector,
    Func<TSource, TItem> itemLoader,
    long maxConcurrentSize) where TItem : IDisposable;

TSource将是文件的路径,TItem将是ImageInfo.我打算这样使用它:

string[] paths = Directory.GetFiles(@"C:\Images", "*.jpg");
var pairs = GetPairs(paths,
    path => new FileInfo(path).Length,
    path => new ImageInfo() { Path = path, Image = Image.FromFile(path) },
    2_000_000_000);
foreach (var (x, y) in pairs)
{
    if (AreSimilar(x, y))
        Console.WriteLine($"{x.Path} and {y.Path} are similar!");
}

我目前不知道如何实现这种方法.这看起来是一项严肃的任务.我现在只有下面的简单版本,它成对加载图像并忽略sizeSelectormaxConcurrentSize参数:

static IEnumerable<(TItem, TItem)> GetPairs<TSource, TItem>(
    IReadOnlyList<TSource> source,
    Func<TSource, long> sizeSelector,
    Func<TSource, TItem> itemLoader,
    long maxConcurrentSize) where TItem : IDisposable
{
    for (int i = 0; i < source.Count; i++)
    {
        using var first = itemLoader(source[i]);
        for (int j = i + 1; j < source.Count; j++)
        {
            using var second = itemLoader(source[j]);
            yield return (first, second);
        }
    }
}

显然,性能很糟糕,因 for each 图像平均加载了约500次.

推荐答案

下面是一个解决您的问题的解决方案,包括一个单元测试.不幸的是,在一次只能加载少量图像的情况下,它的性能很差,最坏的结果是您建议的解决方案加载数量的2倍.但是,当一次可以加载大量图像时,此算法的性能开始优于您的算法,随着允许的内存大小的增加,每个图像的加载限制为1次.

using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace UnitTest;


[TestClass]
public class TestComparison
{
    [TestMethod]
    public void Test()
    {
        const int imageCount = 10;
        var (totalLoads, max, pairs) = RunTest(1, 5, 1000, imageCount);
        Assert.AreEqual(10, totalLoads);
        Assert.AreEqual(1, max);

        (_, max, var pairs2) = RunTest(5, 5, 12, imageCount);
        Assert.AreEqual(9, max);

        var expectedPairs = (imageCount - 1) * imageCount / 2;
        Assert.AreEqual(expectedPairs, pairs.Count);
        Assert.AreEqual(expectedPairs, pairs2.Count);
    }

    private (int totalLoads, int maxLoads, List<(ImageInfo, ImageInfo)> pairs) RunTest(int minImageSize, int maxImageSize, long memorySize, int imageCount)
    {
        var images = GenerateImages(imageCount, minImageSize, maxImageSize);
        var paths = images.Keys.ToList();
        var imageLoadCounts = new Dictionary<string, int>();
        var pairs = GetPairs(paths,p=>images[p].Size,LoadImage,memorySize).ToList();

        var totalLoads = imageLoadCounts.Values.Sum();
        var maxLoad = imageLoadCounts.Values.Max();
        return new(totalLoads, maxLoad,pairs);

        ImageInfo LoadImage(string path)
        {
            if (!imageLoadCounts.TryGetValue(path, out int count))
            {
                count = 0;
            }

            count++;
            imageLoadCounts[path] = count;

            return images[path];
        }
    }

    private Dictionary<string, ImageInfo> GenerateImages(int imageCount, int minImageSize, int maxImageSize)
    {
        var images = new Dictionary<string,ImageInfo>();
        for (int i = 0; i < imageCount; i++)
        {
            images[RandomString()] = new() { Value = _random.NextSingle(), Size = _random.Next(minImageSize, maxImageSize) };
        }

        return images;
    }

    class ImageInfo:IDisposable
    {
        public int Size { get; set; }
        public float Value { get; set; }

        public void Dispose()
        {
        }
    }

    private static readonly Random _random = new();


    static string RandomString()
    {
        const int length = 10;
        var str = string.Empty;
        for (int i = 0; i < length; i++)
        {
            str += (char)_random.Next(65, 90);
        }

        return str;
    }



    static bool AreSimilar(ImageInfo x, ImageInfo y) => Math.Abs(y.Value-x.Value)<.1f;
    record Comparison<T>(T Path1, T Path2)
    {
        public bool Contains(T path) => Path1.Equals(path) || Path2.Equals(path);


        public T Other(T path)
        {
            if (Path1.Equals(path)) return Path2;
            if (Path2.Equals(path)) return Path1;
            throw new Exception("invalid path");
        }

        public bool IsEquivalentTo(Comparison<T> other) => (other.Path1.Equals(Path1) && other.Path2.Equals(Path2)) ||
                                                           (other.Path2.Equals(Path1) && other.Path1.Equals(Path2));
    }
    static IEnumerable<(TItem, TItem)> GetPairs<TSource, TItem>(
        IReadOnlyList<TSource> source,
        Func<TSource, long> sizeSelector,
        Func<TSource, TItem> itemLoader,
        long maxConcurrentSize) where TItem : IDisposable
    {

        var itemCount = source.Count;
        var comparisons = new List<Comparison<TSource>>(itemCount * itemCount / 2);
        for (int i = 0; i < itemCount - 1; i++)
        {
            for (int j = i + 1; j < itemCount; j++)
            {
                comparisons.Add(new(source[i], source[j]));
            }
        }

        return GetPairs(comparisons,sizeSelector,itemLoader,maxConcurrentSize);
    }

    static IEnumerable<(TItem, TItem)> GetPairs<TSource,TItem>(List<Comparison<TSource>> remainingComparisons,
        Func<TSource, long> sizeSelector,
        Func<TSource, TItem> itemLoader,
        long maxConcurrentSize) where TItem:IDisposable
    {
        if(!remainingComparisons.Any()) yield break;
        var images = LoadImages(remainingComparisons,sizeSelector,itemLoader,maxConcurrentSize);//load as many images as possible from the remaining comparisons
        var imageCount = images.Count;
        for (int i = 0; i < imageCount - 1; i++)
        {
            var (path1, image1) = images[i];
            for (int j = i + 1; j < imageCount; j++)
            {
                var (path2, image2) = images[j];
                yield return new(image1, image2);
                var comparison = new Comparison<TSource>(path1, path2);
                remainingComparisons.RemoveAll(c => c.IsEquivalentTo(comparison));
            }
        }

        //dispose
        foreach (var image in images.Select(i => i.Image))
        {
            image.Dispose();
        }

        images = null;//allow GC
        foreach (var pair in GetPairs(remainingComparisons,sizeSelector,itemLoader,maxConcurrentSize))
        {
            yield return pair;
        }
    }

    /// <summary>
    /// Loads as many images into memory as possible from the remaining comparison list
    /// </summary>
    static List<(TSource Path, TITem Image)> LoadImages<TSource,TITem>(List<Comparison<TSource>> remainingComparisons, Func<TSource, long> sizeSelector,
        Func<TSource, TITem> itemLoader,
        long maxConcurrentSize)
    {
        var availableMemory = maxConcurrentSize;
        remainingComparisons = remainingComparisons.ToList();//copy the list so we can alter it in local scope
        var loadedImages = new List<(TSource Path, TITem Image)>();
        if (!TryGetSeed(out var seed)) throw new Exception("One of the images is too large to load into memory");
        while (remainingComparisons.Any())
        {
            if (!remainingComparisons.TryGetFirst(c => c.Contains(seed),out var comparison ))
            {
                if (!TryGetSeed(out seed)) break;
                continue;
            }

            var other = comparison.Other(seed);
            remainingComparisons.Remove(comparison);
            if (!TryLoad(other)) break;

        }

        return loadedImages;

        bool TryLoad(TSource path)
        {
            var fileSize = sizeSelector(path);
            if (fileSize > availableMemory) return false;
            loadedImages.Add(new(path, itemLoader(path)));
            availableMemory -= fileSize;
            return true;
        }

        bool TryGetSeed(out TSource seed)
        {
            //first, remove any comparisons that are relevant to the current loaded list
            var loadedImageCount = loadedImages.Count;
            for (int i = 0; i < loadedImageCount-1; i++)
            {
                for (int j = 1; j < loadedImageCount; j++)
                {
                    var localComparison = new Comparison<TSource>(loadedImages[i].Path, loadedImages[j].Path);
                    remainingComparisons.RemoveAll(c => c.IsEquivalentTo(localComparison));
                }
            }

            if (!remainingComparisons.TryGetFirst(out var firstRemaining))
            {
                seed = default;
                return false;
            }

            seed = firstRemaining.Path1;
            return TryLoad(seed);
        }

  


    }
}
public static class Extensions
{
    public static bool TryGetFirst<T>(this IEnumerable<T> items, out T value) =>
        items.TryGetFirst(t => true, out value);
    public static bool TryGetFirst<T>(this IEnumerable<T> items, Predicate<T> condition, out T value)
    {
        foreach (var item in items)
        {
            if (condition(item))
            {
                value = item;
                return true;
            }
        }
        value = default;
        return false;
    }
}

为了回答你的问题,时间复杂性被忽略了.TryGetSeed的当前实现使得时间复杂度为O(N^3),但这很容易得到改进.我怀疑同样的算法可以在O(N^2)时间内完成,这是这个问题的最佳时间复杂度.

Csharp相关问答推荐

使用LayoutKind在C#中嵌套 struct .显式

MS Graph v5.42.0:在寻呼消息时更改页面大小

Rx.Net窗口内部可观测数据提前完成

每个http请求需要60秒,为什么?

C#带主体的主构造函数?

如何将字符串变量传递给JObject C#-无法加载文件或程序集';System.Text.Json

未在数据流块之间传播完成

Regex字母数字校验

链接到字典字符串.拆分为(.Key,.Value)

Blazor Server.NET 8中的Blazore.FluentValidation问题

正在从最小API-InvocationConext.Arguments中检索参数的FromBodyAttribute

在C#ASP.NET内核中使用INT AS-1进行控制器场景的单元测试

在';、';附近有错误的语法.必须声明标量变量";@Checkin";.';

两个DateTimeOffset之间的差异返回意外的负值

如何在单击按钮后多次异步更新标签

根据优先级整理合同列表

引用类型中的值类型属性是否包含装箱的值?

为什么在Web服务中使用异步(&Q)

EF Core可以';t计算表达式值.net 6

如何强制一个过程在不可见的情况下开始?