我建议搭配词语
让单词由字母和撇号组成的序列
在regular expression的帮助下(请注意,拆分不考虑标点符号,因此cat
、cat,
和cat!
将被视为三个不同的单词),然后查询两个给定字符串的匹配:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
...
private static readonly Regex WordsRegex = new Regex(@"[\p{L}']+");
// 1 - in text1, 2 - in text2, 3 - in both text1 and text2
private static List<(string word, int presentAt)> MyWords(string text1, string text2) {
HashSet<string> words1 = WordsRegex
.Matches(text1)
.Cast<Match>()
.Select(match => match.Value)
.ToHashSet(StringComparer.OrdinalIgnoreCase);
HashSet<string> words2 = WordsRegex
.Matches(text2)
.Cast<Match>()
.Select(match => match.Value)
.ToHashSet(StringComparer.OrdinalIgnoreCase);
return words1
.Union(words2)
.Select(word => (word, presentAt: (words1.Contains(word) ? 1 : 0) |
(words2.Contains(word) ? 2 : 0)))
.ToList();
}
演示:
string str1 = "Cat meet's a dog has";
string str2 = "Cat meet's a dog and a bird";
var result = MyWords(str1, str2);
var report = string.Join(Environment.NewLine, result);
Console.Write(report);
输出:
(Cat, 3) # 3: in both str1 and str2
(meet's, 3) # 3: in both str1 and str2
(a, 3) # 3: in both str1 and str2
(dog, 3) # 3: in both str1 and str2
(has, 1) # 1: in str1 only
(and, 2) # 2: in str2 only
(bird, 2) # 2: in str2 only
Fiddle个
如果你想要一个冗长的输出:
string str1 = "Cat meet's a dog has";
string str2 = "Cat meet's a dog and a bird";
string[] options = new string[] {
"not present",
"present in first string not present in second string",
"not present in first string but present in second string",
"present in first string and present in second string"
};
var report = string.Join(Environment.NewLine, result
.Select(pair => $"{pair.word} - {options[pair.presentAt]}"));
Console.Write(report);
输出:
Cat - present in first string and present in second string
meet's - present in first string and present in second string
a - present in first string and present in second string
dog - present in first string and present in second string
has - present in first string not present in second string
and - not present in first string but present in second string
bird - not present in first string but present in second string