Javascript 如何将字符串拆分成单词并跟踪每个单词的索引(在原始字符串中)

发布于01月20日

我有一根绳子:

const str = 'a string, a long string'

我想把它分解成单词(这里没有问题)，然后跟踪原始字符串中每个单词的索引.

实际结果:

[
  { word: 'a',      idx: 0 },
  { word: 'string', idx: 2 },
  { word: 'a',      idx: 0 },
  { word: 'long',   idx: 12 },
  { word: 'string', idx: 2 }
]

预期结果:

[
  { word: 'a',      idx: 0 },
  { word: 'string', idx: 2 },
  { word: 'a',      idx: 10 },
  { word: 'long',   idx: 12 },
  { word: 'string', idx: 17 }
]

到目前为止的代码:

const str = 'a string, a long string'

const segmenter = new Intl.Segmenter([], { granularity: 'word' })

const getWords = str => {
  const segments = segmenter.segment(str)
  return [...segments]
    .filter(s => s.isWordLike)
    .map(s => s.segment)
}

const words = getWords(str)

const result = words.map(word => ({
  word,
  idx: str.indexOf(word)
}))

console.log(result)

推荐答案

您正在迭代的对象，其中包含segment以及它是否为isWordLike、also have the 102:

const str = 'a string, a long string'

const segmenter = new Intl.Segmenter([], { granularity: 'word' })

const getWordsWithIndexes = str => {
  const segments = segmenter.segment(str)
  return [...segments]
    .filter(s => s.isWordLike)
    .map(s => ({ idx: s.index, word: s.segment }))
}

const result = getWordsWithIndexes(str)

console.log(result)

以下是前type definition名:

interface SegmentData {
    /** A string containing the segment extracted from the original input string. */
    segment: string;
    /** The code unit index in the original input string at which the segment begins. */
    index: number;
    /** The complete input string that was segmented. */
    input: string;
    /**
     * A boolean value only if granularity is "word"; otherwise, undefined.
     * If granularity is "word", then isWordLike is true when the segment is word-like (i.e., consists of letters/numbers/ideographs/etc.); otherwise, false.
     */
    isWordLike?: boolean;
}