希望拆分内容行,保留标题.
我做了大量的文本处理,我喜欢使用unix单行程序,因为随着时间的推移,它们很容易组织(而不是大量脚本),我可以轻松地将它们链接在一起,我喜欢(重新)学习如何使用classic 的unix函数.通常我会使用一个简短的awk、perl或ruby one liner,这取决于哪一个是最优雅的.
这里我有一些行,其中有X个逗号分隔的项目.我想把它们分开,保留中心词.
输入:
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab
输出:
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
算法详细信息:
- 输入行包括一个标题词,然后是等号,然后是一个逗号分隔的列表,列表中至少有一项.
- 在本例中,大多数单词都是单数,但单词可以包含空格(例如末尾的"horseshoe crab")
- 拆分为9项,除非有<;3,在这种情况下,最终拆分可能会在一条线上产生12个
- 有多条线.e、 g.下一行可能是行星.
我想出了一个办法,先逃出空格,然后使用unix折叠,然后使用awk下拉第一列.其工作原理与上述完全相同:
echo "animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab" \
| \tr ' ,' '_ ' \
| fold -s \
| perl -pe 's/=/\t/; s/^_/\t_/g;' \
| awk 'BEGIN{FS=OFS="\t"} $1==""{$1=p} {p=$1} 1' \
| tr '\t _' '=, '
但它只考虑字符长度(而不是项目计数),没有考虑我不想要的特殊情况<;3件物品挂在最后一行.
我觉得这是一个优雅的小拼图,有 idea 吗?