I have a big file and I try to find a way to do sorting or do clustering of the data according to two numeric columns that are in a range of numbers, but I could not find correct or fit function regarding my question. Could you please someone how knows help me.
Thanks in advance.

My file is like this sample file but very big and as you see in this example, first and second rows are like alternating numbers (I mean without any gap in between (sequence number)) and also third and forth rows are like that, but rows fifth and sixth are different and actually far from eachother. Therefore, I want to consider first and second as a one cluster, third and forth as a one cluster, fifth and sixth as a two different clusters to have at the end 4 rows instead of 6 rows because rows 1,2 and 3,4 are in one range without any gap in between.
Example file:

df <- setDT(data.frame(name = c("chr1", "chr1", "chr1", "chr1","chr1","chr1"), 
  start = c(8480001, 8480251, 10006251, 10006501,13910501,14841751), 
  end = c(8480250, 8480500, 10006500, 10006750,13910750,14842000),
  length = c(250, 250, 250, 250,250,250))) 

预期输出:

output <- setDT(data.frame(name = c("chr1", "chr1", "chr1", "chr1"),
  start = c(8480001, 10006251, 13910501, 14841751), 
  end = c(8480250, 10006500, 13910750, 14842000), 
  length = c(250, 250, 250, 250))) 

在输出中,我只想要在一个集群中的那些行的第一行,例如,1和2只有行1.

再次感谢您.

推荐答案

我们可以根据‘Start’和‘End’的lag之间的差异创建一个组,然后 Select 第一行

library(data.table)
df[df[, .I[1], cumsum(start - shift(end, fill = first(end)) > 1)]$V1]

-输出

   name    start      end length
   <char>    <num>    <num>  <num>
1:   chr1  8480001  8480250    250
2:   chr1 10006251 10006500    250
3:   chr1 13910501 13910750    250
4:   chr1 14841751 14842000    250

R相关问答推荐

使用ggcorrplot删除值,但保留不重要相关性的 colored颜色

修改dDeliverr中列表列的最后一个元素

生成具有受控相关性的x和y

R通过字符串中的索引连接数据帧r

geom_raster不适用于x比例中超过2,15的值

从具有随机模式的字符串中提取值

查找具有平局的多个列的最大值并返回列名或平局 destruct 者NA值

如何删除多个.CSV文件的行

如何优化向量的以下条件赋值?

如何同时从多个列表中获取名字?

给定开始日期和月份(数字),如何根据R中的开始日期和月数创建日期列

具有重复元素的维恩图

计算直线上点到参考点的总距离

在另一个包中设置断点&S R函数

仅当后续值与特定值匹配时,才在列中回填Nas

R -基线图-图形周围的阴影区域

如何阻止围堵地理密度图?

在生成打印的自定义函数中,可以通过变量将线型或 colored颜色 设置为NULL吗?

构建一个6/49彩票模拟系统

替换在以前工作的代码中有x行&q;错误(geom_sf/gganimate/dow_mark)