背景

I was importing and cleaning a dataset (head included below) using read_csv() (from ), and noticed various columns which probably should be integer columns were left as numeric columns by readr.

问题

What is an elegant way of converting the numeric columns to integer columns if it is possible (i.e. if there would be no rounding/loss in accuracy), and if not, leaving them as numeric?

对于示例数据集,这将意味着转换列"Execution"、"TDCJ Number"、"Age at Execution"和"Weight".

它不需要是一个整齐的解决方案,底座R或任何包装(S)都可以.

已完成的研究:

我试着搜索Google,堆栈溢出,没有任何运气.老实说,我很惊讶这个问题以前没有被问过!read.csv()转换为整数,但不会像read_csv()那样自动转换日期.

生效日期:

# the entire dataset can be downloaded at https://archive.org/download/tx_deathrow_full/tx_deathrow_full.csv if you wish. The head is below

library(tidyverse)

tx_deathrow <- structure(list(Execution = c(553, 552, 551, 550, 549, 548), `Date of Birth` = structure(c(5014, 
-6701, 4110, 6302, 3737, -5266), class = "Date"), `Date of Offence` = structure(c(12743, 
3433, 12389, 13975, 13039, 11444), class = "Date"), `Highest Education Level` = c(9, 
12, 10, 11, 12, 12), `Last Name` = c("Young", "Bible", "Castillo", 
"Davila", "Rodriguez III", "Battaglia"), `First Name` = c("Christopher Anthony", 
"Danny Paul", "Juan Edward", "Erick Daniel", "Rosendo", "John David"
), `TDCJ
Number` = c(999508, 999455, 999502, 999545, 999534, 
999412), `Age at Execution` = c(34, 66, 37, 31, 38, 62), `Date Received` = structure(c(13238, 
12250, 13053, 14302, 14013, 11808), class = "Date"), `Execution Date` = structure(c(17729, 
17709, 17667, 17646, 17617, 17563), class = "Date"), Race = c("Black", 
"White", "Hispanic", "Black", "Hispanic", "White"), County = c("Bexar", 
"Harris", "Bexar", "Tarrant", "Lubbock", "Dallas"), `Eye Color` = c("Brown", 
"Blue", "Brown", "Brown", "Brown", "Green"), Weight = c(216, 
194, 180, 161, 198, 188), Height = c("6' 1\"", "5' 7\"", "5' 11\"", 
"5' 11\"", "5' 8\"", "6' 0\""), `Native County` = c("Bexar", 
"Brazoria", "Bexar", "Tarrant", "Wichita", "Dallas"), `Native State` = c("Texas", 
"Texas", "Texas", "Texas", "Texas", "Texas"), `Last Statement` = c("l want to make sure the Patel family knows I love them like they love me. Make sure the kids in the world know I'm being executed and those kids I've been mentoring keep this fight going. I'm good Warden.", 
NA, "To everyone that has been there for me you know who you are. Love y'all. See y'all on the other side.That's it.", 
"Yes, I would like to say nephew it burns huh. You know I might have lost the fight but I'm still a soldier. I still love you all. To my supporters and family y'all hold it down. Ten Toes down right. That's all.", 
"First I would like to say I have been here since September 2005.  I had the honor and privilege to know many prison guards and staff.  I want to thank all of them.  I would like for everyone to write the people on death row as they are all good men and I am very happy I got to know them.  All of their lives are worth knowing about.\n\nSecondly on February 14th the medical examiner and the chief nurse were engaged in numerous false illegal acts.  They tried to cover up that thousands were wrongfully convicted by Matt Powell, district attorney.  This needs to be brought to justice.\n\nI call upon the FBI to investigate Matt Powell and the Lubbock County Medical Examiner.  Lastly, I was born and raised Catholic and it was not lost upon me that this is Holy Week and last Sunday was Palm Sunday.  Yesterday was my birthday.  Today is the day I join my God and father.  The state may have my body but not my soul.\n\nIn order to save my brothers on death row I call upon Pope Francis and all the people of the world.\n\nLastly, I want everyone to boycott every single business  in the state of Texas until all the businesses are pressed to stop the death penalty.\n\nWith that Lord I commend my spirit.\n\nWarden I am ready to join my father.", 
"No, Well, Hi Mary Jean. See y'all later. Go ahead please.")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

推荐答案

虽然在注释中已经建议了一些{readr}函数,如parse_guess()type_convert(),但它们只能正确处理类型为character的列,这就是为什么它们不能更正应该是numeric而不是double的列.

一个技巧是将所有列读入参数为col_types = cols(.default = "c")的字符.然后我们可以将结果通过管道传输到type_convert()guess_integer设置为TRUE.

不知道为什么{readr}在阅读CSV时使用不同的过程在内部猜测列—对我来说没有多大意义.

library(readr)
library(dplyr)

tx_deathrow <- read_csv('https://archive.org/download/tx_deathrow_full/tx_deathrow_full.csv',
                        col_types = cols(.default = "c")) %>% 
  type_convert(guess_integer = TRUE)
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   Execution = col_integer(),
#>   `Date of Birth` = col_date(format = ""),
#>   `Date of Offence` = col_date(format = ""),
#>   `Highest Education Level` = col_double(),
#>   `Last Name` = col_character(),
#>   `First Name` = col_character(),
#>   `TDCJ
#> Number` = col_integer(),
#>   `Age at Execution` = col_integer(),
#>   `Date Received` = col_date(format = ""),
#>   `Execution Date` = col_date(format = ""),
#>   Race = col_character(),
#>   County = col_character(),
#>   `Eye Color` = col_character(),
#>   Weight = col_integer(),
#>   Height = col_character(),
#>   `Native County` = col_character(),
#>   `Native State` = col_character(),
#>   `Last Statement` = col_character()
#> )

tx_deathrow %>% 
  glimpse()

#> Rows: 553
#> Columns: 18
#> $ Execution                 <int> 553, 552, 551, 550, 549, 548, 547, 546, 545,…
#> $ `Date of Birth`           <date> 1983-09-24, 1951-08-28, 1981-04-03, 1987-04…
#> $ `Date of Offence`         <date> 2004-11-21, 1979-05-27, 2003-12-03, 2008-04…
#> $ `Highest Education Level` <dbl> 9, 12, 10, 11, 12, 12, 12, 12, 11, 8, 10, 9,…
#> $ `Last Name`               <chr> "Young", "Bible", "Castillo", "Davila", "Rod…
#> $ `First Name`              <chr> "Christopher Anthony", "Danny Paul", "Juan E…
#> $ `TDCJ\nNumber`            <int> 999508, 999455, 999502, 999545, 999534, 9994…
#> $ `Age at Execution`        <int> 34, 66, 37, 31, 38, 62, 64, 55, 47, 38, 46, …
#> $ `Date Received`           <date> 2006-03-31, 2003-07-17, 2005-09-27, 2009-02…
#> $ `Execution Date`          <date> 2018-07-17, 2018-06-27, 2018-05-16, 2018-04…
#> $ Race                      <chr> "Black", "White", "Hispanic", "Black", "Hisp…
#> $ County                    <chr> "Bexar", "Harris", "Bexar", "Tarrant", "Lubb…
#> $ `Eye Color`               <chr> "Brown", "Blue", "Brown", "Brown", "Brown", …
#> $ Weight                    <int> 216, 194, 180, 161, 198, 188, 179, 198, 204,…
#> $ Height                    <chr> "6' 1\"", "5' 7\"", "5' 11\"", "5' 11\"", "5…
#> $ `Native County`           <chr> "Bexar", "Brazoria", "Bexar", "Tarrant", "Wi…
#> $ `Native State`            <chr> "Texas", "Texas", "Texas", "Texas", "Texas",…
#> $ `Last Statement`          <chr> "l want to make sure the Patel family knows …

另一种读入数据的方法是data.table::fread(),如果合适的话,它在将数据正确转换为数字方面做得很好.但是,在这种情况下,日期被转换为IDate,这是一种基于整数的非标准日期格式.此外,通常的标准输出是data.table,但我们可以通过设置data.table = FALSE返回data.frame.不幸的是,我找不到一种方法来阻止日期转换为IDate.

library(data.table)

tx_deathrow <- data.table::fread('https://archive.org/download/tx_deathrow_full/tx_deathrow_full.csv',
                                 data.table = FALSE)

str(tx_deathrow)
#> 'data.frame':    553 obs. of  18 variables:
#>  $ Execution              : int  553 552 551 550 549 548 547 546 545 544 ...
#>  $ Date of Birth          : IDate, format: "1983-09-24" "1951-08-28" ...
#>  $ Date of Offence        : IDate, format: "2004-11-21" "1979-05-27" ...
#>  $ Highest Education Level: num  9 12 10 11 12 12 12 12 11 8 ...
#>  $ Last Name              : chr  "Young" "Bible" "Castillo" "Davila" ...
#>  $ First Name             : chr  "Christopher Anthony" "Danny Paul" "Juan Edward" "Erick Daniel" ...
#>  $ TDCJ
#> Number           : int  999508 999455 999502 999545 999534 999412 999371 999488 999275 999411 ...
#>  $ Age at Execution       : int  34 66 37 31 38 62 64 55 47 38 ...
#>  $ Date Received          : IDate, format: "2006-03-31" "2003-07-17" ...
#>  $ Execution Date         : IDate, format: "2018-07-17" "2018-06-27" ...
#>  $ Race                   : chr  "Black" "White" "Hispanic" "Black" ...
#>  $ County                 : chr  "Bexar" "Harris" "Bexar" "Tarrant" ...
#>  $ Eye Color              : chr  "Brown" "Blue" "Brown" "Brown" ...
#>  $ Weight                 : int  216 194 180 161 198 188 179 198 204 160 ...
#>  $ Height                 : chr  "6' 1\"\"" "5' 7\"\"" "5' 11\"\"" "5' 11\"\"" ...
#>  $ Native County          : chr  "Bexar" "Brazoria" "Bexar" "Tarrant" ...
#>  $ Native State           : chr  "Texas" "Texas" "Texas" "Texas" ...
#>  $ Last Statement         : chr  "l want to make sure the Patel family knows I love them like they love me. Make sure the kids in the world know "| __truncated__ "" "To everyone that has been there for me you know who you are. Love y'all. See y'all on the other side.That's it." "Yes, I would like to say nephew it burns huh. You know I might have lost the fight but I'm still a soldier. I s"| __truncated__ ...

创建于2024—03—16,reprex v2.0.2

R相关问答推荐

为什么以及如何修复Mapview不显示所有点并且st_buffer合并一些区域R?

r中的stat_difference函数不起作用

为什么st_join(ob1,ob2,left = True)返回具有比ob1更多功能的sf对象?

使用gcuminc,如何使用逗号格式化风险表?

如何在kableextra调用cell_spec()中忽略NA?

derrr mutate case_when grepl不能在R中正确返回值

二维样条,严格以一个参数递增

如何在区分不同条件的同时可视化跨时间的连续变量?

在使用bslb和bootstrap5时,有没有办法更改特定dt行的 colored颜色 ?

R中1到n_1,2到n_2,…,n到n_n的所有组合都是列表中的向量?

为什么我使用geom_density的绘图不能到达x轴?

具有重复元素的维恩图

根据约束随机填充向量的元素

在点图上绘制置信度或预测区间ggplot2

向R中的数据帧添加一列,该列统计另一列中每个唯一值的二进制观测值的数量

用满足特定列匹配的另一行替换NA行

如何在R中的两列以上使用联合(&U)?

在使用ggplot2的情况下,如何在使用coord_trans函数的同时,根据未转换的坐标比来定位geom_瓷砖?

修复标签重叠和ggploy内的空间

R:部分修改矩阵对角线的有效方法