如何在 R 中进行 vlookup 和填写(如在 Excel 中)

发布于03月09日

我有一个大约105000行和30列的数据集.我有一个分类变量，我想把它分配给一个数字.在Excel中，我可能会用VLOOKUP和fill做一些事情.

我该如何在R分钟内做同样的事情？

基本上，我有一个HouseType变量，我需要计算HouseTypeNo.以下是一些示例数据:

HouseType HouseTypeNo
Semi            1
Single          2
Row             3
Single          2
Apartment       4
Apartment       4
Row             3

推荐答案

如果我正确理解了你的问题，下面有四种方法可以相当于Excel的VLOOKUP，并使用R填写:

# load sample data from Q
hous <- read.table(header = TRUE, 
                   stringsAsFactors = FALSE, 
text="HouseType HouseTypeNo
Semi            1
Single          2
Row             3
Single          2
Apartment       4
Apartment       4
Row             3")

# create a toy large table with a 'HouseType' column 
# but no 'HouseTypeNo' column (yet)
largetable <- data.frame(HouseType = as.character(sample(unique(hous$HouseType), 1000, replace = TRUE)), stringsAsFactors = FALSE)

# create a lookup table to get the numbers to fill
# the large table
lookup <- unique(hous)
  HouseType HouseTypeNo
1      Semi           1
2    Single           2
3       Row           3
5 Apartment           4

以下是使用lookup表中的值填写largetable中HouseTypeNo的四种方法:

第一个merge英寸的底座:

# 1. using base 
base1 <- (merge(lookup, largetable, by = 'HouseType'))

第二种方法是在基底中使用命名向量:

# 2. using base and a named vector
housenames <- as.numeric(1:length(unique(hous$HouseType)))
names(housenames) <- unique(hous$HouseType)

base2 <- data.frame(HouseType = largetable$HouseType,
                    HouseTypeNo = (housenames[largetable$HouseType]))

第三，使用plyr套餐:

# 3. using the plyr package
library(plyr)
plyr1 <- join(largetable, lookup, by = "HouseType")

第四，使用sqldf包

# 4. using the sqldf package
library(sqldf)
sqldf1 <- sqldf("SELECT largetable.HouseType, lookup.HouseTypeNo
FROM largetable
INNER JOIN lookup
ON largetable.HouseType = lookup.HouseType")

如果largetable中的某些房屋类型可能在lookup中不存在，则将使用左连接:

sqldf("select * from largetable left join lookup using (HouseType)")

其他解决方案也需要相应的改变.

这就是你想做的吗？让我知道你喜欢哪种方法，我会添加 comments .

R相关问答推荐

如何将图案添加到ggplot中的一个类别

在水平条形图中zoom x_轴

为什么stat_bin在R中的ggplot中显示错误的数字？

导入到固定列宽的R中时出现问题

变量计算按R中的行更改

随机森林回归：下拉列重要性

格点中指数、双曲和反双曲模型曲线的正确绘制

在不安装软件包的情况下测试更新

如何从当前行上方找到符合特定条件的最接近值？

用预测NLS处理R中生物学假设之上的误差传播

使用整齐的计算(curl -curl )和杂音

使用rvest从多个页面抓取时避免404错误

将Posict转换为数字时的负时间(以秒为单位)

R如何计算现有行的总和以添加新的数据行

将多个列合并为一个列的有效方法是什么？

R中治疗序列的相对时间指数

长/纬点继续在堪萨斯-SF结束，整齐的人口普查

禁用时，SelecizeInput将变得不透明

将某个阈值以下的列中的值分类到不同的列中，否则保持该列的原样

把代码写成dplyr中的group_by/摘要更简洁吗？

实用课程推荐