Given two data frames:

df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))

df1
#  CustomerId Product
#           1 Toaster
#           2 Toaster
#           3 Toaster
#           4   Radio
#           5   Radio
#           6   Radio

df2
#  CustomerId   State
#           2 Alabama
#           4 Alabama
#           6    Ohio

How can I do database style, i.e., sql style, joins? That is, how do I get:

  • An inner join of df1 and df2:
    Return only the rows in which the left table have matching keys in the right table.
  • An outer join of df1 and df2:
    Returns all rows from both tables, join records from the left which have matching keys in the right table.
  • A left outer join (or simply left join) of df1 and df2
    Return all rows from the left table, and any rows with matching keys from the right table.
  • A right outer join of df1 and df2
    Return all rows from the right table, and any rows with matching keys from the left table.

Extra credit:

How can I do a SQL style select statement?

推荐答案

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable. I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.

You can merge on multiple columns by giving by a vector, e.g., by = c("CustomerId", "OrderId").

If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2" where CustomerId_in_df1 is the name of the column in the first data frame and CustomerId_in_df2 is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)

R相关问答推荐

在R底座中更改白天和夜晚的背景 colored颜色

从嵌套列表中智能提取线性模型系数

使用tidyverse / Mutate的存款账户余额

通过使用str_detect对具有相似字符串的组进行分组

我不能在docker中加载sf

在组中添加值增加和减少的行

如何自定义3D散点图的图例顺序?

汇总数据表中两个特定列条目的值

用关联字符串替换列名的元素

使用列/行匹配将两个不同维度的矩阵相加

使用rvest从多个页面抓取时避免404错误

如何在ggplot2中绘制具有特定 colored颜色 的连续色轮

有没有办法一次粘贴所有列

为什么我对圆周率图的蒙特卡罗估计是空的?

为什么在写入CSV文件时Purrr::Pwalk不起作用

如何合并不同列表中的数据文件,包括基于名称的部分匹配,而不是一对一等价

如何修改GT表中组名行的 colored颜色 ?

我怎么才能把一盘棋变成一盘棋呢?

如何在GGPlot中控制多个图例和线型

用LOOCV进行K近邻问题