当前位置: 动力学知识库 > 问答 > 编程问答 >

r - Merging two different size datasets, copying the same row conditionally from smaller dataset to several rows in larger dataset

问题描述:

I am completely new with R, and I tried googling a representative solution for my problem for some time, but haven't found an adequate answer so far, so I hope that asking for help might solve this one here.

I should merge two different size data sets (other includes annual data: df_f, and other monthly data: df_m). I should merge the smaller df_f to the larger df_m in a way that rows of df_f are merged conditionally with df_m.

Here is a descriptive example of my problem (with some very basic reproducible numbers):

first dataset

a <- c(1990)

b <- c(1980:1981)

c <- c(1994:1995)

aa <- rep("A", 1)

bb <- rep("B", 2)

cc <- rep("C", 2)

df1 <- data.frame(comp=factor(c(aa, bb, cc)))

df2 <- data.frame(year=factor(c(a, b, c)))

other.columns <- rep("other_columns", length(df1))

df_f <- cbind(df1, df2, other.columns ) # first dataset

second dataset

z <- c(10:12)

x <- c(7:12)

xx <- c(1:9)

v <- c(2:9)

w <- rep(1990, length(z))

e <- rep(1980, length(x))

ee <- rep (1981, length(xx))

r <- rep(1995, length(v))

t <- rep("A", length(z))

y <- rep("B", length(x) + length(xx))

u <- rep("C", length(v))

df3 <- data.frame(month=factor(c(z, x, xx, v)))

df4 <- data.frame(year=factor(c(w, e, ee, r)))

df5 <- data.frame(comp=factor(c(t, y, u)))

df_m <- cbind(df5, df4, df3) # second dataset

Output:

> df_m

comp year month

1 A 1990 10

2 A 1990 11

3 A 1990 12

4 B 1980 7

5 B 1980 8

6 B 1980 9

7 B 1980 10

8 B 1980 11

9 B 1980 12

10 B 1981 1

11 B 1981 2

12 B 1981 3

13 B 1981 4

14 B 1981 5

15 B 1981 6

16 B 1981 7

17 B 1981 8

18 B 1981 9

19 C 1995 2

20 C 1995 3

21 C 1995 4

22 C 1995 5

23 C 1995 6

24 C 1995 7

25 C 1995 8

26 C 1995 9

> df_f

comp year other.columns

1 A 1990 other_columns

2 B 1980 other_columns

3 B 1981 other_columns

4 C 1994 other_columns

5 C 1995 other_columns

I want to have the rows from df_f placed to df_m (store the data from df_f to new columns in df_m) according to the conditions comp, year, and month. Comp (company) needs to match always, but matching the year is conditional to month: if month is >6 then year is matched between datasets, if month is <7 then year + 1 (in df_m) is matched with year (in df_f). Note that a certain row in df_f should be placed into several rows in df_m according to the conditions.

The wanted output clarifies the problem and the goal:

Wanted output:

 comp year month comp year other.columns

1 A 1990 10 A 1990 other_columns

2 A 1990 11 A 1990 other_columns

3 A 1990 12 A 1990 other_columns

4 B 1980 7 B 1980 other_columns

5 B 1980 8 B 1980 other_columns

6 B 1980 9 B 1980 other_columns

7 B 1980 10 B 1980 other_columns

8 B 1980 11 B 1980 other_columns

9 B 1980 12 B 1980 other_columns

10 B 1981 1 B 1980 other_columns

11 B 1981 2 B 1980 other_columns

12 B 1981 3 B 1980 other_columns

13 B 1981 4 B 1980 other_columns

14 B 1981 5 B 1980 other_columns

15 B 1981 6 B 1980 other_columns

16 B 1981 7 B 1981 other_columns

17 B 1981 8 B 1981 other_columns

18 B 1981 9 B 1981 other_columns

19 C 1995 2 C 1994 other_columns

20 C 1995 3 C 1994 other_columns

21 C 1995 4 C 1994 other_columns

22 C 1995 5 C 1994 other_columns

23 C 1995 6 C 1994 other_columns

24 C 1995 7 C 1995 other_columns

25 C 1995 8 C 1995 other_columns

26 C 1995 9 C 1995 other_columns

Thank you very much in advance! I hope the question is clear enough, it was somewhat difficult to explain it at least.

网友答案:

The basic idea to solve your problem is to add an extra column with the year that should be used for matching. I will use the package dpylr for this and other manipulation steps.

Before the tables can be combined, the numeric columns must be converted to be numeric:

library(dplyr)
df_m <- mutate(df_m, year = as.numeric(as.character(year)),
                     month = as.numeric(as.character(month)))
df_f <- mutate(df_f, year = as.numeric(as.character(year)))

The reason is that you want to be able to do numerical comparison with the month (month > 6) and subtract one from the year. You cannot do this with a factor.

Then I add the column to be used for matching:

df_m <- mutate(df_m, match_year = ifelse(month >= 7, year, year - 1))

And in the last step, I join the two tables:

df_new <- left_join(df_m, df_f, by = c("comp", "match_year" = "year"))

The argument by determines which columns of the two data frames should be matched. The output agrees with your result:

##    comp year month match_year other.columns
## 1     A 1990    10       1990 other_columns
## 2     A 1990    11       1990 other_columns
## 3     A 1990    12       1990 other_columns
## 4     B 1980     7       1980 other_columns
## 5     B 1980     8       1980 other_columns
## 6     B 1980     9       1980 other_columns
## 7     B 1980    10       1980 other_columns
## 8     B 1980    11       1980 other_columns
## 9     B 1980    12       1980 other_columns
## 10    B 1981     1       1980 other_columns
## 11    B 1981     2       1980 other_columns
## 12    B 1981     3       1980 other_columns
## 13    B 1981     4       1980 other_columns
## 14    B 1981     5       1980 other_columns
## 15    B 1981     6       1980 other_columns
## 16    B 1981     7       1981 other_columns
## 17    B 1981     8       1981 other_columns
## 18    B 1981     9       1981 other_columns
## 19    C 1995     2       1994 other_columns
## 20    C 1995     3       1994 other_columns
## 21    C 1995     4       1994 other_columns
## 22    C 1995     5       1994 other_columns
## 23    C 1995     6       1994 other_columns
## 24    C 1995     7       1995 other_columns
## 25    C 1995     8       1995 other_columns
## 26    C 1995     9       1995 other_columns
分享给朋友:
您可能感兴趣的文章:
随机阅读: