Given a data frame with certain rows having missing data in certain columns, though they belong to the same with similar A values, I want to prioritize rows with full records when there are multiple occurrences with the same value in column A to retain them.
# Create a sample dataframe
df <- data.frame(
A = c(1, 1, 2, 2, 3, 3, 4, 4),
B = c("a", "a", "b", "b", "c", "c", "d", "e"),
C = c("z", "z", "y", "y", "x", "x", "w", "v"),
D = c(6, NA, 7, NA, 8, 8, NA, 10),
E = c("f", NA, "g", NA, "h", "h", NA, "j")
)
#filter it this way
df_filtered <- df %>%
group_by(A) %>%
arrange(rowSums(is.na(.))) %>%
slice(1) %>%
ungroup()
The filtering process is as follows:
Group by column A.
For each group:
a. First, arrange by the number of NA values in descending order (so rows with fewer NA values come first). arrange(rowSums(is.na(.)))
b. Then, slice to pick the first row.
Finally, ungroup.