Answer to: Counting instances of genotype strings where order is irrelevant within locus
Score: 1
To treat Xx and xX as the same we might use a strsplit/sort approach, but on factor levels for sake of efficiency. This assumes diploid, biallelic loci.
canonicalize_genotype <- \(x, decreasing.=TRUE) {
base_levels <- \(A) {
strsplit(A, '') |>
lapply(sort.int, decreasing=decreasing.) |>
sapply(paste, collapse='')
}
x <- unlist(x)
len <- nchar(x)
if (var(len) != 0) {
stop('lengths ambiguous.')
} else {
len <- el(len)
}
a <- vapply(seq_len(len/2) - 1, \(i) {
substr(x, 1 + 2*i, 2 + 2*i)
}, FUN.VALUE=character(length(x))) |>
as.factor()
starts <- seq(1L, len, by=2L)
ends <- starts+1L
substring(x, starts, ends)
levels(a) <- base_levels(levels(a))
matrix(a, ncol=len/2) |>
as.data.frame() |>
Reduce(f=paste0)
}
Gives:
> canonicalize_genotype(offspring) |>
+ table()
aabb aaBb aaBB Aabb AaBb AaBB AAbb AABb AABB
1 2 1 2 4 2 1 2 1
Notice that the table displays Xx rather than xX, as the OP seems to prefer. To get xX, use canonicalize_genotype(., decreasing.=FALSE).
I wrote it so it generalizes to more than two loci [[aa, aA, Aa, AA], [bb, bB, Bb, BB], [cc, cC, Cc, CC], ...]:
> canonicalize_genotype(df) |>
+ table()
aabbCC aaBbCc Aabbcc AabbCC AaBbcc AaBbCc AaBbCC AaBBcc AaBBCC AAbbcc AABbcc AABbCc AABBcc AABBCc AABBCC
1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
Data:
mk_mat <- \(m, n) {
replicate(m*m, lapply(seq_len(n), \(i) {
sample(c(letters[i], LETTERS[i]), replace=TRUE) |>
paste(collapse='')
}), simplify=FALSE) |>
sapply(paste, collapse='') |>
matrix(m, m)
}
set.seed(42)
df <- mk_mat(4, 3) |>
as.data.frame()
View Question ↗
Question
Parent Entity
Score: 4 • Views: 101
Site: stackoverflow
SaaS Metrics