ROIpad ← Back to Search
stackoverflow › answer

Answer to: Counting instances of genotype strings where order is irrelevant within locus

Score: 1
Answered: Mar 3, 2026
User Rep: 76,620
To treat Xx and xX as the same we might use a strsplit/sort approach, but on factor levels for sake of efficiency. This assumes diploid, biallelic loci. canonicalize_genotype <- \(x, decreasing.=TRUE) { base_levels <- \(A) { strsplit(A, '') |> lapply(sort.int, decreasing=decreasing.) |> sapply(paste, collapse='') } x <- unlist(x) len <- nchar(x) if (var(len) != 0) { stop('lengths ambiguous.') } else { len <- el(len) } a <- vapply(seq_len(len/2) - 1, \(i) { substr(x, 1 + 2*i, 2 + 2*i) }, FUN.VALUE=character(length(x))) |> as.factor() starts <- seq(1L, len, by=2L) ends <- starts+1L substring(x, starts, ends) levels(a) <- base_levels(levels(a)) matrix(a, ncol=len/2) |> as.data.frame() |> Reduce(f=paste0) } Gives: > canonicalize_genotype(offspring) |> + table() aabb aaBb aaBB Aabb AaBb AaBB AAbb AABb AABB 1 2 1 2 4 2 1 2 1 Notice that the table displays Xx rather than xX, as the OP seems to prefer. To get xX, use canonicalize_genotype(., decreasing.=FALSE). I wrote it so it generalizes to more than two loci [[aa, aA, Aa, AA], [bb, bB, Bb, BB], [cc, cC, Cc, CC], ...]: > canonicalize_genotype(df) |> + table() aabbCC aaBbCc Aabbcc AabbCC AaBbcc AaBbCc AaBbCC AaBBcc AaBBCC AAbbcc AABbcc AABbCc AABBcc AABBCc AABBCC 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 Data: mk_mat <- \(m, n) { replicate(m*m, lapply(seq_len(n), \(i) { sample(c(letters[i], LETTERS[i]), replace=TRUE) |> paste(collapse='') }), simplify=FALSE) |> sapply(paste, collapse='') |> matrix(m, m) } set.seed(42) df <- mk_mat(4, 3) |> as.data.frame()
r frequency unordered
View Question ↗
Question
Parent Entity
Score: 4 • Views: 101
Site: stackoverflow