Help with unwieldly table #1370

kubu4 · 2022-01-25T20:36:08Z

kubu4
Jan 25, 2022
Maintainer

Alrighty, I have a table that's going to require a lot of manipulation and I think I need some help/suggestions on how to approach it. The table is here: https://github.com/epigeneticstoocean/2018_L18-adult-methylation/blob/main/data/whole_tx_table.csv

It's a large CSV with 67,891 rows.

Here's an excerpt of what it looks like:

t_id	chr	strand	start	end	t_name	num_exons	length	gene_id	gene_name	cov.S12M	FPKM.S12M	cov.S13M	FPKM.S13M	cov.S16F
1	NC_007175.2	+	1	1623	gene-COX1	1	1623	gene-COX1	COX1	197.261856	230.708456	38.658657	63.109	5144.539062
2	NC_007175.2	+	1710	8997	rna-NC_007175.2:1710..8997	2	1469	.	.	2242.554199	2622.788958	96.919327	158.217649	109415.523438
3	NC_007175.2	+	2645	3429	gene-COX3	1	785	gene-COX3	COX3	145.308258	169.945901	44.127384	72.036519	3372.623047

What I want:

gene_name | number_transcripts_per_gene | sample

Problems that I need to figure out how to handle:

Some gene_name are listed as a .. Would need to relabel each of these values so they are unique.
Do not want to count a transcript when value in sample column (i.e. column beginning with FPKM) is = 0.
How to do this for each sample (i.e. each column beginning with FPKM).

If anyone can suggest an approach, I'd greatly appreciate. Feeling a bit overwhelmed when looking at this table and having to handle so many exceptions within the data.

Answered by kubu4

Jan 27, 2022

Guess I should've just gone to StackOverflow right away. Posted my problem and received a solution in R in less than 5 minutes (seriously).

Here's the solution, which is based off of the example table I posted above:

library(dplyr)
desired_result = your_data %>%
  group_by(name_of_first_column) %>%
  summarize(across(everything(), ~sum(. > 0)))

I'll have to figure out some usage of select() to get the full input table and I'll have to figure out a way to programmatically rename all the columns in the output data, but that seems much more feasible for my skillset.

View full answer

sr320 · 2022-01-25T20:50:47Z

sr320
Jan 25, 2022
Maintainer

Some gene_name are listed as a .. Would need to relabel each of these values so they are unique.

I would you create new column that takes chr_start as I believe this would be unique

Not super easy but idea.. would probably need to split, edit and stack back.

2 replies

kubu4 Jan 25, 2022
Maintainer Author

This is a killer idea. Thanks! I think it should be "easy" to replace . with start value...

kubu4 Jan 25, 2022
Maintainer Author

Again, thanks! Here's how I replaced the dots in the gene_name column:

whole_tx_table <- whole_tx_table %>% mutate(gene_name = ifelse(gene_name == ".", t_name, gene_name))

kubu4 · 2022-01-25T22:22:41Z

kubu4
Jan 25, 2022
Maintainer Author

Making progress...

Counting unique gene_name can be done like so:

table(whole_tx_table[,"gene_name"])

4 replies

kubu4 Jan 26, 2022
Maintainer Author

EDITED: This will not work. Leaving in discussion for reference, though.

More progress...

This will count all unique gene_name based on having FPMK values > 0, in a single sample (in this case FPKM.S12M):

whole_tx_table_fpkm.s12m <- whole_tx_table %>% group_by(gene_name) %>% summarise(count = length(FPKM.S12M > 0))

kubu4 Jan 26, 2022
Maintainer Author

This will rename the column to have a sample name-specific count (albeit, not dynamically as will be needed/desired), but I'm just documenting the process:

whole_tx_table %>% group_by(gene_name) %>% summarise(count = length(FPKM.S12M > 0)) %>% rename("trans_count_fpkm.s12m" = count)

kubu4 Jan 26, 2022
Maintainer Author

Create vector of library names. Will be used for looping through columns.

library_names <- paste("FPKM", sample_metadata_full$OldSample.ID, sep = ".")

kubu4 Jan 26, 2022
Maintainer Author

I think I might be switching over to using the shell and awk...

Firstly, just creating a test file. Code below sets input field separator as a , and output field separator as tab (BEGIN {OFS"\t"}). Skips header line (NR > 1) and then prints just three columns. Sorted by first field (gene_name) which is needed for downstream line-by-line comparison of the values in that column.

awk -F"," 'BEGIN {OFS"\t"} NR > 1 {print $10, $12, $14}' whole_tx_table.csv
| sort -k1
| head -n 50
| column -t
> unique_gene_name_test.txt

That gives this output:

ATP6          93.883156   55.84006
COX1          230.708456  63.109
COX2          179.993226  74.224269
COX3          169.945901  72.036519
CYTB          228.799722  87.575892
LOC111099029  0.926958    6.124982
LOC111099030  10.124096   5.024844
LOC111099031  0           0
LOC111099031  0           0
LOC111099031  2.279801    2.289838
LOC111099032  17.674714   12.796428
LOC111099033  5.259716    7.326938
LOC111099034  3.514635    2.858349
LOC111099035  0           0
LOC111099035  1.929607    4.409107
LOC111099036  0           0
LOC111099036  1.45196     7.58513
LOC111099037  21.520663   26.353308
LOC111099038  6.019084    5.311657
LOC111099039  12.858404   13.689644
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0.354202    0.265986
LOC111099040  0.587969    0
LOC111099040  2.620288    1.077892
LOC111099040  4.290659    3.487692
LOC111099040  6.42671     6.906503
LOC111099041  0           0
LOC111099041  3.892818    4.934959
LOC111099042  0           0
LOC111099042  13.86859    14.319505
LOC111099043  0           0

Then, using awk arrays, tally how many occurrences of each gene name only if the FPKM value is > 0:

awk '{if ($2 > 0) gene_name[$1]++}; {if ($3 > 0) arr[$1]++}; END{ for (var in gene_name) print var, "\t", gene_name[var], arr[var]}' unique_gene_name_test.txt

The output looks correct...

LOC111099030  1  1
CYTB          1  1
LOC111099042  1  1
LOC111099037  1  1
LOC111099033  1  1
COX3          1  1
ATP6          1  1
LOC111099039  1  1
LOC111099036  1  1
LOC111099040  5  4
LOC111099035  1  1
LOC111099032  1  1
COX2          1  1
LOC111099038  1  1
LOC111099031  1  1
COX1          1  1
LOC111099029  1  1
LOC111099041  1  1
LOC111099034  1  1

So, this seems to work! This is probably what I'll go with.

Things I don't like, but probably won't take the time to fix:

Don't know how to automatically add headers back to output file.
Don't know how to run command without having to create separate arrays for each column and add each of these arrays to the END print statement.

The final command for to handle all 24 samples will be MESSY!

kubu4 · 2022-01-27T15:54:24Z

kubu4
Jan 27, 2022
Maintainer Author

Guess I should've just gone to StackOverflow right away. Posted my problem and received a solution in R in less than 5 minutes (seriously).

Here's the solution, which is based off of the example table I posted above:

library(dplyr)
desired_result = your_data %>%
  group_by(name_of_first_column) %>%
  summarize(across(everything(), ~sum(. > 0)))

I'll have to figure out some usage of select() to get the full input table and I'll have to figure out a way to programmatically rename all the columns in the output data, but that seems much more feasible for my skillset.

2 replies

kubu4 Jan 27, 2022
Maintainer Author

Aaaaand, here's a "one-liner" to select appropriate columns and get the desired output from all of them:

whole_tx_table %>% select(starts_with(c("gene_name", "FPKM"))) %>% group_by(gene_name) %>% summarise((across(everything(), ~sum(. > 0))))

kubu4 Jan 27, 2022
Maintainer Author

And, just like that, I have a final table with new column names and everything! Beautiful!

# Create table of transcript counts per gene per sample
transcript_counts <- whole_tx_table %>%
select(starts_with(c("gene_name", "FPKM"))) %>%
group_by(gene_name) %>%
summarise((across(everything(), ~sum(. > 0))))

# Rename columns
names(transcript_counts) <- gsub(x = names(transcript_counts), pattern = "FPKM", replacement = "transcript_counts")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help with unwieldly table #1370

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Help with unwieldly table #1370

kubu4 Jan 25, 2022 Maintainer

Replies: 3 comments · 8 replies

sr320 Jan 25, 2022 Maintainer

kubu4 Jan 25, 2022 Maintainer Author

kubu4 Jan 25, 2022 Maintainer Author

kubu4 Jan 25, 2022 Maintainer Author

kubu4 Jan 26, 2022 Maintainer Author

kubu4 Jan 26, 2022 Maintainer Author

kubu4 Jan 26, 2022 Maintainer Author

kubu4 Jan 26, 2022 Maintainer Author

kubu4 Jan 27, 2022 Maintainer Author

kubu4 Jan 27, 2022 Maintainer Author

kubu4 Jan 27, 2022 Maintainer Author

kubu4
Jan 25, 2022
Maintainer

Replies: 3 comments 8 replies

sr320
Jan 25, 2022
Maintainer

kubu4 Jan 25, 2022
Maintainer Author

kubu4 Jan 25, 2022
Maintainer Author

kubu4
Jan 25, 2022
Maintainer Author

kubu4 Jan 26, 2022
Maintainer Author

kubu4 Jan 26, 2022
Maintainer Author

kubu4 Jan 26, 2022
Maintainer Author

kubu4 Jan 26, 2022
Maintainer Author

kubu4
Jan 27, 2022
Maintainer Author

kubu4 Jan 27, 2022
Maintainer Author

kubu4 Jan 27, 2022
Maintainer Author