Issue
I just downloaded the TCGA genomic dataset, which is structured with genomic data files into a folder for each case with a total sample csv file provided for all of the files. The csv is structured in this way:
folder_name, file_name
folder1, file1.txt
folder2, file2.txt
folder3, file3.txt
And each file is a spreadsheet of the genes stacked vertically in this format:
file_path: folder1/file1.txt
geneA, 5
geneB, 2
geneC, 4
How can I write a loop to open each file to merge into each row iteratively to get the following format?
folder_name, file_name, geneA, geneB, geneC
folder1, file1.txt, 5, 2, 4
folder2, file2.txt, 4, 3, 5
folder3, file3.txt, 6, 2, 4
There could be files were one of the genes (eg geneB) is missing, in which case inputting a blank or n/a value could be acceptable.
Solution
You can do something like this:
import pandas as pd
import os
# Define the root path
ROOT_PATH = 'C:/'
file_info = pd.read_csv('fileinfo.csv')
file_info.columns = file_info.columns.str.strip()
merged_data = pd.DataFrame()
for index, row in file_info.iterrows():
folder_name = row['folder_name'].strip()
file_name = row['file_name'].strip()
file_path = os.path.join(ROOT_PATH, folder_name, file_name) # Create the file path
print(file_path)
file_data = pd.read_csv(file_path, names=['Gene', f'{folder_name}_{file_name}'])
file_data_transposed = file_data.set_index('Gene').T.reset_index(drop=True)
merged_data = pd.concat([merged_data, file_data_transposed], axis=0, ignore_index=True)
merged_data.columns.name = None # Remove the index name
merged_data.columns = ['geneA', 'geneB', 'geneC']
merged_data['folder_name'] = file_info['folder_name']
merged_data['file_name'] = file_info['file_name']
merged_data = merged_data[['folder_name', 'file_name', 'geneA', 'geneB', 'geneC']]
print(merged_data)
Here is the kaggle link
Answered By - User12345
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.