This lesson is still being designed and assembled (Pre-Alpha version)

Working with files

Overview

Teaching: 15 min
Exercises: 16 min
Questions
  • How should I name my files?

  • How does folder organisation help me

Objectives
  • Understand elements of good naming strategy

  • Evaluate pros and cons of different project organizations

  • Explain how files management helps in being FAIR

Project organization: planning file names and folders structure

Before you even start collecting or working with data, you should decide how you will structure and name files and folders. This will:

Intro to folder structure Figure credits: Andrés Romanowski

Consistent naming and organizing files in folders has two main goals:

Naming your files (and folders)

One important and often overlooked aspect of organizing, sharing, and keeping track of data files is standardising naming.
It is important to develop naming conventions which permits encoding experimental factors which are important to the project.

File (folder) names should be consistent, meaningful to you and your collaborators, allow you to easily find what you are looking for, give you a sense of the content without opening the file, and identify if something is missing.

Naming and sorting (3+2 minutes)

Have a look at the example files from a project, similar to the one from metadata episode.

All the files have been sorted by name and demonstrate consequences of different naming strategies.

For your information, to encode experimental details, following conventions were taken:

  • phyB/phyA are sample genotype,
  • sXX is sample number
  • LD/SD are different light conditions (long or short day)
  • on/off are different media (on sucrose, off sucrose)
  • measurement date
  • other details are timepoint and raw or normalized data

2020-07-14_s12_phyB_on_SD_t04.raw.xlsx
2020-07-14_s1_phyA_on_LD_t05.raw.xlsx
2020-07-14_s2_phyB_on_SD_t11.raw.xlsx
2020-08-12_s03_phyA_on_LD_t03.raw.xlsx
2020-08-12_s12_phyB_on_LD_t01.raw.xlsx
2020-08-13_s01_phyB_on_SD_t02.raw.xlsx
2020-7-12_s2_phyB_on_SD_t01.raw.xlsx
AUG-13_phyB_on_LD_s1_t11.raw.xlsx
JUL-31_phyB_on_LD_s1_t03.raw.xlsx

LD_phyA_off_t04_2020-08-12.norm.xlsx
LD_phyA_on_t04_2020-07-14.norm.xlsx
LD_phyB_off_t04_2020-08-12.norm.xlsx
LD_phyB_on_t04_2020-07-14.norm.xlsx
SD_phyB_off_t04_2020-08-13.norm.xlsx
SD_phyB_on_t04_2020-07-12.norm.xlsx
SD_phya_off_t04_2020-08-13.norm.xlsx
SD_phya_ons_t04_2020-07-12.norm.xlsx
ld_phyA_ons_t04_2020-08-12.norm.xlsx

  • What are the problems with having date first?
  • How do different date formats behave once sorted?
  • Can you tell the importance of leading 0 (zeros)?
  • Is it equally easy to find all data from LD conditions as ON media?
  • Can you spot problem with when using different cases?
  • Do you see benefits of keeping consistent lengths of each name parts?
  • Do you see what happens when you mix conventions?

Solution

  • Using dates up front makes it difficult to quickly find data for particular conditions or genotypes. It also masks the “logical” order of samples or timepoints.
  • Named months break the “expected” sorting, same as dates without leading 0
  • Without leading zeros, ‘s12’ appear before s1 and s2
  • the first (and second) parts of the name are easiest to spot
  • last file is also from LD conditions but do apearch after SD, same with ‘phya’ genotypes
  • the last 3 file names are easiest to read as all parts appear on top of each other, thanks to using same 3 letter-lemgth codes ons and off
  • The lack of consistency makes it very difficult to get data from related samples/conditions.

Some things to take into account to decide on your naming convention are:

Do’s:

Don’ts:

If adding all the relevant details to file names makes them too long, it is often a signal that you should use folder to organize the files and capture some of those parameters.

Folders vs Files (3 minutes)

Have a look as these two different organization strategies:

(1) |– Project
|– |– arab_LD_phyA_off_t04_2020-08-12.metab.xlsx

(2) |– Project
|– |– arabidopsis
|– |– |– long_day
|– |– |– |– phyA
|– |– |– |– |– off_sucrose_2020-08-12
|– |– |– |– |– |– t04.metab.xlsx

Can you think of scenarios in which one is better suited than other? Hint: think of other files that could be present as well.

Solution

The first strategies, can work very well if the project has only few files, so all of them can quickly be accessed (no need to change folders) and the different parameters are easily visible. For example a couple of conditions, couple of genotypes or species

– Project  
– arab_LD_phyA_off_t04_2020-08-12.metab.xlsx
– arab_LD_WILD_off_t03_2020-08-11.metab.xlsx
– arab_SD_phyA_off_t01_2020-05-12.metab.xlsx
– arab_SD_WILD_off_t02_2020-05-11.metab.xlsx
– rice_LD_phyA_off_t05_2020-05-02.metab.xlsx
– rice_LD_WILD_off_t06_2020-05-02.metab.xlsx
– rice_SD_phyA_off_t07_2020-06-02.metab.xlsx
– rice_SD_WILD_off_t08_2020-06-02.metab.xlsx

The second strategy works better if we have a lot of individual files for each parameter. For example, imagine the metabolites are measured hourly throughout the day, and there are ten different genotypes, two species and 4 light conditions. You would not want to have all the 2000 files in one folder.

– Project          
– arabidopsis        
– long_day      
– phyA    
– off_sucrose_2020-08-12  
– t01.metab.xlsx
– t02.metab.xlsx
– t03.metab.xlsx
– …
– t23.metab.xlsx
– t24.metab.xlsx
– rice        
– long_day      
– phyA    
– off_sucrose_2020-06-03  
– t01.metab.xlsx
– …
– t24.metab.xlsx

Must do: Document your strategy

Regardless of whether you are using long filenames or incorporating some of the variables within the folder structure, document it!
Always include a PROJECT_STRUCTURE (or README) file describing your file naming and folder organisation conventions.

Strategies to set up a clear folder structure

Establishing a system that allows you to access your files, avoid duplication and ensure that your data can be easily found needs planning.

You can start by developing a logical folder structure. To do so, you need to take into account the following suggestions:

Good enough practices for scientific computing recommendations

The Good enough practices in scientific computing paper makes the following simple recommendations:

  • Put each project in its own directory, which is named after the project
  • Put text documents associated with the project in the ‘doc’ directory
  • Put raw data and metadata in a ‘data’ directory
  • Put files generated during cleanup and analysis in a ‘results’ directory
  • Put project source code in the ‘src’ directory
  • Put compiled programs in the ‘bin’ directory
  • Name all files to reflect their content or function:
    • Use names such as ‘bird_count_table.csv’, ‘notebook.md’, or ‘summarized_results.csv’.
    • Do not use sequential numbers (e.g., result1.csv, result2.csv) or a location in a final manuscript (e.g., fig_3_a.png), since those numbers will almost certainly change as the project evolves.

Organization for computing (3 minutes)

Take a look at the folder structure recommended by the Good enough practices in scientific computing paper.

Why do you think it is recommended layout and suited for a computing project?

.
|– CITATION
|– README
|– LICENSE
|– requirements.txt
|
|– data
| |– birds_count_table.csv
|
|– doc
| |– notebook.md
| |– manuscript.md
| |– changelog.txt
|
|– results
| |– summarized_results.csv
|
|– src
| |– sightings_analysis.py
| |– runall.py
|

Solution

This project structure clearly separates the inputs (the raw data) from the outputs (the results) and the analysis procedure (python code). Following the same convention (like src folder for code) makes it easy to find interesting elements, for example the raw data or particular ploting procedure.

The root directory contains a README file that provides an overview of the project as a whole, a CITATION file that explains how to reference it, and a LICENSE, all three make it REUSABLE. The src directory contains a controller script runall.py that loads the data and triggers the whole analysis.

After you have a plan

Your naming conventions might need some adjustments as the project progresses. Don’t despair, just document it!

If you change the strategy, document it in PROJECT_STRUCTURE (or README) stating why you made the change and when. Update the locations and names of files which followed the old convention

Backing up your project files and folders

Do you know how and where to keep 3 copies of your data which are always up to date?

Secure data preservation is very difficult to achieve without institutional support and know-how. One option is a cloud storage, but not all data may not be put in a public cloud.

You should always check your institutional guidelines and what solutions are available in your organization.

Project files organization and FAIR guidelines

FAIR Files (3+2 minutes)

In groups, discuss:

  • how can strategy for folder organisation and naming convention help in achieving FAIR data?

Have you realised that the following the above suggestions means including valuable metadata as part of your folder structure and file names?

Where to next

Bulk renaming of files can be done with the software such as Ant Renamer, RenameIT or Rename4Mac.

Good enough practices in scientific computing (Wilson et al., 2017)

Attribution

Content of this episode was created using the following references as inspiration:

Ed_DaSH

Key Points

  • A good file name hints the file content

  • Good project organization saves you time

  • Describe your files organization in PROJECT_STRUCTURE