It’s been a while since I last posted, a shame because there was definitely stuff to share (though nothing comes to mind now).
I noticed that as time goes by, I’m making an effort to generate larger amounts of metadata to include inside data files. Although descriptive file names, hierarchical directory structure, documentation, personal notes, references to raw data/logs and file indexing are all helping to maintain a certain level of order, when dealing with the output of a large number of executions of similar programs there’s nothing like metadata.
This may help in debugging, or even better, avoiding errors. Accidentally changing/setting a file’s name/location when rich metadata is available isn’t as disastrous as it otherwise might have been. Metadata is particularly useful when batch-processing files for calculating statistics, plotting large figures and such. A small structure containing all the metadata can be loaded much faster than entire variables in order to decide whether the file is relevant for additional processing.
Besides natural metadata features, such as time of generation, code version, program flags and input/output files, other features such as failure/success score, node/PC where execution occurred and statistics gathered during the run may sometimes prove useful. Metadata should probably be generated as early as possible when the data is first parsed, but it may also be updated and enriched as the data is getting analyzed.