Data file format standards

Research Data Management

Best Practices

File Formats

File Format Standards

File formats and file naming according to standards are necessary to ensure that your data can be uniquely identified and made accessible for future uses. When selecting tools for storing your data, pay special attention to the output formats of your data.

For preservation purposes, whenever possible use data formats that are:

  • Open standard
  • In an easily re-usable format (e.g. .txt as opposed to .pdf)

When listing out the data format you will be using, make sure to include:

  • Software necessary to view the data (e.g. SPSS v.3; Microsoft Excel 97-2003)
  • Information about version control
  • If data will be stored in one format during collection and analysis and then transferred to another format for preservation: List out features that may be lost in data conversion such as system specific labels.

File Format Recommendations

Audio
Chemistry (spectra)
Documentation and scripts
Geospatial
Images
Qualtitative (text)
Quantitative tabular data with extensive metadata
Quantitative tabular data with minimal metadata
Video

Digital audio data

Preferred Formats Other Acceptable Formats
  • Free Lossless Audio Codec (FLAC) (.flac)
  • Waveform Audio Format (WAV) (.wav)
  • MPEG-1 Audio Layer 3 (.mp3) - spoken word audio only
  • MPEG-1 Audio Layer 3 (.mp3)
  • Audio Interchange File Format (AIFF) (.aif)

Chemistry data

spectroscopy data and other plots which require the capability of representing contours as well as peak position and intensity

Preferred Formats
Convert NMR, IR, Raman, UV, Mass Spectrometry, files to JCAMP format for ease in sharing.
JCAMP file viewers: JSpecView, ChemDoodle

Documentation and scripts

Preferred Formats Other Acceptable Formats
  • Open Document Text (.odt)
  • Rich Text Format (.rtf)
  • HTML (.htm, .html)
  • plain text (.txt)
  • widely-used proprietary formats, e.g. MS Word
    (.doc/.docx) or MS Excel (.xls/ .xlsx)
  • XML marked-up text (.xml) according to an
    appropriate DTD or schema, e.g. XHMTL 1.0
  • PDF/A or PDF (.pdf)

Geospatial data

vector and raster data

Preferred Formats Other Acceptable Formats
  • ESRI Shapefile (essential -- .shp,.shx, .dbf;
    optional -- .prj, .sbx, .sbn)
  • geo-referenced TIFF (.tif, .tfw)
  • CAD data (.dwg)
  • tabular GIS attribute data
  • Keyhole Mark-up Language (KML) (.kml)
  • ESRI Geodatabase format (.mdb)
  • MapInfo Interchange Format (.mif) for
    vector data

Digital image data

Preferred Formats Other Acceptable Formats
  • TIFF version 6 uncompressed (.tif)
  • JPEG (.jpeg, .jpg)
  • TIFF (other versions)(.tif, .tiff)
  • JPEG 2000 (.jp2)
  • Adobe Portable Document Format (PDF/A,
    PDF) (.pdf)
Viewers: OMERO for conversion, viewing and metadata for biological microscope slides and other TIFF files.  

Qualitative data
textual

Preferred Formats Other Acceptable Formats
  • eXtensible Mark-up Language (XML) text
    according to an appropriate Document Type
    Definition (DTD) or schema (.xml)
  • Rich Text Format (.rtf)
  • plain text data, UTF-8 (unicode) (.txt)
  • plain text data, ASCII (.txt)
  • Hypertext Mark-up Language (HTML) (.html)
  • widely-used proprietary formats, e.g. MS Word
    (.doc/.docx)
  • LaTeX (.tex)

Quantitative tabular data with extensive metadata
a dataset with variable labels, code labels, and defined missing values, in addition to the matrix of data

Preferred Formats Other Acceptable Formats
  • SPSS portable format (.por)
  • delimited text and command ('setup') file (SPSS, Stata, SAS, etc.) containing metadata information
  • structured text or mark-up file containing metadata information, e.g. DDI XML file
  • MS Access (.mdb/.accdb)

Quantitative tabular data with minimal metadata
a matrix of data with or without column headings or variable names, but no other metadata or labelling

Preferred Formats Other Acceptable Formats
  • comma-separated values (CSV) file (.csv)
  • tab-delimited file (.tab)
    including delimited text of given character set
    with SQL data definition statements where
    appropriate
  • delimited text of given character set -- only characters not present in the data should be used as delimiters (.txt)
  • widely-used formats, e.g. MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf) and OpenDocument Spreadsheet (.ods)

Digital video data

Preferred Formats Other Acceptable Formats
  • MPEG-4 High Profile (.mp4)
  • JPEG 2000 (.mj2)

Adapted from the UK Data Archive recommendations for file formats: Managing and Sharing Data.

See also: Library of Congress Digital Format A-Z Directory

Maintained by: Brian Westra, bwestra@uoregon.edu
Created by bwestra on Jul 24, 2012 Last updated Oct 30, 2013