Skip to Main Content
Banner Image

Research Data Management: Data format

Tips and best practices for data management across disciplines.

Data types and file formats

After defining what we mean by data, it is helpful to consider what types of data you create and/or work with, and what format those data take. Your data stewardship practices will be dictated by the types of data that you work with, and what format they are in.

Data Types

Data types generally fall into five categories:

Observational
- Captured in situ
- Can’t be recaptured, recreated or replaced
- Examples: Sensor readings, sensory (human) observations, survey results

Experimental
- Data collected under controlled conditions, in situ or laboratory-based
- Should be reproducible, but can be expensive
- Examples: gene sequences, chromatograms, spectroscopy, microscopy

Derived or compiled
- Reproducible, but can be very expensive
- Examples: text and data mining, derived variables, compiled database, 3D models

Simulation
- Results from using a model to study the behavior and performance of an actual or theoretical system
- Models and metadata, where the input can be more important than output data
- Examples: climate models, economic models, biogeochemical models

Reference or canonical
- Static or organic collection [peer-reviewed] datasets, most probably published and/or curated. 
- Examples: gene sequence databanks, chemical structures, census data, spatial data portals.

 

Data Formats 

Research data comes in many varied formats: text, numeric, multimedia, models, software languages, discipline-specific (e.g. crystallographic information file (CIF) in chemistry), and instrument specific.

Formats more likely to be accessible in the future are:
- Non-proprietary
- Open, documented standards
- In common usage by the research community
- Using standard character encodings (ASCII, UTF-8)
- Uncompressed (desirable, space permitting)

Use the table below to find an appropriate and recommended format for preserving and sharing your data over the long term.


TYPE OF DATA


PREFERRED FILE FORMATS FOR SHARING, RE-USE, AND PRESERVATION


Other Acceptable formats


Quantitative tabular data with extensive metadata

  • a dataset with variable labels, code labels, and defined missing values, in addition to the matrix of data
  • SPSS portable format (.por)
  • delimited text and command (‘setup’) file
  • (SPSS, Stata, SAS, etc.) containing metadata information
  • structured text or mark-up file containing metadata information, e.g. DDI XML file


MS Access (.mdb/.accdb)


Quantitative tabular data with minimal metadata

  • a matrix of data with or without column headings or variable names, but no other metadata or labeling

 

 

  • comma-separated values (CSV) file (.csv)
  • tab-delimited file (.tab)
  • including delimited text of given character set with SQL data definition statements where appropriate
  • delimited text of given character set -- only characters not present in the data should be used as delimiters (.txt)
  • widely-used formats, e.g. MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf) and OpenDocument Spreadsheet (.ods)


Geospatial data

vector and raster data

 

  • ESRI Shapefile
  • (essential: .shp, .shx, .dbf ; optional: .prj, .sbx, .sbn)
  • geo-referenced TIFF (.tif, .tfw)
  • CAD data (.dwg)
  • tabular GIS attribute data
  • ESRI Geodatabase format (.mdb)
  • MapInfo Interchange Format (.mif) for vector data

 


Qualitative data

textual

 

  • eXtensible Mark-up Language (XML) text according to an appropriate Document Type Definition (DTD) or schema (.xml)
  • Rich Text Format (.rtf)
  • plain text data, UTF-8 (Unicode; .txt)
  • plain text data, ASCII (.txt)
  • Hypertext Mark-up Language (HTML) (.html)
  • widely-used proprietary formats, e.g. MS Word (.doc/.docx)
  • LaTeX (.tex)

 


Digital image data

 


TIFF version 6 uncompressed (.tif)

  • JPEG (.jpeg, .jpg)
  • TIFF (other versions; .tif, .tiff)
  • JPEG 2000 (.jp2)
  • Adobe Portable Document Format (PDF/A,
    PDF) (.pdf)


Digital audio data

 

  • Free Lossless Audio Codec (FLAC) (.flac)
  • Waveform Audio Format (WAV) (.wav)
  • MPEG-1 Audio Layer 3 (.mp3) - spoken word audio only
  • MPEG-1 Audio Layer 3 (.mp3)
  • Audio Interchange File Format (AIFF) (.aif)


Digital video data

 

  • MPEG-4 High Profile (.mp4)
  • motion JPEG 2000 (.jp2)


JPEG 2000 (.mj2)


Documentation & Scripts
 

 

  • Rich Text Format (.rtf)
  • Open Document Text (.odt)
  • HTML (.htm, .html)
  • plain text (.txt)
  • widely-used proprietary formats, e.g. MS Word (.doc/.docx) or MS Excel (.xls/ .xlsx)
  • XML marked-up text (.xml) according to an appropriate DTD or schema, e.g. XHMTL 1.0
  • PDF/A or PDF (.pdf)


Chemistry data

spectroscopy data and other plots which require the capability of representing contours as well as peak position and intensity

 


Convert NMR, IR, Raman, UV, and Mass Spectrometry files to JCAMP format for ease in sharing.

JCAMP file viewers: JSpecView, ChemDoodle

 

Sources: University of Edinburgh Information Services
University of Oregon Libraries
California Digital Libraries

Information duplicated from Oregon State University Libraries