Based on Wiley’s infographic, publishing research data in institutional repositories is not yet the norm and accounts for about 26% of research data sharing. I think that this is a growing area that librarians are starting to get into.
The primary way is as an appendix or supplementary material to a journal article, which shows that data sharing in its own right is not mainstream practice, only incidental if the author includes it in their publications.
Data sharing is a relatively new thing as open access gains more weight in the academic world, data sharing will follow. It’s a more difficult area for researchers to embrace compared to OA publishing, because it really requires good information management skills. In the future I’d like to see more data librarian roles dedicated to helping researchers manage, format, describe, preserve and share their data.
The top reason against data sharing is intellectual property or confidentiality issues. I think that this is a valid concern. If researchers are doing research that creates sensitive information that would be unwise to make open, possibly for their own safety or that of human participants in their research. For this reason it does not surprise me that the social sciences discipline has the least amount of open data.
I think that aggregators like RDA (Reseach Data Australia) are very helpful as a portal to many different research repositories. RDA make their data discoverable with searchable fields are Title, Identifier, Related Organisation, related people, and the description. Subject headings are applied using a range of different thesauruses. Searchers can also limit results by time period and location where the data was captured, data provider, license and access.
From looking at the record linked to in the Thing
, I thought some of the subject headings on this particular record weren’t all that useful or descriptive – like “date”, “camera number” (which were fields from the dataset), and some were not that descriptive, like “behaviour” and “cameras” which don’t really describe the content of the data in a way that help with discovery. Other than that, the information is well filled out.
As an aggregator/portal RDA doesn’t host the data itself, instead they link to the data provider. There are public stats on how many times the record has been viewed (ie. pageview count), how many times it has been cited (using Thomson Reuters’ Data Citation Index™) and how many times it has been “accessed”. This is an interesting stat because it actually counts is the number of times the “Go to Data Provider” link has been clicked. I clicked on the Australian Antarctic Data Centre link, but I only looked at AADC’s repository record. In order to download the data itself I needed to open an account so I didn’t bother. Obviously that’s beyond RDA’s control. I just found it interesting because it doesn’t necessarily count the people who have seen the data, just the number of people who have followed the link to the data provider’s record.
The Life cycle of data includes:
- Giving access
This follows on from the previous “thing” – in the scenario the researcher didn’t seem to think that the data would be useful after the article was published, and kept saying “all the information is in the article” – which is why they made no effort to preserve, provide access or make the data re-usable down the track. So they pretty much stopped at step 3.
The ANDS Guide to Research Data Management in Practice is very helpful. It is a little more detailed than the 6 steps above, listing key activities in pre-research, during research, and post-research. It’s critical to have a data management plan.
Content for Thing 2 (ANDS website)
A data management horror story:
- The researcher didn’t store his data securely. He had it on a USB “in a box somewhere”… showing a lack of care. Instead he should have had backups of his files. e.g. an external hard drive – there are secure cloud storage solutions that are also appropriate and easy to access later.
- He stored the data in an incompatible file format. It could only be read in a particular program. In this case, the company who made the software went bankrupt so it was virtually obsolete. To avoid this problem convert data to the most generic file format possible so it can be read by different programs.
- The data itself was not clearly labelled, with abbreviations that could not be understood by an outsider. Even the researcher could not remember what some of the labels stood for. His research colleague was not easy to contact as he had gone back to China. The solution is to make sure reading the data does not rely on inside knowledge. Have clear, self-explanatory labels and provide context.
I’m catching up on 23 (Research Data) Things from ANDS. I’m late to the party, so I’m going at my own pace and blogging my thoughts.
I found Boston University Libraries page to be very helpful, especially for identifying categories of research data:
- Observational: data captured in real-time, usually irreplaceable. For example, sensor data, survey data, sample data, neurological images.
- Experimental: data from lab equipment, often reproducible, but can be expensive. For example, gene sequences, chromatograms, toroid magnetic field data.
- Simulation: data generated from test models where model and metadata are more important than output data. For example, climate models, economic models.
- Derived or compiled: data is reproducible but expensive. For example, text and data mining, compiled database, 3D models.
- Reference or canonical: a (static or organic) conglomeration or collection of smaller (peer-reviewed) datasets, most probably published and curated. For example, gene sequence databanks, chemical structures, or spatial data portals.
The data can be in a great amount of formats, which was quite obvious from looking around the CSIRO Data Access Portal. Having data openly available is a good start to help more people (non-scientists) access and reuse research data. Still, I don’t think many people know they can freely access actual research data like this.