Example of a data inconsistency

Just as the correctness of data analysis code should be tested automatically, the consistency of data should be evaluated and monitored as well. Using BacDive’s advanced search and BacDiveR’s bd_retrieve_by_search() several examples of geographic inconsistencies have been found. Presumably due to an overly strict location-to-country-to-continent mapping, several samples collected from seas neighbouring Russia (like the Sea of Japan), were assigned to Europe.

Two datasets with a geo-logic fault (pun intended)

Two datasets with a geo-logic fault (pun intended)

While one may debate where exactly in Russia the Asian-European border is located, it is clear that its Eastern shoreline is located well within Asia. These and other datasets with East Russian locations have been reported to the BacDive team and a portion of those was corrected in BacDive’s 04.07.2018 release.

As long as this specific inconsistency is not corrected in the datasets, some should be found and downloaded here.

How to test datasets

If a BacDive user finds an inconsistency within the datasets they use, BacDiveR’s bd_retrieve_by_search() can be used to construct a test-case for such a problem. In the following example, the test fails as long as BacDive contains datasets with the above-described discrepancy between the geo_loc_name and continent fields.

Once the inconsistency is corrected in BacDive, the advanced search returns no results any more, and the above test passes. It can thus be used to monitor the resolution of such a problem after reporting it. Furthermore, the users is alerted (by the test failing again) in case new datasets appear in BacDive with the same inconsistency.

References

See testthat.R-lib.org and the related “R Packages” chapter to learn more about testing in R (Wickham 2011, 2015).

Wickham, Hadley. 2011. “Testthat: Get Started with Testing.” The R Journal 3: 5–10. https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.

———. 2015. R Packages: Organize, Test, Document, and Share Your Code. 1st edition. Sebastopol, CA: O’Reilly Media. http://r-pkgs.had.co.nz/.