Deposition mechanism
First of all, let us remind this repository is only for institutions, not for private collectors. As an institution, contact the maintainer and request for joining. We will help you with JACQ/GBIF communication and prepare your account in the repository system. The process later is described bellow.
Image Upload Procedure
You will receive remote S3 storage access credentials from us, with a bucket (“folder”) named herbarium-XYZ where XYZ means international acronym according tot he Index Herbariorum. If you haven’t heard about S3, think of it as a remote disk, or something similar to an FTP server. A list of supported clients and instructions can be found in the CESNET Object Storage S3 documentation, for Linux users might be CrossFTP an option.
Once all settled, upload your image files to the provided S3 bucket. The upload can run overnight or in the background if needed; you will find your optimal frequency and strategy to fit your digitization process.
After the upload is complete, you will interact with the web interface of the repository and provide a few additional metadata fields that apply to all images in the batch (all uploaded files are taken as a single batch). Once this step is completed and you confirm import, the images will begin processing. During processing, files are removed from your S3 storage and transferred to our internal system, checked and stored.
Image Requirements
Format: TIFF (.tif), preferably the original, unedited version. It should retain EXIF metadata (e.g., information about the camera or scanner). Correct DPI values are desirable but not mandatory.
Barcode: Each image must contain a machine-readable barcode that matches an agreed pattern. The exact pattern for your institution will be arranged individually.
If no barcode is found in the image, a filenameFallback option can be enabled. In this case, the system will derive the specimen identifier from the filename, according to a pre-agreed pattern tailored for your use as above.
Multiple Barcodes (Multiplier): If multiple barcodes are detected in a single image, the system throw an error. In case you will use multiplier for a batch, it will i) require to be more than one valid barcode present (to make the process more predictable) and ii) duplicate the image for each corresponding ID rather than reporting an error and requiring manual ID input.
Validity checks
Beside the logic of specimen ID identification and fit to herbarium the user works in, the system checks for:
- filesize larger than 5MB to prevent some obvious thumbs to be imported. The upper quota is not set, but a general expectation is that a single image should not exceed 600MB. Lossless LZA compression is recommended.
- uniqueness in scope of individual specimens (you cannot upload identical image to a single specimen twice, but an identical photo can be uploaded to different specimens ID)
- existence of external authority PID holding information about the specimen (taxon, locality etc.). The repository itself does not stores these data, it overtakes them from e.g. JACQ. A missing/non-detectable PID in JACQ will allow you to import a photo, but not publishing it.
Specimen ID requirements
There are two levels of specimen validity check with several options:
- system-wide check - specimen ID can be any string, only so called “whitespace characters” are limited to regular space. This space cannot be used at the beginning or end of the ID, or between two digits. Multiplied spaces (one next to another) are not allowed. Since the repository assign ARK PID to the specimens, these PIDs are used as URLs. To prevent confusion and misbehavior, ARKs are build from webalized form of the specimenID (all non-alphanumeric characters are replaced with dashes). In a very specific (or rather obscure) situation, this can lead to joining two independent specimens of herbarium into single ARK.
- herbarium level check - an individual pattern (varying in content and strictness), so called “regular expression” is set in cooperation of herbarium curator and repository administrator. This pattern is used to check the validity of the ID, can be changed in future (to allow processing of another batch of images e.g.). There is one regular expression for barcodes detected on the image, and one for the filename.
Options (set on/off by the curator in the web interface):
- filenameFallback - if no barcode is found in the image, the system will derive the specimen identifier from the filename. OFF in the default configuration (= filename has not relevance to the specimen identification by default).
- multiplier - if multiple barcodes are detected in a single image, the system will duplicate the image for each corresponding ID rather than reporting an error and requiring manual ID input. OFF in the default configuration (= no multiplying by default).
- acronymMustBePresent - a presence of international herbarium acronym at the start of the specimenID is expected. For regional museums, having often IDs like “B 00123”, turning off this option allow relax such expectation. ON in the default configuration (= the international acronym is required).