iRODS UGM 2022
Some notes on this year’s User Group Meeting at KU Leuven in Belgium. If an area interest you, I strongly suggest going to the conference page and checking out the slides and videos, once released (some, particularly the RENCI ones are already up on slides.com, so I’ve included links where I know them).
4.3 notes
This was the star of the conference - a work several years in the making, with several significant improvements.
KU Leuven Keynote
Research data network - managing data sharing between Flemish universities et al.
Key quote:
"we must take into account we work for reserarchers. They are very busy people with a lot of responsabilities"
Technology update
- Logging now done by syslog - finally! Structured messages via JSON
- log level now more discrete per area (e.g. rule engine, or API)
- delay server can migrate delay serves
- additional permissions - now covers metadata
imetahas support for admin flag (and iRODS now supports ADMIN_KW for metadata operations)- cron functionality - can bring services (e.g. agent factory, delay server) back up (without restarting service)
- GenQuery re-implementation (using flex/bison), support for ORDER BY, grouping, AND and OR on roadmap for 4.3.1/4.3.2
- Jargon 4.3.2.5 - now supports parallel transfer over 1247 (supported directly via consortium, or via Mike Conway?)
- MetalNX - 2.6.0 merged search interfaces, now uses GenQuery.
- Zone Management Tool (ZMT) - 0.2.0 user & resource management, health checks, built on REST API
- NFSRODS - 2.1.0 large file transfer over 1247, comes with docker-compose file
Python rule engine plugin
Python3 compliant! Previously was python2.
Indexing capability
- Now using a later elastic client
- tracks
iput --metadata& atomic metadata API - use NIEHS schema - adds system metadata along with user defined
Internships
RENCI have several interns this year, and they have been working on;
- refactor rodsServer to modern C++
- live reload for server config fixing some areas where a restart would be needed
- to refactor audit plugin - AMPQ to fix malformed JSON
- Libraries - wrap with higher level API’s
- iRODS Testing Environment Web Application (sounds intriguing!)
Big picture
RENCI want management of iRODS to be about policy design, composition and configuration
Roadmap?
- ‘Console’ refactored icommands as one command line tool
- enhancement to set up script?
Ontoforce (sponsor)
- integrated PubMed, dbSNP, opentargets, clinical trials etc (146 data sets) as reference datasets
- Search public and local datasets and get links to iRODS to get local copy.
- Could connect via Maastricht University iRODS REST API - theoretical, as not yet implemented.
CC-IN2P3
The recently had to move data from Africa by plane and hard disks - these challenges are still with us.
- 26 PB some replication, not all.
- 27 Zones
- 450 million files
Some federated (transatlantic!), most not.
140 PB of data overall, most in mass storage system (tape), some CephFS
- Developers providing applications (Java, C++ API’s) that hide the iRODS interface.
- WebDAV becoming more popular.
- Users are increasingly asking for metadata.
- They have problems with ’thundering herd’ of end users - hundreds of downloads, but infrastructure appears to cope!
- 2 HA physical providers for ALL zones! Using ccirods DNS alias - downtime by removing entry from alias.
- 21 consumers
- Oracle RAC
- still migrating from 3.3.1 in some cases!
- No VMs, no containers!
Max Plank
iRODS for a year so far.
MRI → iRODS
MR data system (aka MrData!) integrates iRODS and makes use of
- Metalnx
- rclone via webdav
- python iRODS client
They found that cyberduck had caching issues.
Castelum - human subject metadata management
Move data one way to scientific metadata
infra
- did use docker compose, now use Ansible docker module
- more robust than compose, easier than K8S
- some CI/CD being worked on via above deployment into a VM
Programmable authentication workflows, SURF
- Bridge between identity and service providers -some collaborators want to bring their own ID.
- As it was before this project, iRODS PAM enabled, but only standard user interaction supported.
I didn’t know there was a Python pam_python.so module!
iRODS data back end for LEXUS
Integrates organizations HPC with larger group - HPC-As-A-Service is an ideal, can speak to local HPC, OpenStack, and k8s.
HEAppE middleware. heaapu.org?
Two iRODS Zones, one backed by spectrum scale GPFS, the other Ceph.
Managing high-throughput sequencing and other -omics data with RODEOS and rodeos-ingest at Berlin Institute of Health at Charite
Very similar to Welcome Sanger Institute, they do automated data ingest and metadata extraction from Illumine (only).
Landing zone for sequencer(s) → iRODS
Using automated ingest plugin plus extensions to handle illumina files (rodeos-ingest) RunInfo.xml & RunParameters.xml - get included as metadata!
Experimenting with MetalNX. Using WebDAV, icommands, isrync.
Landing zones are CephFS with a Samba Gateway. iRODS data is held on CephFS backend.
Future is ingesting of mass spec files, but metadata formats not well-defined.
Fujifilm Object Archive (sponsor)
They also do backup tapes etc!
S3 compatible API looks like Glacier.
iRODS integration -v 2.4.11
The software runs on a Red Hat 8 server and supports Oracle (and other) tape libraries.
Data management environment at the national cancer institute
- Lots of copies of data and they diverge over time.
- Insufficient provenance.
- Most important is always available.
- Controll over the data - users want to be able to share it themselves, not IT as a gatekeeper.
- Archiving into S3 & Glacier via Cloudian
iRODS as an object store for the Galaxy platform
Python iRODS client issues found;
- long running connections dropped, so they had to re-code a process that recreated the session every 5 mins.
- Found a bug in slow downloads/uploads with client, got fixes in python-irodsclient
- now client is more performant than icommands!
iRODS Delay Queue
A way to move delay server role between providers, and having multiple servers be delay rule queue processes.
Too much to summarise, read the presentation - excellent.
Hands-free migration in case of disaster!
Can be manually migrated;
iadmin get_deay_server_info
iadmin set_delay_server <hostname>
One can run delay server on a consumer, but requires database credentials to be in server_config.json
At the moment, every server will ask database ‘am I the leader/successor’ every 5 seconds!
irods-grid status will show delay server PID on server it’s run on.
Towards the FAIRification of lab-data
Tools to ingest data from instruments of any kind.
Initial solution called panacea. Writen in C++ for speed, with R and Python bindings, perhaps REST API in future. PhD student project, however RENCI might also be contributing, AIUI?
S3 Resource Plugin Glacier support
Tested on Amazon and Fujifilm systems.
Upload is by setting a flag, download is async except to ‘instant retrieval’ classes.
If not in cache, returns REPLICA_IS_BEING_STAGED, so you would need to try again.
Libs3 didn’t support glacier, so they forked, and will submit an MR, but not timescale for this.
Python/PRC based portal and tools for active data support in research contexts
Reimplemented MetalNX (groan) but with metadata schema support including editing the schema. If I understand correctly, one can click in the UI to extract metadata from a supported object type. Including image search text recognition.
It uses OpenSearch/SOLR, flask, and as many mainstream opens source rather than reinventing. If only more projects did this!
Roadmap:
- integration with Globus
- dataset/metadata ‘packages’
- tar with metadata in JSON schema
iRODS Testing Environment
RENCI have built the iRODS testing environment and it’s available for all.
iRODS development environment slides
- build environment
- normal build time 8 hours, can bring it down to 3
- testing environment that cane input from build environment
- deploy long-running zones
- deploy federated zones
- harness to run federated tests
future plans;
- CI integration
- web app
- environment reproduction via zone report
- orchestration and agnosticism
Python iRODS Client Library
https://slides.com/irods/ugm2022-python-irods-client-114
iRODS Build and Packaging Update
https://slides.com/irods/ugm2022-irods-build-and-packaging-update
Current status:
- packaged with Cpack
- externals packaged with fpm
- libraries in /usr/lib no matter what the distro expects
This means cant provide Debian/rpm source packages
no package linting or pbuild
’lazy but sufficient is neither’
- no systemd unit files
- all of the above
- deps supplied manually, not using dpkg-shlibdeps
- new OS a lot of work
- shifting to using dpkg-buildpackage and rpmbuild
git-buildpackage- Not provide externals if OS already has comparable
- providing systemd default unit!
A lot of work, needs to be done all at once, not sure when it will be ready.
iRODS CSI driver
K8S container storage Interface Driver for iRODS
- auto creates output DIR for analysis in iRODS
- also shares community shared datasets and users homedir
- Adds 20 seconds to pod start up (compared to copying all the data before/after)