posts

iRODS UGM 2022

Some notes on this year’s User Group Meeting at KU Leuven in Belgium. If an area interest you, I strongly suggest going to the conference page and checking out the slides and videos, once released (some, particularly the RENCI ones are already up on slides.com, so I’ve included links where I know them).

4.3 notes

This was the star of the conference - a work several years in the making, with several significant improvements.

KU Leuven Keynote

Research data network - managing data sharing between Flemish universities et al.

Key quote:

"we must take into account we work for reserarchers. They are very busy people with a lot of responsabilities"

Technology update

  • Logging now done by syslog - finally! Structured messages via JSON
  • log level now more discrete per area (e.g. rule engine, or API)
  • delay server can migrate delay serves
  • additional permissions - now covers metadata
  • imeta has support for admin flag (and iRODS now supports ADMIN_KW for metadata operations)
  • cron functionality - can bring services (e.g. agent factory, delay server) back up (without restarting service)
  • GenQuery re-implementation (using flex/bison), support for ORDER BY, grouping, AND and OR on roadmap for 4.3.1/4.3.2
  • Jargon 4.3.2.5 - now supports parallel transfer over 1247 (supported directly via consortium, or via Mike Conway?)
  • MetalNX - 2.6.0 merged search interfaces, now uses GenQuery.
  • Zone Management Tool (ZMT) - 0.2.0 user & resource management, health checks, built on REST API
  • NFSRODS - 2.1.0 large file transfer over 1247, comes with docker-compose file

Python rule engine plugin

Python3 compliant! Previously was python2.

Indexing capability

  • Now using a later elastic client
  • tracks iput --metadata & atomic metadata API
  • use NIEHS schema - adds system metadata along with user defined

Internships

RENCI have several interns this year, and they have been working on;

  • refactor rodsServer to modern C++
  • live reload for server config fixing some areas where a restart would be needed
  • to refactor audit plugin - AMPQ to fix malformed JSON
  • Libraries - wrap with higher level API’s
  • iRODS Testing Environment Web Application (sounds intriguing!)

Big picture

RENCI want management of iRODS to be about policy design, composition and configuration

Roadmap?

  • ‘Console’ refactored icommands as one command line tool
  • enhancement to set up script?

Ontoforce (sponsor)

Ontoforce

  • integrated PubMed, dbSNP, opentargets, clinical trials etc (146 data sets) as reference datasets
  • Search public and local datasets and get links to iRODS to get local copy.
  • Could connect via Maastricht University iRODS REST API - theoretical, as not yet implemented.

CC-IN2P3

The recently had to move data from Africa by plane and hard disks - these challenges are still with us.

  • 26 PB some replication, not all.
  • 27 Zones
  • 450 million files

Some federated (transatlantic!), most not.

140 PB of data overall, most in mass storage system (tape), some CephFS

  • Developers providing applications (Java, C++ API’s) that hide the iRODS interface.
  • WebDAV becoming more popular.
  • Users are increasingly asking for metadata.
  • They have problems with ’thundering herd’ of end users - hundreds of downloads, but infrastructure appears to cope!
  • 2 HA physical providers for ALL zones! Using ccirods DNS alias - downtime by removing entry from alias.
    • 21 consumers
    • Oracle RAC
  • still migrating from 3.3.1 in some cases!
  • No VMs, no containers!

Max Plank

iRODS for a year so far.

MRI → iRODS

MR data system (aka MrData!) integrates iRODS and makes use of

  • Metalnx
  • rclone via webdav
  • python iRODS client

They found that cyberduck had caching issues.

Castelum - human subject metadata management

Castelum

Move data one way to scientific metadata

infra

  • did use docker compose, now use Ansible docker module
  • more robust than compose, easier than K8S
  • some CI/CD being worked on via above deployment into a VM

Programmable authentication workflows, SURF

  • Bridge between identity and service providers -some collaborators want to bring their own ID.
  • As it was before this project, iRODS PAM enabled, but only standard user interaction supported.

I didn’t know there was a Python pam_python.so module!

iRODS data back end for LEXUS

Integrates organizations HPC with larger group - HPC-As-A-Service is an ideal, can speak to local HPC, OpenStack, and k8s.

HEAppE middleware. heaapu.org?

Two iRODS Zones, one backed by spectrum scale GPFS, the other Ceph.

lexis-project

Managing high-throughput sequencing and other -omics data with RODEOS and rodeos-ingest at Berlin Institute of Health at Charite

Very similar to Welcome Sanger Institute, they do automated data ingest and metadata extraction from Illumine (only).

Landing zone for sequencer(s) → iRODS

try it out in docker

Using automated ingest plugin plus extensions to handle illumina files (rodeos-ingest) RunInfo.xml & RunParameters.xml - get included as metadata!

Experimenting with MetalNX. Using WebDAV, icommands, isrync.

Landing zones are CephFS with a Samba Gateway. iRODS data is held on CephFS backend.

Future is ingesting of mass spec files, but metadata formats not well-defined.

Fujifilm Object Archive (sponsor)

They also do backup tapes etc!

S3 compatible API looks like Glacier.

iRODS integration -v 2.4.11

The software runs on a Red Hat 8 server and supports Oracle (and other) tape libraries.

Data management environment at the national cancer institute

  • Lots of copies of data and they diverge over time.
  • Insufficient provenance.
  • Most important is always available.
  • Controll over the data - users want to be able to share it themselves, not IT as a gatekeeper.
  • Archiving into S3 & Glacier via Cloudian

iRODS as an object store for the Galaxy platform

Galaxy Project

Python iRODS client issues found;

  • long running connections dropped, so they had to re-code a process that recreated the session every 5 mins.
  • Found a bug in slow downloads/uploads with client, got fixes in python-irodsclient
  • now client is more performant than icommands!

iRODS Delay Queue

A way to move delay server role between providers, and having multiple servers be delay rule queue processes.

Too much to summarise, read the presentation - excellent.

Diagram

Hands-free migration in case of disaster!

Can be manually migrated;

iadmin get_deay_server_info iadmin set_delay_server <hostname>

One can run delay server on a consumer, but requires database credentials to be in server_config.json

At the moment, every server will ask database ‘am I the leader/successor’ every 5 seconds!

irods-grid status will show delay server PID on server it’s run on.

Towards the FAIRification of lab-data

Tools to ingest data from instruments of any kind.

Initial solution called panacea. Writen in C++ for speed, with R and Python bindings, perhaps REST API in future. PhD student project, however RENCI might also be contributing, AIUI?

fairelabs

S3 Resource Plugin Glacier support

Tested on Amazon and Fujifilm systems.

Upload is by setting a flag, download is async except to ‘instant retrieval’ classes. If not in cache, returns REPLICA_IS_BEING_STAGED, so you would need to try again.

Libs3 didn’t support glacier, so they forked, and will submit an MR, but not timescale for this.

Python/PRC based portal and tools for active data support in research contexts

Reimplemented MetalNX (groan) but with metadata schema support including editing the schema. If I understand correctly, one can click in the UI to extract metadata from a supported object type. Including image search text recognition.

It uses OpenSearch/SOLR, flask, and as many mainstream opens source rather than reinventing. If only more projects did this!

Roadmap:

  • integration with Globus
  • dataset/metadata ‘packages’
  • tar with metadata in JSON schema

iRODS Testing Environment

RENCI have built the iRODS testing environment and it’s available for all.

iRODS development environment slides

iRODS development environment

  • build environment
    • normal build time 8 hours, can bring it down to 3
  • testing environment that cane input from build environment
    • deploy long-running zones
    • deploy federated zones
    • harness to run federated tests

future plans;

  • CI integration
  • web app
  • environment reproduction via zone report
  • orchestration and agnosticism

Python iRODS Client Library

https://slides.com/irods/ugm2022-python-irods-client-114

iRODS Build and Packaging Update

https://slides.com/irods/ugm2022-irods-build-and-packaging-update

Current status:

  • packaged with Cpack
  • externals packaged with fpm
  • libraries in /usr/lib no matter what the distro expects

This means cant provide Debian/rpm source packages no package linting or pbuild

’lazy but sufficient is neither’ - no systemd unit files - all of the above - deps supplied manually, not using dpkg-shlibdeps - new OS a lot of work

  • shifting to using dpkg-buildpackage and rpmbuild
  • git-buildpackage
  • Not provide externals if OS already has comparable
  • providing systemd default unit!

A lot of work, needs to be done all at once, not sure when it will be ready.

iRODS CSI driver

K8S container storage Interface Driver for iRODS

  • auto creates output DIR for analysis in iRODS
  • also shares community shared datasets and users homedir
  • Adds 20 seconds to pod start up (compared to copying all the data before/after)