About this Manual

This manual is intended for application programmers who wish to use the APIs described here to integrate the information held in the ORI into their own services and applications.

About the Organisation and Repository Identification(ORI)

The project was developed as a component of the larger UKRepositoryNet+ project under the JISC Repositories and Curation Shared Infrastructure which included other projects such as PIRUS, Repository Junction Broker, SHERPA RoMEO and SHERPA JULIET.

The ORI service provides a standalone middleware tool for identifying academic organisations and their institutional repositories. The service is provided by EDINA who run it as a micro-service.

ORI provides an API to provide access to information on circa 25,000 academic organisations, some 14,000 networks that map to those organisations, and the 3,000 repositories that also relate to those organisations.

An important characteristic of the ORI service is that it is a database of organisations which may have a list of repositories as an attribute, as opposed to a database of repositories... which have organisations as one of their attributes.

It is an edited union of multiple authoritative and extant data sources, with queries returning:

The results for repositories can be restricted by repository type(s) and/or accepted content type(s). Results are returned in a number of formats: JSON (default), XML, and simple text

A SPARQL query interface to the data is available, and a full dataset can be downloaded as RDF/XML or Turtle Linked Data.

ORI Architecture

The ORI service is built on a range of open source software including: Apache with Mod_perl, PostgresSQL, and Perl. Data on organisations is gathered regularly by the ORI service from a number of sources using custom CGI scripts written in Perl.

Several documented APIs are provided that support querying of the ORI database by remote applications and the return of data in the requested format: JSON, XML or text: JSON is the default if no format is requested. ORI also provides Linked Data.

Functional Diagram

ORI Functional Diagram

Infrastructure

Details of the infrastructure used for the ORI development and test environment(s) are held by the project software engineer(s). If you have further questions about the service that are not answered by the guide please contact the EDINA help desk team, or send email direct to: edina@ed.ac.uk .

To provide resilience and scale for load, the micro service runs across two installations of the ORI system, one at each of EDINA's two data centres, with a load balancer service distributing client traffic across these sites.

The Dataset

This section documents the APIs by which clients can query the ORI dataset to retrieve data.

Extent

The ORI dataset is a list of [academic] organisations, with details of networks and repositories associated with them.

The set presently contains data on: 25,000 Organisations, 3000 repositories, and 14,000 networks. There are 30,000 URLs and 54,000 names for these objects, so the set is large and growing all the time.

Data Returns

The following data may be returned when the dataset is queried via calls to the APIs.

Organisation Data

org_id The ID for the org (can be used in other API calls)
lat The Latitude held for the organisation
long The Longitude held for the organisation
city The city, or physical location, for the organisation
countrycode The two-letter country-code the organisation is located in (ISO 3166-1 codes)
identitities A list of names (and URLs) for the organisation (see below for details)

Data is also pulled in from the identities data. The following are taken from the first identity record:

org_name
org_npri
org_acronym
org_npref
org_iri

From the first matching (else non-matching) URL for the first identity:

org_url
org_upri
org_checked_good
org_date_checked

Repository Data

repo_id The ID for the repository: can be used in other API calls.
lat The Latitude for the repository.
long The Longitude for the repository.
postaddress The address the repository is located at.
countrycode The two-letter country-code the organisation is located in (ISO 3166-1 codes)
oaibaseurl The URL for ORI harvesting.
softwarename What software the repository uses e.g. EPRints, DSpace, flubber, etc.
softwareversion The version of the repository software.
description The main description of the repository.
comment A list of additional comments for the repository.
types A list of the repository's types: institutional, data, etc.
content A list of the content types the repository accepts e.g. Pre-prints, data, etc.
external_ids A list of external IDs e.g. OpenDOAR_123, etc.
language A list of languages used in the repository interface.
sword A list of service document locations for the repository.
identities A list of names and URLs for the organisation. See below for details.

The following are taken from the first identity record.

repo_name
repo_npri
repo_acronym
repo_npref
repo_iri

From the first matching (else non-matching) URL for the first identity.

repo_url
repo_upri
repo_checked_good
repo_date_checked

Network Data

net_id The ID for the network: can be used in other API calls.
inetnum The IP range for the network (123.234.0.0-123.234.63.255).
dec_lower. The first IP number of the range (123.234.0.0, from above).
dec_upper The last IP number of the range (123.234.63.255, from above)
identities A list of name(s) for the network: there are no URLS. See below for details.

The following are taken from the first identity record.

net_name
net_npri
net_acronym
net_npref
net_iri

Identities

Each entry in the array is a name for the object, with whichever name is defined as "Primary" at the start of the list.

Each identity object contains the following keys (if they exist in the database):

name The name of the object ("Poppleton Univeristy", "Plink-Plonk Repository", etc.)
acronym Any acronym the object may be known as ("PU", "PPR", etc.)
npref A true/false flag that indicates which is the preferred term. (Absent means true, not false, or "There is no statement that the name is not the preferred term" )
pri A true/false flag that indicates if the name is marked as Primary. Again, this flag in not always defined, as there may be only one option, or there may be know definite name that is the primary name.
iri The Open Linked-Data uri to get the linked-data record
nid The database ID for the name
urls A sub-element containing URL data for the object, as associated with the particular name.

URLs

In the database, there is an association between names and URLs. This is to enable objects to have multi-lingual names, and appropriate urls for each language (eg: Ukranian, Russian, and English)

The urls element contains two keys: “matching” and “non-matching”, both of which are lists on url objects:

'urls' => {
              'matching' => [
                                {....},
                                {....}
                               ],
          'non_matching' => [
                                {....},
                                {....}
                               ]
          }

If a URL is flagged as Primary, it is placed at the front of the appropriate list

Within each url object, the following data is returned:

url The actual URL.
pri Whether the URL is marked as a primary one.
live A true/false flag to indicate if the URL returns [a non-error] web page
date The date that the URL was last checked. Note that no history is kept of the alive/not-alive checking. Hosts that are alive are re-checked weekly, hosts that are not flagged as alive are checked on a daily basis.
uid The database ID for the URL.

main API

The primary contact point for calls to the ORI is http://ori.edina.ac.uk/api.

Data Returns

All APIs return data in the same ways:

  1. You can specify the format either with the Accepts header in the http request, or with the format parameter. The format options are ‘json’, ‘xml’, or ‘text’, with 'json' being the default if nothing is specified.
    If there is a callback parameter, and the format is 'json', then a crossDomain package is returned.
  2. All APIs return the data as a nested object, with three top-level elements:
         {
           'message' => {}
           'status'  => 'ok',
           'to'      => 'http://.....'
         }
    	
    status is “ok”or “fail”, to is the url that made the query, and message contains the actual data being returned.

Making Calls

Currently at http://ori.edina.ac.uk/api? the “locus” for the search can be defined in a number of ways:

  1. You can specify an IP number to base the search on (ip=129.215)
    • If a full quad is not given, then the full range based on what is given is assumed (so 129.215 means 129.215.0.0 to 129.215.255.255)
    • If a range is defined (ie 129.214-129.217) then the upper and lower bands are set accordingly (ie 129.214.0.0-129.217.255.255)
  2. You can specify a geographic location to base the search on (geo=55.95,-3)
    • The accuracy for the search depends on the numbers given: the range is always +/- 1 either side of the last decimal place given (so a bounding box of 55.94,-2.9 to 55.96,-3.1)
  3. You can specifically define an organisation ID to fix your search on (org=2736)

You can specify multiple locus points, however how they interact needs to be made clear:

In addition to defining the locus for the search, the repositories returned can be tuned to return only those of a certain type, and/or only those that accept particular types of deposits.

Returned Data Object

The data object returned is a set of net objects (indexed by net_id), within which is a list of org objects associated with that network. Within each org object is a list of repo objects. All objects conform to the specification here. The data is not sorted before being returned.

 {
   'message' => {net} => 'i38647' => { 'dec_lower' => '152.78.0.0',
                                       'dec_upper' => '152.78.255.255',
                                       'orgs' => [ { 'org_name' => 'AgentLink.org',
                                                     'org_url' => 'http://www.agentlink.org',
                                                     'repos' => [ { 'repo_name' => 'xxxxxxx',
                                                                    'org_url'   => 'yyyyyy',
                                                                    ................
                                                                   },
                                                                   {
                                                                    .................
                                                                   } ]
                                                     },
                                                     {
                                                 } ],
                                       ...........
                                     }
                      => 'i39677' => {
                                       .............
                                     }
 }

get_xxx API

This suite of functions was initially created as part of a set of “data sanity checking” web pages and have now been brought in-line with the other functions, and made generic.

Data returns

All APIs return data in the same ways:

  1. You can specify the format either with the Accepts header in the http request, or with the format parameter. The options are json, xml, or text, with json being the default if nothing is specified. The get_xxx suite also understands the prototype format (see below)
    If there is a callback parameter, and the format is json, then a crossDomain package is returned.
  2. For prototype returns, the data is formatted as an xhtml unordered list (as per the scriptalicious/prototype requirements), with the for attribute set to match EPrints field names.
  3. For all other returns, the data is a list of data records.

Making Calls

This is a suite of three APIs are at http://ori.edina.ac.uk/cgi-bin/get_xxx, and are there to support AJAX calls.

The basic premis is that the term to be looked up is passed in a parameter q, and all the records that have that term somewhere in the data are returned.

Additional parameters can be used to tune the query:

The three queries are:

get_orgs
This query will search either the name or the url to return a list of organisaions that match. By default, the name field is searched.
get_nets
This query will search either the name or an IP number to return a list of networks that match. By default, the name field is searched, however if the script spots an IP number, it will automatically switch to an IP search.
get_repos
This query will search either by name or url to return a list of networks that match. By default, the name field is searched.

list/xxx API

These APIs return lists of values, some of which may be used as parameters for the main API calls.

Data returns

All APIs return data in the same ways:

  1. You can specify the format either with the Accepts header in the http request, or with the format parameter. The options are ‘json’, ‘xml’, or ‘text’, with ‘json being the default if nothing is specified.
    • If there’s a callback parameter, and the format is json, then a crossDomain package is returned: very useful!
  2. All return the data as a nested object, with three top-level elements:
    •  {
         'message' => {}
         'status'  => 'ok',
         'to'      => 'http://.....'
       } 
       				

status is ok or fail, to is the url that made the query, and message contains the actual data being returned, which is dependent upon the query!

Making Calls

This is a suite of APIs at http://ori.edina.ac.uk/cgi-bin/list/xxx that pull out lists on the following:

type

This lists the type (or classification) of repository. The classification scheme is automatically extended as new types are listed in the up-stream sources.

To use a repository type with the main API use the code number required e.g. ?type=11

The count element indicates how many repositories are in the set.

'message' => {
               'type' => [
                           {
                             'code'  => 1,
                             'count' => 57,
                             'text'  => 'Subject (Research Cross-Institutional)'
                           },
                           {
                             'code'  => 2,
                             'count' => 299,
                             'text'  => 'Other'
                           },
                           ......
                         ]
                      },

The classification scheme is automatically extended as new types are listed in the up-stream sources, but started as:

Type Code Descriptive Text
1 Subject (Research Cross-Institutional)
2 Other
3 Disciplinary (Cross-institutional subject repositories)
4 Journal (e-Journal/Publication)
5 Database (Database/A&I Index)
6 Demonstration
7 Institutional (Institutional or departmental repositories)
8 Thesis
9 Undetermined - Repositories whose type has not yet been assessed
10 Aggregating (Archives aggregating data from several subsidiary repositories)
11 Learning (Learning and Teaching Objects)
12 Governmental (Repositories for governmental data)
13 Theses
14 Multi
15 Researchdata
16 Opendata

Adding the parameter full=1 will cause the query to return all the repositories that are of that type listed under a repos element. Note that repositories are not exclusively one type or another, and may appear under multiple types.

The repos sub-elements are indexed by repo_id.

content

This lists the type of content that repositories accept. The classification scheme is automatically extended as new types are listed in the up-stream sources.

To use a repository type with the main API use the code number required e.g. ?type=11

The count element indicates how many repositories are in the set.

  <message>
    <content>
      <code>1</code>
      <count>112</count>
      <text>Research papers (pre- and postprints)</text>
    </content>
    <content>
      <code>2></code>
      <count>86</count>
      <text>Research papers (preprints only)</text>
    </content>
    .....
  <message>

The classification scheme is automatically extended as new types are listed in the up-stream sources, but started as:

Content Code Descriptive Text
1 Research papers (pre- and postprints)
2 Research papers (preprints only)
3 Research papers (postprints only)
4 Bibliographic references
5 Conference and workshop papers
6 Theses and dissertations
7 Unpublished reports and working papers
8 Books & chapters and sections
9 Datasets
10 Learning Objects
11 Multimedia and audio-visual materials
12 Software
13 Patents
14 Other special item types

Adding the parameter full=1 will cause the query to return all the repositories that are of that type listed under a repos element. Note that repositories are not exclusively one type or another, and may appear under multiple types.

The repos sub-elements are indexed by repo_id.

lang

This call lists all the languages the dataset knows about: the ISO 639 codes .

(We are limited to ISO 639-2 as ISO639-3 and later are not Open Access lists and there is a clause which states “the product, system, or device does not provide a means to redistribute the code set.”)

The count element indicates how many repositories are in the set.

{
  "to" : "http://ori.edina.ac.uk/cgi-bin/list/lang",
  "status" : "ok",
  "message" : {
    "lang" : [
      {
        "text" : "Abkhazian",
        "iso3_b" : "abk",
        "count" : 0,
        "code" : "ab"
      },
      {
        "text" : "Achinese",
        "iso3_b" : "ace",
        "count" : 0
      },
    ]
  }
}

Adding the parameter full=1 will cause the query to return all the repositories that are of that type listed under a repos element. Note that repositories are not exclusively one type or another, and may appear under multiple types.

The repos sub-elements are indexed by repo_id.

country

This lists all the countries the dataset knows about: the ISO 3166-1 codes.

The count element indicates how many repositories are in the set.

{
  "to" : "http://ori.edina.ac.uk/cgi-bin/list/country",
  "status" : "ok",
  "message" : {
    "country" : [
      {
        "text" : "Andora",
        "count" : 0,
        "code" : "ad"
      },
      {
        "text" : "United Arab Emirates",
        "count" : 0,
        "code" : "ae"
      },
    ]
  }
}

Adding the parameter full=1 will cause the query to include all the repositories, under a repos element, that are listed as being of that country.

The repos sub-elements are indexed by repo_id.

Adding the parameter full=1 will cause the query to return all the repositories that are of that type listed under a repos element. Note that repositories are not exclusively one type or another, and may appear under multiple types.

The repos sub-elements are indexed by repo_id.

org

This lists all the organisations in the dataset. This script will take over 15 minutes to complete as there is a large amount of data to return.

Adding the parameter full=1 will cause the query to return all the repositories that are of that type listed under a repos element. Running the query with the full flag can take twenty minutes!

The repos sub-elements are, in this situation, listed as described here.

{
  "to" : "http://ori.edina.ac.uk/cgi-bin/list/org",
  "status" : "ok",
  "message" : {
    "org" : {
      "1" : {
        <as per org listing>
      },
      "4": { 
        <as per org listing>
      },
    ]
  }
}
net

This call lists all the network IP range(s) for the organisations in the ORI where these are known. This script will take several minutes to complete as there is a large amount of data to return.

Adding the parameter full=1 will cause the query to return the details of each organisation that is within each network.

Results are returned in ascending order of the net_id.

repo

This call lists all the repositories in the dataset. This script will take several minutes to complete as there is a large amount of data to return.

There is no full=1 flag

Results are returned in ascending order of the repo_id.

Linked Data Files

Up-to-date raw linked data files are produced each day and can be retrieved in W3C supported formats at: http://ori.edina.ac.uk/reference/linked/1.0/

Further Reading

Open Archives Initiative
The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. OAI has its roots in the open access and institutional repository movements.

Glossary of Terms

Acknowledgement

ORI was developed by EDINA, with funding from JISC