|Abstract||Identifying experts on rare diseases is a key element in improving the situation of patients and health care providers alike. Information on rare disease experts is being provided by a number of online portals such as Orphanet, se-atlas and also Expertscape. These are, however, mostly manually maintained and updated which causes issues with the specificity, completeness and currency of their data. In the case of Expertscape, data is being collected via bibliometric analysis from scientific literature, i.e. authors of publications on diseases are presented as potential experts for those diseases. This approach mitigates some of the aforementioned issues, it however introduces a new issue of name ambiguity, whereby publications from multiple authors sharing the same name are not being discerned. In addition, the orientation towards common diseases lets Expertscape miss many rare diseases.
It therefore was the goal of this work to develop an expert finder system on the basis of bibliometric analysis which is specifically oriented towards rare diseases and employs in-depth analysis techniques as well as author name disambiguation. The system should overcome the issues of traditional expert registries and complement the availability of data on rare disease experts.
A thesaurus of rare disease terms was set up using the Orphanet classification of rare diseases. With these terms, PubMed, a biomedical literature database with more than 26 million citations, was searched for publications on rare diseases and the respective metadata extracted. An application which manages the literature extraction in an iterative way as well as a staging database for storing the extracted data were set up. The application is also used in the maintenance of the rare diseases thesaurus.
The extracted data was processed in terms of an in-depth analysis pertaining to segmenting affiliation entries in order to get additional structured information about institutions, cities and countries among others. Further analyses to make full use of e.g. keywords were examined but not fully realised. Author name disambiguation was extensively examined. Grouping name instances to reduce computational complexity prior to performing the disambiguation was based on name similarities instead of exact matches. Multiple similarity metrics were compared against each other and the best suitable matching scheme was employed. A disambiguation approach using pairwise similarity calculation, hierarchical agglomerative clustering and a dynamic cluster-detection method was implemented. The approach was evaluated against a sophisticated disambiguation approach as well as against a baseline naive grouping based on exact name matches. The disambiguation performance was unconvincing and for the first prototype, the naive grouping scheme was retained including the resulting ambiguity issues.
The overall system was evaluated in different ways, such as having experts annotate author lists manually as well as comparing the system to other expert finding approaches on the basis of verified expert lists. The comparison showed that the system is able to identify more rare disease experts than the other approaches, however, there are numerous false positive entries to be found and data on experts also suffer from a partial lack of correctness and currency.
Still, the bibliometric analysis system is, in conclusion, a successful proof of principle and can already be used in complementing rare disease expert registries by identifying previously unknown experts. Persisting data flaws would need to be sorted out in the future along with further adaptation to changing external conditions such as access possibilities to source databases. More analysis features could be examined and realised in order to enhance the system and provide for its sustainability.||dc.description.abstract