Data integrity checks

The quality of data is very important to us. Although ultimately the data responsibility lies with editors and project managers we take great care to avoid entering of inconsistent data on a technical level, e.g. with the user interface it is not possible to enter begin dates which are later than end dates.

Nevertheless mistakes can happen, not only on the application level but also e.g. when importing data from other projects or deleting files outside of the application. Because data integrity is important for the quality of research we implemented functions to check possible inconsistencies which are described in detail below.

Orphans

Orphans

In this tab entries like dates which are not linked are shown. They could be artifacts from imports or bugs and can be deleted. If they seem to appear regularly again (without imports or known bugs) please report that issue.

Type

These types were created but have no sub types or associated data. Maybe they originate from the first install or were never used.

Missing files

Here are listed file entities which have no corresponding file, most likely because the file itself doesn’t exist anymore.

Orphaned files

Files that have no corresponding entity are listed here.

Circular dependencies

A check if an entity is linked to itself. This could happen e.g. if a person is married to herself or a type has itself as super. It shouldn’t be possible to create circular dependencies within the application. Nevertheless it’s a useful check for e.g. if data is imported from other systems.

Check dates

In this tab invalid date combinations are shown, e.g. begin dates which are later than end dates. These entries should be cleared up otherwise they cannot be updated because the user interface won’t allow saving entries with invalid date combinations.

Check similar names

Here you can search for similar names. Depending on selection and data volume this might take some time.

  • Classes - select the class which you want to search for similar names
  • Ratio - select how similar the names should be. 100 is the default and means absolute identical. The lower you set the number the more names which are similar will be found but it will also get more time consuming, so you should begin with a higher numbers.

To find similar names the Python fuzzywuzzy package is used which in turn uses the Levenshtein Distance.