November 4, 2019 – Alert: Possible Data Integrity Issues with Mirroring and Journal Restore
InterSystems has corrected several critical defects that can result in data integrity issues. These defects were identified and corrected within a short time, so InterSystems has simplified the upgrade process by consolidating them into a single package. The effects of encountering these defects may not always be visible. These defects affect InterSystems IRIS, IRIS for Health, Health Connect, Caché, Ensemble, and HealthShare products. All of these defects relate to the application of journal data.
InterSystems recommends that you review this document. It will help you determine your level of risk and also describes mitigations for each defect. Please contact the Worldwide Response Center if you have any questions regarding this alert.
Description of Defects
All references to messages in the descriptions below refer to the cconsole.log (Caché, Ensemble, and HealthShare) or messages.log (InterSystems IRIS, IRIS for Health).
1) Mirror Data Not Applied – Failover Mirror Member Becoming Primary at Instance Startup
This defect affects Caché/Ensemble 2017.2 and 2018.1, all released versions of InterSystems IRIS, and all HealthShare products based on the affected Data Platforms product versions. A list of HealthShare products affected by this issue appears in Appendix A.
If a mirror failover member starts when its failover partner is shut down and it has to retrieve journal data from another mirror member as part of becoming primary, the system may silently fail to apply some journal records. The sequence of messages below indicates that a mirror member retrieved journal data from another mirror member as part of becoming primary:
11/01/19-13:51:59:109 (13062) 0 Mirror manager for <mirname> starting
11/01/19-13:51:59:774 (13062) 0 Retrieving journal file #<#> for mirror <mirname> from <other-mirror-member-address>
… (more files might get retrieved, and data is applied)
11/01/19-14:06:53:165 (13062) 1 Becoming primary mirror server
If your system has encountered this defect, the journal data indicated by one or more of these messages may not have been applied to the databases.
To avoid encountering this defect: If both failover members are down, always start the member that most recently ran as primary before starting the member that most recently ran as backup.
2) Mirror Data Not Applied – Async Mirror Member Shutting Down Dejournaling
This defect affects Caché/Ensemble 2017.2 and 2018.1, all released versions of InterSystems IRIS, and all HealthShare products based on the affected Data Platforms product versions. A list of HealthShare products affected by this issue appears in Appendix A.
When a system meets both of the following conditions, it may encounter this defect:
- An async mirror member has processed more than 2^32 (~4 billion) records in one dejournaling session.
- Dejournaling was stopped manually.
If these conditions are met, then, when dejournaling restarts, some of the processed journal records may not be applied.
The following message indicates that dejournaling was stopped manually:
11/01/19-17:43:06:754 (436) 1 Dejournaling for <mirname> shutting down and set to manual restart required
To avoid encountering this defect: Stop and restart mirroring before stopping dejournaling on an async member.
3) Journal Records Not Restored – Journal Restore Immediately Following Online Backup Restore (^DBREST)
This defect affects Caché/Ensemble 2018.1.2, all released versions of InterSystems IRIS except 2018.1, and all HealthShare products based on the affected Data Platforms product versions. A list of HealthShare products affected by this issue appears in Appendix A.
This defect is specific to journal restores that are performed in conjunction with a database restore that uses online backup. For more information, see the “Online Backup Utilities” section of the Data Integrity Guide for InterSystems IRIS or the “Caché Online Backup” section of the Caché Data Integrity Guide for Caché and Ensemble.
The defect can cause a journal record not to be applied during the journal restore following the backup restore. If your system has encountered this defect, there is an entry in the ^SYS(“JOURNAL”,”RESTORE”) global in the %SYS namespace that includes the timestamp of the restore as well as the “StartAddress” that was used, such as:
^SYS("RESTORE","JOURNAL","20191101 16:55:01","Files",1,"StartAddress") = 9374648
If the global has no entries that include the “StartAddress” node, the system has not been affected by this defect.
To avoid encountering this defect: When restoring from online backup, do not apply journals as part of the database restore; instead, first restore the backup, then run ^JRNRESTO. Do not use the journal marker.
4) Journal Records Applied Incorrectly – Mirror Catchup on a Subset of Databases
This defect affects all released versions; however, the likelihood of the defect being triggered is greatly increased in Caché/Ensemble 2018.1.2, all released versions of InterSystems IRIS except 2018.1, and all HealthShare products based on the affected Data Platforms product versions. A list of HealthShare products most likely to be affected by this issue appears in Appendix A.
If a mirror Catchup operation involves one or more of the mirrored databases in a mirror but not all of them, then stale records from a journal file may be applied to the active databases in the mirror, potentially overwriting newer application data.
If Catchup runs, there are messages such as:
11/01/19-13:01:55:057 (17752) 0 Mirror <mirname> catchup started for 1 database.
… (other events can occur while Catchup is running)
11/01/19-13:01:55:207 (17752) 0 Mirror Catchup completed for database <path-to-database>
To avoid encountering this defect, make sure the mirror member is disconnected from the primary before running Catchup on a subset of databases.
5) Other Scenarios
Other, much rarer dejournaling defects have also been corrected as part of this alert that affect mirroring, shadowing, and journal restore. All released versions of InterSystems products are vulnerable to at least some of these. A short summary of each issue is provided below. Each of these issues is extremely rare, even if the necessary conditions are met. For more details about these defects, please see the maintenance release notes for the corrections listed below in Caché/Ensemble 2018.1.3 or IRIS 2019.1.1, or contact the Worldwide Response Center (WRC).
- If mirror shutdown takes at least 10 seconds while stopping mirroring or shutting down an instance, this can result in the same defect on the same versions as issue #2 above. (Typically, mirror shutdown takes less than a second.)
- Mirror dejournaling (on all released versions) can either silently apply a journal record incorrectly or exit due to an error.
- Journal restore (on the same versions as #1 and #2) can either silently apply a journal record incorrectly or exit due to an error.
- With parallel dejournaling enabled, mirroring can record an incorrect database checkpoint. If a system subsequently crashes or the checkpoint is recorded while manually stopping dejournaling, some journal records are not applied when dejournaling starts again. This affects the same versions as #1 and #2.
- A backup mirror member can fail to apply a subset of journal records as part of becoming primary, if the original primary was forced down. This affects all released versions.
- When dejournaling shuts down for any reason, then journal restore, mirroring, and shadowing are vulnerable to the same defect as issue #2. This affects 32-bit platforms for all released versions.
- If dejournaling gets an error (such as <FILEFULL> or <DATABASE>) while applying a journal record to a specific database, it can then incorrectly mark a different database inactive; it then can continue applying data to the specific database, despite failing to apply the journal record that caused the error. This affects the same versions as #1 and #2.
- Mirror database Catchup can fail to apply some journal records if a previous database Catchup on that database failed in some way. This affects all released versions.
Verifying Current Data Consistency
You can run DataCheck to verify the consistency of globals across mirror members, but it cannot guarantee that your system never encountered any of these defects. Specifically, if DataCheck reports consistent data, this means the systems are consistent now; however, they may not have been in the past, if affected globals were updated after the defect was encountered. For more information on DataCheck, see the “Data Consistency on Multiple Systems” chapter in the Data Integrity Guide or the Caché Data Integrity Guide.
Information about the Corrections
The corrections for these defects are identified as SML2776, SML2781, SML2782, SML2783, SML2785, JO2990, JO3117, JO3137, JO3140, JO3141, RJF391, RJF392, HYY2362, HYY2364, and HYY2373, and will be included in all future releases, including Caché/Ensemble 2018.1.3 and InterSystems IRIS data platform 2019.1.1 and 2019.3, and are available via Ad hoc distribution via the Worldwide Response Center (WRC). If you have any questions regarding this alert, please contact the Worldwide Response Center.
Appendix A: Affected HealthShare Products
For issues #1 and #2 the following HealthShare products are affected:
- Information Exchange 15.03 and 2018.1
- Unified Care Record 2019.1
- Patient Index 15.03, 2018.1 and 2019.1
- Health Insight 15.03, 2018.1 and 2019.1
- Provider Directory 2019.1
- Personal Community 12, 2018.1 and 2019.1
- Health Connect 15.03, if built on Caché/Ensemble 2017.2 or 2018.1
- Health Connect 2019.1
For issue #3, the following HealthShare products are affected:
- Unified Care Record 2019.1
- Patient Index 2019.1
- Health Insight 2019.1
- Provider Directory 2019.1
- Personal Community 2019.1
- Health Connect 15.03, if built on Caché/Ensemble 2018.1.2
- Health Connect 2019.1
For issue #4, all versions of all products are vulnerable. The versions listed for issue #3 are particularly vulnerable to issue #4.