首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
ObjectiveSimulating electronic health record data offers an opportunity to resolve the tension between data sharing and patient privacy. Recent techniques based on generative adversarial networks have shown promise but neglect the temporal aspect of healthcare. We introduce a generative framework for simulating the trajectory of patients’ diagnoses and measures to evaluate utility and privacy.Materials and MethodsThe framework simulates date-stamped diagnosis sequences based on a 2-stage process that 1) sequentially extracts temporal patterns from clinical visits and 2) generates synthetic data conditioned on the learned patterns. We designed 3 utility measures to characterize the extent to which the framework maintains feature correlations and temporal patterns in clinical events. We evaluated the framework with billing codes, represented as phenome-wide association study codes (phecodes), from over 500 000 Vanderbilt University Medical Center electronic health records. We further assessed the privacy risks based on membership inference and attribute disclosure attacks.ResultsThe simulated temporal sequences exhibited similar characteristics to real sequences on the utility measures. Notably, diagnosis prediction models based on real versus synthetic temporal data exhibited an average relative difference in area under the ROC curve of 1.6% with standard deviation of 3.8% for 1276 phecodes. Additionally, the relative difference in the mean occurrence age and time between visits were 4.9% and 4.2%, respectively. The privacy risks in synthetic data, with respect to the membership and attribute inference were negligible.ConclusionThis investigation indicates that temporal diagnosis code sequences can be simulated in a manner that provides utility and respects privacy.  相似文献   

2.
ObjectiveThis study sought to evaluate whether synthetic data derived from a national coronavirus disease 2019 (COVID-19) dataset could be used for geospatial and temporal epidemic analyses.Materials and MethodsUsing an original dataset (n = 1 854 968 severe acute respiratory syndrome coronavirus 2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip code-level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated.ResultsIn general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean = 2.9 ± 2.4; max = 16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n = 171) and for all unsuppressed zip codes (n = 5819), respectively. In small sample sizes, synthetic data utility was notably decreased.DiscussionAnalyses on the population-level and of densely tested zip codes (which contained most of the data) were similar between original and synthetically derived datasets. Analyses of sparsely tested populations were less similar and had more data suppression.ConclusionIn general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression—an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.  相似文献   

3.
ObjectiveAs a long-standing Clinical and Translational Science Awards (CTSA) Program hub, the University of Pittsburgh and the University of Pittsburgh Medical Center (UPMC) developed and implemented a modern research data warehouse (RDW) to efficiently provision electronic patient data for clinical and translational research.Materials and MethodsWe designed and implemented an RDW named Neptune to serve the specific needs of our CTSA. Neptune uses an atomic design where data are stored at a high level of granularity as represented in source systems. Neptune contains robust patient identity management tailored for research; integrates patient data from multiple sources, including electronic health records (EHRs), health plans, and research studies; and includes knowledge for mapping to standard terminologies.ResultsNeptune contains data for more than 5 million patients longitudinally organized as Health Insurance Portability and Accountability Act (HIPAA) Limited Data with dates and includes structured EHR data, clinical documents, health insurance claims, and research data. Neptune is used as a source for patient data for hundreds of institutional review board-approved research projects by local investigators and for national projects.DiscussionThe design of Neptune was heavily influenced by the large size of UPMC, the varied data sources, and the rich partnership between the University and the healthcare system. It includes several unique aspects, including the physical warehouse straddling the University and UPMC networks and management under an HIPAA Business Associates Agreement.ConclusionWe describe the design and implementation of an RDW at a large academic healthcare system that uses a distinctive atomic design where data are stored at a high level of granularity.  相似文献   

4.
ObjectiveThe lack of representative coronavirus disease 2019 (COVID-19) data is a bottleneck for reliable and generalizable machine learning. Data sharing is insufficient without data quality, in which source variability plays an important role. We showcase and discuss potential biases from data source variability for COVID-19 machine learning.Materials and MethodsWe used the publicly available nCov2019 dataset, including patient-level data from several countries. We aimed to the discovery and classification of severity subgroups using symptoms and comorbidities.ResultsCases from the 2 countries with the highest prevalence were divided into separate subgroups with distinct severity manifestations. This variability can reduce the representativeness of training data with respect the model target populations and increase model complexity at risk of overfitting.ConclusionsData source variability is a potential contributor to bias in distributed research networks. We call for systematic assessment and reporting of data source variability and data quality in COVID-19 data sharing, as key information for reliable and generalizable machine learning.  相似文献   

5.
ObjectiveWith the growing demand for sharing clinical trial data, scalable methods to enable privacy protective access to high-utility data are needed. Data synthesis is one such method. Sequential trees are commonly used to synthesize health data. It is hypothesized that the utility of the generated data is dependent on the variable order. No assessments of the impact of variable order on synthesized clinical trial data have been performed thus far. Through simulation, we aim to evaluate the variability in the utility of synthetic clinical trial data as variable order is randomly shuffled and implement an optimization algorithm to find a good order if variability is too high.Materials and MethodsSix oncology clinical trial datasets were evaluated in a simulation. Three utility metrics were computed comparing real and synthetic data: univariate similarity, similarity in multivariate prediction accuracy, and a distinguishability metric. Particle swarm was implemented to optimize variable order, and was compared with a curriculum learning approach to ordering variables.ResultsAs the number of variables in a clinical trial dataset increases, there is a pattern of a marked increase in variability of data utility with order. Particle swarm with a distinguishability hinge loss ensured adequate utility across all 6 datasets. The hinge threshold was selected to avoid overfitting which can create a privacy problem. This was superior to curriculum learning in terms of utility.ConclusionsThe optimization approach presented in this study gives a reliable way to synthesize high-utility clinical trial datasets.  相似文献   

6.
The Patient-Centered Outcomes Research Institute (PCORI) has launched PCORnet, a major initiative to support an effective, sustainable national research infrastructure that will advance the use of electronic health data in comparative effectiveness research (CER) and other types of research. In December 2013, PCORI''s board of governors funded 11 clinical data research networks (CDRNs) and 18 patient-powered research networks (PPRNs) for a period of 18 months. CDRNs are based on the electronic health records and other electronic sources of very large populations receiving healthcare within integrated or networked delivery systems. PPRNs are built primarily by communities of motivated patients, forming partnerships with researchers. These patients intend to participate in clinical research, by generating questions, sharing data, volunteering for interventional trials, and interpreting and disseminating results. Rapidly building a new national resource to facilitate a large-scale, patient-centered CER is associated with a number of technical, regulatory, and organizational challenges, which are described here.  相似文献   

7.
ObjectiveIntegrating and harmonizing disparate patient data sources into one consolidated data portal enables researchers to conduct analysis efficiently and effectively.Materials and MethodsWe describe an implementation of Informatics for Integrating Biology and the Bedside (i2b2) to create the Mass General Brigham (MGB) Biobank Portal data repository. The repository integrates data from primary and curated data sources and is updated weekly. The data are made readily available to investigators in a data portal where they can easily construct and export customized datasets for analysis.ResultsAs of July 2021, there are 125 645 consented patients enrolled in the MGB Biobank. 88 527 (70.5%) have a biospecimen, 55 121 (43.9%) have completed the health information survey, 43 552 (34.7%) have genomic data and 124 760 (99.3%) have EHR data. Twenty machine learning computed phenotypes are calculated on a weekly basis. There are currently 1220 active investigators who have run 58 793 patient queries and exported 10 257 analysis files.DiscussionThe Biobank Portal allows noninformatics researchers to conduct study feasibility by querying across many data sources and then extract data that are most useful to them for clinical studies. While institutions require substantial informatics resources to establish and maintain integrated data repositories, they yield significant research value to a wide range of investigators.ConclusionThe Biobank Portal and other patient data portals that integrate complex and simple datasets enable diverse research use cases. i2b2 tools to implement these registries and make the data interoperable are open source and freely available.  相似文献   

8.
ObjectiveTo propose an algorithm that utilizes only timestamps of longitudinal electronic health record data to classify clinical deterioration events.Materials and methodsThis retrospective study explores the efficacy of machine learning algorithms in classifying clinical deterioration events among patients in intensive care units using sequences of timestamps of vital sign measurements, flowsheets comments, order entries, and nursing notes. We design a data pipeline to partition events into discrete, regular time bins that we refer to as timesteps. Logistic regressions, random forest classifiers, and recurrent neural networks are trained on datasets of different length of timesteps, respectively, against a composite outcome of death, cardiac arrest, and Rapid Response Team calls. Then these models are validated on a holdout dataset.ResultsA total of 6720 intensive care unit encounters meet the criteria and the final dataset includes 830 578 timestamps. The gated recurrent unit model utilizes timestamps of vital signs, order entries, flowsheet comments, and nursing notes to achieve the best performance on the time-to-outcome dataset, with an area under the precision-recall curve of 0.101 (0.06, 0.137), a sensitivity of 0.443, and a positive predictive value of 0. 092 at the threshold of 0.6.Discussion and ConclusionThis study demonstrates that our recurrent neural network models using only timestamps of longitudinal electronic health record data that reflect healthcare processes achieve well-performing discriminative power.  相似文献   

9.
10.
ObjectiveAccurate and robust quality measurement is critical to the future of value-based care. Having incomplete information when calculating quality measures can cause inaccuracies in reported patient outcomes. This research examines how quality calculations vary when using data from an individual electronic health record (EHR) and longitudinal data from a health information exchange (HIE) operating as a multisource registry for quality measurement. Materials and MethodsData were sampled from 53 healthcare organizations in 2018. Organizations represented both ambulatory care practices and health systems participating in the state of Kansas HIE. Fourteen ambulatory quality measures for 5300 patients were calculated using the data from an individual EHR source and contrasted to calculations when HIE data were added to locally recorded data.ResultsA total of 79% of patients received care at more than 1 facility during the 2018 calendar year. A total of 12 994 applicable quality measure calculations were compared using data from the originating organization vs longitudinal data from the HIE. A total of 15% of all quality measure calculations changed (P < .001) when including HIE data sources, affecting 19% of patients. Changes in quality measure calculations were observed across measures and organizations.DiscussionThese results demonstrate that quality measures calculated using single-site EHR data may be limited by incomplete information. Effective data sharing significantly changes quality calculations, which affect healthcare payments, patient safety, and care quality.ConclusionsFederal, state, and commercial programs that use quality measurement as part of reimbursement could promote more accurate and representative quality measurement through methods that increase clinical data sharing.  相似文献   

11.

Objective

There has been a consistent concern about the inadvertent disclosure of personal information through peer-to-peer file sharing applications, such as Limewire and Morpheus. Examples of personal health and financial information being exposed have been published. We wanted to estimate the extent to which personal health information (PHI) is being disclosed in this way, and compare that to the extent of disclosure of personal financial information (PFI).

Design

After careful review and approval of our protocol by our institutional research ethics board, files were downloaded from peer-to-peer file sharing networks and manually analyzed for the presence of PHI and PFI. The geographic region of the IP addresses was determined, and classified as either USA or Canada.

Measurement

We estimated the proportion of files that contain personal health and financial information for each region. We also estimated the proportion of search terms that return files with personal health and financial information. We ascertained and discuss the ethical issues related to this study.

Results

Approximately 0.4% of Canadian IP addresses had PHI, as did 0.5% of US IP addresses. There was more disclosure of financial information, at 1.7% of Canadian IP addresses and 4.7% of US IP addresses. An analysis of search terms used in these file sharing networks showed that a small percentage of the terms would return PHI and PFI files (ie, there are people successfully searching for PFI and PHI on the peer-to-peer file sharing networks).

Conclusion

There is a real risk of inadvertent disclosure of PHI through peer-to-peer file sharing networks, although the risk is not as large as for PFI. Anyone keeping PHI on their computers should avoid installing file sharing applications on their computers, or if they have to use such tools, actively manage the risks of inadvertent disclosure of their, their family''s, their clients'', or patients'' PHI.  相似文献   

12.
ObjectivesThe coronavirus disease 2019 (COVID-19) is a resource-intensive global pandemic. It is important for healthcare systems to identify high-risk COVID-19-positive patients who need timely health care. This study was conducted to predict the hospitalization of older adults who have tested positive for COVID-19.MethodsWe screened all patients with COVID test records from 11 Mass General Brigham hospitals to identify the study population. A total of 1495 patients with age 65 and above from the outpatient setting were included in the final cohort, among which 459 patients were hospitalized. We conducted a clinician-guided, 3-stage feature selection, and phenotyping process using iterative combinations of literature review, clinician expert opinion, and electronic healthcare record data exploration. A list of 44 features, including temporal features, was generated from this process and used for model training. Four machine learning prediction models were developed, including regularized logistic regression, support vector machine, random forest, and neural network.ResultsAll 4 models achieved area under the receiver operating characteristic curve (AUC) greater than 0.80. Random forest achieved the best predictive performance (AUC = 0.83). Albumin, an index for nutritional status, was found to have the strongest association with hospitalization among COVID positive older adults.ConclusionsIn this study, we developed 4 machine learning models for predicting general hospitalization among COVID positive older adults. We identified important clinical factors associated with hospitalization and observed temporal patterns in our study cohort. Our modeling pipeline and algorithm could potentially be used to facilitate more accurate and efficient decision support for triaging COVID positive patients.  相似文献   

13.
ObjectiveThe Global Digital Exemplar (GDE) Program is a national attempt to accelerate digital maturity in healthcare providers through promoting knowledge transfer across the English National Health Service (NHS). “Blueprints”—documents capturing implementation experience—were intended to facilitate this knowledge transfer. Here we explore how Blueprints have been conceptualized, produced, and used to promote interorganizational knowledge transfer across the NHS.Materials and MethodsWe undertook an independent national qualitative evaluation of the GDE Program. This involved collecting data using semistructured interviews with implementation staff and clinical leaders in provider organizations, nonparticipant observation of meetings, and key documents. We also attended a range of national meetings and conferences, interviewed national program managers, and analyzed a range of policy documents. Our analysis drew on sociotechnical principles, combining deductive and inductive methods.ResultsData comprised 508 interviews, 163 observed meetings, and analysis of 325 documents. We found little evidence of Blueprints being adopted in the manner originally conceived by national program managers. However, they proved effective in different ways to those planned. As well as providing a helpful initial guide to a topic, we found that Blueprints served as a method of identifying relevant expertise that paved the way for subsequent discussions and richer knowledge transfers amongst provider organizations. The primary value of Blueprinting, therefore, seemed to be its role as a networking tool. Members of different organizations came together in developing, applying, and sustaining Blueprints through bilateral conversations—in some circumstances also fostering informal communities of practice.ConclusionsBlueprints may be effective in facilitating knowledge transfer among healthcare organizations, but need to be accompanied by other evolving methods, such as site visits and other networking activities, to iteratively transfer knowledge and experience.  相似文献   

14.
ObjectiveDuring the coronavirus disease 2019 (COVID-19) pandemic, federally qualified health centers rapidly mobilized to provide SARS-CoV-2 testing, COVID-19 care, and vaccination to populations at increased risk for COVID-19 morbidity and mortality. We describe the development of a reusable public health data analytics system for reuse of clinical data to evaluate the health burden, disparities, and impact of COVID-19 on populations served by health centers.Materials and MethodsThe Multistate Data Strategy engaged project partners to assess public health readiness and COVID-19 data challenges. An infrastructure for data capture and sharing procedures between health centers and public health agencies was developed to support existing capabilities and data capacities to respond to the pandemic.ResultsBetween August 2020 and March 2021, project partners evaluated their data capture and sharing capabilities and reported challenges and preliminary data. Major interoperability challenges included poorly aligned federal, state, and local reporting requirements, lack of unique patient identifiers, lack of access to pharmacy, claims and laboratory data, missing data, and proprietary data standards and extraction methods.DiscussionEfforts to access and align project partners’ existing health systems data infrastructure in the context of the pandemic highlighted complex interoperability challenges. These challenges remain significant barriers to real-time data analytics and efforts to improve health outcomes and mitigate inequities through data-driven responses.ConclusionThe reusable public health data analytics system created in the Multistate Data Strategy can be adapted and scaled for other health center networks to facilitate data aggregation and dashboards for public health, organizational planning, and quality improvement and can inform local, state, and national COVID-19 response efforts.  相似文献   

15.
ObjectiveIn electronic health record data, the exact time stamp of major health events, defined by significant physiologic or treatment changes, is often missing. We developed and externally validated a method that can accurately estimate these time stamps based on accurate time stamps of related data elements.Materials and MethodsA novel convolution-based change detection methodology was developed and tested using data from the national deidentified clinical claims OptumLabs data warehouse, then externally validated on a single center dataset derived from the M Health Fairview system.ResultsWe applied the methodology to estimate time to liver transplantation for waitlisted candidates. The median error between estimated date within the period of the actual true date was zero days, and median error was 92% and 84% of the transplants, in development and validation samples, respectively.DiscussionThe proposed method can accurately estimate missing time stamps. Successful external validation suggests that the proposed method does not need to be refit to each health system; thus, it can be applied even when training data at the health system is insufficient or unavailable. The proposed method was applied to liver transplantation but can be more generally applied to any missing event that is accompanied by multiple related events that have accurate time stamps.ConclusionMissing time stamps in electronic healthcare record data can be estimated using time stamps of related events. Since the model was developed on a nationally representative dataset, it could be successfully transferred to a local health system without substantial loss of accuracy.  相似文献   

16.
ObjectiveProviding behavioral health interventions via smartphones allows these interventions to be adapted to the changing behavior, preferences, and needs of individuals. This can be achieved through reinforcement learning (RL), a sub-area of machine learning. However, many challenges could affect the effectiveness of these algorithms in the real world. We provide guidelines for decision-making.Materials and MethodsUsing thematic analysis, we describe challenges, considerations, and solutions for algorithm design decisions in a collaboration between health services researchers, clinicians, and data scientists. We use the design process of an RL algorithm for a mobile health study “DIAMANTE” for increasing physical activity in underserved patients with diabetes and depression. Over the 1.5-year project, we kept track of the research process using collaborative cloud Google Documents, Whatsapp messenger, and video teleconferencing. We discussed, categorized, and coded critical challenges. We grouped challenges to create thematic topic process domains.ResultsNine challenges emerged, which we divided into 3 major themes: 1. Choosing the model for decision-making, including appropriate contextual and reward variables; 2. Data handling/collection, such as how to deal with missing or incorrect data in real-time; 3. Weighing the algorithm performance vs effectiveness/implementation in real-world settings.ConclusionThe creation of effective behavioral health interventions does not depend only on final algorithm performance. Many decisions in the real world are necessary to formulate the design of problem parameters to which an algorithm is applied. Researchers must document and evaulate these considerations and decisions before and during the intervention period, to increase transparency, accountability, and reproducibility.Trial Registrationclinicaltrials.gov, NCT03490253.  相似文献   

17.
ObjectiveMultimodal automated phenotyping (MAP) is a scalable, high-throughput phenotyping method, developed using electronic health record (EHR) data from an adult population. We tested transportability of MAP to a pediatric population.Materials and MethodsWithout additional feature engineering or supervised training, we applied MAP to a pediatric population enrolled in a biobank and evaluated performance against physician-reviewed medical records. We also compared performance of MAP at the pediatric institution and the original adult institution where MAP was developed, including for 6 phenotypes validated at both institutions against physician-reviewed medical records.ResultsMAP performed equally well in the pediatric setting (average AUC 0.98) as it did at the general adult hospital system (average AUC 0.96). MAP’s performance in the pediatric sample was similar across the 6 specific phenotypes also validated against gold-standard labels in the adult biobank.ConclusionsMAP is highly transportable across diverse populations and has potential for wide-scale use.  相似文献   

18.
ObjectiveWe identified challenges and solutions to using electronic health record (EHR) systems for the design and conduct of pragmatic research.Materials and MethodsSince 2012, the Health Care Systems Research Collaboratory has served as the resource coordinating center for 21 pragmatic clinical trial demonstration projects. The EHR Core working group invited these demonstration projects to complete a written semistructured survey and used an inductive approach to review responses and identify EHR-related challenges and suggested EHR enhancements.ResultsWe received survey responses from 20 projects and identified 21 challenges that fell into 6 broad themes: (1) inadequate collection of patient-reported outcome data, (2) lack of structured data collection, (3) data standardization, (4) resources to support customization of EHRs, (5) difficulties aggregating data across sites, and (6) accessing EHR data.DiscussionBased on these findings, we formulated 6 prerequisites for PCTs that would enable the conduct of pragmatic research: (1) integrate the collection of patient-centered data into EHR systems, (2) facilitate structured research data collection by leveraging standard EHR functions, usable interfaces, and standard workflows, (3) support the creation of high-quality research data by using standards, (4) ensure adequate IT staff to support embedded research, (5) create aggregate, multidata type resources for multisite trials, and (6) create re-usable and automated queries.ConclusionWe are hopeful our collection of specific EHR challenges and research needs will drive health system leaders, policymakers, and EHR designers to support these suggestions to improve our national capacity for generating real-world evidence.  相似文献   

19.
ObjectiveSupporting public health research and the public’s situational awareness during a pandemic requires continuous dissemination of infectious disease surveillance data. Legislation, such as the Health Insurance Portability and Accountability Act of 1996 and recent state-level regulations, permits sharing deidentified person-level data; however, current deidentification approaches are limited. Namely, they are inefficient, relying on retrospective disclosure risk assessments, and do not flex with changes in infection rates or population demographics over time. In this paper, we introduce a framework to dynamically adapt deidentification for near-real time sharing of person-level surveillance data.Materials and MethodsThe framework leverages a simulation mechanism, capable of application at any geographic level, to forecast the reidentification risk of sharing the data under a wide range of generalization policies. The estimates inform weekly, prospective policy selection to maintain the proportion of records corresponding to a group size less than 11 (PK11) at or below 0.1. Fixing the policy at the start of each week facilitates timely dataset updates and supports sharing granular date information. We use August 2020 through October 2021 case data from Johns Hopkins University and the Centers for Disease Control and Prevention to demonstrate the framework’s effectiveness in maintaining the PK11 threshold of 0.01.ResultsWhen sharing COVID-19 county-level case data across all US counties, the framework’s approach meets the threshold for 96.2% of daily data releases, while a policy based on current deidentification techniques meets the threshold for 32.3%.ConclusionPeriodically adapting the data publication policies preserves privacy while enhancing public health utility through timely updates and sharing epidemiologically critical features.  相似文献   

20.
ObjectiveDue to a complex set of processes involved with the recording of health information in the Electronic Health Records (EHRs), the truthfulness of EHR diagnosis records is questionable. We present a computational approach to estimate the probability that a single diagnosis record in the EHR reflects the true disease.Materials and MethodsUsing EHR data on 18 diseases from the Mass General Brigham (MGB) Biobank, we develop generative classifiers on a small set of disease-agnostic features from EHRs that aim to represent Patients, pRoviders, and their Interactions within the healthcare SysteM (PRISM features).ResultsWe demonstrate that PRISM features and the generative PRISM classifiers are potent for estimating disease probabilities and exhibit generalizable and transferable distributional characteristics across diseases and patient populations. The joint probabilities we learn about diseases through the PRISM features via PRISM generative models are transferable and generalizable to multiple diseases.DiscussionThe Generative Transfer Learning (GTL) approach with PRISM classifiers enables the scalable validation of computable phenotypes in EHRs without the need for domain-specific knowledge about specific disease processes.ConclusionProbabilities computed from the generative PRISM classifier can enhance and accelerate applied Machine Learning research and discoveries with EHR data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号