Author Archives: Hye Chung Kum

About Hye Chung Kum

Associate Professor

Salud Para Usted y Su Familia

[Health for You and Your Family]

Welcome to the project website

The long-term goal of Salud Para Usted y Su Familia [Health for You and Your Family] (SPUSF) is to reduce the incidence of overweight and obesity among Mexican-heritage children from limited-resource colonias/neighborhoods along the Arizona, New Mexico, and Texas borders with Mexico through a Promotora-led, family-based obesity prevention program that integrates research, education, and extension to target food and beverage consumption, physical activity, and screen-time by changing individual and family behaviors and the home environment in a coordinated manner.

  • USDA (2/01/2015-1/31/2020)
  • PI: Joseph Sharkey
  • Family-Focused Childhood Obesity Prevention

 

Instructions

You have received the link to each document over email, but we are keeping all the links here so it is easy for you keep track of all of them.

PCORI Award: Diabetes Education and Wellness Through Faith-based Organizations (FBOs) in Texas

The Patient-Centered Outcomes Research Institute (PCORI) has awarded Texas A&M University with a Tier I award in the amount of $15,000 for a 9-month period. Tier I awards fund the building of the community and capacity necessary to later develop a patient-centered comparative effectiveness research project.  The project is awarded to Mark Lawley of the Industrial and Systems Engineering Department to examine Diabetes Education and Wellness Through Faith-based Organizations (FBOs) in Texas.  Only 17% of proposals submitted were selected for PCORI’s Tier I award.

Diabetes is a chronic disease requiring behavior modification and lifestyle changes to manage the disease. Diabetes is the 7th leading cause of death in the U.S. and costs $245 billion per year.  Texas is the 5th leading state in diabetes prevalence.  It is difficult to effectively control the fast growing trend in diabetes prevalence in Texas since risk factors are very prevalent. For example, about 1 of 3 adults in Texas are obese, and 2 of 3 are either overweight or obese.  Also, more than 50% of adults in Texas are not physically active and about 3 of 4 adults have fewer than 5 servings of fruits and vegetables each day.

Fortunately, proper management reduces the risk of disease progression and complications.  Often, disease management is taught through diabetes education and wellness classes.  FBOs describe organizations or programs associated with a religious congregation and account for a variety of religious backgrounds (e.g., Christian, Catholic, Jewish, Muslim, etc.).  Some FBOs have successfully partnered with health promotion programs to provide preventative health services to at-risk populations with chronic diseases.  FBOs have regular access to a captive adult audience of patients and volunteers and they typically have strong community credibility. Therefore, FBOs will be of central importance in facilitating diabetes management and improving population health.

There are three main thrusts to be executed for this project: partnership development, communication structure, and leadership structure.  For partnership development, the goal is to build a partnership network of more than 40 researchers, diabetes educators, clinicians, patients, and FBOs who are interested in comparing the awareness, behavior modification, and disease-management success of patient populations who receive diabetes education and wellness from traditional sources vs. FBOs.   For communication, the team will utilize a listserv and is currently in the process of developing a website for the partnership network. Finally, for leadership development, the team will form an internal governance structure to facilitate discussions about using FBOs for diabetes education and wellness.

The partnership team initially consists of three researchers (Mark Lawley, Hye-Chung Kum, Michelle Alvarado) from Texas A&M University (TAMU) and the President (Charles Bell) of the Diabetes Health and Wellness Institute (DHWI) at Juanita J. Craft Recreation Center.  Mark Lawley, Ph.D., P.E., (PI) is the TEES Research and One Health Professor of Industrial and Systems Engineering and Biomedical Engineering. Hye-Chung Kum, Ph.D., MSW (Co-PI) is an associate professor of Health Policy and Management at the School of Rural Public Health in the Texas A&M Health Science Center. Michelle Alvarado, PhD, (Project Lead) is a postdoctoral research associate in Industrial and Systems Engineering Department at Texas A&M University.

PCORI’s mission is to help people make informed healthcare decisions, and improve healthcare delivery and its outcomes, by producing and promoting high-integrity, evidence-based information that comes from research guided by patients, caregivers, and the broader healthcare community.  This is the second year PCORI has funded Tier I awards in their “Pipeline to Proposal” process.  The Pipeline to proposal is a 3-tier process aimed to build a national community of patients, stakeholders, and researchers who have the expertise and passion to participate in patient-centered outcomes research that lead to high-quality research proposals. Upon successful completion of PCORI’s Tier I award, projects are eligible to advance to Tier II ($25,000 for 12 months) for further development of the partnerships.  This year, 27 of 30 projects advanced from Tier I to Tier II.  Another competitive process is required to receive a PCORI Tier III ($50,000 for 12 months) award whose purpose is to develop high quality research proposals.


Source: http://www.lchdhealthcare.org/information/diabetes-information/

“The incidence of type II diabetes is increasingly prevalent in the Texas population.  We feel that utilizing FBOs as a means of communicating diabetes education and wellness can be effective in reducing this prevalence.  The PCORI funding will be instrumental in allowing us to develop the partnerships necessary to pursue this research idea.”
Dr. Michelle Alvarado

Record Linkage Basics

What’s Record Linkage

Record linkage (RL), also named as to “duplicate detection”, “record matching”, “data matching” or “object identity problem”, refers to the task of finding entries that refer to the same individual in two or more files. It is an appropriate technique when you have to join data sets that do not have a unique database key in common. A data set that has undergone record linkage is said to be linked.

For example, in a table that belongs to University of North Carolina at Chapel Hill, one entry keeps one student’s information, contains a column “onyen” , a column “First Name”, and a column “Last Name”, and a column “SSN”. The last three items also maintained by a table from Bank of America to record its customers. Now we pick up one pair contained two entries, one from the first table and the other from the second one. If the 10 digital number SSN in two entries’ SSN are same, and the student’s name from both tables matches each other, we could determine this bank’s customer is the student from UNC; this pair is linked.

Record Linkage in Research

When research requires linking data between historical records and current survey or records, people use record linkage to build connection between old and new data sets. This is normal since information from records is updated during periods with the status changes. For longitudinal records research, reconstruct one data set must link each period data sets to track series of records.

We also need to use record linkage to link data between different agencies. Each agency, for their purpose, use specific formats to store data information. When people research among different areas, one need to link data sets from each of them. However, different agency information systems do not share a common ID. Without common IDs, linking data records reliably and accurately across different data sources is an important issue.

Basic Algorithm of Record Linkage

There are two basic methods of record linkage: deterministic record linkage and probabilistic record linkage.  Deterministic record linkage is defined as linking two (or more) tables based on agreement rules (exact, approximate, and partial) for matching variables, which are often structured hierarchically.  That is deterministic record linkage compares a group of identifiers or one identifier across databases; a link is made if all of the fields in record agree to an acceptable level. In practice, people use common IDs, like Social Security Number, birth dates, first and last names of individuals as basic fields to compare. Using combinations of different fields of identifying information could increase the validity of the link made.

Probabilistic Record Linkage is based on the assumption that no single match between variables common to the source databases will identify a client with complete reliability. One probability function indicates two records belong to one same client through calculation of identifying information, such as last and first name, birth date, Social Security Number or other fields existed at the same time in different data sets.

The process of record linkage can be conceptualized as identifying matched pairs among all possible pairs of observations from two data sets. People definite the set of true matches(M set) and the set of true nonmatches(U set) in practice, and also denoted in m probability and u probability. The m probability is the probability that a field agrees given that the record pair being examined is a matched pair, and the u probability is the probability that a field agrees given that the record pair being examined is not a matched pair. Although there are many methods to calculate M and U probabilities, maximum-likelihood-based methods such as the Expectation-Maximization(EM) algorithm is, as the recent studies shows, the most effective of all existed algorithms.

Using m and u probabilities, weight is defined to measure the contribution of each field to the probability of making an accurate classification of each pair into M or U sets. The “agreement” weight when a field agrees between the two records being examined is calculated as log2(m/u); the “disagreement” weight when a field does not agree is calculated as log2((1-m)/(1-u)). These weights will vary based on the distribution of values of the identifiers and indicate how powerful a particular variable is in determining whether two records are from the same individual.

Using the composite weights, calculated by summing the individual data field’s weights, one can classify each pair of records into three groups: a link when the composite weight is above a threshold value(U), a non link when the composite weight is below a threshold value(L), and a possible link for clerical review when the composite weight is between U and L. The threshold values can be calculated bythe accepted probability of false matches and the probability of false nonmatchs.

Based on the theory above, the main focus of record linkage research has been how to matching fields and how to determine the threshold values of U and L to improve the accuracy of determining what the threshold weight is for a certain link or non-link.

Fields Matching Methods

Records Matching Techniques

Conclusion

References:

Dunn, Halbert L. (December 1946). “Record Linkage” American Journal of Public Health 36 pp. 1412–1416.

Studies of Welfare Populations: Data Collection and Research Issues (Editors M. Ver Ploeg, A. Moffit, and C. Citro, National Academy Press) titled Matching and Cleaning Administrative Data. Author Bong-Ju Lee.

Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios. Duplicate Record Detection: A Survey. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19 JANUARY 2007

Record Linkage References

Why is this important

  • [RECOMMENDED] Goth G. Running on EMPI. Health information exchanges and the ONC keep trying to find the secret sauce of patient matching. Health data management. 2014;22(2):52-, 4, 6 passim.

Detailed survey in computer science

  • [RECOMMENDED] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios. Duplicate Record Detection: A Survey. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19 JANUARY 2007
  • [RECOMMENDED] M. Elfeky, V. Verykios, A. Elmagarmid. TAILOR: A Record Linkage Tool Box. In Proceedings of the 18th International Conference on Data Engineering (ICDE 2002). IEEE Computer Society, Washington, DC, USA
  • N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data (SIGMOD ’06). ACM, New York, NY, USA, 802-803. DOI=10.1145/1142473.1142599 http://doi.acm.org.libproxy.lib.unc.edu/10.1145/1142473.1142599

What is actually done in the field

  • [RECOMMENDED] S. Weber, H. Lowe, A. Das, et al. A simple heuristic for blindfolded record linkage. J Am Med Inform Assoc. 2012.
  • [RECOMMENDED] F. Boscoe, D. Schrag, K. Chen, et al. Building capacity to assess cancer care in the Medicaid population in New York State. Health Services Research 2011;46(3): 805-20
  • https://www.census.gov/srd/papers/pdf/rrs2006-02.pdf

Private Record Linkage

  • [RECOMMENDED]Rob Hall and Stephen E. Fienberg: Privacy-Preserving Record Linkage. Privacy in Statistical Databases 2010: Lecture Notes in Computer Science, 2011, Volume 6344/2011, pp 269-283, DOI: 10.1007/978-3-642-15838-4_24.
  • Vatsalan, D., Christen, P., & Verykios, V. S. (2013). A taxonomy of privacy-preserving record linkage techniques. Information Systems, 38(6), 946-969
  • L. Bonomi, L. Xiong, J. Lu. LinkIT: Privacy Preserving Record Linkage and Integration via Transformations (demo track). In SIGMOD, 2013
  • http://hiplab.mc.vanderbilt.edu/projects/soempi/ (most recent work in the field)
  • A. Inan, M. Kantarcioglu, E. Bertino, and M. Scannapieco. A hybrid approach to private record linkage. In ICDE, pp 496-505. IEEE, 2008
  • T. Churches and P. Christen. Blind data linkage using n-gram similarity comparisons. In H. Dai, R. Srikant, and C. Zhang, editors, PAKDD, volume 3056 of Lecture Notes in Computer Science, pp 121-126. Springer, 2004

Recent papers based on data mining and machine learning techniques

  • McCoy AB, Wright A, Kahn MG, Shapiro JS, Bernstam EV, Sittig DF. Matching identifiers in electronic health records: implications for duplicate records and patient safety. Bmj Quality & Safety. Mar 2013;22(3):219-224.
  • Peter Christen. 2008. Automatic Record Linkage using Seeded Nearest Neighbor and Support Vector Machine Classification. Proceedings of the ACM SIGKDD 2008 conference, Las Vegas, August 2008.
  • Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive deduplication using active learning. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’02). ACM, New York, NY, USA, 269-278. DOI=10.1145/775047.775087 http://doi.acm.org/10.1145/775047.775087
  • Bilenko, M.; Kamath, B.; Mooney, R.J.; , “Adaptive Blocking: Learning to Scale Up Record Linkage,” Data Mining, 2006. ICDM ’06. Sixth International Conference on , vol., no., pp.87-96, 18-22 Dec. 2006
    doi: 10.1109/ICDM.2006.13

Corner Stone Papers for Probabilistic Record Linkage

  • H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic Linkage of Vital Records, Science, 130, pp. 954-959. 1959
  • I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association 1969;64: pp 1183–1210

Papers that look at the impact of record linkage on analysis

  • I. Baldi, A. Ponti, R. Zanetti, G. Ciccone, F. Merletti, and D. Gregori. The impact of record-linakge bias in the Cox model. Journal of Evaluation in Clinical Practice. 16: 92-96. 2010.
  • P. Lahiri and M. Larsen. Regression analysis with linked data. Journal of the American Statistical Association, 100(469):222-230, March 2005
  • F. Scheuren and W. E. Winkler. Regression Analysis of Data Files That Are Computer Matched – Part II. Survey Methodology, 23, 157-165. 1997.

Available Software

  • P. Jurczyk, J. J. Lu, L. Xiong, J. D. Cragan, A. Correa, FRIL: A Tool for Comparative Record Linkage, American Medical Informatics Associations (AMIA) 2008 Annual Symposium
  • Febrl
  • Linkagewiz. http://www.linkagewiz.com/index.htm
  • K. Campbell, D. Deck, and A. Krupski. 2008. Record linkage software in the public domain: a comparison of Link Plus, The Link King, and a `basic’ deterministic algorithm. Health Informatics Journal March 2008 vol. 14 no. 1 5-15

Two CS faculty who focus on record linkage

Managing Diabetes in the Digital World

Submitted Proposals

  • NSF: A Smart Diabetes Management System (SDMS)
    PI: Dr. Lawley
  • Google: Virtual Village: Multimedia Social Networking for Managing Type 2 Diabetes
    PI: Dr. Kum

Current Projects

  • modeling scheduling and no show at the clinic
  • surveying access the technology among clients
  • modeling continuous glucose monitoring data

Editors in Linux

The most difficult hurtle for many students who start to use Linux is to learn to become proficient in an editor. The editor is how you communicate with the computer, so spending a little time becoming proficient in a power editor is worth your time.

This page has some information about the most common editors. If you are totally new to LINUX then you can use nano (or pico) for simple things to get you going.

Editors

  • nano or pico : for simple editing
    • nano fn.sas
  • emacs : use ESS for sas
  • vim : see below for more information

ESS setup for Emacs users

  1. Use command “ls -a” to check if the “.emacs” file exists.
  2. If the “.emacs” file exist, open the file using “emacs .emacs” command.
  3. In the emacs file, type in ” (load “/opt/HPM/bin/ess-13.09/lisp/ess-site”) ” on the top.
  4. Save the file and exit. (Command “Ctrl + x + s” and “Ctrl + x + c”)
  5. Edit sas file in your directory and check the different color setting in the editor.

old use server

Installation Guide

Connecting to Linux Server

  • Connection configuration (PuTTY & WinSCP & Xming)
  • You will need the IP ADDRESS (66.64.81.149) of the server as you work through the configuration.  Please do not share this IP address with unauthorized users.  It is RESTRICTED information.  Knowing the IP address opens up more potential for attack on the server.  This is why we do not have this information in the pdf document that is more widely accessed.

Using PC SAS to submit jobs to the Linux Server

      • Open the Base SAS application in your PC or laptop
      • At the top of your program type in the following (This is run at the beginning every time you start up base sas)
        %let server = 66.64.81.149 5019;
        options comamid = tcp remote = server;
        signon username = _prompt_;
        run;
      • Submit this small program. It will ask for your login/pwd for the server. Log in.
      • After that. you can put any code you want to run on the server between the following two keywords (rsubmit & rsubmitend)
      • Example (this code should work for everyone. Run this to confirm you are setup correctly.)
        rsubmit;
        libname in “/opt/HPM/usr/kum/data”;
        proc print data=in.test(obs=10);
        rsubmitend;
      • More documentation