Primary Data
[This content is not available in "Englisch" yet]
Important preliminary remark:
Before starting any scientific work in the field of medicine ("research on humans", i.e. research on living humans and on bodies of deceased persons, on human biomaterial, on data from humans and projects of epidemiological research with personal data), a (professional) ethical and legal consultation and approval is usually necessary (see also "Notes on planning, conducting, evaluating and publishing studies").
In almost every research project, personal data is collected and processed. Even pseudonymized subject or patient data are generally considered as personal data. They are subject to the respective data protection regulations. In addition, suitable technical and organizational measures must be taken to ensure data security - further information on this can be found, for example, under Research Data Protection on the pages of the Data Protection Officer of the University of Cologne.
Recommendations for the structure and collection of primary data
The following remarks should be understood as a recommendation when collecting and structuring raw data. Failure to comply with these recommendations can lead to considerable additional work in computer-assisted data entry as well as significantly complicate the subsequent analysis of the data. If you have problems with the content, you can obtain statistical consultation.
An abridged version of this document is available for download here: Recommendation Primary Data (PDF file) as well as "good" and "bad" data examples (xls file).
Data structure
The common statistical analysis programs (especially the preferred program packages IBM SPSS® Statistics and SAS®) require that the raw data that are to be processed are arranged in a "rectangular" data structure. This means that the collected data are listed in exactly the same sequence and number of characteristics for each case (i.e. for the observation units such as patients). The variables belonging to a case are summarized row by row, with the number of rows corresponding to the number of cases. Each characteristic is assigned a "field" with a suitable number of writing positions for entering the measurements of the characteristic in the row, so that the number of fields per row corresponds exactly to the number of characteristics collected per case. The length of the fields can vary from characteristic to characteristic, but must be chosen so that every conceivable measurement of the characteristic can be recorded. For example, the characteristic "Body height in [cm]" for patients could be recorded in a field with three digits for all conceivable measurement results, but not in a field with only two digits.
A file structured this way contains one "row" per case (= observation unit). The first fields of each row are usually assigned to variables that can be used to distinguish the respective observation units in a pseudo- or anonymized way. If the patients of a sample are to be regarded as observation units, these could be the variables "Pseudonymized identification number", "Age", "Sex", etc. This is followed by the fields in which the measurements of other characteristics are recorded.
If individual characteristics are repeatedly recorded for each observation unit at different times, such as when systolic blood pressure is measured immediately before and two hours after administration of a drug, a "separate" field and thus a separate variable must be assigned for this characteristic for each measurement time. In the above example, this would be the variables "systolic blood pressure before" and "systolic blood pressure after".
Pat_ID | Age | SysRR_1 | SysRR_2* | Weight |
---|---|---|---|---|
971265 | 25 | 124 | 110 | 76,0 |
975621 | 30 | 140 | 142 | 56,1 |
984528 | 54 | 134 | 9999 | 84,3 |
*) missing value: 9999
If all characteristics are surveyed more than once per observation unit, it is advisable to record these surveys row by row. Then two different identification codes are assigned: one for each observation unit (e.g. the patient ID) and one per survey (e.g. the number of the examination). Depending on the type of statistical analysis planned, the form in which measurement repetitions are recorded must be considered (assistance is provided here by the statistical supervisor).
Pat_ID | E_No | Age* | Pulse | Weight* |
---|---|---|---|---|
970001 | 1 | 38 | 85 | 65,1 |
970001 | 2 | 9999 | 90 | 66,2 |
975454 | 1 | 35 | 73 | 72,5 |
961111 | 1 | 44 | 68 | 83,5 |
961111 | 2 | 9999 | 60 | 9999 |
961111 | 3 | 9999 | 72 | 91,5 |
*) missing value: 9999
The resulting file structure vividly corresponds to the collection of data as "file box contents", where each case corresponds to a file card (= "row") and each surveyed characteristic corresponds to a fixed field on the file cards with variable contents (= "variable") from case to case (= from card to card).
Furthermore it applies:
- In addition to the actual raw data file, a complete list of variables and the respective value range is useful (so-called data description).
- Anonymize or (at least) pseudonymize personal data; do not record/transmit real names!
- In this context, it is essential to observe the legal regulations - information on this can be found, among other places, on the pages of the relevant data protection officers as well as professional guidelines and sector-specific laws (see also, for example, information on planning, conducting, evaluating and publishing studies).
Data acquisition
For the analysis with the IBM SPSS® Statistics program package, it is also possible to enter the raw data with the IBM SPSS® Statistics data editor. However, if the raw data are entered using other programs (e.g. Microsoft Excel®), the following hints should be noted:
- Variable names may be a maximum of 64 characters long, must begin with a letter (A-Z, a-z), and may not contain umlauts, ß, or special characters (e.g., ! % # - etc.) except for the underline (_).
- Missing values must be indicated by a special code.
- Calendar data must not be entered as text fields (e.g. June 97).
- Plain text cannot be evaluated directly under any circumstances and must therefore be coded in a meaningful way (e.g. childhood diseases: 1 = measles, 2 = rubella, ...)
- Fields with numeric variables may only contain digits, the sign "+" or "-" and decimal point or comma.
- If variable values can contain not only digits but also alphanumeric characters (i.e. character strings such as "T1a" or "X1y3"), they must contain only the standard ASCII characters, i.e. no umlauts and no ß, in order to be able to transfer the files without interference.
System files are files that cannot be read directly as text files (e.g. in ASCII format), but can only be interpreted by the respective special program package. If the raw data are available as system files that do not correspond to any of the standard formats supported at the IMSB, the data must be transformed from the respective programs into portable files (so-called export files) or output as ASCII files before further processing with IBM SPSS® Statistics. For an import of ASCII files into IBM SPSS® Statistics, the following points must be carefully observed and, as a precaution, checked for compliance before data entry into the respective system:
- Missing values (missings) must be coded in such a way that the ported data can be interpreted by IBM SPSS® Statistics without errors.
- The variable values of each case must either be separated by a blank space, the tab character, or a special character, or each must begin in the same column.
- If the file contains alphanumeric or date variables that contain special or blank characters or are partially missing, they must be arranged in fixed columns. The files must not contain any content (such as header, blank, or result rows) other than the variable values.
The use of a spreadsheet program such as Microsoft Excel® for data analysis cannot be recommended, as powerful statistical packages are available for these purposes. For data collection, see "good" and "bad" data examples (xls file).
Important notes
The responsibility for the data (formal or content-related correctness of the data, data backup, data protection) remains with the doctoral candidate or the medical supervisor. In particular, we refer to the "Regulations of the University of Cologne for Ensuring Good Scientific Practice and Dealing with Scientific Misconduct". Section 4 states, among other things, that the person responsible for a research project must ensure that primary data as the basis for publications are stored on durable and secure media for ten years at the institution where they were generated.
In addition, explicit reference is made to the "Guidelines for Safeguarding Good Scientific Practice" of the German Research Foundation (DFG) as well as to the "Guidelines for Handling Research Data" at the University of Cologne, the university-wide "Research Code of Conduct" and the Scientific Integrity Committees of the University of Cologne.
Further useful links can also be found on our page with advice on planning, conducting, evaluating and publishing studies. Please inform yourself about the documents relevant for you.