

REVIEW ARTICLE 

Ahead of print publication 


Statistical methods used in medical research and cancer registries: A review
Nazir Ahmad Dar^{1}, Tavseef Ahmad Tali^{1}, Basharat Ahmad Gani^{2}, Mushtaq Ahmad Sofi^{1}, Shahid Rashid Sofi^{1}, Nazir Ahmad Khan^{1}, Arshad Manzoor Najmi^{1}, Afroz Fir^{1}, Syed Nisar Ahmad^{3}
^{1} Department of Radiation Oncology, Sheri Kashmir Institute of Medical Sciences, SKIMS Soura, Srinagar, Jammu and Kashmir, India ^{2} Department of Internal Medicine, Sheri Kashmir Institute of Medical Sciences, SKIMS Soura, Srinagar, Jammu and Kashmir, India ^{3} Department of Medical Oncology, Sheri Kashmir Institute of Medical Sciences, SKIMS Soura, Srinagar, Jammu and Kashmir, India
Date of Submission  14Jun2022 
Date of Acceptance  29Aug2022 
Date of Web Publication  02Nov2022 
Correspondence Address: Mushtaq Ahmad Sofi, Department of Radiation Oncology, Sheri Kashmir Institute of Medical Sciences, SKIMS Soura, Srinagar, Jammu and Kashmir India
Source of Support: None, Conflict of Interest: None DOI: 10.4103/jrcr.jrcr_36_22
Medicine is an everchanging science. Thus, new knowledge is generated by research and clinical experience. Statistical methods used in medical research play a vital role in medical research to draw a meaningful conclusion about research. Analyzing data and interpreting results is the most exciting stage of research, but it is not possible for everyone. It is possible for those who is having deep knowledge and to know the applicability of statistical methods used in medical research. Commonly used statistical methods in medical research are descriptive and inferential statistical methods. In descriptive statistical methods, we describe our data by the organization of our data in the form of tabulation and diagrams, measures of central tendency, dispersion, condensation, and measures of correlation. In inferential statistics, we draw a meaningful conclusion whether our treatment or procedure used in medical research gives a fruitful outcome or not. It is possible only when we have a good knowledge and skill of statistical methods used in basic research and it allows our clinical researchers to draw accurate and reasonable conclusions. Statistics provides us with sound methods in collecting data about observing healthrelated events, which in turn helps us in summarizing and analyzing the results so as to draw valid inferences regarding the hypothesis of our research. During the research, scientists used different statistical methods such as independent ttest or Student's ttest and Chisquare test to compare the different treatments used in the experimental studies to check whether there was a significant difference in our treatment or not. The main role of a cancer registry is to capturing a clear and complete picture of the cancer burden. To show how confident the researchers are that the results did not happen by chance, they use confidence intervals. For example, 95% confidence means that the researchers are pretty sure that the result has not happened by chance. The motive and aim of my review article are only to aware the researchers to know the importance and applicability of these statistical methods used in medical research and cancer registries. Keywords: Cancer registries, descriptive statistics, inferential statistics, statistical methods
How to cite this URL: Dar NA, Tali TA, Gani BA, Sofi MA, Sofi SR, Khan NA, Najmi AM, Fir A, Ahmad SN. Statistical methods used in medical research and cancer registries: A review. J Radiat Cancer Res [Epub ahead of print] [cited 2022 Dec 4]. Available from: https://www.journalrcr.org/preprintarticle.asp?id=360386 
Introduction   
Anyone who is involved in medical research should always keep in mind that science is a search for the truth and there is no room for bias or inaccuracy in statistical analyses or interpretation. However, data analysis must be undertaken in a careful and considered way by people who have an inherent knowledge of the nature of the data and of their interpretation. Any errors in statistical analyses will mean that the conclusions of the study may be incorrect. As a result, many journals may require reviewers to scrutinize the statistical aspects of submitted articles, and many research groups include statisticians who direct the data analyses. Analyzing data correctly and including detailed documentation are established markers of scientific integrity, which help other researchers, reach the same conclusions.
The father of epidemiology, John Snow during 1854 studied the cholera epidemic in London and demonstrated the association of epidemiological and statistical methods in medical research. Popularity gained after Bradford Hill's lectures were published as a series of articles in the Lancet and then in book form, principles of medical statistics.^{[1]}
Biostatistics is the part of statistics as applied to biological areas. Biological laboratory experiments, medical research (including clinical research), and health services research all use statistical methods. The reason to study biostatistics than statistics is that:
 Some statistical methods are used more heavily in biostatistics than in other fields. For example, a general statistical textbook would not discuss the life table method of analyzing survival data—unlike it has importance in many biostatistical applications
 Examples are drawn from the biological, medical, and healthcare areas: which helps us to have references in the specific field. It also helps you understand how to apply statistical methods to the specific field of research such as biological and health sciences
 The third reason for a biostatistical text is to teach the material to an audience of health professionals. In this case, the interaction between students and teachers, but especially among the students themselves, is of great value in learning and applying the subject matter.
The process of converting data into meaningful information requires a special approach called statistics. Statistics is the branch of methods for making wise decisions in the face of uncertainty. In other words, it can be defined as the collection, summarization, organization, analysis, and interpretation of numerical data. Biostatistics is the science that helps in managing medical uncertainties. It mainly consists of various steps such as generation of hypothesis, collection of data, and application of statistical analysis. Ample knowledge of biostatistics is important for research scholars, medical students, and nursing students so that they can design epidemiological study accurately and draws meaningful conclusions and inadequate knowledge of biostatistics leads to biased results.
Statistical methods help us in developing solutions to overcome complex questions in research and during the collection of data. Biostatistics has developed enormously in recent years, due to continuing advances in diverse biomedical fields. For example, new problems in biomedical research have led to the development of new statistical methodologies that would not otherwise have arisen, and at the same time have favored ingenious adaptations of classical statistical techniques to new contexts of applications.^{[2],[3]}
The main role of statistics in research is to design research, analyze data, and draw meaningful conclusions. A meaningful conclusion can be drawn using proper statistical tests. Statistics also help to reduce the large volume of raw data which must be suitably reduced so that the same can be read easily and can be used for further analysis.^{[4]}
The most important statistical methods used in basic research are summarized below:
Descriptive Statistical Methods   
Mean
Mean is the first and simplest measure of location. It is the most frequently used measure of location. It can be defined as the sum of observations divided by the number of observations. The most important drawback of mean is that it was affected by extreame values.^{[5]}
Median
Median is defined as the middle of observation. It divides the whole data into two equal parts one part comprising all the values less than the median and the second part comprising all the values greater than the median. The median is not affected by extreme values. Median is the only average used for dealing with the qualitative data. In the median, we have two cases odd and even. In the odd case, we arrange the distribution into ascending (descending) order and distribute the series into two parts and the middle one is median. In an even case, we arrange the distribution into ascending (descending) order and calculate the average between the two middle values and the middle value is median.^{[6]}
Mode
Mode is the most frequently occurring value in a set of data. Mode is particularly useful in the study of popular sizes. Mode is the average to be used to find the ideal size in a series.^{[7]}
Range
Range is the simplest measure of dispersion. It can be defined as the difference between the two extreme items of series. The utility of range is that it gives us an idea of variability very quickly.
Range = (highest value of serieslowest value of series).
Standard deviation
The standard deviation (SD) is mostly used in research studies and is regarded as a very satisfactory measure of dispersion. SD can also be defined as the positive square root of the mean of the squared deviation of the values of mean. The SD describes how much individual measurement differs on the average from the mean.
Statistical Inferential Method   
The statistical inferential methods are used to draw meaningful inferences about the characteristics of the population using various inferential statistics such as independent ttest and Chisquare test to compare the significant impact using various treatments in research subjects. The fundamental principle by applying these inferential methods, we have to check the normality of the distribution, if our distribution follows normality, we have to go for parametric test (Student's ttest and analysis of variance [ANOVA]) and if not go for nonparametric test (Chisquare test, sign test, Wilcoxon signedrank test, and Mann–Whitney Utest). A thumb rule to check the normality of the distribution is based upon the mean and SD, if the mean of any distribution is 50 and its SD is >50% of the mean, then it does not follow normality, if its SD is <50% of the mean, then distribution follows normality. In this review, we will discuss some of the important parametric and nonparametric statistical tests mostly used in medical research and are discussed below.
Student's ttest and Analysis of Variance   
Ttests and ANOVA both are parametric tests and are extensively used in medical research to check the effectiveness of treatment in our research. Student's ttest is used when two independent groups are compared, and in Student's ttest, we have two types: (a) paired ttest and (b) unpaired ttest. ANOVA extends the ttest to more than two groups. Both methods are parametric as they assume the normality of the data and equality of variances across comparison groups. Both analyses are performed on logtransformed data and compare the means of the groups. A paired ttest (also known as a dependent or correlated ttest) is a statistical test that compares the averages/means and SDs of two related groups to determine if there is a significant difference between the two groups. An unpaired ttest (also known as an independent ttest) is a statistical procedure that compares the averages/means of two independent or unrelated groups to determine if there is a significant difference between the two.^{[8],[9],[10],[11]}
Paired ttests are used when the same item or group is tested twice, which is known as a repeated measures ttest. Some of the examples mentioned here such as (a) before and after effect of pharmaceutical treatment on the same group of people, (b) body temperature using two different thermometers on the same group of participants, and (c) standardized test results of a group of students before and after a study course.
An unpaired ttest is used to compare the mean between two independent groups. Examples of appropriate instances during which an use an unpaired ttest is used such as (a) research during which there are two independent groups, such as women and men, that examines whether the average bone density is significantly different between the two groups^{[12],[13],[14]} [Table 1].
ANOVA is also a parametric test used to compare the variance across the means of different groups. When data are normally distributed, Student's ttest can be used to assess the significance of the means of the sample. To compare the difference between three or more independent groups simultaneously, a parametric test called ANOVA can be used. When there is only one qualitative variable which defines the groups, a oneway ANOVA is performed.
For example, to study the effectiveness of different diabetes medications, scientists design an experiment to explore the relationship between the type of medicine and the resulting blood sugar level. The sample population is a set of people. We divide the sample population into multiple groups, and each group receives a particular medicine for a trial period. At the end of the trial period, blood sugar levels are measured for each of the individual participants. Then for each group, the mean blood sugar level is calculated. ANOVA helps to compare these group means to find out if they are statistically different or not.
Nonparametric test means distribution does not follow the normality of the distribution. A nonparametric test in statistics does not mean that you do not know nothing about the population, but it usually means that population data does not follow a normal distribution. The rule of thumb is (a) for nominal or ordinal scale use a nonparametric test and (b) for interval or ratio scale use a nonparametric test.
Chisquare test
It is a nonparametric test and is used to find the association between variables. It is used for categorical variables and also used in continuous variables by making intervals.
Sign test
It is the simplest nonparametric test and estimates the median of the population and compares it to the reference value or target value. In the sign test, we use signs positive (+) and negative (−) to every observation. When the reference value is less than the observed value, the plus sign will be used and when the reference value is greater than the observed value, the negative sign will be used. Moreover, when the reference value is equal to the observed value, it will be eliminated.
Wilcoxon signedrank test
Estimate the population median and compare it to a reference/target value and assumes your data comes from a symmetric distribution (like the Cauchy or uniform distribution).
Mann–Whitney test
Compare differences between two independent groups when dependent variables are either ordinal or continuous.
Cancer Registries   
Projection of cancer incidence is essential for planning cancer control actions, health care, and allocation of resources. The incidence or the projection of cancer is made from the cancer registry. The cancer registry is an organization for the systematic collection, storage, analysis, interpretation, and reporting of data on subjects with cancer. There are two types of cancer registries which are available, namely populationbased cancer registry (PBCR) and hospitalbased cancer registry (HBCR).
PBCR systematically collects data on all new cases of cancer occurring in a welldefined population, from multiple sources such as government hospitals, private hospitals, nursing homes, clinics, diagnostic laboratories, imaging centers, hospices, and registrars of births and deaths. The coverage is about 10% of the population in India. National Cancer Registry Program (NCRP) started with a network of three PBCRs in Bangalore, Chennai, and Mumbai and three HBCRs in Chandigarh, Dibrugarh, and Thiruvananthapuram. The number of registries working under the program has expanded greatly from the time of inception and presently, there are 36 PBCRs and 236 HBCRs registered under NCRP.
Since cancer is not a notifiable disease, cancer registration in India is active and staff of all registries visit hospitals, pathology laboratories, and all other sources of registration of cancer cases on a routine basis. Death certificates are also scrutinized from the local government units such as municipal corporations and Panchayat Raj Institutes, and information is collected on all cases where cancer is mentioned as a cause of death on the death certificates.^{[15],[16],[17],[18],[19],[20],[21]}
Definitions, Statistical Terms, and Methods   
Cancer registration
It is the process of continuing, systematic collection of data on the occurrence, and characteristics of reportable neoplasms with the purpose of helping to assess and control the impact of malignancies on the community.
Cancer case
All neoplasms with a behavior code of “3” as defined by the International Classification of Diseases for Oncology, Third edition are considered reportable.
Cancer registry
Is the office or institution which attempts to collect, store, analyze, and interpret data on persons with cancer.
Populationbased cancer registries
Systematically collect information on reportable neoplasms from multiple sources in a geographically defined population residing in the area for 1 year.
Hospitalbased cancer registries
These registries are concerned with recording the information on the treatment, management, and outcome of cancer patients registered in a particular hospital.
Sources of registration
Hospitals or cancer centers are the sources of registration for cancer registries.
Data processing
The data processing means checking the quality of the data that may have been committed during the data submission. These errors must be rectified during the data processing and further data analysis will be done.
Crude incidence rate refers to the new cases of cancer in a particular year by division of the total number of cancer cases by the corresponding estimated population (midyear) and multiplying by 100,000.
Agespecific rate (ASpR) refers to the rate obtained by division of the total number of cancer cases by the corresponding estimated population in that age group and gender/site/geographic area/time period and multiplying by 100,000.
Ageadjusted rate (AAR) or agestandardized rate cancer incidence increases as age increases. Therefore, the higher the proportion of the older population, the higher is the number of cancers. Most developed and western countries have a higher proportion of the older population. Hence, to make rates of cancer comparable between countries, a world standard population that takes this into account is used to arrive at AAR or agestandardized rates. This is calculated according to the direct method (Boyle and Parkin, 1991) by obtaining the ASpRs and applying these rates to the standard population in that age group.^{[22],[23],[24],[25],[26]}
Conclusion   
The purpose of this article is to aware scientist working in medical centers to know the importance of statistical methods used in medical research and cancer registries. Good knowledge of statistical methods gives us an adequate analysis of the data.
Financial support and sponsorship
Nil.
Conflicts of interest
There are no conflicts of interest.
References   
1.  Hill AB. Principles. In: Hill AB, editor. Principles of Medical Statistics. 1 ^{st} ed. London: Lancet, Oxford University Press; 1937. p. 189. 
2.  DeMets DL, Stormo G, Boehnke M, Louis TA, Taylor J, Dixon D. Training of the next generation of biostatisticians: A call to action in the U.S. Stat Med 2006;25:341529. 
3.  Zelen M. Biostatisticians, biostatistical science and the future. Stat Med 2006;25:340914. 
4.  Sprent P. Statistics in medical research. Swiss Med Wkly 2003;133:5229. 
5.  Perrie A, Sabin C, editors. Describing data. In: Medical Statistics at Galance. UK: Blackwell Science Ltd.; 2000. p. 169. 
6.  Kuzma JW, Bohnenblust SE, editors. Summarizing Data: Basic Statistics for the Science. London: Mayfield Publishing Company; 2001 p. 4454. 
7.  Manikandan S. Measures of central tendency: Median and mode. J Pharmacol Pharmacother 2011;2:2145. [ PUBMED] [Full text] 
8.  Bewick V, Check L, Ball J. Statistics review 10: Further nonparametric methods. Crit Care 2004;8:1969. 
9.  Altman DG, Bland JM. Parametric v nonparametric methods for data analysis. BMJ 2009;338:a3167. 
10.  Kaur SP. Variables in research. Indian J Res Rep Med Sci 2013;4:368. 
11.  Ali Z, Bhaskar SB. Basic statistical tools in research and data analysis. Indian J Anaesth 2016;60:6629. [ PUBMED] [Full text] 
12.  Magnello ME karl person and the origin of modern statistics : An elastician becomes statistician, Rutherford J 20052006;1. Avaliable from: http://Rutherford.org. [Last accessed on 2022 Jun 12]. 
13.  Rana R, Singhal R. Chisquare test and its application in hypothesis testing. J Pract Cradiovasc Sci 2015;1:6971. 
14.  Acheson ED. Medical Record Linkage. London: Oxford University Press; 1967. 
15.  American Cancer Society. Manual of Tumor Nomenclature and Coding. Washington, DC: American Cancer Society; 1951. 
16.  Baker RJ, Nelder JA. The GLIM System Release 3: Generalized Interactive Linear Modelling. Oxford: Numerical Algorithms Group; 1978. 
17.  Barclay TH. Canada, Saskatchewan. In: Waterhouse J, Muir CS, Correa P, Powell J, editors. Cancer Incidence in Five Continents. Vol. III (IARC Scientific Publications No. 15). Lyon: International Agency for Research on Cancer; 1976. p. 160 3. 
18.  Powell J, eds, Cancer Incidence in Five Continents, Volume III (IARC Scientific Publications No. 15), Lyon, International Agency for Research on Cancer, 1976. p. 1603. 
19.  Danish Cancer Registry. Cancer Incidence in Denmark 1981 and 1982. Copenhagen: Danish Cancer Society; 1985. 
20.  Danish Cancer Society Danish Cancer Registry. Cancer Incidence in Denmark 1984. Copenhagen: Danish Cancer Society; 1987. 
21.  National Cancer Registry Programme (ICMR). Time Trends in Cancer Incidence Rates 19822010. Bangalore: National Cancer Registry Programme (ICMR); 2013. 
22.  National Cancer Registry Programme (ICMR). ThreeYear Report of Hospital Based Cancer Registries 20072011. Bangalore: National Cancer Registry Programme (ICMR); 2013. 
23.  National Cancer Registry Programme (ICMR). Consolidated Report of Hospital Based Cancer Registries 20122014. Bangalore: National Cancer Registry Programme (ICMR); 2016. 
24.  National Cancer Registry Programme (ICMR). ThreeYear Report of Population Based Cancer Registries 20122014. Bangalore: National Cancer Registry Programme (ICMR); 2016. 
25.  National Cancer Registry Programme (ICMR). A Report on Cancer Burden in North Eastern States of India 20122014. Bangalore: National Cancer Registry Programme (ICMR); 2017. 
26.  National Cancer Registry Programme (ICMR). A Report on Cancer Burden in North Eastern States of India 20122016. Bangalore: National Cancer Registry Programme (ICMR); 2020. 
[Table 1]
