What are Statistics and types of Statistics?

Hina anum September 22, 2023

Data Science Statistics
June 15, 2022
No Comment
1494

The study of data gathering, analysis, presentation, and interpretation is known as statistics. Government demands for census data and information on a variety of economic processes provided much of the early inspiration for the field of statistics. The current requirement to convert vast volumes of data accessible in a variety of applications into valuable information has prompted theoretical and practical improvements in statistics.

Data is the collection, evaluation, and summarization of facts and statistics for display and interpretation. There are two categories of data: quantitative and qualitative data. Qualitative data is used to provide labels or names to groupings of comparable things, whereas quantitative data is used to determine how much or how many of something.

Consider a study that seeks information on a group of 100 people’s age, gender, marital status, and annual income. These attributes would be the study variables, and the values for each of the data variables would link to each participant. The data values of 28, male, single, and $30,000 would report for a 28-year-old single guy with a $30,000 annual wage. If there were 100 participants and four variables, the data collection would contain 100 × 4 = 400 items.

What are the population and sample?

Population:

A population is a distinct group of people, whether it’s a country or a collection of individuals who have something in common. A population is a group of people from whom a statistical sample draws for a statistical study. As a consequence, a population may be described as a group of individuals who have a feature in common.

The statistical population is the collection of all crows that exist now, have ever existed, or will exist in the future, for example, if we wish to make generalizations about all crows. It is difficult to monitor the complete statistical population in this situation and many others due to time limits, geographic accessibility constraints, and the researcher’s resources. Instead, a researcher would study a statistical sample of the population to get insight into the entire population.

A government may wish to collect information on all residents of a certain area, including gender, color, income, and religion. A census is a way of gathering information from a large number of individuals.

Sample:

A sample is a more manageable subset of a larger group. The sample is a subset of a larger population with similar characteristics. When a population is too large for a statistical test to include all prospective members or observations, samples use. A sample should not just reflect a part of the population, but the full population.

For instance, if a pharmaceutical corporation wishes to look into the negative side effects of medicine on the whole population of a country, doing a research study that covers everyone is almost impossible. In this scenario, the researcher picks a group of people from each demographic and conducts a study on them, giving him or her hints about the drug’s effectiveness.

Types of statistics:

There are two types of statistics.

inferential statistics
descriptive statistics

What is inferential statistics?

Inferential statistics use a random sample of data from a population to describe and infer about it. Inferential statistics are useful when it is neither possible nor practical to investigate each person in a population. Measuring the diameter of each nail produced in a mill, for example, is impractical. A normal random sample of nails may have their diameters measured. Based on the information from the sample, you may make assumptions about the diameters of all the nails. One of the two primary disciplines of statistics is inferential statistics.

What are descriptive statistics and types of descriptive statistics?

“Descriptive statistics” refers to the analysis, synthesis, and presentation of conclusions related to a data set received from a sample or the entire population.

To make the data easier to understand, descriptive statistics use numbers like mean, median, and so on to summarise the data. It doesn’t need any inference or extrapolation beyond what already known. Descriptive statistics, on the other hand, are merely a description of the available data (sample) and are not based on any probability theory.

Types of descriptive statistics:

Measures of central tendency
Measures of dispersion (or variability)

What are measures of central tendency?

A Measure of Central Tendency is a single-number description of the data that broadly characterize the data’s center. One-number summaries can divide into three categories.

Mean:

The mean defines as the ratio of the sum of all the observations in the data to the total number of observations. This is also known as average. As a consequence, a mean is a number that is dispersed over the whole data set.

Median:

The median is the halfway point in dividing all of the data into two halves. The median is lower in half of the data, whereas the other half is greater. Arrange the data in ascending or descending order before computing the median.

• If the number of observations is odd, the median is the middle observation in the sorted form.

• If the number of observations is even, the median is derived by averaging the two middle-sorted observations.

Mode:

The mode is the number that appears the most in the whole data set, or in other words, the number that appears the most. One or more modes can exist in a data collection.

• If there is only one number that appears the most times, the data is considered to be Uni-modal.

• If two numbers appear the most times, the data says to be bi-modal; if there are more than two values that appear the most times, the data says to have more than two modes. Multi-modal data is what it’s called.

What are Measures of dispersion (or variability)?

Measures of dispersion describe the spread of data around the center value (or the Measures of Central Tendency)

Absolute deviation from mean:

Mean Absolute Deviation (MAD) is a measure of data set variation that shows the average absolute distance between each data point in the collection.

Variance:

The term “variance” refers to how far data points differ from the mean. Data points with a big variation widely scatter, whereas data points with small variance are closer to the data set’s mean.

Standard deviation:

The Standard Deviation is defined as the square root of Variance.

Range:

The difference between a data set’s highest and minimum values is its range.

Quartiles:

Quartiles are the points in a data collection that divide it into four equal halves. The data set’s first, second, and third quartiles are Q1, Q2, and Q3, respectively.

• 25% of the data points fall below Q1, whereas 75% fall above it.

• Half of the data points are lower than Q2, while the other half are higher. The Median is the focus of Q2.

• 75% of the data points are lower than Q3, while 25% are higher.

Skewness:

Asymmetry in a probability distribution measures by skewness. It might be favorable, negative, or ambiguous.

Positive skewed: When the right-hand curve’s tail is longer than the left-hand curve’s tail, this happens. In these distributions, the mean is bigger than the mode.

Negatively skewed: This is the circumstance when the left side of the curve’s tail is greater than the right side’s. In these distributions, the mean is less than the mode.

Kurtosis:

Kurtosis determines whether the data is light-tailed (no outliers) or heavy-tailed (no outliers) as compares to a Normal distribution (outliers present). There are three forms of kurtosis:

Mesokurtic: This is similar to normal distributions in that the kurtosis is zero.

Leptokurtic: The tail of the distribution is heavy in this case (there is an outlier), and the kurtosis is larger than the normal distribution.

Platykurtic: The kurtosis is lower than the normal distribution, and the tail of the distribution is light (no outlier).

What is sampling distribution?

The values of a statistic for a random sample of data that is part of a larger total of data show in a sampling distribution. Data scientists may use sampling distributions to estimate attributes of a collection of data, such as the mean or standard deviation when working with large volumes of data. Parameters are statistical variables that include data-related information.

This also known as a probability distribution since it relies on probability to inform the data scientist about sample statistics. When dealing with large amounts of data, using a sample distribution makes it easier to conclude. As a result, it’s commonly used in data science as a statistical resource.