Review of Basic Data Analytics :Unit II

Unit II: Review of Basic Data Analytics

Unit II: Review of Basic Data Analytics

  • Exploratory Data Analysis (EDA) is a crucial step in the data analysis process, involving the initial examination and exploration of a dataset to gain insights, discover patterns, and identify potential relationships between variables. It aims to understand the data, summarize its main characteristics, and uncover any hidden patterns or trends that can inform further analysis or hypothesis generation.
    • Summary statistics: Calculation of basic descriptive statistics such as mean, median, mode, standard deviation, and range to understand the central tendency, dispersion, and shape of the data.
    • Data visualization: Creation of visual representations such as histograms, box plots, scatter plots, and bar charts to visually explore the distribution, relationships, and patterns in the data.
    • Data cleaning: Identification and handling of missing values, outliers, or erroneous data points to ensure data quality and accuracy.
    • Correlation analysis: Examination of the strength and direction of relationships between variables using correlation coefficients or scatter plots.
    • Dimensionality reduction: Techniques like Principal Component Analysis (PCA) or t-SNE to reduce the dimensionality of high-dimensional data while preserving its structure and relationships.
    • Feature engineering: Transformation or creation of new variables based on domain knowledge or specific goals, which can enhance the predictive power of machine learning models.
  • Data visualization refers to the representation of data through visual elements like charts, graphs, and maps to facilitate understanding and interpretation of complex information.
    Types include:
    • Bar charts
    • Line charts
    • Scatter plots
    • Pie charts
    • Histograms
    • Heatmaps
    • Geographic maps
  • Data visualization refers to the representation of data through visual elements like charts, graphs, and maps to facilitate understanding and interpretation of complex information.
    Advantages:
    • Improved comprehension
    • Effective communication
    • Decision-making support
    Disadvantages:
    • Misinterpretation
    • Data limitations
    • Overcomplication
    • Subjectivity
    • Mean Squared Error (MSE)
    • Accuracy
    • Precision, Recall, and F1-score
    • Receiver Operating Characteristic (ROC) curve
    • Area Under the Curve (AUC)
    • Cross-validation
    • Hypothesis testing
  • Hypothesis testing is a statistical technique used to make inferences and draw conclusions about a population based on sample data. It involves two competing hypotheses: the null hypothesis (H0) and the alternative hypothesis (H1).
    Example: In a study on a new drug's effect on blood pressure:
    - H0: The drug has no effect on blood pressure.
    - H1: The drug does have an effect on blood pressure.
  • Student's t-test: Assumes equal variances (homoscedasticity), suitable for similar variances and small sample sizes.
    Welch's t-test: Does not assume equal variances (heteroscedasticity), more robust for unequal variances or sample sizes.
  • The Wilcoxon Rank-Sum Test (Mann-Whitney U test) is a nonparametric test to compare the distributions or medians of two independent groups. It is used when data does not meet normality assumptions. It assigns ranks to all observations and compares the sum of ranks between groups.
  • Type I Error (False Positive): Rejecting the null hypothesis when it is true (probability = α).
    Type II Error (False Negative): Not rejecting the null hypothesis when it is false (probability = β).
    Lowering α reduces Type I errors but increases Type II errors.
  • ANOVA (Analysis of Variance) compares the means of three or more groups to determine if at least one group mean is significantly different.
    Example: Comparing plant heights across three fertilizer treatments using ANOVA to see if any treatment leads to different mean heights.
  • Clustering groups similar data points together. K-means clustering partitions data into k clusters by iteratively assigning points to the nearest centroid and updating centroids until convergence.
  • The Apriori algorithm finds frequent itemsets in transactional data and generates association rules.
    Steps:
    1. Calculate support for itemsets.
    2. Generate frequent itemsets above the support threshold.
    3. Generate rules from frequent itemsets and calculate confidence.
    4. Prune rules below confidence/support thresholds.
    Visualization: Rule tables, scatter plots, and network graphs.
  • An association rule is a pattern like {A} ➞ {B}, meaning if A occurs, B is likely to occur.
    Applications:
    • Market basket analysis
    • Recommender systems
    • Customer segmentation
    • Web usage mining
    • Healthcare
    • Fraud detection
  • Student's t-test determines if there is a significant difference between the means of two independent groups, assuming normality and equal variances. It calculates a t-statistic and compares it to a critical value to decide if the difference is significant.
  • Welch's t-test compares the means of two groups when variances are unequal. It adjusts the t-statistic and degrees of freedom for more robust results when variances or sample sizes differ.