-
PDF
- Split View
-
Views
-
Cite
Cite
Li-Pang Chen, Introduction to Data Science: Data analysis and Prediction Algorithms with R, Journal of the Royal Statistical Society Series A: Statistics in Society, Volume 185, Issue 2, April 2022, Pages 733–734, https://doi.org/10.1111/rssa.12781
Close - Share Icon Share
Data science has been a popular area and attracts many researchers’ and scientists’ attention in recent years. It needs not only statistical knowledge but also skills of computation. In 2019, Irizarry published a book entitled Introduction to Data Science: Data Analysis and Prediction Algorithm in R, whose purpose is to introduce R programming and popular statistical methods as well as machine learning simultaneously. This book includes 38 chapters and the author separates them into 6 topics. Moreover, many case studies are well organized in each chapter so that readers are able to learn suitable tools and appropriate solutions to deal with real-world problems.
Part 1 contains 6 chapters and focuses on a detailed introduction of R programming, including the illustration of R software, basic commands and basic programming such as defining user-specific functions and loops and tidy data. Part 2 focuses on data visualization, which belongs to descriptive statistics and is an important step in statistical analysis. The author introduces the R function ggplot2 to show the visualization of data distribution, such as cumulative distribution functions, boxplots and scatterplots. In addition, some basic measures, including outliers, median, IQR and median absolute deviation, are described.
The materials in Part 3 include fundamental statistics, including classical probability theory, concepts of random variables and relevant theories, and statistical inference (e.g. parameter and its estimator, construction of confidence intervals, power, p-value test). Moreover, some advanced topics are also outlined, including Monte Carlo simulations, confounding, measurement error and regression fallacy. The topic in Part 4 is data wrangling, whose purpose is to transfer data from its raw form to the tidy form that facilitates the rest of the analysis. Typical data that need such steps include string processing, HTML parsing, data with dates and times, and text mining. tidyr is the key package that enables us to achieve the goal of tidying data, and detailed descriptions as well as applications are summarized.
Part 5 focuses on machine learning. The author first introduces commonly used evaluation metrics for assessing performance of classification. After that, high-dimensional frameworks (e.g. dimension reduction and regularization), supervised learning methods (e.g. K-nearest neighbours and random forests) and unsupervised learning approaches (e.g. hierarchical clustering and K-means) are summarized. In the last part, the author introduces some useful productivity tools, whose purpose is to wisely manage file system (e.g. Git and Github) and programming code (e.g. RStudio), so that developments can be further applied to other projects.
In summary, this book provides good insights and clear guideline to readers about how to analyse data from descriptive statistics to analytical procedures. In addition, this book takes a balance between detailed descriptions of statistical methods and clear demonstrations of R code. A lot of colourful figures and step-by-step tutorials make readers easily understand key ideas of the materials and reproduce the results by themselves. This book is suitable for beginners in data science and is a good reference for graduate-level courses.