This book has fundamental theoretical and practical aspects of data analysis, useful for beginners and experienced researchers that are looking for a recipe or an analysis approach. Since R has many packages, even experienced researchers look for how particular functions are used in an analysis workflow.
Autorentext
Dr. Altuna Akalin is a bioinformatics scientist and the head of Bioinformatics and Omics Data Science Platform at the Berlin Institute of Medical Systems Biology, Max Delbrück Center in Berlin. He has been developing computational methods for analyzing and integrating large-scale genomics data sets since 2002. His interest is in using machine learning and statistics to uncover patterns related to important biological variables such as disease state and type. He has lived in the USA, Norway, Turkey, Japan, and Switzerland in order to pursue research work and education related to computational genomics.
Klappentext
Computational Genomics with R provides a starting point for beginners in genomic data analysis and also guides more advanced practitioners to sophisticated data analysis techniques in genomics. The book covers topics from R programming, to machine learning and statistics, to the latest genomic data analysis techniques. The text provides accessible information and explanations, always with the genomics context in the background. This also contains practical and well-documented examples in R so readers can analyze their data by simply reusing the code presented. As the field of computational genomics is interdisciplinary, it requires different starting points for people with different backgrounds. For example, a biologist might skip sections on basic genome biology and start with R programming, whereas a computer scientist might want to start with genome biology.
After reading:
You will know basic techniques for integrating and interpreting multi-omics datasets.
Altuna Akalin is a group leader and head of the Bioinformatics and Omics Data Science Platform at the Berlin Institute of Medical Systems Biology, Max Delbrück Center, Berlin. He has been developing computational methods for analyzing and integrating large-scale genomics data sets since 2002. He has published an extensive body of work in this area. The framework for this book grew out of the yearly computational genomics courses he has been organizing and teaching since 2015.
Inhalt
Introduction to R for Genomic Data Analysis Steps of (genomic) data analysis Data collection Data quality check and cleaning Data processing Exploratory data analysis and modeling Visualization and reporting Why use R for genomics ? Getting started with R Installing packages Installing packages in custom locations Getting help on functions and packages Computations in R Data structures Vectors Matrices Data Frames Lists Factors Data types Reading and writing data Reading large files Plotting in R with base graphics Combining multiple plots Saving plots Plotting in R with ggplot Combining multiple plots ggplot and tidyverse Functions and control structures (for, if/else etc) User defined functions Loops and looping structures in R Exercises Computations in R Data structures in R Reading in and writing data out in R Plotting in R Functions and control structures (for, if/else etc)
Statistics for Genomics How to summarize collection of data points: The idea behind statistical distributions Describing the central tendency: mean and median Describing the spread: measurements of variation Precision of estimates: Confidence intervals How to test for differences between samples randomization based testing for difference of the means Using t-test for difference of the means between two samples multiple testing correction moderated t-tests: using information from multiple comparisons Relationship between variables: linear models and correlation How to fit a line How to estimate the error of the coefficients Accuracy of the model Regression with categorical variables Regression pitfalls Exercises How to summarize collection of data points: The idea behind statistical distributions How to test for differences in samples Relationship between variables: linear models and correlation
Exploratory Data Analysis with Unsupervised Machine Learning Clustering: grouping samples based on their similarity Distance metrics Hiearchical clustering K-means clustering how to choose "k", the number of clusters Dimensionality reduction techniques: visualizing complex data sets in D Principal component analysis Other matrix factorization methods for dimensionality reduction Multi-dimensional scaling t-Distributed Stochastic Neighbor Embedding (t-SNE) Exercises Clustering Dimension Reduction
Predictive Modeling with Supervised Machine Learning How machine learning models are fit? Machine learning vs Statistics Steps in supervised machine learning Use case: Disease subtype from genomics data Data preprocessing data transformation Filtering data and scaling Dealing with missing values Splitting the data Holdout test dataset Cross-validation Bootstrap resampling Predicting the subtype with k-nearest neighbors Assessing the performance of our model Receiver Operating Characteristic (ROC) Curves Model tuning and avoiding overfitting Model complexity and bias variance trade-off Data split strategies for model tuning and testing Variable importance How to deal with class imbalance Sampling for class balance Altering case weights selecting different classification score cutoffs Dealing with correlated predictors Trees and forests: Random forests in action decision trees Trees to forests Variable importance Logistic regression and regularization regularization in order to avoid overfitting variable importance Other supervised algorithms Gradient boosting Support Vector Machines (SVM) Neural networks and deep versions of it Ensemble learning Predicting continuous variables: regression with machine learning Use case: Predicting age from DNA methylation reading and processing the data Running random forest regression Exercises Classification Regression
Operations on Genomic Intervals and Genome Arithmetic Operations on Genomic Intervals with GenomicRanges package How to create and manipulate a GRanges object Getting genomic regions into R as GRanges objects Finding regions that do/do not overlap with another set of regions Dealing with mapped high-th…