CHF70.00
Download est disponible immédiatement
The only how-to guide offering a unified, systemic approach to acquiring, cleaning, and managing data in R
Every experienced practitioner knows that preparing data for modeling is a painstaking, time-consuming process. Adding to the difficulty is that most modelers learn the steps involved in cleaning and managing data piecemeal, often on the fly, or they develop their own ad hoc methods. This book helps simplify their task by providing a unified, systematic approach to acquiring, modeling, manipulating, cleaning, and maintaining data in R.
Starting with the very basics, data scientists Samuel E. Buttrey and Lyn R. Whitaker walk readers through the entire process. From what data looks like and what it should look like, they progress through all the steps involved in getting data ready for modeling. They describe best practices for acquiring data from numerous sources; explore key issues in data handling, including text/regular expressions, big data, parallel processing, merging, matching, and checking for duplicates; and outline highly efficient and reliable techniques for documenting data and recordkeeping, including audit trails, getting data back out of R, and more.
The only single-source guide to R data and its preparation, it describes best practices for acquiring, manipulating, cleaning, and maintaining data
Begins with the basics and walks readers through all the steps necessary to get data ready for the modeling process
Provides expert guidance on how to document the processes described so that they are reproducible
Written by seasoned professionals, it provides both introductory and advanced techniques
Features case studies with supporting data and R code, hosted on a companion website
A Data Scientist's Guide to Acquiring, Cleaning and Managing Data in R is a valuable working resource/bench manual for practitioners who collect and analyze data, lab scientists and research associates of all levels of experience, and graduate-level data mining students.
Auteur
SAMUEL E. BUTTREY, PhD is an Associate Professor of Operations Research at the Naval Postgraduate School, Monterey, California, USA. LYN R. WHITAKER, PhD is an Associate Professor of Operations Research at the Naval Postgraduate School, Monterey, California, USA.
Contenu
About the Authors xv
Preface xvii
Acknowledgments xix
About the CompanionWebsite xxi
1 R 1
1.1 Introduction 1
1.1.1 What Is R? 1
1.1.2 Who Uses R and Why? 2
1.1.3 Acquiring and Installing R 2
1.1.4 Starting and Quitting R 3
1.2 Data 3
1.2.1 Acquiring Data 3
1.2.2 Cleaning Data 4
1.2.3 The Goal of Data Cleaning 4
1.2.4 Making YourWork Reproducible 5
1.3 The Very Basics of R 5
1.3.1 Top Ten Quick Facts You Need to Know about R 5
1.3.2 Vocabulary 8
1.3.3 Calculating and Printing in R 11
1.4 Running an R Session 12
1.4.1 Where Your Data Is Stored 13
1.4.2 Options 13
1.4.3 Scripts 14
1.4.4 R Packages 14
1.4.5 RStudio and Other GUIs 15
1.4.6 Locales and Character Sets 15
1.5 Getting Help 16
1.5.1 At the Command Line 16
1.5.2 The Online Manuals 16
1.5.3 On the Internet 17
1.5.4 Further Reading 17
1.6 How to Use This Book 17
1.6.1 Syntax and Conventions inThis Book 17
1.6.2 The Chapters 18
2 RData,Part1:Vectors 21
2.1 Vectors 21
2.1.1 Creating Vectors 21
2.1.2 Sequences 22
2.1.3 Logical Vectors 23
2.1.4 Vector Operations 24
2.1.5 Names 27
2.2 Data Types 27
2.2.1 Some Less-Common Data Types 28
2.2.2 What Type of Vector IsThis? 28
2.2.3 Converting from One Type to Another 29
2.3 Subsets of Vectors 31
2.3.1 Extracting 31
2.3.2 Vectors of Length 0 34
2.3.3 Assigning or Replacing Elements of a Vector 35
2.4 Missing Data (NA) and Other Special Values 36
2.4.1 The Effect of NAs in Expressions 37
2.4.2 Identifying and Removing or Replacing NAs 37
2.4.3 Indexing with NAs 39
2.4.4 NaN and Inf Values 40
2.4.5 NULL Values 40
2.5 The table() Function 40
2.5.1 Two- and Higher-Way Tables 42
2.5.2 Operating on Elements of a Table 42
2.6 Other Actions on Vectors 45
2.6.1 Rounding 45
2.6.2 Sorting and Ordering 45
2.6.3 Vectors as Sets 46
2.6.4 Identifying Duplicates and Matching 47
2.6.5 Finding Runs of Duplicate Values 49
2.7 Long Vectors and Big Data 50
2.8 Chapter Summary and Critical Data Handling Tools 50
3 R Data, Part 2:More Complicated Structures 53
3.1 Introduction 53
3.2 Matrices 53
3.2.1 Extracting and Assigning 54
3.2.2 Row and Column Names 56
3.2.3 Applying a Function to Rows or Columns 57
3.2.4 Missing Values in Matrices 59
3.2.5 Using a Matrix Subscript 60
3.2.6 Sparse Matrices 61
3.2.7 Three- and Higher-Way Arrays 62
3.3 Lists 62
3.3.1 Extracting and Assigning 64
3.3.2 Lists in Practice 65
3.4 Data Frames 67
3.4.1 Missing Values in Data Frames 69
3.4.2 Extracting and Assigning in Data Frames 69
3.4.3 ExtractingThings That Aren'tThere 72
3.5 Operating on Lists and Data Frames 74
3.5.1 Split, Apply, Combine 75
3.5.2 All-Numeric Data Frames 77
3.5.3 Convenience Functions 78
3.5.4 Re-Ordering, De-Duplicating, and Sampling from Data Frames 79
3.6 Date and Time Objects 80
3.6.1 Formatting Dates 80
3.6.2 Common Operations on Date Objects 82
3.6.3 Differences between Dates 83
3.6.4 Dates and Times 83
3.6.5 Creating POSIXt Objects 85
3.6.6 Mathematical Functions for Date and Times 86
3.6.7 Missing Values in Dates 88
3.6.8 Using Apply Functions with Dates and Times 89
3.7 Other Actions on Data Frames 90 3.7.1 Combining by Rows or Col...