{"id":228,"date":"2021-09-13T08:48:13","date_gmt":"2021-09-13T08:48:13","guid":{"rendered":"https:\/\/www.kindsonthegenius.com\/data-science\/?p=228"},"modified":"2021-09-25T13:47:32","modified_gmt":"2021-09-25T13:47:32","slug":"practical-data-science-class-for-data-science-beginners","status":"publish","type":"post","link":"https:\/\/www.kindsonthegenius.com\/data-science\/practical-data-science-class-for-data-science-beginners\/","title":{"rendered":"Class 1 &#8211; Practical Data Science Class For Data Science Beginners"},"content":{"rendered":"<p>This would be your first class as a beginner Data Scientist. It would be a practical class with some explanations of the concepts along the line.<\/p>\n<p>Here&#8217;s what we&#8217;ll cover in this class:<\/p>\n<ol>\n<li><a href=\"#t1\">Obtain Free Dataset from sklearn<\/a><\/li>\n<li><a href=\"#t2\">Slicing Your Data<\/a><\/li>\n<li><a href=\"#t3\">Create a Dictionary Using the Dataset<\/a><\/li>\n<li><a href=\"#t4\">Convert to Pandas Dataframe<\/a><\/li>\n<li><a href=\"#t5\">Replace Numerical Values with Target Names<\/a><\/li>\n<li><a href=\"#t6\">Write data to csv<\/a><\/li>\n<li><a href=\"#t7\">Check dimension of the dataset<\/a><\/li>\n<li><a href=\"#t8\">View data types<\/a><\/li>\n<li><a href=\"#t9\">View summary of dataset<\/a><\/li>\n<li><a href=\"#t10\">Check class distribution of data using group_by<\/a><\/li>\n<li><a href=\"#t11\">Check Correlation between features<\/a><\/li>\n<li><a href=\"#t12\">Check skewness of dataset<\/a><\/li>\n<\/ol>\n<p><strong>Note<\/strong>: This class is clearer when you also watch the video lesson.<\/p>\n<p><a href=\"https:\/\/youtu.be\/mdC08NQa5uc\" target=\"_blank\" rel=\"noopener\">Class 1 Video<\/a><\/p>\n<h4><strong id=\"t1\">1. Obtain Free Dataset from sklearn<\/strong><\/h4>\n<p>There are a number of ways to get free datasets. You can also generate your own dataset. One way to get free datasets is to get them from packages in R. <a href=\"https:\/\/youtu.be\/cKxgH0fchPc\" target=\"_blank\" rel=\"noopener\">This is explained here<\/a>.<\/p>\n<p>But in this tutorial, we would the iris dataset from sklearn. The dataset comes as an array. The code is given below:<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008800; font-weight: bold;\">import<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.datasets<\/span> <span style=\"color: #008800; font-weight: bold;\">as<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">ds<\/span>\r\niris <span style=\"color: #333333;\">=<\/span> ds<span style=\"color: #333333;\">.<\/span>load_iris()\r\n<\/pre>\n<p>The above code gets the dataset and loads it into the variable iris.<\/p>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t2\">2. Slicing Your Data<\/strong><\/h4>\n<p>Slicing simply means taking a subset of the dataset. The code below separates the dataset into columns:<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">col0 <span style=\"color: #333333;\">=<\/span> iris<span style=\"color: #333333;\">.<\/span>data[:,<span style=\"color: #0000dd; font-weight: bold;\">0<\/span>] <span style=\"color: #888888;\"># column 0<\/span>\r\ncol1 <span style=\"color: #333333;\">=<\/span> iris<span style=\"color: #333333;\">.<\/span>data[:,<span style=\"color: #0000dd; font-weight: bold;\">1<\/span>] <span style=\"color: #888888;\"># column 1<\/span>\r\ncol2 <span style=\"color: #333333;\">=<\/span> iris<span style=\"color: #333333;\">.<\/span>data[:,<span style=\"color: #0000dd; font-weight: bold;\">2<\/span>] <span style=\"color: #888888;\"># column 2<\/span>\r\ncol3 <span style=\"color: #333333;\">=<\/span> iris<span style=\"color: #333333;\">.<\/span>data[:,<span style=\"color: #0000dd; font-weight: bold;\">3<\/span>] <span style=\"color: #888888;\"># column 3<\/span>\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t3\">3. Create a Dictionary Using the Dataset<\/strong><\/h4>\n<p>Now we create a python dictionary called iris_dict. We need a dictionary for us to covert the array dataset to a pandas DataFrame<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">iris_dict <span style=\"color: #333333;\">=<\/span> {<span style=\"background-color: #fff0f0;\">'Sepal Length'<\/span>:col[:,<span style=\"color: #0000dd; font-weight: bold;\">0<\/span>], \r\n             <span style=\"background-color: #fff0f0;\">'Sepal Width'<\/span>:col[:,<span style=\"color: #0000dd; font-weight: bold;\">1<\/span>], \r\n             <span style=\"background-color: #fff0f0;\">'Petal Length'<\/span>:col[:,<span style=\"color: #0000dd; font-weight: bold;\">2<\/span>], \r\n             <span style=\"background-color: #fff0f0;\">'Petal Width'<\/span>:col[:,<span style=\"color: #0000dd; font-weight: bold;\">3<\/span>], \r\n             <span style=\"background-color: #fff0f0;\">'Target'<\/span>:iris<span style=\"color: #333333;\">.<\/span>target\r\n            }\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t4\">4. Convert to Pandas Dataframe<\/strong><\/h4>\n<p>We do this using the code below.<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008800; font-weight: bold;\">import<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">pandas<\/span> <span style=\"color: #008800; font-weight: bold;\">as<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">pd<\/span>\r\niris_df <span style=\"color: #333333;\">=<\/span> pd<span style=\"color: #333333;\">.<\/span>DataFrame(data<span style=\"color: #333333;\">=<\/span>iris_dict)\r\n<\/pre>\n<p>The new dataframe is called iris_df.<\/p>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t5\">5. Replace Numerical Values With Text Target Names<\/strong><\/h4>\n<p>We would now have to replace the numerical values (0, 1, 2) with the actual names of the classes available in iris.target_names. The code below does that<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Replace Numerical classes with Target names<\/span>\r\ntarget <span style=\"color: #333333;\">=<\/span> iris<span style=\"color: #333333;\">.<\/span>target_names\r\n\r\niris_df<span style=\"color: #333333;\">.<\/span>loc[iris_df[<span style=\"background-color: #fff0f0;\">'Target'<\/span>]<span style=\"color: #333333;\">==<\/span><span style=\"color: #0000dd; font-weight: bold;\">0<\/span>, <span style=\"background-color: #fff0f0;\">'Target'<\/span>] <span style=\"color: #333333;\">=<\/span> target[<span style=\"color: #0000dd; font-weight: bold;\">0<\/span>]\r\niris_df<span style=\"color: #333333;\">.<\/span>loc[iris_df[<span style=\"background-color: #fff0f0;\">'Target'<\/span>]<span style=\"color: #333333;\">==<\/span><span style=\"color: #0000dd; font-weight: bold;\">1<\/span>, <span style=\"background-color: #fff0f0;\">'Target'<\/span>] <span style=\"color: #333333;\">=<\/span> target[<span style=\"color: #0000dd; font-weight: bold;\">1<\/span>]\r\niris_df<span style=\"color: #333333;\">.<\/span>loc[iris_df[<span style=\"background-color: #fff0f0;\">'Target'<\/span>]<span style=\"color: #333333;\">==<\/span><span style=\"color: #0000dd; font-weight: bold;\">2<\/span>, <span style=\"background-color: #fff0f0;\">'Target'<\/span>] <span style=\"color: #333333;\">=<\/span> target[<span style=\"color: #0000dd; font-weight: bold;\">2<\/span>]\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t6\">6. Write Pandas DataFrame to csv<\/strong><\/h4>\n<p>Now you can export this data as a csv in your local computer<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">iris_df<span style=\"color: #333333;\">.<\/span>to_csv(<span style=\"background-color: #fff0f0;\">'irisCSV.csv'<\/span>)\r\n<\/pre>\n<p>It is saved in the same directory as the current notebook you are working with.<\/p>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t6\">6. Check dimension of the dataset<\/strong><\/h4>\n<p>Dimension of the dataset is simply the number of rows and columns in the dataset. You get it using the shape method as shown below:<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">iris_df<span style=\"color: #333333;\">.<\/span>shape\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t8\">8. View data types<\/strong><\/h4>\n<p>This means that you want to know the datatypes of the columns in your dataset. You get it using the dtypes method<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">iris_df<span style=\"color: #333333;\">.<\/span>dtypes\r\n<\/pre>\n<p>The output will be:<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">Sepal Length    float64\r\nSepal Width     float64\r\nPetal Length    float64\r\nPetal Width     float64\r\nTarget            int64\r\ndtype: object\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t9\">9. View summary of dataset<\/strong><\/h4>\n<p>We can get the summary statistics of our dataset. These statistics includes mean, count, etc<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">pandas<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> set_option <span style=\"color: #888888;\"># allows us to set precision<\/span>\r\nset_option(<span style=\"background-color: #fff0f0;\">'precision'<\/span>, <span style=\"color: #0000dd; font-weight: bold;\">2<\/span>)\r\niris_df<span style=\"color: #333333;\">.<\/span>describe()\r\n<\/pre>\n<p>The output of the above code is :<\/p>\n<figure id=\"attachment_230\" aria-describedby=\"caption-attachment-230\" style=\"width: 300px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-13-at-10.23.00.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-230 size-medium\" src=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-13-at-10.23.00-300x164.png\" alt=\"Output of dataset summary using describe()\" width=\"300\" height=\"164\" srcset=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-13-at-10.23.00-300x164.png 300w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-13-at-10.23.00-1024x560.png 1024w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-13-at-10.23.00-768x420.png 768w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-13-at-10.23.00.png 1292w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><figcaption id=\"caption-attachment-230\" class=\"wp-caption-text\">Output of dataset summary using describe()<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t10\">10. Check class distribution of data using group_by<\/strong><\/h4>\n<p>The class distribution helps you to see the balance of the class values. See the video for more explanation.<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">iris_df<span style=\"color: #333333;\">.<\/span>groupby(<span style=\"background-color: #fff0f0;\">'Target'<\/span>)<span style=\"color: #333333;\">.<\/span>size()\r\n<\/pre>\n<p>The output would be<br \/>\n<!-- HTML generated using hilite.me --><\/p>\n<pre style=\"margin: 0; line-height: 125%;\">Target\r\n0    50\r\n1    50\r\n2    50\r\ndtype: int64\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t11\">11. Check Correlation between features<\/strong><\/h4>\n<p>Correlation is the relationship between the variables in your dataset. The values of correlation ranges from -1 (negative correlation) to 0 (no correlation) to 1 (positive correlation).<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">correlations <span style=\"color: #333333;\">=<\/span> iris_df<span style=\"color: #333333;\">.<\/span>corr(method<span style=\"color: #333333;\">=<\/span><span style=\"background-color: #fff0f0;\">'pearson'<\/span>)\r\ncorrelations <span style=\"color: #888888;\"># you can also use print(correlations)<\/span>\r\n<\/pre>\n<p>The output of the above code is given below:<\/p>\n<figure id=\"attachment_231\" aria-describedby=\"caption-attachment-231\" style=\"width: 300px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-13-at-10.32.18.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-231\" src=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-13-at-10.32.18-300x114.png\" alt=\"Output of Correlation\" width=\"300\" height=\"114\" srcset=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-13-at-10.32.18-300x114.png 300w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-13-at-10.32.18-1024x389.png 1024w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-13-at-10.32.18-768x292.png 768w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-13-at-10.32.18.png 1422w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><figcaption id=\"caption-attachment-231\" class=\"wp-caption-text\">Output of Correlation<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t12\">12. Check skewness of dataset<\/strong><\/h4>\n<p>Skewness of the data is the distribution of the data that is expected to be a normal distribution (Gaussian) but it appear distorted or shifted to either the left or the right.<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">iris_df<span style=\"color: #333333;\">.<\/span>skew()\r\n<\/pre>\n<p>The output of the above code is given below:<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">Sepal Length    0.31\r\nSepal Width     0.32\r\nPetal Length   -0.27\r\nPetal Width    -0.10\r\nTarget          0.00\r\ndtype: float64\r\n<\/pre>\n<p>&nbsp;<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>This would be your first class as a beginner Data Scientist. It would be a practical class with some explanations of the concepts along the &hellip; <!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":1,"featured_media":235,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[44,43],"tags":[46,40,45],"class_list":["post-228","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science","category-python","tag-jupyter-notebook","tag-pandas","tag-sklearn"],"jetpack_featured_media_url":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Your-First-Data-Science-Class-With-Python-and-Jupyter-Notebook-2.jpg","_links":{"self":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts\/228","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/comments?post=228"}],"version-history":[{"count":7,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts\/228\/revisions"}],"predecessor-version":[{"id":278,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts\/228\/revisions\/278"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/media\/235"}],"wp:attachment":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/media?parent=228"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/categories?post=228"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/tags?post=228"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}