{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![](images/python_with_Birds.gif)\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# แนะนำ Pandas (Introduction to Pandas)\n", "**20** minutes\n", "\n", " **วัตถุประสงค์**\n", "\n", "\n", " หลังจากทำทำแล็บ นศ.จะสามารถ \n", "\n", "* ใช้งาน Pandas เข้าถึงและแสดงข้อมูลได้\n", "\n", "Ref: \n", "* http://pandas.pydata.org/pandas-docs/stable/getting_started/index.html\n", "* https://www.kaggle.com/leonardodata/analysis-from-2010-2019-spotify" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Dataset" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "ตอนนี้ในมือของคุณมีชุดข้อมูล (Dataset) เพลง เป็นข้อมูลเพลงฮิตในแต่ละปีตั้งแต่ปี 2010 ถึงปี 2019 โดยข้อมูลอยู่ในรูปแบบของตาราง แต่ละแถว (row) ในตารางเป็นข้อมูลของเพลงแต่ละเพลง และแต่ละคอลัมน์ (column) มีข้อมูลดังต่อไปนี้\n", "\n", "* **artist** - ชื่อศิลปิน (นักร้อง วงดนตรี)\n", "* Top genre คือแนวเพลงหรือประเภทของแนวดนตรี\n", "* Year คือปีของเพลงที่อยู่ในบิลบอร์ด (Billboard)\n", "* Bpm (Tempo) คือจังหวะเร็วช้าขนาดไหน หน่วยเป็น bpm (Beats.Per.Minute)\n", "* Nrgy (Energy) คือพลังหรือความพุ่งของเพลง\n", "* Dnce (Danceability) คือความดิ้น หรือความน่าเต้นของเพลง\n", "* dB (Loudness) คือระดับความดังในหน่วย decibel \n", "* Live (Liveness) คือระดับของการแสดงสด (Live) ของเพลง \n", "* Val (Valence) คือระดับความ positive ในเพลง (ค่าต่ำ: เศร้า-หดหู่-โกรธ,คาสูง: มีความสุข-ให้กำลังใจ)\n", "* Dur (Length) - คือความยาว (ระยะเวลา) ของเพลง/ดนตรี\n", "* Acous (Acousticness) คือความ “acoustic” ของเพลง\n", "* Spch (Speechiness) คือปริมาณของคำพูดหรือเนื้อร้อง\n", "* Popularity คือระดับความนิยม ยิ่งค่าสูงเท่าไหร่ ยิ่งเป็นที่นิยม\n", "\n", "
\n", "\n", "ชุดข้อมูลที่ได้รับอยู่ในรูปตาราง (Format CSV)\n", "\n", "(Source: spotifycharts.com, kaggle.com)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "|#|title |artist |top genre |year|bpm|nrgy|dnce|dB |live|val|dur|acous|spch|pop|\n", "|------|----------------------------------------------------------------------------------------------------|------------------------|-------------------------|----|---|----|----|---|----|---|---|-----|----|---|\n", "|1 |Memories |Maroon 5 |pop |2019|91 |32 |76 |-7 |8 |57 |189|84 |5 |99 |\n", "|2 |Lose You To Love Me |Selena Gomez |dance pop |2019|102|34 |51 |-9 |21 |9 |206|58 |4 |97 |\n", "|3 |Someone You Loved |Lewis Capaldi |pop |2019|110|41 |50 |-6 |11 |45 |182|75 |3 |96 |\n", "|4 |Senorita |Shawn Mendes |canadian pop |2019|117|54 |76 |-6 |9 |75 |191|4 |3 |95 |\n", "|5 |How Do You Sleep? |Sam Smith |pop |2019|111|68 |48 |-5 |8 |35 |202|15 |9 |93 |\n", "|6 |South of the Border (feat. Camila Cabello & Cardi B) |Ed Sheeran |pop |2019|98 |62 |86 |-6 |9 |67 |204|15 |8 |92 |\n", "|7 |Trampoline (with ZAYN) |SHAED |electropop |2019|127|46 |62 |-6 |14 |50 |184|56 |3 |92 |\n", "|8 |Happier |Marshmello |brostep |2019|100|79 |69 |-3 |17 |67 |214|19 |5 |90 |\n", "|9 |Truth Hurts |Lizzo |escape room |2019|158|62 |72 |-3 |12 |41 |173|11 |11 |90 |\n", "|10 |Good as Hell (feat. Ariana Grande) - Remix |Lizzo |escape room |2019|96 |89 |67 |-3 |74 |48 |159|30 |6 |90 |\n", "|11 |Underneath the Tree |Kelly Clarkson |dance pop |2013|160|81 |51 |-5 |21 |69 |230|0 |5 |88 |\n", "|12 |Higher Love |Kygo |edm |2019|104|68 |69 |-7 |10 |40 |228|2 |3 |88 |\n", "|13 |Shape of You |Ed Sheeran |pop |2017|96 |65 |83 |-3 |9 |93 |234|58 |8 |87 |\n", "|14 |Only Human |Jonas Brothers |boy band |2019|94 |50 |80 |-6 |6 |87 |183|11 |7 |87 |\n", "|15 |All of Me |John Legend |neo mellow |2014|120|26 |42 |-7 |13 |33 |270|92 |3 |86 |\n", "|16 |Closer |The Chainsmokers |electropop |2017|95 |52 |75 |-6 |11 |66 |245|41 |3 |86 |\n", "|17 |One Kiss (with Dua Lipa) |Calvin Harris |dance pop |2018|124|86 |79 |-3 |8 |59 |215|4 |11 |86 |\n", "|18 |Beautiful People (feat. Khalid) |Ed Sheeran |pop |2019|93 |65 |64 |-8 |8 |55 |198|12 |19 |86 |\n", "|19 |Sucker |Jonas Brothers |boy band |2019|138|73 |84 |-5 |11 |95 |181|4 |6 |86 |\n", "|20 |Don't Call Me Up |Mabel |dance pop |2019|99 |88 |67 |-3 |8 |23 |178|30 |15 |86 |\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Pandas คือ?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Pandas](https://pypi.org/project/pandas/)* เป็นไลบรารี่ที่ได้รับความนิยมอย่างมากอันหนึ่งของภาษาไพทอน ใช้ในการจัดการข้อมูล (Data wrangling/ Data cleaning) และการวิเคราะห์ข้อมูล (Data analysis) สามารถจัดการกับข้อมูลที่มีขนาดใหญ่ๆ ได้โดยที่ไม่มีปัญหา (ในขณะที่โปรแกรม Spreadsheets อย่าง Excel มีปัญหาเรื่องการประมวลผลช้าและไม่เสถียร)\n", "\n", "หากทำงานเกี่ยวข้องกับ Data (เช่น Data Scientist) Pandas จะเป็นตัวเลือกหลัก เพราะ Pandas สามารถเชื่อมต่อการแหล่งข้อมูลได้หลากหลาย (เช่น CSV file, Text file, Microsoft Excel, Clipboard, HDFS format, Google BIg Query, Oracle database, URL, HTML, JSON file)\n", "หลังจากนั้นก็สามารถจัดเตรียมข้อมูล (Data preparation) ทำความสะอาดข้อมูล (Data cleansing/Data scrubbing) และจัดรูปแบบให้พร้อมกับการนำไปวิเคราะห์ (Data analytics) จนถึงแสดงผล (Data visualization) เบื้องต้นก็ได้\n", "\n", "\n", "เนื่องจาก Pandas ไม่ใช่โมดูลมาตรฐานของไพทอน ดังนั้น ก่อนจะใช้งานจำเป็นต้องติดตั้งลงเครื่องก่อน (pip install หรือ conda install)\n", "\n", "เราจะเริ่มต้นเขียนโปรแกรมโดยการ import เข้ามา ณ ที่นี้จะตั้งชื่อว่า pd เพื่อให้เวลาเรียกใช้งานฟังก์ชันจะได้เขียนง่ายๆ เช่น pd.read_csv() \n", "\n", "(*Pandas มาจาก Panel data ซึ่งหมายถึงข้อมูลที่มีหลายมิติ)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## การอ่านข้อมูลจากไฟล์ในรูปแบบ DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import ไลบรารี่ที่จะใช้งาน" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "#### เช็ค Version" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pandas version 1.5.1\n" ] } ], "source": [ "print (\"Pandas version\",pd.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**ดาวน์โหลดชุดข้อมูลไฟล์ CSV เข้าสู่โปรแกรมในรูปแบบ DataFrame**\n", "\n", "_The top songs BY YEAR in the world by spotify. This dataset has several variables about the songs and is based on Billboard._\n", "\n", "(ชุดข้อมูลข้างต้นสามารถดาวโหลดได้จาก spotifycharts.com หรือ [kaggle.com](https://www.kaggle.com/leonardopena/top-spotify-songs-from-20102019-by-year) โดยข้อมูลดิบจะเรียงตามปี (Year) ในรูปของไฟล์ CSV (Comma-Separated Values) )\n", "\n", "\n", "ก่อนอื่น เราจะโหลดไฟล์ข้อมูล csv ที่อยู่ในเครื่องของเราเข้าสู่ไพทอนในรูปแบบ DataFrame (DataFrame คือ ตารางข้อมูลที่ถูกแบ่งเป็นแถวและคอลัมน์เหมือน Excel) โดยใช้ฟังก์ชัน read_csv()\n", "\n", "ตัวแปร csv_path เก็บที่อยู่ของไฟล์ (path) .csv ส่งเป็นอาร์กิวเมนต์ให้กับฟังก์ชั่น read_csv ผลลัพธ์จะถูกเก็บในออบเจ็กต์ชื่อ df ซึ่งเป็นชื่อย่อของตัวแปรที่นิยมใช้เก็บ DataFrame" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Read data from CSV file\n", "csv_path = 'docs/Top_Spotify_songs_from_2010_2019.csv'\n", "df = pd.read_csv(csv_path, encoding='ISO-8859-1')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "หมายเหตุ: ASCII: 7 bit, ISO-8859-1: 8 bits, UTF-8: 8-32 bits" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "เราสามารถแสดงตัวอย่างข้อมูล (Preview) 5 แถวบนสุดและ 5 แถวล่างสุดของ Dataframe ได้โดยใช้ฟังก์ชัน ใช้เมธอด head() และ tail() ตามลำดับ" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0titleartisttop genreyearbpmnrgydncedBlivevalduracousspchpop
01Hey, Soul SisterTrainneo mellow2010978967-488021719483
12Love The Way You LieEminemdetroit hip hop2010879375-55264263242382
23TiK ToKKeshadance pop20101208476-32971200101480
34Bad RomanceLady Gagadance pop20101199270-48712950479
45Just the Way You AreBruno Marspop20101098464-59432212478
\n", "
" ], "text/plain": [ " Unnamed: 0 title artist top genre year bpm \\\n", "0 1 Hey, Soul Sister Train neo mellow 2010 97 \n", "1 2 Love The Way You Lie Eminem detroit hip hop 2010 87 \n", "2 3 TiK ToK Kesha dance pop 2010 120 \n", "3 4 Bad Romance Lady Gaga dance pop 2010 119 \n", "4 5 Just the Way You Are Bruno Mars pop 2010 109 \n", "\n", " nrgy dnce dB live val dur acous spch pop \n", "0 89 67 -4 8 80 217 19 4 83 \n", "1 93 75 -5 52 64 263 24 23 82 \n", "2 84 76 -3 29 71 200 10 14 80 \n", "3 92 70 -4 8 71 295 0 4 79 \n", "4 84 64 -5 9 43 221 2 4 78 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print first five rows of the dataframe\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "จะกำหนดจำนวนแถวที่ต้องการแสดงผลก็ทำได้ เช่น" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0titleartisttop genreyearbpmnrgydncedBlivevalduracousspchpop
01Hey, Soul SisterTrainneo mellow2010978967-488021719483
12Love The Way You LieEminemdetroit hip hop2010879375-55264263242382
\n", "
" ], "text/plain": [ " Unnamed: 0 title artist top genre year bpm nrgy \\\n", "0 1 Hey, Soul Sister Train neo mellow 2010 97 89 \n", "1 2 Love The Way You Lie Eminem detroit hip hop 2010 87 93 \n", "\n", " dnce dB live val dur acous spch pop \n", "0 67 -4 8 80 217 19 4 83 \n", "1 75 -5 52 64 263 24 23 82 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(2)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "ในกรณีที่ข้อมูลเป็นไฟล์ excel (.xlsx, .xls) Pandas จะใช้ไลบรารี [xlrd](https://pypi.org/project/xlrd/) โดยเรียกใช้ใช้ฟังก์ชั่น read_excel() ซึ่งผลลัพท์ที่ได้จะเป็น Dataframe เช่นเดียวกัน" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0titleartisttop genreyearbpmnrgydncedBlivevalduracousspchpop
601602Nothing Breaks Like a Heart (feat. Miley Cyrus)Mark Ronsondance pop20191147960-642242171769
602603Kills You SlowlyThe Chainsmokerselectropop20191504470-913232136667
\n", "
" ], "text/plain": [ " Unnamed: 0 title \\\n", "601 602 Nothing Breaks Like a Heart (feat. Miley Cyrus) \n", "602 603 Kills You Slowly \n", "\n", " artist top genre year bpm nrgy dnce dB live val dur \\\n", "601 Mark Ronson dance pop 2019 114 79 60 -6 42 24 217 \n", "602 The Chainsmokers electropop 2019 150 44 70 -9 13 23 213 \n", "\n", " acous spch pop \n", "601 1 7 69 \n", "602 6 6 67 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read data from Excel File and print the first five rows\n", "\n", "xlsx_path = 'docs/Top_Spotify_songs_from_2010_2019.xlsx'\n", "\n", "df = pd.read_excel(xlsx_path)\n", "df.tail(2)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "ในกรณีที่เกิด Error “XLRDError: Excel xlsx file; not supported” ให้ทำการติดตั้ง xlrd โดยใช้คำสั่งต่อไปนี้ แล้วลองโหลดใหม่ดูอีกที" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: xlrd in /Users/tube.sc/.pyenv/versions/3.8.13/lib/python3.8/site-packages (2.0.1)\n" ] } ], "source": [ "# Dependency needed to install file \n", "# Pandas.read_excel a.) uses a library called xlrd internally.\n", "\n", "!pip install xlrd" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## ตรวจสอบชุดข้อมูล" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**เช็คจำนวนแถวและจำนวนคอลัมน์**\n", "ใน Pandas มีวิธีเช็คจำนวนแถวและจำนวนคอลัมน์ได้ง่ายๆ โดยใช้ " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(603, 15)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ชุดข้อมูลมี 603 rows และ 15 columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**ดูค่าสถิติเบื้องต้นของข้อมูลแต่ละ column ใน DataFrame** เช่น mean (ค่าเฉลี่ย), median (ค่ากลาง), standard deviation (std), min (ค่าต่ำสุด), max (ค่าสูงสุด) ฯลฯ" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0yearbpmnrgydncedBlivevalduracousspchpop
count603.000000603.000000603.000000603.000000603.000000603.000000603.000000603.000000603.000000603.000000603.000000603.000000
mean302.0000002014.592040118.54560570.50414664.379768-5.57877317.77446152.225539224.67495914.3267008.35820966.520730
std174.2153842.60705724.79535816.31066413.3787182.79802013.10254322.51302034.13005920.7661657.48316214.517746
min1.0000002010.0000000.0000000.0000000.000000-60.0000000.0000000.000000134.0000000.0000000.0000000.000000
25%151.5000002013.000000100.00000061.00000057.000000-6.0000009.00000035.000000202.0000002.0000004.00000060.000000
50%302.0000002015.000000120.00000074.00000066.000000-5.00000012.00000052.000000221.0000006.0000005.00000069.000000
75%452.5000002017.000000129.00000082.00000073.000000-4.00000024.00000069.000000239.50000017.0000009.00000076.000000
max603.0000002019.000000206.00000098.00000097.000000-2.00000074.00000098.000000424.00000099.00000048.00000099.000000
\n", "
" ], "text/plain": [ " Unnamed: 0 year bpm nrgy dnce \\\n", "count 603.000000 603.000000 603.000000 603.000000 603.000000 \n", "mean 302.000000 2014.592040 118.545605 70.504146 64.379768 \n", "std 174.215384 2.607057 24.795358 16.310664 13.378718 \n", "min 1.000000 2010.000000 0.000000 0.000000 0.000000 \n", "25% 151.500000 2013.000000 100.000000 61.000000 57.000000 \n", "50% 302.000000 2015.000000 120.000000 74.000000 66.000000 \n", "75% 452.500000 2017.000000 129.000000 82.000000 73.000000 \n", "max 603.000000 2019.000000 206.000000 98.000000 97.000000 \n", "\n", " dB live val dur acous spch \\\n", "count 603.000000 603.000000 603.000000 603.000000 603.000000 603.000000 \n", "mean -5.578773 17.774461 52.225539 224.674959 14.326700 8.358209 \n", "std 2.798020 13.102543 22.513020 34.130059 20.766165 7.483162 \n", "min -60.000000 0.000000 0.000000 134.000000 0.000000 0.000000 \n", "25% -6.000000 9.000000 35.000000 202.000000 2.000000 4.000000 \n", "50% -5.000000 12.000000 52.000000 221.000000 6.000000 5.000000 \n", "75% -4.000000 24.000000 69.000000 239.500000 17.000000 9.000000 \n", "max -2.000000 74.000000 98.000000 424.000000 99.000000 48.000000 \n", "\n", " pop \n", "count 603.000000 \n", "mean 66.520730 \n", "std 14.517746 \n", "min 0.000000 \n", "25% 60.000000 \n", "50% 69.000000 \n", "75% 76.000000 \n", "max 99.000000 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "หรือจะดูเฉพาะ mean (ค่าเฉลี่ย), median (ค่ากลาง), standard deviation (std), min (ค่าต่ำสุด), max (ค่าสูงสุด) ก็สามารถทำได้\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/nb/qx4_7k_n2mx3zzwh39qtrg_c0000gn/T/ipykernel_62169/3698961737.py:1: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.\n", " df.mean()\n" ] }, { "data": { "text/plain": [ "Unnamed: 0 302.000000\n", "year 2014.592040\n", "bpm 118.545605\n", "nrgy 70.504146\n", "dnce 64.379768\n", "dB -5.578773\n", "live 17.774461\n", "val 52.225539\n", "dur 224.674959\n", "acous 14.326700\n", "spch 8.358209\n", "pop 66.520730\n", "dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.mean()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/nb/qx4_7k_n2mx3zzwh39qtrg_c0000gn/T/ipykernel_62169/530051474.py:1: FutureWarning: The default value of numeric_only in DataFrame.median is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.\n", " df.median()\n" ] }, { "data": { "text/plain": [ "Unnamed: 0 302.0\n", "year 2015.0\n", "bpm 120.0\n", "nrgy 74.0\n", "dnce 66.0\n", "dB -5.0\n", "live 12.0\n", "val 52.0\n", "dur 221.0\n", "acous 6.0\n", "spch 5.0\n", "pop 69.0\n", "dtype: float64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.median()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/nb/qx4_7k_n2mx3zzwh39qtrg_c0000gn/T/ipykernel_62169/3390915376.py:1: FutureWarning: The default value of numeric_only in DataFrame.std is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.\n", " df.std()\n" ] }, { "data": { "text/plain": [ "Unnamed: 0 174.215384\n", "year 2.607057\n", "bpm 24.795358\n", "nrgy 16.310664\n", "dnce 13.378718\n", "dB 2.798020\n", "live 13.102543\n", "val 22.513020\n", "dur 34.130059\n", "acous 20.766165\n", "spch 7.483162\n", "pop 14.517746\n", "dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.std()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/nb/qx4_7k_n2mx3zzwh39qtrg_c0000gn/T/ipykernel_62169/3962516015.py:1: FutureWarning: The default value of numeric_only in DataFrame.min is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.\n", " df.min()\n" ] }, { "data": { "text/plain": [ "Unnamed: 0 1\n", "artist 3OH!3\n", "top genre acoustic pop\n", "year 2010\n", "bpm 0\n", "nrgy 0\n", "dnce 0\n", "dB -60\n", "live 0\n", "val 0\n", "dur 134\n", "acous 0\n", "spch 0\n", "pop 0\n", "dtype: object" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.min()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/nb/qx4_7k_n2mx3zzwh39qtrg_c0000gn/T/ipykernel_62169/1151452817.py:1: FutureWarning: The default value of numeric_only in DataFrame.max is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.\n", " df.max()\n" ] }, { "data": { "text/plain": [ "Unnamed: 0 603\n", "artist will.i.am\n", "top genre tropical house\n", "year 2019\n", "bpm 206\n", "nrgy 98\n", "dnce 97\n", "dB -2\n", "live 74\n", "val 98\n", "dur 424\n", "acous 99\n", "spch 48\n", "pop 99\n", "dtype: object" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**เช็คความผิดปกติใน DataFrame เบื้องต้น**\n", "\n", "หากเราใช้ฟังชั่น info() เราก็จะทราบว่าข้อมูลเบื้องต้นของ DataFrame เช่น มีกี่แถว, มีกี่คอลัมน์, Missing value มีมากน้อยเท่าไหร่, แต่ละคอลัมน์เป็น Data Type อะไรบ้าง ซึ่งเป็นการเช็คหาความผิดปกติใน DataFrame เบื้องต้นได้" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 603 entries, 0 to 602\n", "Data columns (total 15 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Unnamed: 0 603 non-null int64 \n", " 1 title 603 non-null object\n", " 2 artist 603 non-null object\n", " 3 top genre 603 non-null object\n", " 4 year 603 non-null int64 \n", " 5 bpm 603 non-null int64 \n", " 6 nrgy 603 non-null int64 \n", " 7 dnce 603 non-null int64 \n", " 8 dB 603 non-null int64 \n", " 9 live 603 non-null int64 \n", " 10 val 603 non-null int64 \n", " 11 dur 603 non-null int64 \n", " 12 acous 603 non-null int64 \n", " 13 spch 603 non-null int64 \n", " 14 pop 603 non-null int64 \n", "dtypes: int64(12), object(3)\n", "memory usage: 70.8+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ชุดข้อมูลนี้ไม่มีช่องใดที่มีค่าเป็น NaN (ไม่มีข้อมูล)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "Index(['Unnamed: 0', 'title', 'artist', 'top genre', 'year', 'bpm', 'nrgy',\n", " 'dnce', 'dB', 'live', 'val', 'dur', 'acous', 'spch', 'pop'],\n", " dtype='object')" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "Unnamed: 0 int64\n", "title object\n", "artist object\n", "top genre object\n", "year int64\n", "bpm int64\n", "nrgy int64\n", "dnce int64\n", "dB int64\n", "live int64\n", "val int64\n", "dur int64\n", "acous int64\n", "spch int64\n", "pop int64\n", "dtype: object" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**แสดง index ของ DataFrame**\n", "ตอนสร้าง DataFrame ถ้าไม่ได้กำหนด index จะเป็นเลขลำดับ 0, 1, 2, 3, … โดยอัตโนมัติ" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=603, step=1)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "✄ ---------- เพิ่มเติม ----------- ✄ " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Unnamed: 0 0\n", "title 0\n", "artist 0\n", "top genre 0\n", "year 0\n", "bpm 0\n", "nrgy 0\n", "dnce 0\n", "dB 0\n", "live 0\n", "val 0\n", "dur 0\n", "acous 0\n", "spch 0\n", "pop 0\n", "dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# ******************************************************************\n", "df.isnull().sum() # ตรวจสอบค่า null values ว่าในแต่ละคอลัมน์มีอยู่กี่ค่า" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Unnamed: 0 603\n", "title 603\n", "artist 603\n", "top genre 603\n", "year 603\n", "bpm 603\n", "nrgy 603\n", "dnce 603\n", "dB 603\n", "live 603\n", "val 603\n", "dur 603\n", "acous 603\n", "spch 603\n", "pop 603\n", "dtype: int64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.count() # นับจำนวนข้อมูลในแต่ละคอลัมน์" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "#******************************************************************" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ตรวจสอบจำนวนเพลงที่...\n", "\n", "ชื่อเพลงไม่ซ้ำกัน 584 เพลง\n", "\n", "ชื่อศิลปินไม่ซ้ำกัน 184 เพลง" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Unnamed: 0 603\n", "title 584\n", "artist 184\n", "top genre 50\n", "year 10\n", "bpm 104\n", "nrgy 77\n", "dnce 70\n", "dB 14\n", "live 61\n", "val 94\n", "dur 144\n", "acous 75\n", "spch 39\n", "pop 71\n", "dtype: int64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# ดูค่า nunique ของแต่ละคอลัมน์ (Count number of distinct elements in specified axis.\n", "\n", "df.nunique()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# ✄ ****************************************************************** ✄\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## การเข้าถึงข้อมูล (เบื้องต้น)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**การเข้าถึง Column**\n", "\n", "เราสามารถดูข้อมูลหรือเข้าถึง column ที่ต้องได้ โดยการใส่ชื่อคอลัมน์เป็น string ในวงเล็บเหลี่ยม [] (วงเล็บเดี่ยว) หรือใช้ชื่อคอลัมน์เป็นชื่อ attribute ก็ได้\n", "\n", "Pandas มีออบเจ็กต์พิเศษ 2 ชนิดสำหรับใช้บรรจุข้อมูล ซีรีส์ (Pandas series) และ เดตาเฟรม (Pandas dataFrame)) ดังแสดงในรูป" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](images/Anatomy_of_Pandas_Data_Structures.png)\n", "[src] (https://towardsdatascience.com/how-to-master-pandas-for-data-science-b8ab0a9b1042)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 2010\n", "1 2010\n", "2 2010\n", "3 2010\n", "4 2010\n", " ... \n", "598 2019\n", "599 2019\n", "600 2019\n", "601 2019\n", "602 2019\n", "Name: year, Length: 603, dtype: int64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Access to the column year\n", "y = df['year']\n", "y" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2010\n", "1 2010\n", "2 2010\n", "3 2010\n", "4 2010\n", " ... \n", "598 2019\n", "599 2019\n", "600 2019\n", "601 2019\n", "602 2019\n", "Name: year, Length: 603, dtype: int64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = df.year\n", "y" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ผลลัพธ์ที่ได้เป็น 1 column เรียกว่า Panda series (1-D dataframe)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ในกรณีที่ต้องการให้เป็น Dataframe ก็สามารถทำได้โดยใช้ วงเล็บเหลี่ยม [[ ]] (วงเล็บคู่) เช่น column artist\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
artist
0Train
1Eminem
2Kesha
3Lady Gaga
4Bruno Mars
......
598Mark Ronson
599Ed Sheeran
600DJ Khaled
601Mark Ronson
602The Chainsmokers
\n", "

603 rows × 1 columns

\n", "
" ], "text/plain": [ " artist\n", "0 Train\n", "1 Eminem\n", "2 Kesha\n", "3 Lady Gaga\n", "4 Bruno Mars\n", ".. ...\n", "598 Mark Ronson\n", "599 Ed Sheeran\n", "600 DJ Khaled\n", "601 Mark Ronson\n", "602 The Chainsmokers\n", "\n", "[603 rows x 1 columns]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the column as a dataframe\n", "y = df[['artist']]\n", "y" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "หากต้องการหลายคอลัมน์ก็สามารถทำได้ แค่ใส่ชื่อ dataframe ในกรณีนี้คือ df และชื่อของชื่อของคอลัมน์ที่ต้องการในวงเล็บคู่ " ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
artistyeartop genre
0Train2010neo mellow
1Eminem2010detroit hip hop
2Kesha2010dance pop
3Lady Gaga2010dance pop
4Bruno Mars2010pop
............
598Mark Ronson2019dance pop
599Ed Sheeran2019pop
600DJ Khaled2019dance pop
601Mark Ronson2019dance pop
602The Chainsmokers2019electropop
\n", "

603 rows × 3 columns

\n", "
" ], "text/plain": [ " artist year top genre\n", "0 Train 2010 neo mellow\n", "1 Eminem 2010 detroit hip hop\n", "2 Kesha 2010 dance pop\n", "3 Lady Gaga 2010 dance pop\n", "4 Bruno Mars 2010 pop\n", ".. ... ... ...\n", "598 Mark Ronson 2019 dance pop\n", "599 Ed Sheeran 2019 pop\n", "600 DJ Khaled 2019 dance pop\n", "601 Mark Ronson 2019 dance pop\n", "602 The Chainsmokers 2019 electropop\n", "\n", "[603 rows x 3 columns]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Access to multiple columns\n", "y = df[['artist','year','top genre']]\n", "y" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "sort_index(), sort_values()**" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "- sort_index() เรียงลำดับ row ตามค่าของ index\n", "- sort_values() เรียงลำดับ row ตามค่าของ column ที่กำหนด" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0titleartisttop genreyearbpmnrgydncedBlivevalduracousspchpop
442443Million Years AgoAdelebritish soul2016000-6000227000
4445You Lost MeChristina Aguileradance pop2010433923-614725785456
87881+1BeyoncÈdance pop2011633830-772627438360
56BabyJustin Biebercanadian pop2010658673-5115421441477
460461PrayingKeshadance pop2017733958-7113223049377
................................................
314315I'll Show YouJustin Biebercanadian pop20151926136-718820051068
462463The GreatestSiaaustralian dance20171927367-667321012776
200201How Ya Doin'? (feat. Missy Elliott)Little Mixdance pop20132019536-3375121194850
362363L.A.LOVE (la la)Fergiedance pop20152023948-82627193290
276277FourFiveSecondsRihannabarbadian pop20152062758-6133518888580
\n", "

603 rows × 15 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 title artist \\\n", "442 443 Million Years Ago Adele \n", "44 45 You Lost Me Christina Aguilera \n", "87 88 1+1 BeyoncÈ \n", "5 6 Baby Justin Bieber \n", "460 461 Praying Kesha \n", ".. ... ... ... \n", "314 315 I'll Show You Justin Bieber \n", "462 463 The Greatest Sia \n", "200 201 How Ya Doin'? (feat. Missy Elliott) Little Mix \n", "362 363 L.A.LOVE (la la) Fergie \n", "276 277 FourFiveSeconds Rihanna \n", "\n", " top genre year bpm nrgy dnce dB live val dur acous spch \\\n", "442 british soul 2016 0 0 0 -60 0 0 227 0 0 \n", "44 dance pop 2010 43 39 23 -6 14 7 257 85 4 \n", "87 dance pop 2011 63 38 30 -7 7 26 274 38 3 \n", "5 canadian pop 2010 65 86 73 -5 11 54 214 4 14 \n", "460 dance pop 2017 73 39 58 -7 11 32 230 49 3 \n", ".. ... ... ... ... ... .. ... ... ... ... ... \n", "314 canadian pop 2015 192 61 36 -7 18 8 200 5 10 \n", "462 australian dance 2017 192 73 67 -6 6 73 210 1 27 \n", "200 dance pop 2013 201 95 36 -3 37 51 211 9 48 \n", "362 dance pop 2015 202 39 48 -8 26 27 193 2 9 \n", "276 barbadian pop 2015 206 27 58 -6 13 35 188 88 5 \n", "\n", " pop \n", "442 0 \n", "44 56 \n", "87 60 \n", "5 77 \n", "460 77 \n", ".. ... \n", "314 68 \n", "462 76 \n", "200 50 \n", "362 0 \n", "276 80 \n", "\n", "[603 rows x 15 columns]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values(by='bpm')" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0titleartisttop genreyearbpmnrgydncedBlivevalduracousspchpop
01Hey, Soul SisterTrainneo mellow2010978967-488021719483
12Love The Way You LieEminemdetroit hip hop2010879375-55264263242382
23TiK ToKKeshadance pop20101208476-32971200101480
34Bad RomanceLady Gagadance pop20101199270-48712950479
45Just the Way You AreBruno Marspop20101098464-59432212478
................................................
598599Find U Again (feat. Camila Cabello)Mark Ronsondance pop20191046661-720161761375
599600Cross Me (feat. Chance the Rapper & PnB Rock)Ed Sheeranpop2019957975-6761206211275
600601No Brainer (feat. Justin Bieber, Chance the Ra...DJ Khaleddance pop20191367653-596526073470
601602Nothing Breaks Like a Heart (feat. Miley Cyrus)Mark Ronsondance pop20191147960-642242171769
602603Kills You SlowlyThe Chainsmokerselectropop20191504470-913232136667
\n", "

603 rows × 15 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 title \\\n", "0 1 Hey, Soul Sister \n", "1 2 Love The Way You Lie \n", "2 3 TiK ToK \n", "3 4 Bad Romance \n", "4 5 Just the Way You Are \n", ".. ... ... \n", "598 599 Find U Again (feat. Camila Cabello) \n", "599 600 Cross Me (feat. Chance the Rapper & PnB Rock) \n", "600 601 No Brainer (feat. Justin Bieber, Chance the Ra... \n", "601 602 Nothing Breaks Like a Heart (feat. Miley Cyrus) \n", "602 603 Kills You Slowly \n", "\n", " artist top genre year bpm nrgy dnce dB live val \\\n", "0 Train neo mellow 2010 97 89 67 -4 8 80 \n", "1 Eminem detroit hip hop 2010 87 93 75 -5 52 64 \n", "2 Kesha dance pop 2010 120 84 76 -3 29 71 \n", "3 Lady Gaga dance pop 2010 119 92 70 -4 8 71 \n", "4 Bruno Mars pop 2010 109 84 64 -5 9 43 \n", ".. ... ... ... ... ... ... .. ... ... \n", "598 Mark Ronson dance pop 2019 104 66 61 -7 20 16 \n", "599 Ed Sheeran pop 2019 95 79 75 -6 7 61 \n", "600 DJ Khaled dance pop 2019 136 76 53 -5 9 65 \n", "601 Mark Ronson dance pop 2019 114 79 60 -6 42 24 \n", "602 The Chainsmokers electropop 2019 150 44 70 -9 13 23 \n", "\n", " dur acous spch pop \n", "0 217 19 4 83 \n", "1 263 24 23 82 \n", "2 200 10 14 80 \n", "3 295 0 4 79 \n", "4 221 2 4 78 \n", ".. ... ... ... ... \n", "598 176 1 3 75 \n", "599 206 21 12 75 \n", "600 260 7 34 70 \n", "601 217 1 7 69 \n", "602 213 6 6 67 \n", "\n", "[603 rows x 15 columns]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_index()" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "**การดูข้อมูลและการเข้าถึงข้อมูล: iloc()**\n", "\n", "วิธีหนึ่งในการเข้าถึงข้อมูลแต่ละช่องคือใช้เมธอด iloc() (**i**nteger-**loc**ation based indexing)\n", "\n", "ยกตัวอย่าง เราสามารถเข้าถึงแถวที่ 1 และคอลัมน์ที่ 2 ได้ดังนี้" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'Hey, Soul Sister'" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Access the value on the first row and the first column\n", "df.iloc[0, 1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ถ้าเป็นแถวที่ 1 และคอลัมน์ที่ 3 เขียนได้ดังนี้" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Train'" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Access the value on the second row and the first column\n", "df.iloc[0,2]" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "เข้าถึงแถวที่ 2 และคอลัมน์ที่ 2 และแถวที่ 2 และคอลัมน์ที่ 3 ได้ดังนี้" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Love The Way You Lie'" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Access the value on the first row and the third column\n", "df.iloc[1,1]" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Eminem'" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Access the value on the second row and the third column\n", "df.iloc[1,2]" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "แสดงได้ดังรูป" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "![](images/2_loc.png)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "นอกจากจะใช้ index แล้วเรายังสามารถเข้าถึงคอลัมน์โดยใช้ชื่อได้เช่นกันโดยใช้เมธอด loc() (**loc**ation based indexing) (ผลลัพธ์เหมือนกับก่อนหน้านี้)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Hey, Soul Sister'" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Access the column using the name\n", "\n", "df.loc[0, 'title']" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Train'" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Access the column using the name\n", "\n", "df.loc[0, 'artist']" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Love The Way You Lie'" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Access the column using the name\n", "\n", "df.loc[1, 'title']" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Eminem'" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Access the column using the name\n", "\n", "df.loc[1, 'artist']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**การตัด (Slicing)**\n", "\n", "เราสามารถทำการตัดข้อมูลโดยใช้ทั้งดัชนีและชื่อของคอลัมน์ได้" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleartisttop genre
1Love The Way You LieEminemdetroit hip hop
2TiK ToKKeshadance pop
\n", "
" ], "text/plain": [ " title artist top genre\n", "1 Love The Way You Lie Eminem detroit hip hop\n", "2 TiK ToK Kesha dance pop" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Slicing the dataframe\n", "\n", "z = df.iloc[1:3, 1:4]\n", "z" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](images/3_loc.png)\n" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleartisttop genreyear
0Hey, Soul SisterTrainneo mellow2010
1Love The Way You LieEminemdetroit hip hop2010
2TiK ToKKeshadance pop2010
3Bad RomanceLady Gagadance pop2010
\n", "
" ], "text/plain": [ " title artist top genre year\n", "0 Hey, Soul Sister Train neo mellow 2010\n", "1 Love The Way You Lie Eminem detroit hip hop 2010\n", "2 TiK ToK Kesha dance pop 2010\n", "3 Bad Romance Lady Gaga dance pop 2010" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Slicing the dataframe using name\n", "\n", "z = df.loc[0:3, 'title':'year']\n", "z" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](images/4_loc.png)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**การจัดกลุ่มข้อมูลตาม column ที่กำหนด: groupby()**\n", "\n", "จัดกลุ่มข้อมูลตาม column ที่กำหนดโดยใช้เมธอด groupby()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/nb/qx4_7k_n2mx3zzwh39qtrg_c0000gn/T/ipykernel_62169/130325292.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.median is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.\n", " df.groupby('year').median()\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0bpmnrgydncedBlivevalduracousspchpop
year
201026.0125.082.067.0-5.013.061.0221.03.05.065.0
201178.0126.078.065.0-5.014.057.0231.05.05.066.0
2012122.0125.079.067.0-5.012.068.0225.02.05.072.0
2013175.0126.078.063.0-5.013.052.0227.04.05.067.0
2014239.5125.072.063.5-6.013.554.0223.58.05.065.0
2015316.0120.075.067.0-5.013.052.0220.06.05.068.0
2016403.5113.071.063.0-6.013.043.0217.55.56.069.0
2017476.0106.072.068.0-5.012.054.0222.09.06.071.0
2018540.5110.067.067.5-6.011.547.5211.55.06.074.5
2019588.0104.068.070.0-6.010.052.0202.013.05.085.0
\n", "
" ], "text/plain": [ " Unnamed: 0 bpm nrgy dnce dB live val dur acous spch pop\n", "year \n", "2010 26.0 125.0 82.0 67.0 -5.0 13.0 61.0 221.0 3.0 5.0 65.0\n", "2011 78.0 126.0 78.0 65.0 -5.0 14.0 57.0 231.0 5.0 5.0 66.0\n", "2012 122.0 125.0 79.0 67.0 -5.0 12.0 68.0 225.0 2.0 5.0 72.0\n", "2013 175.0 126.0 78.0 63.0 -5.0 13.0 52.0 227.0 4.0 5.0 67.0\n", "2014 239.5 125.0 72.0 63.5 -6.0 13.5 54.0 223.5 8.0 5.0 65.0\n", "2015 316.0 120.0 75.0 67.0 -5.0 13.0 52.0 220.0 6.0 5.0 68.0\n", "2016 403.5 113.0 71.0 63.0 -6.0 13.0 43.0 217.5 5.5 6.0 69.0\n", "2017 476.0 106.0 72.0 68.0 -5.0 12.0 54.0 222.0 9.0 6.0 71.0\n", "2018 540.5 110.0 67.0 67.5 -6.0 11.5 47.5 211.5 5.0 6.0 74.5\n", "2019 588.0 104.0 68.0 70.0 -6.0 10.0 52.0 202.0 13.0 5.0 85.0" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('year').median()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/nb/qx4_7k_n2mx3zzwh39qtrg_c0000gn/T/ipykernel_62169/3811477203.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.\n", " df.groupby('artist').mean().head()\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0yearbpmnrgydncedBlivevalduracousspchpop
artist
3OH!332.02010.0138.089.068.0-4.036.083.0192.01.08.062.0
5 Seconds of Summer230.02014.0160.095.049.0-4.033.044.0202.00.013.071.0
A Great Big World250.02014.0138.015.045.0-9.09.09.0229.087.03.061.0
Adam Lambert30.02010.0158.579.554.5-4.521.562.0227.51.08.062.5
Adele317.82014.7119.347.352.3-11.410.435.4257.535.34.061.5
\n", "
" ], "text/plain": [ " Unnamed: 0 year bpm nrgy dnce dB live val \\\n", "artist \n", "3OH!3 32.0 2010.0 138.0 89.0 68.0 -4.0 36.0 83.0 \n", "5 Seconds of Summer 230.0 2014.0 160.0 95.0 49.0 -4.0 33.0 44.0 \n", "A Great Big World 250.0 2014.0 138.0 15.0 45.0 -9.0 9.0 9.0 \n", "Adam Lambert 30.0 2010.0 158.5 79.5 54.5 -4.5 21.5 62.0 \n", "Adele 317.8 2014.7 119.3 47.3 52.3 -11.4 10.4 35.4 \n", "\n", " dur acous spch pop \n", "artist \n", "3OH!3 192.0 1.0 8.0 62.0 \n", "5 Seconds of Summer 202.0 0.0 13.0 71.0 \n", "A Great Big World 229.0 87.0 3.0 61.0 \n", "Adam Lambert 227.5 1.0 8.0 62.5 \n", "Adele 257.5 35.3 4.0 61.5 " ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('artist').mean().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Exercise]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1) จงเขียนโค้ด กำหนดให้ตัวแปร q เก็บข้อมูลคอลัมน์ title ในรูปแบบของ dataframe\n" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "# Write your code below and press Shift+Enter to execute\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2) เขียนโค้ดกำหนดค่าให้ตัวแปร q เก็บ dataframe ที่ประกอบด้วยคอลัมน์ artist title และ year" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "# Write your code below and press Shift+Enter to execute\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3) เขียนโค้ดเข้าถึงข้อมูลที่อยู่ในแถวที่ 102 และคอลัมน์ที่ 3 ของ df\n" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Write your code below and press Shift+Enter to execute\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4) ค้นหาชื่อเพลงที่มีค่า nrgy ต่ำสุดและสูงสุด " ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "# Write your code below and press Shift+Enter to execute\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Author\n", "\n", "S.C.\n", "\n", "### Change Log\n", "\n", " \n", "| Date | Version | Change Description |\n", "|---|---|---|\n", "| 08-08-2021 | 0.1 | First edition |\n", "| | | |\n", "| | | |\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel(3.8.13))", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "toc-autonumbering": true, "toc-showcode": false, "toc-showmarkdowntxt": false, "toc-showtags": false }, "nbformat": 4, "nbformat_minor": 4 }