[##hwayeon is studying##]

Algorithm, Data Structure

Basic Concepts of Database 2022.08.18
Exercise: Housing Prices Competition for Kaggle Learn Users 2022.01.25

Basic Concepts of Database

2022. 8. 18. 22:21

What is a Database (DB)?

- Any collection of related information

Phone Book
Shopping list
Todo List
Your 5 best friends
Facebook’s User Base

- Databases can be stored in different ways

On paper
In your mind
On a computer
PowerPoint slides
Comments section

*Computers + Databases > extremely useful

· computers are great at keeping track of large amounts of information

· Amazon.com vs shopping list

Database Management System (DBMS)

: a special software program that helps users create and maintain a database

Makes it easy to manage large amounts of information
Handles security
Backups
Importing/exporting data
Concurrency
Interacts with software applications: programming languages

Two Types of Databases

1. Relational Databases: SQL (most popular, organize all the data into a pre-defined table and insert info after that)

- organize data into one or more tables

Each table has columns and rows
A unique key identifies each row

2. Non-Relational: noSQL / not just SQL (alike excel spread sheet)

- Organize data in anything but a traditional table

Key-value stores: html, JavaScript
Documents: JSON, XML, etc
Graphs
Flexible Tables

Relational Databases (SQL)

1. Relational Database Management Systems (RDBMS)

- help users create and maintain a relational databae

ex > mySQL, Oracle, PostgreSQL, MariaDB, etc.

2. Structured Query Language (SQL)

- Standardized language for interacting with RDBMS

- Used to perform C.R.U.D operations, as well as other administrative tasks (user management, security, backup, etc).

- Used to define tables and structures

- SQL code used on one RDBMS is not always portable to another without modification

Non-Relational Databases (noSQL, not just SQL)

*JSON: html, most popular in noSQL

1. Non-Relational Database Management Systems (NRDBMS)

- Help users create and maintain a non-relational database

MongoDB, DynamoDB, apache Cassandra, firebase, etc

2. Implementation Specific

- Any non-relational database falls under this category, so there’s no set language standard

- Most NRDBMS will implement their own language for performing C.R.U.D and administrative operations on the database.

*cause there’s no language operating non-relational database

Database Queries

: requests made to the database management system for specific information

- It grabs information I want in a large database via programming language

- As the database’s structure become more and more complex, it becomes more difficult to get the specific pieces of information we want

- Query is alike a google search; we type specific programming language in query instead of English for a google search.

Reference

https://www.youtube.com/embed/HXV3zeQKqGY

YouTube

www.youtube.com

저작자표시 비영리 변경금지 (새창열림)

'Algorithm, Data Structure' 카테고리의 다른 글

Exercise: Housing Prices Competition for Kaggle Learn Users (0)	2022.01.25

Exercise: Housing Prices Competition for Kaggle Learn Users

2022. 1. 25. 01:06

Certificate of the course

Competition Code

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "696ffc31",
   "metadata": {
    "papermill": {
     "duration": 0.017932,
     "end_time": "2022-01-24T15:31:02.911243",
     "exception": false,
     "start_time": "2022-01-24T15:31:02.893311",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "**This notebook is an exercise in the [Introduction to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/machine-learning-competitions).**\n",
    "\n",
    "---\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15b0c621",
   "metadata": {
    "papermill": {
     "duration": 0.015928,
     "end_time": "2022-01-24T15:31:02.944362",
     "exception": false,
     "start_time": "2022-01-24T15:31:02.928434",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Introduction\n",
    "\n",
    "In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to apply what you've learned and move up the leaderboard.\n",
    "\n",
    "Begin by running the code cell below to set up code checking and the filepaths for the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "b0beccd9",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-01-24T15:31:02.982067Z",
     "iopub.status.busy": "2022-01-24T15:31:02.972802Z",
     "iopub.status.idle": "2022-01-24T15:31:03.034194Z",
     "shell.execute_reply": "2022-01-24T15:31:03.033471Z",
     "shell.execute_reply.started": "2022-01-24T15:29:24.422310Z"
    },
    "papermill": {
     "duration": 0.076957,
     "end_time": "2022-01-24T15:31:03.034381",
     "exception": false,
     "start_time": "2022-01-24T15:31:02.957424",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Set up code checking\n",
    "from learntools.core import binder\n",
    "binder.bind(globals())\n",
    "from learntools.machine_learning.ex7 import *\n",
    "\n",
    "# Set up filepaths\n",
    "import os\n",
    "if not os.path.exists(\"../input/train.csv\"):\n",
    "    os.symlink(\"../input/home-data-for-ml-course/train.csv\", \"../input/train.csv\")  \n",
    "    os.symlink(\"../input/home-data-for-ml-course/test.csv\", \"../input/test.csv\") "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0022da8",
   "metadata": {
    "papermill": {
     "duration": 0.013738,
     "end_time": "2022-01-24T15:31:03.060426",
     "exception": false,
     "start_time": "2022-01-24T15:31:03.046688",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Here's some of the code you've written so far. Start by running it again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "93a3c815",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-01-24T15:31:03.095832Z",
     "iopub.status.busy": "2022-01-24T15:31:03.094918Z",
     "iopub.status.idle": "2022-01-24T15:31:05.043489Z",
     "shell.execute_reply": "2022-01-24T15:31:05.044783Z",
     "shell.execute_reply.started": "2022-01-24T15:29:27.611973Z"
    },
    "papermill": {
     "duration": 1.973414,
     "end_time": "2022-01-24T15:31:05.045018",
     "exception": false,
     "start_time": "2022-01-24T15:31:03.071604",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Validation MAE for Random Forest Model: 21,857\n"
     ]
    }
   ],
   "source": [
    "# Import helpful libraries\n",
    "import pandas as pd\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "from sklearn.metrics import mean_absolute_error\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# Load the data, and separate the target\n",
    "iowa_file_path = '../input/train.csv'\n",
    "home_data = pd.read_csv(iowa_file_path)\n",
    "y = home_data.SalePrice\n",
    "\n",
    "# Create X (After completing the exercise, you can return to modify this line!)\n",
    "features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']\n",
    "\n",
    "# Select columns corresponding to features, and preview the data\n",
    "X = home_data[features]\n",
    "X.head()\n",
    "\n",
    "# Split into validation and training data\n",
    "train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)\n",
    "\n",
    "# Define a random forest model\n",
    "rf_model = RandomForestRegressor(random_state=1)\n",
    "rf_model.fit(train_X, train_y)\n",
    "rf_val_predictions = rf_model.predict(val_X)\n",
    "rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)\n",
    "\n",
    "print(\"Validation MAE for Random Forest Model: {:,.0f}\".format(rf_val_mae))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48fa7734",
   "metadata": {
    "papermill": {
     "duration": 0.010561,
     "end_time": "2022-01-24T15:31:05.067648",
     "exception": false,
     "start_time": "2022-01-24T15:31:05.057087",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Train a model for the competition\n",
    "\n",
    "The code cell above trains a Random Forest model on **`train_X`** and **`train_y`**.  \n",
    "\n",
    "Use the code cell below to build a Random Forest model and train it on all of **`X`** and **`y`**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "658419ac",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-01-24T15:31:05.096232Z",
     "iopub.status.busy": "2022-01-24T15:31:05.095508Z",
     "iopub.status.idle": "2022-01-24T15:31:05.707453Z",
     "shell.execute_reply": "2022-01-24T15:31:05.708047Z",
     "shell.execute_reply.started": "2022-01-24T15:29:31.129932Z"
    },
    "papermill": {
     "duration": 0.629754,
     "end_time": "2022-01-24T15:31:05.708225",
     "exception": false,
     "start_time": "2022-01-24T15:31:05.078471",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RandomForestRegressor()"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# To improve accuracy, create a new Random Forest model which you will train on all training data\n",
    "rf_model_on_full_data = RandomForestRegressor()\n",
    "\n",
    "# fit rf_model_on_full_data on all data from the training data\n",
    "rf_model_on_full_data.fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "797a0776",
   "metadata": {
    "papermill": {
     "duration": 0.011148,
     "end_time": "2022-01-24T15:31:05.731192",
     "exception": false,
     "start_time": "2022-01-24T15:31:05.720044",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Now, read the file of \"test\" data, and apply your model to make predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "66e57c17",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-01-24T15:31:05.759805Z",
     "iopub.status.busy": "2022-01-24T15:31:05.758751Z",
     "iopub.status.idle": "2022-01-24T15:31:05.836755Z",
     "shell.execute_reply": "2022-01-24T15:31:05.836173Z",
     "shell.execute_reply.started": "2022-01-24T15:29:33.666916Z"
    },
    "papermill": {
     "duration": 0.09341,
     "end_time": "2022-01-24T15:31:05.836927",
     "exception": false,
     "start_time": "2022-01-24T15:31:05.743517",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# path to file you will use for predictions\n",
    "test_data_path = '../input/test.csv'\n",
    "\n",
    "# read test data file using pandas\n",
    "test_data = pd.read_csv(test_data_path)\n",
    "\n",
    "# create test_X which comes from test_data but includes only the columns you used for prediction.\n",
    "# The list of columns is stored in a variable called features\n",
    "test_X = test_data[features]\n",
    "\n",
    "# make predictions which we will submit. \n",
    "test_preds = rf_model_on_full_data.predict(test_X)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75d68ee6",
   "metadata": {
    "papermill": {
     "duration": 0.01115,
     "end_time": "2022-01-24T15:31:05.859379",
     "exception": false,
     "start_time": "2022-01-24T15:31:05.848229",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Before submitting, run a check to make sure your `test_preds` have the right format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "b5ea8ac8",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-01-24T15:31:05.888306Z",
     "iopub.status.busy": "2022-01-24T15:31:05.887256Z",
     "iopub.status.idle": "2022-01-24T15:31:05.893013Z",
     "shell.execute_reply": "2022-01-24T15:31:05.892351Z",
     "shell.execute_reply.started": "2022-01-24T15:29:36.461312Z"
    },
    "papermill": {
     "duration": 0.022541,
     "end_time": "2022-01-24T15:31:05.893159",
     "exception": false,
     "start_time": "2022-01-24T15:31:05.870618",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "application/javascript": [
       "parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"outcomeType\": 1, \"valueTowardsCompletion\": 1.0, \"interactionType\": 1, \"questionType\": 2, \"questionId\": \"1_CheckSubmittablePreds\", \"learnToolsVersion\": \"0.3.4\", \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\"}}, \"*\")"
      ],
      "text/plain": [
       "<IPython.core.display.Javascript object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "<span style=\"color:#33cc33\">Correct</span>"
      ],
      "text/plain": [
       "Correct"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Check your answer (To get credit for completing the exercise, you must get a \"Correct\" result!)\n",
    "step_1.check()\n",
    "# step_1.solution()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a069dfc",
   "metadata": {
    "papermill": {
     "duration": 0.011894,
     "end_time": "2022-01-24T15:31:05.917282",
     "exception": false,
     "start_time": "2022-01-24T15:31:05.905388",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Generate a submission\n",
    "\n",
    "Run the code cell below to generate a CSV file with your predictions that you can use to submit to the competition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "73ad408b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-01-24T15:31:05.946009Z",
     "iopub.status.busy": "2022-01-24T15:31:05.945003Z",
     "iopub.status.idle": "2022-01-24T15:31:05.958956Z",
     "shell.execute_reply": "2022-01-24T15:31:05.959500Z",
     "shell.execute_reply.started": "2022-01-24T15:29:41.508334Z"
    },
    "papermill": {
     "duration": 0.030196,
     "end_time": "2022-01-24T15:31:05.959691",
     "exception": false,
     "start_time": "2022-01-24T15:31:05.929495",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Run the code to save predictions in the format used for competition scoring\n",
    "\n",
    "output = pd.DataFrame({'Id': test_data.Id,\n",
    "                       'SalePrice': test_preds})\n",
    "output.to_csv('submission.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90e380a5",
   "metadata": {
    "papermill": {
     "duration": 0.012283,
     "end_time": "2022-01-24T15:31:05.984922",
     "exception": false,
     "start_time": "2022-01-24T15:31:05.972639",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Submit to the competition\n",
    "\n",
    "To test your results, you'll need to join the competition (if you haven't already).  So open a new window by clicking on **[this link](https://www.kaggle.com/c/home-data-for-ml-course)**.  Then click on the **Join Competition** button.\n",
    "\n",
    "![join competition image](https://i.imgur.com/axBzctl.png)\n",
    "\n",
    "Next, follow the instructions below:\n",
    "1. Begin by clicking on the **Save Version** button in the top right corner of the window.  This will generate a pop-up window.  \n",
    "2. Ensure that the **Save and Run All** option is selected, and then click on the **Save** button.\n",
    "3. This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **Save Version** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.\n",
    "4. Click on the **Output** tab on the right of the screen.  Then, click on the file you would like to submit, and click on the **Submit** button to submit your results to the leaderboard.\n",
    "\n",
    "You have now successfully submitted to the competition!\n",
    "\n",
    "If you want to keep working to improve your performance, select the **Edit** button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.\n",
    "\n",
    "\n",
    "# Continue Your Progress\n",
    "There are many ways to improve your model, and **experimenting is a great way to learn at this point.**\n",
    "\n",
    "The best way to improve your model is to add features.  To add more features to the data, revisit the first code cell, and change this line of code to include more column names:\n",
    "```python\n",
    "features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']\n",
    "```\n",
    "\n",
    "Some features will cause errors because of issues like missing values or non-numeric data types.  Here is a complete list of potential columns that you might like to use, and that won't throw errors:\n",
    "- 'MSSubClass'\n",
    "- 'LotArea'\n",
    "- 'OverallQual' \n",
    "- 'OverallCond' \n",
    "- 'YearBuilt'\n",
    "- 'YearRemodAdd' \n",
    "- '1stFlrSF'\n",
    "- '2ndFlrSF' \n",
    "- 'LowQualFinSF' \n",
    "- 'GrLivArea'\n",
    "- 'FullBath'\n",
    "- 'HalfBath'\n",
    "- 'BedroomAbvGr' \n",
    "- 'KitchenAbvGr' \n",
    "- 'TotRmsAbvGrd' \n",
    "- 'Fireplaces' \n",
    "- 'WoodDeckSF' \n",
    "- 'OpenPorchSF'\n",
    "- 'EnclosedPorch' \n",
    "- '3SsnPorch' \n",
    "- 'ScreenPorch' \n",
    "- 'PoolArea' \n",
    "- 'MiscVal' \n",
    "- 'MoSold' \n",
    "- 'YrSold'\n",
    "\n",
    "Look at the list of columns and think about what might affect home prices.  To learn more about each of these features, take a look at the data description on the **[competition page](https://www.kaggle.com/c/home-data-for-ml-course/data)**.\n",
    "\n",
    "After updating the code cell above that defines the features, re-run all of the code cells to evaluate the model and generate a new submission file.  \n",
    "\n",
    "\n",
    "# What's next?\n",
    "\n",
    "As mentioned above, some of the features will throw an error if you try to use them to train your model.  The **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course will teach you how to handle these types of features. You will also learn to use **xgboost**, a technique giving even better accuracy than Random Forest.\n",
    "\n",
    "The **[Pandas](https://kaggle.com/Learn/Pandas)** course will give you the data manipulation skills to quickly go from conceptual idea to implementation in your data science projects. \n",
    "\n",
    "You are also ready for the **[Deep Learning](https://kaggle.com/Learn/intro-to-Deep-Learning)** course, where you will build models with better-than-human level performance at computer vision tasks."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31035197",
   "metadata": {
    "papermill": {
     "duration": 0.011881,
     "end_time": "2022-01-24T15:31:06.009511",
     "exception": false,
     "start_time": "2022-01-24T15:31:05.997630",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "---\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/intro-to-machine-learning/discussion) to chat with other learners.*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  },
  "papermill": {
   "default_parameters": {},
   "duration": 14.840316,
   "end_time": "2022-01-24T15:31:06.834359",
   "environment_variables": {},
   "exception": null,
   "input_path": "__notebook__.ipynb",
   "output_path": "__notebook__.ipynb",
   "parameters": {},
   "start_time": "2022-01-24T15:30:51.994043",
   "version": "2.3.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}

저작자표시 비영리 변경금지 (새창열림)

'Algorithm, Data Structure' 카테고리의 다른 글

Basic Concepts of Database (0)	2022.08.18

PREV 1 NEXT

Hwayeonniii

Algorithm, Data Structure

Basic Concepts of Database

Reference

'Algorithm, Data Structure' 카테고리의 다른 글

Exercise: Housing Prices Competition for Kaggle Learn Users

'Algorithm, Data Structure' 카테고리의 다른 글

+ Recent posts

티스토리툴바