Explore Californoa Housing Dataset

 

California Housing Dataset


🔹 Dataset Characteristics

FeatureDescription
Number of Instances            20,640
Number of Attributes            8 numerical predictive attributes + 1 target + 1 categorical
Target Variable            Median House Value

🔹 Context

This dataset is used in the book
“Hands-On Machine Learning with Scikit-Learn and TensorFlow” by Aurélien Géron.

It is widely used as an introductory dataset for machine learning because:

  • Requires basic preprocessing

  • Has clear and interpretable features

  • Is moderate in size (not too small, not too large)



🔹 Dataset Description

The dataset contains information about housing in California districts based on the 1990 U.S. Census.

Each row represents a census block group, which is:

  • The smallest geographical unit used by the census

  • Typically contains 600 to 3,000 people


🔹 Features (Attributes)

No.AttributeDescription
1longitudeHow far west a house is (higher = farther west)
2latitudeHow far north a house is (higher = farther north)
3housing_median_ageMedian age of houses in a block
4total_roomsTotal number of rooms in a block
5total_bedroomsTotal number of bedrooms in a block
6populationTotal population in the block
7householdsNumber of households in the block
8median_incomeMedian income (in tens of thousands of USD)
9median_house_value      Median house value (target variable)
10ocean_proximityCategorical feature (distance from ocean)

🎯 Target Variable

  • median_house_value

  • Represents housing prices in hundreds of thousands of dollars

    • Example: value = 2.5 → $250,000


⚠️ Important Notes

  • Dataset is not fully cleaned

  • May require:

    • Handling missing values (e.g., total_bedrooms)

    • Encoding categorical variable (ocean_proximity)

    • Feature scaling


🔍 Additional Insights

  • A household = group of people living in one home

  • Some blocks may have:

    • Few households

    • Many rooms (e.g., vacation areas)

  • This can lead to unusual feature values


📥 Dataset File

  • File name: housing.csv

  • Source: California Housing Dataset (1990 Census)

  • Download File : housing.csv


Experiment-1


1.Dowmload the file housing.csv
2.Open colab
3.Start a new python Notebook
4.Upload the housing.csv file
5.Analyse the data file
6.Visualize Data

Python Code


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


# Reading the data file
df = pd.read_csv("/content/sample_data/housing.csv")
#print the datframe
print(df)
#printing the shape
print(df.shape)
#printing top 10 and bottom 10 rows
print(df.head(10))
print(df.tail(10))
#printing the columns
print(df.columns)
#print a feature value
print(df['total_rooms'])
# Selecting one input feature and target feature (dependent) for 150 rows only
X = df['total_rooms'].head(150).values.reshape(-1,1)
y = df['median_house_value'].head(150).values.reshape(-1, 1)

#visualizing these values
plt.scatter(X, y)
plt.xlabel("Total Rooms")
plt.ylabel("Median House Value")
plt.title("Scatter Plot of Total Rooms vs. Median House Value")
plt.show()

#printing the statistical summary
print(df.describe())

This Caifornia hosuing dataset can be fetched and used directly with scikit-learn ( understand how to use it )

from sklearn.datasets import fetch_california_housing
import pandas as pd
import matplotlib.pyplot as plt
housing = fetch_california_housing()
print(housing.DESCR)
print(housing.data.shape)
print(housing.target.shape)
print(housing.feature_names)

df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['MedHouseVal'] = housing.target
print("\nShape of Data:")
print(df.shape)

print("\nFirst 10 Rows:")
print(df.head(10))

print("\nColumns:")
print(df.columns)

print("\nStatistical Summary:")
print(df.describe())

print("\nMissing Values:")
print(df.isnull().sum())

# -----------------------------
# 3. Scatter Plot (Rooms vs Price)
# -----------------------------
plt.figure()
plt.scatter(df['AveRooms'], df['MedHouseVal'])
plt.xlabel("Average Rooms")
plt.ylabel("House Value")
plt.title("Rooms vs House Value")
plt.show()

Comments

Popular posts from this blog

Machine Learning Lab PCCSL508 Semester 5 KTU CS 2024 Scheme - Dr Binu V P

Recommended Tools and Setup for Lab