Lab 7 Manual Data Pipeline

1. Scenario

You are building a system to monitor Cars entering and exiting five carparks. The data that you are capturing is car plate number, time of entry, time of exit and the carpark. In the past two months there seemed to be a overcrowding issues at the carparks. You are to prepare the data for the Data Scientists and Analysts to investigate the issue.

We start this practical exercises with a common starting point, reading in data and explore the data using Microsoft Excel or other visualisations or dashboarding tools.

2. Prepare Python Environment

Open a terminal to run these commands to contain all your python codes in the python environment

 mkdir -p ~/Documents/projects/ee3801

 cd ~/Documents/projects/ee3801

 python3 -m venv venv_ee3801

 source venv_ee3801/bin/activate

Open Visual Studio Code. Add Folder into Workspace (“~/Documents/projects/ee3801”).
Right click on the added folder ee3801 and create a new jupyter notebook file “manual_data_pipeline.ipynb”.

Select Kernel - Python Environment “venv_ee3801”. (For Windows) Select your python interpreter first then kernel “Python: Select Interpreter” > “~/Documents/projects/ee3801/venv_ee3801/Script/python.exe”.

 # install python library
    
 !python -m pip install faker
 !python -m pip install pandas
 !python -m pip install matplotlib
 !python -m pip install --upgrade pip

When prompted “Running cells with ‘venv_ee3801 (Python 3.x.x)’ requires the ipykernel package.”, click Install.

 # import libraries
    
 from faker import Faker
 import json
 from datetime import datetime, timedelta
 import random
 import pandas as pd
 import matplotlib.pyplot as plt
    
 # set the date format
 date_format = "%d/%m/%Y %H:%M:%S"
    
 import os
 home_directory = os.path.expanduser("~")
 os.chdir(home_directory+'/Documents/projects/ee3801')

3. Generate data

Copy and paste these codes into your notebook and run.

 # faker is a python library that generates fake data
 # here we want to generate fake data to simulate cars entering into the carpark
 fake=Faker()
 fake.license_plate()

Define the CarPark class and declare function that generate cars entering the carpark for the past 3 months between 9 am to 8 pm for every minute and second.

 # define the CarPark class
 class CarPark:
     def __init__(self, Plate, LocationID, Entry_DateTime, Exit_DateTime, Parking_Charges):
         self.Plate = Plate
         self.LocationID = LocationID
         self.Entry_DateTime = Entry_DateTime
         self.Exit_DateTime = Exit_DateTime
         self.Parking_Charges = Parking_Charges
    
 def generate_past_datetime(days):
     """
     Generates a list of random datetime objects within the last 2-3 months,
     between 9 AM and 8 PM, including minute and second precision.
    
     Returns:
         datetime object.
     """
     # generated_dates = []
     now = datetime.now()
    
     # Define the time range for the past n months
     end_date = now
     start_date = now - timedelta(days=days) 
    
     time_delta_total_seconds = int((end_date - start_date).total_seconds())
    
     # Generate a random number of seconds within the 2-3 month range
     random_seconds_past = random.randint(0, time_delta_total_seconds)
     random_date_base = start_date + timedelta(seconds=random_seconds_past)
    
     # Generate random time components between 9 AM and 8 PM
     random_hour = random.randint(9, 20)  # 9 to 20 (inclusive for 8 PM)
     random_minute = random.randint(0, 59)
     random_second = random.randint(0, 59)
    
     # Combine date and time components
     generated_datetime = random_date_base.replace(
         hour=random_hour,
         minute=random_minute,
         second=random_second,
         microsecond=0  # Set microseconds to 0 for minute/second precision
     )
    
     return generated_datetime
    
 # generate cars entering the carpark for the past 2-3 months between 9 am to 8 pm for every minute and second
 def createPastCarEntry():
     car = CarPark(
         Plate= fake.license_plate(),
         LocationID="Park"+str(random.randint(0, 5)),
         Entry_DateTime=generate_past_datetime(90).strftime(date_format), # Approximately 3 months ago
         Exit_DateTime="",
         Parking_Charges=float(0)
     )
    
     # return a json format
     return json.dumps(car.__dict__) 

Generate 1000 cars.

 # Generate 1000 cars
 carpark_system = []
 for i in range(1000):
     thiscar_dict = eval(createPastCarEntry())
     carpark_system.append(list(thiscar_dict.values()))
    
 carpark_system_df = pd.DataFrame(carpark_system, columns=list(eval(createPastCarEntry()).keys()))
 print(len(carpark_system_df))
 carpark_system_df.head()
        

Save to csv file for analysis.

 # export to csv for further analysis
 carpark_system_df.to_csv("data/carpark_system.csv", encoding='utf-8-sig', index=False)
    

4. Python Visualisation

In your notebook, plot the data to show the total number of cars in each location.

 carpark_system_df.groupby(['LocationID'])['LocationID'].count().plot(title="Total number of cars in each location")
 index_carpark_system_df = carpark_system_df.set_index('Entry_DateTime')
 index_carpark_system_df.index = pd.to_datetime(index_carpark_system_df.index, errors='coerce')
    

Plot a chart to show which time of the day has more cars entering the carpark.

 df = pd.DataFrame({'hour':index_carpark_system_df.index.hour, 
     'plate': index_carpark_system_df['Plate']})\
     .groupby('hour')['plate'].count()
    
 ax = df.plot(title="Total number of cars in each hour of the day")
    
 # show data labels
 for i, txt in enumerate(df):
     ax.annotate(txt, (df.index[i], df.values[i]), textcoords="offset points", xytext=(0,5), ha='center')
 plt.show()
    

Plot a chart to show how many cars is in each carpark at every hour. Is the carparks overcrowded?

 pd.DataFrame({'hour':index_carpark_system_df.index.hour, 
     'plate': index_carpark_system_df['Plate'], 
     'carpark': index_carpark_system_df['LocationID']})\
     .groupby(['hour','carpark'])['plate'].count()

 df = pd.DataFrame({'hour':index_carpark_system_df.index.hour, 
     'plate': index_carpark_system_df['Plate'], 
     'carpark': index_carpark_system_df['LocationID']})\
     .groupby(['hour','carpark'])['plate'].count()
 df.unstack().plot(title="Number of cars in the car park in each hour of the day")

5. Microsoft Excel (Pivot Tables)

Open the carpark_system.csv file in Microsoft Excel.
Click on Insert > Pivot Chart > Ok
A new worksheet will be opened.
Drag and drop the field name Entry_DateTime into Rows, Plate into Values, LocationID into Columns. Note: If the date does not show Month or Day groupings, go to your system date settings and set to English (Singapore).

6. Microsoft Power BI

The amount of data might increase as the days goes by, we want to quickly visualise the charts everytime we refresh the data.

For Windows users,

Go to powerbi.microsoft.com. Download and install Microsoft Power BI.
Start MS Power BI.

For Mac users,

Download VMWare Horizon Client
Launch the VMWAre Horizon Client and create connection to https://thinlab.nus.edu.sg
Launch the virtual machine, click on “Virtual Student Desktop”.
Login using nusstu\<your student id> and password
In the virtual windows, start MS Power BI.
Your computer will need to be in NUS VPN.

Loading and visualising data

Upload your generated carpark_system.csv file into OneDrive:

”~/Library/CloudStorage/OneDrive-NationalUniversityofSingapore/ee3801/data/carpark_system.csv”.
Right click on the triple dots (…) and copy the link.
Open MS Power BI Desktop.
Select New > Report.
Select Get Data > Web. Paste the link that you have copied in step 2. Remove the question mark (?), parameters and values at the end of the link. Click Ok and Load. Authentication prompt might pop up, select Organisation > Sign in > Connect.
Select Transform Data.
Change the Data Type of Entry_DateTime and Exit_DateTime to Locale > DateTime > Locale > English (Singapore).
Save your MS Power BI file as carpark_system.pbix. Click on Apply. Click on Close and Apply.
Drag and drop the Line chart into the workspace. Drag and drop Entry_DateTime into X-axis, Plate into Y-axis, LocationID into Legend.
You can see the chart. Click on the arrow down (V) beside X-axis Entry_DateTime > Choose Date Hierarchy. Remove Year and Qtr.

New data is coming in… Use the codes below to generate new car entry for every hour from 9 am to 8 pm for every minute and second.

def createNewCarEntry():
    car = CarPark(
        Plate= fake.license_plate(),
        LocationID="Park"+str(random.randint(0, 5)),
        Entry_DateTime=generate_past_datetime(1).strftime(date_format), # Approximately 1 day ago
        Exit_DateTime="",
        Parking_Charges=float(0)
    )
    
    return json.dumps(car.__dict__)

Generate 100 cars.

# read latest file
carpark_system_df = pd.read_csv("data/carpark_system.csv", encoding='utf-8-sig', index_col=False)
# carpark_system_df.drop(columns=['Unnamed: 0'], inplace=True)
print(len(carpark_system_df))
carpark_system_df.head()
    
# Generate more cars, append to list and save csv
carpark_system = []
for i in range(100):
    thiscar_dict = eval(createNewCarEntry())
    carpark_system.append(list(thiscar_dict.values()))
    
new_carpark_system_df = pd.DataFrame(carpark_system, columns=list(eval(createNewCarEntry()).keys()))
print(len(new_carpark_system_df))
print(new_carpark_system_df.head())
    
updated_carpark_system_df = pd.concat([carpark_system_df,new_carpark_system_df], axis=0)
print(len(updated_carpark_system_df))
updated_carpark_system_df.head()
    
# export to csv for further analysis
updated_carpark_system_df.to_csv("data/carpark_system.csv", encoding='utf-8-sig', index=False)
# export to your OneDrive too for on-demand refresh (replace your file in OneDrive)
updated_carpark_system_df.to_csv("~/Library/CloudStorage/OneDrive-NationalUniversityofSingapore/ee3801/data/carpark_system.csv", encoding='utf-8-sig', index=False)
        

The codes above should replace the carpark_system.csv in OneDrive. However if it failed you can still manually copy and replace your generated carpark_system.csv file into OneDrive.
In MS Power BI, click on the refresh button.
Drag and drop the Card into the workspace and count the number of Plates.

Conclusion

In this lab, you learned the

manual task of generating data,
visualising using python,
Microsoft Excel and
created a simple data pipeline in Microsoft Power BI.

This is the most simple form of data pipeline that is often the first steps in an organisation with a basic data maturity level. However this is an essential step towards building a more automated streamlined data pipeline.

Questions to ponder

With MS Sharepoint through OneDrive and MS Power BI we are able to see the latest data by one click. How do we remove the step to click and yet refresh the data?
What is the time taken to refresh the data?
As the data refreshes only when you click on the refresh button, if you have multiple users how do you share this data with them and how do you ensure that they will access up-to-date data?
What if your company does not subscribe to Microsoft Power Platform?

Submissions next Wed 9pm (8 Oct 2025)

Submit your ipynb as a pdf, excel file and pbi file. Save your ipynb as a html file, open in browser and print as a pdf. Include in your submission:

Answer the Questions to ponder

~ The End ~