Lab 7 Manual Data Pipeline

1. Scenario

You are building a system to monitor Cars entering and exiting five carparks. The data that you are capturing is car plate number, time of entry, time of exit and the carpark. In the past two months there seemed to be a overcrowding issues at the carparks. You are to prepare the data for the Data Scientists and Analysts to investigate the issue.

We start this practical exercises with a common starting point, reading in data and explore the data using Microsoft Excel or other visualisations or dashboarding tools.

2. Prepare Python Environment

  1. Open a terminal to run these commands to contain all your python codes in the python environment

     mkdir -p ~/Documents/projects/ee3801
    
     cd ~/Documents/projects/ee3801
    
     python3 -m venv venv_ee3801
    
     source venv_ee3801/bin/activate
    
  2. Open Visual Studio Code. Add Folder into Workspace (“~/Documents/projects/ee3801”).

  3. Right click on the added folder ee3801 and create a new jupyter notebook file “manual_data_pipeline.ipynb”.

  4. Select Kernel - Python Environment “venv_ee3801”. (For Windows) Select your python interpreter first then kernel “Python: Select Interpreter” > “~/Documents/projects/ee3801/venv_ee3801/Script/python.exe”.

     # install python library
        
     !python -m pip install faker
     !python -m pip install pandas
     !python -m pip install matplotlib
     !python -m pip install --upgrade pip
    

    When prompted “Running cells with ‘venv_ee3801 (Python 3.x.x)’ requires the ipykernel package.”, click Install.

     # import libraries
        
     from faker import Faker
     import json
     from datetime import datetime, timedelta
     import random
     import pandas as pd
     import matplotlib.pyplot as plt
        
     # set the date format
     date_format = "%d/%m/%Y %H:%M:%S"
        
     import os
     home_directory = os.path.expanduser("~")
     os.chdir(home_directory+'/Documents/projects/ee3801')
    

3. Generate data

  1. Copy and paste these codes into your notebook and run.

     # faker is a python library that generates fake data
     # here we want to generate fake data to simulate cars entering into the carpark
     fake=Faker()
     fake.license_plate()
    
  2. Define the CarPark class and declare function that generate cars entering the carpark for the past 3 months between 9 am to 8 pm for every minute and second.

     # define the CarPark class
     class CarPark:
         def __init__(self, Plate, LocationID, Entry_DateTime, Exit_DateTime, Parking_Charges):
             self.Plate = Plate
             self.LocationID = LocationID
             self.Entry_DateTime = Entry_DateTime
             self.Exit_DateTime = Exit_DateTime
             self.Parking_Charges = Parking_Charges
        
     def generate_past_datetime(days):
         """
         Generates a list of random datetime objects within the last 2-3 months,
         between 9 AM and 8 PM, including minute and second precision.
        
         Returns:
             datetime object.
         """
         # generated_dates = []
         now = datetime.now()
        
         # Define the time range for the past n months
         end_date = now
         start_date = now - timedelta(days=days) 
        
         time_delta_total_seconds = int((end_date - start_date).total_seconds())
        
         # Generate a random number of seconds within the 2-3 month range
         random_seconds_past = random.randint(0, time_delta_total_seconds)
         random_date_base = start_date + timedelta(seconds=random_seconds_past)
        
         # Generate random time components between 9 AM and 8 PM
         random_hour = random.randint(9, 20)  # 9 to 20 (inclusive for 8 PM)
         random_minute = random.randint(0, 59)
         random_second = random.randint(0, 59)
        
         # Combine date and time components
         generated_datetime = random_date_base.replace(
             hour=random_hour,
             minute=random_minute,
             second=random_second,
             microsecond=0  # Set microseconds to 0 for minute/second precision
         )
        
         return generated_datetime
        
     # generate cars entering the carpark for the past 2-3 months between 9 am to 8 pm for every minute and second
     def createPastCarEntry():
         car = CarPark(
             Plate= fake.license_plate(),
             LocationID="Park"+str(random.randint(0, 5)),
             Entry_DateTime=generate_past_datetime(90).strftime(date_format), # Approximately 3 months ago
             Exit_DateTime="",
             Parking_Charges=float(0)
         )
        
         # return a json format
         return json.dumps(car.__dict__) 
    
  3. Generate 1000 cars.

     # Generate 1000 cars
     carpark_system = []
     for i in range(1000):
         thiscar_dict = eval(createPastCarEntry())
         carpark_system.append(list(thiscar_dict.values()))
        
     carpark_system_df = pd.DataFrame(carpark_system, columns=list(eval(createPastCarEntry()).keys()))
     print(len(carpark_system_df))
     carpark_system_df.head()
            
    
  4. Save to csv file for analysis.

     # export to csv for further analysis
     carpark_system_df.to_csv("data/carpark_system.csv", encoding='utf-8-sig', index=False)
        
    

4. Python Visualisation

  1. In your notebook, plot the data to show the total number of cars in each location.

     carpark_system_df.groupby(['LocationID'])['LocationID'].count().plot(title="Total number of cars in each location")
     index_carpark_system_df = carpark_system_df.set_index('Entry_DateTime')
     index_carpark_system_df.index = pd.to_datetime(index_carpark_system_df.index, errors='coerce')
        
    
  2. Plot a chart to show which time of the day has more cars entering the carpark.

     df = pd.DataFrame({'hour':index_carpark_system_df.index.hour, 
         'plate': index_carpark_system_df['Plate']})\
         .groupby('hour')['plate'].count()
        
     ax = df.plot(title="Total number of cars in each hour of the day")
        
     # show data labels
     for i, txt in enumerate(df):
         ax.annotate(txt, (df.index[i], df.values[i]), textcoords="offset points", xytext=(0,5), ha='center')
     plt.show()
        
    
  3. Plot a chart to show how many cars is in each carpark at every hour. Is the carparks overcrowded?

     pd.DataFrame({'hour':index_carpark_system_df.index.hour, 
         'plate': index_carpark_system_df['Plate'], 
         'carpark': index_carpark_system_df['LocationID']})\
         .groupby(['hour','carpark'])['plate'].count()
    
     df = pd.DataFrame({'hour':index_carpark_system_df.index.hour, 
         'plate': index_carpark_system_df['Plate'], 
         'carpark': index_carpark_system_df['LocationID']})\
         .groupby(['hour','carpark'])['plate'].count()
     df.unstack().plot(title="Number of cars in the car park in each hour of the day")
    

5. Microsoft Excel (Pivot Tables)

  1. Open the carpark_system.csv file in Microsoft Excel.

  2. Click on Insert > Pivot Chart > Ok

  3. A new worksheet will be opened.

  4. Drag and drop the field name Entry_DateTime into Rows, Plate into Values, LocationID into Columns. Note: If the date does not show Month or Day groupings, go to your system date settings and set to English (Singapore).

6. Microsoft Power BI

The amount of data might increase as the days goes by, we want to quickly visualise the charts everytime we refresh the data.

For Windows users,

  1. Go to powerbi.microsoft.com. Download and install Microsoft Power BI.

  2. Start MS Power BI.

For Mac users,

  1. Download VMWare Horizon Client

  2. Launch the VMWAre Horizon Client and create connection to https://thinlab.nus.edu.sg

  3. Launch the virtual machine, click on “Virtual Student Desktop”.

  4. Login using nusstu\<your student id> and password

  5. In the virtual windows, start MS Power BI.

  6. Your computer will need to be in NUS VPN.

Loading and visualising data

  1. Upload your generated carpark_system.csv file into OneDrive:

    ”~/Library/CloudStorage/OneDrive-NationalUniversityofSingapore/ee3801/data/carpark_system.csv”.

  2. Right click on the triple dots (…) and copy the link.

  3. Open MS Power BI Desktop.

  4. Select New > Report.

  5. Select Get Data > Web. Paste the link that you have copied in step 2. Remove the question mark (?), parameters and values at the end of the link. Click Ok and Load. Authentication prompt might pop up, select Organisation > Sign in > Connect.

  6. Select Transform Data.

  7. Change the Data Type of Entry_DateTime and Exit_DateTime to Locale > DateTime > Locale > English (Singapore).


  8. Save your MS Power BI file as carpark_system.pbix. Click on Apply. Click on Close and Apply.

  9. Drag and drop the Line chart into the workspace. Drag and drop Entry_DateTime into X-axis, Plate into Y-axis, LocationID into Legend.

  10. You can see the chart. Click on the arrow down (V) beside X-axis Entry_DateTime > Choose Date Hierarchy. Remove Year and Qtr.


  11. New data is coming in… Use the codes below to generate new car entry for every hour from 9 am to 8 pm for every minute and second.

    def createNewCarEntry():
        car = CarPark(
            Plate= fake.license_plate(),
            LocationID="Park"+str(random.randint(0, 5)),
            Entry_DateTime=generate_past_datetime(1).strftime(date_format), # Approximately 1 day ago
            Exit_DateTime="",
            Parking_Charges=float(0)
        )
        
        return json.dumps(car.__dict__)
    
  12. Generate 100 cars.

    # read latest file
    carpark_system_df = pd.read_csv("data/carpark_system.csv", encoding='utf-8-sig', index_col=False)
    # carpark_system_df.drop(columns=['Unnamed: 0'], inplace=True)
    print(len(carpark_system_df))
    carpark_system_df.head()
        
    # Generate more cars, append to list and save csv
    carpark_system = []
    for i in range(100):
        thiscar_dict = eval(createNewCarEntry())
        carpark_system.append(list(thiscar_dict.values()))
        
    new_carpark_system_df = pd.DataFrame(carpark_system, columns=list(eval(createNewCarEntry()).keys()))
    print(len(new_carpark_system_df))
    print(new_carpark_system_df.head())
        
    updated_carpark_system_df = pd.concat([carpark_system_df,new_carpark_system_df], axis=0)
    print(len(updated_carpark_system_df))
    updated_carpark_system_df.head()
        
    # export to csv for further analysis
    updated_carpark_system_df.to_csv("data/carpark_system.csv", encoding='utf-8-sig', index=False)
    # export to your OneDrive too for on-demand refresh (replace your file in OneDrive)
    updated_carpark_system_df.to_csv("~/Library/CloudStorage/OneDrive-NationalUniversityofSingapore/ee3801/data/carpark_system.csv", encoding='utf-8-sig', index=False)
            
    
  13. The codes above should replace the carpark_system.csv in OneDrive. However if it failed you can still manually copy and replace your generated carpark_system.csv file into OneDrive.

  14. In MS Power BI, click on the refresh button.

  15. Drag and drop the Card into the workspace and count the number of Plates.

Conclusion

In this lab, you learned the

  • manual task of generating data,
  • visualising using python,
  • Microsoft Excel and
  • created a simple data pipeline in Microsoft Power BI.

This is the most simple form of data pipeline that is often the first steps in an organisation with a basic data maturity level. However this is an essential step towards building a more automated streamlined data pipeline.

Questions to ponder

  • With MS Sharepoint through OneDrive and MS Power BI we are able to see the latest data by one click. How do we remove the step to click and yet refresh the data?
  • What is the time taken to refresh the data?
  • As the data refreshes only when you click on the refresh button, if you have multiple users how do you share this data with them and how do you ensure that they will access up-to-date data?
  • What if your company does not subscribe to Microsoft Power Platform?

Submissions next Wed 9pm (8 Oct 2025)

Submit your ipynb as a pdf, excel file and pbi file. Save your ipynb as a html file, open in browser and print as a pdf. Include in your submission:

Answer the Questions to ponder

~ The End ~