In a project, we usually want to execute codes many times, but there are also some codes we only want to execute once. It is a good idea to utilize a file made specifically for running a program once and let’s call it in this case the runOnce.py file.
For this HackerNews project, we will initialize and populate tables within our runOnce.py file to create ops_tracker, ops_name, top_stories, and users. Information for each table below:
- ops_tracker: Shows the time stamps of each automated execution within the database. Contains primary_key, tracking_key, start_time, end_time, and the ops_key.
- ops_name: Shows the name and details about what each operation does classified by an ops_key. Contains: ops_key (primary_key of table), ops_name, ops_detail, effective_start_date, and effective_end_date.
- top_stories: Shows the top stories created and are categorized by the tracking_key. Contains primary_key, tracking_key, id, deleted, type, by, time, text, dead, parent, poll, and kids.
- users: show user information tied with primary_key and tracking_key. Contains primary_key, tracking_key, id, created, karma, about, and submitted
Alright, now that we know what we are going to initialize, let’s do it using Python!
First, let’s import our required libraries and connect to the database
import pandas as pd import os as os from sqlalchemy import (create_engine, Column, DateTime, Integer, MetaData, Table, Boolean, VARCHAR) from sqlalchemy.dialects.postgresql import TIMESTAMP from dotenv import load_dotenv load_dotenv('.env') user = os.environ['login'] password = os.environ['password'] dbname = os.environ['dbname'] host = os.environ['host'] port = os.environ['port'] engine = create_engine( ( f"postgresql://{user}:" f"{password}" f"@{host}:{port}/{dbname}" ) ) connection = engine.connect()
Now initialize our tables by declaring a variable, table name, and column information to each table.
metadata = MetaData() ops_tracker = Table( 'ops_tracker', metadata, Column('primary_key', Integer, primary_key=True, autoincrement=True), Column('tracking_key', Integer, nullable=False), Column('start_time', TIMESTAMP, nullable=False), Column('end_time', TIMESTAMP, nullable=False), Column('ops_key', Integer, nullable=False) ) ops_name = Table( 'ops_name', metadata, Column('ops_key', Integer, nullable=False, primary_key=True), Column('ops_name', VARCHAR, nullable=False), Column('ops_detail', VARCHAR, nullable=True), Column('effective_start_date', TIMESTAMP, nullable=False), Column('effective_end_date', TIMESTAMP, nullable=False) ) top_stories = Table( 'top_stories', metadata, Column('primary_key', Integer, primary_key=True, autoincrement=True), Column('tracking_key', Integer, nullable=False), Column('id', Integer, nullable=True), Column('deleted', Boolean, nullable=True), Column('type', VARCHAR, nullable=True), Column('by', VARCHAR, nullable=True), Column('time', Integer, nullable=True), Column('text', VARCHAR, nullable=True), Column('dead', Boolean, nullable=True), Column('parent', Integer, nullable=True), Column('poll', Integer, nullable=True), Column('kids', VARCHAR, nullable=True), Column('url', VARCHAR, nullable=True), Column('score', Integer, nullable=True), Column('title', VARCHAR, nullable=True), Column('parts', VARCHAR, nullable=True), Column('descendants', Integer, nullable=True), Column('extracted_datetime', TIMESTAMP, nullable=True) ) users = Table( 'users', metadata, Column('primary_key', Integer, primary_key=True, autoincrement=True), Column('tracking_key', Integer, nullable=False), Column('id', VARCHAR, nullable=False), Column('created', Integer, nullable=True), Column('karma', Integer, nullable=True), Column('about', VARCHAR, nullable=True), Column('submitted', VARCHAR, nullable=True), Column('extracted_datetime', TIMESTAMP, nullable=True) ) metadata.create_all(engine)
With metadata.create_all(engine)
, we have now created empty tables in our schema.
Since our tables are currently empty, we will first fill in our ‘ops_name’ table with information on certain operations we will use later on.
dataOpsName = pd.DataFrame( { 'ops_key': [1,2,3], 'ops_name': ['retrieve_top_stories', 'stories_update', 'users_update'], 'ops_detail': ['Initial operations to get top news', 'Update top news', 'Update data related to users'], 'effective_start_date': ['2023-12-03','2023-12-03','2023-12-03'], 'effective_end_date': ['9999-12-31','9999-12-31','9999-12-31'] } ) dataOpsName.to_sql( name = 'ops_name', con=connection, if_exists='append', index=False ) connection.commit()
Now, we have inserted rows into the ops_names
table! This will be the key to what type of operations is used for other operations tables.