With access to the database, we can write functions to collect information from our tables. There are four functions to be implemented in this customFunctionsGeneral file which will be utilized in the main to retrieve data.
The getMaxTrackingKey function returns a brand new tracking key for the upcoming HackerNews execution after the latest tracking key.
# Function 1 def getMaxTrackingKey(): """Retrieves the greatest tracking key that's present in the ops_tracker table and returns an integer representing the tracking key that's greater than the current max tracking key incremented by 1. """ sv.query = """ SELECT tracking_key from ops_tracker; """ tracking = pd.read_sql_query(sv.query, sv.connection) if len(tracking) == 0: max_tracking_key = 1 else: max_tracking_key = tracking['tracking_key'].max() + 1 return max_tracking_key
The uploadOpsTime function uses the parameters given by the user to create a data frame to be added to the ops_tracker table in the database. This includes the duration of a particular execution, what tracking key it is (should be the latest as it is real-time), and what type of execution was just done in correlation to the function execution.
# Function 2 def uploadOpsTime(maxTrackingKey='', starttime='', endtime='', ops_key=''): """ Creates and appends a dataframe containing information about the HackerNews execution. """ zOpsTracking = pd.DataFrame( { 'tracking_key': [maxTrackingKey], 'start_time': [starttime], 'end_time': [endtime], 'ops_key': [ops_key] } ) zOpsTracking.to_sql( name='ops_tracker', con=sv.connection, if_exists='append', index=False ) sv.connection.commit()
The retrievePreviousTopNews function takes in the number of days before the current date given by the user and returns all the distinct IDs from the top_stories table between the length of days specified by the user.
Note: ‘%s’ % str in the “query” assignment line is essentially replacing the s with str which is the max_lookback_days.
# Function 3 def retrievePreviousTopNews(max_lookback_days=30): """ Takes in number of days prior to current date as beginning range and returns the distinct ids from top_stories between timeframe. """ # First, Check if the input is an integer. if type(max_lookback_days) is not int: print('stop you did not supply integer') else: # Parse max_lookback_days as a variable str = max_lookback_days query = 'SELECT DISTINCT id from top_stories WHERE extracted_datetime >= current_date - %s' % str # First we get the data we want data_to_look_back = pd.read_sql_query( query, con=sv.connection ) return data_to_look_back
The retrieveDistinctUsers function tries to get all distinct users in the top_stories table.
# Function 4 def retrieveDistinctUsers(): """ Returns all distint users from top_stories if exists""" # Define Query query = 'SELECT DISTINCT by FROM top_stories ORDER BY by asc' # Next we get data try: dataUsers = pd.read_sql_query( query, con=sv.connection ) return dataUsers except Exception as e: write_log('I am in checkIfPostgresConnectionWorks exception. There is an issue with database connection') sys.exit(1)
Now with these essential functions, we can use them in our main file later to retrieve data that we need from our database.