We are working with a vehicle fleet inspection agency to provide a spectrum of analytics concerning their client organizations. Their most pressing need and the current task is to populate a widget on a dashboard showing the three organizations with the highest proportion of vehicles that failed their last adjudicated inspection.
In order to populate the widget, we need to produce a report in a plain text file named
virs_report.tsv. It should include three TAB-separated columns:
org_name: The names of the organizations as inferred from the source data files.
tot_v: How many vehicles with inspections exist in those organizations.
failed_v: How many vehicles in those organizations are currently in a failed state.
We do not have access to the customer's main tracking database but we do have monthly dumps of vehicle inspection report (VIR) records. These contain enough information that we can determine everything we really need to know about both the vehicles being inspected and the organizations that own those vehicles.
The three monthly files they gave us so far are
vir_202008.csv, and once the system is in place new files will be provided monthly. Though the files are provided for your inspection, other teams are already loading the flat files into database tables named
vir_202008 respectively inside the provided SQLite database file
virs_db_sqlite.db where the load step is already complete, so write your code starting with that.
In addition to this report, we anticipate the client will want to extend the dashboard to include additional analyses. To enable this, your solution should structure the data so that it is amenable to other queries on related topics.
Our customer has provided a few additional details related to how the source data is maintained by their inspection software:
- A vehicle's or organization's ID will remain the same over time, even if something else about it changes (e.g, an organization changes its name)
- Whenever a vehicle is inspected, an "inspection record" is created. An inspection record is identified by a unique combination of
inspection_date, which can be considered a compound key. It may be adjudicated as passing or failing, or not adjudicated at that time. This means there can either be a value in the
inspection_passedcolumn, or it can be empty. However, an inspection record can get updated if the inspection is re-adjudicated. When this happens, the
inspection_passedvalue may change. A re-adjudicated inspection record would appear in a subsequent VIR (a new dump).
- The information in the latest entry pertaining to a vehicle or an organization is authoritative. For example, if the latest inspection record for a particular vehicle indicates that it now belongs to a different organization, we must interpret that as the vehicle having been transferred between organizations and its previous inspections no longer count against that organization.
- Until a new inspection result is indicated, the previous inspection result continues to apply. There can be long delays between when an inspection is conducted, and when a result is updated to passing or failing, so each successive dump is likely to update a few inspection records received in previous dumps.
- In a given inspection record, only
inspection_passedis ever updated, so changes to vehicles or organizations can only be inferred from subsequent inspection records.
The code you used to produce the structure and report is the main substance of your answer that we'll be evaluating, so please provide all the code necessary to reproduce your work. The exception is that we don't need to see how you wrote the report to disk as long as you provide the code that generated the report data.
Along with the report file, please provide all code used to produce the report. The code you used to structure the data and produce the report is the main substance of your homework submission that we will be evaluating.
Please include any relevant instructions or caveats for how to use your code to reproduce the result in your COMMENTS file.
This is intended to be a quick and hopefully fun task. We want to see how you approach this problem and how that relates to similar problems. Please take care to ensure your solution is not too tailored to these specific requirements and allows for future queries. However, there is no need or benefit to going beyond the stated requirements. For example, we do not care about and do not want to retain past states of the data, so we don’t need time series tables.
Your submission should include:
- A report in
- All code used to prepare the data for analysis (e.g. creating and populating tables) and generate the report.
COMMENTS.mdfile explaining how to use your code to reproduce your result.