Batch Streaming to Automatically Read in Data Counts

Working with large datasets is unevitable these days, and so being able to work with the data in batches can significantly lessen analysis times. This program uses PySpark to import data from a directory and run counts.

The Data

All files were provided by the Bellevue Univeristy DSC650 GitHub repository.

Technologies

Completed in Jupyter Notebook, the following packages are used

  • PySpark
  • Shutil
  • Time

Launch

All necessary code is included in the Jupyter notebook. The data files can be found in the data folder.