Batch Streaming to Automatically Read in Data Counts
Working with large datasets is unevitable these days, and so being able to work with the data in batches can significantly lessen analysis times. This program uses PySpark to import data from a directory and run counts.
The Data
All files were provided by the Bellevue Univeristy DSC650 GitHub repository.
Technologies
Completed in Jupyter Notebook, the following packages are used
- PySpark
- Shutil
- Time
Launch
All necessary code is included in the Jupyter notebook. The data files can be found in the data folder.