Performing 20,000 concurrent file searches in less than 1 minute.
Most coding languages out there make working with concurrent & parallelized threads extremely complicated! Elixir instead chose to follow Erlang and keep the concept of concurrency simple. What I'm about to show here in this post is something that I'd never try in Ruby, Python, or any one of the other languages that were not specifically designed for concurrency.
Understanding concurrent programming is vital to building scalable and highly performing AI programs. AI systems are naturally expected to sift through large amounts of data very quickly and make really smart decisions on what they find. Instead of beating a dead horse I'll illustrate it via a problem, and then show the concurrent solution I used to address it.
Everyone loves big data, but there are times when it is extremely difficult to manage it. More than managing it you have to have programs that can interpret it. As time goes on data sets are only going to get larger and larger, so processing data sequentially is no longer an option, nor has it been for the past 20 years. Ask Google.
If someone asked you to search 10 files to find out how many times a product was purchased, you'd say "No Sweat. I could code that up easy." Then time goes on and that person comes back to you and says, "Hey I need you to search 5000 files." You say, "Yikes, ok I can manage." You notice the current program performance is getting slower and slower. Then you start hoping they don't need more, but they always do. Finally they come to you and say, "Omg, our growth is exponential!! We have 20,000 files now! Can you manage to find how many people purchased a specific product?!"
This is a simplistic version of what's been happening as a whole to startups. You find that the challenges continue to increase, and eventually you discover that you need better tools to address them. AI is no different. In fact, it is so much harder because all the system does is sift through tons and tons of data before making a rational decision. Its a much different way to program than thinking in terms of building a CRUD app.
During my automation projects I'm finding that there are often multiple data sources to read from. These data sources are usually files with tons of data output in them. Other times the data sources are databases or web services, but for the most part when you're making smart programs they need to be able to read and analyze tons of files on the fly especially if you're working with a partner that does nightly data dumps to ftp servers.
Lets say for example you had 20,000 customer files with 5,000 different purchase ids in them. These files come in on one of those nightly data dumps. You hate this partner because you wish they would use a more modern way to pass data so that you can query it much more easily, but that's not how the real world works. Sometimes you're stuck working with bad integrations with third parties and now it's up to you to deal with it. The boss wants you to find how many customer records listed in the files have the purchase id 777 in their purchase history, and most importantly determine who doesn't. These files you need to read have all the purchase ids of each customer listed in them. The business needs to know this because they don't want to waste time marketing a product to a customer that has already been sold that product. So you now need to look for purchase id 777 in 20,000 customer files!
I love efficiency, so I decided to build an automated system that is responsible for finding a specific purchase id. It will then return a full list of all the customers who have it in their purchase history, and those that don't. This way, I can run it whenever that data is needed again in the future with a different purchase id. For the sake of the example I wrote a small script that generated 20,000 files with 5,000 random purchase id numbers. These files are generated and located in the data/people/ directory. The steps of the automation algorithm should look like this
- Get the target purchase id from the user
- Get all 20,000 files in the data/people path
- Parse and prepare the files
- Spawn off one process for each file
- Perform a binary search algorithm on each file in its own process for the target purchase id
- Send the status of the finished process to a result list (which can be an Elixir Agent)
- Display the results list
How many lines of code did that take? 80, maybe less. Check out the solution below...
The result of this code is shown next.
Notice the time! It took just under a minute to search 20,000 files on a laptop!!! The machine I performed this on has a 2.5 GHz Intel Core i7 processor. Just imagine what the performance time would have been if I ran this on a 3.0 GHz Core i7-5960X or a giant super computer!
In short..... concurrent programs are the way to go!