Wednesday, October 17, 2012

I'm a C++ dinosaur, but I'm OK

So here's a nice challenge. Let's say you have a list of member email addresses which you get from your account list. But you also have a list of email addresses that you have of your customers, addresses that your customers have used to communicate with you. And finally you have a list of email addresses of potential customers, but who are not yet signed up.

Now let's say someone decides to do a survey of future and existing customers, and sends out the survey to a list that purports to be that. And when the numbers of the survey come in.. mass confusion arises because the numbers don't add up.

So we now have a bunch of files, 'email addresses we mailed the survey to', 'main account email addresses', 'potential customer email addresses' and 'other customer email addresses'. And nobody knows why the last three don't add up to the first list.

Did I mention typos and polluted data? The pain.

So what would normal people do? Well, no idea, but we got on the case using 'diff'. This was tremendously helpful, but only up to a point. It is hard to answer questions like 'give me everybody not on list A but who is on B or C'. This is of course all very easy to do from SQL.

But I wasn't feeling like that. So, because C++ is what I do, I wrote a C++ program. I might be a dinosaur like that, but I feel compelled to point out that the program I present below will scale to literally billions and billions of customers. You know, in case you have more users than there are people on the planet.

So what does this small program do? Using slightly modern C++, it reads lines of text from several files. And once it is done, it will print for each unique line of text in which files it was found. This allows you to quickly see for each email address if it was part of 'main account email addresses', but perhaps not of 'potential customer addresses' etc.

Once it has printed that, it prints out a list of those same lines of text per 'configuration'. So you get groups of lines, and one group might represent 'on the list of addresses we emailed to, but not in any of the other files'.  In other words, no clue why we emailed this address.

Another group might represent 'addresses we emailed and only present on the potential customer list'. And finally, and importantly, you might get the group 'we didn't email, but IS actually a customer'.

The output can be read by most spreadsheets, and they'll do the right thing.

All in all, this helped solve a tricky problem, and was actually implemented in less time than it would take to do it by hand. I hope ;-)

// read all input files, output for each line of text in which of the input files it was found
// public domain code by bert.hubert@netherlabs.nl
#include <string>
#include <vector>
#include <map>
#include <iostream>
#include <fstream>
#include <boost/dynamic_bitset.hpp>
#include <boost/algorithm/string.hpp>
#include <boost/foreach.hpp>

using namespace std;

// this allows us to make a Case Insensitive container
struct CIStringCompare: public std::binary_function<string, string, bool>  
{
  bool operator()(const string& a, const string& b) const
  {
    return strcasecmp(a.c_str(), b.c_str()) < 0;
  }
};

int main(int argc, char**argv)
{
  typedef map<string, boost::dynamic_bitset<>, CIStringCompare> presence_t;
  presence_t presence;
  
  string line;
  cout << '\t';
  for(int n = 1; n < argc; ++n) {
    cout << argv[n] << '\t';
    ifstream ifs(argv[n]);
    if(!ifs) {
      cerr<<"Unable to open '"<<argv[n]<<"' for reading\n"<<endl;
      exit(EXIT_FAILURE);
    }
    
    while(getline(ifs, line)) {
      boost::trim(line);
      if(line.empty())
        continue;
      presence_t::iterator iter = presence.find(line);
      if(iter == presence.end()) { // not present, do a very efficient 'insert & get location'
        iter = presence.insert(make_pair(line, boost::dynamic_bitset<>(argc-1))).first;  
      }
      iter->second[n-1]=1;
    }
  }
  cout << '\n';
  
  // this is where we store the reverse map, 'presence groups', so which lines where present in file1, but not file2 etc
  typedef map<boost::dynamic_bitset<>, vector<string> > revpresence_t;
  revpresence_t revpresence;
  
  BOOST_FOREACH(presence_t::value_type& val, presence) {
    revpresence[val.second].push_back(val.first);
    cout << val.first << '\t';
    for (boost::dynamic_bitset<>::size_type i = 0; i < val.second.size(); ++i) {
      cout << val.second[i] << '\t';
    }
    cout << endl;
  }
  
  cout << "\nPer group output\t\n";
  BOOST_FOREACH(revpresence_t::value_type& val, revpresence) {
    cout<<"\nGroup: \t";
    
    for (boost::dynamic_bitset<>::size_type i = 0; i < val.first.size(); ++i) {
      cout << val.first[i] << '\t';
    }
    
    cout << endl;
    
    BOOST_FOREACH(string& entry, val.second) {
      cout << entry << "\t\n";
    }
  }
}

No comments:

Post a Comment