Started By
Message

What should I use to play with over 400 million rows of data?

Posted on 1/9/17 at 2:32 am
Posted by Kujo
225-911-5736
Member since Dec 2015
6031 posts
Posted on 1/9/17 at 2:32 am
I love excel, and I want something that will have the ease of use and the cabaility to manipulate the rows like excel does but has the capacity to handle the data load.

Data Type: Telemetry data(ie vehicle placement, Credentials, load, etc) Geo positioning mapping add-in/feature would be sweet

Data Issues: need to purge duplicate entries or erroneous entries...ie where column a&b = c&d and column x falls in between y&z. Cross reference column m against table "DRIVERLIST" for status, column n against table "fuel type"

Data Use: generate reports based on overall industry and then individual carriers, individual routes, drivers, route heat maps.....etc.



data is currently on Oracle Sql developer

should I learn python?
Posted by mdomingue
Lafayette, LA
Member since Nov 2010
37822 posts
Posted on 1/9/17 at 7:47 am to
Wish I had experience with anything to give you advice but I don't. Have never use python.

I'm assuming the raw data will be stored in a sql type database and you're looking for an interface to retrieve manipulate the data into useful chunks for display/reporting/decision making purposes.

I'd be curious to know what solution you come up with. If you don't mind, keep us posted, particularly if you find your solution elsewhere.
Posted by tom
Baton Rouge
Member since Jun 2007
8495 posts
Posted on 1/9/17 at 7:52 am to
Oracle SQL is easy enough to pick up the basics in a few hours. It is much easier than excel or python. Plus you already have SQL developer.

quote:

column a&b = c&d and column x falls in between y&z

This is almost SQL.
Posted by TrebleHook
Member since Jun 2016
1356 posts
Posted on 1/9/17 at 7:58 am to
SabiDojo could probably code something for you
Posted by BoogaBear
Member since Jul 2013
6457 posts
Posted on 1/9/17 at 8:01 am to
PL/SQL is pretty straight forward and you could run through some code classes online and knock out what you need.

If you're more visual, check out Pentaho. There is a community version that you can use for free.

Pentaho

ETA: Your data isn't in Oracle SQL Developer. Your data is likely in an Oracle database. SQL Developer is simply a query tool.

ETA2: Pentaho has add ins for address data (Melissa data) and geo mapping.
This post was edited on 1/9/17 at 8:31 am
Posted by castorinho
13623 posts
Member since Nov 2010
84407 posts
Posted on 1/9/17 at 8:04 am to
Stop tracking me, Telematics.
Posted by flyAU
Scottsdale
Member since Dec 2010
24897 posts
Posted on 1/9/17 at 8:24 am to
This post tells me I am very glad to not be in logistics.
Posted by jeff5891
Member since Aug 2011
15824 posts
Posted on 1/9/17 at 8:25 am to
R, SAS (if you know how to code), pandas for Python

quote:

Data Use: generate reports based on overall industry and then individual carriers, individual routes, drivers, route heat maps.....etc.


Ggplot2 in R, tableau, pandas


Honestly I would use R or Pandas if you're wanting to use a secondary program to help with analysis and data visualization. The SQL language can be used for both as well
This post was edited on 1/9/17 at 8:43 am
Posted by Scream4LSU
Member since Sep 2007
1140 posts
Posted on 1/9/17 at 11:22 am to
If you already have it in a relational database (Oracle) you can just use PL/SQL to manipulate/clean data and query data for reports. Do you have access to the Oracle server? Do you need nicely formatted reports or just tabular data is fine?
Posted by TigerinATL
Member since Feb 2005
62437 posts
Posted on 1/9/17 at 11:46 am to
quote:

Do you need nicely formatted reports or just tabular data is fine?


My take based on his questions is that he's used to exporting data from the database and playing around with it in Excel. Now the data has gotten too large for that to be a practical solution. If that's the case I'm thinking Tableau might be closer to what he's looking for.
Posted by lynxcat
Member since Jan 2008
24713 posts
Posted on 1/9/17 at 2:48 pm to
Alteryx. You will thank me. It should be very capable to meet your needs and has spatial mapping capabilities.

It is object oriented with workflows and you do not have to learn a coding language.
Posted by lynxcat
Member since Jan 2008
24713 posts
Posted on 1/9/17 at 2:51 pm to
Tableau is amazing for visualizations but there are much better tools for data cleansing.

Alteryx integrated into Tableau too.

I'm not associated with the company but I do use the program from time to time and have been extremely impressed with its ease of use and speed of analysis.
Posted by foshizzle
Washington DC metro
Member since Mar 2008
40599 posts
Posted on 1/9/17 at 5:41 pm to
quote:

Oracle SQL is easy enough to pick up the basics in a few hours. It is much easier than excel or python. Plus you already have SQL developer.



Oracle is a very good answer for this and the great news is that you can download it for free. You can even get the full enterprise-level version if you promise it's just for personal use. Then you can use SQL Developer to query your data.

If you've never installed Oracle, there are several tutorials on Youtube that will help.

I actually have this installed on my Surface Pro 3. Since it has an SSD, my tablet probably runs queries faster than most major corporations. Of course, a Surface probably won't be able to store 400 million rows but you could put it on a larger SSD on a desktop and get very good results.
Posted by Kujo
225-911-5736
Member since Dec 2015
6031 posts
Posted on 1/9/17 at 11:57 pm to
quote:

My take based on his questions is that he's used to exporting data from the database and playing around with it in Excel.


excactly, I can visually see the errors. ie one vehicle kept reporting a single trip multiple times, another is breaking a single trip into multiple stops and simultaneaous starts. unless you "see" that trend you don't know its there....I'm guessing unless you write sometype of check code...if duration between x and x-1 is less than 1 min...flag error?

This is a new position for them(I am the only one they have) which involves alot of easy performance reports(no big deal) from their proprietary system. But now they want me to start getting deeper into the data because they have this gps tracking monitor on all vehicles that they want to incorporate into internal trip/route data.

I can play with one vehicle's data for a day, and give a report about that data, or use it in an investigation into a driver or something "pinpoint".....but when I look at an entire day of all vehicles....I crash before I can begin.

We have over 2000 vehicles on the road at a time, running even more drivers in different jurisdictions.....and I have to be able to:

ensure data integrity
have the quarterly report in such a manner that any question can be answered on the fly (like a pivot table)

"So kujo, what's the difference in westward bound route times in October versus December?"

Octber is 5% faster

"Really, Where is more time saved before noon or afternoon?"

Uh, frick, Imma have to relaod all the fraking data again and then try to write some time split code into it, and hopefully by the end of the day I can get that ONE question answered..with 80...60% confidence I did it right.


Posted by Kujo
225-911-5736
Member since Dec 2015
6031 posts
Posted on 1/9/17 at 11:59 pm to
quote:

Oracle is a very good answer


takes forever to run a query more than 1 day and when I tried to export a quarter of data in a csv file...it took 3 days to save.

I am not a fan
Posted by tom
Baton Rouge
Member since Jun 2007
8495 posts
Posted on 1/10/17 at 8:45 am to
Any 3rd party solution you use is going to either run off of oracle or export the data to some oracle equivalent and then use that. So it will take as long or longer than doing it correctly in oracle.

400 million records is a lot. How far back does that data go? Is all of that time important for your reporting purposes? If not, you may consider automatically copying/summarizing small portions of that data into your own separate table (with indexes appropriate for your queries) on a nightly basis.

Your strategy in querying should be to minimize/eliminate your calls to the 400 million record table and use GROUP BY to minimize the rows returned.

GROUP BY turns a result like this:
Driver A 1/09/2017 6:45:12 - 6:53:33 1.4 miles
Driver A 1/09/2017 7:00:03 - 7:14:56 2.3 miles
Driver A 1/09/2017 7:20:03 - 7:22:01 0.3 miles
... (400000000 rows)

into something more manageable like this:
Driver A 1/09/2017 22 miles
Driver B 1/09/2017 38 miles
... (xxxxxx rows)

or:
All drivers 1/09/2017 687 miles
All drivers 1/08/2017 722 miles
... (xxxx rows)

or even:
All drivers All dates 57892346 miles
(1 row)
Posted by BoogaBear
Member since Jul 2013
6457 posts
Posted on 1/10/17 at 11:37 am to
quote:

full enterprise-level


Highly unlikely he has the hardware to support it.

quote:

Since it has an SSD, my tablet probably runs queries faster than most major corporations

Absolutely not, we have a 10mm Exadata appliance for our Oracle databases.
We also have a Vertica machine with over 100 cores.

It sounds like all of your data is being dumped into tables and it's not very usable. Are the tables indexed at all?

A group by on 400mm records is going to be extremely heavy. Sounds like you need some ETL processes over there.
This post was edited on 1/10/17 at 11:39 am
Posted by bigblake
Member since Jun 2011
2536 posts
Posted on 1/10/17 at 7:46 pm to
(no message)
This post was edited on 3/28/17 at 5:58 am
Posted by philabuck
NE Ohio
Member since Sep 2008
10389 posts
Posted on 1/10/17 at 7:55 pm to
I've never used it, but BigQuery may work (Google Console). Should be able to get a $300 credit to toy around.

LINK
Posted by Hogwall Jackson
Member since Feb 2013
5201 posts
Posted on 1/10/17 at 8:45 pm to
What company if you dont mind? I am in transportation as well. I can't imagine doing that much data! That is nuts. I've gotten into dedicated so I just worry about the same 40 guys every day and those are my guys. It's easy to see arrival/departure times each day versus planned times and we can see where we are losing time. But then again, it's only 40 drivers :)
first pageprev pagePage 1 of 2Next pagelast page

Back to top
logoFollow TigerDroppings for LSU Football News
Follow us on X, Facebook and Instagram to get the latest updates on LSU Football and Recruiting.

FacebookXInstagram