- My Forums
- Tiger Rant
- LSU Recruiting
- SEC Rant
- Saints Talk
- Pelicans Talk
- More Sports Board
- Fantasy Sports
- Golf Board
- Soccer Board
- O-T Lounge
- Tech Board
- Home/Garden Board
- Outdoor Board
- Health/Fitness Board
- Movie/TV Board
- Book Board
- Music Board
- Political Talk
- Money Talk
- Fark Board
- Gaming Board
- Travel Board
- Food/Drink Board
- Ticket Exchange
- TD Help Board
Customize My Forums- View All Forums
- Show Left Links
- Topic Sort Options
- Trending Topics
- Recent Topics
- Active Topics
Started By
Message
What should I use to play with over 400 million rows of data?
Posted on 1/9/17 at 2:32 am
Posted on 1/9/17 at 2:32 am
I love excel, and I want something that will have the ease of use and the cabaility to manipulate the rows like excel does but has the capacity to handle the data load.
Data Type: Telemetry data(ie vehicle placement, Credentials, load, etc) Geo positioning mapping add-in/feature would be sweet
Data Issues: need to purge duplicate entries or erroneous entries...ie where column a&b = c&d and column x falls in between y&z. Cross reference column m against table "DRIVERLIST" for status, column n against table "fuel type"
Data Use: generate reports based on overall industry and then individual carriers, individual routes, drivers, route heat maps.....etc.
data is currently on Oracle Sql developer
should I learn python?
Data Type: Telemetry data(ie vehicle placement, Credentials, load, etc) Geo positioning mapping add-in/feature would be sweet
Data Issues: need to purge duplicate entries or erroneous entries...ie where column a&b = c&d and column x falls in between y&z. Cross reference column m against table "DRIVERLIST" for status, column n against table "fuel type"
Data Use: generate reports based on overall industry and then individual carriers, individual routes, drivers, route heat maps.....etc.
data is currently on Oracle Sql developer
should I learn python?
Posted on 1/9/17 at 7:47 am to Kujo
Wish I had experience with anything to give you advice but I don't. Have never use python.
I'm assuming the raw data will be stored in a sql type database and you're looking for an interface to retrieve manipulate the data into useful chunks for display/reporting/decision making purposes.
I'd be curious to know what solution you come up with. If you don't mind, keep us posted, particularly if you find your solution elsewhere.
I'm assuming the raw data will be stored in a sql type database and you're looking for an interface to retrieve manipulate the data into useful chunks for display/reporting/decision making purposes.
I'd be curious to know what solution you come up with. If you don't mind, keep us posted, particularly if you find your solution elsewhere.
Posted on 1/9/17 at 7:52 am to Kujo
Oracle SQL is easy enough to pick up the basics in a few hours. It is much easier than excel or python. Plus you already have SQL developer.
This is almost SQL.
quote:
column a&b = c&d and column x falls in between y&z
This is almost SQL.

Posted on 1/9/17 at 7:58 am to Kujo
SabiDojo could probably code something for you
Posted on 1/9/17 at 8:01 am to Kujo
PL/SQL is pretty straight forward and you could run through some code classes online and knock out what you need.
If you're more visual, check out Pentaho. There is a community version that you can use for free.
Pentaho
ETA: Your data isn't in Oracle SQL Developer. Your data is likely in an Oracle database. SQL Developer is simply a query tool.
ETA2: Pentaho has add ins for address data (Melissa data) and geo mapping.
If you're more visual, check out Pentaho. There is a community version that you can use for free.
Pentaho
ETA: Your data isn't in Oracle SQL Developer. Your data is likely in an Oracle database. SQL Developer is simply a query tool.
ETA2: Pentaho has add ins for address data (Melissa data) and geo mapping.
This post was edited on 1/9/17 at 8:31 am
Posted on 1/9/17 at 8:24 am to Kujo
This post tells me I am very glad to not be in logistics.
Posted on 1/9/17 at 8:25 am to Kujo
R, SAS (if you know how to code), pandas for Python
Ggplot2 in R, tableau, pandas
Honestly I would use R or Pandas if you're wanting to use a secondary program to help with analysis and data visualization. The SQL language can be used for both as well
quote:
Data Use: generate reports based on overall industry and then individual carriers, individual routes, drivers, route heat maps.....etc.
Ggplot2 in R, tableau, pandas
Honestly I would use R or Pandas if you're wanting to use a secondary program to help with analysis and data visualization. The SQL language can be used for both as well
This post was edited on 1/9/17 at 8:43 am
Posted on 1/9/17 at 11:22 am to Kujo
If you already have it in a relational database (Oracle) you can just use PL/SQL to manipulate/clean data and query data for reports. Do you have access to the Oracle server? Do you need nicely formatted reports or just tabular data is fine?
Posted on 1/9/17 at 11:46 am to Scream4LSU
quote:
Do you need nicely formatted reports or just tabular data is fine?
My take based on his questions is that he's used to exporting data from the database and playing around with it in Excel. Now the data has gotten too large for that to be a practical solution. If that's the case I'm thinking Tableau might be closer to what he's looking for.
Posted on 1/9/17 at 2:48 pm to Kujo
Alteryx. You will thank me. It should be very capable to meet your needs and has spatial mapping capabilities.
It is object oriented with workflows and you do not have to learn a coding language.
It is object oriented with workflows and you do not have to learn a coding language.
Posted on 1/9/17 at 2:51 pm to TigerinATL
Tableau is amazing for visualizations but there are much better tools for data cleansing.
Alteryx integrated into Tableau too.
I'm not associated with the company but I do use the program from time to time and have been extremely impressed with its ease of use and speed of analysis.
Alteryx integrated into Tableau too.
I'm not associated with the company but I do use the program from time to time and have been extremely impressed with its ease of use and speed of analysis.
Posted on 1/9/17 at 5:41 pm to tom
quote:
Oracle SQL is easy enough to pick up the basics in a few hours. It is much easier than excel or python. Plus you already have SQL developer.
Oracle is a very good answer for this and the great news is that you can download it for free. You can even get the full enterprise-level version if you promise it's just for personal use. Then you can use SQL Developer to query your data.
If you've never installed Oracle, there are several tutorials on Youtube that will help.
I actually have this installed on my Surface Pro 3. Since it has an SSD, my tablet probably runs queries faster than most major corporations. Of course, a Surface probably won't be able to store 400 million rows but you could put it on a larger SSD on a desktop and get very good results.
Posted on 1/9/17 at 11:57 pm to TigerinATL
quote:
My take based on his questions is that he's used to exporting data from the database and playing around with it in Excel.
excactly, I can visually see the errors. ie one vehicle kept reporting a single trip multiple times, another is breaking a single trip into multiple stops and simultaneaous starts. unless you "see" that trend you don't know its there....I'm guessing unless you write sometype of check code...if duration between x and x-1 is less than 1 min...flag error?
This is a new position for them(I am the only one they have) which involves alot of easy performance reports(no big deal) from their proprietary system. But now they want me to start getting deeper into the data because they have this gps tracking monitor on all vehicles that they want to incorporate into internal trip/route data.
I can play with one vehicle's data for a day, and give a report about that data, or use it in an investigation into a driver or something "pinpoint".....but when I look at an entire day of all vehicles....I crash before I can begin.
We have over 2000 vehicles on the road at a time, running even more drivers in different jurisdictions.....and I have to be able to:
ensure data integrity
have the quarterly report in such a manner that any question can be answered on the fly (like a pivot table)
"So kujo, what's the difference in westward bound route times in October versus December?"
Octber is 5% faster
"Really, Where is more time saved before noon or afternoon?"
Uh, frick, Imma have to relaod all the fraking data again and then try to write some time split code into it, and hopefully by the end of the day I can get that ONE question answered..with 80...60% confidence I did it right.
Posted on 1/9/17 at 11:59 pm to foshizzle
quote:
Oracle is a very good answer
takes forever to run a query more than 1 day and when I tried to export a quarter of data in a csv file...it took 3 days to save.
I am not a fan
Posted on 1/10/17 at 8:45 am to Kujo
Any 3rd party solution you use is going to either run off of oracle or export the data to some oracle equivalent and then use that. So it will take as long or longer than doing it correctly in oracle.
400 million records is a lot. How far back does that data go? Is all of that time important for your reporting purposes? If not, you may consider automatically copying/summarizing small portions of that data into your own separate table (with indexes appropriate for your queries) on a nightly basis.
Your strategy in querying should be to minimize/eliminate your calls to the 400 million record table and use GROUP BY to minimize the rows returned.
GROUP BY turns a result like this:
Driver A 1/09/2017 6:45:12 - 6:53:33 1.4 miles
Driver A 1/09/2017 7:00:03 - 7:14:56 2.3 miles
Driver A 1/09/2017 7:20:03 - 7:22:01 0.3 miles
... (400000000 rows)
into something more manageable like this:
Driver A 1/09/2017 22 miles
Driver B 1/09/2017 38 miles
... (xxxxxx rows)
or:
All drivers 1/09/2017 687 miles
All drivers 1/08/2017 722 miles
... (xxxx rows)
or even:
All drivers All dates 57892346 miles
(1 row)
400 million records is a lot. How far back does that data go? Is all of that time important for your reporting purposes? If not, you may consider automatically copying/summarizing small portions of that data into your own separate table (with indexes appropriate for your queries) on a nightly basis.
Your strategy in querying should be to minimize/eliminate your calls to the 400 million record table and use GROUP BY to minimize the rows returned.
GROUP BY turns a result like this:
Driver A 1/09/2017 6:45:12 - 6:53:33 1.4 miles
Driver A 1/09/2017 7:00:03 - 7:14:56 2.3 miles
Driver A 1/09/2017 7:20:03 - 7:22:01 0.3 miles
... (400000000 rows)
into something more manageable like this:
Driver A 1/09/2017 22 miles
Driver B 1/09/2017 38 miles
... (xxxxxx rows)
or:
All drivers 1/09/2017 687 miles
All drivers 1/08/2017 722 miles
... (xxxx rows)
or even:
All drivers All dates 57892346 miles
(1 row)
Posted on 1/10/17 at 11:37 am to foshizzle
quote:
full enterprise-level
Highly unlikely he has the hardware to support it.
quote:
Since it has an SSD, my tablet probably runs queries faster than most major corporations
Absolutely not, we have a 10mm Exadata appliance for our Oracle databases.
We also have a Vertica machine with over 100 cores.
It sounds like all of your data is being dumped into tables and it's not very usable. Are the tables indexed at all?
A group by on 400mm records is going to be extremely heavy. Sounds like you need some ETL processes over there.
This post was edited on 1/10/17 at 11:39 am
Posted on 1/10/17 at 7:46 pm to Kujo
(no message)
This post was edited on 3/28/17 at 5:58 am
Posted on 1/10/17 at 8:45 pm to Kujo
What company if you dont mind? I am in transportation as well. I can't imagine doing that much data! That is nuts. I've gotten into dedicated so I just worry about the same 40 guys every day and those are my guys. It's easy to see arrival/departure times each day versus planned times and we can see where we are losing time. But then again, it's only 40 drivers :)
Popular
Back to top
