We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

3.00/5 (Submit Your Rating)

Dallas, TX

PROFESSIONAL SUMMARY:

  • Over 10 + years of Experience on Machine Learning, Statistical Modeling, Predictive Modeling, Data Analytics, Data Modeling, Data Architecture, Data Analysis, Data Mining, Text Mining& Natural Language Processing (NLP).
  • Experience on Artificial Intelligence algorithms, Business Intelligence, Analytics Models (like Decision Trees, Linear & Logistic Regression, Hadoop (Hive, PIG), R, Python, Spark, Scala, MS Excel, SQL and Postgre SQL, Erwin.
  • Extensive experienced on business intelligence (and BI technologies) tools such as OLAP, Data warehousing, reporting and querying tools, Data mining and Spreadsheets.
  • Hands on experience on R packages and libraries like ggplot2, Shiny, h2o, dplyr, reshape2, plotly, R Markdown, Elm Stat Learn, ca Tools etc.
  • Installing and configuring additional services on appropriate AWS EC2, RDS, S3 and/or other AWS service instances.
  • Expertise on working with MongoDB, Apache Cassandra .
  • Worked on different type of Python modules such as requests, boto, flake8, flask, mock and nose .
  • Excellent working in Big Data Hadoop Horton works, HDFS architecture, R, Python, Jupyter, Pandas, numPy, Scikit, Matplotlib, pyhive, Keras, Hive, noSQL - HBASE, Sqoop, Pig, Map Reduce, Oozie, Spark MLlib.
  • Efficient in developing Logical and Physical Data model and organizing data as per the business requirements using Sybase Power Designer, Erwin, ER Studio in both OLTP and OLAP applications.
  • Strong understanding of when to use an ODS or data mart or data warehousing.
  • Experienced in employing R Programming, MATLAB, SAS, Tableau and SQL for data cleaning, data visualization, risk analysis and predictive analytics..
  • Adept at using SAS Enterprise suite, R, Python, and Big Data related technologies including Hadoop, Hive, Pig, Sqoop, Cassandra, Oozie, Flume, Map-Reduce and Cloudera Manager for design of business intelligence applications
  • Hands on experience in Liner, Logistic Regression, K Means Cluster Analysis, Decision Tree, KNN, SVM, Random Forest, Market Basket, NLTK/Naïve Bayes, Sentiment Analysis, Text Mining/Text Analytics, Time Series Forecasting.
  • Experienced in Database using Oracle, XML, DB2, Teradata15/14, Netezza, server, Big Data and NoSQL.
  • Expertise on working with Machine Learning with MLlib using Python.
  • Familiar with data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling and data mining, machine learning and advanced data processing. Experience optimizing ETL workflows.
  • Worked with engineering teams to integrate algorithms and data into Return Path solutions.
  • Worked closely with other data scientists to create data driven products.
  • Extensive experience in development of T-SQL, DTS, OLAP, PL/SQL, Stored Procedures, Triggers, Functions, Packages, performance tuning and optimization for business logic implementation.
  • Experienced using query tools like SQL Developer, PLSQL Developer, and Teradata SQL Assistant.
  • Excellent in performing data transfer activities between SAS and various databases and data file formats like XLS,CSV,DBF,MDB etc.
  • Proficient in Hadoop, HDFS, Hive, Map Reduce, Pig and NOSQL databases like Mongo DB, HBase, Cassandra and expertise in applying data mining techniques and optimization techniques in B2B and B2C industries and proficient Machine Learning, Data/Text Mining, Statistical Analysis & Predictive Modeling.
  • Experienced in Data Modeling&Data Analysis experience using Dimensional Data Modeling and Relational Data Modeling, Star Schema/Snowflake Modeling, FACT & Dimensions tables, Physical &Logical Data Modeling.
  • Expertise skills in SAS DATA STEP, PROC Step, SQL, ETL, data mining, SAS MACRO, SAS ACCESS, SASSTAT, SAS GRAPH, SAS DI Studio, SAS BI Platform, SAS Web Report Studio, SAS BI Dashboard, SAS Stored Process, SAS Management Console, Enterprise Guide, Enterprise Miner, SASVA and ODS and procs like PROCSQL, import, export, means, summary, freq, tabulate, report, univariate, append, print, sort, transpose, format, glm, corr, factor, t test, ChiSquare, ANOVA, Arima, Arma , rank, Reg, logistic, boxplot etc.
  • Profound Analytical and problem solving skills along with ability to understand current business process and implement efficient solutions to issues/problems.

TECHNICAL SKILLS:

Machine Learning Algorithms: Neural Networks, Decision trees, Support Vector Machines, Random forest, Convolution Neural Networks, Logistic Regression, PCA, K- means, KNN.

Database Design Tools and Data Modeling: MS Visio, ERWIN 4.5/4.0, Star Schema/Snowflake Schema modeling, Fact & Dimensions tables, physical & logical data modeling, Normalization and De-normalization techniques, Kimball &Inmon Methodologies

Big Data/Hadoop Technologies: Hadoop, HDFS, Map Reduce, Hive, Pig, YARN, Impala, Sqoop, Flume, Spark, Kafka, Storm, Drill, Zookeeper and Oozie

Languages: HTML5,DHTML, CSS3, C, C++, XML, WSDL, R/R Studio, SAS Enterprise Guide, SAS, R (Caret, Weka, ggplot), Perl, MATLAB, Mathematica, FORTRAN, DTD, Schemas, Json, Ajax, Java, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), Java Script, Shell Scripting

Databases: Microsoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 11g, 12c, DB2, Netezza, TeradataNO SQL Databases Cassandra, HBase, Mongo DB, Maria DB

Business Intelligence Tools: Tableau server, Tableau Reader, Tableau, Splunk, SAP Business Objects, OBIEE, SAP Business Intelligence, QlikView, Amazon Redshift, or Azure Data Warehouse

Development Tools: Microsoft SQL Studio, IntelliJ, Eclipse, NetBeans.

Operating Systems: All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris

PROFESSIONAL EXPERIENCE:

Confidential, Dallas, TX

Sr. Data Engineer

Responsibilities:

  • Worked with several R packages including knitr, dplyr, SparkR, Causal Infer, Space-Time.
  • Coded R functions to interface with Caffe Deep Learning Framework.
  • Used Pandas, NumPy, Seaborn, SciPy, Matplotlib, Sci-kit-learn, and NLTK in Python for developing various Machine learning algorithms.
  • Installed and used Caffe Deep Learning Framework.
  • Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
  • Hands on with Git / GitHub for code check-ins/checkouts and branching etc.
  • Used pandas, NumPy, Seaborn, matplotlib, Scikit-learn, SciPy, NLTK in Python for developing various machine learning algorithms.
  • Data Manipulation and Aggregation from a different source using Nexus, Business Objects, Toad, Power BI and Smart View.
  • Implemented Agile Methodology for building an internal application.
  • Focus on integration overlap and Informatica newer commitment to MDM with the acquisition of Identity Systems.
  • Setup storage and data analysis tools in Amazon Web Services (AWS) cloud computing infrastructure.
  • Experienced in installation, configuration and troubleshooting the issues and performance tuning of WebLogic/Apache/IIS and Tomcat.
  • Implemented end-to-end systems for Data Analytics, Data Automation and integrated with custom visualization tools using R, Mahout, Hadoop, and Mongo DB.
  • Worked as Data Architects and IT Architects to understand the movement of data and its storage and ER Studio 9.7.
  • Utilized Spark, Scala, Hadoop, HBase, Cassandra, Mongo DB, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc. and Utilized the engine to increase user lifetime by 45% and triple user conversations for Confidential categories.
  • Used Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL, and MLLib libraries.
  • Expertise on Spark streaming (Lambda Architecture), Spark SQL, Tuning and Debugging the Spark Cluster (MESOS).
  • Expertise on working with Machine Learning with MLlib using Python.
  • Familiar with data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling and data mining, machine learning and advanced data processing. Experience optimizing ETL workflows.
  • Used Data Quality Validation techniques to validate Critical Data Elements (CDE) and identified various anomalies.
  • Extensively worked on Data Modeling tools Erwin Data Modeler to design the Data Models.
  • Developed various Qlik View Data Models by extracting and using the data from various sources files, DB2, Excel, Flat Files and Big data.
  • Participated in all phases of Data-Mining, Data-collection, Data-Cleaning, Developing-Models, Validation, Visualization and Performed Gap Analysis.
  • Data Manipulation and Aggregation from a different source using Nexus, Toad, Business Objects, Power BL, and Smart View.
  • Implemented Agile Methodology for building an internal application.
  • Focus on integration overlap and Informatica newer commitment to MDM with the acquisition of Identity Systems.
  • Good knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node, and Map Reduce concepts.
  • As Architect delivered various complex OLAP Databases/Cubes, Scorecards, Dashboards, and Reports.
  • Programmed by a utility in Python that used multiple packages (SciPy, NumPy, Pandas).
  • Implemented Classification using supervised algorithms like Logistic Regression, Decision trees, KNN, Naive Bayes.
  • Designed both 3NF data models for ODS, OLTP systems and Dimensional Data Models using Star and Snowflake Schemas.
  • Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.
  • Created SQL tables with referential integrity and developed queries using SQL, SQL Plus, and PL/SQL.
  • Designed and developed Use Case, Activity Diagrams, Sequence Diagrams, OOD (Object-oriented Design) using UML and Visio.
  • Interaction with Business Analyst, SMEs, and other Data Architects to understand Business needs and functionality for various project solutions.
  • Created, managed and utilized policies for S3 buckets and Glacier for storage and backup AWS.
  • Wrote ETL jobs to read from web APIs using REST and HTTP calls and loaded into HDFS using Java and Talend.
  • Identifying and executing process improvements, hands-on in various technologies such as Oracle, Informatica, and Business Objects.

Environment: AWS, R, Informatica, Python, HDFS, ODS, OLTP, Oracle 10g, Hive, OLAP, DB2, Metadata, MS Excel, Mainframes MS Vision, Map-Reduce, Rational Rose, SQL, and Mongo DB.

Confidential, Phoenix, AZ

Sr. Data Engineer

Responsibilities:

  • Provided the architectural leadership in shaping strategic, business technology projects, with an emphasis on application architecture.
  • Utilized domain knowledge and application portfolio knowledge to play a key role in defining the future state of large, business technology programs.
  • Participated in all phases of data mining, data collection, data cleaning, developing models, validation, and visualization and performed Gap analysis.
  • Developed Map Reduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
  • Migrated the Application on to AWS Cloud.
  • Created ecosystem models (e.g. conceptual, logical, physical, canonical) that are required for supporting services within the enterprise data architecture (conceptual data model for defining the major subject areas used, ecosystem logical model for defining standard business meaning for entities and fields, and an ecosystem canonical model for defining the standard messages and formats to be used in data integration services throughout the ecosystem).
  • Used Pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, &KNN for data analysis.
  • Conducted studies, rapid plots and using advanced data mining and statistical modeling techniques to build a solution that optimizes the quality and performance of data.
  • Demonstrated experience in the design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution and data lifecycle management in both RDBMS, Big Data environments.
  • Designed multiple Python packages that were used within a large ETL process used to load 2TB of data from an existing Oracle database into a new PostgreSQL cluster.
  • Analyzed large data sets apply machine learning techniques and develop predictive models, statistical models and developing and enhancing statistical models by leveraging best-in-class modelling techniques.
  • Worked on database design, relational integrity constraints, OLAP, OLTP, Cubes and Normalization (3NF) and De-normalization of the database.
  • Developed Map Reduce/Spark Python modules for machine learning & predictive analytics in Hadoopon AWS.
  • Leveraged ETL methods for ETL solutions and data warehouse tools for reporting and analysis.
  • Used CSV Excel Storage to parse with different delimiters in PIG.
  • Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
  • Developed multiple MapReduce jobs in java to clean datasets.
  • Developed code to write canonical model JSON records from numerous input sources to Kafka queues.
  • Worked on customer segmentation using an unsupervised learning technique - clustering.
  • Worked with various Teradata15 tools and utilities like Teradata Viewpoint, Multi-Load, ARC, Teradata Administrator, BTEQ and other Teradata Utilities.
  • Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
  • Developed Linux Shell scripts by using NZSQL/NZLOAD utilities to load data from flat files to the Netezza database.
  • Designed and implemented system architecture for AmazonEC2 based cloud-hosted solution for the client.
  • Tested Complex ETL Mappings and Sessions based on business user requirements and business rules to load data from source flat files and RDBMS tables to Confidential tables.

Environment: Erwin r9.6, Python, SQL, Oracle 12c, Netezza, SQL Server, Informatica, Java, SSRS, PL/SQL, T-SQL, Tableau, MLLib, regression, Cluster analysis, Scala NLP, Spark, Kafka, MongoDB, logistic regression, Hadoop, Hive, Teradata0, random forest, OLAP, Azure, MariaDB, SAP CRM, HDFS, ODS, NLTK, SVM, JSON, Tableau, XML, Cassandra, MapReduce, AWS.

Confidential, Atlanta, GA

Sr. Data Engineer

RESPONSIBILITIES:

  • Supported MapReduce Programs running on the cluster.
  • Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
  • Configured Hadoop cluster with Name node and slaves and formatted HDFS.
  • Used Oozie workflow engine to run multiple Hive and Pig jobs.
  • Performed Map Reduce Programs those are running on the cluster.
  • Developed multiple MapReduce jobs in java for data cleaning and pre-processing.
  • Analyzed the partitioned and bucketed data and compute various metrics for reporting.
  • Involved in loading data from RDBMS and web logs into HDFS using Sqoop and Flume.
  • Worked on loading the data from MySQL to HBase where necessary using Sqoop.
  • Developed Hive queries for Analysis across different banners.
  • Extracted data from Twitter using Java and Twitter API. Parsed JSON formatted twitter data and uploaded to database.
  • Launching AmazonEC2 Cloud Instances using Amazon Images (Linux/ Ubuntu) and Configuring launched instances with respect to specific applications.
  • Exported the result set from Hive to MySQL using Sqoop after processing the data.
  • Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
  • Have hands on experience working on Sequence files, AVRO, HAR file formats and compression.
  • Used Hive to partition and bucket data.
  • Fetched live stream data from DB2 to Hbase table using Spark Streaming and Apache Kafka.
  • Implemented Apache PIG scripts to load data from and to store data into Hive.
  • Experience in writing MapReduce programs with JavaAPI to cleanse Structured and unstructured data.
  • Wrote Pig Scripts to perform ETL procedures on the data in HDFS.
  • Created HBase tables to store various data formats of data coming from different portfolios.
  • Worked on improving performance of existing Pig and Hive Queries.

Environment: SQL/Server, Oracle 9i, MS-Office, Apache, Teradata, Informatica, ER Studio, XML, Business Objects.

Confidential, Minneapolis, MN

Data Analyst/Modeler

RESPONSIBILITIES:

  • Participated in JAD sessions, gathered information from Business Analysts, end users and other stakeholders to determine the requirements.
  • Developed the logical data models and physical data models that confine existing condition/potential status data fundamentals and data flows using ER Studio.
  • Created Datawarehousing methodologies/Dimensional Data modeling techniques such as Star/Snowflake schema using ERWIN9.1.
  • Extensively used Aginity Netezza workbench to perform various DDL, DML etc. operations on Netezza database.
  • Designed the DataWarehouse and MDM hub Conceptual, Logical and Physical data models.
  • Performed Daily Monitoring of Oracle instances using Oracle Enterprise Manager, ADDM, TOAD, monitor users, table spaces, memory structures, rollback segments, logs and alerts.
  • Involved in TeradataSQL Development, Unit Testing and Performance Tuning and to ensure testing issues are resolved on the basis of using defect reports.
  • Implemented Apache PIG scripts to load data from and to store data into Hive.
  • Customized reports using SAS/MACRO facility, PROC REPORT, PROC TABULATE and PROC.
  • Translate business and data requirements into Logical data models in support of Enterprise Data Models, ODS, OLAP, OLTP, Operational Data Structures and Analytical systems.
  • Worked on database testing, wrote complex SQL queries to verify the transactions and business logic like identifying the duplicate rows by using SQL Developer and PL/SQL Developer.
  • Used Teradata SQL Assistant, Teradata Administrator, PMON and data load/export utilities like BTEQ, Fast Load, Multi Load, Fast Export and TPump on UNIX/Windows environments and running the batch process for Teradata.
  • Worked on data profiling and data validation to ensure the accuracy of the data between the warehouse and source systems.
  • Hands on Data warehouse concepts like Datawarehouse Architecture, Star schema, Snowflake schema, and Data Marts, Dimension and Fact tables.
  • Developed SQL Queries to fetch complex data from different tables in remote databases using joins, database links and Bulk collects.
  • Migrated database from legacy systems, SQLserver to Oracle and Netezza.
  • Reviewed the logical model with application developers, ETL Team, DBAs and testing team to provide information about the data model and business requirements.
  • Fetched live stream data from DB2 to Hbase table using Spark Streaming and Apache Kafka.
  • Worked on SQL Server concepts SSIS (SQL Server Integration Services), SSAS (Analysis Services) and SSRS (Reporting Services)

Environment: ER Studio, Teradata13.1, SQL, PL/SQL, BTEQ, DB2, Oracle, Apache, MDM, Netezza, ETL, RTF UNIX, SQL Server2010, Informatica, SSRS, SSIS, SSAS, SAS, Aginity.

Confidential

Data Analyst

RESPONSIBILITIES:

  • Processed data received from vendors and loading them into the database. The process was carried out on weekly basis and reports were delivered on a bi-weekly basis. The extracted data had to be checked for integrity.
  • Documented requirements and obtained signoffs.
  • Coordinated between the Business users and development team in resolving issues.
  • Documented data cleansing and data profiling.
  • Wrote SQLscripts to meet the business requirement.
  • Analyzed views and produced reports.
  • Tested cleansed data for integrity and uniqueness.
  • Automated the existing system to achieve faster and accurate data loading.
  • Generated weekly, bi-weekly reports to be sent to client business team using business objects and documented them too.
  • Learned to create BusinessProcessModels.
  • Ability to manage multiple projects simultaneously tracking them towards varying timelines effectively through a combination of business and technical skills.
  • Good Understanding of clinical practice management, medical and laboratory billing and insurance claim with processing with process flow diagrams.
  • Assisted QA team in creating test scenarios that cover a day in a life of the patient for Inpatient and Ambulatory workflows.

Environment: SQL, data profiling, data loading, QA team.

Confidential

Data Analyst

RESPONSIBILITIES:

  • Processed data received from vendors and loading them into the database. The process was carried out on weekly basis and reports were delivered on a bi-weekly basis. The extracted data had to be checked for integrity.
  • Documented requirements and obtained signoffs.
  • Coordinated between the Business users and development team in resolving issues.
  • Documented data cleansing and data profiling.
  • Wrote SQLscripts to meet the business requirement.
  • Analyzed views and produced reports.
  • Tested cleansed data for integrity and uniqueness.
  • Automated the existing system to achieve faster and accurate data loading.
  • Generated weekly, bi-weekly reports to be sent to client business team using business objects and documented them too.
  • Learned to create BusinessProcessModels.
  • Ability to manage multiple projects simultaneously tracking them towards varying timelines effectively through a combination of business and technical skills.
  • Good Understanding of clinical practice management, medical and laboratory billing and insurance claim with processing with process flow diagrams.
  • Assisted QA team in creating test scenarios that cover a day in a life of the patient for Inpatient and Ambulatory workflows.

Environment: SQL, data profiling, data loading, QA team.

We'd love your feedback!