Data Engineer/big Data Developer Resume
Plano, TexaS
PROFESSIONAL SUMMARY:
- Over 7 years of IT development experience that includes Data Warehousing and Data Analysis, and more than 4 years of experience in Big Data ecosystem using Hadoop framework and related technologies such as HDFS, MapReduce, Hive, Pig, Spark, Hbase, Flume, Oozie, Sqoop, Impala, Kafka, Splunk, and Zookeeper.
- Excellent knowledge of distributed storages (HDFS) and distributed processing (MapReduce, Yarn) for real - time streaming and batch processing.
- Experience in developing Python and Spark Data Frames to perform data transformations and build tables in Hive.
- Interpreted problems and identified areas of improvement using data analysis, data mining, statistics and machine learning techniques.
- Wrote custom MapReduce programs in Java and Python to extend Hive functionality by writing custom UDFs.
- Experience in extracting data from RDBMS into HDFS using Sqoop.
- Worked on collecting real-time streaming data and log data from log collector into HDFS using Flume and Kafka.
- Experience with NoSQL databases such as HBase for key-based low latency queries.
- Experience in analyzing data in HDFS through Impala, Hive and Spark.
- Experience in workflow scheduling and monitoring tools like Control-M and Oozie.
- Hands-on experience in performing ETL operations using AbInitio environment and creating dashboard reports using Tableau.
- Hands-on experience in Python, Java, Multi-threaded processing, SQL and PL/SQL.
- Hands on experience in loading and processing unstructured data (Log files, XML data) into HDFS using Python and Flume.
- Experience in performance tuning operations using Partitioning, Bucketing and Indexing in HIVE.
- Hands-on experience with test frameworks for Hadoop using MRUnit framework.
- Flexible with Unix/Linux Environments working with operating systems like CentOS, Redhat, Ubuntu.
- In-depth knowledge of database connectivity for databases like Oracle 12c/11g/10g, MS SQL Server 2005/2008, MS Access, DB2 and Teradata.
- Experience in writing ad-hoc Map Reduce programs in Pig Latin.
- Experience in performing Spark RDD transformations to process data using Data Frames and Data Pipelines.
- Experience in working with lambda architecture using Scala on Spark.
- Worked extensively in developing PySpark scripts generate output files in formats like Avro, JSON and XML.
- Experience in working with Hive tables in developing data pipelines, implementing complex business logic and optimizing Hive queries.
- Worked on importing data into HBase using HBase shell and HBase client API.
- Experience in automating job flows using Oozie.
- Experience in working with Apache SOLR and Elasticsearch.
- Worked on loading and unloading data to a set of files in Amazon S3 bucket.
- Experience in developing shell scripts in UNIX and using SQL or PL/SQL to process data from the input file and load it into the database.
- Involved in developing generic and custom AbInitio graphs for Unix environment
- Experience in building a data lake to embrace existing historical data.
- Worked on pulling relevant raw data from the data lake for analysis according to the requirements.
- Developed Unix shell wrapper scripts to run graphs in development, testing, and production environments.
- Experience in planning and scheduling Control-M jobs and automating workloads to reduce incident volume.
- Excellent understanding of statistics, machine learning, data mining, predictive analysis, data warehousing concepts, data modeling procedures, data profiling, data structures and algorithms.
TECHNICAL SKILLS:
Big Data Ecosystem: HDFS, Map Reduce, Spark, YARN, Pig, Hive, Impala, HBase, Sqoop, Flume, Cloudera Hue, SOLR, Storm, Kafka, Elasticsearch, Oozie, Zookeeper
Languages: Java, Scala, Python, Pig Latin, HiveQL, SQL, PL/SQL, XML, JSON
Database Systems: Oracle 12c/11g/10g, MS SQL Server 2005/2008, MS Access, DB2, Teradata, Greenplum
NoSQL Databases: HBase, Cassandra, MongoDB
IDEs: Eclipse, IntelliJ, Netbeans
Scripting Tools: UNIX Shell Scripting, PERL
Operating Systems: Linux, Unix, Windows 7/Vista/XP/10
ETL Tools: Ab Initio (GDE 3.2.2, Co>Operating System 3.0.4.10), Tableau
Scheduling Tools: Control M, Tidal Enterprise Scheduler, Crontab
PROFESSIONAL EXPERIENCE:
Confidential, Plano, Texas
Data Engineer/Big Data Developer
Responsibilities:
- Built a generic data ingestion framework to extract data from multiple sources like Oracle, delimited flat files, XML, Parquet, and JSON, using it to build Hive/Impala tables.
- Built big data analytic solutions to provide near real-time and batch data reports to business users according to requirements.
- Responsible for design, development and maintenance of workflows to integrate Shell-actions, Java-actions, Sqoop-actions, Hive-actions and Spark-actions into Oozie workflow nodes to run data pipelines.
- Used Python to parse XML files and create flat files from them.
- Worked with Spark Data Frames, Datasets and RDDs using Python to transform and load data into Hive tables based on the requirements.
- Extensively worked with Pyspark/Spark SQL for data cleansing and generating data frames and RDDS.
- Analyzed SQL scripts for the design and implementation of solutions using PySpark.
- Worked on creating Spark applications in Scala using map, flatmap, collectByKey, reduceByKey, filter, groupByKey, distinct, cogroup, join, count, collect, reduce functions to process data.
- Used Impala for low latency queries, visualization and faster-querying purposes.
- Imported, exported and appended incremental data into HDFS using PySpark and Sqoop from Oracle database and ingested it into Hive tables.
- Used PySpark to build tables that require multiple computations and non equi-joins.
- Exported analyzed data to relational databases using PySpark and Sqoop for visualization and to generate reports for the BI team.
- Used HBase to support front end applications that retrieve data using row keys.
- Build data quality framework using Java and Impala to run data rules that can generate reports and send emails of business-critical successful and failed job notifications to business users daily.
- Determine the size of data and the level of computation required to process it and leverage suitable methodologies to transform data and compute aggregations.
- Handled the design and support of multi-tenancy on our data platform to allow other teams to run their applications.
- Worked on configuration and automation of workflows using Control-M and led the production support teams through their operational, scheduling and monitoring activities.
- Created partition tables in Hive for better performance and faster querying.
- Worked extensively on creating Hive tables to store data that resulted after querying from large data sets.
- Worked on debugging and performance tuning of Hive and Pig jobs.
- Knowledge of SQL, Numpy, Pandas, Scikit-learn, and PySpark for data analysis and model building.
- Worked on file formats and compressed files like snappy, Gzip, Bzip2, avro and text.
- Processed JSON files using Pyspark and created Hive tables.
- Developed Python regular expression (re module) operations in the Hadoop/Hive environment.
- Developed on-the-fly decryption support for Hive, Pig and custom Map Reduce use cases using Java API.
- Involved in using HCatalog to access Hive metadata from Map Reduce or Pig code.
- Worked on data pre-processing and cleaning to perform feature engineering and data imputation for missing values in various datasets using Python.
- Automated jobs for pulling or sending files from and to SFTP servers according to business requirements.
Confidential, Scottsdale, Arizona
Data Engineer
Responsibilities:
- Developed multiple Map Reduce programs for extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
- Written Map Reduce jobs to parse and then process the data as per the requirements.
- Designed and implemented Map Reduce jobs to support distributed processing and provide end to end solutions using Cloudera environment.
- Implemented automation of workloads using Shell Scripting.
- Involved in debugging Java Map Reduce programs to process data.
- Executed different logics on both ORC and text file formats.
- Built the data lake environment from scratch by pooling data from different sources.
- Worked extensively on importing and exporting data into HDFS and Hive/Impala tables from Relational Database Systems using Sqoop.
- Used Flume to collect and store data from different sources into HDFS and later processed it using HiveQL, Pig Latin and Java.
- Built scalable multi-threaded applications for large data processing using Pig.
- Developed Pig Latin scripts using DDL and DML to extract data from files and load into HDFS.
- Worked on HBase whenever required for low latency queries.
- Used Oozie workflow engine to run multiple Hive/Impala and Pig jobs.
- Assigned schemas and created Hive tables with partitioning and bucketing for faster analytics and better performance.
- Implemented Spark RDD transformations and actions to migrate Map reduce algorithms.
- Developed User Defined Functions (UDFs) to provide custom Hive and Pig capabilities.
- Used lambda architecture to build large-scale, distributed data processing systems.
- Created and compared solutions with NoSQL databases and SQL server solutions.
- Involved in backup activities and preparation of transition documents and ETL (Abinitio) related design and testing.
- Involved in ETL process migration from AbInitio to Hadoop environments.
- Involved in data load and unload to and from HDFS using Abinitio using Hadoop read write components and direct Hive scripts.
- Performed data loads and unloads from DB2 and Oracle databases to Hive using Sqoop scripts.
- Guided testing teams in the test data preparation to ensure the test coverage including production data scrubbing and test data manipulation using Hive and Pig scripts.
- Responsible for building and maintaining regression test tool using AbInitio, Hive, and Python script.
- Worked with support teams to resolve operational and performance issues.
- Created and compared solutions with NoSQL databases and SQL server solutions.
- Designed distributed solutions for parallel processing of large data.
- Worked with cross functional teams to get efficient data results and reports as per the business requirements.
- Worked with support teams to resolve operational and performance issues.
Confidential, Columbus, Ohio
ETL Developer
Responsibilities:
- Developed several AbInitio Graphs based on business requirements using various Ab Initio components like reformat, rollup, join, scan, normalize, gather etc.
- Maintenance of Unix shell scripting for moving AB initio code from one Unix environment to another.
- Design, Development, testing and implementation of Ab initio graphs and Korn shell scripts.
- Developed Ab Initio graphs for Data validation using validate components like compare records, compute checksum etc.
- Implemented Data Parallelism through graphs, by using Ab Initio partition components.
- Responsible for documentation of complete Graphs and their components.
- Implemented Data Parallelism through graphs, which deals with data, divided into segments and operates on each segment simultaneously through the Ab Initio partition components to segment data.
- Involved in Ab Initio Design, Configuration experience in Ab Initio ETL, Data Mapping, Transformation and Loading in complex and high-volume environment and data processing at Terabytes level.
- Involved in automating the ETL process through scheduling.
- Responsible for cleansing the data from source systems using Ab Initio components such as reformat and filter by expression.
- Developed Psets to impose reusable business restrictions and to improve the performance of the graph.
- Developed several partition-based Ab Initio Graphs for high volume data warehouse.
- Experience in batch scheduling using Control-M Workload Automation.
- Experience in Control-M version migration and maintaining server integrity to reduce downtime in the production environment.
- Experience in Data Analysis to mitigate the level of incident frequency by using various incident and change management tools such as Hive, Automic Workload Automation, Control-M, Resolve and HP Service Manager.
- Co-ordination of various L3 support teams and SME teams to track and document incident data.
Confidential
Business Analyst
Responsibilities:
- Performed SWOT Analysis of transaction processing in current state system and workflows, and estimated savings for proposed workflow changes and system enhancements.
- Identified Return on Investment (RoI) opportunities that will reduce costs and increase transaction processing efficiency achieved through automation.
- Modeled the system using Use Cases and State diagrams, and performed gap, risk, and cost-benefit analysis.
- Delivered Business Requirement Documentation (BRD) from Use Case specifications, identified goals and objectives, and supported Functional Specification Documentation (FSD) and Data Mapping Documentation.
- Researched and modeled activity workflows to report business trends for business unit development.
- Wrote SQL statements to extract data from application back-end and verify it with front-end results.
- Involved in developing and handling PL/SQL packages, procedures and database triggers.
- Identified data sources, coordinated data feeds from external sources and the development of reports, queries, and forms to meet the information needs of business partners.
- Involved in writing business logic and development and modification of a business portal which facilitates end-users to apply for financial licenses online.
- Generated customer data reports using PL/SQL procedures and handled generation of user profiles.
- Involved in the development of a Tallyman collection product that allows businesses to manage the recovery of debt from customers.
- Provided technical support for clients, business partners and end-users.
- Involved in redefining subscriber profiles using PL/SQL procedures, triggers based on the number of cheque bounces, number of barring activities, etc.
- Prepared technical specification, interface, and detailed design documents to generate customer status reports using PL/SQL stored procedures in Oracle 11g.