We provide IT Staff Augmentation Services!

Big Data Engineer Resume

5.00/5 (Submit Your Rating)

Thousand Oak, CA

SUMMARY

  • Sr. Big Data Developer with more than 8 years of experience in the design and development of analytics/big data applications using leading industry tools, working with fortune firms.
  • Well - rounded experience in ETL, Hadoop, Spark, data modeling, data visualization
  • Good understanding of Big Data concepts like Hadoop, Map - Reduce, YARN, Spark, RDD, Data frames, Datasets, Streaming.
  • Adept in statistical programming languages like Python and R, Apache Spark, MATLAB including Big Data technologies like Hadoop, Hive, Pig, BigQuery
  • Deep understanding & exposure of Big Data Eco - system
  • Experienced in writing Pig Latin scripts, MapReduce jobs and HiveQL
  • Experience in using various packages in Rand python like caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitteR, NLP, Reshape2, rjson, plyr, pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, Beautiful Soup, Rpy2, Partykit
  • 5+ years of experiences on BI Application design on MicroStrategy, Tableau and Power BI.
  • Proficient in Hive, Oracle, SQL Server, SQL, PL/SQL, T-SQL and in managing very large databases
  • Hands on programming experience in scripting languages like JAVA, SCALA.
  • Experience writing in house UNIX shell scripts for Hadoop & Big Data Development
  • Skilled in performance tuning of data pipelines, distributed datasets, databases and SQL query performance.
  • Work closely with development teams to ensure accurate integration of machine learning models into firm platforms
  • Develop API services in an Agile environment
  • Hands on experience on AWS cloud services (VPC, EC2, S3, RDS, Redshift, Data Pipeline, EMR, DynamoDB, Workspaces, Lambda, Kinesis, RDS, SNS, SQS)
  • Worked on Informatica Power Center tool - Source Analyzer, Data warehousing designer, Mapping & Mapplet Designer and Transformation Designer.
  • Having good Knowledge in NOSQL data base like DynamoDB and MongoDB
  • Experience with NoSQL databases such as HBase, Apache Cassandra, Vertica, or MongoDB
  • 3+ years of experiences Project Management on Agile, Kandan, Scrum.
  • 5+ years of experiences in managing Migration to DEV/UAT/PROD.
  • Strong data modeling skills with experience developing complex data using Unified Modeling Language (UML), ER Diagrams, conceptual/physical Diagrams etc.
  • Strong Data Architect Star and Snowflex Schema.
  • Strong at Data lake building and OLAP Services.
  • Assists with the development of the data model(s) for the data warehouse. Provides recommendations for all data strategies and out of scope processes
  • Full understanding of common machine learning concepts; prior academic research in machine learning.
  • Work in an agile environment and contribute to the improvement of our development processes
  • Work with the other IT groups such as Infrastructure, Networks, Web Application Development to design BI solutions for internal and external use
  • Implement process improvements (Automation, Performance tuning, Optimize workflows)
  • Follows SDLC methodologies in performance of all development activities, including design, development, test and quality assurance support.

TECHNICAL SKILLS

Big Data: Hadoop, Sqoop, Flume, Hive, Spark, Pig, Kafka, Talend, HBase, Impala

ETL Tools: Informatica, Talend, Microsoft SSIS, Confidential DataStage. DBT

Database: Oracle, SQL Server 2016, Teradata, Netezza, MS Access, Snow Flax

Reporting: MicroStrategy, Microsoft Power BI, Tableau, SSRS, Business Objects (Crystal)

Business Intelligence: MDM, Metadata, Data Cleansing, OLAP, OLTP, SCD, SOA, REST, Web Services.

Tools: Ambari, SQL Developer, TOAD, Erwin, H20.ai, Visio, Teradata

Operating Systems: Windows Server, UNIX (Red Hat, Linux, Solaris, AIX)

Languages: UNIX shell scripting, SCALA, SQL, PL/SQL, T-SQL, Python, R

PROFESSIONAL EXPERIENCE

Confidential, Thousand Oak, CA

Big Data Engineer

Responsibilities:

  • Work with Project Manager, Business Leaders and Technical teams to finalize requirements and create solution design & architecture.
  • Architect the data lake by cataloging the source data, analyzing entity relationships, and aligning the design as per performance, schedule & reporting requirements
  • Architect BI applications as Enterprise Solution for Supply Chain, Online, Finances.
  • Design and Develop Hadoop ETL solutions to move data to the data lake using big data tools like Sqoop, Hive, Spark, HDFS, Talend etc.
  • Design and Develop Spark code using Scala programming language & Spark SQL for high-speed data processing to meet critical business requirements
  • Preferable experience in using Kafka, Spark, Elasticsearch, Cassandra
  • Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
  • Configured site to site VPN connection and Direct Connect for high-rate data transfer
  • Implement RDD/Datasets/Data frame transformations in Scala through Spark Context and Hive Context
  • Import python libraries into the transformation logic to implement core functionality
  • Wrote Spark-SQL and embedded the SQL in SCALA files to generate jar files for submission onto the Hadoop cluster
  • Used NOSQL database Amazon dynamo DB to store data of reporting Application.
  • Written AWS Lambda code in Python for nested Json files, converting, comparing, sorting etc.
  • Strong database development skills in Store procedure, Query language, performance optimization in RDBMS (DB2) as well as Cassandra and Hadoop
  • Involved in Migrating Objects from Teradata to Snowflake.
  • Configured Direct connect and VPN with AWS VPC
  • Hands on experience in AWS Cloud in various AWS services such as Redshift cluster, Route 53 domain configuration.
  • Design and construct of AWS Data pipelines using various resources in AWS including AWS API Gateway to receives response from aws lambda and retrieve data from snowflake using lambda function and convert the response into Json format using Database as Snowflake, DynamoDB, AWS Lambda function and AWS S3.
  • Constructing a state-of-the-art data lake on AWS using EMR, Spark, NiFi, Kafka, Java
  • Partners with DBT on delivery of Data definitions and aligns with Instance Data conversion team
  • Develop algorithms & scripts in Hadoop to import data from source system and persist in HDFS (Hadoop Distributed File System) for staging purposes.

Environment: Hortonworks 2.3.5, Sqoop, DBT, Hive, Informatica, AWS, Spark, Scala, Python, T-SQL, PL/SQL, Talend, DataStage, MicroStrategy 2019(Developer), UNIX, H2O.ai, Ambari, Oozie.

Confidential, Columbia, MD

Data Engineer

Responsibilities:

  • Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and Milab.
  • Responsible for data engineering functions including, but not limited to data extract, transformation, loading, integration in support of enterprise data infrastructures - data warehouse, operational data stores, and master data management
  • Responsible for data services and data movement infrastructures
  • Identified areas of improvement in existing business by unearthing insights by analyzing vast amount of data using machine learning techniques.
  • Implemented a generic ETL framework withhigh availabilityfor bringing related data for Hadoop & Cassandra from various sources using spark.
  • Led discussions with users to gather business processes requirements and data requirements to develop a variety of Conceptual, Logical and Physical Data Models. Expert in Business Intelligence and Data Visualization tools: Tableau, MicroStrategy
  • Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
  • Worked on machine learning on large size data using Spark and MapReduce.
  • Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
  • Experience with AWS cloud services: EC2, EMR, RDS, Redshift, S3
  • Stored and retrieved data from data-warehouses using Amazon Redshift.
  • Worked on TeradataSQL queries, Teradata Indexes, Utilities such as Mload, Tpump, Fast load and Fast Export.
  • Worked on functions in Lambda that aggregates the data from incoming events, and then stored result data in Amazon DynamoDB. Wrote Terraform templates for AWS Infrastructure as a code to build staging, production environments & set up build & automations for Jenkins.
  • Experience designing DAGs using AirFlow/Luigi/AWS Data Pipeline.
  • Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
  • Used Data Warehousing Concepts like Ralph Kimball Methodology, Bill Inmon Methodology, OLAP, OLTP, Star Schema, Snowflake Schema, Fact Table and Dimension Table.
  • Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data. Created various types of data visualizations using Python and Tableau.
  • Worked with H2O — explore data and create and tune models
  • Develop Hive logic & Stored Procedures to implement business rules and perform data transformation
  • Develop Unix Shell scripts to perform Hadoop ETL functions like Sqoop, create external/internal Hive tables, initiate HQL scripts and BigQuery.
  • Develop scripts in Hive to perform transformations on the data and load to target systems for use by the data analysts for reporting.
  • Schedule jobs through Apache oozie by creating workflow and properties file, and submit jobs
  • Scheduled different Snowflake jobs using NiFi
  • Knowledge of machine learning libraries like Spark MLlib, Mahout, and JSAT
  • Designed workflows with many sessions with decision, assignment task, event wait, and event raise tasks, used informatica scheduler to schedule jobs.
  • Work with Technical Program Management and Software Quality Assurance teams to develop means to measure and monitor overall project health

Environment: Hadoop, Map Reduce, Spark, Spark MLLib, Tableau, SQL, Excel, VBA, SAS, MATLAB, AWS, SPSS, Cassandra, Oracle, MongoDB, SQL Server 2012, DB2, T-SQL, PL/SQL, XML, Tableau.

Confidential, Chanhassen, MN

Big Data Engineer/Data Architect

Responsibilities:

  • Performed extensive data analysis and coordinated with the client teams to develop data models
  • Worked as BI SME for converting business requirements into Technical requirements and document.
  • Developed HQL scripts in Hive & Spark SQL to perform transformation on relational data and Sqoop export data back to DB’s.
  • Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark.
  • Develop Unix Shell scripts to perform ELT operations on big data using functions like Sqoop, create external/internal Hive tables, initiate HQL scripts and BigQuery.
  • Developed the ETL/SQL code to load data from raw stages relational DB’s, and Ingest data using Sqoop to Hadoop environment
  • Developed and executed a migration strategy to move Data Warehouse from an Oracle platform to AWS Redshift.
  • HA on replicated storage backends in the cloud (S3, HDFS, DynamoDB, Redshift, Cassandra, HBASE, RDS, etc.)
  • Optimize Spark code in Scala through reengineering the DAG logic to use minimal resources and provide high throughput
  • Used Informatica file watch events to pole the FTP sites for the external mainframe files.
  • Develop PIG scripts to transform unstructured and semi structured streaming data and perform transformations.
  • Developed data flow architecture & physical data model with Data Warehouse Architect
  • Wrote unit scripts to automate data load and performed data transformation operations
  • Performance tuned the Hive code through use of Map Joins, partitions, vectorization, compute statistics
  • Performance tuned the Spark code by minimizing shuffle operations, caching and persisting reusable RDD’s and adjusting the number of executors/cores/tasks

Environment: Informatica, DataStage, Hadoop, Shell Scripting, AWS, Scala, Sqoop, Hive, Oracle, MicroStrategy, Tableau, PL/SQL, Java, UNIX

Confidential, Gwynedd, PA

Big Data Developer

Responsibilities:

  • Extracted and profiled data from the customer, commercial loans and retail source systems that would provide the data needed for the loan reporting requirements
  • Determined criteria and wrote scripts for technical and business data quality checks, error handling and rejected reports during the data quality stage
  • Provided inputs on design of physical and logical architecture, Source\Target Mappings of the data warehouse and the ETL process
  • Created UNIX shell scripts to run the Informatica workflows and controlling the ETL flow
  • Created Hive tables and loaded data from HDFS to Hive tables as per the requirement.
  • Processing complex XML, XSLT files and generating derived fields to be loaded to database.
  • Converting Large XML files to Multiple XML files as required by downstream application.
  • Loading the Processed XML files to the Database tables.
  • Mapping source files and generating Target files in multiple formats like XML, Excel, CSV etc.
  • Transform the data and reports retrieved from various sources and generating derived fields.
  • Writing complex SQL queries to validate the reports.
  • Writing user defined function to transform data into required formats.
  • Developing Talend jobs by using the context variables and scheduling the jobs to run it automatically.
  • Extensively worked on Data Mapper to map complex JSON formats to XML.
  • Copy data to AWS S3 for storage and use COPY command to transfer data to Redshift. Used Talend connectors integrated to Redshift.
  • BI development for multiple technical projects running in parallel
  • Participate in development and implementation of product roadmap
  • Create technical blueprints and solution architecture diagrams
  • Troubleshoot and resolve incidents

Environment: Talend, Hadoop, Hortonworks, DataStage, AWS, Redshift, UNIX, Hive, Informatica, Control-M

Confidential, Waltham, MA

Data analyst/ BI developer

Responsibilities:

  • Extracted data from five operational databases containing almost two terabytes of data, loaded into the data warehouse and subsequently populated seven data marts
  • Created complex transformations, mappings, mapplets, reusable items, scheduled workflows based on the business logic and rules
  • Developed ETL job workflows with QC reporting and analysis frameworks
  • Developed Informatica mappings, Lookups, Reusable Components, Sessions, Work Flows etc. (on ETL side) as per the design documents/communication
  • Designed Metadata tables at source staging table to profile data and perform impact analysis
  • Performed query tuning and setting optimization on the Oracle database (rule and cost based)
  • Created Cardinalities, Contexts, Joins and Aliases for resolving loops and checked the data integrity
  • Debugged issues, fixed critical bugs and assisted in code deployments to QA and production
  • Coordinated with the external teams to assure the quality of master data and conduct UAT/integration testing
  • Implemented Power Exchange CDC for mainframes to load certain large data modules into the data warehouse and implement changing data
  • Designed and developed exception handling, data standardization procedures and quality assurance controls
  • Used Cognos for analysis and presentation layers
  • Develop Cognos 10 cubes using Framework Manager, Report Studio and Query Studio
  • Provide performance management and tuning
  • Develop in several BI reporting tool suites
  • Provide technical oversight to consultant partners

Environment: Informatica, Java/SOAP/Web Services, Oracle, DB2, SAS, Shell Scripting, TOAD, SQL Plus, Scheduler

Confidential

Data Analyst

Responsibilities:

  • Implemented Microsoft Visio and Rational Rose for designing the Use Case Diagrams, Class model, Sequence diagrams, and Activity diagrams for SDLC process of the application Utilized SQL to extract data from statewide databases for analysis.
  • Worked with other teams to analyze customers to analyze parameters of marketing.
  • Conducted Design reviews and Technical reviews with other project stakeholders.
  • Was a part of the complete life cycle of the project from the requirements to the production support.
  • Created test plan documents for all back-end database modules Used MS Excel, MS Access and SQL to write and run various queries.
  • Develop back-end machine learning applications
  • Used traceability matrix to trace the requirements of the organization
  • Data profiling to cleanse the datain the data base and raise the dataissues found
  • Recommended structural changes and enhancements to systems and databases
  • Conducted Design reviews and Technical reviews with other project stakeholders
  • Assisted in mining data from the SQL database that was used in several significant presentations

Environment: Microsoft Visio, SQL, MS Excel, MS Access, Data mining

We'd love your feedback!